Solaris services type status - service

There are following service states in Solaris:
degraded – The instance is running or available to run, but is functioning at a limited capacity.
disabled – The instance is not enabled and is not running or available to run.
maintenance – The instance is enabled but not able to run. The instance might be transitioning through the maintenance state because
an administrative action has not yet completed. Otherwise,
administrative action is required to resolve the problem.
offline – The instance is enabled but not running or available to run. For example, if the dependencies of an enabled service are not
satisfied, the service is kept in the offline state.
online – The instance is enabled and running or available to run. The online state is the expected operating state for a correctly
configured service instance with all dependencies satisfied.
uninitialized – This state is the initial state for all services.
I want to understand the 'uninitialized' state. Someone, please explain this state. Does this mean that the service is installed and enabled but it is not yet enabled. Or the service is in disabled state.

This is the usual state when you clear the service status after maintenance. Here you can find more explanations.
UNINITIALIZED
This state is the initial state for all service instances, including newly created instances (through executing svccfg add,
importing a manifest, or applying a profile). When an instance is
cleared from the maintenance state (svcadm clear), it is placed in the
uninitialized state so that its restarter can re-evaluate its
configuration. Instances are moved to maintenance, offline, or
disabled state after evaluation by the appropriate restarter. Note
that evaluation of an instance can only occur if its restarter service
is online.

Related

General guidelines to self-organized task allocation within a Microservice

I have some general problems/questions regarding self managed Microservices (in Kubernetes).
The Situation:
I have a provider (Discord API) for my desired state, which tells me the count (or multiples of the count) of sharded connections (websocket -> stateful in some way) I should establish with the provider.
Currently a have a "monolithic" microservice (it can't be deployed in an autoscaling service and has to be stateful), which determines the count of connections i should have and a factor based on the currently active pods, that can establish a connection to this API.
It further (by heartbeating and updating the connection target of all those pods) manages the state of every pod and achieves this target configuration.
It also handles the case of a pod being removed from the service and a change of target configuration, by rolling out the updated target and only after updating the target discontinuing the old connections.
The Cons:
This does not in any way resemble a good microservice architecture
A failure of the manager (even when persisting the current state in a cache or db of some sort) results in the target of the target provider not being achieved and maybe one of the pods having a failure without graceful handling of the manager
The Pros:
Its "easy" to understand and maintain a centrally managed system
There is no case (assuming a running manager system) where a pod can fail and it wont be handled -> connection resumed on another pod
My Plan:
I would like this websocket connection pods to manage themselves in some way.
Theoretically there has to be a way in which a "swarm" (swarm here is just a descriptive word for pods within a service) can determine a swarm wide accepted target.
The tasks to achieve this target (or change of target) should then be allocated across the swarm by the swarm itself.
Every failure of a member of the swarm has to be recognized, and the now unhandled tasks (in my case websocket connections) have to be resumed on different members of the swarm.
Also updates of the target have to be rolled out across the swarm in a distinct manner, retaining the tasks for the old target till all tasks for the new target are handled.
My ideas so far:
As a general syncing point a cache like redis or a db like mongodb could be used.
Here the current target (and the old target, for creating earlier mentioned smooth target changes) could be stored, along with all tasks that have to be handled to achieve this desired target.
This should be relatively easy to set up and also a "voting process" for the current target could be possible - if even necessary (every swarm member checks the current target of the target provider and the target that is determined by most of the swarm members is set as the vote outcome).
But now we face the problem already mentioned in the pros for the managed system, I currently cant think of a way the failure of a swarm member can be recognized and managed by the swarm consistently.
How should a failure be determined without a constant exchange between swarm members, which i think should be avoided because of the:
swarms should operate entirely target driven and interact with each other as litte as possible
kubernetes itself isn't really designed to have easy intra service communication
Every contribution, idea or further question here helps.
My tech stack would be but isn't limited to:
Java with Micronaut for the application
Grpc as the only exchange protocol
Kubernetes as the orchestrator
Since you're on the JVM, you could use Akka Cluster to take care of failure detection between the pods (even in Kubernetes, though there's some care needed with service meshes to exempt the pod-to-pod communications from being routed through the mesh) and use (as one of many possibilities for this) Distributed Data's implementations of CRDTs to distribute state (in this case the target) among the pods.
This wouldn't require you to use Akka HTTP or Akka's gRPC implementations, so you could still use Micronaut for external interactions. It would effectively create a stateful self-organizing service which presents to Kubernetes as a regular stateless service.
If for some reason Akka isn't appealing, looking through the code and docs for its failure detection (phi-accrual) might provide some ideas for implementing a failure detector using (e.g.) periodic updates to a DB.
Disclaimer: I am employed by Lightbend, which provides commercial support for Akka and employs or has employed at some point most of the contributors to and maintainers of Akka.

Service Fabric - How to repair a failing stateful application

I have a stateful service that configures state backups for the primary replica on RunAsync using an Azure storage account.
The other day someone inadvertently deleted the storage account being used for backups. On our next deployment, the services began throwing errors as they initialize due to this 404 error response.
I have noticed that during a deployment fabric apparently shuffles around the old version of the service spinning up new primaries as needed to free up the vm it is upgrading. If the old version of the code fails to instantiate by throwing an exception, the upgrade process will fail causing a rollback.
My problem is, once I create a new storage account, I am still left seemingly no way to bring the existing services back to healthy states. My existing services are using Storage account urls with AccountKeys that no longer exists in azure. Attempts to upgrade fail because the old service instances can’t instantiate due to now bad configuration.
Are there any ways to deal with this situation?
The simplest thing would be to use an unmonitored manual upgrade to force through the change that would point the service to the new storage account.
However, this puts a lot of management overhead on you, particularly if there are many other services, since you need to be careful to perform all safety and functionality checks manually so as not to regress anything.
The recommend solution is to use the ServiceTypeHealthPolicyMap described here to "mask out" the unhealthy service (since you expect it to be unhealthy during the upgrade). You may also need to adjust some of the other upgrade parameters depending on the exact situation.
A third recommendation, or maybe something to improve in the future, would be to make the upgrade to change the account information a configuration only upgrade. This would ensure that SF tries to change the config in-place without restarting the services (by default), which would prevent the existing services from failing over during the upgrade and encountering issues. This is demonstrated in this example.

Service Fabric upgrades keep active connections alive

I am trying to upgrade an application deployed to service fabric.
How can I only upgrade nodes that have no active connections and wait for the busy nodes to finish before upgrading them?
Most of the time, you don't really have to worry about the upgrades on a node level as the SF runtime handles it internally if configured in Monitored mode. This is what we've been using with a high level of success and never really had to do much. This also fit our requirement that all upgrade domains (nodes) have to match our health state policies before considered healthy.
If you want to have more advanced control over your upgrades like using request draining etc, have a look at the info as mentioned here. But to be honest, we've been quite happy with just using monitored mode and investigating why stuff fails if it does. We had some apps that had a long background task running as a stateful actor that sometimes failed upgrade and most always it was due to an issue that was caused in the background task itself instead of anything to do with Service Fabric.
Service Fabric knew when no active connections and background tasks were running to then upgrade nodes and we could actually see the nodes that were temporarily 'stuck' due to waiting for an active background task to finish.

meaning of parameter "mode " of set-AzureDeployment

What is the meaning of "mode" of set-AzureDeployment?
-Mode
Specifies the mode of upgrade. Supported values are: "Auto", "Manual", and "Simultaneous".
What does "Auto","Manual", and "Simultaneous" mean?
I am particularly interested in "Simultaneous". Does it mean my package will be deployed to multiple instances simultaneously?
Thanks
Mode specifies the type of update to initiate. Role instances are allocated to update domains when the service is deployed. Updates can be initiated manually in each update domain or initiated automatically in all update domains.
If not specified, the default value is Auto. If set to Manual, WalkUpgradeDomain must be called to apply the update. If set to Auto, the update is automatically applied to each update domain in sequence.
To perform an automatic update of a deployment, call Upgrade Deployment or Change Deployment Configuration with the Mode element set to automatic. The update proceeds from that point without a need for further input. You can call Get Operation Status to determine when the update is complete.
To perform a manual update, first call Upgrade Deployment with the Mode element set to manual. Next, call Walk Upgrade Domain to update each domain within the deployment. You should make sure that the operation is complete by calling Get Operation Status before updating the next domain. More information please refer to this link.
One of the new deployment options we now support is the ability to do a “Simultaneous Update” of a Cloud Service (we sometimes also refer to this as the “Blast Option”). When you use this option we bypass the normal upgrade domain walk that is done by default with Cloud Services (where we upgrade parts of the Cloud Service sequentially to avoid ever bringing the entire service down) and we instead upgrade all roles and instances simultaneously. With today’s release this simultaneous update logic now happens within Windows Azure (on the cloud side). This has the benefit of enabling the Cloud Service update to happen much faster. More information please refer to this link.
I am particularly interested in "Simultaneous". Does it mean my
package will be deployed to multiple instances simultaneously?
The answer is yes.

Azure Service Fabric

Please help me to know , Is there any option in the azure service fabric to delay deprovision ? I have a micro service application hosted in fabric which is distributed in different nodes at their instances . If i tried to disengage/deprovision the service from portal , Can the service fabric internally check whether any transaction is going any of the instances or not , If it is engaged , Will it wait for complete it ? Also want to know , If microsoft is not providing such a service , does we have any powershell command to check the instance status ?
Thanks
I assume that by "disengage/deprovision the service from portal" you are referring to deleting the service via the Service Fabric Explorer web app (perhaps via a link followed from the portal). Please correct me if this is wrong.
To answer your question directly, the framework will not wait for in-flight operations to complete during a service delete. Every replica for the service will lose its read and write permissions, causing all in-flight operations to fail. We do not offer a way to stall during this step in order to, for example, allow currently open transactions to be completed.
The reason we do not offer this semantic, is that service deletion is expected to be rare or permanent, and that delaying deletion for the final operation doesn't enable any additional scenarios. In either case, if a client is attempting operations on a service being deleted, either:
The last client operation may fail due to delete racing and revoking read/write permissions
Every subsequent client operation will fail due to the service no longer existing
or
The last client operation will succeed due to deletion being delayed
Every subsequent client operation will fail due to the service no longer existing
The expectation is that any client or dependent service should have already been updated or deleted prior to deleting the service they depend on, as you are making the permanent decision that this service should no longer exist.