How can i do zero down time deployment on cluster environment? - deployment

I need to deploy a major deployment on my system (more that 15 ear file ) , my system is high available system , So how can I do this deployment with zero downtime ?
my application server is IBM-WAS

After updating the applications, you can utilize the "Rollout Update" feature. Rather than saving and synchronizing the nodes after updating, you can use this feature which automatically performs the following tasks to enable the changes to propagate to all deployment targets while maintaining high availability (assuming you have a horizontal cluster, such that cluster members exist on multiple nodes):
Save session changes to the master configuration
For each node in the cluster (one at a time, to enable continuous availability):
Stop the cluster members on the node
Synchronize the node
Start the application servers (which automatically starts the application)

Related

Rundeck ansible inventory: static instead of dynamic

Deployed Rundeck (rundeck/rundeck:4.2.0) importing and discovering my inventory using Ansible Resource Model Source. Having 300 nodes, out of which statistically ~150 are accessible/online, the rest is offline (IOT devices). All working fine.
My challenge is when creating jobs i can assign only those nodes which are online, while i wanted to assign ALL nodes (including those offline) and keep retrying the job for the failed ones only. Only this way i could track the completeness of my deployment. Ideally i would love rundeck to be intelligent enough to automatically deploy the job as soon as my node goes back online.
Any ideas/hints how to achieve that ?
Thanks,
The easiest way is to use the health checks feature (only available on PagerDuty Process Automation On-Prem, formerly "Rundeck Enterprise"), in that way you can use a node filter only for "healthy" (up) nodes.
Using this approach (e.g: configuring a command health check against all nodes) you can dispatch your jobs only for "up" nodes (from a global set of nodes), this is possible using the .* as node filter and !healthcheck:status: HEALTHY as exclude node filter. If any "offline" node "turns on", the filter/exclude filter should work automatically.
On Ansible/Rundeck integration it works using the following environment variable: ANSIBLE_HOST_KEY_CHECKING=False or using host_key_checking=false on the ansible.cfg file (at [defaults] section).
In that way, you can see all ansible hosts in your Rundeck nodes, and your commands/jobs should be dispatched only for online nodes, if any "offline" node changes their status, the filter should work.

K8s graceful upgrade of service with long-running connections

tl;dr: I have a server that handles WebSocket connections. The nature of the workload is that it is necessarily stateful (i.e., each connection has long-running state). Each connection can last ~20m-4h. Currently, I only deploy new revisions of this service at off hours to avoid interrupting users too much.
I'd like to move to a new model where deploys happen whenever, and the services gracefully drain connections over the course of ~30 minutes (typically the frontend can find a "good" time to make that switch over within 30 minutes, and if not, we just forcibly disconnect them). I can do that pretty easily with K8s by setting gracePeriodSeconds.
However, what's less clear is how to do rollouts such that new connections only go to the most recent deployment. Suppose I have five replicas running. Normal deploys have an undesirable mode where a client is on R1 (replica 1) and then K8s deploys R1' (upgraded version) and terminates R1; frontend then reconnects and gets routed to R2; R2 terminates, frontend reconnects, gets routed to R3.
Is there any easy way to ensure that after the upgrade starts, new clients get routed only to the upgraded versions? I'm already running Istio (though not using very many of its features), so I could imagine doing something complicated with some custom deployment infrastructure (currently just using Helm) that spins up a new deployment, cuts over new connections to the new deployment, and gracefully drains the old deployment... but I'd rather keep it simple (just Helm running in CI) if possible.
Any thoughts on this?
This is already how things work with normal Services. Once a pod is terminating, it has already been removed from the Endpoints. You'll probably need to tune up your max burst in the rolling update settings of the Deployment to 100%, so that it will spawn all new pods all at once and then start the shutdown process on all the rest.

How do I setup a Active / Passive environment with two nodes in OpenShift?

I am trying to configure a Active/Passive cluster with two nodes (using OpenShift). The second passive node should be a hot standby, in other words it is up and running but not doing anything, until the first node dies. Then the passive node becomes active and a new passive node is started.
I have read the High Availability documentation, however it just seems to cover the theory. Furthermore it seems like overkill ( I am thinking there might be an easier way to meet my goal).
Where would I start?
What you are asking for goes against the usual practice for how Kubernetes/OpenShift is used. You wouldn't have hot standby nodes, you would always use all nodes in the cluster. You would then allow for enough additional capacity in your cluster such that loosing a node doesn't cause a problem as other nodes would have enough capacity to then run the applications. In this scenario the Kubernetes scheduler would automatically restart any applications which were on a failed node on the other nodes in the cluster, without you needing to perform any explicit failover steps.
So don't try and do anything special, setup your cluster with the two nodes, with applications being distributed across both. If you need to have the ability to run with only a single node, make sure it has enough capacity to run everything. If over time you add more applications and one node is not enough, add a third node, with all three being used in normal case. You can then handle failure of a single node again.

Adding Desired State Configuration extension to a service fabric VMSS

We recently needed to add the Microsoft.Powershell.DSC extension to our VMSS that contain our service fabric cluster. We redeployed the cluster using our ARM template, with the addition of the new extension for DSC. During the deployment we observed that as many as 4 out of 5 scale set instances were in the restarting stage at a given time. The services in our cluster were also unresponsive during that time. The outage was only a few minutes long, but this seems like something that should not happen.
Reliability Level: Silver
Durability Level: Bronze
This is caused by the selected durability level 'bronze'.
The durability tier is used to indicate to the system the privileges
that your VMs have with the underlying Azure infrastructure. In the
primary node type, this privilege allows Service Fabric to pause any
VM level infrastructure request (such as a VM reboot, VM reimage, or
VM migration) that impact the quorum requirements for the system
services and your stateful services. In the non-primary node types,
this privilege allows Service Fabric to pause any VM level
infrastructure requests like VM reboot, VM reimage, VM migration etc.,
that impact the quorum requirements for your stateful services running
in it.
Bronze - No privileges. This is the default and is recommended if you are only > running stateless workloads in your cluster.
I suggest reading this article. Its a MS employee blog. I'll copy out the relevant part:
If you don’t mind all your VMs being rebooted at the same time, you can set upgradePolicy to “Automatic”. Otherwise set it to “Manual” and take care of applying changes to the scale set model to individual VMs yourself. It is fairly easy to script rolling out the update to VMs while maintaining application uptime. See https://learn.microsoft.com/en-us/azure/virtual-machine-scale-sets/virtual-machine-scale-sets-upgrade-scale-set for more details.
If your scale set is in a Service Fabric cluster, certain updates like changing OS version are blocked (currently – that will change in future), and it is recommended that upgradePolicy be set to “Automatic”, as Service Fabric takes care of safely applying model changes (like updated extension settings) while maintaining availability.

Deployment in IBM Websphere 7 cluster with nodes with High availability

Environment :
Java EE webApp
JDK: 1.6,
AS: Websphere app server 7,
OS:redhatzLinux
I am not a websphere admin and I am asked to develop a way or a script to solve the issue below:
I have a cluster with three nodes NodeA NodeB and NodeC. My application runs on these clusters. I want to deploy my application on these nodes such that i dont need to bring all of them down at once. These days the deployments is done this way : we come at night to stop all the servers all at once from console. Then we install the application on the main node which is on the same machine as the deployment manager and then we synchronize and bring all the servers back up one by one.
What I am asked to do is that we upgrade the application or install the new ear file by not bringing everything down as this is causing downtime to the application. Is there a way to acheive this. WAS 7 is a very mature product i am sure there must be a way to do it.
I looked at the documentation/tutorial we can do something like "Update" where we select the application (from Apllications> websphere enterprise application)and select update and then select radio button "Replace Entire Application" and radio button"local file system" and point to the new ear file. But in that case the doc says that it will bring down all the servers as well when updating. its the same as before. no online deployment.
I am a java programmer so I thought of using what tools I have to solve this
Tell me if this is can be an issue :
1) We bring down NODEA
2) We remove the NODEA from the cluster (by pressing remove node button or using the removeNode.sh)
3) Install the new Ear on the NODEA (can we do this in the same admin console? or through shell script or jython or may be like a standalone server)
3) We then start it up back again and then add it to cluster.
NOW we have NODEA with new applicaition while NODE B and NODEC are with old application versions.
Then we bring down NODEB
remove NODEB from cluster
install applciation on NODEB
start it up again
Add it back to cluster
NOW we have two nodes with new application and NODEC with old
we try the same process for NODEC.
Will this work. Has any one tried this. what issues can you think of that can happen.
I will so appreciate any feedback from here. I am sure there are experienced ppl on this forum. I dont think this is a rare issue,i believe this is something any organization would want with High Availability requirements.
Thanks for any help in advance.
Syed...
This is a possible duplicate of How can i do zero down time deployment on cluster environment?. Here is essentially my answer from that question:
After updating the application, you can utilize the "Rollout Update" feature. Rather than saving and synchronizing the nodes after updating, you can use this feature which automatically performs the following tasks to enable the changes to propagate to all deployment targets while maintaining high availability (assuming you have a horizontal cluster, such that cluster members exist on multiple nodes, which it sounds like you do):
Save session changes to the master configuration
For each node in the cluster (one at a time, to enable continuous availability):
Stop the cluster members on the node
Synchronize the node
Start the application servers (which automatically starts the application)
Alternatively, you can follow the following procedure.
Stop all nodeagents except Node A.
Comment out or disable the Node A from Load Balancer or Plugin (So the traffic will not come to the node)
Deploy the application.
Changes will be synchronized only on Node A as its nodeagent is up.
Uncomment/enable the Node A from plugin / load balancer.
Comment/disable Node B from plugin/load balancer to stop incomming traffic on the node.
Start the nodeagent of Node B so it will synchronize the file changes on the Node. The ear application will stop and start after synchronization.
Uncomment/enable the Node B from plugin / load balancer.
Repeat steps 6,7,8 for all the remaining nodes.
Regards,
Laique Ahmed