Redis in our production environment is in cluster mode, with 6 nodes, 3 master nodes and 3 slave nodes. When nodes are switched due to network and other reasons, Redis-client cannot automatically refresh these nodes, and will report exception
MOVED 5649 192.168.1.1:6379
The vertx-redis-client version I use is 4.2.1
My configuration looks like this:
RedisOptions options = new RedisOptions();
options.setType(RedisClientType.CLUSTER)
.setRole(RedisRole.MASTER)
.setUseReplicas(RedisReplicas.NEVER)
Every time I encounter this problem, I need to restart my application service to restore it. I want to ask if there is any way to make vertx-redis-client automatically refresh the node?
Thank you ~
Related
We sometimes use Python scripts to spin up and monitor Kubernetes Pods running on Google Kubernetes Engine using the Official Python client library for kubernetes. We also enable auto-scaling on several of our node pools.
According to this, "Master VM is automatically scaled, upgraded, backed up and secured". The post also seems to indicate that some automatic scaling of the control plane / Master VM occurs when the node count increases from 0-5 to 6+ and potentially at other times when more nodes are added.
It seems like the control plane can go down at times like this, when many nodes have been brought up. In and around when this happens, our Python scripts that monitor pods via the control plane often crash, seemingly unable to find the KubeApi/Control Plane endpoint triggering some of the following exceptions:
ApiException, urllib3.exceptions.NewConnectionError, urllib3.exceptions.MaxRetryError.
What's the best way to handle this situation? Are there any properties of the autoscaling events that might be helpful?
To clarify what we're doing with the Python client is that we are in a loop reading the status of the pod of interest via read_namespaced_pod every few minutes, and catching exceptions similar to the provided example (in addition we've tried also catching exceptions for the underlying urllib calls). We have also added retrying with exponential back-off, but things are unable to recover and fail after a specified max number of retries, even if that number is high (e.g. keep retrying for >5 minutes).
One thing we haven't tried is recreating the kubernetes.client.CoreV1Api object on each retry. Would that make much of a difference?
When a nodepool size changes, depending on the size, this can initiate a change in the size of the master. Here are the nodepool sizes mapped with the master sizes. In the case where the nodepool size requires a larger master, automatic scaling of the master is initiated on GCP. During this process, the master will be unavailable for approximately 1-5 minutes. Please note that these events are not available in Stackdriver Logging.
At this point all API calls to the master will fail, including the ones from the Python API client and kubectl. However after 1-5 minutes the master should be available and calls from both the client and kubectl should work. I was able to test this by scaling my cluster from 3 node to 20 nodes and for 1-5 minutes the master wasn't available .
I obtained the following errors from the Python API client:
Max retries exceeded with url: /api/v1/pods?watch=False (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at>: Failed to establish a new connection: [Errno 111] Connection refused',))
With kubectl I had :
“Unable to connect to the server: dial tcp”
After 1-5 minutes the master was available and the calls were successful. There was no need to recreate kubernetes.client.CoreV1Api object as this is just an API endpoint.
According to your description, your master wasn't accessible after 5 minutes which signals a potential issue with your master or setup of the Python script. To troubleshoot this further on side while your Python script runs, you can check for availability of master by running any kubectl command.
I have a 5 cluster MariaDB/Galera cluster running in production environment.
I also have a monitor which checks every 20 seconds for cluster size changes. One of our other engineers has been running queries using MySQL Workbench, and when that application is running, I start seeing alerts coming from my monitor where cluster size is 1. It does recover in a few seconds back to the correct size of 5, however it's disconcerting that this client app is causing issues on the cluster. I've requested everyone on our team to not use this app... however I wonder if anyone else has seen this, or knows what it is doing to the cluster.
I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.
Environment :
Java EE webApp
JDK: 1.6,
AS: Websphere app server 7,
OS:redhatzLinux
I am not a websphere admin and I am asked to develop a way or a script to solve the issue below:
I have a cluster with three nodes NodeA NodeB and NodeC. My application runs on these clusters. I want to deploy my application on these nodes such that i dont need to bring all of them down at once. These days the deployments is done this way : we come at night to stop all the servers all at once from console. Then we install the application on the main node which is on the same machine as the deployment manager and then we synchronize and bring all the servers back up one by one.
What I am asked to do is that we upgrade the application or install the new ear file by not bringing everything down as this is causing downtime to the application. Is there a way to acheive this. WAS 7 is a very mature product i am sure there must be a way to do it.
I looked at the documentation/tutorial we can do something like "Update" where we select the application (from Apllications> websphere enterprise application)and select update and then select radio button "Replace Entire Application" and radio button"local file system" and point to the new ear file. But in that case the doc says that it will bring down all the servers as well when updating. its the same as before. no online deployment.
I am a java programmer so I thought of using what tools I have to solve this
Tell me if this is can be an issue :
1) We bring down NODEA
2) We remove the NODEA from the cluster (by pressing remove node button or using the removeNode.sh)
3) Install the new Ear on the NODEA (can we do this in the same admin console? or through shell script or jython or may be like a standalone server)
3) We then start it up back again and then add it to cluster.
NOW we have NODEA with new applicaition while NODE B and NODEC are with old application versions.
Then we bring down NODEB
remove NODEB from cluster
install applciation on NODEB
start it up again
Add it back to cluster
NOW we have two nodes with new application and NODEC with old
we try the same process for NODEC.
Will this work. Has any one tried this. what issues can you think of that can happen.
I will so appreciate any feedback from here. I am sure there are experienced ppl on this forum. I dont think this is a rare issue,i believe this is something any organization would want with High Availability requirements.
Thanks for any help in advance.
Syed...
This is a possible duplicate of How can i do zero down time deployment on cluster environment?. Here is essentially my answer from that question:
After updating the application, you can utilize the "Rollout Update" feature. Rather than saving and synchronizing the nodes after updating, you can use this feature which automatically performs the following tasks to enable the changes to propagate to all deployment targets while maintaining high availability (assuming you have a horizontal cluster, such that cluster members exist on multiple nodes, which it sounds like you do):
Save session changes to the master configuration
For each node in the cluster (one at a time, to enable continuous availability):
Stop the cluster members on the node
Synchronize the node
Start the application servers (which automatically starts the application)
Alternatively, you can follow the following procedure.
Stop all nodeagents except Node A.
Comment out or disable the Node A from Load Balancer or Plugin (So the traffic will not come to the node)
Deploy the application.
Changes will be synchronized only on Node A as its nodeagent is up.
Uncomment/enable the Node A from plugin / load balancer.
Comment/disable Node B from plugin/load balancer to stop incomming traffic on the node.
Start the nodeagent of Node B so it will synchronize the file changes on the Node. The ear application will stop and start after synchronization.
Uncomment/enable the Node B from plugin / load balancer.
Repeat steps 6,7,8 for all the remaining nodes.
Regards,
Laique Ahmed
I need to deploy a major deployment on my system (more that 15 ear file ) , my system is high available system , So how can I do this deployment with zero downtime ?
my application server is IBM-WAS
After updating the applications, you can utilize the "Rollout Update" feature. Rather than saving and synchronizing the nodes after updating, you can use this feature which automatically performs the following tasks to enable the changes to propagate to all deployment targets while maintaining high availability (assuming you have a horizontal cluster, such that cluster members exist on multiple nodes):
Save session changes to the master configuration
For each node in the cluster (one at a time, to enable continuous availability):
Stop the cluster members on the node
Synchronize the node
Start the application servers (which automatically starts the application)