We have a DCOS 1.9.0 cluster set up with 11 public agent nodes ( CentOS ). However sometime in the last couple of days one of the agent nodes got detached and it is not available in the DCOS UI.
I'm able to ssh and access the agent node which has been detached. Is there any way to re-attach this instance without any data loss. Please can you suggest the best way to debug this issue.
We were able to fix the problem by following the instructions in this link : https://dcos.io/docs/1.9/administering-clusters/update-a-node/ within this sub-section : Updating nodes by manually killing agents
Related
I am going through guide here:
https://learn.microsoft.com/en-us/azure/service-fabric/service-fabric-cluster-creation-for-windows-server
Section "Step 1B: Create a multi-machine cluster".
I have installed Cluster on one box and trying to use the same json (as per instructions) and trying to install it on another box so that i can have Cluster running on 2 VMs.
I am now getting this error when I run by TestConfig.ps1:
Previous Fabric installation detected on machine XXX. Please clean the machine.
Previous Fabric installation detected on machine XXX. Please clean the machine.
Data Root node Dev Box1 exists on machine XXX in \XXX\C$\ProgramData\SF\Dev Box1. This is an artifact from a previous installation - please delete the directory corresponding to this node.
First, take a look on this link. These are the requirements for each cluster node that are need to be met if you want to create the cluster.
The error is pretty obvious. You most likely have already SF installed on the machine. So either you have SF runtime or some uncleaned cluster data there.
Your first try should be running CleanFabric powershell script from the SF standalone package on each node. It should clean all SF data (cluster, runtime, registry etc.). Try this and then run TestConfiguration script once again. If this does not help, you would have to go to each node and manually delete any SF data that TestConfiguration script is complaining about.
I have a cluster in Azure and it failed to update automatically so I'm trying a manual update. I tried via the portal, it failed so I kicked off an update using PS, it failed also. The update starts then just sits at "UpdatingUserConfiguration" then after an hour or so fails with a time out. I have removed all application types and check my certs for "NETWORK SERVCIE". The cluster is 5 VM single node type, Windows.
Error
Set-AzureRmServiceFabricUpgradeType : Code: ClusterUpgradeFailed,
Message: Cluster upgrade failed. Reason Code: 'UpgradeDomainTimeout',
Upgrade Progress:
'{"upgradeDescription":{"targetCodeVersion":"6.0.219.9494","
targetConfigVersion":"1","upgradePolicyDescription":{"upgradeMode":"UnmonitoredAuto","forceRestart":false,"u
pgradeReplicaSetCheckTimeout":"37201.09:59:01","kind":"Rolling"}},"targetCodeVersion":"6.0.219.9494","target
ConfigVersion":"1","upgradeState":"RollingBackCompleted","upgradeDomains":[{"name":"1","state":"Completed"},
{"name":"2","state":"Completed"},{"name":"3","state":"Completed"},{"name":"4","state":"Completed"}],"rolling
UpgradeMode":"UnmonitoredAuto","upgradeDuration":"02:02:07","currentUpgradeDomainDuration":"00:00:00","unhea
lthyEvaluations":[],"currentUpgradeDomainProgress":{"upgradeDomainName":"","nodeProgressList":[]},"startTime
stampUtc":"2018-05-17T03:13:16.4152077Z","failureTimestampUtc":"2018-05-17T05:13:23.574452Z","failureReason"
:"UpgradeDomainTimeout","upgradeDomainProgressAtFailure":{"upgradeDomainName":"1","nodeProgressList":[{"node
Name":"_mstarsf10_1","upgradePhase":"PreUpgradeSafetyCheck","pendingSafetyChecks":[{"kind":"EnsureSeedNodeQu
orum"}]}]}}'.
Any ideas on what I can do about a "EnsureSeedNodeQuorum" error ?
The root cause was only 3 seed nodes in the cluster as a result of the cluster being build with a VM scale set that had "overprovision" set to true. Lesson learned, remember to set "overprovision" to false.
I ended up deleting the cluster and scale set and recreated using my stored ARM template.
I changed my initialization script after creating a cluster with 2 worker nodes for spark. Then I changed the script a bit and tried to update the cluster with 2 more worker nodes. The script failed because I simply forgot to apt-get update before apt-get install, so dataproc reports error and the cluster's status changed to ERROR. When I try to reduce the size back to 2 nodes again, it doesn't work anymore with the following message
ERROR: (gcloud.dataproc.clusters.update) Cluster 'cluster-1' must be running before it can be updated, current cluster state is 'ERROR'.
The two worker nodes are still added, but they don't seem to be detected by a running spark application at first because no more executors are added. I manually reset the two instances on the Google Compute Engine page, and then 4 executors are added. So it seems everything is working fine again except that the cluster's status is still ERROR, and I cannot increase or decrease the number of worker nodes anymore.
How can I update the cluster status back to normal (RUNNING)?
In your case ERROR indicates that workflow to re-configure the cluster has failed, and Dataproc is not sure of its health. At this point Dataproc cannot guarantee that another reconfigure attempt will succeed so further updates are disallowed. You can however submit jobs.
Your best bet is to delete it and start over.
I have configured a Soa Cluster with one admin node and two managed nodes and all server nodes configured in three different machines. once I deploy a Bpell to one managed node it automatically deploys in the other managed nodes as well(default behavior). once you go to soa enterprise manager those deployed Bpels can be viewed under [Soa -> managed node -> Defult ->..]. It is the same place where we deploy new Bpels. I accidentally undeploy all bpels (you can do it by right clicking a managed node and choosing un-deploy option).
Now I'm having a hard time to get back to previous state, how to deploy all those projects again to a specific managed node. I tried to restart the node hoping it would sync again, yet the managed server went to "admin" state (not the ok state).
is there anything needs to be done !!
Thanks, Hemal
You'll need to start the server from command line, it will work.
For managing 'managed servers' from EM or WLS console, there's one additional step that's needed during instalation process.
Please modify the nodemanager.properties of WLS and set the property startscriptenabled=true.
http://download.oracle.com/docs/cd/E12839_01/core.1111/e10105/start.htm#CIHBACFI
Environment :
Java EE webApp
JDK: 1.6,
AS: Websphere app server 7,
OS:redhatzLinux
I am not a websphere admin and I am asked to develop a way or a script to solve the issue below:
I have a cluster with three nodes NodeA NodeB and NodeC. My application runs on these clusters. I want to deploy my application on these nodes such that i dont need to bring all of them down at once. These days the deployments is done this way : we come at night to stop all the servers all at once from console. Then we install the application on the main node which is on the same machine as the deployment manager and then we synchronize and bring all the servers back up one by one.
What I am asked to do is that we upgrade the application or install the new ear file by not bringing everything down as this is causing downtime to the application. Is there a way to acheive this. WAS 7 is a very mature product i am sure there must be a way to do it.
I looked at the documentation/tutorial we can do something like "Update" where we select the application (from Apllications> websphere enterprise application)and select update and then select radio button "Replace Entire Application" and radio button"local file system" and point to the new ear file. But in that case the doc says that it will bring down all the servers as well when updating. its the same as before. no online deployment.
I am a java programmer so I thought of using what tools I have to solve this
Tell me if this is can be an issue :
1) We bring down NODEA
2) We remove the NODEA from the cluster (by pressing remove node button or using the removeNode.sh)
3) Install the new Ear on the NODEA (can we do this in the same admin console? or through shell script or jython or may be like a standalone server)
3) We then start it up back again and then add it to cluster.
NOW we have NODEA with new applicaition while NODE B and NODEC are with old application versions.
Then we bring down NODEB
remove NODEB from cluster
install applciation on NODEB
start it up again
Add it back to cluster
NOW we have two nodes with new application and NODEC with old
we try the same process for NODEC.
Will this work. Has any one tried this. what issues can you think of that can happen.
I will so appreciate any feedback from here. I am sure there are experienced ppl on this forum. I dont think this is a rare issue,i believe this is something any organization would want with High Availability requirements.
Thanks for any help in advance.
Syed...
This is a possible duplicate of How can i do zero down time deployment on cluster environment?. Here is essentially my answer from that question:
After updating the application, you can utilize the "Rollout Update" feature. Rather than saving and synchronizing the nodes after updating, you can use this feature which automatically performs the following tasks to enable the changes to propagate to all deployment targets while maintaining high availability (assuming you have a horizontal cluster, such that cluster members exist on multiple nodes, which it sounds like you do):
Save session changes to the master configuration
For each node in the cluster (one at a time, to enable continuous availability):
Stop the cluster members on the node
Synchronize the node
Start the application servers (which automatically starts the application)
Alternatively, you can follow the following procedure.
Stop all nodeagents except Node A.
Comment out or disable the Node A from Load Balancer or Plugin (So the traffic will not come to the node)
Deploy the application.
Changes will be synchronized only on Node A as its nodeagent is up.
Uncomment/enable the Node A from plugin / load balancer.
Comment/disable Node B from plugin/load balancer to stop incomming traffic on the node.
Start the nodeagent of Node B so it will synchronize the file changes on the Node. The ear application will stop and start after synchronization.
Uncomment/enable the Node B from plugin / load balancer.
Repeat steps 6,7,8 for all the remaining nodes.
Regards,
Laique Ahmed