Condor central manager could not see the other computing nodes - hpc

I connect three servers to form an HPC cluster using condor as a middleware when I run the command condor_status from the central manager it does not shows the other nodes I can run jobs in the central manager and connect to the other nodes via SSH but it seems that there is something missing in condor configuration files where I set the central manager as condor host and allows writing and reading for everyone. I keep the daemon MASTER, STARTD in the daemon list for the worker nodes.
When I run condor_status in the central manager it just show the central manager and when I run it on the compute node it give me the error "CEDAR:6001:Failed to connect to" followed by the central manager IP and port number.

I manage to solve it. The problem was in the central manager's firewall (in my case it was iptables) which was running.
So, when I stopped the firewall (su -c "service iptables stop") all nodes appeared normally, typing condor_status".
The firewall status can be checked using "service iptables status".

There are a number of things that could be going on here. I'd suggest you follow this tutorial and see if it resolves your problems -
http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/

In my case the service "condor.exe" was not running on the server. I had stopped manually. I just start it and every thing went fine.

Related

FabricGateway.exe goes into a boot loop after a server reboot

I have a on prem Service Fabric 3 Node cluster running 8.2.1571.9590. This has been running for months without any problems.
The cluster node were rebooted overnight, as part of operating system patching, and the cluster will now not establish connections.
If I run connect-servicefabriccluster -verbose, I get the timeout error
System.Fabric.FabricTransientException: Could not ping any of the provided Service Fabric gateway endpoints.
Looking at the processes running I can see all the expected processes start and are stable with the exception of FabricGateway.exe which goes into a boot loop cycle.
I have confirmed that
I can do a TCP-IP Ping between the nodes in the cluster
I can do a PowerShell remote session between the nodes in the cluster
No cluster certs have expired.
Any suggestions as to how to debug this issue?
Actual Problem
On checking the Windows event logs Admin Tools > Event Viewer > Application & Service Logs > Microsoft Service Fabric > Admin I could see errors related to the startup of the FabricGateway process. The errors and warnings come in repeated sets with the following basic order
CreateSecurityDescriptor: failed to convert mydomain\admin-old to SID: NotFound
failed to set security settings to { provider=Negotiate protection=EncryptAndSign remoteSpn='' remote='mydomain\admin-old, mydomain\admin-new, mydomain\sftestuser, ServiceFabricAdministrators, ServiceFabricAllowedUsers' isClientRoleInEffect=true adminClientIdentities='mydomain\admin-old, mydomain\admin-new, ServiceFabricAdministrators' claimBasedClientAuthEnabled=false }: NotFound
Failed to initialize external channel error : NotFound
EntreeService proxy open failed : NotFound
FabricGateway failed with error code = S_OK
client-sfweb1:19000/[::1]:19000: error = 2147943625, failureCount=9082. This is conclusive that there is no listener. Connection failure is expected if listener was never started, or listener / its process was stopped before / during connecting. Filter by (type~Transport.St && ~"(?i)sfweb1:19000") on the other node to get listener lifecycle, or (type~Transport.GlobalTransportData) for all listeners
Using Windows Task Manager (or similar tool) you would see the Fabricgateway.exe process was starting and terminating every few seconds.
The net effect of this was the Service Fabric cluster communication could not be established.
Solution
The problem was the domain account mydomain\admin-old (an old historic account, not use for a long period) had been deleted in the Active Directory, so no SID for the account could be found. This failure was causing then loop, even though the admin accounts were valid.
The fix was to remove this deleted ID from the cluster nodes current active setting.xml file. The process I used was
RDP onto a cluster node VM
Stop the service fabric service
Find the current service fabric cluster configuration e.g. the newest folder on the form D:\SvcFab\VM0\Fabric\Fabric.Config.4.123456
Edit the settings.xml and remove the deleted account mydomain\admin-old from the AdminClientIdentities block, so I ended up with
<Section Name="Security">
<Parameter Name="AdminClientIdentities" Value="mydomain\admin-new" />
...
Once the file is saved, restart the service fabric service, it should start as normal. Remember,it will take a minute or two startup
Repeat the process on the other nodes in the cluster.
Once completed the cluster starts and operates as expected

mongodb mms monitoring agent does not find group members

I have installed the latest mongodb mms agent (6.5.0.456) on ubuntu 16.04 and initialised the replicaset. Hence I am running a single node replicaset with the monitoring agent enabled. The agent works fine, however it does not seem to actually find the replicaset member:
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:170] Received new configuration: Primary agent, Assigned 0 out of 0 plus 0 chunk monitor(s)
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:182] Nothing to do. Either the server detected the possibility of another monitoring agent running, or no Hosts are configured on the Group.
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Run:199] Done. Sleeping for 55s...
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:746] Performing discovery with 0 hosts
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:803] Received discovery responses from 0/0 requests after 891ns
I can see two processes for monitor agents:
/bin/sh -c /usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config >> /var/log/mongodb-mms/monitoring-agent.log 2>&1
/usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config
However if I terminate one, it also tears down the other, so I do not think that is the problem.
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
The rs.config() looks fine, with one replicaset member, which has a host field, which looks just fine. I can use that value to connect to the instance using the mongo command. no auth is configured.
EDIT
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset. This seems to be different to pre-cloud-manager days, where the agent was able to track the rs - if I remember correctly... Probably there still is a way to get this done easier, so I am leaving this question open for now...
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
Configuration values for the Cloud Manager agent (such as mmsGroupId and mmsApiKey) are set in the config file, which is /etc/mongodb-mms/monitoring-agent.config by default. The agent needs this information in order to communicate with the Cloud Manager servers.
For more details, see Install or Update the Monitoring Agent and Monitoring Agent Configuration in the Cloud Manager documentation.
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset.
Unless a MongoDB process is already managed by Cloud Manager automation, I believe it has always been the case that you need to add an existing MongoDB process to monitoring to start the process of initial topology discovery. Once a deployment is monitored, any changes in deployment membership should automatically be discovered by the Cloud Manager agent.
Production employments should have authentication and access control enabled, so in addition to adding a seed hostname and port via the Cloud Manager UI you usually need to provide appropriate credentials.

How resolve marathon leader with mesos-dns

I've installed mesos-dns in our cluster and is running ok. We can check the domain of the apps installed in marathon but I would like to know in which host is installed the marathon itself. If I do a dig to marathon.domain is not resolving anything.
According to the doc of mesos-dns: "A records ({framework}.domain) and SRV records (_framework._tcp.{framework}.domain) - for every known Mesos master"
Thanks.
It's marathon.mesos unless you've used a different TLD. The Marathon scheduler runs on the Master.
You can use my mesosdns-resolver bash script to get the endpoint from Mesos DNS.
You can use it like:
mesosdns-resolver.sh -sn <service-name>.marathon.mesos -s <IP_ADDRESS_OF_MESOS_DNS_SERVER>

Mesos cluster does not recover when physical host restart

I'm using mesosphere on 3 host over Ubuntu 14.04 as follow:
one with mesos master
two with mesos slave
All work fine, but after restart all physical hosts all scheduled job was lost. It's normal? I'm expected that zookeeper will store the current jobs, then when the system will need restart it, all jobs will be rescheduled after the master boot.
Update:
I'm using marathon and mesos on a same node, and I'm run marathon with flag --zk
With marathon's --zk and --ha enabled, Marathon should be storing its state in ZK and recovering it on restart, as long as Mesos allows it to reregister with the same framework ID.
However, you'll also need to enable the Mesos registry (even for a single master), to ensure that Mesos persists information about what frameworkIds are registered in the event of master failover. This can be accomplished by setting the --registry=replicated_log (default), --quorum=1 (since you only have 1 master), and --work_dir=/path/to/registry (where to store the state).
I solved the problem following this installation instructions: How To Configure a Production-Ready Mesosphere Cluster on Ubuntu 14.04
Although you found a solution, I'd like to explain more to this issue:)
In official doc:http://mesos.apache.org/documentation/latest/slave-recovery/
Note that if the operating system on the slave is rebooted, all
executors and tasks running on the host are killed and are not
automatically restarted when the host comes back up.
So all frameworks on Mesos will be killed after reboot. One way to restart the frameworks is to run all frameworks on Marathon, which will manage other frameworks and restart them in need.
However, then you need to auto-restart Marathon when it's killed. In the digitialocean link you mentioned, the Marathon is installed with script in init.d, so it can be restarted after rebooted. Otherwise, if you installed the Marathon via source code, you can use tools like supervisord to monitor Marathon.

Having Capistrano skip over down hosts

My setup
I am deploying a Ruby on Rails application to 70+ hosts. These hosts are located behind consumer-grade ADSL connections which may or may not be up. Probability of being up is aroud 99% but definently not 100%.
The deploy process works perfectly fine and I have no problem specific to it.
The problem
When Capistrano encounters a down host, it stops the entire process. This is a problem because if host n°30 is down, then the 40 other hosts after it do not get the deployment.
What I would like is definently an error for the hosts that are down but I would also like Capistrano to continue deploying to all the hosts that are up.
Is there any setting or configuration that would enable me to do this ?
I ended up running a Capistrano instance for each IP then parsing the logs to see which one has failed and which one has succeeded.
A little Python script adjusted to my needs does this fine.