How resolve marathon leader with mesos-dns - marathon

I've installed mesos-dns in our cluster and is running ok. We can check the domain of the apps installed in marathon but I would like to know in which host is installed the marathon itself. If I do a dig to marathon.domain is not resolving anything.
According to the doc of mesos-dns: "A records ({framework}.domain) and SRV records (_framework._tcp.{framework}.domain) - for every known Mesos master"
Thanks.

It's marathon.mesos unless you've used a different TLD. The Marathon scheduler runs on the Master.

You can use my mesosdns-resolver bash script to get the endpoint from Mesos DNS.
You can use it like:
mesosdns-resolver.sh -sn <service-name>.marathon.mesos -s <IP_ADDRESS_OF_MESOS_DNS_SERVER>

Related

Marathon (Mesos) - Stuck in "Loading applications"

I am building a mesos cluster from scratch (using Vagrant, which is not relevant for this issue).
OS: Ubuntu 16.04 (trusty)
Setup:
Master -> Runs ZooKeeper, Mesos-master, Marathon and Chronos
Slave -> Runs Mesos-slave
This is my provisioning script for the master node https://github.com/zeitgeist2018/infrastructure/blob/fix-marathon/provision/scripts/install-master.sh.
I have managed to register de slave into Mesos, install Marathon and Chronos frameworks, and run scheduled jobs in Chronos (both with docker and shell commands), but I can't get Marathon to work properly. The UI gets stuck in "Loading applications" as soon as I open it, and when I try to call the API, the request hangs forever with no response. In the API I tried to get simple marathon information and do deployments, both with the same hanging result.
I've been checking Marathon logs but I don't see anything error there. Just a couple of logs that may (or not) be a hint:
[2020-03-08 10:33:21,819] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marathon-akka.actor.default-dispatcher-6)
[2020-03-08 10:33:21,822] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-87)
[2020-03-08 10:33:25,957] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ReviveRepeaterLogic:marathon-akka.actor.default-dispatcher-7)
Installing jdk11 and choosing it as default fixed this issue for me without downgrading the Marathon to any other version.
in ubuntu 20.04:
sudo apt install openjdk-11-jre-headless
update-alternatives --config java
I increased the number of cpus, virtual machine in which the marathon was installed to 3 and the problem was solved.
I have managed to make it work. It was as simple as downgrading Marathon to v1.7.189. After that, it starts properly, and the API responds to requests.

mongodb mms monitoring agent does not find group members

I have installed the latest mongodb mms agent (6.5.0.456) on ubuntu 16.04 and initialised the replicaset. Hence I am running a single node replicaset with the monitoring agent enabled. The agent works fine, however it does not seem to actually find the replicaset member:
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:170] Received new configuration: Primary agent, Assigned 0 out of 0 plus 0 chunk monitor(s)
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Iterate:182] Nothing to do. Either the server detected the possibility of another monitoring agent running, or no Hosts are configured on the Group.
[2018/05/26 18:30:30.222] [agent.info] [components/agent.go:Run:199] Done. Sleeping for 55s...
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:746] Performing discovery with 0 hosts
[2018/05/26 18:30:30.222] [discovery.monitor.info] [components/discovery.go:discover:803] Received discovery responses from 0/0 requests after 891ns
I can see two processes for monitor agents:
/bin/sh -c /usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config >> /var/log/mongodb-mms/monitoring-agent.log 2>&1
/usr/bin/mongodb-mms-monitoring-agent -conf /etc/mongodb-mms/monitoring-agent.config
However if I terminate one, it also tears down the other, so I do not think that is the problem.
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
The rs.config() looks fine, with one replicaset member, which has a host field, which looks just fine. I can use that value to connect to the instance using the mongo command. no auth is configured.
EDIT
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset. This seems to be different to pre-cloud-manager days, where the agent was able to track the rs - if I remember correctly... Probably there still is a way to get this done easier, so I am leaving this question open for now...
So, question is what is the Group that the agent is referring to. Where is that configured? Or how do I find out which Group the agent refers to and how do I check if the group is configured correctly.
Configuration values for the Cloud Manager agent (such as mmsGroupId and mmsApiKey) are set in the config file, which is /etc/mongodb-mms/monitoring-agent.config by default. The agent needs this information in order to communicate with the Cloud Manager servers.
For more details, see Install or Update the Monitoring Agent and Monitoring Agent Configuration in the Cloud Manager documentation.
It kind of looks that the cloud manager now needs to be configured with the seed host. Then it starts to discover all the other nodes in the replicaset.
Unless a MongoDB process is already managed by Cloud Manager automation, I believe it has always been the case that you need to add an existing MongoDB process to monitoring to start the process of initial topology discovery. Once a deployment is monitored, any changes in deployment membership should automatically be discovered by the Cloud Manager agent.
Production employments should have authentication and access control enabled, so in addition to adding a seed hostname and port via the Cloud Manager UI you usually need to provide appropriate credentials.

Mesos cluster does not recover when physical host restart

I'm using mesosphere on 3 host over Ubuntu 14.04 as follow:
one with mesos master
two with mesos slave
All work fine, but after restart all physical hosts all scheduled job was lost. It's normal? I'm expected that zookeeper will store the current jobs, then when the system will need restart it, all jobs will be rescheduled after the master boot.
Update:
I'm using marathon and mesos on a same node, and I'm run marathon with flag --zk
With marathon's --zk and --ha enabled, Marathon should be storing its state in ZK and recovering it on restart, as long as Mesos allows it to reregister with the same framework ID.
However, you'll also need to enable the Mesos registry (even for a single master), to ensure that Mesos persists information about what frameworkIds are registered in the event of master failover. This can be accomplished by setting the --registry=replicated_log (default), --quorum=1 (since you only have 1 master), and --work_dir=/path/to/registry (where to store the state).
I solved the problem following this installation instructions: How To Configure a Production-Ready Mesosphere Cluster on Ubuntu 14.04
Although you found a solution, I'd like to explain more to this issue:)
In official doc:http://mesos.apache.org/documentation/latest/slave-recovery/
Note that if the operating system on the slave is rebooted, all
executors and tasks running on the host are killed and are not
automatically restarted when the host comes back up.
So all frameworks on Mesos will be killed after reboot. One way to restart the frameworks is to run all frameworks on Marathon, which will manage other frameworks and restart them in need.
However, then you need to auto-restart Marathon when it's killed. In the digitialocean link you mentioned, the Marathon is installed with script in init.d, so it can be restarted after rebooted. Otherwise, if you installed the Marathon via source code, you can use tools like supervisord to monitor Marathon.

Changing hostnames in added hosts in Ambari server?

In my Ambari server I have added seven slaves to the master, But the problem is I had changed the hostnames on slaves, therefore Master Node cannot identify the slaves now.
So can anyone help me to changed those hostnames which are already added?
Thank You.
As of Ambari 2.2.2 you can change hostnames using ambari-server update-host-names <hostnames.json>
The basic steps are:
Back up your Ambari DB
Disable Kerberos
Stop ambari-server and ambari-agent on all hosts
Create hostnames.json to map old names to new names. For example:
{"clusterName":{"oldhost1.example.com":"newhost1.example.com","oldhost2.example.com":"newhost2.example.com"}}
On Ambari Server: ambari-server update-host-names hostnames.json
Updates hostnames on all nodes
If the hostname of Ambari Server changed, update ambari-agent.ini on every Ambari Agent node
On Ambari Server: ambari-server start
On all Agents: ambari-agent start
Re-enable Kerberos if needed
http://docs.hortonworks.com/HDPDocuments/Ambari-2.2.2.0/bk_ambari_reference_guide/content/ch_changing_host_names.html
I'm facing a similar problem. What I have found so far is a mechanism for setting up custom hostnames:
https://ambari.apache.org/1.2.3/installing-hadoop-using-ambari/content/ambari-chap7a.html
It should solve the issue of changing hostnames, however I'm afraid it won't be that straightforward because of the first 2 steps:
On the Install Options screen, select Perform Manual Registration for Ambari Agents.
Install the Agents manually as described in Installing Ambari Agents Manually.
which you probably can't retake long after you set up the whole cluster.

Condor central manager could not see the other computing nodes

I connect three servers to form an HPC cluster using condor as a middleware when I run the command condor_status from the central manager it does not shows the other nodes I can run jobs in the central manager and connect to the other nodes via SSH but it seems that there is something missing in condor configuration files where I set the central manager as condor host and allows writing and reading for everyone. I keep the daemon MASTER, STARTD in the daemon list for the worker nodes.
When I run condor_status in the central manager it just show the central manager and when I run it on the compute node it give me the error "CEDAR:6001:Failed to connect to" followed by the central manager IP and port number.
I manage to solve it. The problem was in the central manager's firewall (in my case it was iptables) which was running.
So, when I stopped the firewall (su -c "service iptables stop") all nodes appeared normally, typing condor_status".
The firewall status can be checked using "service iptables status".
There are a number of things that could be going on here. I'd suggest you follow this tutorial and see if it resolves your problems -
http://spinningmatt.wordpress.com/2011/06/12/getting-started-creating-a-multiple-node-condor-pool/
In my case the service "condor.exe" was not running on the server. I had stopped manually. I just start it and every thing went fine.