Marathon (Mesos) - Stuck in "Loading applications" - apache-zookeeper

I am building a mesos cluster from scratch (using Vagrant, which is not relevant for this issue).
OS: Ubuntu 16.04 (trusty)
Setup:
Master -> Runs ZooKeeper, Mesos-master, Marathon and Chronos
Slave -> Runs Mesos-slave
This is my provisioning script for the master node https://github.com/zeitgeist2018/infrastructure/blob/fix-marathon/provision/scripts/install-master.sh.
I have managed to register de slave into Mesos, install Marathon and Chronos frameworks, and run scheduled jobs in Chronos (both with docker and shell commands), but I can't get Marathon to work properly. The UI gets stuck in "Loading applications" as soon as I open it, and when I try to call the API, the request hangs forever with no response. In the API I tried to get simple marathon information and do deployments, both with the same hanging result.
I've been checking Marathon logs but I don't see anything error there. Just a couple of logs that may (or not) be a hint:
[2020-03-08 10:33:21,819] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marathon-akka.actor.default-dispatcher-6)
[2020-03-08 10:33:21,822] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-87)
[2020-03-08 10:33:25,957] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ReviveRepeaterLogic:marathon-akka.actor.default-dispatcher-7)

Installing jdk11 and choosing it as default fixed this issue for me without downgrading the Marathon to any other version.
in ubuntu 20.04:
sudo apt install openjdk-11-jre-headless
update-alternatives --config java

I increased the number of cpus, virtual machine in which the marathon was installed to 3 and the problem was solved.

I have managed to make it work. It was as simple as downgrading Marathon to v1.7.189. After that, it starts properly, and the API responds to requests.

Related

Installing kubernetes on less ram

Is it possible to install kubernetes by kubeadm init command on system has RAM less than 1GB ?I have tried to install but it failed in kubeadm init command.
As mentioned in the installation steps to be taken before you begin, you need to have:
linux compatible system for master and nodes
2GB or more RAM per machine
network connectivity
swap disabled on every node
But going back to your question, It may be possible to run the installation process, but the further usability is not possible. This configuration will be not stable.

Airflow web server starts without Gunicorn and is not accessible

I'm using Airflow 1.9 and it was working fine for over 2 months but somehow now I am not able to start airflow webserver on Gunicorn.
nohup airflow webserver $* > webserver_new.logs &
just starts the web server process but log does not contain any mention of Gunicorn. The UI is not accessible. I have checked that the environment variable $AIRFLOW_HOME points to the correct path.
Also when the web server is being started it doesn't create a webserver-pid file in $AIRFLOW_HOME.
When I uninstall Gunicorn and start the Airflow web server I do not get any error but without Gunicorn the UI is not accessible. Basically it behaves the same whether gunicorn is present or not.
Environment
I use a Python 2.7 virtualenv on a CentOS box. Few other developers updated some Python packages like pyhive, thrift and six. I have uninstalled all those and uninstalled Airflow using pip (and installed back again).
Log contents
The web server logs do not contain any mention of Gunicorn and the do not contain any other error when started from the command line. The DAGs are running but the UI was still down.
[2018-02-21 14:13:36,082] {default_celery.py:41} WARNING - Celery Executor will run without SSL
Additional observation
After a manual start of Gunicorn I found that the workers are getting timed out as soon as they are created.
I found out that the problem was a dag which had a for loop to generate dynamic tasks(all tasks were dyanmic) but the task ids were same for each iteration, I removed that dag and the webserver came back like charm.

How do I upgrade concourse from 3.4.0 to 3.5.0 without causing jobs to abort with state error?

When I did the upgrade of concourse from 3.4.0 to 3.5.0, suddenly all running jobs changed their state from running to errored. I can see the string 'no workers' appearing at the start of their log now. Starting the jobs manually or triggered by the next changes didn't have any problem.
The upgrade of concourse itself was successful.
I was watching what bosh did at the time and I saw this change of job states took place all at once while either the web or the db VM was upgraded (I don't know which one). I am pretty sure that the worker VMs were not touched yet by bosh.
Is there a way to avoid this behavior?
We have one db, one web VM and six workers.
With only one web VM it's possible that it was out of service for long enough that all workers expired. Workers continuously heartbeat and if they miss two heartbeats (which takes 1 minute by default) they'll stall. They should come back after the deploy is finished but if scheduling happened before they heartbeats, that would cause those errors.

How resolve marathon leader with mesos-dns

I've installed mesos-dns in our cluster and is running ok. We can check the domain of the apps installed in marathon but I would like to know in which host is installed the marathon itself. If I do a dig to marathon.domain is not resolving anything.
According to the doc of mesos-dns: "A records ({framework}.domain) and SRV records (_framework._tcp.{framework}.domain) - for every known Mesos master"
Thanks.
It's marathon.mesos unless you've used a different TLD. The Marathon scheduler runs on the Master.
You can use my mesosdns-resolver bash script to get the endpoint from Mesos DNS.
You can use it like:
mesosdns-resolver.sh -sn <service-name>.marathon.mesos -s <IP_ADDRESS_OF_MESOS_DNS_SERVER>

Mesos cluster does not recover when physical host restart

I'm using mesosphere on 3 host over Ubuntu 14.04 as follow:
one with mesos master
two with mesos slave
All work fine, but after restart all physical hosts all scheduled job was lost. It's normal? I'm expected that zookeeper will store the current jobs, then when the system will need restart it, all jobs will be rescheduled after the master boot.
Update:
I'm using marathon and mesos on a same node, and I'm run marathon with flag --zk
With marathon's --zk and --ha enabled, Marathon should be storing its state in ZK and recovering it on restart, as long as Mesos allows it to reregister with the same framework ID.
However, you'll also need to enable the Mesos registry (even for a single master), to ensure that Mesos persists information about what frameworkIds are registered in the event of master failover. This can be accomplished by setting the --registry=replicated_log (default), --quorum=1 (since you only have 1 master), and --work_dir=/path/to/registry (where to store the state).
I solved the problem following this installation instructions: How To Configure a Production-Ready Mesosphere Cluster on Ubuntu 14.04
Although you found a solution, I'd like to explain more to this issue:)
In official doc:http://mesos.apache.org/documentation/latest/slave-recovery/
Note that if the operating system on the slave is rebooted, all
executors and tasks running on the host are killed and are not
automatically restarted when the host comes back up.
So all frameworks on Mesos will be killed after reboot. One way to restart the frameworks is to run all frameworks on Marathon, which will manage other frameworks and restart them in need.
However, then you need to auto-restart Marathon when it's killed. In the digitialocean link you mentioned, the Marathon is installed with script in init.d, so it can be restarted after rebooted. Otherwise, if you installed the Marathon via source code, you can use tools like supervisord to monitor Marathon.