Issue using same zookeeper for Kafka and Mesos - apache-kafka

I am trying to setup Kafka and Spark with Mesos on our 8 nodes cluster as following but having issues launching/starting Mesos Agent using zookeeper endpoint of Mesos masters.
Install and setup Zookeeper on 3 nodes (server00,server01,server02) (through $KAFKA_HOME/config/zookeeper.properties)
Install Kafka brokers on all 8 nodes (and point it to 3 zookeepers by setting following property in its $KAFKA_HOME/config/server.properties)
zookeeper.connect=server00:2181,server01:2181,server02:2181
Install Mesos master on 3 nodes (server00,server01,server02) and update /etc/mesos/zk with following line:
zk://server00:2181,server01:2181,server02:2181/mesos
Install Mesos agents on all 8 nodes.
Edit /etc/mesos/zk file on all other servers to have following line.
zk://server00:2181,server01:2181,server02:2181/mesos
Start Mesos master on all 3 master servers as below (verified that all Mesos master are running and available by launching http://server00:5050/#/, http://server01:5050/#/, http://server02:5050/#/
sudo /usr/sbin/mesos-master --cluster=server_mesos_cluster --log_dir=/var/log/mesos --work_dir=/var/lib/mesos
Start Mesos Agent on all 8 servers.
Example of launching this on server00:
sudo /usr/sbin/mesos-slave --work_dir=/var/lib/mesos --master=zk://server00:2181,server01:2181,server02:2181/mesos --ip=9.1.69.150
But above doesn't launch agent.
But following command does which makes me think that perhaps master mesos are not getting registered with zookeepers.
sudo /usr/sbin/mesos-slave --work_dir=/var/lib/mesos --master=server00:5050 --ip=9.1.69.150
Could anyone shed any light as to whether
My configuration is not right or
If I have to setup separate zookeepers for Mesos cluster?
How can I verify if Mesos masters are getting registered with zookeeper?
Once this setup is working, I intend to run Spark on all 8 nodes.

On Ubuntu, at least, /etc/mesos/zk, and other config files under /etc/mesos are only read by /usr/bin/mesos-init-wrapper. Thus your master isn't seeing your zk config.
You'll either need to launch it with the init script (service mesos-master start), run the wrapper manually, or use the -zk option to mesos-master:
sudo /usr/sbin/mesos-master --cluster=server_mesos_cluster --log_dir=/var/log/mesos --work_dir=/var/lib/mesos --zk=zk://server00:2181,server01:2181,server02:2181/mesos
`

Related

Marathon (Mesos) - Stuck in "Loading applications"

I am building a mesos cluster from scratch (using Vagrant, which is not relevant for this issue).
OS: Ubuntu 16.04 (trusty)
Setup:
Master -> Runs ZooKeeper, Mesos-master, Marathon and Chronos
Slave -> Runs Mesos-slave
This is my provisioning script for the master node https://github.com/zeitgeist2018/infrastructure/blob/fix-marathon/provision/scripts/install-master.sh.
I have managed to register de slave into Mesos, install Marathon and Chronos frameworks, and run scheduled jobs in Chronos (both with docker and shell commands), but I can't get Marathon to work properly. The UI gets stuck in "Loading applications" as soon as I open it, and when I try to call the API, the request hangs forever with no response. In the API I tried to get simple marathon information and do deployments, both with the same hanging result.
I've been checking Marathon logs but I don't see anything error there. Just a couple of logs that may (or not) be a hint:
[2020-03-08 10:33:21,819] INFO Prompting Mesos for a heartbeat via explicit task reconciliation (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor$$anon$1:marathon-akka.actor.default-dispatcher-6)
[2020-03-08 10:33:21,822] INFO Received fake heartbeat task-status update (mesosphere.marathon.core.heartbeat.MesosHeartbeatMonitor:Thread-87)
[2020-03-08 10:33:25,957] INFO Found no roles suitable for revive repetition. (mesosphere.marathon.core.launchqueue.impl.ReviveOffersStreamLogic$ReviveRepeaterLogic:marathon-akka.actor.default-dispatcher-7)
Installing jdk11 and choosing it as default fixed this issue for me without downgrading the Marathon to any other version.
in ubuntu 20.04:
sudo apt install openjdk-11-jre-headless
update-alternatives --config java
I increased the number of cpus, virtual machine in which the marathon was installed to 3 and the problem was solved.
I have managed to make it work. It was as simple as downgrading Marathon to v1.7.189. After that, it starts properly, and the API responds to requests.

spark-shell on multinode spark cluster fails to spon executor on remote worker node

Installed spark cluster on standalone mode with 2 nodes on first node there is spark master running and on another node spark worker. When i try to run spark shell on worker node with word count code it runs fine but when i try to run spark shell on the master node it gives following output :
WARN scheduler.TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient resources
Executor is not triggered to run the job. Even though there is worker available to spark master its giving such a problem . Any help is appriciated , thanks
You use client deploy mode so the best bet is that executor nodes cannot connect to the driver port on the local machine. It could be firewall issue or problem with advertised IP / hostname. Please make sure that:
spark.driver.bindAddress
spark.driver.host
spark.driver.port
use expected values. Please refer to the networking section of Spark documentation.
Less likely it is a lack of resources. Please check if you don't request more resources than provided by workers.

Apache spark in cluster mode where to run the jobs. In Master or in worker node?

I have installed the spark in cluster mode. 1 master and 2 workers.And When I start spark shell in master node it is countinously running without getting the scala shell.
But when I run spark-shell on a worker node I am getting scala shell.And I am able to do the jobs.
val file=sc.textFile(“hdfs://192.168.1.20:9000/user/1gbdata”)
file.count()
And for this I got the output.
So My doubt is actually where to run the spark jobs.
Is it in worker nodes?
Based on the documentation, you need to connect your spark-shell to the master node with the following command : spark-shell --master spark://IP:PORT. This url can be retrieved from the master's UI or log file.
You should be able to launch the spark-shell on the master node (machine), make sure to check out the UI to see if the spark-shell is effectively running and that the prompt is shown (you might need to press enter on your keyboard after issuing spark-shell).
Please note that when you are using spark-submit in cluster mode, the driver will be submitted directly from one of the worker nodes, contrary to client mode where it will run as a client process. Refer to the documentation for more details.

Thoughts on How to Hotdeploy using JBoss 5

I am trying to see if this is possible. I will first you you a background of how the application currently runs.
The application is deployed to 4 separate nodes (using the 'all' config). 2 nodes on ServerA and 2 nodes on ServerB named node1, node2, node3, node4.
The application is behind a web server running apache and mod_jk to redirect traffic.
Assume that version 1.0.0 is currently deployed.
I will be trying to deploy 1.0.1 which will only have a minor change.
The goal will be to take down node4, deploy version 1.0.1 to node4 (while node1-node3 are still up and running).
They will be sharing the same database which in theory should be fine as long as our code doesn't require us to update anything in our database.
The next step would be to direct traffic using apache + mod_jk to only load balance node1-node3. node4 will be accessed directly.
node4 will be tested running version 1.0.1.
Apache + mod_jk will be changed to serve node4.
Version 1.0.1 will be deployed to node1-node3.
All nodes should now be running version 1.0.1.
I know this is extremely high level and I am already facing problems (not to mention application specific problems).
I just want to know what are other ways of approaching this or JBoss specific problems that I can run into.
Should I be putting the hotdeploy node in a different cluster and have the rest join later?
Any suggestions would help. Thanks.
You can take advantage of your Apache with mod_jk in front, imagine you have in your configuration something like:
JkMount /myapp/* workerApp
JkWorkersFile /etc/httpd/conf/workerApp.properties
Well instead of having a file named workerApp.properties use these 3 files:
workerApp-deploy1.properties
Will contain configuration to connect only to node 4
workerApp-deploy2.properties
Will contain configuration to connect only to nodes 1,2 and 3
workerApp-normal.properties
This will be your actual workers file
Now wokerApp.properties instead of being a file is a link, so on normal cirscunstances:
ln -s workerApp-normal.properties workerApp.properties
When you deploy a new version
rm -f workerApp.properties
ln -s workerApp-deploy2.properties workerApp.properties
reload apache
Now you can deploy on node4 the new version and all request will route through node1,2 and 3. When deploy on node 4 is ready:
rm -f workerApp.properties
ln -s workerApp-deploy1.properties workerApp.properties
reload apache
In this situation all clients will be router to node 4 and you can upgrade versions on other nodes. When you're done:
rm -f workerApp.properties
ln -s workerApp-normal.properties workerApp.properties
reload apache
And you get all request balanced between servers.
This have another advantage, for example you can define a VirtualHost like preflighttest.yourcompany.com using a different set of workers, so you can test your new version on node 4 before effectivily rolling it in production.
Hope it helps.

Mesos cluster does not recover when physical host restart

I'm using mesosphere on 3 host over Ubuntu 14.04 as follow:
one with mesos master
two with mesos slave
All work fine, but after restart all physical hosts all scheduled job was lost. It's normal? I'm expected that zookeeper will store the current jobs, then when the system will need restart it, all jobs will be rescheduled after the master boot.
Update:
I'm using marathon and mesos on a same node, and I'm run marathon with flag --zk
With marathon's --zk and --ha enabled, Marathon should be storing its state in ZK and recovering it on restart, as long as Mesos allows it to reregister with the same framework ID.
However, you'll also need to enable the Mesos registry (even for a single master), to ensure that Mesos persists information about what frameworkIds are registered in the event of master failover. This can be accomplished by setting the --registry=replicated_log (default), --quorum=1 (since you only have 1 master), and --work_dir=/path/to/registry (where to store the state).
I solved the problem following this installation instructions: How To Configure a Production-Ready Mesosphere Cluster on Ubuntu 14.04
Although you found a solution, I'd like to explain more to this issue:)
In official doc:http://mesos.apache.org/documentation/latest/slave-recovery/
Note that if the operating system on the slave is rebooted, all
executors and tasks running on the host are killed and are not
automatically restarted when the host comes back up.
So all frameworks on Mesos will be killed after reboot. One way to restart the frameworks is to run all frameworks on Marathon, which will manage other frameworks and restart them in need.
However, then you need to auto-restart Marathon when it's killed. In the digitialocean link you mentioned, the Marathon is installed with script in init.d, so it can be restarted after rebooted. Otherwise, if you installed the Marathon via source code, you can use tools like supervisord to monitor Marathon.