Can I catch events such as on Executor start in Apache Spark? - scala

What I want to do, is for the executor to start a program, such as a profiling tool, when it starts (that is, before it start executing any task). In this way, it would be possible to monitor things like CPU usage of an executor. Does Spark provide such hooks/callbacks? I have used SparkListener, but that is used by the driver side. Do we have a similar thing for Executors?

This should work for your requirement.
http://spark.apache.org/developer-tools.html#profiling
Setup yourkit to work with both drivers and slaves (executors). It doesn't start profiling unless you tell it. Connect to master or slave, start profiling and then run your tests.
Happy profiling!!

Related

Spark Standalone Cluster deployMode = "cluster": Where is my Driver?

I have researched this for a significant amount of time and find answers that seem to be for a slightly different question than mine.
UPDATE: Spark docs say the Driver runs on a cluster Worker in deployMode: cluster. This does not seem to be true when you don't use spark-submit
My Spark 2.3.3 cluster is running fine. I see the GUI on “http://master-address:8080", there are 2 idle workers, as configured.
I have a Scala application that creates a context and starts a Job. I do not use spark-submit, I start the Job programmatically and this is where many answers diverge from my question.
In "my-app" I create a new SparkConf, with the following code (slightly abbreviated):
conf.setAppName(“my-job")
conf.setMaster(“spark://master-address:7077”)
conf.set(“deployMode”, “cluster”)
// other settings like driver and executor memory requests
// the driver and executor memory requests are for all mem on the slaves, more than
// mem available on the launching machine with “my-app"
val jars = listJars(“/path/to/lib")
conf.setJars(jars)
…
When I launch the job I see 2 executors running on the 2 nodes/workers/slaves. The logs show their IP address and calls them executor 0 and 1.
With a Yarn cluster I would expect the “Driver" to run on/in the Yarn Master but I am using the Spark Standalone Master, where is the Driver part of the Job running? If it runs on a random worker or elsewhere, is there a way to find it from logs
Where is my Spark Driver executing? Does deployMode = cluster work when not using spark-submit? Evidence shows a cluster with one master (on the same machine as executor 0) and 2 Workers. It also show identical memory usage on both Workers during the job. From logs I know both Workers are running Executors. Where is the Driver?
The “Driver” creates and broadcasts some large data structures so the need for an answer is more critical than with more typical tiny Drivers.
Where is the driver running? How do I find it given logs and monitoring? I can't reconcile what I see with the docs, they contradict each other.
This is answered by the official documentation:
In cluster mode, however, the driver is launched from one of the Worker processes inside the cluster, and the client process exits as soon as it fulfills its responsibility of submitting the application without waiting for the application to finish.
In other words driver uses arbitrary worker node, hence it it is likely to co-locate with one on the executors, on such small cluster. And to anticipate the follow-up question - this behavior is not configurable - you just have to make sure that the cluster has capacity to start both required executors, and the driver with it's requested memory and cores.

How can i kill distributed worker in Kafka cluster?

I am working with Apache Kafka and using distributed worker. I can start my worker as below:
// Command to start the distributed worker.
"bin/connect-distributed.sh config/connect-distributed.properties"
This is from official documentation. After this we can create connectors and tasks. And this works fine.
But when i change my connector or task logic I should add new jar to classpath of kafka. And after this I should restart worker.
I don't know how it should be right I think we should stop and run worker.
But when I want to stop worker I don't know how i can do it correctly.
ofcourse, I can find my process by ps aux | grep worker, kill it and kill rest server which i should find by ps too. But i think it's strange case. Killing two processes isn't good idea, but i can't find any information how we can do it in another way.
If you know right way, please help me:)
Thanks for your time.
Killing two processes isn't good idea
ConnectDistributed is only one process. There is no separate REST server to stop.
And yes, :connector/pause followed by a kill <pid> is the correct way to stop it.
If installed with a recent version of Confluent Platform, you can stop/start using systemctl.

Spark mesos cluster mode is slower than local mode

I submit the same jar to run by using both local mode and mesos cluster mode. And found for some exactly same stages, local mode only takes several milliseconds to finish however cluster mode will take seconds!
listed is one example: stage 659
local mode:
659
Streaming job from [output operation 1, batch time 17:45:50]
map at KafkaHelper.scala:35 +details
2016/03/22 17:46:31 11 ms
mesos cluster mode:
659
Streaming job from [output operation 1, batch time 18:01:20]
map at KafkaHelper.scala:35 +details
2016/03/22 18:09:33 3 s
And I found from spark UI that mesos cluster mode will consistently take 4 seconds to finish the foreachRDD jobs, why is that? Any submit commands options can help with this?
Bunch of thanks in advance!
That behavior depends on multiple factors. You don't specify what kind of job you run in which cluster mode, and with which settings. If Spark is not installed on the Slaves, you'll see an overhead because the distribution needs to be downloaded etc.
Furthermore, the jars you're using need to be distributed to the executors, which can take some time for the startup as well.
As said, this all depends on how you run Spark on Mesos.
See
http://spark.apache.org/docs/latest/running-on-mesos.html

How to flush out jobs on a Mesos slave?

I want to take out a host (mesos-slave) from the mesos cluster in a clean manner by draining out the executors its running. Is it possible for mesos-master to not give any further work to a mesos-slave but still receive updates for the currently running executors? If thats possible, I can make mesos-master not give anymore work to this slave and once the slave is done with its currently running executors, I can take it out. Please feel free to suggest a better way of achieving the same thing.
I think you look for maintenance primitives, which have been recently added to Mesos. A user doc is here.

Queuing systems - what is a good way to start up multiple workers?

How have you set-up one or more worker scripts for queue-oriented systems?
How do you arrange to startup - and restart if necessary - worker scripts as required? (I'm thinking about such tools as init.d/, Ruby-based 'god', DJB's Daemontools, etc, etc)
I'm developing an asynchronous queue/worker system, in this case using PHP & BeanstalkdD (though the actual language and daemon isn't important). The tasks themselves are not too hard - encoding an array with the commands and parameters into JSON for transport through the Beanstalkd daemon, picking them up in a worker script to action them as required.
There are a number of other similar queue/worker setups out there, such as Starling, Gearman, Amazon's SQS and other more 'enterprise' oriented systems like IBM's MQ and RabbitMQ. If you run something like Gearman, or SQS - how do you start and control the worker pool? The questions is on the initial worker startup, and then being able to add additional extra workers, shutting them down at will (though I can send a message through the queue to shut them down - as long as some 'watcher' won't automatically restart them). This is not a PHP problem, it's about straight Unix processes of setting up one or more processes to run on startup, or adding more workers to the pool.
A bash script to loop a script is already in place - this calls the PHP script which then collects and runs tasks from the queue, occasionally exiting to be able to clean itself up (it can also pause a few seconds on failure, or via a planned event). This works fine, and building the worker processes on top of that won't be very hard at all.
Getting a good worker controller system is about flexibility, starting one or two automatically on a machine start, and being able to add a couple more from the command line when the queue is busy, shutting down the extras when no longer required.
I've been helping a friend who's working on a project that involves a Gearman-based queue that will dispatch various asynchronous jobs to various PHP and C daemons on a pool of several servers.
The workers have been designed to behave just like classic unix/linux daemons, thanks to simple shell scripts in /etc/init.d/, and commands like :
invoke-rc.d myWorker start|stop|restart|reload
This mechanism is simple and efficient. And as it relies on standard linux features, even people with a limited knowledge of your app can launch a daemon or stop one, if they know how it's called system-wise (aka "myWorker" in the above example).
Another advantage of this mechanism is it makes your workers pool management easy as well. You could have 10 daemons on your machine (myWorker1, myWorker2, ...) and have a "worker manager" start or stop them depending on the queue length. And as these commands can be run through ssh, you can easily manage several servers.
This solution may sound cheap, but if you build it with well-coded daemons and reliable management scripts, I don't see why it would be less efficient than big-bucks solutions, for any average (as in "non critical") project.
Real message queuing middleware like WebSphere MQ or MSMQ offer "triggers" where a service that is part of the MQM will start a worker when new messages are placed into a queue.
AFAIK, no "web service" queuing system can do that, by the nature of the beast. However I have only looked hard at SQS. There you have to poll the queue, and in Amazon's case overly eager polling is going to cost you some real $$.
I've recently been working on such a tool. It's not entirely finished (thought it should take more than a few more days before I hit something I could call 1.0) and clearly not ready for production yet, but the important part are already coded. Anybody can have a look at the code here: https://gitorious.org/workers_pool.
Supervisor is a good monitor tool. It includes a web UI where you can monitor and manage workers.
Here is a simple config file for a worker.
[program:demo]
command=php worker.php ; php command to run worker file
numprocs=2 ; number of processes
process_name=%(program_name)s_%(process_num)03d ; unique name for each process if numprocs > 1
directory=/var/www/demo/ ; directory containing worker file
stdout_logfile=/var/www/demo/worker.log ; log file location
autostart=true ; auto start program when supervisor starts
autorestart=true ; auto restart program if it exits