I want to kill my Kafka Connect distributed worker, but I am unable (or I do not know how) to determine which process running in linux is that worker.
When running
ps aux | grep worker
I do see a lot of worker processes but am unsure which is the connect worker and which are standard non-connect workers
It is true that only one of these processes was started yesterday and I suspect that that is the one, but that would obviously not be a sufficient condition in all cases, say for example if the Kafka cluster was brought online yesterday. So, in general, how can I determine which process is a Kafka connect worker?
What is the fool proof method here?
If the other worker processes are not related to connect, you can search connect process with properties file which you passed to start connect worker.
ps aux | grep connect-distributed.properties
There is no kill script for connect workers. You have to run kill command with SIGTERM to stop worker process gracefully.
Related
I've been working with Airflow for a while now, which was set up by a colleague. Lately I run into several errors, which require me to more in dept know how to fix certain things within Airflow.
I do understand what the 3 processes are, I just don't understand the underlying things that happen when I run them. What exactly happens when I run one of the commands? Can I somewhere see afterwards that they are running? And if I run one of these commands, does this overwrite older webservers/schedulers/workers or add a new one?
Moreover, if I for example run airflow webserver, the screen shows some of the things that are happening. Can I simply get out of this by pressing CTRL + C? Because when I do this, it says things like Worker exiting and Shutting down: Master. Does this mean I'm shutting everything down? How else should I get out of the webserver screen then?
Each process does what they are built to do while they are running (webserver provides a UI, scheduler determines when things need to be run, and workers actually run the tasks).
I think your confusion is that you may be seeing them as commands that tell some sort of "Airflow service" to do something, but they are each standalone commands that start the processes to do stuff. ie. Starting from nothing, you run airflow scheduler: now you have a scheduler running. Run airflow webserver: now you have a webserver running. When you run airflow webserver, it is starting a python flask app. While that process is running, the webserver is running, if you kill command, is goes down.
All three have to be running for airflow as a whole to work (assuming you are using an executor that needs workers). You should only ever had one scheduler running, but if you were to run two processes of airflow webserver (ignoring port conflicts, you would then have two separate http servers running using the same metadata database. Workers are a little different in that you may want multiple worker processes running so you can execute more tasks concurrently. So if you create multiple airflow worker processes, you'll end up with multiple processes taking jobs from the queue, executing them, and updating the task instance with the status of the task.
When you run any of these commands you'll see the stdout and stderr output in console. If you are running them as a daemon or background process, you can check what processes are running on the server.
If you ctrl+c you are sending a signal to kill the process. Ideally for a production airflow cluster, you should have some supervisor monitoring the processes and ensuring that they are always running. Locally you can either run the commands in the foreground of separate shells, minimize them and just keep them running when you need them. Or run them in as a background daemon with the -D argument. ie airflow webserver -D.
Upstart service is responsible for creating a gearman workers which run in parallel as number of cpus with the help of gnu-parallel. To understand the problem you can read my stackoverflow post which describes how to run workers in parallel.
Fork processes indefinetly using gnu-parallel which catch individual exit errors and respawn
Upstart service: workon.conf
# workon
description "worker load"
start on runlevel [2345]
stop on runlevel [!2345]
respawn
script
exec seq 1000000 | parallel -N0 --joblog out.log ./worker
end script
Oright. so above service is started
$ sudo service workon start
workon start/running, process 4620
4620 is the process id of service workon.
4 workers will be spawned as per cpu cores. for example.
___________________
Name | PID
worker 1011
worker 1012
worker 1013
worker 1014
perl 1000
perl is the process which is running gnu-parallel.
And, gnu-parallel is responsible for running parallel worker processes.
Now, the problem is.
If I kill the workon service.
$ sudo kill 4620
The service has instruction to re-spawn if killed so it restarts. But, the processes created by the service are not killed. Which means it creates a new set of processes. Now we have 2 perl and 8 workers.
Name | PID
worker 1011
worker 1012
worker 1013
worker 1014
worker 2011
worker 2012
worker 2013
worker 2014
perl 1000
perl 2000
If you ask me, the old process which abandon by service, are they zombies?
Well, the answer is no. They are alive cuz I tested them. Everytime the service dies it creates a new set.
Well, this is one problem. Another problem is with the gnu-parallel.
Lets say I started the service as fresh. Service is running good.
I ran this command to kill the gnu-parallel, i.e. perl
$ sudo kill 1000
This doesn't kill the workers,and they again left without any parent. But, the workon service intercept the death of perl and respawn a new set of workers. This time we have 1 perl and 8 workers. All 8 workers are alive. 4 of them with parent and 4 are orphan.
Now, how do I solve this problem? I want kill all processes created by the service whenever it crashes.
Well, I was able to solve this issue by post-stop. It is an event listener I believe which executes after a service ends. In my case, if I run kill -9 -pid- (pid of the service), post-stop block is executed after the service process is killed. So, I can write the necessary code to remove all the processes spawned by the service.
here is my code using post-stop.
post-stop script
exec killall php & killall perl
end script
I am working with Apache Kafka and using distributed worker. I can start my worker as below:
// Command to start the distributed worker.
"bin/connect-distributed.sh config/connect-distributed.properties"
This is from official documentation. After this we can create connectors and tasks. And this works fine.
But when i change my connector or task logic I should add new jar to classpath of kafka. And after this I should restart worker.
I don't know how it should be right I think we should stop and run worker.
But when I want to stop worker I don't know how i can do it correctly.
ofcourse, I can find my process by ps aux | grep worker, kill it and kill rest server which i should find by ps too. But i think it's strange case. Killing two processes isn't good idea, but i can't find any information how we can do it in another way.
If you know right way, please help me:)
Thanks for your time.
Killing two processes isn't good idea
ConnectDistributed is only one process. There is no separate REST server to stop.
And yes, :connector/pause followed by a kill <pid> is the correct way to stop it.
If installed with a recent version of Confluent Platform, you can stop/start using systemctl.
I have a doubt in storm and here is goes:
Can multiple supervisors run on a single node? or is it the fact that we can run only one supervisor in one machine?
Thanks.
In Principle, There should be 1 Supervisor daemon per 1 physical machine. Why ?
Answer : Nimbus receives heart beat of Supervisor daemon and try to restart it in case supervisor died, if nimbus failed permanently on restart attempt. Nimbus will assign that job to another Supervisor.
Imagine, two Supervisors going down same time as they are from same physical machine, poor fault tolerant !!
running two Supervisor daemons will also be waste of memory resources.
If you have very high memory machines simply increase number of workers by adding more ports in storm.yaml instead adding supervisor.slots.ports.
Theoretically possible - practically you may not need to do it - unless you are doing a PoC/Demo. I did this for one of the demo I gave by making multiple copies of storm and changing the ports for one of the supervisors - you can do it by changing supervisors.slots.ports.
It is designed basically per node. So one node should have only one supervisor. This daemon deals with number of worker processes that you configured based on ports.
So there is no need of extra supervisor daemon per node.
It is possible to run multiple supervisors on a single host. Have a look at this post in storm-user mailing list.
Just copy multiple Storm, and change the storm.yaml to specify
different ports for each supervisor(supervisor.slots.ports)
Supervisor is configured per node basis. Running multiple supervisor on a single node does not make much sense. The sole purpose of the supervisor daemon is to start/stop the worker process (each of these workers are responsible for running subset of topologies). From the doc page ..
The supervisor listens for work assigned to its machine and starts and stops worker processes as necessary based on what Nimbus has assigned to it.
Can a worker job in Heroku make socket (ex.pop3) connection to external server ?
I guess scaling worker process to 2 or more will run jobs in parallel and they all trying to connect to same server/port from a same client/port, am I right or missing something ?
Yes - Heroku workers can connect to the outside world - however, there is no built in provision for handling the sort of problems that you mention - you'd need to do that bit yourself.
Just look at the workers as a variety of separate EC2 instances.