Celery: all workers stuck, how to diagnose - celery

Periodically all my Celery workers get stuck on something. I cannot figure out what is causing this, as inspect doesn't work as all the workers are busy.
celery inspect active
Error: No nodes replied within time constraint
Is it possible to get Celery status, like active tasks, even if nodes are doing something (that seems to be causing problems)? Can I somehow spin up a temporary worker just to get inspect output?
What kind of other strategies there would be to diagnose this issue?
Celery 4.x. Redis backend.

This turned out to be a deadlock issue with Celery + gevent (evil monkey patch) + Sentry's Raven logger.
https://github.com/getsentry/raven-python/issues/305
To diagnose issues
You can start Celery workers with different queues (-q, -n) parameters and see when workers hang. Even if some worker groups are hung the others still may respond to inspect queries.
Celery file logs may reveal the error
2017-02-27 08:36:34,371 CRITI [celery.worker][DummyThread-6] Unrecoverable error: AttributeError("'NoneType' object has no attribute 'readline'",)
Traceback (most recent call last):
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
c.loop(*c.loop_args())
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 118, in synloop
connection.drain_events(timeout=2.0)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/connection.py", line 301, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/virtual/base.py", line 961, in drain_events
get(self._deliver, timeout=timeout)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 359, in get
ret = self.handle_event(fileno, event)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 341, in handle_event
return self.on_readable(fileno), self
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 337, in on_readable
chan.handlers[type]()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 714, in _brpop_read
**options)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/client.py", line 585, in parse_response
response = connection.read_response()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/connection.py", line 577, in read_response
response = self._parser.read_response()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/connection.py", line 238, in read_response
response = self._buffer.readline()
AttributeError: 'NoneType' object has no attribute 'readline'

Related

Reading logs of autoscaled Ray worker nodes

We're running ray tasks on kubernetes with autoscaling. From time to time, a worker dies, and we get the following:
WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307
We're using a config.yaml similar to the autoscaler/kubernetes/default.yaml.
The problem we're facing is that the nodes were the worker runs is autoscaling, causing the node to often be scaled away before we have time to read the logs (in /tmp/ray).
Any way we can persist the logs of the autoscaled (worker) nodes?
I tried setting the temp dir to point to a filestore that's shared between the workers, by calling ray.init(_temp_dir="/our_mounted_filestore/persistent_ray_logs") but this does not seem to work.
I also tried adding --temp-dir=<path> in the worker_start_ray_commands, but this just gives the following error:
PermissionError: [Errno 13] Permission denied: '/filestore/ray_temp'
mkdir(name, mode)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 223, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 213, in makedirs
os.makedirs(directory_path, exist_ok=True)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/utils.py", line 797, in try_to_create_directory
try_to_create_directory(self._temp_dir)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 278, in _init_temp
self._init_temp(redis_client)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 166, in init
node = ray.node.Node(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 580, in start
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in call
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1808, in main
sys.exit(main())
File "/home/ray/anaconda3/bin/ray", line 8, in
Traceback (most recent call last):

Librabbitmq 2.0.0 with Python 3 gives TypeError: can't pickle memoryview objects

I am using the latest master branch of the git repo https://github.com/celery/librabbitmq and installing librabbitmq==2.0.0 for Python 3.6 by following the instructions in the readme
Using the development version
You can clone the repository by doing the following:
$ git clone git://github.com/celery/librabbitmq.git
Then install it by doing the following:
$ cd librabbitmq
$ make install # or make develop
This works fine (after installing certain binaries for c compliation in the OS), but when I then make a small a+b add task and call it with add.delay(2,2) it fails with the following error. I looked up and saw that Celery 4 uses json as serializer, so clearly it is not because if pickle serialization
Changing from librabbitmq to pyamqp broker works normally
Same exact situation in both MacOS and Ubuntu 16
[2018-04-30 23:40:02,956: CRITICAL/MainProcess] Unrecoverable error:
SystemError(' returned a result with an error set',) Traceback (most
recent call last): File
"/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/kombu/messaging.py",
line 624, in _receive_callback
return on_m(message) if on_m else self.receive(decoded, message) File
"/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 570, in on_task_received
callbacks, File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/strategy.py",
line 145, in task_message_handler
handle(req) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/worker.py",
line 221, in _process_task_sem
return self._quick_acquire(self._process_task, req) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/kombu/async/semaphore.py",
line 62, in acquire
callback(*partial_args, **partial_kwargs) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/worker.py",
line 226, in _process_task
req.execute_using_pool(self.pool) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/request.py",
line 531, in execute_using_pool
correlation_id=task_id, File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/concurrency/base.py",
line 155, in apply_async
**options) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/billiard/pool.py",
line 1486, in apply_async
self._quick_put((TASK, (result._job, None, func, args, kwds))) File
"/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/concurrency/asynpool.py",
line 813, in send_job
body = dumps(tup, protocol=protocol) TypeError: can't pickle memoryview objects
The above exception was the direct cause of the following exception:
Traceback (most recent call last): File
"/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/worker.py",
line 203, in start
self.blueprint.start(self) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/bootsteps.py",
line 119, in start
step.start(parent) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/bootsteps.py",
line 370, in start
return self.obj.start() File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 320, in start
blueprint.start(self) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/bootsteps.py",
line 119, in start
step.start(parent) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/consumer/consumer.py",
line 596, in start
c.loop(*c.loop_args()) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/celery/worker/loops.py",
line 88, in asynloop
next(loop) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/kombu/async/hub.py",
line 354, in create_loop
cb(*cbargs) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/kombu/transport/base.py",
line 236, in on_readable
reader(loop) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/kombu/transport/base.py",
line 218, in _read
drain_events(timeout=0) File "/Users/somghosh/.virtualenvs/ctdb/lib/python3.6/site-packages/librabbitmq-2.0.0-py3.6-macosx-10.6-intel.egg/librabbitmq/init.py",
line 227, in drain_events
self._basic_recv(timeout) SystemError: returned a result with an error set
This library is not recommended to use as rabbitmq broker with celery. Instead please try py-amqp. this is more maintained and less buggy.

Error while retrieving result in pymongo

I have an python application which creates number of threads for a job. each thread connects to mongodb and retrieve data. Number of allowed connection to mongodb is 200 which I'm taking care using semaphore. And once mongo querying job is done each thread closes mongodb connection. But while executing this application I'm getting same error for all threads. Error is:
Traceback (most recent call last):
File "C:\Python34\lib\threading.py", line 921, in _bootstrap_inner
self.run()
File "C:\Python34\lib\threading.py", line 869, in run
self._target(*self._args, **self._kwargs)
File "C:/path/pytest/under_construction/testAlgo.py", line 95, in sample_thread
status=monObj.process_status(list_value1,list_value2,5,120,120)
File "C:\path\pytest\under_construction\mongo_lib.py", line 153, in process_status
result=self.mongo_result('Submission','find',q={})
File "C:\path\pytest\under_construction\mongo_lib.py", line 53, in mongo_result
result=list(_query[query_type.lower()](query_string[keys]))
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 1076, in __next__
if len(self.__data) or self._refresh():
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 1037, in _refresh
limit, self.__id))
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 933, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1205, in _send_message_with_response
response = self.__send_and_receive(message, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1182, in __send_and_receive
return self.__receive_message_on_socket(1, request_id, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1174, in __receive_message_on_socket
return self.__receive_data_on_socket(length - 16, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1153, in __receive_data_on_socket
chunk = sock_info.sock.recv(length)
MemoryError
Code for creating mongo connection
client=MongoClient(mc_name,port)
I was thinking, is this error due to results of all threads accumulating at one port of machine running my application?
MongoClient is a thread-safe connection pool, so you should be creating a single instance that's shared by all the worker threads rather than having each thread create its own.
The connection pool size defaults to 100, but if you want to make it even larger you can use the maxPoolSize parameter to do that (e.g. maxPoolSize=200).

redis-py raises AttributeError

In what circumstances would redis-py raise the following AttributeError exception?
Isn't redis-py built by design to raise only redis.exceptions.RedisError based exceptions?
What would be a reasonable handling logic?
Traceback (most recent call last):
File "c:\Python27\Lib\threading.py", line 551, in __bootstrap_inner
self.run()
File "c:\Python27\Lib\threading.py", line 504, in run
self.__target(*self.__args, **self.__kwargs)
File C:\Users\Administrator\Documents\my_proj\my_module.py", line 33, in inner
ret = protected_func(*args, **kwargs)
File C:\Users\Administrator\Documents\my_proj\my_module.py", line 104, in _listen
for message in _pubsub.listen():
File "C:\Users\Administrator\virtual_environments\my_env\lib\site-packages\redis\client.py", line 1555, in listen
r = self.parse_response()
File "C:\Users\Administrator\virtual_environments\my_env\lib\site-packages\redis\client.py", line 1499, in parse_response
response = self.connection.read_response()
File "C:\Users\Administrator\virtual_environments\my_env\lib\site-packages\redis\connection.py", line 306, in read_response
response = self._parser.read_response()
File "C:\Users\Administrator\virtual_environments\my_env\lib\site-packages\redis\connection.py", line 104, in read_response
response = self.read()
File "C:\Users\Administrator\virtual_environments\my_env\lib\site-packages\redis\connection.py", line 89, in read
return self._fp.readline()[:-2]
AttributeError: 'NoneType' object has no attribute 'readline'
seems like an old question, but I faced the same problem recently.
My setup was using celery with redis as a broker. A ThreadPoolExecutor uses the shared celery object to batch tasks to workers. The batcher function waits for the submitted tasks to finish using celery.result.ResultSet.
After quick investigations, I found that celery somewhere uses a pub/sub mechanism to wait for the tasks to finish. And that is it, pub/sub don't play well with thread-safety per the official readme https://github.com/andymccurdy/redis-py#thread-safety
Honestly, I didn't try to prove my theory and fixed my problem by switching to a ProcessPoolExecutor instead.

ipython 0.13 zmq errors

I encounter weird behavior of an ipython cluster. The calculations finish, but many results never reach the client (and the engines just idle after finishing their first calculation).
I suspect something is wrong with zmq because 1) from time to time I see the following error:
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 118, in get
if not self.ready():
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 132, in ready
self.wait(0)
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/asyncresult.py", line 142, in wait
self._ready = self._client.wait(self.msg_ids, timeout)
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 1058, in wait
self.spin()
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 1015, in spin
self._flush_results(self._task_socket)
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/parallel/client/client.py", line 814, in _flush_results
idents,msg = self.session.recv(sock, mode=zmq.NOBLOCK)
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/zmq/session.py", line 642, in recv
idents, msg_list = self.feed_identities(msg_list, copy)
File "/data/misc/nano/python/env_stable/lib/python2.7/site-packages/IPython/zmq/session.py", line 673, in feed_identities
idx = msg_list.index(DELIM)
ValueError: '<IDS|MSG>' is not in list
Additionally IPython.zmq has two test failures:
======================================================================
ERROR: test_send (IPython.zmq.tests.test_session.TestSession)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/clusterdata/python/env_stable/lib/python2.7/site-packages/IPython/zmq/tests/test_session.py", line 76, in test_send
socket = MockSocket(zmq.Context.instance(),zmq.PAIR)
File "/clusterdata/python/env_stable/lib/python2.7/site-packages/IPython/zmq/tests/test_session.py", line 34, in __init__
self.data = []
File "/clusterdata/python/env_stable/lib/python2.7/site-packages/zmq/sugar/attrsettr.py", line 38, in __setattr__
self.__class__.__name__, upper_key)
AttributeError: MockSocket has no such option: DATA
======================================================================
ERROR: test_send (IPython.zmq.tests.test_session.TestSession)
----------------------------------------------------------------------
Traceback (most recent call last):
File "/clusterdata/python/env_stable/lib/python2.7/site-packages/zmq/tests/__init__.py", line 108, in tearDown
raise RuntimeError("context could not terminate, open sockets likely remain in test")
RuntimeError: context could not terminate, open sockets likely remain in test
----------------------------------------------------------------------
I use pyzmq 13.0.0 (as installed by pip), and the zeromq 3.2.2, compiled by the setup of pyzmq. I use ipython 13.1 and python 2.7.3.
Any suggestions of what could this be, and if not how I could figure out more information why these errors occur?
Update: It turns out the slowdown was due to a long task queue of ipcontroller, which was then taking 100% CPU and lagging horribly. That is a separate issue, but I would still appreciate feedback on the above.
Answered by #minrk in comments. ZMQ errors were unimportant, performance was due to scheduling, and was solved by setting TaskScheduler.hwm=0.