Celery: Remote workers frequently losing connection - celery

I have a Celery broker running on a cloud server (Django app), and two workers on local servers in my office connected behind a NAT. The local workers frequently lose connection, and have to be restarted to re-establish connection with the broker. Usually celeryd restart hangs the first time I try it, so I have to ctr+C and retry once or twice to get it back up and connected. The workers' logs two most common errors:
[2014-08-03 00:08:45,398: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 796, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python2.7/dist-packages/celery/worker/loops.py", line 72, in asynloop
next(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/async/hub.py", line 320, in create_loop
cb(*cbargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 159, in on_readable
reader(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 142, in _read
raise ConnectionError('Socket was disconnected')
ConnectionError: Socket was disconnected
[2014-03-07 20:15:41,963: CRITICAL/MainProcess] Couldn't ack 11, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 93, in ack_log_error
self.ack()
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack
self.channel.basic_ack(self.delivery_tag)
File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1583, in basic_ack
self._send_method((60, 80), args)
File "/usr/local/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method
raise RecoverableConnectionError('connection already closed')
How do I go about debugging this? Is the fact that the workers are behind a NAT an issue? Is there a good tool to monitor whether the workers have lost connection? At least with that, I could get them back online by manually restarting the worker.

Unfortunately yes, there is a problem with late acks in Celery+Kombu - task handler tries to use closed connection.
I worked around it like this:
CELERY_CONFIG = {
'CELERYD_MAX_TASKS_PER_CHILD': 1,
'CELERYD_PREFETCH_MULTIPLIER': 1,
'CELERY_ACKS_LATE': True,
}
CELERYD_MAX_TASKS_PER_CHILD - guarantees that worker will be restarted after finishing the task.
As for the tasks that already lost connection, there is nothing you can do right now. Maybe it'll be fixed in version 4. I just make sure that the tasks are as idempotent as possible.

Related

KafkaProducer - Getting error to connect to kafka (Failed to update metadata after 60.0 secs)

I am trying to read data from Oracle and send to a Kafka topic. I was able to read from oracle, put it into a dataframe and I put all parameters about Kafka as I show in my code below, but I am getting the error:
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
This link look similar, but did not help me
KafkaTimeoutError: Failed to update metadata after 60.0 secs
I use Amazon Managed Streaming for Apache Kafka (MSK).
I have two Brokers. Do I need put both as my Bootstrap servers or just the main Bootstrap servers?
It connect to kafka and disconnect but don't send any message to kafka.
Here is my code ...
try:
conn = OracleHook(oracle_conn_id=oracle_conn_id).get_conn()
query = "Select * from sales"
df = pd.read_sql(query, conn)
topic = 'my-topic'
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],value_serializer=lambda x:dumps(x).encode('utf-8'), api_version=(0, 10, 1)
)
for raw in pd.read_sql(query, conn):
producer.send(topic, raw.encode('utf-8'))
print('Number os records')
conn.close()
except Exception as error:
raise error
return
... and the log
doubt KafkaProducer - Getting error to connect to kafka
{{conn.py:381}} INFO - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': connecting to 'my-server']
{{conn.py:410}} INFO - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': Connection complete.
{{conn.py:1096}} ERROR - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': socket disconnected
{{conn.py:919}} INFO - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': Closing connection. KafkaConnectionError: socket disconnected
{{taskinstance.py:1703}} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
result = execute_callable(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 151, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 162, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 576, in send
self._wait_on_metadata(topic, self.config['max_block_ms'] / 1000.0)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 703, in _wait_on_metadata
"Failed to update metadata after %.1f secs." % (max_wait,))
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
{{taskinstance.py:1280}} INFO - Marking task as FAILED. dag_id=bkbne_ora_to_kafka, task_id=task_id, execution_date=20220624T204102, start_date=20220628T171225, end_date=20220628T171327
{{standard_task_runner.py:91}} ERROR - Failed to execute job 95 for task task_id
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
args.func(args, dag=self.dag)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 292, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
_run_raw_task(args, ti)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 184, in _run_raw_task
error_file=args.error_file,
File "/usr/local/lib/python3.7/site-packages/airflow/utils/session.py", line 70, in wrapper
return func(*args, session=session, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
result = execute_callable(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 151, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 162, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/send_to_kafka/src/send_to_kafka.py", line 63, in f_se
raise e
File "/usr/local/airflow/dags/send_to_kafka/src/send_to_kafka.py", line 55, in send_to_kafka
producer.send(topic, row.encode('utf-8'))
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 576, in send
self._wait_on_metadata(topic, self.config['max_block_ms'] / 1000.0)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 703, in _wait_on_metadata
"Failed to update metadata after %.1f secs." % (max_wait,))
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
Someone could help me with this? I don't know what is happen here
Ensure that you actually have the connectivity to upstream Kafka brokers (preferably every one of them) with something like ping/ncat/kafka console tools. The fact you can't get metadata (have socket disconnects) points to network "problems" (bad config / firewall?).
Do I need put both as my Bootstrap servers or just the main Bootstrap servers?
Need? No.
However the more servers you put into bootstrap, the more tolerant to failures your application is (at least in Java client, where it picks a random one to first to connect to - C (Python) one should be the same AFAICT).
Your code isn't running on the actual brokers, so bootstrap_servers=['localhost:9092'] should be changed to the address(es) that MSK provides you. You may also need to add authentication settings, depending on which port you use, and have configured your cluster.
Regarding the logic of your code, I'd suggest using MSK Connect with JDBC Source or Debezium to read a database table into Kafka.

Airflow scheduler with kubernetes executor fails :- Unknown error in KubernetesJobWatcher

I am trying to set up airflow with the kubernetes executor. I have cloned airflow 1.10.6 and am building the docker image and then deploying it with kube.
The pods are running, the service airflow also starts. The webserver is working fine.
But when I check the logs for the scheduler I get the following error.
ERROR - Error while health checking kube watcher process. Process died for unknown reasons
INFO - Event: and now my watch begins starting at resource_version: 0
ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", line 333, in run
self.worker_uuid, self.kube_config)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", line 358, in _run
**kwargs):
File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 439, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out.
Any help/pointers are appreciated.

Airflow scheduler is throwing out an error - 'DisabledBackend' object has no attribute '_get_task_meta_for'

I am trying to install airflow (distributed mode) in WSL, I got the setup of Airflow webserver, Airflow Scheduler, Airflow Worker, Celery (3.1) and RabbitMQ.
While running the Airflow Scheduler it is throwing out this error (below) even though the backend is set up.
ERROR
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/airflow/executors/celery_executor.py", line 92, in sync
state = task.state
File "/usr/local/lib/python3.6/dist-packages/celery/result.py", line 398, in state
return self._get_task_meta()['status']
File "/usr/local/lib/python3.6/dist-packages/celery/result.py", line 341, in _get_task_meta
return self._maybe_set_cache(self.backend.get_task_meta(self.id))
File "/usr/local/lib/python3.6/dist-packages/celery/backends/base.py", line 288, in get_task_meta
meta = self._get_task_meta_for(task_id)
AttributeError: 'DisabledBackend' object has no attribute '_get_task_meta_for'
https://issues.apache.org/jira/browse/AIRFLOW-1840
This is the exact error I am getting but couldn't find a solution.
Result Backend-
result_backend = db+postgresql://postgres:****#localhost:5432/postgres
broker_url = amqp://rabbitmq_user_name:rabbitmq_password#localhost/rabbitmq_virtual_host_name
Help please, gone through almost all the documents but couldn't find a solution
I was facing the same issue on celery version - 3.1.26.post2 (with rabitmq,postgresql and airflow),the reason for this issue is the dictionary used in celery base.py file at(lib/python3.5/site-packages/celery/app/base.py)
does not capture celery backend at key CELERY_RESULT_BACKEND instead it captures at key result_backend.
So the solution here is go to _get_config function available in base.py file at(lib/python3.5/site-packages/celery/app/base.py),at the end of the function before returning dictionary s add the below code.
s['CELERY_RESULT_BACKEND'] = s['result_backend'] #code to be added
return s
This solved the problem.

PyMongo AutoReconnect: timed out

I work in an Azure environment. I have a VM that runs a Django application (Open edX) and a Mongo server on another VM instance (Ubuntu 16.04). Whenever I try to load anything in the application (where the data is fetched from the Mongo server), I would get an error like this one:
Feb 23 12:49:43 xxxxx [service_variant=lms][mongodb_proxy][env:sandbox] ERROR [xxxxx 13875] [mongodb_proxy.py:55] - Attempt 0
Traceback (most recent call last):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/mongodb_proxy.py", line 53, in wrapper
return func(*args, **kwargs)
File "/edx/app/edxapp/edx-platform/common/lib/xmodule/xmodule/contentstore/mongo.py", line 135, in find
with self.fs.get(content_id) as fp:
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/__init__.py", line 159, in get
return GridOut(self.__collection, file_id)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/grid_file.py", line 406, in __init__
self._ensure_file()
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/grid_file.py", line 429, in _ensure_file
self._file = self.__files.find_one({"_id": self.__file_id})
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/collection.py", line 1084, in find_one
for result in cursor.limit(-1):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1149, in next
if len(self.__data) or self._refresh():
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1081, in _refresh
self.__codec_options.uuid_representation))
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 996, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1366, in _send_message_with_response
raise AutoReconnect(str(e))
AutoReconnect: timed out
First I thought it was because my Mongo server lived in an instance outside of the Django application's virtual network. I created a new Mongo server on an instance inside the same virtual network and would still get these issues. Mind you, I receive the data eventually but I feel like I wouldn't get timed out errors if the connection is normal.
If it helps, here's the Ansible playbook that I used to create the Mongo server: https://github.com/edx/configuration/tree/master/playbooks/roles/mongo_3_2
Also I have tailed the Mongo log file and this is the only line that would appear at the same time I would get the timed out error on the application server:
2018-02-23T12:49:20.890+0000 [conn5] authenticate db: edxapp { authenticate: 1, user: "user", nonce: "xxx", key: "xxx" }
mongostat and mongotop don't show anything out of the ordinary. Also here's the htop output:
I don't know what else to look for or how to fix this issue.
I forgot to change the Mongo server IP's in the Django application settings to point to the new private IP address inside the virtual network instead of the public IP. After I've changed that it don't get that issue anymore.
If you are reading this, make sure you change the private IP to a static one in Azure, if you are using that IP address in the Djagno application settings.

GSM Shield with Raspberry Pi Port Open

I have connected a-gsm shield http://itbrainpower.net/a-gsm/downloadables/a-gsm-series-presentation-v1.01.pdf with Raspberry Pi 3.
I have imported and attempted to execute power on code http://itbrainpower.net/a-gsm/RaspberryPI-gsm-shield-library-powerOnOff-demo-code-a-gsm. However, I am getting the following error message:
sudo ./poweronoff.py
Traceback (most recent call last):
File "./poweronoff.py", line 66, in <module>
agsm.open()
File "/usr/lib/python2.7/dist-packages/serial/serialposix.py", line 271, in open
raise SerialException("Port is already open.")
serial.serialutil.SerialException: Port is already open.
I assume I need to open a different port. If that is the solution, how do I do it. If that is not the solution, what do I need to do to satisfy this error message?
Follow directions in this post https://www.raspberrypi.org/forums/viewtopic.php?uid=195125&f=28&t=165897&start=0.
Basically, "enable_uart=1" should be the last line in '/boot/config.txt' for serial port operations. Reboot the RPi3 and everything should work.