PyMongo AutoReconnect: timed out - mongodb

I work in an Azure environment. I have a VM that runs a Django application (Open edX) and a Mongo server on another VM instance (Ubuntu 16.04). Whenever I try to load anything in the application (where the data is fetched from the Mongo server), I would get an error like this one:
Feb 23 12:49:43 xxxxx [service_variant=lms][mongodb_proxy][env:sandbox] ERROR [xxxxx 13875] [mongodb_proxy.py:55] - Attempt 0
Traceback (most recent call last):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/mongodb_proxy.py", line 53, in wrapper
return func(*args, **kwargs)
File "/edx/app/edxapp/edx-platform/common/lib/xmodule/xmodule/contentstore/mongo.py", line 135, in find
with self.fs.get(content_id) as fp:
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/__init__.py", line 159, in get
return GridOut(self.__collection, file_id)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/grid_file.py", line 406, in __init__
self._ensure_file()
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/grid_file.py", line 429, in _ensure_file
self._file = self.__files.find_one({"_id": self.__file_id})
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/collection.py", line 1084, in find_one
for result in cursor.limit(-1):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1149, in next
if len(self.__data) or self._refresh():
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1081, in _refresh
self.__codec_options.uuid_representation))
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 996, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1366, in _send_message_with_response
raise AutoReconnect(str(e))
AutoReconnect: timed out
First I thought it was because my Mongo server lived in an instance outside of the Django application's virtual network. I created a new Mongo server on an instance inside the same virtual network and would still get these issues. Mind you, I receive the data eventually but I feel like I wouldn't get timed out errors if the connection is normal.
If it helps, here's the Ansible playbook that I used to create the Mongo server: https://github.com/edx/configuration/tree/master/playbooks/roles/mongo_3_2
Also I have tailed the Mongo log file and this is the only line that would appear at the same time I would get the timed out error on the application server:
2018-02-23T12:49:20.890+0000 [conn5] authenticate db: edxapp { authenticate: 1, user: "user", nonce: "xxx", key: "xxx" }
mongostat and mongotop don't show anything out of the ordinary. Also here's the htop output:
I don't know what else to look for or how to fix this issue.

I forgot to change the Mongo server IP's in the Django application settings to point to the new private IP address inside the virtual network instead of the public IP. After I've changed that it don't get that issue anymore.
If you are reading this, make sure you change the private IP to a static one in Azure, if you are using that IP address in the Djagno application settings.

Related

Airflow 2.1.4 Composer V2 GKE kubernetes in custom VPC Subnet returning 404

So I have two V2 Composers running in the same project, the only difference in these two is that in one of them I'm using the default subnet and default values/autogenerated values for cluster-ipv4-cidr & services-ipv4-cidr. In the other one I've created another subnet in the same (default VPC) which is in the same region, but a different IP range, and I reference this subnet when creating the Composer, additionally I give it the services-ipv4-cidr=xx.44.0.0/17 and services-ipv4-cidr=xx.45.4.0/22.
Everything else is the same between these two Composer environments. In the environment where I have a custom subnet I'm not able to run any KubernetsPodOperator jobs, they return the error:
ERROR - Exception when attempting to create Namespaced Pod:
Traceback (most recent call last):
File "/opt/python3.8/lib/python3.8/site-packages/airflow/providers/cncf/kubernetes/utils/pod_manager.py", line 111, in run_pod_async
resp = self._client.create_namespaced_pod(
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 6174, in create_namespaced_pod
(data) = self.create_namespaced_pod_with_http_info(namespace, body, **kwargs) # noqa: E501
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/api/core_v1_api.py", line 6251, in create_namespaced_pod_with_http_info
return self.api_client.call_api(
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 340, in call_api
return self.__call_api(resource_path, method,
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 172, in __call_api
response_data = self.request(
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/api_client.py", line 382, in request
return self.rest_client.POST(url,
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/rest.py", line 272, in POST
return self.request("POST", url,
File "/opt/python3.8/lib/python3.8/site-packages/kubernetes/client/rest.py", line 231, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (404)
and this pod does not appear if I go to GKE to check workloads. These two GKE envs use same composer service account, K8s service account and namespaces, but from my understand that is not an issue. Jobs outside of the K8sPodOperator work fine. I had a theory that perhaps the non-default subnet needed additional permissions but I wasn't able to confirm or deny this theory yet.
From the log I can see that the KubernetesPodOperator can't locate the worker, even though from the UI I can find it, and also non-KubernetesPodOperator jobs do this succesfully.
Would appreciate some guidance on what to do/where to look?

Airflow scheduler is throwing out an error - 'DisabledBackend' object has no attribute '_get_task_meta_for'

I am trying to install airflow (distributed mode) in WSL, I got the setup of Airflow webserver, Airflow Scheduler, Airflow Worker, Celery (3.1) and RabbitMQ.
While running the Airflow Scheduler it is throwing out this error (below) even though the backend is set up.
ERROR
Traceback (most recent call last):
File "/usr/local/lib/python3.6/dist-packages/airflow/executors/celery_executor.py", line 92, in sync
state = task.state
File "/usr/local/lib/python3.6/dist-packages/celery/result.py", line 398, in state
return self._get_task_meta()['status']
File "/usr/local/lib/python3.6/dist-packages/celery/result.py", line 341, in _get_task_meta
return self._maybe_set_cache(self.backend.get_task_meta(self.id))
File "/usr/local/lib/python3.6/dist-packages/celery/backends/base.py", line 288, in get_task_meta
meta = self._get_task_meta_for(task_id)
AttributeError: 'DisabledBackend' object has no attribute '_get_task_meta_for'
https://issues.apache.org/jira/browse/AIRFLOW-1840
This is the exact error I am getting but couldn't find a solution.
Result Backend-
result_backend = db+postgresql://postgres:****#localhost:5432/postgres
broker_url = amqp://rabbitmq_user_name:rabbitmq_password#localhost/rabbitmq_virtual_host_name
Help please, gone through almost all the documents but couldn't find a solution
I was facing the same issue on celery version - 3.1.26.post2 (with rabitmq,postgresql and airflow),the reason for this issue is the dictionary used in celery base.py file at(lib/python3.5/site-packages/celery/app/base.py)
does not capture celery backend at key CELERY_RESULT_BACKEND instead it captures at key result_backend.
So the solution here is go to _get_config function available in base.py file at(lib/python3.5/site-packages/celery/app/base.py),at the end of the function before returning dictionary s add the below code.
s['CELERY_RESULT_BACKEND'] = s['result_backend'] #code to be added
return s
This solved the problem.

XMPP get error while authentication

I am using a xmpppy library to connect with XMPP server... I was able to connect with server but while authentication get error "Plugging ignored: another instance already plugged."
What does this error mean and how can I resolve it??
In [37]: c.isConnected()
Out[37]: ''
In [38]: jid = xmpp.protocol.JID('gathole#localhost')
In [39]: c.auth(jid.getNode(),'password', resource=jid.getResource())
DEBUG: sasl start Plugging <xmpp.auth.SASL instance at 0x108fc2710> into <xmpp.client.Client instance at 0x109003c68>
DEBUG: sasl error Plugging ignored: another instance already plugged.
Traceback (most recent call last):
File "/Users/gathole/.virtualenvs/driveu/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 459, in interact
line = self.raw_input(prompt)
File "/Users/gathole/.virtualenvs/driveu/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 528, in raw_input
line = py3compat.cast_unicode_py2(self.raw_input_original(prompt))
KeyboardInterrupt
DEBUG: gen_auth start Plugging <xmpp.auth.NonSASL instance at 0x108fc2710> into <xmpp.client.Client instance at 0x109003c68>
DEBUG: gen_auth error Plugging ignored: another instance already plugged.
Traceback (most recent call last):
File "/Users/gathole/.virtualenvs/driveu/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 459, in interact
line = self.raw_input(prompt)
File "/Users/gathole/.virtualenvs/driveu/lib/python2.7/site-packages/IPython/terminal/interactiveshell.py", line 528, in raw_input
line = py3compat.cast_unicode_py2(self.raw_input_original(prompt))
KeyboardInterrupt
It requires change the in XMPP ejabberd server config. Change the line {hosts, ["localhost"]} with {hosts, ["localhost", "server-domain", "server-ip-address"]} in the ejabberd.cfg file.
Restart the server and create another user under new hosts with the server domain or server ip.

Jupyterhub on Google compute engine

I'm trying to set up a Jupyterhub instance to serve IPython notebooks on a Google Compute Engine. However, when running jupyterhub i am faced with an error regarding sockets:
[E 2015-08-31 10:27:55.617 JupyterHub app:1097]
Traceback (most recent call last):
File "/home/esten/anaconda3/envs/py3k/lib/python3.3/site- packages/jupyterhub/app.py", line 1095, in launch_instance_async
yield self.start()
File "/home/esten/anaconda3/envs/py3k/lib/python3.3/site-packages/jupyterhub/app.py", line 1027, in start
self.http_server.listen(self.hub_port, address=self.hub_ip)
File "/home/esten/anaconda3/envs/py3k/lib/python3.3/site-packages/tornado/tcpserver.py", line 126, in listen
sockets = bind_sockets(port, address=address)
File "/home/esten/anaconda3/envs/py3k/lib/python3.3/site-packages/tornado/netutil.py", line 187, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 99] Cannot assign requested address
The address/port assigned by the config file is localhost/8081, and binding a socket like below works perfectly fine
import socket
s = socket.socket()
s.bind(("localhost", 8081))
Does jupyterhub look somewhere else for the information or is something done differently when binding the socket through my own code?
This seems to be a problem with GCE not supporting ipv6.
I found this link explaining that enabling ipv6 solved the issue on another machine.
Running using --ip solved the issue:
jupyter notebook --ip="*"

Celery: Remote workers frequently losing connection

I have a Celery broker running on a cloud server (Django app), and two workers on local servers in my office connected behind a NAT. The local workers frequently lose connection, and have to be restarted to re-establish connection with the broker. Usually celeryd restart hangs the first time I try it, so I have to ctr+C and retry once or twice to get it back up and connected. The workers' logs two most common errors:
[2014-08-03 00:08:45,398: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 796, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python2.7/dist-packages/celery/worker/loops.py", line 72, in asynloop
next(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/async/hub.py", line 320, in create_loop
cb(*cbargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 159, in on_readable
reader(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 142, in _read
raise ConnectionError('Socket was disconnected')
ConnectionError: Socket was disconnected
[2014-03-07 20:15:41,963: CRITICAL/MainProcess] Couldn't ack 11, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 93, in ack_log_error
self.ack()
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack
self.channel.basic_ack(self.delivery_tag)
File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1583, in basic_ack
self._send_method((60, 80), args)
File "/usr/local/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method
raise RecoverableConnectionError('connection already closed')
How do I go about debugging this? Is the fact that the workers are behind a NAT an issue? Is there a good tool to monitor whether the workers have lost connection? At least with that, I could get them back online by manually restarting the worker.
Unfortunately yes, there is a problem with late acks in Celery+Kombu - task handler tries to use closed connection.
I worked around it like this:
CELERY_CONFIG = {
'CELERYD_MAX_TASKS_PER_CHILD': 1,
'CELERYD_PREFETCH_MULTIPLIER': 1,
'CELERY_ACKS_LATE': True,
}
CELERYD_MAX_TASKS_PER_CHILD - guarantees that worker will be restarted after finishing the task.
As for the tasks that already lost connection, there is nothing you can do right now. Maybe it'll be fixed in version 4. I just make sure that the tasks are as idempotent as possible.