Airflow scheduler with kubernetes executor fails :- Unknown error in KubernetesJobWatcher - kubernetes

I am trying to set up airflow with the kubernetes executor. I have cloned airflow 1.10.6 and am building the docker image and then deploying it with kube.
The pods are running, the service airflow also starts. The webserver is working fine.
But when I check the logs for the scheduler I get the following error.
ERROR - Error while health checking kube watcher process. Process died for unknown reasons
INFO - Event: and now my watch begins starting at resource_version: 0
ERROR - Unknown error in KubernetesJobWatcher. Failing
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", line 333, in run
self.worker_uuid, self.kube_config)
File "/usr/local/lib/python2.7/dist-packages/airflow/contrib/executors/kubernetes_executor.py", line 358, in _run
**kwargs):
File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", line 144, in stream
for line in iter_resp_lines(resp):
File "/usr/local/lib/python2.7/dist-packages/kubernetes/watch/watch.py", line 48, in iter_resp_lines
for seg in resp.read_chunked(decode_content=False):
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 781, in read_chunked
self._original_response.close()
File "/usr/lib/python2.7/contextlib.py", line 35, in __exit__
self.gen.throw(type, value, traceback)
File "/usr/local/lib/python2.7/dist-packages/urllib3/response.py", line 439, in _error_catcher
raise ReadTimeoutError(self._pool, None, "Read timed out.")
ReadTimeoutError: HTTPSConnectionPool(host='10.0.0.1', port=443): Read timed out.
Any help/pointers are appreciated.

Related

KafkaProducer - Getting error to connect to kafka (Failed to update metadata after 60.0 secs)

I am trying to read data from Oracle and send to a Kafka topic. I was able to read from oracle, put it into a dataframe and I put all parameters about Kafka as I show in my code below, but I am getting the error:
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
This link look similar, but did not help me
KafkaTimeoutError: Failed to update metadata after 60.0 secs
I use Amazon Managed Streaming for Apache Kafka (MSK).
I have two Brokers. Do I need put both as my Bootstrap servers or just the main Bootstrap servers?
It connect to kafka and disconnect but don't send any message to kafka.
Here is my code ...
try:
conn = OracleHook(oracle_conn_id=oracle_conn_id).get_conn()
query = "Select * from sales"
df = pd.read_sql(query, conn)
topic = 'my-topic'
producer = KafkaProducer(bootstrap_servers=['localhost:9092'],value_serializer=lambda x:dumps(x).encode('utf-8'), api_version=(0, 10, 1)
)
for raw in pd.read_sql(query, conn):
producer.send(topic, raw.encode('utf-8'))
print('Number os records')
conn.close()
except Exception as error:
raise error
return
... and the log
doubt KafkaProducer - Getting error to connect to kafka
{{conn.py:381}} INFO - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': connecting to 'my-server']
{{conn.py:410}} INFO - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': Connection complete.
{{conn.py:1096}} ERROR - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': socket disconnected
{{conn.py:919}} INFO - <BrokerConnection node_id=bootstrap-0 host='my-bootstrap_servers': Closing connection. KafkaConnectionError: socket disconnected
{{taskinstance.py:1703}} ERROR - Task failed with exception
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
result = execute_callable(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 151, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 162, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 576, in send
self._wait_on_metadata(topic, self.config['max_block_ms'] / 1000.0)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 703, in _wait_on_metadata
"Failed to update metadata after %.1f secs." % (max_wait,))
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
{{taskinstance.py:1280}} INFO - Marking task as FAILED. dag_id=bkbne_ora_to_kafka, task_id=task_id, execution_date=20220624T204102, start_date=20220628T171225, end_date=20220628T171327
{{standard_task_runner.py:91}} ERROR - Failed to execute job 95 for task task_id
Traceback (most recent call last):
File "/usr/local/lib/python3.7/site-packages/airflow/task/task_runner/standard_task_runner.py", line 85, in _start_by_fork
args.func(args, dag=self.dag)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/cli_parser.py", line 48, in command
return func(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/utils/cli.py", line 92, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 292, in task_run
_run_task_by_selected_method(args, dag, ti)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 107, in _run_task_by_selected_method
_run_raw_task(args, ti)
File "/usr/local/lib/python3.7/site-packages/airflow/cli/commands/task_command.py", line 184, in _run_raw_task
error_file=args.error_file,
File "/usr/local/lib/python3.7/site-packages/airflow/utils/session.py", line 70, in wrapper
return func(*args, session=session, **kwargs)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1332, in _run_raw_task
self._execute_task_with_callbacks(context)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1458, in _execute_task_with_callbacks
result = self._execute_task(context, self.task)
File "/usr/local/lib/python3.7/site-packages/airflow/models/taskinstance.py", line 1514, in _execute_task
result = execute_callable(context=context)
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 151, in execute
return_value = self.execute_callable()
File "/usr/local/lib/python3.7/site-packages/airflow/operators/python.py", line 162, in execute_callable
return self.python_callable(*self.op_args, **self.op_kwargs)
File "/usr/local/airflow/dags/send_to_kafka/src/send_to_kafka.py", line 63, in f_se
raise e
File "/usr/local/airflow/dags/send_to_kafka/src/send_to_kafka.py", line 55, in send_to_kafka
producer.send(topic, row.encode('utf-8'))
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 576, in send
self._wait_on_metadata(topic, self.config['max_block_ms'] / 1000.0)
File "/usr/local/airflow/.local/lib/python3.7/site-packages/kafka/producer/kafka.py", line 703, in _wait_on_metadata
"Failed to update metadata after %.1f secs." % (max_wait,))
kafka.errors.KafkaTimeoutError: KafkaTimeoutError: Failed to update metadata after 60.0 secs.
Someone could help me with this? I don't know what is happen here
Ensure that you actually have the connectivity to upstream Kafka brokers (preferably every one of them) with something like ping/ncat/kafka console tools. The fact you can't get metadata (have socket disconnects) points to network "problems" (bad config / firewall?).
Do I need put both as my Bootstrap servers or just the main Bootstrap servers?
Need? No.
However the more servers you put into bootstrap, the more tolerant to failures your application is (at least in Java client, where it picks a random one to first to connect to - C (Python) one should be the same AFAICT).
Your code isn't running on the actual brokers, so bootstrap_servers=['localhost:9092'] should be changed to the address(es) that MSK provides you. You may also need to add authentication settings, depending on which port you use, and have configured your cluster.
Regarding the logic of your code, I'd suggest using MSK Connect with JDBC Source or Debezium to read a database table into Kafka.

gRPC query waits indefinitely while execution giving no output in Seldon-Core

I am trying to run example from gRPC Seldon example for gRPC on google cloud kubernetes cluster. I am able to get access_token but gRPC query waits indefinitely. Finally I get following error :
Traceback (most recent call last):
File "test_request.py", line 53, in
grpc_request()
File "test_request.py", line 50, in grpc_request
response = stub.Predict(request=request,metadata=metadata)
File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 487, in call
return _end_unary_response_blocking(state, call, False, deadline)
File "/usr/local/lib/python2.7/dist-packages/grpc/_channel.py", line 437, in _end_unary_response_blocking
raise _Rendezvous(state, None, None, deadline)
grpc._channel._Rendezvous: <_Rendezvous of RPC that terminated with (StatusCode.UNAVAILABLE, OS Error)>
Whats the issue?

Why can't service fabric cluster be connected to?

I've tried to connect to a cluster that I've just created via powershell and ARM template using the command (replaced ip numbers with z):
sfctl cluster select --endpoint http://z.z.z.z:19000
This is the error that occurs shortly afterwards:
C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise>sfctl cluster select --endpoint http://z.z.z.z:19000
Error occurred in request., ConnectionError: HTTPConnectionPool(host='z.z.z.z', port=19000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x05F39F90>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connection.py", line 141, in _new_conn
(self.host, self.port), self.timeout, **extra_kw)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\util\connection.py", line 83, in create_connection
raise err
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\util\connection.py", line 73, in create_connection
sock.connect(sa)
TimeoutError: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connectionpool.py", line 601, in urlopen
chunked=chunked)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connectionpool.py", line 357, in _make_request
conn.request(method, url, **httplib_request_kw)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\http\client.py", line 1239, in request
self._send_request(method, url, body, headers, encode_chunked)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\http\client.py", line 1285, in _send_request
self.endheaders(body, encode_chunked=encode_chunked)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\http\client.py", line 1234, in endheaders
self._send_output(message_body, encode_chunked=encode_chunked)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\http\client.py", line 1026, in _send_output
self.send(msg)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\http\client.py", line 964, in send
self.connect()
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connection.py", line 166, in connect
conn = self._new_conn()
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connection.py", line 150, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.HTTPConnection object at 0x05F39F90>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\adapters.py", line 440, in send
timeout=timeout
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connectionpool.py", line 668, in urlopen
**response_kw)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connectionpool.py", line 668, in urlopen
**response_kw)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connectionpool.py", line 668, in urlopen
**response_kw)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\connectionpool.py", line 639, in urlopen
_stacktrace=sys.exc_info()[2])
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\urllib3\util\retry.py", line 388, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPConnectionPool(host='51.145.27.4', port=19000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x05F39F90>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\msrest\service_client.py", line 201, in send
**kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\adapters.py", line 508, in send
raise ConnectionError(e, request=request)
requests.exceptions.ConnectionError: HTTPConnectionPool(host='51.145.27.4', port=19000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x05F39F90>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\knack\cli.py", line 125, in invoke
cmd_result = self.invocation.execute(args)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\knack\invocation.py", line 85, in execute
cmd_result = parsed_args.func(params)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\knack\commands.py", line 67, in __call__
return self.handler(*args, **kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\knack\commands.py", line 123, in _command_handler
result = op(client, **command_args) if client else op(**command_args)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\sfctl\custom_cluster.py", line 95, in select
rest_client.send(rest_client.get('/')).raise_for_status()
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\msrest\service_client.py", line 227, in send
raise_with_traceback(ClientRequestError, msg, err)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\msrest\exceptions.py", line 45, in raise_with_traceback
raise error.with_traceback(exc_traceback)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\msrest\service_client.py", line 201, in send
**kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\sessions.py", line 508, in request
resp = self.send(prep, **send_kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\sessions.py", line 618, in send
r = adapter.send(request, **kwargs)
File "c:\users\user\appdata\local\programs\python\python36-32\lib\site-packages\requests\adapters.py", line 508, in send
raise ConnectionError(e, request=request)
msrest.exceptions.ClientRequestError: Error occurred in request., ConnectionError: HTTPConnectionPool(host='51.145.27.4', port=19000): Max retries exceeded with url: / (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x05F39F90>: Failed to establish a new connection: [WinError 10060] A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond',))
C:\Program Files (x86)\Microsoft Visual Studio\2017\Enterprise>
I have read answers about checking the load balancer to ensure it has rules. The load balancer doesn't have rules:
I've scratching my head here to try and understand why it's not working. I'm trying to create a VSTS release that deploys to this service fabric cluster, and am testing that it's available as I'm getting an error in the release:
##[error]No cluster endpoint is reachable, please check if there is connectivity/firewall/DNS issue.
You're using the wrong port. sfctl uses Service Fabric's HTTP API, which is typically port 19080 on your cluster (confirmed by the LBHttpRule in your load balancer settings screenshot). Port 19000 on your cluster is typically the binary connection port.

PyMongo AutoReconnect: timed out

I work in an Azure environment. I have a VM that runs a Django application (Open edX) and a Mongo server on another VM instance (Ubuntu 16.04). Whenever I try to load anything in the application (where the data is fetched from the Mongo server), I would get an error like this one:
Feb 23 12:49:43 xxxxx [service_variant=lms][mongodb_proxy][env:sandbox] ERROR [xxxxx 13875] [mongodb_proxy.py:55] - Attempt 0
Traceback (most recent call last):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/mongodb_proxy.py", line 53, in wrapper
return func(*args, **kwargs)
File "/edx/app/edxapp/edx-platform/common/lib/xmodule/xmodule/contentstore/mongo.py", line 135, in find
with self.fs.get(content_id) as fp:
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/__init__.py", line 159, in get
return GridOut(self.__collection, file_id)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/grid_file.py", line 406, in __init__
self._ensure_file()
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/gridfs/grid_file.py", line 429, in _ensure_file
self._file = self.__files.find_one({"_id": self.__file_id})
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/collection.py", line 1084, in find_one
for result in cursor.limit(-1):
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1149, in next
if len(self.__data) or self._refresh():
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 1081, in _refresh
self.__codec_options.uuid_representation))
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/cursor.py", line 996, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "/edx/app/edxapp/venvs/edxapp/local/lib/python2.7/site-packages/pymongo/mongo_client.py", line 1366, in _send_message_with_response
raise AutoReconnect(str(e))
AutoReconnect: timed out
First I thought it was because my Mongo server lived in an instance outside of the Django application's virtual network. I created a new Mongo server on an instance inside the same virtual network and would still get these issues. Mind you, I receive the data eventually but I feel like I wouldn't get timed out errors if the connection is normal.
If it helps, here's the Ansible playbook that I used to create the Mongo server: https://github.com/edx/configuration/tree/master/playbooks/roles/mongo_3_2
Also I have tailed the Mongo log file and this is the only line that would appear at the same time I would get the timed out error on the application server:
2018-02-23T12:49:20.890+0000 [conn5] authenticate db: edxapp { authenticate: 1, user: "user", nonce: "xxx", key: "xxx" }
mongostat and mongotop don't show anything out of the ordinary. Also here's the htop output:
I don't know what else to look for or how to fix this issue.
I forgot to change the Mongo server IP's in the Django application settings to point to the new private IP address inside the virtual network instead of the public IP. After I've changed that it don't get that issue anymore.
If you are reading this, make sure you change the private IP to a static one in Azure, if you are using that IP address in the Djagno application settings.

Celery: Remote workers frequently losing connection

I have a Celery broker running on a cloud server (Django app), and two workers on local servers in my office connected behind a NAT. The local workers frequently lose connection, and have to be restarted to re-establish connection with the broker. Usually celeryd restart hangs the first time I try it, so I have to ctr+C and retry once or twice to get it back up and connected. The workers' logs two most common errors:
[2014-08-03 00:08:45,398: WARNING/MainProcess] consumer: Connection to broker lost. Trying to re-establish the connection...
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 278, in start
blueprint.start(self)
File "/usr/local/lib/python2.7/dist-packages/celery/bootsteps.py", line 123, in start
step.start(parent)
File "/usr/local/lib/python2.7/dist-packages/celery/worker/consumer.py", line 796, in start
c.loop(*c.loop_args())
File "/usr/local/lib/python2.7/dist-packages/celery/worker/loops.py", line 72, in asynloop
next(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/async/hub.py", line 320, in create_loop
cb(*cbargs)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 159, in on_readable
reader(loop)
File "/usr/local/lib/python2.7/dist-packages/kombu/transport/base.py", line 142, in _read
raise ConnectionError('Socket was disconnected')
ConnectionError: Socket was disconnected
[2014-03-07 20:15:41,963: CRITICAL/MainProcess] Couldn't ack 11, reason:RecoverableConnectionError(None, 'connection already closed', None, '')
Traceback (most recent call last):
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 93, in ack_log_error
self.ack()
File "/usr/local/lib/python2.7/dist-packages/kombu/message.py", line 88, in ack
self.channel.basic_ack(self.delivery_tag)
File "/usr/local/lib/python2.7/dist-packages/amqp/channel.py", line 1583, in basic_ack
self._send_method((60, 80), args)
File "/usr/local/lib/python2.7/dist-packages/amqp/abstract_channel.py", line 50, in _send_method
raise RecoverableConnectionError('connection already closed')
How do I go about debugging this? Is the fact that the workers are behind a NAT an issue? Is there a good tool to monitor whether the workers have lost connection? At least with that, I could get them back online by manually restarting the worker.
Unfortunately yes, there is a problem with late acks in Celery+Kombu - task handler tries to use closed connection.
I worked around it like this:
CELERY_CONFIG = {
'CELERYD_MAX_TASKS_PER_CHILD': 1,
'CELERYD_PREFETCH_MULTIPLIER': 1,
'CELERY_ACKS_LATE': True,
}
CELERYD_MAX_TASKS_PER_CHILD - guarantees that worker will be restarted after finishing the task.
As for the tasks that already lost connection, there is nothing you can do right now. Maybe it'll be fixed in version 4. I just make sure that the tasks are as idempotent as possible.