Too many tasks running at the same time on Airflow, some tasks get heartbeat exception and they get stuck. Sometimes even if the tasks are completed, the airflow can't get the completed log.
our setup:
Airflow v2.3.4 in Docker
PostgreSQL 11.6 on x86_64-pc-linux-gnu
example log:
[2022-10-19, 09:01:15 +03] {base_job.py:229} ERROR - LocalTaskJob heartbeat got an exception
....
File "/home/airflow/.local/lib/python3.7/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "xxx" (xxx), port xxx failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
The above exception was the direct cause of the following exception:
Traceback (most recent call last):
....
return self.dbapi.connect(*cargs, **cparams)
File "/home/airflow/.local/lib/python3.7/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
sqlalchemy.exc.OperationalError: (psycopg2.OperationalError) connection to server at "xxx" (xxx), port xxx failed: server closed the connection unexpectedly
This probably means the server terminated abnormally
before or while processing the request.
(Background on this error at: https://sqlalche.me/e/14/e3q8)
Created an app with flask and connected with postgresql via psycopg2 as connector. Created Dockerfile and while building it shows error of
'''Traceback (most recent call last):
File "init_db.py", line 5, in <module>
conn = psycopg2.connect(
File "/usr/local/lib/python3.8/site-packages/psycopg2/__init__.py", line 122, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: connection to server at "192.168.0.167", port 5432 failed: Connection refused
Is the server running on that host and accepting TCP/IP connections?'''
have mentioned host, port, database, username and password in it. Still there an issue. Have also tried changing ports but still the error persists.
When I type
swift stat
I get this error:
HTTPConnectionPool(host='controller', port=8080): Max retries exceeded with url: /v1/AUTH_d17698cf7bbf4dcc8fc59ed6f7b48052 (Caused by NewConnectionError('<urllib3.connection.HTTPConnection object at 0x7f1c597f9910>: Failed to establish a new connection: [Errno 111] Connection refused'))
Issue with Ambassador deployment starting in Kubernetes:
Kubernetes - v1.13
Ambassador - image: quay.io/datawire/ambassador:0.50.3
Container Runtime - Docker
Cluster networking done used Flannel
The entire set up is done on two Oracle VMs on a single Windows 10 machine.
the network applied is Host-Only with master - 192.168.99.110 and a node - 192.168.99.101.
I am deploying ambassador using kubectl apply -f https://getambassador.io/yaml/ambassador/ambassador-rbac.yaml. After 30 secs when the kubernetes pods starting the Kube watch it goes into a 'CrashLoopBackOff' state. I inspected the logs of the pod and it says below - which at the last sentence states that 10.96.0.1(API Server cluster IP) is unreachable :
" []# kubectl logs ambassador-76f644ddfb-vnj4d
2019-03-06 17:12:13 kubewatch [23 TMainThread] 0.50.3 INFO: kubewatch starting: mode 'cluster-id' ambassador_config_dir '/no/such/path' envoy_config_file '/dev/null' debug 'False' delay '1.0' pid 'None'
2019-03-06 17:12:13 kubewatch [23 TMainThread] 0.50.3 INFO: namespace default, watching all namespaces
2019-03-06 17:12:14,131 WARNING Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d748>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': /api/v1/namespaces/default
2019-03-06 17:12:14 kubewatch [23 TMainThread] 0.50.3 WARNING: Retrying (Retry(total=2, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d748>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': /api/v1/namespaces/default
2019-03-06 17:12:15,136 WARNING Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d7f0>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': /api/v1/namespaces/default
2019-03-06 17:12:15 kubewatch [23 TMainThread] 0.50.3 WARNING: Retrying (Retry(total=1, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d7f0>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': /api/v1/namespaces/default
2019-03-06 17:12:16,140 WARNING Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d898>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': /api/v1/namespaces/default
2019-03-06 17:12:16 kubewatch [23 TMainThread] 0.50.3 WARNING: Retrying (Retry(total=0, connect=None, read=None, redirect=None, status=None)) after connection broken by 'NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d898>: Failed to establish a new connection: [Errno 113] Host is unreachable',)': /api/v1/namespaces/default
2019-03-06 17:12:17 kubewatch [23 TMainThread] 0.50.3 WARNING: kubewatch failed!
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 159, in _new_conn
(self._dns_host, self.port), self.timeout, **extra_kw)
File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 80, in create_connection
raise err
File "/usr/lib/python3.6/site-packages/urllib3/util/connection.py", line 70, in create_connection
sock.connect(sa)
OSError: [Errno 113] Host is unreachable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 600, in urlopen
chunked=chunked)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 343, in _make_request
self._validate_conn(conn)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 839, in _validate_conn
conn.connect()
File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 301, in connect
conn = self._new_conn()
File "/usr/lib/python3.6/site-packages/urllib3/connection.py", line 168, in _new_conn
self, "Failed to establish a new connection: %s" % e)
urllib3.exceptions.NewConnectionError: <urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d9b0>: Failed to establish a new connection: [Errno 113] Host is unreachable
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/ambassador/kubewatch.py", line 527, in <module>
main()
File "/usr/lib/python3.6/site-packages/click/core.py", line 764, in __call__
return self.main(*args, **kwargs)
File "/usr/lib/python3.6/site-packages/click/core.py", line 717, in main
rv = self.invoke(ctx)
File "/usr/lib/python3.6/site-packages/click/core.py", line 956, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/usr/lib/python3.6/site-packages/click/core.py", line 555, in invoke
return callback(*args, **kwargs)
File "/ambassador/kubewatch.py", line 518, in main
watcher.run(id_only=True)
File "/ambassador/kubewatch.py", line 342, in run
self.get_cluster_id(v1)
File "/ambassador/kubewatch.py", line 407, in get_cluster_id
ret = v1.read_namespace(wanted)
File "/usr/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 17572, in read_namespace
(data) = self.read_namespace_with_http_info(name, **kwargs)
File "/usr/lib/python3.6/site-packages/kubernetes/client/apis/core_v1_api.py", line 17657, in read_namespace_with_http_info
collection_formats=collection_formats)
File "/usr/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/usr/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/usr/lib/python3.6/site-packages/kubernetes/client/api_client.py", line 342, in request
headers=headers)
File "/usr/lib/python3.6/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/usr/lib/python3.6/site-packages/kubernetes/client/rest.py", line 205, in request
headers=headers)
File "/usr/lib/python3.6/site-packages/urllib3/request.py", line 68, in request
**urlopen_kw)
File "/usr/lib/python3.6/site-packages/urllib3/request.py", line 89, in request_encode_url
return self.urlopen(method, url, **extra_kw)
File "/usr/lib/python3.6/site-packages/urllib3/poolmanager.py", line 323, in urlopen
response = conn.urlopen(method, u.request_uri, **kw)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 667, in urlopen
**response_kw)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 667, in urlopen
**response_kw)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 667, in urlopen
**response_kw)
File "/usr/lib/python3.6/site-packages/urllib3/connectionpool.py", line 638, in urlopen
_stacktrace=sys.exc_info()[2])
File "/usr/lib/python3.6/site-packages/urllib3/util/retry.py", line 398, in increment
raise MaxRetryError(_pool, url, error or ResponseError(cause))
urllib3.exceptions.MaxRetryError: HTTPSConnectionPool(host='10.96.0.1', port=443): Max retries exceeded with url: /api/v1/namespaces/default (Caused by NewConnectionError('<urllib3.connection.VerifiedHTTPSConnection object at 0x7f739506d9b0>: Failed to establish a new connection: [Errno 113] Host is unreachable',))
AMBASSADOR: kubewatch cluster-id exited with status 1
AMBASSADOR: shutting down (1)"
An update. I have disabled firewalld service in both master and node and the issue is solved now. I think the root cause was due to PODs in the node were not able to access the kube-dns on the master due to firewall restrictions.