ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve> - kubernetes

I am having issues configuring my container to point to my Kubernetes cluster with the command gcloud container clusters get-credentials. I get the following error.
ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>
If you would like to report this issue, please run the following command:
gcloud feedback
To check gcloud for common problems, please run the following command:
gcloud info --run-diagnostics
Enhanced logging:
CannotConnectToMetadataServerException: <urlopen error [Errno -2] Name does not resolve>
2018-04-10 18:00:42,625 ERROR ___FILE_ONLY___ BEGIN CRASH STACKTRACE
Traceback (most recent call last):
File "/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 147, in main
gcloud_cli.Execute()
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 818, in Execute
self._HandleAllErrors(exc, command_path_string, specified_arg_names)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 856, in _HandleAllErrors
exceptions.HandleError(exc, command_path_string, self.__known_error_handler)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/exceptions.py", line 526, in HandleError
core_exceptions.reraise(exc)
File "/google-cloud-sdk/lib/googlecloudsdk/core/exceptions.py", line 111, in reraise
six.reraise(type(exc_value), exc_value, tb)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 792, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 751, in Run
self._parent_group.RunGroupFilter(tool_context, args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 692, in RunGroupFilter
self._parent_group.RunGroupFilter(context, args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 693, in RunGroupFilter
self._common_type().Filter(context, args)
File "/google-cloud-sdk/lib/surface/container/__init__.py", line 71, in Filter
context['api_adapter'] = api_adapter.NewAPIAdapter('v1')
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/container/api_adapter.py", line 147, in NewAPIAdapter
return NewV1APIAdapter()
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/container/api_adapter.py", line 151, in NewV1APIAdapter
return InitAPIAdapter('v1', V1Adapter)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/container/api_adapter.py", line 172, in InitAPIAdapter
api_client = core_apis.GetClientInstance('container', api_version)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/apis.py", line 297, in GetClientInstance
api_name, api_version, no_http, _CheckResponse, enable_resource_quota)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/apis_internal.py", line 153, in _GetClientInstance
http_client = http.Http(enable_resource_quota=enable_resource_quota)
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/http.py", line 64, in Http
creds = store.LoadIfEnabled()
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 281, in LoadIfEnabled
return Load()
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 348, in Load
cred = STATIC_CREDENTIAL_PROVIDERS.GetCredentials(account)
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 162, in GetCredentials
cred = provider.GetCredentials(account)
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 214, in GetCredentials
if account in c_gce.Metadata().Accounts():
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 127, in Accounts
gce_read.GOOGLE_GCE_METADATA_ACCOUNTS_URI + '/')
File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 289, in DecoratedFunction
exceptions.reraise(to_reraise[1], tb=to_reraise[2])
File "/google-cloud-sdk/lib/googlecloudsdk/core/exceptions.py", line 111, in reraise
six.reraise(type(exc_value), exc_value, tb)
File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 159, in TryFunc
return func(*args, **kwargs), None
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 52, in _ReadNoProxyWithCleanFailures
raise CannotConnectToMetadataServerException(e)
CannotConnectToMetadataServerException: <urlopen error [Errno -2] Name does not resolve>
To give some color, we kick off a build to CircleCI everytime we push code to github. However, we have a container we call internally belushi, that we use to run our entire infrastructure. This container has gcloud installed in it. CircleCI infrastructure is on AWS and when they spin up the belushi container we actually run gcloud get-credentials that point the belushi container to our project in google cloud, which has a kubernetes cluster configured and we run all of our functional CI testing in that cluster. So we need that belushi pod to configure into the ci project to move forward.
The weird thing is that the belushi:latest image always configures properly; however, when we are working on belushi we often branch and create a new image to run tests. So for example, I will create a branch in belushi and then have a new hash of 1234567, so we will spin up the belushi:1234567 image and try to run things, and the first thing we do is configure it to point to the ci project; however, we get that metadata resolve issue.
I feel like it is DNS related or maybe the metadata server isn't allow the new image of belushi to communicate with it right away. After I retry it a bunch of times it will eventually configure properly (without any code changes). So I wonder if the metadata server is rejecting it for some reason or it could be on AWS not resolving for some unknown reason.

First thing you can do to troubleshoot is, when you get this error, attempt this:
curl -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/
The metadata server should respond straight away with your service account metadata.
Is your container behind any kind of http proxy?

Related

AWX-Web error when installing awx-operator on Kubernetes

I am currently installing the awx-operator and I have come across an issue while I am trying to expose the application to the outside world.
But I have come across an error with the awx-web container within the awx-5b58db49c-9r4hp. When I run kubectl logs pod/awx-5b58db49c-9r4hp -c awx-web, I get the following output:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 81, in _ctit_db_wrapper
yield
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 411, in __getattr__
value = self._get_local(name)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 355, in _get_local
setting = Setting.objects.filter(key=name, user__isnull=True).order_by('pk').first()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 653, in first
for obj in (self if self.ordered else self.order_by('pk'))[:1]:
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 274, in __iter__
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
cursor = self.connection.cursor()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 233, in _cursor
self.ensure_connection()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 195, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL: password authentication failed for user "awx"
2021-05-12 14:28:54,478 ERROR [-] awx.conf.settings Database settings are not available, using defaults.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/oauth2_provider/settings.py", line 138, in __getattr__
val = self.user_settings[attr]
KeyError: 'OAUTH2_VALIDATOR_CLASS'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 195, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: password authentication failed for user "awx"
I am not too sure whether this is a big deal or just a red-herring. I am just in need of some clarification. If I need to get any more information to aide troubleshooting, please let me know!
As per AWX 19.0.0: password authentication failed for user "awx" issue is not anymore in minikube v1.20.0, awx-operator 0.9.0 so advice is try use this version for now
I was also observing the same error message => password authentication failed for user "awx".
I was running the awx-operator version 0.10.0 on my kubernetes cluster created using kubeadm and not using minikube.
I hosted my postgres persistent volume needed for awx postgres pod on my worker node which already had some stale previously migrated data from a different kubernetes cluster. I had to cleanup that already lying data on my worker node hostPath where i mounted my persistent volume and make a fresh install with fresh data from postgres pod and the password authentication error never come back.

AttributeError: 'AuthorizedSession' object has no attribute 'configure_mtls_channel'

I was orchestrating two dataflow job with cloud composer and it was working fine for month. Suddenly the two jobs stopped working with the following error message:
in download_blob File
"/usr/local/lib/python3.6/site-packages/google/cloud/storage/client.py",
line 399, in get_bucket retry=retry, File
"/usr/local/lib/python3.6/site-packages/google/cloud/storage/bucket.py",
line 1002, in reload retry=retry, File
"/usr/local/lib/python3.6/site-packages/google/cloud/storage/_helpers.py",
line 225, in reload retry=retry, File
"/usr/local/lib/python3.6/site-packages/google/cloud/storage/_http.py",
line 63, in api_request return call() File
"/usr/local/lib/python3.6/site-packages/google/api_core/retry.py",
line 286, in retry_wrapped_func on_error=on_error, File
"/usr/local/lib/python3.6/site-packages/google/api_core/retry.py",
line 184, in retry_target return target() File
"/usr/local/lib/python3.6/site-packages/google/cloud/_http.py", line
479, in api_request timeout=timeout, File
"/usr/local/lib/python3.6/site-packages/google/cloud/_http.py", line
337, in _make_request method, url, headers, data, target_object,
timeout=timeout File
"/usr/local/lib/python3.6/site-packages/google/cloud/_http.py", line
374, in _do_request return self.http.request( File
"/usr/local/lib/python3.6/site-packages/google/cloud/_http.py", line
157, in http return self._client._http File
"/usr/local/lib/python3.6/site-packages/google/cloud/client.py", line
187, in _http
self._http_internal.configure_mtls_channel(self._client_cert_source)
AttributeError: 'AuthorizedSession' object has no attribute
'configure_mtls_channel'
In the jobs I download a file from google cloud storage with the storage client. I assumed it was because of some dependencies issues. In the composer environment I installed google-cloud-storage without specifying a version. I tried specifying different versions of the package but nothing seems to work.
Thanks!
This seems to be related to this issue.
Try pinning google-cloud-core to 1.5.0, then I highly recommend for you to Drain your jobs once you get them back to work (assuming they have streaming jobs).

ansible k8s module failing to connect to cluster with 503 - appends /version/openshift to non openshift cluster

I'm trying to use ansible new k8s module (based ok k8_raw from 2.6) to maintain an aks k8 cluster.
While I can work with the cluster with kubectl , any command with the k8s cluster fails with a 503 error.
For example this task:
- name: deploy kured daemonset
k8s:
state: present
context: "{{ cluster_name}}"
host: "redacted"# tried specifying this, but does not help
kubeconfig: "~/.kube/config"
src: "aks/utils/kured-ds.yaml"
And failure:
Traceback (most recent call last):
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 113, in <module>
_ansiballz_main()
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 105, in _ansiballz_main
invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 48, in invoke_module
imp.load_module('__main__', mod, module, MOD_DESC)
File "/tmp/ansible_k8s_payload_IYmGFG/__main__.py", line 233, in <module>
File "/tmp/ansible_k8s_payload_IYmGFG/__main__.py", line 229, in main
File "/tmp/ansible_k8s_payload_IYmGFG/ansible_k8s_payload.zip/ansible/module_utils/k8s/raw.py", line 131, in execute_module
File "/tmp/ansible_k8s_payload_IYmGFG/ansible_k8s_payload.zip/ansible/module_utils/k8s/common.py", line 172, in get_api_client
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 103, in __init__
self.__init_cache()
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 113, in __init_cache
self.__resources.update(self.parse_api_groups())
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 169, in parse_api_groups
new_group[version] = self.get_resources_for_api_version(prefix, group['name'], version, preferred)
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 181, in get_resources_for_api_version
resources_response = load_json(self.request('GET', path))['resources']
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 363, in request
_return_http_data_only=params.get('_return_http_data_only', True)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
headers=headers)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (503)
Reason: Service Unavailable
Ansible version: 2.7/8(dev)
What am I missing?
UPDATE:
When I've added print statement to the libs used by the module beneath I found out somewhere in the pipeline /version/openshift is appended to the host name, which of course fails, because it's a non openshift cluster
Any work around for this bug?
Answer: turned out there were two failing requests. the first is to version/openshift is catched by the client and doesn't cause the crash. the crash actually happened because of an error with my cluster metrics server, which while not really needed by the k8 client used by ansible still fails a request to it.
So if anyone bumps into it, might be helpful

OSError: [Errno 99] Cannot assign requested address

Trying to run jupyter notebook on a CentOS 7. It comes back with:
OSError: [Errno 99] Cannot assign requested address
And the stack trace:
[user#desktop ~]$ jupyter notebook
Traceback (most recent call last):
File "/home/use/anaconda3/bin/jupyter-notebook", line 6, in <module>
sys.exit(notebook.notebookapp.main())
File "/home/user/anaconda3/lib/python3.6/site-packages/jupyter_core/application.py", line 267, in launch_instance
return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
File "/home/user/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 657, in launch_instance
app.initialize(argv)
File "<decorator-gen-7>", line 2, in initialize
File "/home/user/anaconda3/lib/python3.6/site-packages/traitlets/config/application.py", line 87, in catch_config_error
return method(app, *args, **kwargs)
File "/home/user/anaconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 1296, in initialize
self.init_webapp()
File "/home/user/anaconda3/lib/python3.6/site-packages/notebook/notebookapp.py", line 1120, in init_webapp
self.http_server.listen(port, self.ip)
File "/home/user/anaconda3/lib/python3.6/site-packages/tornado/tcpserver.py", line 142, in listen
sockets = bind_sockets(port, address=address)
File "/home/user/anaconda3/lib/python3.6/site-packages/tornado/netutil.py", line 197, in bind_sockets
sock.bind(sockaddr)
OSError: [Errno 99] Cannot assign requested address
jupyter notebook --ip=127.0.0.1 --port=8888
I had to simply set the ip/port params. The issue was likely caused because the default ip/port that it was previously trying to assign was already taken!
In a remote VM, I solved the issue by
$ jupyter-notebook --ip=0.0.0.0 --port=8888
...
Copy/paste this URL into your browser when you connect for the first time,
to login with a token:
http://0.0.0.0:8888/?token=1234567890abcdefghijklmnopqrstuvwxyz (the token is for demo)
...
Note: do not assign the specific ip
then I can connect to jupyter notebook via:
http://your_vm_ip:8888/?token=1234567890abcdefghijklmnopqrstuvwxyz
(replace 0.0.0.0 with your_vm_ip)
Here is a permanent solution.
Create a configuration file for Jupyter, enter in the terminal: jupyter notebook --generate-config
The last command will create a file in /home/USER/.jupyter/jupyter_notebook_config.py
Open the file jupyter_notebook_config.py and edit the variable c.NotebookApp.ip as follows:
# c.NotebookApp.ip = 'localhost'
c.NotebookApp.ip = '127.0.0.1'
Enter in the terminal: jupyter notebook
Remarks: sometimes need to first chmod to grant permissions, the file
If you've tried several ports already (using --port XXXX), and none work:
Check that the localhost entry in /etc/hosts/ is not set to something other than 127.0.0.1.

Celery: all workers stuck, how to diagnose

Periodically all my Celery workers get stuck on something. I cannot figure out what is causing this, as inspect doesn't work as all the workers are busy.
celery inspect active
Error: No nodes replied within time constraint
Is it possible to get Celery status, like active tasks, even if nodes are doing something (that seems to be causing problems)? Can I somehow spin up a temporary worker just to get inspect output?
What kind of other strategies there would be to diagnose this issue?
Celery 4.x. Redis backend.
This turned out to be a deadlock issue with Celery + gevent (evil monkey patch) + Sentry's Raven logger.
https://github.com/getsentry/raven-python/issues/305
To diagnose issues
You can start Celery workers with different queues (-q, -n) parameters and see when workers hang. Even if some worker groups are hung the others still may respond to inspect queries.
Celery file logs may reveal the error
2017-02-27 08:36:34,371 CRITI [celery.worker][DummyThread-6] Unrecoverable error: AttributeError("'NoneType' object has no attribute 'readline'",)
Traceback (most recent call last):
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
c.loop(*c.loop_args())
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 118, in synloop
connection.drain_events(timeout=2.0)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/connection.py", line 301, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/virtual/base.py", line 961, in drain_events
get(self._deliver, timeout=timeout)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 359, in get
ret = self.handle_event(fileno, event)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 341, in handle_event
return self.on_readable(fileno), self
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 337, in on_readable
chan.handlers[type]()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 714, in _brpop_read
**options)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/client.py", line 585, in parse_response
response = connection.read_response()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/connection.py", line 577, in read_response
response = self._parser.read_response()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/connection.py", line 238, in read_response
response = self._buffer.readline()
AttributeError: 'NoneType' object has no attribute 'readline'