Reading logs of autoscaled Ray worker nodes - kubernetes

We're running ray tasks on kubernetes with autoscaling. From time to time, a worker dies, and we get the following:
WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307
We're using a config.yaml similar to the autoscaler/kubernetes/default.yaml.
The problem we're facing is that the nodes were the worker runs is autoscaling, causing the node to often be scaled away before we have time to read the logs (in /tmp/ray).
Any way we can persist the logs of the autoscaled (worker) nodes?
I tried setting the temp dir to point to a filestore that's shared between the workers, by calling ray.init(_temp_dir="/our_mounted_filestore/persistent_ray_logs") but this does not seem to work.
I also tried adding --temp-dir=<path> in the worker_start_ray_commands, but this just gives the following error:
PermissionError: [Errno 13] Permission denied: '/filestore/ray_temp'
mkdir(name, mode)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 223, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 213, in makedirs
os.makedirs(directory_path, exist_ok=True)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/utils.py", line 797, in try_to_create_directory
try_to_create_directory(self._temp_dir)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 278, in _init_temp
self._init_temp(redis_client)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 166, in init
node = ray.node.Node(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 580, in start
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in call
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1808, in main
sys.exit(main())
File "/home/ray/anaconda3/bin/ray", line 8, in
Traceback (most recent call last):

Related

Not able to run OpenDistro for Elastic in kubernetes as non-root -supervisord error

I am setting up OpenDistro for Elastic in Kubernetes. The cluster has pod security in place that will not allow privileged pods. When I start the cluster the logs indicated a permission issue with /usr/share/supervisor/supervisord.log
I have a securityContext set on the deployment
securityContext:
runAsUser: 1000
fsGroup: 1000
``
The error message from kubectl logs es-master-0 is
```/usr/share/elasticsearch/config/elasticsearch.yml seems to be already configured for Security. Quit.
Traceback (most recent call last):
File "/usr/bin/supervisord", line 9, in <module>
load_entry_point('supervisor==4.0.2', 'console_scripts', 'supervisord')()
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/supervisord.py", line 358, in main
go(options)
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/supervisord.py", line 368, in go
d.main()
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/supervisord.py", line 70, in main
self.options.make_logger()
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/options.py", line 1472, in make_logger
backups=self.logfile_backups,
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/loggers.py", line 417, in handle_file
handler = RotatingFileHandler(filename, 'a', maxbytes, backups)
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/loggers.py", line 212, in __init__
FileHandler.__init__(self, filename, mode)
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/loggers.py", line 159, in __init__
self.stream = open(filename, mode)
IOError: [Errno 13] Permission denied: '/usr/share/supervisor/supervisord.log'

ansible k8s module failing to connect to cluster with 503 - appends /version/openshift to non openshift cluster

I'm trying to use ansible new k8s module (based ok k8_raw from 2.6) to maintain an aks k8 cluster.
While I can work with the cluster with kubectl , any command with the k8s cluster fails with a 503 error.
For example this task:
- name: deploy kured daemonset
k8s:
state: present
context: "{{ cluster_name}}"
host: "redacted"# tried specifying this, but does not help
kubeconfig: "~/.kube/config"
src: "aks/utils/kured-ds.yaml"
And failure:
Traceback (most recent call last):
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 113, in <module>
_ansiballz_main()
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 105, in _ansiballz_main
invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 48, in invoke_module
imp.load_module('__main__', mod, module, MOD_DESC)
File "/tmp/ansible_k8s_payload_IYmGFG/__main__.py", line 233, in <module>
File "/tmp/ansible_k8s_payload_IYmGFG/__main__.py", line 229, in main
File "/tmp/ansible_k8s_payload_IYmGFG/ansible_k8s_payload.zip/ansible/module_utils/k8s/raw.py", line 131, in execute_module
File "/tmp/ansible_k8s_payload_IYmGFG/ansible_k8s_payload.zip/ansible/module_utils/k8s/common.py", line 172, in get_api_client
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 103, in __init__
self.__init_cache()
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 113, in __init_cache
self.__resources.update(self.parse_api_groups())
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 169, in parse_api_groups
new_group[version] = self.get_resources_for_api_version(prefix, group['name'], version, preferred)
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 181, in get_resources_for_api_version
resources_response = load_json(self.request('GET', path))['resources']
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 363, in request
_return_http_data_only=params.get('_return_http_data_only', True)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
headers=headers)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (503)
Reason: Service Unavailable
Ansible version: 2.7/8(dev)
What am I missing?
UPDATE:
When I've added print statement to the libs used by the module beneath I found out somewhere in the pipeline /version/openshift is appended to the host name, which of course fails, because it's a non openshift cluster
Any work around for this bug?
Answer: turned out there were two failing requests. the first is to version/openshift is catched by the client and doesn't cause the crash. the crash actually happened because of an error with my cluster metrics server, which while not really needed by the k8 client used by ansible still fails a request to it.
So if anyone bumps into it, might be helpful

ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>

I am having issues configuring my container to point to my Kubernetes cluster with the command gcloud container clusters get-credentials. I get the following error.
ERROR: gcloud crashed (CannotConnectToMetadataServerException): <urlopen error [Errno -2] Name does not resolve>
If you would like to report this issue, please run the following command:
gcloud feedback
To check gcloud for common problems, please run the following command:
gcloud info --run-diagnostics
Enhanced logging:
CannotConnectToMetadataServerException: <urlopen error [Errno -2] Name does not resolve>
2018-04-10 18:00:42,625 ERROR ___FILE_ONLY___ BEGIN CRASH STACKTRACE
Traceback (most recent call last):
File "/google-cloud-sdk/lib/googlecloudsdk/gcloud_main.py", line 147, in main
gcloud_cli.Execute()
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 818, in Execute
self._HandleAllErrors(exc, command_path_string, specified_arg_names)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 856, in _HandleAllErrors
exceptions.HandleError(exc, command_path_string, self.__known_error_handler)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/exceptions.py", line 526, in HandleError
core_exceptions.reraise(exc)
File "/google-cloud-sdk/lib/googlecloudsdk/core/exceptions.py", line 111, in reraise
six.reraise(type(exc_value), exc_value, tb)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/cli.py", line 792, in Execute
resources = calliope_command.Run(cli=self, args=args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 751, in Run
self._parent_group.RunGroupFilter(tool_context, args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 692, in RunGroupFilter
self._parent_group.RunGroupFilter(context, args)
File "/google-cloud-sdk/lib/googlecloudsdk/calliope/backend.py", line 693, in RunGroupFilter
self._common_type().Filter(context, args)
File "/google-cloud-sdk/lib/surface/container/__init__.py", line 71, in Filter
context['api_adapter'] = api_adapter.NewAPIAdapter('v1')
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/container/api_adapter.py", line 147, in NewAPIAdapter
return NewV1APIAdapter()
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/container/api_adapter.py", line 151, in NewV1APIAdapter
return InitAPIAdapter('v1', V1Adapter)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/container/api_adapter.py", line 172, in InitAPIAdapter
api_client = core_apis.GetClientInstance('container', api_version)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/apis.py", line 297, in GetClientInstance
api_name, api_version, no_http, _CheckResponse, enable_resource_quota)
File "/google-cloud-sdk/lib/googlecloudsdk/api_lib/util/apis_internal.py", line 153, in _GetClientInstance
http_client = http.Http(enable_resource_quota=enable_resource_quota)
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/http.py", line 64, in Http
creds = store.LoadIfEnabled()
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 281, in LoadIfEnabled
return Load()
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 348, in Load
cred = STATIC_CREDENTIAL_PROVIDERS.GetCredentials(account)
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 162, in GetCredentials
cred = provider.GetCredentials(account)
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/store.py", line 214, in GetCredentials
if account in c_gce.Metadata().Accounts():
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 127, in Accounts
gce_read.GOOGLE_GCE_METADATA_ACCOUNTS_URI + '/')
File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 289, in DecoratedFunction
exceptions.reraise(to_reraise[1], tb=to_reraise[2])
File "/google-cloud-sdk/lib/googlecloudsdk/core/exceptions.py", line 111, in reraise
six.reraise(type(exc_value), exc_value, tb)
File "/google-cloud-sdk/lib/googlecloudsdk/core/util/retry.py", line 159, in TryFunc
return func(*args, **kwargs), None
File "/google-cloud-sdk/lib/googlecloudsdk/core/credentials/gce.py", line 52, in _ReadNoProxyWithCleanFailures
raise CannotConnectToMetadataServerException(e)
CannotConnectToMetadataServerException: <urlopen error [Errno -2] Name does not resolve>
To give some color, we kick off a build to CircleCI everytime we push code to github. However, we have a container we call internally belushi, that we use to run our entire infrastructure. This container has gcloud installed in it. CircleCI infrastructure is on AWS and when they spin up the belushi container we actually run gcloud get-credentials that point the belushi container to our project in google cloud, which has a kubernetes cluster configured and we run all of our functional CI testing in that cluster. So we need that belushi pod to configure into the ci project to move forward.
The weird thing is that the belushi:latest image always configures properly; however, when we are working on belushi we often branch and create a new image to run tests. So for example, I will create a branch in belushi and then have a new hash of 1234567, so we will spin up the belushi:1234567 image and try to run things, and the first thing we do is configure it to point to the ci project; however, we get that metadata resolve issue.
I feel like it is DNS related or maybe the metadata server isn't allow the new image of belushi to communicate with it right away. After I retry it a bunch of times it will eventually configure properly (without any code changes). So I wonder if the metadata server is rejecting it for some reason or it could be on AWS not resolving for some unknown reason.
First thing you can do to troubleshoot is, when you get this error, attempt this:
curl -H "Metadata-Flavor:Google" http://metadata.google.internal/computeMetadata/v1/instance/service-accounts/
The metadata server should respond straight away with your service account metadata.
Is your container behind any kind of http proxy?

Celery: all workers stuck, how to diagnose

Periodically all my Celery workers get stuck on something. I cannot figure out what is causing this, as inspect doesn't work as all the workers are busy.
celery inspect active
Error: No nodes replied within time constraint
Is it possible to get Celery status, like active tasks, even if nodes are doing something (that seems to be causing problems)? Can I somehow spin up a temporary worker just to get inspect output?
What kind of other strategies there would be to diagnose this issue?
Celery 4.x. Redis backend.
This turned out to be a deadlock issue with Celery + gevent (evil monkey patch) + Sentry's Raven logger.
https://github.com/getsentry/raven-python/issues/305
To diagnose issues
You can start Celery workers with different queues (-q, -n) parameters and see when workers hang. Even if some worker groups are hung the others still may respond to inspect queries.
Celery file logs may reveal the error
2017-02-27 08:36:34,371 CRITI [celery.worker][DummyThread-6] Unrecoverable error: AttributeError("'NoneType' object has no attribute 'readline'",)
Traceback (most recent call last):
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/worker.py", line 203, in start
self.blueprint.start(self)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 370, in start
return self.obj.start()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 318, in start
blueprint.start(self)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/bootsteps.py", line 119, in start
step.start(parent)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/consumer/consumer.py", line 594, in start
c.loop(*c.loop_args())
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/celery/worker/loops.py", line 118, in synloop
connection.drain_events(timeout=2.0)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/connection.py", line 301, in drain_events
return self.transport.drain_events(self.connection, **kwargs)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/virtual/base.py", line 961, in drain_events
get(self._deliver, timeout=timeout)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 359, in get
ret = self.handle_event(fileno, event)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 341, in handle_event
return self.on_readable(fileno), self
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 337, in on_readable
chan.handlers[type]()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/kombu/transport/redis.py", line 714, in _brpop_read
**options)
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/client.py", line 585, in parse_response
response = connection.read_response()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/connection.py", line 577, in read_response
response = self._parser.read_response()
File "/srv/pyramid/xxx/venv/lib/python3.5/site-packages/redis/connection.py", line 238, in read_response
response = self._buffer.readline()
AttributeError: 'NoneType' object has no attribute 'readline'

Error while retrieving result in pymongo

I have an python application which creates number of threads for a job. each thread connects to mongodb and retrieve data. Number of allowed connection to mongodb is 200 which I'm taking care using semaphore. And once mongo querying job is done each thread closes mongodb connection. But while executing this application I'm getting same error for all threads. Error is:
Traceback (most recent call last):
File "C:\Python34\lib\threading.py", line 921, in _bootstrap_inner
self.run()
File "C:\Python34\lib\threading.py", line 869, in run
self._target(*self._args, **self._kwargs)
File "C:/path/pytest/under_construction/testAlgo.py", line 95, in sample_thread
status=monObj.process_status(list_value1,list_value2,5,120,120)
File "C:\path\pytest\under_construction\mongo_lib.py", line 153, in process_status
result=self.mongo_result('Submission','find',q={})
File "C:\path\pytest\under_construction\mongo_lib.py", line 53, in mongo_result
result=list(_query[query_type.lower()](query_string[keys]))
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 1076, in __next__
if len(self.__data) or self._refresh():
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 1037, in _refresh
limit, self.__id))
File "C:\Python34\lib\site-packages\pymongo\cursor.py", line 933, in __send_message
res = client._send_message_with_response(message, **kwargs)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1205, in _send_message_with_response
response = self.__send_and_receive(message, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1182, in __send_and_receive
return self.__receive_message_on_socket(1, request_id, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1174, in __receive_message_on_socket
return self.__receive_data_on_socket(length - 16, sock_info)
File "C:\Python34\lib\site-packages\pymongo\mongo_client.py", line 1153, in __receive_data_on_socket
chunk = sock_info.sock.recv(length)
MemoryError
Code for creating mongo connection
client=MongoClient(mc_name,port)
I was thinking, is this error due to results of all threads accumulating at one port of machine running my application?
MongoClient is a thread-safe connection pool, so you should be creating a single instance that's shared by all the worker threads rather than having each thread create its own.
The connection pool size defaults to 100, but if you want to make it even larger you can use the maxPoolSize parameter to do that (e.g. maxPoolSize=200).