Not able to run OpenDistro for Elastic in kubernetes as non-root -supervisord error - kubernetes

I am setting up OpenDistro for Elastic in Kubernetes. The cluster has pod security in place that will not allow privileged pods. When I start the cluster the logs indicated a permission issue with /usr/share/supervisor/supervisord.log
I have a securityContext set on the deployment
securityContext:
runAsUser: 1000
fsGroup: 1000
``
The error message from kubectl logs es-master-0 is
```/usr/share/elasticsearch/config/elasticsearch.yml seems to be already configured for Security. Quit.
Traceback (most recent call last):
File "/usr/bin/supervisord", line 9, in <module>
load_entry_point('supervisor==4.0.2', 'console_scripts', 'supervisord')()
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/supervisord.py", line 358, in main
go(options)
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/supervisord.py", line 368, in go
d.main()
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/supervisord.py", line 70, in main
self.options.make_logger()
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/options.py", line 1472, in make_logger
backups=self.logfile_backups,
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/loggers.py", line 417, in handle_file
handler = RotatingFileHandler(filename, 'a', maxbytes, backups)
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/loggers.py", line 212, in __init__
FileHandler.__init__(self, filename, mode)
File "/usr/lib/python2.7/site-packages/supervisor-4.0.2-py2.7.egg/supervisor/loggers.py", line 159, in __init__
self.stream = open(filename, mode)
IOError: [Errno 13] Permission denied: '/usr/share/supervisor/supervisord.log'

Related

Airflow webserver No module named 'airflow.www.fab_security'

I am running Airflow 1.10.12 on Ubuntu. Airflow was running fine using local executor and MySql.
In order to conduct some tests, I have moved to Celery executer with RabbitMQ.
Based on a tutorial, here is my config file:
[core]
executor = CeleryExecutor
[celery]
broker_url = pyamqp://rabbitmq:rabbitmq#localhost/
result_backend = db+mysql+pymysql://airflow:airflow#localhost:3306/airflow_db
But when I run:
airflow webserver
The following error is thrown:
Traceback (most recent call last):
File "/usr/local/bin/airflow", line 37, in <module>
args.func(args)
File "/usr/local/lib/python3.8/dist-packages/airflow/utils/cli.py", line 76, in wrapper
return f(*args, **kwargs)
File "/usr/local/lib/python3.8/dist-packages/airflow/bin/cli.py", line 1076, in webserver
app = cached_app_rbac(None) if settings.RBAC else cached_app(None)
File "/usr/local/lib/python3.8/dist-packages/airflow/www_rbac/app.py", line 300, in cached_app
app, _ = create_app(config, session, testing)
File "/usr/local/lib/python3.8/dist-packages/airflow/www_rbac/app.py", line 65, in create_app
app.config.from_pyfile(settings.WEBSERVER_CONFIG, silent=True)
File "/usr/local/lib/python3.8/dist-packages/flask/config.py", line 132, in from_pyfile
exec(compile(config_file.read(), filename, "exec"), d.__dict__)
File "/home/helia/airflow/webserver_config.py", line 21, in <module>
from airflow.www.fab_security.manager import AUTH_DB
ModuleNotFoundError: No module named 'airflow.www.fab_security'

Reading logs of autoscaled Ray worker nodes

We're running ray tasks on kubernetes with autoscaling. From time to time, a worker dies, and we get the following:
WARNING worker.py:1114 -- A worker died or was killed while executing a task by an unexpected system error. To troubleshoot the problem, check the logs for the dead worker. Task ID: 874253cc82fde8d4ffffffffffffffffffffffff02000000 Worker ID: 98dcd78c9bebbc2176a14fc230b40afdd574dea608dd573aebeae00e Node ID: d4a7b594aec2970b02dde480f2eb2c6070b7886b333d8f1cb1087758 Worker IP address: 10.40.4.2 Worker port: 10010 Worker PID: 307
We're using a config.yaml similar to the autoscaler/kubernetes/default.yaml.
The problem we're facing is that the nodes were the worker runs is autoscaling, causing the node to often be scaled away before we have time to read the logs (in /tmp/ray).
Any way we can persist the logs of the autoscaled (worker) nodes?
I tried setting the temp dir to point to a filestore that's shared between the workers, by calling ray.init(_temp_dir="/our_mounted_filestore/persistent_ray_logs") but this does not seem to work.
I also tried adding --temp-dir=<path> in the worker_start_ray_commands, but this just gives the following error:
PermissionError: [Errno 13] Permission denied: '/filestore/ray_temp'
mkdir(name, mode)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 223, in makedirs
makedirs(head, exist_ok=exist_ok)
File "/home/ray/anaconda3/lib/python3.8/os.py", line 213, in makedirs
os.makedirs(directory_path, exist_ok=True)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/_private/utils.py", line 797, in try_to_create_directory
try_to_create_directory(self._temp_dir)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 278, in _init_temp
self._init_temp(redis_client)
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/node.py", line 166, in init
node = ray.node.Node(
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 580, in start
return __callback(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 763, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1404, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1668, in invoke
rv = self.invoke(ctx)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1062, in main
return self.main(*args, **kwargs)
File "/home/ray/anaconda3/lib/python3.8/site-packages/click/core.py", line 1137, in call
return cli()
File "/home/ray/anaconda3/lib/python3.8/site-packages/ray/scripts/scripts.py", line 1808, in main
sys.exit(main())
File "/home/ray/anaconda3/bin/ray", line 8, in
Traceback (most recent call last):

AWX-Web error when installing awx-operator on Kubernetes

I am currently installing the awx-operator and I have come across an issue while I am trying to expose the application to the outside world.
But I have come across an error with the awx-web container within the awx-5b58db49c-9r4hp. When I run kubectl logs pod/awx-5b58db49c-9r4hp -c awx-web, I get the following output:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 81, in _ctit_db_wrapper
yield
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 411, in __getattr__
value = self._get_local(name)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/awx/conf/settings.py", line 355, in _get_local
setting = Setting.objects.filter(key=name, user__isnull=True).order_by('pk').first()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 653, in first
for obj in (self if self.ordered else self.order_by('pk'))[:1]:
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 274, in __iter__
self._fetch_all()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 1242, in _fetch_all
self._result_cache = list(self._iterable_class(self))
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/query.py", line 55, in __iter__
results = compiler.execute_sql(chunked_fetch=self.chunked_fetch, chunk_size=self.chunk_size)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/models/sql/compiler.py", line 1140, in execute_sql
cursor = self.connection.cursor()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 256, in cursor
return self._cursor()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 233, in _cursor
self.ensure_connection()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/utils.py", line 89, in __exit__
raise dj_exc_value.with_traceback(traceback) from exc_value
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 195, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
django.db.utils.OperationalError: FATAL: password authentication failed for user "awx"
2021-05-12 14:28:54,478 ERROR [-] awx.conf.settings Database settings are not available, using defaults.
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/oauth2_provider/settings.py", line 138, in __getattr__
val = self.user_settings[attr]
KeyError: 'OAUTH2_VALIDATOR_CLASS'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 217, in ensure_connection
self.connect()
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/base/base.py", line 195, in connect
self.connection = self.get_new_connection(conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/django/db/backends/postgresql/base.py", line 178, in get_new_connection
connection = Database.connect(**conn_params)
File "/var/lib/awx/venv/awx/lib64/python3.8/site-packages/psycopg2/__init__.py", line 126, in connect
conn = _connect(dsn, connection_factory=connection_factory, **kwasync)
psycopg2.OperationalError: FATAL: password authentication failed for user "awx"
I am not too sure whether this is a big deal or just a red-herring. I am just in need of some clarification. If I need to get any more information to aide troubleshooting, please let me know!
As per AWX 19.0.0: password authentication failed for user "awx" issue is not anymore in minikube v1.20.0, awx-operator 0.9.0 so advice is try use this version for now
I was also observing the same error message => password authentication failed for user "awx".
I was running the awx-operator version 0.10.0 on my kubernetes cluster created using kubeadm and not using minikube.
I hosted my postgres persistent volume needed for awx postgres pod on my worker node which already had some stale previously migrated data from a different kubernetes cluster. I had to cleanup that already lying data on my worker node hostPath where i mounted my persistent volume and make a fresh install with fresh data from postgres pod and the password authentication error never come back.

Running supervisord on a read only filesystem

I'm trying to run supervisord on a read only filesystem.
I have tried to stop supervisord from writing log and pid files using the following configuration:
[supervisord]
nodaemon=true
user=root
logfile=/dev/stdout
logfile_maxbytes=0
pidfile=/dev/null
However, when I attempt to start, I still receive the following error:
Traceback (most recent call last):
File "/usr/bin/supervisord", line 11, in <module>
load_entry_point('supervisor==3.3.4', 'console_scripts', 'supervisord')()
File "/usr/lib/python2.7/site-packages/supervisor/supervisord.py", line 349, in main
options = ServerOptions()
File "/usr/lib/python2.7/site-packages/supervisor/options.py", line 428, in __init__
existing_directory, default=tempfile.gettempdir())
File "/usr/lib/python2.7/tempfile.py", line 275, in gettempdir
tempdir = _get_default_tempdir()
File "/usr/lib/python2.7/tempfile.py", line 217, in _get_default_tempdir
("No usable temporary directory found in %s" % dirlist))
IOError: [Errno 2] No usable temporary directory found in ['/tmp', '/var/tmp', '/usr/tmp', '/']
Is it possible to start/run supervisord on a read only filesystem?
Set the TEMPDIR environment variable to a rw volume mount.

ansible k8s module failing to connect to cluster with 503 - appends /version/openshift to non openshift cluster

I'm trying to use ansible new k8s module (based ok k8_raw from 2.6) to maintain an aks k8 cluster.
While I can work with the cluster with kubectl , any command with the k8s cluster fails with a 503 error.
For example this task:
- name: deploy kured daemonset
k8s:
state: present
context: "{{ cluster_name}}"
host: "redacted"# tried specifying this, but does not help
kubeconfig: "~/.kube/config"
src: "aks/utils/kured-ds.yaml"
And failure:
Traceback (most recent call last):
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 113, in <module>
_ansiballz_main()
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 105, in _ansiballz_main
invoke_module(zipped_mod, temp_path, ANSIBALLZ_PARAMS)
File "/home/alonisser/.ansible/tmp/ansible-tmp-1549320815.98-157731551192134/AnsiballZ_k8s.py", line 48, in invoke_module
imp.load_module('__main__', mod, module, MOD_DESC)
File "/tmp/ansible_k8s_payload_IYmGFG/__main__.py", line 233, in <module>
File "/tmp/ansible_k8s_payload_IYmGFG/__main__.py", line 229, in main
File "/tmp/ansible_k8s_payload_IYmGFG/ansible_k8s_payload.zip/ansible/module_utils/k8s/raw.py", line 131, in execute_module
File "/tmp/ansible_k8s_payload_IYmGFG/ansible_k8s_payload.zip/ansible/module_utils/k8s/common.py", line 172, in get_api_client
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 103, in __init__
self.__init_cache()
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 113, in __init_cache
self.__resources.update(self.parse_api_groups())
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 169, in parse_api_groups
new_group[version] = self.get_resources_for_api_version(prefix, group['name'], version, preferred)
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 181, in get_resources_for_api_version
resources_response = load_json(self.request('GET', path))['resources']
File "/home/alonisser/.local/lib/python2.7/site-packages/openshift/dynamic/client.py", line 363, in request
_return_http_data_only=params.get('_return_http_data_only', True)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 321, in call_api
_return_http_data_only, collection_formats, _preload_content, _request_timeout)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 155, in __call_api
_request_timeout=_request_timeout)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/api_client.py", line 342, in request
headers=headers)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 231, in GET
query_params=query_params)
File "/home/alonisser/.local/lib/python2.7/site-packages/kubernetes/client/rest.py", line 222, in request
raise ApiException(http_resp=r)
kubernetes.client.rest.ApiException: (503)
Reason: Service Unavailable
Ansible version: 2.7/8(dev)
What am I missing?
UPDATE:
When I've added print statement to the libs used by the module beneath I found out somewhere in the pipeline /version/openshift is appended to the host name, which of course fails, because it's a non openshift cluster
Any work around for this bug?
Answer: turned out there were two failing requests. the first is to version/openshift is catched by the client and doesn't cause the crash. the crash actually happened because of an error with my cluster metrics server, which while not really needed by the k8 client used by ansible still fails a request to it.
So if anyone bumps into it, might be helpful