Kubernetes Executor does not spawn Kubernetes Pod Operator - kubernetes

can't figure out where to look. I am running Airflow on GKE. It was running fine but recently started failing. Can't understand where to look. Basically, DAG starts, and then tasks fail, but they were running okay a week ago.
It seems like something changed in a cluster, but based on logs, can't figure out what.
My KubernetesExecutor stopped spawning KubernetesPodOperators and there are no logs or errors.
If I run directly (kubectl apply -f) template I use for Operator, it runs successfully.
Airflow 2.1.2
Kubectl
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.900", GitCommit:"28ab8501be88ea42e897ca8514d7cd0b436253d9", GitTreeState:"clean", BuildDate:"2021-06-30T09:23:36Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}
Executor Template
apiVersion: v1
kind: Pod
metadata:
...
spec:
restartPolicy: Never
serviceAccountName: airflow # this account have rights to create pods
automountServiceAccountToken: true
volumes:
- name: dags
emptyDir: {}
- name: logs
emptyDir: {}
- configMap:
name: airflow-git-sync-configmap
name: airflow-git-sync-configmap
initContainers:
- name: git-sync-clone
securityContext:
runAsUser: 65533
runAsGroup: 65533
image: k8s.gcr.io/git-sync/git-sync:v3.3.1
imagePullPolicy: Always
volumeMounts:
- mountPath: /tmp/git
name: dags
resources:
...
args: ["--one-time"]
envFrom:
- configMapRef:
name: airflow-git-sync-configmap
- secretRef:
name: airflow-git-sync-secret
containers:
- name: base
image: <artifactory_url>/airflow:latest
volumeMounts:
- name: dags
mountPath: /opt/airflow/dags
- name: logs
mountPath: /opt/airflow/logs
imagePullPolicy: Always
Pod template
apiVersion: v1
kind: Pod
metadata:
....
spec:
serviceAccountName: airflow
automountServiceAccountToken: true
volumes:
- name: sql
emptyDir: {}
initContainers:
- name: git-sync
image: k8s.gcr.io/git-sync/git-sync:v3.3.1
imagePullPolicy: Always
args: ["--one-time"]
volumeMounts:
- name: sql
mountPath: /tmp/git/
resources:
requests:
memory: 300Mi
cpu: 500m
limits:
memory: 600Mi
cpu: 1000m
envFrom:
- configMapRef:
name: git-sync-configmap
- secretRef:
name: airflow-git-sync-secret
containers:
- name: base
imagePullPolicy: Always
image: <artifactory_url>/clickhouse-client-gcloud:20.6.4.44
volumeMounts:
- name: sql
mountPath: /opt/sql
resources:
....
env:
- name: GS_SERVICE_ACCOUNT
valueFrom:
secretKeyRef:
name: gs-service-account
key: service_account.json
- name: DB_CREDENTIALS
valueFrom:
secretKeyRef:
name: estimation-db-secret
key: db_cred.json
DAG code
from textwrap import dedent
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
TEMPLATE_PATH = "/opt/airflow/dags/airflow-dags.git/pod_templates"
args = {
...
}
def create_pipeline(dag_: DAG):
task_startup_client = KubernetesPodOperator(
name="clickhouse-client",
task_id="clickhouse-client",
labels={"application": "clickhouse-client-gsutil"},
pod_template_file=f"{TEMPLATE_PATH}/template.yaml",
cmds=["sleep", "60000"],
reattach_on_restart=True,
is_delete_operator_pod=False,
get_logs=True,
log_events_on_failure=True,
dag=dag_,
)
task_startup_client
with DAG(
dag_id="MANUAL-GKE-clickhouse-client",
default_args=args,
schedule_interval=None,
max_active_runs=1,
start_date=days_ago(2),
tags=["utility"],
) as dag:
create_pipeline(dag)
I ran Airflow with DEBUG logging and there is nothing, successful completion:
Scheduler log
...
Event: manualgkeclickhouseclientaticlickhouseclient.9959fa1fd13a4b6fbdaf40549a09d2f9 Succeeded
...
*Executor logs
[2021-08-15 18:40:27,045] {settings.py:208} DEBUG - Setting up DB connection pool (PID 1)
[2021-08-15 18:40:27,046] {settings.py:276} DEBUG - settings.prepare_engine_args(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
[2021-08-15 18:40:27,095] {cli_action_loggers.py:40} DEBUG - Adding <function default_action_log at 0x7f0556c5e280> to pre execution callback
[2021-08-15 18:40:28,070] {cli_action_loggers.py:66} DEBUG - Calling callbacks: [<function default_action_log at 0x7f0556c5e280>]
[2021-08-15 18:40:28,106] {settings.py:208} DEBUG - Setting up DB connection pool (PID 1)
[2021-08-15 18:40:28,107] {settings.py:244} DEBUG - settings.prepare_engine_args(): Using NullPool
[2021-08-15 18:40:28,109] {dagbag.py:496} INFO - Filling up the DagBag from /opt/airflow/dags/ati-airflow-dags.git/dag_clickhouse-client.py
[2021-08-15 18:40:28,110] {dagbag.py:311} DEBUG - Importing /opt/airflow/dags/ati-airflow-dags.git/dag_clickhouse-client.py
/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py:26 DeprecationWarning: This module is deprecated. Please use `kubernetes.client.models.V1Volume`.
/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py:27 DeprecationWarning: This module is deprecated. Please use `kubernetes.client.models.V1VolumeMount`.
[2021-08-15 18:40:28,135] {dagbag.py:461} DEBUG - Loaded DAG <DAG: MANUAL-GKE-clickhouse-client>
[2021-08-15 18:40:28,176] {plugins_manager.py:281} DEBUG - Loading plugins
[2021-08-15 18:40:28,176] {plugins_manager.py:225} DEBUG - Loading plugins from directory: /opt/airflow/plugins
[2021-08-15 18:40:28,177] {plugins_manager.py:205} DEBUG - Loading plugins from entrypoints
[2021-08-15 18:40:28,238] {plugins_manager.py:418} DEBUG - Integrate DAG plugins
Running <TaskInstance: MANUAL-GKE-clickhouse-client.clickhouse-client 2021-08-15T18:39:38.150950+00:00 [queued]> on host manualgkeclickhouseclientclickhouseclient.9959fa1fd13a4b6fbd
[2021-08-15 18:40:28,670] {cli_action_loggers.py:84} DEBUG - Calling callbacks: []
[2021-08-15 18:40:28,670] {settings.py:302} DEBUG - Disposing DB connection pool (PID 1)

Related

cloudinitnocloud userdata is not working in k8s

My yaml file likes :
apiVersion: kubevirt.io/v1
kind: VirtualMachineInstance
metadata:
name: m1
spec:
domain:
cpu:
cores: 4
devices:
disks:
- name: harddrive
disk: {}
- name: cloudinitdisk
disk: {}
interfaces:
- name: ovs-net
bridge: {}
- name: default
masquerade: {}
resources:
requests:
memory: 8G
volumes:
- name: harddrive
containerDisk:
image: 1.1.1.1:8888/redhat/redhat79:latest
- name: cloudinitdisk
cloudInitNoCloud:
userData: |
#!/bin/bash
echo 1 > /opt/1.txt
networks:
- name: ovs-net
multus:
networkName: ovs-vlan-100
- name: default
pod: {}
VMI is running and I login the vm , nothing is in directory '/opt'; I find a disk sdb ,I mount sdb to /mnt, I can see file 'userdata', and the content in 'userdata' is right
I don't know where I did wrong
K8S 1.22.10
I also tried the other two methods
1)
cloudInitNoCloud:
userData: |
bootcmd:
- touch /opt/1.txt
runcmd:
- touch /opt/2.txt
cloudInitNoCloud:
secretRef:
name: my-vmi-secret
I hope the cloudinitnocloud work, and it can run my command
I find the problem, the docker image that I used doesn't install cloud* package
Kubevirt offical doesn't mention it, I think I can use it directly.

Is EFS a good logs backup option if Loki pod terminated accidentally in EKS Fargate

I am currently using Loki to store logs generated by my applications from EKS Fargate. Sidecar pattern with promtail is used to scrape logs. Single Loki pod is used and S3 is configured as a destination to store logs. It works nicely as expected. However, when I tested the availability of the logging system by deleting pods, I discovered that if Loki’s pod was deleted, some logs would be missing (range around 20 mins before the pod was deleted to the time the pod was deleted) even after the pod restarted.
To solve this problem, I tried to use EFS as the persistent volume of Loki’ pod, mounting the path /loki. The whole process is followed by this article (https://aws.amazon.com/blogs/aws/new-aws-fargate-for-amazon-eks-now-supports-amazon-efs/). But I have got an error from the Loki pod with msg "error running loki" err="mkdir /loki/compactor: permission denied”
Therefore, I have 2 questions in my mind:
Should I use EFS as a solution for log backup in my case?
Why did I get a permission denied inside the pod, any ways to solve this problem?
My Loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
# grpc_listen_port: 9096
ingester:
wal:
enabled: true
dir: /loki/wal
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
# final_sleep: 0s
chunk_idle_period: 3m
chunk_retain_period: 30s
max_transfer_retries: 0
chunk_target_size: 1048576
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: s3
aws:
bucketnames: bucketnames
endpoint: s3.us-west-2.amazonaws.com
region: us-west-2
access_key_id: access_key_id
secret_access_key: secret_access_key
sse_encryption: true
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 5m
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 48h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 96h
querier:
query_ingesters_within: 0
analytics:
reporting_enabled: false
Deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: fargate-api-dev
name: dev-loki
spec:
selector:
matchLabels:
app: dev-loki
template:
metadata:
labels:
app: dev-loki
spec:
volumes:
- name: loki-config
configMap:
name: dev-loki-config
- name: dev-loki-efs-pv
persistentVolumeClaim:
claimName: dev-loki-efs-pvc
containers:
- name: loki
image: loki:2.6.1
args:
- -print-config-stderr=true
- -config.file=/tmp/loki.yaml
resources:
limits:
memory: "500Mi"
cpu: "200m"
ports:
- containerPort: 3100
volumeMounts:
- name: dev-loki-config
mountPath: /tmp
readOnly: false
- name: dev-loki-efs-pv
mountPath: /loki
Promtail-config.yaml
server:
log_level: info
http_listen_port: 9080
clients:
- url: http://loki.com/loki/api/v1/push
positions:
filename: /run/promtail/positions.yaml
scrape_configs:
- job_name: api-log
static_configs:
- targets:
- localhost
labels:
job: apilogs
pod: ${POD_NAME}
__path__: /var/log/*.log
I had a similar issue using EFS as volume to store the logs and I found this solution https://github.com/grafana/loki/issues/2018#issuecomment-1030221498
Basically loki container by it's own is not able to create a directory to start working, so we used a initcotainer to do it for it.
This solution worked like a charm for.

Mount camera to pod get MountVolume.SetUp failed for volume "default-token-c8hm5" : failed to sync secret cache: timed out waiting for the condition

On my Jetson NX, I like to set a yaml file that can mount 2 cameras to pod,
the yaml:
containers:
- name: my-pod
image: my_image:v1.0.0
imagePullPolicy: Always
volumeMounts:
- mountPath: /dev/video0
name: dev-video0
- mountPath: /dev/video1
name: dev-video1
resources:
limits:
nvidia.com/gpu: 1
ports:
- containerPort: 9000
command: [ "/bin/bash"]
args: ["-c", "while true; do echo hello; sleep 10;done"]
securityContext:
privileged: true
volumes:
- hostPath:
path: /dev/video0
type: ""
name: dev-video0
- hostPath:
path: /dev/video1
type: ""
name: dev-video1
but when I deploy it as pod, get the error:
MountVolume.SetUp failed for volume "default-token-c8hm5" : failed to sync secret cache: timed out waiting for the condition
I had tried to remove volumes in yaml, and the pod can be successfully deployed. Any comments on this issue?
Another issue is that when there is a pod got some issues, it will consume the rest of my storage of my Jetson NX, I guess maybe k8s will make lots of temporary files or logs...? when something wrong happening, any solution to this issue, otherwise all od my pods will be evicted...

Pathing Issue when Executing Airflow run & backfill commands on Kubernetes

Verions
Airflow: 1.10.7
Kubernetes: 1.14.9
Setup
Airflow is configured to use Kubernetes Executors; Normal operations work just fine;
Dags and logs are accessed via EFS volumes defined with PersistentVolume & PersistentVolumeClaim specs ;
I have the following k8s spec, which I want to run backfill jobs with;
apiVersion: v1
kind: Pod
metadata:
name: backfill-test
namespace: airflow
spec:
serviceAccountName: airflow-service-account
volumes:
- name: airflow-dags
persistentVolumeClaim:
claimName: airflow-dags
- name: airflow-logs
persistentVolumeClaim:
claimName: airflow-logs
containers:
- name: somename
image: myimage
volumeMounts:
- name: airflow-dags
mountPath: /usr/local/airflow/dags
readOnly: true
- name: airflow-logs
mountPath: /usr/local/airflow/logs
readOnly: false
env:
- name: AIRFLOW__CORE__EXECUTOR
value: KubernetesExecutor
- name: AIRFLOW__KUBERNETES__NAMESPACE
value: airflow
- name: AIRFLOW__CORE__DAGS_FOLDER
value: dags
- name: AIRFLOW__CORE__BASE_LOG_FOLDER
value: logs
# - name: AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT
# value: /usr/local/airflow/dags
- name: AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH
value: dags
- name: AIRFLOW__KUBERNETES__LOGS_VOLUME_SUBPATH
value: logs
- name: AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM
value: airflow-dags
- name: AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM
value: airflow-logs
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY
value: someimage_uri
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
value: latest
- name: AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME
value: airflow-service-account
- name: AIRFLOW_HOME
value: usr/local/airflow
# command: ["sleep", "1h"]
command: ["airflow", "backfill",
"my_dag",
# # "--subdir", ".",
# "--local",
"--task_regex", "my_task_task",
"--start_date", "2020-07-01T00:00:00",
"--end_date", "2020-08-01T00:00:00"]
restartPolicy: Never
Problem
The issue with this seems to be some pathing problem when the task is added to the queue
When running the initial command, the cli finds the dag and associated task;
airflow#backfill-test:~$ airflow run my_dag my_task 2020-07-01T01:15:00+00:00 --local --raw --force
[2020-08-27 23:14:42,038] {__init__.py:51} INFO - Using executor KubernetesExecutor
[2020-08-27 23:14:42,040] {dagbag.py:403} INFO - Filling up the DagBag from /usr/local/airflow/dags
Running %s on host %s <TaskInstance: my_dag.my_task 2020-07-01T01:15:00+00:00 [failed]> backfill-test
However, the task gets added to the queue with some weird path. Log of the actual task execution attempt below.
[2020-08-27 23:14:43,019] {taskinstance.py:867} INFO - Starting attempt 3 of 2
[2020-08-27 23:14:43,019] {taskinstance.py:868} INFO -
--------------------------------------------------------------------------------
[2020-08-27 23:14:43,043] {taskinstance.py:887} INFO - Executing <Task(PostgresOperator): my_task> on 2020-07-01T01:15:00+00:00
[2020-08-27 23:14:43,046] {standard_task_runner.py:52} INFO - Started process 191 to run task
[2020-08-27 23:14:43,085] {logging_mixin.py:112} INFO - [2020-08-27 23:14:43,085] {dagbag.py:403} INFO - Filling up the DagBag from /usr/local/airflow/dags/usr/local/airflow/my_dag.py
[2020-08-27 23:14:53,006] {logging_mixin.py:112} INFO - [2020-08-27 23:14:53,006] {local_task_job.py:103} INFO - Task exited with return code 1
Adding --subdir to the initial command doesn't actually get propagated to the task queue, and results in the same log output.

Kubernetes doesn't allow to mount file to container

I ran into the below error when trying to deploy an application in a kubernetes cluster. It looks like kubernetes doesn't allow to mount a file to containers, do you know the possible reason?
deployment config file
apiVersion: extensions/v1beta1
kind: Deployment
metadata:
name: model-loader-service
namespace: "{{ .Values.nsPrefix }}-aai"
spec:
selector:
matchLabels:
app: model-loader-service
template:
metadata:
labels:
app: model-loader-service
name: model-loader-service
spec:
containers:
- name: model-loader-service
image: "{{ .Values.image.modelLoaderImage }}:{{ .Values.image.modelLoaderVersion }}"
imagePullPolicy: {{ .Values.pullPolicy }}
env:
- name: CONFIG_HOME
value: /opt/app/model-loader/config/
volumeMounts:
- mountPath: /etc/localtime
name: localtime
readOnly: true
- mountPath: /opt/app/model-loader/config/
name: aai-model-loader-config
- mountPath: /var/log/onap
name: aai-model-loader-logs
- mountPath: /opt/app/model-loader/bundleconfig/etc/logback.xml
name: aai-model-loader-log-conf
subPath: logback.xml
ports:
- containerPort: 8080
- containerPort: 8443
- name: filebeat-onap-aai-model-loader
image: {{ .Values.image.filebeat }}
imagePullPolicy: {{ .Values.pullPolicy }}
volumeMounts:
- mountPath: /usr/share/filebeat/filebeat.yml
name: filebeat-conf
- mountPath: /var/log/onap
name: aai-model-loader-logs
- mountPath: /usr/share/filebeat/data
name: aai-model-loader-filebeat
volumes:
- name: localtime
hostPath:
path: /etc/localtime
- name: aai-model-loader-config
hostPath:
path: "/dockerdata-nfs/{{ .Values.nsPrefix }}/aai/model-loader/appconfig/"
- name: filebeat-conf
hostPath:
path: /dockerdata-nfs/{{ .Values.nsPrefix }}/log/filebeat/logback/filebeat.yml
Details information of this issue:
message: 'invalid header field value "oci runtime error: container_linux.go:247:
starting container process caused \"process_linux.go:359: container init
caused \\\"rootfs_linux.go:53: mounting \\\\\\\"/dockerdata-nfs/onap/log/filebeat/logback/filebeat.yml\\\\\\\"
to rootfs \\\\\\\"/var/lib/docker/aufs/mnt/7cd32a29938e9f70a727723f550474cb5b41c0966f45ad0c323360779f08cf5c\\\\\\\"
at \\\\\\\"/var/lib/docker/aufs/mnt/7cd32a29938e9f70a727723f550474cb5b41c0966f45ad0c323360779f08cf5c/usr/share/filebeat/filebeat.yml\\\\\\\"
caused \\\\\\\"not a directory\\\\\\\"\\\"\"\n"'
....
$ docker version
Client:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:38:45 2017
OS/Arch: linux/amd64
Server:
Version: 1.12.6
API version: 1.24
Go version: go1.6.4
Git commit: 78d1802
Built: Tue Jan 10 20:38:45 2017
OS/Arch: linux/amd64
$ kubectl version
Client Version: version.Info{Major:"1", Minor:"7", GitVersion:"v1.7.4", GitCommit:"793658f2d7ca7f064d2bdf606519f9fe1229c381", GitTreeState:"clean", BuildDate:"2017-08-17T08:48:23Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
Server Version: version.Info{Major:"1", Minor:"8+", GitVersion:"v1.8.3-rancher3", GitCommit:"772c4c54e1f4ae7fc6f63a8e1ecd9fe616268e16", GitTreeState:"clean", BuildDate:"2017-11-27T19:51:43Z", GoVersion:"go1.8.3", Compiler:"gc", Platform:"linux/amd64"}
caused "not a directory" is kind of self explanatory. What is the exact volume and volumeMount definition you use ? do you use subPath in your declaration ?
EDIT: change
- name: filebeat-conf
hostPath:
path: /dockerdata-nfs/{{ .Values.nsPrefix }}/log/filebeat/logback/filebeat.yml
to
- name: filebeat-conf
hostPath:
path: /dockerdata-nfs/{{ .Values.nsPrefix }}/log/filebeat/logback/
and add subPath: filebeat.yml to volumeMount
SELinux may also be the culprit here. Log on to the node and execute sestatus. If the policy is disabled, you will see the output as SELINUX=disabled else it will be something similar to this:
SELinux status: enabled
SELinuxfs mount: /sys/fs/selinux
SELinux root directory: /etc/selinux
Loaded policy name: mcs
Current mode: permissive
Mode from config file: permissive
Policy MLS status: enabled
Policy deny_unknown status: allowed
Max kernel policy version: 31
First Option:
You can either disable the selinux by editing the /etc/selinux/config file and update the SELINUX=permissive to SELINUX=disabled. Once done, reboot the machine and deploy to see if fixed. However, this is not the recommonded way and can be seen as a temporary fix.
Second Option:
Log on to the node and execute ps -efZ | grep kubelet which will give something like this.
system_u:system_r:kernel_t:s0 root 1592 1 2 May23 ? 09:58:18 /usr/local/bin/kubelet --anonymous-auth=false
Now, from this output capture the string system_u:system_r:kernel_t:s0 which can be changed to security context as below in your deployment.
securityContext:
seLinuxOptions:
user: system_u
role: system_r
type: spc_t
level: s0
Deploy your application and check the logs if it is fixed. Do let me know if this works for you or need any further help.
Is this a multi-node cluster? If so, the file needs to exist on all Kubernetes nodes since the pod is typically scheduled on a randomly available host machine. In any case, ConfigMaps are a much better way to supply static/read-only files to a container.