Pathing Issue when Executing Airflow run & backfill commands on Kubernetes - kubernetes

Verions
Airflow: 1.10.7
Kubernetes: 1.14.9
Setup
Airflow is configured to use Kubernetes Executors; Normal operations work just fine;
Dags and logs are accessed via EFS volumes defined with PersistentVolume & PersistentVolumeClaim specs ;
I have the following k8s spec, which I want to run backfill jobs with;
apiVersion: v1
kind: Pod
metadata:
name: backfill-test
namespace: airflow
spec:
serviceAccountName: airflow-service-account
volumes:
- name: airflow-dags
persistentVolumeClaim:
claimName: airflow-dags
- name: airflow-logs
persistentVolumeClaim:
claimName: airflow-logs
containers:
- name: somename
image: myimage
volumeMounts:
- name: airflow-dags
mountPath: /usr/local/airflow/dags
readOnly: true
- name: airflow-logs
mountPath: /usr/local/airflow/logs
readOnly: false
env:
- name: AIRFLOW__CORE__EXECUTOR
value: KubernetesExecutor
- name: AIRFLOW__KUBERNETES__NAMESPACE
value: airflow
- name: AIRFLOW__CORE__DAGS_FOLDER
value: dags
- name: AIRFLOW__CORE__BASE_LOG_FOLDER
value: logs
# - name: AIRFLOW__KUBERNETES__DAGS_VOLUME_MOUNT_POINT
# value: /usr/local/airflow/dags
- name: AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH
value: dags
- name: AIRFLOW__KUBERNETES__LOGS_VOLUME_SUBPATH
value: logs
- name: AIRFLOW__KUBERNETES__DAGS_VOLUME_CLAIM
value: airflow-dags
- name: AIRFLOW__KUBERNETES__LOGS_VOLUME_CLAIM
value: airflow-logs
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY
value: someimage_uri
- name: AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG
value: latest
- name: AIRFLOW__KUBERNETES__WORKER_SERVICE_ACCOUNT_NAME
value: airflow-service-account
- name: AIRFLOW_HOME
value: usr/local/airflow
# command: ["sleep", "1h"]
command: ["airflow", "backfill",
"my_dag",
# # "--subdir", ".",
# "--local",
"--task_regex", "my_task_task",
"--start_date", "2020-07-01T00:00:00",
"--end_date", "2020-08-01T00:00:00"]
restartPolicy: Never
Problem
The issue with this seems to be some pathing problem when the task is added to the queue
When running the initial command, the cli finds the dag and associated task;
airflow#backfill-test:~$ airflow run my_dag my_task 2020-07-01T01:15:00+00:00 --local --raw --force
[2020-08-27 23:14:42,038] {__init__.py:51} INFO - Using executor KubernetesExecutor
[2020-08-27 23:14:42,040] {dagbag.py:403} INFO - Filling up the DagBag from /usr/local/airflow/dags
Running %s on host %s <TaskInstance: my_dag.my_task 2020-07-01T01:15:00+00:00 [failed]> backfill-test
However, the task gets added to the queue with some weird path. Log of the actual task execution attempt below.
[2020-08-27 23:14:43,019] {taskinstance.py:867} INFO - Starting attempt 3 of 2
[2020-08-27 23:14:43,019] {taskinstance.py:868} INFO -
--------------------------------------------------------------------------------
[2020-08-27 23:14:43,043] {taskinstance.py:887} INFO - Executing <Task(PostgresOperator): my_task> on 2020-07-01T01:15:00+00:00
[2020-08-27 23:14:43,046] {standard_task_runner.py:52} INFO - Started process 191 to run task
[2020-08-27 23:14:43,085] {logging_mixin.py:112} INFO - [2020-08-27 23:14:43,085] {dagbag.py:403} INFO - Filling up the DagBag from /usr/local/airflow/dags/usr/local/airflow/my_dag.py
[2020-08-27 23:14:53,006] {logging_mixin.py:112} INFO - [2020-08-27 23:14:53,006] {local_task_job.py:103} INFO - Task exited with return code 1
Adding --subdir to the initial command doesn't actually get propagated to the task queue, and results in the same log output.

Related

Argo Workflow fails with no such directory error when using input parameters

I'm currently doing a PoC to validate usage of Argo Workflow. I created a workflow spec with the following template (this is just a small portion of the workflow yaml):
templates:
- name: dummy-name
inputs:
parameters:
- name: params
container:
name: container-name
image: <image>
volumeMounts:
- name: vault-token
mountPath: "/etc/secrets"
readOnly: true
imagePullPolicy: IfNotPresent
command: ['workflow', 'f10', 'reports', 'expiry', '.', '--days-until-expiry', '30', '--vault-token-file-path', '/etc/secrets/token', '--environment', 'corporate', '--log-level', 'debug']
The above way of passing the commands works without any issues upon submitting the workflow. However, if I replace the command with {{inputs.parameters.params}} like this:
templates:
- name: dummy-name
inputs:
parameters:
- name: params
container:
name: container-name
image: <image>
volumeMounts:
- name: vault-token
mountPath: "/etc/secrets"
readOnly: true
imagePullPolicy: IfNotPresent
command: ['workflow', '{{inputs.parameters.params}}']
it fails with the following error:
DEBU[2023-01-20T18:11:07.220Z] Log line
content="Error: failed to find name in PATH: exec: \"workflow f10 reports expiry . --days-until-expiry 30 --vault-token-file-path /etc/secrets/token --environment corporate --log-level debug\":
stat workflow f10 reports expiry . --days-until-expiry 30 --vault-token-file-path /etc/secrets/token --environment corporate --log-level debug: no such file or directory"
Am I missing something here?
FYI: The Dockerfile that builds the container has the following ENTRYPOINT set:
ENTRYPOINT ["workflow"]

Is EFS a good logs backup option if Loki pod terminated accidentally in EKS Fargate

I am currently using Loki to store logs generated by my applications from EKS Fargate. Sidecar pattern with promtail is used to scrape logs. Single Loki pod is used and S3 is configured as a destination to store logs. It works nicely as expected. However, when I tested the availability of the logging system by deleting pods, I discovered that if Loki’s pod was deleted, some logs would be missing (range around 20 mins before the pod was deleted to the time the pod was deleted) even after the pod restarted.
To solve this problem, I tried to use EFS as the persistent volume of Loki’ pod, mounting the path /loki. The whole process is followed by this article (https://aws.amazon.com/blogs/aws/new-aws-fargate-for-amazon-eks-now-supports-amazon-efs/). But I have got an error from the Loki pod with msg "error running loki" err="mkdir /loki/compactor: permission denied”
Therefore, I have 2 questions in my mind:
Should I use EFS as a solution for log backup in my case?
Why did I get a permission denied inside the pod, any ways to solve this problem?
My Loki-config.yaml
auth_enabled: false
server:
http_listen_port: 3100
# grpc_listen_port: 9096
ingester:
wal:
enabled: true
dir: /loki/wal
lifecycler:
ring:
kvstore:
store: inmemory
replication_factor: 1
# final_sleep: 0s
chunk_idle_period: 3m
chunk_retain_period: 30s
max_transfer_retries: 0
chunk_target_size: 1048576
schema_config:
configs:
- from: 2020-05-15
store: boltdb-shipper
object_store: aws
schema: v11
index:
prefix: index_
period: 24h
storage_config:
boltdb_shipper:
active_index_directory: /loki/index
cache_location: /loki/index_cache
shared_store: s3
aws:
bucketnames: bucketnames
endpoint: s3.us-west-2.amazonaws.com
region: us-west-2
access_key_id: access_key_id
secret_access_key: secret_access_key
sse_encryption: true
compactor:
working_directory: /loki/compactor
shared_store: s3
compaction_interval: 5m
limits_config:
reject_old_samples: true
reject_old_samples_max_age: 48h
chunk_store_config:
max_look_back_period: 0s
table_manager:
retention_deletes_enabled: true
retention_period: 96h
querier:
query_ingesters_within: 0
analytics:
reporting_enabled: false
Deploy.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
namespace: fargate-api-dev
name: dev-loki
spec:
selector:
matchLabels:
app: dev-loki
template:
metadata:
labels:
app: dev-loki
spec:
volumes:
- name: loki-config
configMap:
name: dev-loki-config
- name: dev-loki-efs-pv
persistentVolumeClaim:
claimName: dev-loki-efs-pvc
containers:
- name: loki
image: loki:2.6.1
args:
- -print-config-stderr=true
- -config.file=/tmp/loki.yaml
resources:
limits:
memory: "500Mi"
cpu: "200m"
ports:
- containerPort: 3100
volumeMounts:
- name: dev-loki-config
mountPath: /tmp
readOnly: false
- name: dev-loki-efs-pv
mountPath: /loki
Promtail-config.yaml
server:
log_level: info
http_listen_port: 9080
clients:
- url: http://loki.com/loki/api/v1/push
positions:
filename: /run/promtail/positions.yaml
scrape_configs:
- job_name: api-log
static_configs:
- targets:
- localhost
labels:
job: apilogs
pod: ${POD_NAME}
__path__: /var/log/*.log
I had a similar issue using EFS as volume to store the logs and I found this solution https://github.com/grafana/loki/issues/2018#issuecomment-1030221498
Basically loki container by it's own is not able to create a directory to start working, so we used a initcotainer to do it for it.
This solution worked like a charm for.

How to use env variable on kubernetes script?

I have this kubernetes script on argo workflows template
- name: rendition-composer
inputs:
parameters:
- name: original_resolution
script:
image: node:9.1-alpine
command: [node]
source: |
// some node.js script
...
console.log($(SD_RENDITION));
volumeMounts:
- name: workdir
mountPath: /mnt/vol
- name: config
mountPath: /config
readOnly: true
env:
- name: SD_RENDITION
valueFrom:
configMapKeyRef:
name: rendition-specification
key: res480p
In here console.log($(SD_RENDITION)); I can't get the env value. it returns error
ReferenceError: $ is not defined
I already did all the setup for the ConfigMap on this kubernetes official documentation
Is there anything I miss?
process.env.SD_RENDITION
The above syntax solved my problem. It seems I miss some essential concepts about js' process object

Kubernetes Executor does not spawn Kubernetes Pod Operator

can't figure out where to look. I am running Airflow on GKE. It was running fine but recently started failing. Can't understand where to look. Basically, DAG starts, and then tasks fail, but they were running okay a week ago.
It seems like something changed in a cluster, but based on logs, can't figure out what.
My KubernetesExecutor stopped spawning KubernetesPodOperators and there are no logs or errors.
If I run directly (kubectl apply -f) template I use for Operator, it runs successfully.
Airflow 2.1.2
Kubectl
Client Version: version.Info{Major:"1", Minor:"20", GitVersion:"v1.20.7", GitCommit:"132a687512d7fb058d0f5890f07d4121b3f0a2e2", GitTreeState:"clean", BuildDate:"2021-05-12T12:40:09Z", GoVersion:"go1.15.12", Compiler:"gc", Platform:"darwin/amd64"}
Server Version: version.Info{Major:"1", Minor:"20+", GitVersion:"v1.20.8-gke.900", GitCommit:"28ab8501be88ea42e897ca8514d7cd0b436253d9", GitTreeState:"clean", BuildDate:"2021-06-30T09:23:36Z", GoVersion:"go1.15.13b5", Compiler:"gc", Platform:"linux/amd64"}
Executor Template
apiVersion: v1
kind: Pod
metadata:
...
spec:
restartPolicy: Never
serviceAccountName: airflow # this account have rights to create pods
automountServiceAccountToken: true
volumes:
- name: dags
emptyDir: {}
- name: logs
emptyDir: {}
- configMap:
name: airflow-git-sync-configmap
name: airflow-git-sync-configmap
initContainers:
- name: git-sync-clone
securityContext:
runAsUser: 65533
runAsGroup: 65533
image: k8s.gcr.io/git-sync/git-sync:v3.3.1
imagePullPolicy: Always
volumeMounts:
- mountPath: /tmp/git
name: dags
resources:
...
args: ["--one-time"]
envFrom:
- configMapRef:
name: airflow-git-sync-configmap
- secretRef:
name: airflow-git-sync-secret
containers:
- name: base
image: <artifactory_url>/airflow:latest
volumeMounts:
- name: dags
mountPath: /opt/airflow/dags
- name: logs
mountPath: /opt/airflow/logs
imagePullPolicy: Always
Pod template
apiVersion: v1
kind: Pod
metadata:
....
spec:
serviceAccountName: airflow
automountServiceAccountToken: true
volumes:
- name: sql
emptyDir: {}
initContainers:
- name: git-sync
image: k8s.gcr.io/git-sync/git-sync:v3.3.1
imagePullPolicy: Always
args: ["--one-time"]
volumeMounts:
- name: sql
mountPath: /tmp/git/
resources:
requests:
memory: 300Mi
cpu: 500m
limits:
memory: 600Mi
cpu: 1000m
envFrom:
- configMapRef:
name: git-sync-configmap
- secretRef:
name: airflow-git-sync-secret
containers:
- name: base
imagePullPolicy: Always
image: <artifactory_url>/clickhouse-client-gcloud:20.6.4.44
volumeMounts:
- name: sql
mountPath: /opt/sql
resources:
....
env:
- name: GS_SERVICE_ACCOUNT
valueFrom:
secretKeyRef:
name: gs-service-account
key: service_account.json
- name: DB_CREDENTIALS
valueFrom:
secretKeyRef:
name: estimation-db-secret
key: db_cred.json
DAG code
from textwrap import dedent
from airflow import DAG
from airflow.utils.dates import days_ago
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
TEMPLATE_PATH = "/opt/airflow/dags/airflow-dags.git/pod_templates"
args = {
...
}
def create_pipeline(dag_: DAG):
task_startup_client = KubernetesPodOperator(
name="clickhouse-client",
task_id="clickhouse-client",
labels={"application": "clickhouse-client-gsutil"},
pod_template_file=f"{TEMPLATE_PATH}/template.yaml",
cmds=["sleep", "60000"],
reattach_on_restart=True,
is_delete_operator_pod=False,
get_logs=True,
log_events_on_failure=True,
dag=dag_,
)
task_startup_client
with DAG(
dag_id="MANUAL-GKE-clickhouse-client",
default_args=args,
schedule_interval=None,
max_active_runs=1,
start_date=days_ago(2),
tags=["utility"],
) as dag:
create_pipeline(dag)
I ran Airflow with DEBUG logging and there is nothing, successful completion:
Scheduler log
...
Event: manualgkeclickhouseclientaticlickhouseclient.9959fa1fd13a4b6fbdaf40549a09d2f9 Succeeded
...
*Executor logs
[2021-08-15 18:40:27,045] {settings.py:208} DEBUG - Setting up DB connection pool (PID 1)
[2021-08-15 18:40:27,046] {settings.py:276} DEBUG - settings.prepare_engine_args(): Using pool settings. pool_size=5, max_overflow=10, pool_recycle=1800, pid=1
[2021-08-15 18:40:27,095] {cli_action_loggers.py:40} DEBUG - Adding <function default_action_log at 0x7f0556c5e280> to pre execution callback
[2021-08-15 18:40:28,070] {cli_action_loggers.py:66} DEBUG - Calling callbacks: [<function default_action_log at 0x7f0556c5e280>]
[2021-08-15 18:40:28,106] {settings.py:208} DEBUG - Setting up DB connection pool (PID 1)
[2021-08-15 18:40:28,107] {settings.py:244} DEBUG - settings.prepare_engine_args(): Using NullPool
[2021-08-15 18:40:28,109] {dagbag.py:496} INFO - Filling up the DagBag from /opt/airflow/dags/ati-airflow-dags.git/dag_clickhouse-client.py
[2021-08-15 18:40:28,110] {dagbag.py:311} DEBUG - Importing /opt/airflow/dags/ati-airflow-dags.git/dag_clickhouse-client.py
/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py:26 DeprecationWarning: This module is deprecated. Please use `kubernetes.client.models.V1Volume`.
/usr/local/lib/python3.9/site-packages/airflow/providers/cncf/kubernetes/backcompat/backwards_compat_converters.py:27 DeprecationWarning: This module is deprecated. Please use `kubernetes.client.models.V1VolumeMount`.
[2021-08-15 18:40:28,135] {dagbag.py:461} DEBUG - Loaded DAG <DAG: MANUAL-GKE-clickhouse-client>
[2021-08-15 18:40:28,176] {plugins_manager.py:281} DEBUG - Loading plugins
[2021-08-15 18:40:28,176] {plugins_manager.py:225} DEBUG - Loading plugins from directory: /opt/airflow/plugins
[2021-08-15 18:40:28,177] {plugins_manager.py:205} DEBUG - Loading plugins from entrypoints
[2021-08-15 18:40:28,238] {plugins_manager.py:418} DEBUG - Integrate DAG plugins
Running <TaskInstance: MANUAL-GKE-clickhouse-client.clickhouse-client 2021-08-15T18:39:38.150950+00:00 [queued]> on host manualgkeclickhouseclientclickhouseclient.9959fa1fd13a4b6fbd
[2021-08-15 18:40:28,670] {cli_action_loggers.py:84} DEBUG - Calling callbacks: []
[2021-08-15 18:40:28,670] {settings.py:302} DEBUG - Disposing DB connection pool (PID 1)

keycloak and cockroachdb cloud

I tried to use keycloak db against crdb cloud. I have used https://github.com/codecentric/helm-charts/tree/master/charts/keycloak charts for deployment k8s. I create a db for the keycloak and give the above configuration to success connection. I use my values.yaml and added additional env var:
extraEnv: |
- name: DB_VENDOR
value: postgres
- name: DB_ADDR
value: xxxx.xxx.cockroachlabs.cloud
- name: DB_PORT
value: "xxx"
- name: DB_DATABASE
value: keycloak
- name: DB_USER_FILE
value: /secrets/db-creds/user
- name: DB_PASSWORD_FILE
value: /secrets/db-creds/password
- name: JDBC_PARAMS
value: sslmode=verify-ca&sslrootcert=/secrets/crdb-creds/xxx.crt
- name: JDBC_PARAMS_FILE
value: /secrets/crdb-creds/xxx.crt
and also
# Add additional volumes, e. g. for custom themes
extraVolumes: |
- name: crdb-creds
secret:
secretName: keycloak-crdb-creds
- name: db-creds
secret:
secretName: keycloak-db-creds
and mounting
# Add additional volumes mounts, e. g. for custom themes
extraVolumeMounts: |
- name: crdb-creds
mountPath: /secrets/crdb-creds
readOnly: true
- name: db-creds
mountPath: /secrets/db-creds
readOnly: true
So in theory there is no restriction for using cockroach the with postgres db vendor in keycloak(!). And I am going to give a try for this and actually it wasn't give an error but it restaring after while and keeping restarting same period. So it gives an :
Caused by: liquibase.exception.DatabaseException: liquibase.exception.DatabaseException: java.sql.SQLException: IJ031040: Connection is not associated with a managed connection: org.jboss.jca.adapters.jdbc.jdk8.WrappedConnectionJDK8#3a612dd6
or
10:55:31,907 FATAL [org.keycloak.services] (ServerService Thread Pool -- 64) Error during startup: java.lang.IllegalStateException: Failed to retrieve lock
So my question is what is the variable for giving .crt path and is there any additional progress need to run this correctly