We are running a self-managed Airflow 1.10.2 with KubernetesExecutor on a GKE cluster in GCP. All internal operators are working fine so far, except the KubernetesPodOperator, which we would like to use for running our custom docker images. It seems that the Airflow worker images don't have privileges to start other pods inside the Kubernetes cluster. DAG just does not seem to be doing anything after starting it. This is what we found in the logs initially:
FileNotFoundError: [Errno 2] No such file or directory: '/root/.kube/config'
Next try - in_cluster=True parameter in the KubernetesPodOperator section does not seem to help. After that, we tried to use this parameter in airflow.cfg, section [kubernetes]:
gcp_service_account_keys = kubernetes-executor-private-key:/var/tmp/private/kubernetes_executor_private_key.json
and the error message was now TypeError: a bytes-like object is required, not 'str'
This is the parameter definition from github:
# GCP Service Account Keys to be provided to tasks run on Kubernetes Executors
# Should be supplied in the format: key-name-1:key-path-1,key-name-2:key-path-2
gcp_service_account_keys =
Already tried using various kinds of parentheses and quotes here, no success.
DAG code:
from datetime import datetime, timedelta
from airflow.contrib.operators.kubernetes_pod_operator import KubernetesPodOperator
from airflow.operators.dummy_operator import DummyOperator
default_args = {
'owner': 'xxx',
'depends_on_past': False,
'start_date': datetime.utcnow(),
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5)
}
dag = DAG(
'kubernetes_sample', default_args=default_args, schedule_interval=timedelta(minutes=10))
start = DummyOperator(task_id='run_this_first', dag=dag)
passing = KubernetesPodOperator(namespace='default',
image="Python:3.6",
cmds=["Python","-c"],
arguments=["print('hello world')"],
labels={"foo": "bar"},
name="passing-test",
task_id="passing-task",
in_cluster=True,
get_logs=True,
dag=dag
)
failing = KubernetesPodOperator(namespace='default',
image="ubuntu:1604",
cmds=["Python","-c"],
arguments=["print('hello world')"],
labels={"foo": "bar"},
in_cluster=True,
name="fail",
task_id="failing-task",
get_logs=True,
dag=dag
)
passing.set_upstream(start)
failing.set_upstream(start)
Anyone facing the same problem? Am i missing something here?
Related
I do not know if i am lack of airflow scheduler knowledge or if this is a potential bug from airflow.
situation is like this:
my dag's start date is set to be "start_date": airflow.utils.dates.days_ago(1),
i uploaded the dag to the folder where airflow scans the DAGs
i then turn the dag on (it was by default 'off')
the tasks in the pipeline immediately goes into 'up_for_retry' and you do not really see what had been tried before.
airflow Version Info: Version : 1.10.14. it is run on kubenetes in azure
use Celery executor with Redis
the task instance details are listed below:
Task Instance Details
Dependencies Blocking Task From Getting Scheduled
Dependency Reason
Task Instance State Task is in the 'up_for_retry' state which is not a valid state for execution. The task must be cleared in order to be run.
Not In Retry Period Task is not ready for retry yet but will be retried automatically. Current date is 2021-05-17T09:06:57.239015+00:00 and task will be retried at 2021-05-17T09:09:50.662150+00:00.
am i missing something to judge if it is a bug or if it is expected?
addition, below is the DAG definition as requested.
import airflow
from airflow import DAG
from airflow.contrib.operators.databricks_operator import DatabricksSubmitRunOperator
from airflow.models import Variable
dag_args = {
"owner": "our_project_team_name",
"retries": 1,
"email": ["ouremail_address_replaced_by_this_string"],
"email_on_failure": True,
"email_on_retry": True,
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
}
# Implement cluster reuse on Databricks, pick from light, medium, heavy cluster type based on workloads
clusters = Variable.get("our_project_team_namejob_cluster_config", deserialize_json=True)
databricks_connection = "our_company_databricks"
adl_connection = "our_company_wasb"
pipeline_name = "process_our_data_from_boomi"
dag = DAG(dag_id=pipeline_name, default_args=dag_args, schedule_interval="0 3 * * *")
notebook_dir = "/Shared/our_data_name"
lib_path_sub = ""
lib_name_dev_plus_branch = ""
atlas_library = {
"whl": f"dbfs:/python-wheels/atlas{lib_path_sub}/atlas_library-0{lib_name_dev_plus_branch}-py3-none-any.whl"
}
create_our_data_name_source_data_from_boomi_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_source_data_from_boomi",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
create_our_data_name_standardized_table_from_source_xml_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_standardized_table_from_source_xml",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
create_our_data_name_enriched_table_from_standardized_notebook_params = {
"existing_cluster_id": clusters["our_cluster_name"],
"notebook_task": {
"notebook_path": f"{notebook_dir}/create_our_data_name_enriched",
"base_parameters": {"Extraction_date": "{{ ds_nodash }}"},
},
}
layer_1_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_source",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_source_data_from_boomi_notebook_params,
libraries=[atlas_library],
)
layer_2_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_standardized",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_standardized_table_from_source_xml_notebook_params,
libraries=[
{"maven": {"coordinates": "com.databricks:spark-xml_2.11:0.5.0"}},
{"pypi": {"package": "inflection"}},
atlas_library,
],
)
layer_3_task = DatabricksSubmitRunOperator(
task_id="Load_our_data_name_to_enriched",
databricks_conn_id=databricks_connection,
dag=dag,
json=create_our_data_name_enriched_table_from_standardized_notebook_params,
libraries=[atlas_library],
)
layer_1_task >> layer_2_task >> layer_3_task
after getting some help from #AnandVidvat about trying to make retry=0 experiment and some firend help to change operator to either DummyOperator or PythonOperator, i can confirm that the issue is not to do with DatabricksOperator or airflow version 1.10.x. i.e it is not an airflow bug.
so in summary, when a DAG, has meaningful operator, my setup fails in first Execution without any task log, and during retry works OK (the task log hides the fact it had been retried, because the failure had no logs).
In order to reduce the total run time. The workaround/patch, before finding the real cause, is to set the retry_delay to 10 seconds (default is 5 mins, and it makes DAG run long unnessicssarily.)
Next step is to figure out what is causing this 1st failure thing, by checking logs on scheduler or woker pods in our current setup (azure K8s, postgresql, Redis, celery executor).
p.s. I used below DAG tested and get the conclusion.
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
import time
from pprint import pprint
dag_args = {
"owner": "min_test",
"retries": 1,
"email": ["c243d70b.domain.onmicrosoft.com#emea.teams.ms"],
"email_on_failure": True,
"email_on_retry": True,
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
}
pipeline_name = "min_test_debug_airflow_baseline_PythonOperator_1_retry"
dag = DAG(
dag_id=pipeline_name,
default_args=dag_args,
schedule_interval="0 3 * * *",
tags=["min_test_airflow"],
)
def my_sleeping_function(random_base):
"""This is a function that will run within the DAG execution"""
time.sleep(random_base)
def print_context(ds, **kwargs):
pprint(kwargs)
print(ds)
return "Whatever you return gets printed in the logs"
run_this = PythonOperator(
task_id="print_the_context",
provide_context=True,
python_callable=print_context,
dag=dag,
)
# Generate 3 sleeping tasks, sleeping from 0 to 2 seconds respectively
for i in range(3):
task = PythonOperator(
task_id="sleep_for_" + str(i),
python_callable=my_sleeping_function,
op_kwargs={"random_base": float(i) / 10},
dag=dag,
)
task.set_upstream(run_this)
I am adding a task to an airflow dag as follows:
examples_task = KubernetesPodOperator(
task_id='examples_generation',
dag=dag,
namespace='test',
image='test_amazon_image',
name='pipe-labelled-examples-generation-tf-record-operator',
env={
'GOOGLE_APPLICATION_CREDENTIALS': Variable.get('google_cloud_credentials')
},
arguments=[
"--assets_path", Variable.get('assets_path'),
"--folder_source", Variable.get('folder_source'),
"--folder_destination", Variable.get('folder_destination'),
"--gcs_folder_destination", Variable.get('gcs_folder_destination'),
"--aws_region", Variable.get('aws_region'),
"--s3_endpoint", Variable.get('s3_endpoint')
],
get_logs=True)
I thought I could paste the service account json file as a variable and call it but this doesn't work and airflow/google documentation isn't clear. How do you do this?
Solutions to port the json into an argument
examples_task = KubernetesPodOperator(
task_id='examples_generation',
dag=dag,
namespace='test',
image='test_amazon_image',
name='pipe-labelled-examples-generation-tf-record-operator',
arguments=[
"--folder_source", Variable.get('folder_source'),
"--folder_destination", Variable.get('folder_destination'),
"--gcs_folder_destination", Variable.get('gcs_folder_destination'),
"--aws_region", Variable.get('aws_region'),
"--s3_endpoint", Variable.get('s3_endpoint')
"--gcs_credentials", Variable.get('google_cloud_credentials')
],
get_logs=True)
then in the cli set
import json
from google.cloud import storage
from google.oauth2 import service_account
credentials = service_account.Credentials.from_service_account_info(json.loads(gcs_credentials))
client = storage.Client(project='project_id', credentials=credentials)
I'm having a confusion with KubernetesPodOperator from Airflow, and I'm wondering how to pass the load_users_into_table() function that it has a conn_id parameter stored in connection of Airflow in the Pod ?
In the official doc proposes to put the conn_id in Secret but I don't understand how can I pass it in my function load_users_into_table() after that.
https://airflow.apache.org/docs/stable/kubernetes.html
the function (task) to be executed in the pod:
def load_users_into_table(postgres_hook, schema, path):
gdf = read_csv(path)
gdf.to_sql('users', con=postgres_hook.get_sqlalchemy_engine(), schema=schema)
the dag:
_pg_hook = PostgresHook(postgres_conn_id = _conn_id)
with dag:
test = KubernetesPodOperator(
namespace=namespace,
image=image_name,
cmds=["python", "-c"],
arguments=[load_users_into_table],
labels={"dag-id": dag.dag_id},
name="airflow-test-pod",
task_id="task-1",
is_delete_operator_pod=True,
in_cluster=in_cluster,
get_logs=True,
config_file=config_file,
executor_config={
"KubernetesExecutor": {"request_memory": "512Mi",
"limit_memory": "1024Mi",
"request_cpu": "1",
"limit_cpu": "2"}
}
)
Assuming you want to run with K8sPodOperator, you can use argparse and add arguments to the docker cmd. Something in these lines should do the job:
import argparse
def f(arg):
print(arg)
parser = argparse.ArgumentParser()
parser.add_argument('--foo', help='foo help')
args = parser.parse_args()
if __name__ == '__main__':
f(args.foo)
Dockerfile:
FROM python:3
COPY main.py main.py
CMD ["python", "main.py", "--foo", "somebar"]
There are other ways to solve this such as using secrets, configMaps or even Airflow Variables, but this should get you moving forward.
Summary of my DAG:
I am using SSH Operator to SSH to an EC2 instance and run a JAR file which will connect to multiple DBs. I've declared the Airflow Connection in my DAG file and able to pass the variables into the EC2 instance. As you can see from below, I'm passing properties into JAVA command.
Airflow version - airflow-1-10.7
Package installed - apache-airflow[crypto]
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.hooks.base_hook import BaseHook
from airflow.models.connection import Connection
ssh_hook = SSHHook(ssh_conn_id='ssh_to_ec2')
ssh_hook.no_host_key_check = True
redshift_connection = BaseHook.get_connection("my_redshift")
rs_user = redshift_connection.login
rs_password = redshift_connection.password
mongo_connection = BaseHook.get_connection("my_mongo")
mongo_user = mongo_connection.login
mongo_password = mongo_connection.password
default_args = {
'owner': 'AIRFLOW',
'start_date': datetime(2020, 4, 1, 0, 0),
'email': [],
'retries': 1,
}
dag = DAG('connect_to_redshift', default_args=default_args)
t00_00 = SSHOperator(
task_id='ssh_and_connect_db',
ssh_hook=ssh_hook,
command="java "
"-Drs_user={rs_user} -Drs_pass={rs_pass} "
"-Dmongo_user={mongo_user} -Dmongo_pass={mongo_pass} "
"-jar /home/airflow/root.jar".format(rs_user=rs_user,rs_pass=rs_pass,mongo_user=mongo_user,mongo_pass=mongo_pass),
dag=dag)
t00_00
Problem
The value for rs_pass,mongo_pass will be exposed in Rendered_Template/Airflow log which is not good and I would like to have a solution that can hide all these sensitive information from log and rendered template with SSH Operator.
So far I've tried to minimum the log verbose to ERROR in airflow.cfg, but it still shows in Rendered_Template.
Please enlighten me.
Thanks
I'm setting up an Airflow environment on Google Cloud Composer for testing. I've added some secrets to my namespace, and they show up fine:
$ kubectl describe secrets/eric-env-vars
Name: eric-env-vars
Namespace: eric-dev
Labels: <none>
Annotations: <none>
Type: Opaque
Data
====
VERSION_NUMBER: 6 bytes
I've referenced this secret in my DAG definition file (leaving out some code for brevity):
env_var_secret = Secret(
deploy_type='env',
deploy_target='VERSION_NUMBER',
secret='eric-env-vars',
key='VERSION_NUMBER',
)
dag = DAG('env_test', schedule_interval=None, start_date=start_date)
operator = KubernetesPodOperator(
name='k8s-env-var-test',
task_id='k8s-env-var-test',
dag=dag,
image='ubuntu:16.04',
cmds=['bash', '-cx'],
arguments=['env'],
config_file=os.environ['KUBECONFIG'],
namespace='eric-dev',
secrets=[env_var_secret],
)
But when I run this DAG, the VERSION_NUMBER env var isn't printed out. It doesn't look like it's being properly linked to the pod either (apologies for imprecise language, I am new to both Kubernetes and Airflow). This is from the Airflow task log of the pod creation response (also formatted for brevity/readability):
'env': [
{
'name': 'VERSION_NUMBER',
'value': None,
'value_from': {
'config_map_key_ref': None,
'field_ref': None,
'resource_field_ref': None,
'secret_key_ref': {
'key': 'VERSION_NUMBER',
'name': 'eric-env-vars',
'optional': None}
}
}
]
I'm assuming that we're somehow calling the constructor for the Secret wrong, but I am not entirely sure. Guidance appreciated!
Turns out this was a misunderstanding of the logs!
When providing an environment variable to a Kubernetes pod via a Secret, that value key in the API response is None because the value comes from the secret_key_ref.