Airflow Kubernetes / DAGs in different clusters - kubernetes

We have a few applications which span multiple AWS regions. So, rather than having multiple deployments of Airflow to handle our ETL tasks (one in each region), we would like to figure out if there is a way to have workers in different regions/clusters/namespaces.
Our Airflow deployment runs in EKS, so I'm guessing this would maybe be a setting in the KubernetesPodOperator if at all. I also don't see a way to specify a cluster via the DAG but I'm hoping some of the geniuses here may have an idea.
Thanks in advance,
Bill

At the company I work for, we use the KubernetesPodOperator to run in different namespaces.
KubernetesPodOperator has a parameter named 'namespace'. Thanks to this parameter, it can be determined in which namespace the job will run.
write_xcom = KubernetesPodOperator(
namespace='default',
image='alpine',
cmds=["sh", "-c", "mkdir -p /airflow/xcom/;echo '[1,2,3,4]' > /airflow/xcom/return.json"],
name="write-xcom",
do_xcom_push=True,
is_delete_operator_pod=True,
in_cluster=True,
task_id="write-xcom",
get_logs=True,
)
When I searched about working in a different cluster, I saw that KubernetesPodOperator has a parameter named 'config_file' in the last stable version. This value is set to '~/.kube/config' by default.
Link: https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
I haven't tried it before, but it may be possible to work in different clusters with the 'config_file' parameter by creating different config files. I will be following for different better solutions.
I came across an example solution as below
from datetime import datetime, timedelta
from airflow import DAG
from airflow.configuration import conf
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
namespace = conf.get('kubernetes', 'NAMESPACE')
# This will detect the default namespace locally and read the
# environment namespace when deployed to Astronomer.
if namespace =='default':
config_file = '/usr/local/airflow/include/.kube/config'
in_cluster = False
else:
in_cluster = True
config_file = None
dag = DAG('example_kubernetes_pod', schedule_interval='#once', default_args=default_args)
with mountain:
KubernetesPodOperator(
namespace=namespace,
image="hello-world",
labels={"<pod-label>": "<label-name>"},
name="airflow-test-pod",
task_id="task-one",
in_cluster=in_cluster, # if set to true, will look in the cluster, if false, looks for file
cluster_context="docker-desktop", # is ignored when in_cluster is set to True
config_file=config_file,
is_delete_operator_pod=True,
get_logs=True,
)
For details, you can refer to this link: https://docs.astronomer.io/software/kubepodoperator-local

Related

Restoring a AWS documentdb snapshot with terraform

I am unsure how to restore an AWS documentdb cluster that is managed by terraform.
My terraform setup looks like this:
resource "aws_docdb_cluster" "this" {
cluster_identifier = var.env_name
engine = "docdb"
engine_version = "4.0.0"
master_username = "USERNAME"
master_password = random_password.this.result
db_cluster_parameter_group_name = aws_docdb_cluster_parameter_group.this.name
availability_zones = ["us-east-1a", "us-east-1b", "us-east-1c"]
db_subnet_group_name = aws_docdb_subnet_group.this.name
deletion_protection = true
backup_retention_period = 7
preferred_backup_window = "07:00-09:00"
skip_final_snapshot = false
# Added on 6.25.22 to rollback an incorrect application of the namespace
# migration, which occurred at 2AM EST on June 23.
snapshot_identifier = "...the arn for the snapshot..."
}
resource "aws_docdb_cluster_instance" "this_2a" {
count = 1
engine = "docdb"
availability_zone = "us-east-1a"
auto_minor_version_upgrade = true
cluster_identifier = aws_docdb_cluster.this.id
instance_class = "db.r5.large"
}
resource "aws_docdb_cluster_instance" "this_2b" {
count = 1
engine = "docdb"
availability_zone = "us-east-1b"
auto_minor_version_upgrade = true
cluster_identifier = aws_docdb_cluster.this.id
instance_class = "db.r5.large"
}
resource "aws_docdb_subnet_group" "this" {
name = var.env_name
subnet_ids = module.vpc.private_subnets
}
I added the snapshot_identifier parameter and applied it, expecting a rollback. However, this did not have the intended effect of restoring documentdb state to its settings on June 23rd. (As far as I can tell, nothing changed at all)
I wanted to avoid using the AWS console approach (described here) because that creates a new cluster which won't be tracked by terraform.
What is the proper way of accomplishing this rollback using terraform?
The snapshot_identifier parameter is only used when Terraform creates a new cluster. Setting it after the cluster has been created just tells Terraform "If you ever have to recreate this cluster, use this snapshot".
To actually get Terraform to recreate the cluster you would need to do something else to make Terraform think the cluster needs to be recreated. Possible options are:
Run terraform taint aws_docdb_cluster.this to signal to Terraform that the resource needs to be recreated. It will then recreate it the next time you run terraform apply.
Delete the cluster through some other means, like the AWS console, and then run terraform apply.
The general approach is this, but i have no experience with documentdb. Hope this helps.
0. Take a backup of your terrafrom state file terraform state pull > backup_state_file_timestamp.json
Restore through the console to the point in time you want.
Remove the old instances and cluster from your terraform state file
terraform state rm aws_docdb_cluster_instance.this_2a
terraform state rm aws_docdb_cluster_instance.this_2b
terraform state rm aws_docdb_cluster.this
Import the manually restored cluster and instance into terraform
terraform import aws_docdb_cluster.this cluster_identifier
terraform import rm aws_docdb_cluster_instance.this_2a identifier
terraform import rm aws_docdb_cluster_instance.this_2b identifier
(see import at the bottom https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/docdb_cluster_instance and https://registry.terraform.io/providers/hashicorp/aws/latest/docs/resources/docdb_cluster)

Airflow- dag_id could not be found issue when using kubernetes executor

I am using airflow stable helm chart and using Kubernetes Executor, new pod is being scheduled for dag but its failing with dag_id could not be found issue. I am using git-sync to get dags. Below is the error and kubernetes config values. Can someone please help me resolve this issue?
Error:
[2020-07-01 23:18:36,939] {__init__.py:51} INFO - Using executor LocalExecutor
[2020-07-01 23:18:36,940] {dagbag.py:396} INFO - Filling up the DagBag from /opt/airflow/dags/dags/etl/sampledag_dag.py
Traceback (most recent call last):
File "/home/airflow/.local/bin/airflow", line 37, in <module>
args.func(args)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/utils/cli.py", line 75, in wrapper
return f(*args, **kwargs)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 523, in run
dag = get_dag(args)
File "/home/airflow/.local/lib/python3.6/site-packages/airflow/bin/cli.py", line 149, in get_dag
'parse.'.format(args.dag_id))
airflow.exceptions.AirflowException: dag_id could not be found: sampledag . Either the dag did not exist or it failed to parse.
Config:
AIRFLOW__KUBERNETES__DELETE_WORKER_PODS: false
AIRFLOW__KUBERNETES__GIT_REPO: git#git.com/dags.git
AIRFLOW__KUBERNETES__GIT_BRANCH: master
AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT: /dags
AIRFLOW__KUBERNETES__GIT_SSH_KEY_SECRET_NAME: git-secret
AIRFLOW__KUBERNETES__WORKER_CONTAINER_REPOSITORY: airflow-repo
AIRFLOW__KUBERNETES__WORKER_CONTAINER_TAG: tag
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000"
sampledag
import logging
import datetime
from airflow import models
from airflow.contrib.operators import kubernetes_pod_operator
import os
args = {
'owner': 'airflow'
}
YESTERDAY = datetime.datetime.now() - datetime.timedelta(days=1)
try:
print("Entered try block")
with models.DAG(
dag_id='sampledag',
schedule_interval=datetime.timedelta(days=1),
start_date=YESTERDAY) as dag:
print("Initialized dag")
kubernetes_min_pod = kubernetes_pod_operator.KubernetesPodOperator(
# The ID specified for the task.
task_id='trigger-task',
# Name of task you want to run, used to generate Pod ID.
name='trigger-name',
namespace='scheduler',
in_cluster = True,
cmds=["./docker-run.sh"],
is_delete_operator_pod=False,
image='imagerepo:latest',
image_pull_policy='Always',
dag=dag)
print("done")
except Exception as e:
print(str(e))
logging.error("Error at {}, error={}".format(__file__, str(e)))
raise
I had the same issue. I solved it by adding the following to my config:
AIRFLOW__KUBERNETES__DAGS_VOLUME_SUBPATH: repo/
What was happening is that the init container will download your dags in [AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT]/[AIRFLOW__KUBERNETES__GIT_SYNC_DEST] and AIRFLOW__KUBERNETES__GIT_SYNC_DEST by default is repo (https://airflow.apache.org/docs/stable/configurations-ref.html#git-sync-dest)
I am guessing that the problem could be incurred from the difference in your setup that causes: /opt/airflow/dags/dags/etl/sampledag_dag.py and AIRFLOW__KUBERNETES__GIT_DAGS_FOLDER_MOUNT_POINT: /dags
I'd double check that these are what you want, and are what you expect.
I was facing the same issue while trying to use Kubernetes Executor using stable helm airflow chart. In my case, I was able to resolve it by changing
AIRFLOW__KUBERNETES__RUN_AS_USER: "50000" to AIRFLOW__KUBERNETES__GIT_SYNC_RUN_AS_USER: "65533" in the env section of helm chart.
Same value is mentioned in this link
I came to this conclusion as the init container (git sync) which was running before the temporary worker pod came up was not able to clone/sync the git dags to the worker pods. In my case, there was a permissions error (even when kube secret for ssh clone was correctly passed)
Note:
the git-sync init container returns no error even if it fails to fetch the DAGs
Kubernetes debugging information for init containers
kubectl get pods -n [NAMESPACE]
kubectl logs -n [NAMESPACE] [POD_ID] -c git-sync
Getting the same issue, I solved it with the suggestion from #gtrip to set the UID of the git-sync run user to 65533.
I would add the following debug hints:
the git-sync init container returns no error even if it fails to fetch the DAGs
Kubernetes debugging information for init containers
kubectl get pods -n [NAMESPACE]
kubectl logs -n [NAMESPACE] [POD_ID] -c git-sync

How to send logs to Cloudwatch from a background process running in AWS Fargate containers?

I'm using Fargate. My container is running two processes. Celery worker in the background and Django in the foreground. The foreground process emits logs to stdout, hence AWS takes care of sending Django logs to concerned Cloudwatch Log Group and Stream.
Since its running in the background, how do send the celery worker's logs to (a different Log Stream within same Log Group) Cloudwatch?
If there's no way to move the second process to the separate container and log it as usual, you may install awslogs package to the container and set it up to read background process' log files and send content to CloudWatch.
But I'd not recommend such approach.
Again this is not necessarily a Fargate based issue or question. For logging in celery check this out http://docs.celeryproject.org/en/latest/userguide/tasks.html#logging.
The worker won’t update the redirection if you create a logger instance somewhere in your task or task module.
If you want to redirect sys.stdout and sys.stderr to a custom logger you have to enable this manually, for example:
import sys
logger = get_task_logger(__name__)
#app.task(bind=True)
def add(self, x, y):
old_outs = sys.stdout, sys.stderr
rlevel = self.app.conf.worker_redirect_stdouts_level
try:
self.app.log.redirect_stdouts_to_logger(logger, rlevel)
print('Adding {0} + {1}'.format(x, y))
return x + y
finally:
sys.stdout, sys.stderr = old_outs
And for logging with Fargate, i would use the awslogs driver. Below is how you configure as documented here: https://docs.aws.amazon.com/AmazonECS/latest/developerguide/using_awslogs.html
If using the console:
Or this if in cloudformation template:
Like #OK999 said, Celery is designed to swallow logs whether its on Fargate or not. We ended up using a Django LOGGING config like:
LOGGING = {
'version': 1,
# This only "disables" but the loggers don't propagate
# 'disable_existing_loggers': False,
...
'handlers': {
'console': {
'level': env.str('LOGGING_LEVEL'),
'class': 'logging.StreamHandler',
'formatter': 'verbose',
},
},
'loggers': {
...
# celery won't route logs to console without this
'celery': {
# filtered at the handler
'level': logging.DEBUG,
'handlers': ['console'],
},
...
We had to make this change long before transitioning to Fargate.

Amazon EKS: generate/update kubeconfig via python script

When using Amazon's K8s offering, the EKS service, at some point you need to connect the Kubernetes API and configuration to the infrastructure established within AWS. Especially we need a kubeconfig with proper credentials and URLs to connect to the k8s control plane provided by EKS.
The Amazon commandline tool aws provides a routine for this task
aws eks update-kubeconfig --kubeconfig /path/to/kubecfg.yaml --name <EKS-cluster-name>
Question: do the same through Python/boto3
When looking at the Boto API documentation, I seem to be unable to spot the equivalent for the above mentioned aws routine. Maybe I am looking at the wrong place.
is there a ready-made function in boto to achieve this?
otherwise how would you approach this directly within python (other than calling out to aws in a subprocess)?
There isn't a method function to do this, but you can build the configuration file yourself like this:
# Set up the client
s = boto3.Session(region_name=region)
eks = s.client("eks")
# get cluster details
cluster = eks.describe_cluster(name=cluster_name)
cluster_cert = cluster["cluster"]["certificateAuthority"]["data"]
cluster_ep = cluster["cluster"]["endpoint"]
# build the cluster config hash
cluster_config = {
"apiVersion": "v1",
"kind": "Config",
"clusters": [
{
"cluster": {
"server": str(cluster_ep),
"certificate-authority-data": str(cluster_cert)
},
"name": "kubernetes"
}
],
"contexts": [
{
"context": {
"cluster": "kubernetes",
"user": "aws"
},
"name": "aws"
}
],
"current-context": "aws",
"preferences": {},
"users": [
{
"name": "aws",
"user": {
"exec": {
"apiVersion": "client.authentication.k8s.io/v1alpha1",
"command": "heptio-authenticator-aws",
"args": [
"token", "-i", cluster_name
]
}
}
}
]
}
# Write in YAML.
config_text=yaml.dump(cluster_config, default_flow_style=False)
open(config_file, "w").write(config_text)
This is explained in Create kubeconfig manually section of https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html, which is in fact referenced from the boto3 EKS docs. The manual method there is very similar to #jaxxstorm's answer except that it doesn't shown the python code you would need, however it also does not assume heptio anthenticator (it shows token and IAM authenticator approaches).
I faced same problem decided to implement it as a Python package
it can be installed via
pip install eks-token
and then simply do
from eks_token import get_token
response = get_token(cluster_name='<value>')
More details and examples here
Amazon's aws tool is included in the python package awscli, so one option is to add awscli as a python dependency and just call it from python. The code below assumes that kubectl is installed (but you can remove the test if you want).
kubeconfig depends on ~/.aws/credentials
One challenge here is that the kubeconfig file generated by aws has a users section like this:
users:
- name: arn:aws:eks:someregion:1234:cluster/somecluster
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
args:
- --region
- someregion
- eks
- get-token
- --cluster-name
- somecluster
command: aws
So if you you mount it into a container or move it to a different machine you'll get this error when you try to use it:
Unable to locate credentials. You can configure credentials by running "aws configure".
Based on that user section, kubectl is running aws eks get-token and it's failing because the ~/.aws dir doesn't have the credentials that it had when the kubeconfig file was generated.
You could get around this by also staging the ~/.aws dir everywhere you want to use the kubeconfig file, but I have automation that takes a lone kubeconfig file as a parameter, so I'll be modifying the user section to include the necessary secrets as env vars.
Be aware that this makes it possible for whoever gets that kubeconfig file to use the secrets we've included for other things. Whether this is a problem will depend on how much power your aws user has.
Assume Role
If your cluster uses RBAC, you might need to specify which role you want for your kubeconfig file. The code below does this by first generating a separate set of creds and then using them to generate the kubeconfig file.
Role assumption has a timeout (I'm using 12 hours below), so you'll need to call the script again if you can't manage your mischief before the token times out.
The Code
You can generate the file like:
pip install awscli boto3 pyyaml sh
python mkkube.py > kubeconfig
...if you put the following in mkkube.py
from pathlib import Path
from tempfile import TemporaryDirectory
from time import time
import boto3
import yaml
from sh import aws, sh
aws_access_key_id = "AKREDACTEDAT"
aws_secret_access_key = "ubREDACTEDaE"
role_arn = "arn:aws:iam::1234:role/some-role"
cluster_name = "mycluster"
region_name = "someregion"
# assume a role that has access
sts = boto3.client(
"sts",
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
assumed = sts.assume_role(
RoleArn=role_arn,
RoleSessionName="mysession-" + str(int(time())),
DurationSeconds=(12 * 60 * 60), # 12 hrs
)
# these will be different than the ones you started with
credentials = assumed["Credentials"]
access_key_id = credentials["AccessKeyId"]
secret_access_key = credentials["SecretAccessKey"]
session_token = credentials["SessionToken"]
# make sure our cluster actually exists
eks = boto3.client(
"eks",
aws_session_token=session_token,
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
region_name=region_name,
)
clusters = eks.list_clusters()["clusters"]
if cluster_name not in clusters:
raise RuntimeError(f"configured cluster: {cluster_name} not found among {clusters}")
with TemporaryDirectory() as kube:
kubeconfig_path = Path(kube) / "config"
# let awscli generate the kubeconfig
result = aws(
"eks",
"update-kubeconfig",
"--name",
cluster_name,
_env={
"AWS_ACCESS_KEY_ID": access_key_id,
"AWS_SECRET_ACCESS_KEY": secret_access_key,
"AWS_SESSION_TOKEN": session_token,
"AWS_DEFAULT_REGION": region_name,
"KUBECONFIG": str(kubeconfig_path),
},
)
# read the generated file
with open(kubeconfig_path, "r") as f:
kubeconfig_str = f.read()
kubeconfig = yaml.load(kubeconfig_str, Loader=yaml.SafeLoader)
# the generated kubeconfig assumes that upon use it will have access to
# `~/.aws/credentials`, but maybe this filesystem is ephemeral,
# so add the creds as env vars on the aws command in the kubeconfig
# so that even if the kubeconfig is separated from ~/.aws it is still
# useful
users = kubeconfig["users"]
for i in range(len(users)):
kubeconfig["users"][i]["user"]["exec"]["env"] = [
{"name": "AWS_ACCESS_KEY_ID", "value": access_key_id},
{"name": "AWS_SECRET_ACCESS_KEY", "value": secret_access_key},
{"name": "AWS_SESSION_TOKEN", "value": session_token},
]
# write the updates to disk
with open(kubeconfig_path, "w") as f:
f.write(yaml.dump(kubeconfig))
awsclipath = str(Path(sh("-c", "which aws").stdout.decode()).parent)
kubectlpath = str(Path(sh("-c", "which kubectl").stdout.decode()).parent)
pathval = f"{awsclipath}:{kubectlpath}"
# test the modified file without a ~/.aws/ dir
# this will throw an exception if we can't talk to the cluster
sh(
"-c",
"kubectl cluster-info",
_env={
"KUBECONFIG": str(kubeconfig_path),
"PATH": pathval,
"HOME": "/no/such/path",
},
)
print(yaml.dump(kubeconfig))

Dynamically Creating DAG based on Row available on DB Connection

I want to create a dynamically created DAG from database table query. When I'm trying to create a dynamically creating DAG from both of range of exact number or based on available object in airflow settings it's succeeded. However when I'm trying to use a PostgresHook and create a DAG for each of row of my table, I can see a new DAG generated whenever I add a new row in my table. However it turned out that I can't click the newly created DAG on my airflow web server ui. For more context I'm using Google Cloud Composer. I already following the steps mentioned in DAGs not clickable on Google Cloud Composer webserver, but working fine on a local Airflow. However it's still not working for my case.
Here's my code
from datetime import datetime, timedelta
from airflow import DAG
import psycopg2
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from psycopg2.extras import NamedTupleCursor
import os
default_args = {
"owner": "debug",
"depends_on_past": False,
"start_date": datetime(2018, 10, 17),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def create_dag(dag_id,
schedule,
default_args):
def hello_world_py(*args):
print 'Hello from DAG: {}'.format(dag_id)
dag = DAG(dag_id,
schedule_interval=timedelta(days=1),
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id=dag_id,
python_callable=hello_world_py,
dag_id=dag_id)
return dag
dag = DAG("dynamic_yolo_pg_", default_args=default_args,
schedule_interval=timedelta(hours=1))
"""
Bahavior:
Create an exact DAG which in turn will create it's own file
https://www.astronomer.io/guides/dynamically-generating-dags/
"""
pg_hook = PostgresHook(postgres_conn_id='some_db')
conn = pg_hook.get_conn()
cursor = conn.cursor(cursor_factory=NamedTupleCursor)
cursor.execute("SELECT * FROM airflow_test_command;")
commands = cursor.fetchall()
for command in commands:
dag_id = command.id
schedule = timedelta(days=1)
id = "dynamic_yolo_" + str(dag_id)
print id
globals()[id] = create_dag(id,
schedule,
default_args)
Best,
This is can be solved using self-managed Airflow Webserver using steps mentioned in [1]. After you do this, if you decide to add authentication in front of your self-managed webserver, once you created the ingress, your BackendServices should appear on the Google IAP console and you can enable the IAP. In case you want to access your airflow programmatically you also can use JWT authentication using service account for your self-managed Airflow Webserver [2].
[1] https://cloud.google.com/composer/docs/how-to/managing/deploy-webserver
[2] https://cloud.google.com/iap/docs/authentication-howto