Dynamically Creating DAG based on Row available on DB Connection - postgresql

I want to create a dynamically created DAG from database table query. When I'm trying to create a dynamically creating DAG from both of range of exact number or based on available object in airflow settings it's succeeded. However when I'm trying to use a PostgresHook and create a DAG for each of row of my table, I can see a new DAG generated whenever I add a new row in my table. However it turned out that I can't click the newly created DAG on my airflow web server ui. For more context I'm using Google Cloud Composer. I already following the steps mentioned in DAGs not clickable on Google Cloud Composer webserver, but working fine on a local Airflow. However it's still not working for my case.
Here's my code
from datetime import datetime, timedelta
from airflow import DAG
import psycopg2
from airflow.hooks.postgres_hook import PostgresHook
from airflow.operators.bash_operator import BashOperator
from airflow.operators.python_operator import PythonOperator
from psycopg2.extras import NamedTupleCursor
import os
default_args = {
"owner": "debug",
"depends_on_past": False,
"start_date": datetime(2018, 10, 17),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=5),
# 'queue': 'bash_queue',
# 'pool': 'backfill',
# 'priority_weight': 10,
# 'end_date': datetime(2016, 1, 1),
}
def create_dag(dag_id,
schedule,
default_args):
def hello_world_py(*args):
print 'Hello from DAG: {}'.format(dag_id)
dag = DAG(dag_id,
schedule_interval=timedelta(days=1),
default_args=default_args)
with dag:
t1 = PythonOperator(
task_id=dag_id,
python_callable=hello_world_py,
dag_id=dag_id)
return dag
dag = DAG("dynamic_yolo_pg_", default_args=default_args,
schedule_interval=timedelta(hours=1))
"""
Bahavior:
Create an exact DAG which in turn will create it's own file
https://www.astronomer.io/guides/dynamically-generating-dags/
"""
pg_hook = PostgresHook(postgres_conn_id='some_db')
conn = pg_hook.get_conn()
cursor = conn.cursor(cursor_factory=NamedTupleCursor)
cursor.execute("SELECT * FROM airflow_test_command;")
commands = cursor.fetchall()
for command in commands:
dag_id = command.id
schedule = timedelta(days=1)
id = "dynamic_yolo_" + str(dag_id)
print id
globals()[id] = create_dag(id,
schedule,
default_args)
Best,

This is can be solved using self-managed Airflow Webserver using steps mentioned in [1]. After you do this, if you decide to add authentication in front of your self-managed webserver, once you created the ingress, your BackendServices should appear on the Google IAP console and you can enable the IAP. In case you want to access your airflow programmatically you also can use JWT authentication using service account for your self-managed Airflow Webserver [2].
[1] https://cloud.google.com/composer/docs/how-to/managing/deploy-webserver
[2] https://cloud.google.com/iap/docs/authentication-howto

Related

Airflow Kubernetes / DAGs in different clusters

We have a few applications which span multiple AWS regions. So, rather than having multiple deployments of Airflow to handle our ETL tasks (one in each region), we would like to figure out if there is a way to have workers in different regions/clusters/namespaces.
Our Airflow deployment runs in EKS, so I'm guessing this would maybe be a setting in the KubernetesPodOperator if at all. I also don't see a way to specify a cluster via the DAG but I'm hoping some of the geniuses here may have an idea.
Thanks in advance,
Bill
At the company I work for, we use the KubernetesPodOperator to run in different namespaces.
KubernetesPodOperator has a parameter named 'namespace'. Thanks to this parameter, it can be determined in which namespace the job will run.
write_xcom = KubernetesPodOperator(
namespace='default',
image='alpine',
cmds=["sh", "-c", "mkdir -p /airflow/xcom/;echo '[1,2,3,4]' > /airflow/xcom/return.json"],
name="write-xcom",
do_xcom_push=True,
is_delete_operator_pod=True,
in_cluster=True,
task_id="write-xcom",
get_logs=True,
)
When I searched about working in a different cluster, I saw that KubernetesPodOperator has a parameter named 'config_file' in the last stable version. This value is set to '~/.kube/config' by default.
Link: https://airflow.apache.org/docs/apache-airflow-providers-cncf-kubernetes/stable/operators.html
I haven't tried it before, but it may be possible to work in different clusters with the 'config_file' parameter by creating different config files. I will be following for different better solutions.
I came across an example solution as below
from datetime import datetime, timedelta
from airflow import DAG
from airflow.configuration import conf
from airflow.providers.cncf.kubernetes.operators.kubernetes_pod import KubernetesPodOperator
default_args = {
'owner': 'airflow',
'depends_on_past': False,
'start_date': datetime(2022, 1, 1),
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
namespace = conf.get('kubernetes', 'NAMESPACE')
# This will detect the default namespace locally and read the
# environment namespace when deployed to Astronomer.
if namespace =='default':
config_file = '/usr/local/airflow/include/.kube/config'
in_cluster = False
else:
in_cluster = True
config_file = None
dag = DAG('example_kubernetes_pod', schedule_interval='#once', default_args=default_args)
with mountain:
KubernetesPodOperator(
namespace=namespace,
image="hello-world",
labels={"<pod-label>": "<label-name>"},
name="airflow-test-pod",
task_id="task-one",
in_cluster=in_cluster, # if set to true, will look in the cluster, if false, looks for file
cluster_context="docker-desktop", # is ignored when in_cluster is set to True
config_file=config_file,
is_delete_operator_pod=True,
get_logs=True,
)
For details, you can refer to this link: https://docs.astronomer.io/software/kubepodoperator-local

How to use flask babel gettext in celery?

I have a flask-celery setup with flask babel translating my texts. I can't translate in celery tasks. I believe this is because it doesn't know the current language (I'm not it could even if it did) and this is because celery doesn't have access to the request context (from what i understood)...
What would be the solution to be able to translate?
You rightly pointed out that issue that is celery doesn't have access to request context, which means flask_babelex.get_locale returns None. You can use force_locale context manager available in Flask-Babel which provides dummy request context.
from contextlib import contextmanager
from flask import current_app
from babel import Locale
from ..config import SERVER_NAME, PREFERRED_URL_SCHEME
#contextmanager
def force_locale(locale=None):
if not locale:
yield
return
env = {
'wsgi.url_scheme': PREFERRED_URL_SCHEME,
'SERVER_NAME': SERVER_NAME,
'SERVER_PORT': '',
'REQUEST_METHOD': ''
}
with current_app.request_context(env) as ctx:
ctx.babel_locale = Locale.parse(locale)
yield
Sample Celery Task
#celery.task()
def some_task(user_id):
user = User.objects.get(id=user_id)
with force_locale(user.locale):
...gettext('TranslationKey')...

Amazon EKS: generate/update kubeconfig via python script

When using Amazon's K8s offering, the EKS service, at some point you need to connect the Kubernetes API and configuration to the infrastructure established within AWS. Especially we need a kubeconfig with proper credentials and URLs to connect to the k8s control plane provided by EKS.
The Amazon commandline tool aws provides a routine for this task
aws eks update-kubeconfig --kubeconfig /path/to/kubecfg.yaml --name <EKS-cluster-name>
Question: do the same through Python/boto3
When looking at the Boto API documentation, I seem to be unable to spot the equivalent for the above mentioned aws routine. Maybe I am looking at the wrong place.
is there a ready-made function in boto to achieve this?
otherwise how would you approach this directly within python (other than calling out to aws in a subprocess)?
There isn't a method function to do this, but you can build the configuration file yourself like this:
# Set up the client
s = boto3.Session(region_name=region)
eks = s.client("eks")
# get cluster details
cluster = eks.describe_cluster(name=cluster_name)
cluster_cert = cluster["cluster"]["certificateAuthority"]["data"]
cluster_ep = cluster["cluster"]["endpoint"]
# build the cluster config hash
cluster_config = {
"apiVersion": "v1",
"kind": "Config",
"clusters": [
{
"cluster": {
"server": str(cluster_ep),
"certificate-authority-data": str(cluster_cert)
},
"name": "kubernetes"
}
],
"contexts": [
{
"context": {
"cluster": "kubernetes",
"user": "aws"
},
"name": "aws"
}
],
"current-context": "aws",
"preferences": {},
"users": [
{
"name": "aws",
"user": {
"exec": {
"apiVersion": "client.authentication.k8s.io/v1alpha1",
"command": "heptio-authenticator-aws",
"args": [
"token", "-i", cluster_name
]
}
}
}
]
}
# Write in YAML.
config_text=yaml.dump(cluster_config, default_flow_style=False)
open(config_file, "w").write(config_text)
This is explained in Create kubeconfig manually section of https://docs.aws.amazon.com/eks/latest/userguide/create-kubeconfig.html, which is in fact referenced from the boto3 EKS docs. The manual method there is very similar to #jaxxstorm's answer except that it doesn't shown the python code you would need, however it also does not assume heptio anthenticator (it shows token and IAM authenticator approaches).
I faced same problem decided to implement it as a Python package
it can be installed via
pip install eks-token
and then simply do
from eks_token import get_token
response = get_token(cluster_name='<value>')
More details and examples here
Amazon's aws tool is included in the python package awscli, so one option is to add awscli as a python dependency and just call it from python. The code below assumes that kubectl is installed (but you can remove the test if you want).
kubeconfig depends on ~/.aws/credentials
One challenge here is that the kubeconfig file generated by aws has a users section like this:
users:
- name: arn:aws:eks:someregion:1234:cluster/somecluster
user:
exec:
apiVersion: client.authentication.k8s.io/v1beta1
args:
- --region
- someregion
- eks
- get-token
- --cluster-name
- somecluster
command: aws
So if you you mount it into a container or move it to a different machine you'll get this error when you try to use it:
Unable to locate credentials. You can configure credentials by running "aws configure".
Based on that user section, kubectl is running aws eks get-token and it's failing because the ~/.aws dir doesn't have the credentials that it had when the kubeconfig file was generated.
You could get around this by also staging the ~/.aws dir everywhere you want to use the kubeconfig file, but I have automation that takes a lone kubeconfig file as a parameter, so I'll be modifying the user section to include the necessary secrets as env vars.
Be aware that this makes it possible for whoever gets that kubeconfig file to use the secrets we've included for other things. Whether this is a problem will depend on how much power your aws user has.
Assume Role
If your cluster uses RBAC, you might need to specify which role you want for your kubeconfig file. The code below does this by first generating a separate set of creds and then using them to generate the kubeconfig file.
Role assumption has a timeout (I'm using 12 hours below), so you'll need to call the script again if you can't manage your mischief before the token times out.
The Code
You can generate the file like:
pip install awscli boto3 pyyaml sh
python mkkube.py > kubeconfig
...if you put the following in mkkube.py
from pathlib import Path
from tempfile import TemporaryDirectory
from time import time
import boto3
import yaml
from sh import aws, sh
aws_access_key_id = "AKREDACTEDAT"
aws_secret_access_key = "ubREDACTEDaE"
role_arn = "arn:aws:iam::1234:role/some-role"
cluster_name = "mycluster"
region_name = "someregion"
# assume a role that has access
sts = boto3.client(
"sts",
aws_access_key_id=aws_access_key_id,
aws_secret_access_key=aws_secret_access_key,
)
assumed = sts.assume_role(
RoleArn=role_arn,
RoleSessionName="mysession-" + str(int(time())),
DurationSeconds=(12 * 60 * 60), # 12 hrs
)
# these will be different than the ones you started with
credentials = assumed["Credentials"]
access_key_id = credentials["AccessKeyId"]
secret_access_key = credentials["SecretAccessKey"]
session_token = credentials["SessionToken"]
# make sure our cluster actually exists
eks = boto3.client(
"eks",
aws_session_token=session_token,
aws_access_key_id=access_key_id,
aws_secret_access_key=secret_access_key,
region_name=region_name,
)
clusters = eks.list_clusters()["clusters"]
if cluster_name not in clusters:
raise RuntimeError(f"configured cluster: {cluster_name} not found among {clusters}")
with TemporaryDirectory() as kube:
kubeconfig_path = Path(kube) / "config"
# let awscli generate the kubeconfig
result = aws(
"eks",
"update-kubeconfig",
"--name",
cluster_name,
_env={
"AWS_ACCESS_KEY_ID": access_key_id,
"AWS_SECRET_ACCESS_KEY": secret_access_key,
"AWS_SESSION_TOKEN": session_token,
"AWS_DEFAULT_REGION": region_name,
"KUBECONFIG": str(kubeconfig_path),
},
)
# read the generated file
with open(kubeconfig_path, "r") as f:
kubeconfig_str = f.read()
kubeconfig = yaml.load(kubeconfig_str, Loader=yaml.SafeLoader)
# the generated kubeconfig assumes that upon use it will have access to
# `~/.aws/credentials`, but maybe this filesystem is ephemeral,
# so add the creds as env vars on the aws command in the kubeconfig
# so that even if the kubeconfig is separated from ~/.aws it is still
# useful
users = kubeconfig["users"]
for i in range(len(users)):
kubeconfig["users"][i]["user"]["exec"]["env"] = [
{"name": "AWS_ACCESS_KEY_ID", "value": access_key_id},
{"name": "AWS_SECRET_ACCESS_KEY", "value": secret_access_key},
{"name": "AWS_SESSION_TOKEN", "value": session_token},
]
# write the updates to disk
with open(kubeconfig_path, "w") as f:
f.write(yaml.dump(kubeconfig))
awsclipath = str(Path(sh("-c", "which aws").stdout.decode()).parent)
kubectlpath = str(Path(sh("-c", "which kubectl").stdout.decode()).parent)
pathval = f"{awsclipath}:{kubectlpath}"
# test the modified file without a ~/.aws/ dir
# this will throw an exception if we can't talk to the cluster
sh(
"-c",
"kubectl cluster-info",
_env={
"KUBECONFIG": str(kubeconfig_path),
"PATH": pathval,
"HOME": "/no/such/path",
},
)
print(yaml.dump(kubeconfig))

Configure jenkins plugin/ credentials values with api

I want to know if there is jenkins api (a remote access api) to set values in jenkins plugin configuration. For example artifactory plugin asks for artifactory URL only in configuration manager (http://jenkins-url/configure) and a new url cannot be created while creating a job.
Also how can we create new credentials (ssh/ username, password) on jenkins system with jenkins remote API.
Check out this example: https://gist.github.com/iocanel/9de5c976cc0bd5011653
import jenkins.model.*
import com.cloudbees.plugins.credentials.*
import com.cloudbees.plugins.credentials.common.*
import com.cloudbees.plugins.credentials.domains.*
import com.cloudbees.plugins.credentials.impl.*
import com.cloudbees.jenkins.plugins.sshcredentials.impl.*
import hudson.plugins.sshslaves.*;
domain = Domain.global()
store = Jenkins.instance.getExtensionList('com.cloudbees.plugins.credentials.SystemCredentialsProvider')[0].getStore()
priveteKey = new BasicSSHUserPrivateKey(
CredentialsScope.GLOBAL,
"jenkins-slave-key",
"root",
new BasicSSHUserPrivateKey.UsersPrivateKeySource(),
"",
""
)
usernameAndPassword = new UsernamePasswordCredentialsImpl(
CredentialsScope.GLOBAL,
"jenkins-slave-password", "Jenkis Slave with Password Configuration",
"root",
"jenkins"
)
store.addCredentials(domain, priveteKey)
store.addCredentials(domain, usernameAndPassword)

How to delete a queue in rabbit mq

I am using rabbitmctl using pika library.
I use the following code to create a Producer
#!/usr/bin/env python
import pika
import time
import json
import datetime
connection = pika.BlockingConnection(pika.ConnectionParameters(
host='localhost'))
channel = connection.channel()
channel.queue_declare(queue='hello')
def callback(ch, method, properties, body):
#print " current time: %s " % (str(int((time.time())*1000)))
print body
channel.basic_consume(callback,
queue='hello',
no_ack=True)
channel.start_consuming()
Since I create an existing queue everytime (Over-write the creation of queue in case if queue is not created) The queue has been corrupted due to this.and now I want to delete the queue..how do i do that?
Since this seems to be a maintenance procedure, and not something you'll be doing routinely on your code, you should probably be using the RabbitMQ management plugin and delete the queue from there.
Anyway, you can delete it from pika with:
channel.queue_delete(queue='hello')
https://pika.readthedocs.org/en/latest/modules/channel.html#pika.channel.Channel.queue_delete
The detailed answer is as follows (with reference to above very helpful and useful answer)
import pika
connection = pika.BlockingConnection(pika.ConnectionParameters(
'localhost'))
channel = connection.channel()
channel.queue_delete(queue='hello')
connection.close()
GUI rabbitMQ mgm't made that easy
$ sudo rabbitmq-plugins enable rabbitmq_management
http://localhost:15672/#/queues
Username : guest
password : guest
inspired by this