How to execute PostgreSQL SELECT query using cloud sql in cloud composer's airflow? - google-cloud-sql

I am new to cloud composer & I want to execute one PostgreSQL SELECT query using gcp_cloud_sql hook in cloud composer's airflow. I tried with CloudSqlQueryOperator but it doesn't work with SELECT queries.
I want to create DAGs on basis of results I get from this select query.However, I am not able to create even simple connection for this SELECT query.
from six.moves.urllib.parse import quote_plus
import airflow
from airflow import models
from airflow.contrib.operators.gcp_sql_operator import (
CloudSqlQueryOperator
)
from datetime import date, datetime, timedelta
GCP_PROJECT_ID = "adtech-dev"
GCP_REGION = "<my cluster zone>"
GCSQL_POSTGRES_INSTANCE_NAME_QUERY = "testpostgres"
GCSQL_POSTGRES_DATABASE_NAME = ""
GCSQL_POSTGRES_USER = "<PostgreSQL User Name>"
GCSQL_POSTGRES_PASSWORD = "**********"
GCSQL_POSTGRES_PUBLIC_IP = "0.0.0.0"
GCSQL_POSTGRES_PUBLIC_PORT = "5432"
rule_query = "select r.id from rules r where r.id = 1"
postgres_kwargs = dict(
user=quote_plus(GCSQL_POSTGRES_USER),
password=quote_plus(GCSQL_POSTGRES_PASSWORD),
public_port=GCSQL_POSTGRES_PUBLIC_PORT,
public_ip=quote_plus(GCSQL_POSTGRES_PUBLIC_IP),
project_id=quote_plus(GCP_PROJECT_ID),
location=quote_plus(GCP_REGION),
instance=quote_plus(GCSQL_POSTGRES_INSTANCE_NAME_QUERY),
database=quote_plus(GCSQL_POSTGRES_DATABASE_NAME)
)
default_args = {
'owner': 'airflow',
'start_date': datetime(2018, 5, 31),
'email': ['aniruddha.dwivedi#xyz.com'],
'email_on_failure': True,
'email_on_retry': False,
'depends_on_past': False,
'catchup': False,
'retries': 3,
'retry_delay': timedelta(minutes=10),
}
os.environ['AIRFLOW_CONN_PROXY_POSTGRES_TCP'] = \
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?" \
"database_type=postgres&" \
"project_id={project_id}&" \
"location={location}&" \
"instance={instance}&" \
"use_proxy=True&" \
"sql_proxy_use_tcp=True".format(**postgres_kwargs)
connection_names = [
"proxy_postgres_tcp"
]
tasks = []
with models.DAG(
dag_id='example_gcp_sql_query',
default_args=default_args,
schedule_interval=None
) as dag:
prev_task = None
for connection_name in connection_names:
task = CloudSqlQueryOperator(
gcp_cloudsql_conn_id=connection_name,
task_id="example_gcp_sql_task_" + connection_name,
sql=rule_query
)
tasks.append(task)
if prev_task:
prev_task >> task
prev_task = task

Related

How to hide the password from log and rendered template when pass another airflow connection to airflow SSH Operator

Summary of my DAG:
I am using SSH Operator to SSH to an EC2 instance and run a JAR file which will connect to multiple DBs. I've declared the Airflow Connection in my DAG file and able to pass the variables into the EC2 instance. As you can see from below, I'm passing properties into JAVA command.
Airflow version - airflow-1-10.7
Package installed - apache-airflow[crypto]
from airflow import DAG
from datetime import datetime, timedelta
from airflow.contrib.hooks.ssh_hook import SSHHook
from airflow.contrib.operators.ssh_operator import SSHOperator
from airflow.hooks.base_hook import BaseHook
from airflow.models.connection import Connection
ssh_hook = SSHHook(ssh_conn_id='ssh_to_ec2')
ssh_hook.no_host_key_check = True
redshift_connection = BaseHook.get_connection("my_redshift")
rs_user = redshift_connection.login
rs_password = redshift_connection.password
mongo_connection = BaseHook.get_connection("my_mongo")
mongo_user = mongo_connection.login
mongo_password = mongo_connection.password
default_args = {
'owner': 'AIRFLOW',
'start_date': datetime(2020, 4, 1, 0, 0),
'email': [],
'retries': 1,
}
dag = DAG('connect_to_redshift', default_args=default_args)
t00_00 = SSHOperator(
task_id='ssh_and_connect_db',
ssh_hook=ssh_hook,
command="java "
"-Drs_user={rs_user} -Drs_pass={rs_pass} "
"-Dmongo_user={mongo_user} -Dmongo_pass={mongo_pass} "
"-jar /home/airflow/root.jar".format(rs_user=rs_user,rs_pass=rs_pass,mongo_user=mongo_user,mongo_pass=mongo_pass),
dag=dag)
t00_00
Problem
The value for rs_pass,mongo_pass will be exposed in Rendered_Template/Airflow log which is not good and I would like to have a solution that can hide all these sensitive information from log and rendered template with SSH Operator.
So far I've tried to minimum the log verbose to ERROR in airflow.cfg, but it still shows in Rendered_Template.
Please enlighten me.
Thanks

Why isn't my table showing when doing a create table statement using airflow?

So, I am trying to create a table in my Redshift DB by using airflow. My connection works and I tested it with a SQL command but when I change the sql command to a create table command it runs successfully but it doesn't show up in my redshift DB.
Here is my code:
from datetime import datetime, timedelta
from airflow import DAG
from airflow.hooks.postgres_hook import PostgresHook
from airflow.models import BaseOperator
from airflow.operators.python_operator import PythonOperator
default_args = {
'owner': 'james_c',
'depends_on_past': False,
'start_date': datetime(2019,4,1),
'email': ['myemail#aol.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 2,
'retry_delay': timedelta(minutes=1)
}
def get_activated_sources():
request ="CREATE TABLE if not exists schema1.db1.tb1 (vendor_id varchar(50) PRIMARY KEY, vendor_name VARCHAR(255) NOT NULL);"
pg_hook = PostgresHook(postgre_conn_id="postgres_default", schema='schema1')
connection = pg_hook.get_conn()
cursor = connection.cursor()
cursor.execute(request)
#cursor.fetchall()
cursor.close()
connection.close()
with DAG('create_sample_table_dagg', description='testing my redshift connection', default_args=default_args, schedule_interval='#once', catchup=False) as dag:
hook_task = PythonOperator(task_id='hook_task', python_callable=get_activated_sources)
Any ideas/suggestion as to why its running and completing by not actually creating the table in redshift?f
Your code is fine, you just need to write:
connection.commit()
under
cursor.execute(request)

PostgresOperator in Airflow getting error while passing parameter

I have a dag which queries the postgress database, And I am using postgresOperator
however when passing the parameter I am getting the below Error.
psycopg2.ProgrammingError: column "132" does not exist
LINE 1: ...d,derived_tstamp FROM atomic.events WHERE event_name = "132"
snapshot of my dag below :
default_args = {
"owner": "airflow",
"depends_on_past": False,
"start_date": airflow.utils.dates.days_ago(1),
"email": ["airflow#airflow.com"],
"email_on_failure": False,
"email_on_retry": False,
"retries": 1,
"retry_delay": timedelta(minutes=1),
}
dag = DAG("PostgresTest", default_args=default_args, schedule_interval='3,33 * * * *',template_searchpath = ['/root/airflow/sql/'])
dailyOperator = PostgresOperator(
task_id='Refresh_DailyScore',
postgres_conn_id='postgress_sophi',
params={"e_name":'"132"'},
sql='atomTest.sql',
dag=dag)
Snapshot of atomTest.sql
SELECT domain_userid,derived_tstamp FROM atomic.events WHERE event_name = {{ params.e_name }}
I am hitting my head the whole day to understand why airflow is considering 132 values as column.
Please suggest.

Airflow task not running on schedule with PrestoDB Query

I have defined a Airflow sample task where I wanted to run a PrestoDB Query followed by a Spark job to perform a simple word count example. Here is the DAG I defined:
from pandas import DataFrame
import logging
from datetime import timedelta
from operator import add
import airflow
from airflow import DAG
from airflow.operators.python_operator import PythonOperator
from airflow.hooks.presto_hook import PrestoHook
default_args = {
'owner': 'airflow',
'start_date': airflow.utils.dates.days_ago(1),
'depends_on_past': False,
'email': ['airflow#example.com'],
'email_on_failure': False,
'email_on_retry': False,
'retries': 1,
'retry_delay': timedelta(minutes=5),
}
dag = DAG(
'presto_dag',
default_args=default_args,
description='A simple tutorial DAG with PrestoDB and Spark',
# Continue to run DAG once per hour
schedule_interval='#daily',
)
def talk_to_presto():
ph = PrestoHook(host='presto.myhost.com', port=9988)
# Query PrestoDB
query = "show catalogs"
# Fetch Data
data = ph.get_records(query)
logging.info(data)
return data
def submit_to_spark():
# conf = SparkConf().setAppName("PySpark App").setMaster("http://sparkhost.com:18080/")
# sc = SparkContext(conf)
# data = sc.parallelize(list("Hello World"))
# counts = data.map(lambda x: (x, 1)).reduceByKey(add).sortBy(lambda x: x[1], ascending=False).collect()
# for (word, count) in counts:
# print("{}: {}".format(word, count))
# sc.stop()
return "Hello"
presto_task = PythonOperator(
task_id='talk_to_presto',
provide_context=True,
python_callable=talk_to_presto,
dag=dag,
)
spark_task = PythonOperator(
task_id='submit_to_spark',
provide_context=True,
python_callable=submit_to_spark,
dag=dag,
)
presto_task >> spark_task
When I submit the task, about 20 DAG instances stay in the running state:
But it never completes and no logs are generated, at least for the PrestoDB Query. I am able to run the same PrestoDB Query from the Airflow's Data Profiling > Ad-Hoc Query section correctly.
I have intentionally commented out the PySpark code as it wasn't running and not of focus in the question.
I have two questions:
Why aren't the tasks completed and stays in the running state?
What am I doing wrong with the PrestoHook as the query isn't running?

How to know from what table record is retrieved in Sphinx?

from sphinx.conf:
source src0 {
type = pgsql
sql_host = localhost
sql_user = <db user>
sql_pass = <pwd>
sql_db = <db name>
sql_port = 5432
sql_query = \
SELECT id, header, text, "app_main" as table_name \
FROM app_main
sql_query_info = SELECT * FROM app_main WHERE id=$id
sql_attr_string = table_name
}
source src1 {
type = pgsql
sql_host = localhost
sql_user = <db user>
sql_pass = <pwd>
sql_db = <db name>
sql_port = 5432
sql_query = \
SELECT id, header, text, "app_product" as table_name \
FROM app_product
sql_query_info = SELECT * FROM app_product WHERE id=$id
sql_attr_string = table_name
}
index global_index {
source = src0
source = src1
path = D:/blizzard/Projects/Python/Web/moz455/app/sphinx/data/global_index
docinfo = extern
charset_type = utf-8
}
Command
client.Query(S, '*')
returns
{'status': 0, 'matches': [{'id': 5, 'weight': 30, 'attrs': {}}], 'fields': ['header', 'text'], 'time': '0.000', 'total_found': 1, 'warning': '', 'attrs': [], 'words': [{'docs': 1, 'hits': 2, 'word': 'styless'}], 'error': '', 'total': 1})
Why attrs dict is empty? Is this the right way to get table name and if not - what is?
Make sure you rebuild the index after changing the the config file
Best to restart sphinx after changing config
Specify the actual index name(s) in the query, rather than just using '*' - all indexes should have the required attribute(s)