How to Connect Database(postgres) to Airflow composer On Google Cloud Platform? - postgresql

I have airflow setup on my local machine.Dags are written in a way that they need to access database(postgres).I am trying to setup similar thing on Google Cloud Platform.But I am not able to connect database to Airflow in a composer.I am Keep getting error "no host postgres" Any Suggestions for setting up airflow on GCP or Connecting Database to airflow composer??
Here Is Link For My Complete Airflow Folder:(This setup works fine on my local machine with docker)
https://github.com/digvijay13873/airflow-docker.git
I am using GCP composer.Postgres Database is in SQL instance. My Table creation Dag is here :
https://github.com/digvijay13873/airflow-docker/blob/main/dags/tablecreation.py
What changes should I do in a My existing Dag to connect it with postgres in SQL instance. I tried Giving public IP address of postgres in Host parameter.

Answering your main question, connecting a SQL instance from GCP in Cloud Composer environment can be done in two ways:
Using Public IP
Using Cloud SQL proxy (recommended): secure access without the need of authorized networks and SSL configuration
Connecting using Public IP:
Postgres: connect directly via TCP (non-SSL)
os.environ['AIRFLOW_CONN_PUBLIC_POSTGRES_TCP'] = (
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?"
"database_type=postgres&"
"project_id={project_id}&"
"location={location}&"
"instance={instance}&"
"use_proxy=False&"
"use_ssl=False".format(**postgres_kwargs)
)
For more information refer github
For connecting using Cloud SQL proxy: You can connect using Auth proxy from GKE as per this documentation.
After setting up the SQL proxy you can connect Composer to your SQL instance using a proxy.
Exemplar Code:
SQL = [
'CREATE TABLE IF NOT EXISTS TABLE_TEST (I INTEGER)',
'CREATE TABLE IF NOT EXISTS TABLE_TEST (I INTEGER)',
'INSERT INTO TABLE_TEST VALUES (0)',
'CREATE TABLE IF NOT EXISTS TABLE_TEST2 (I INTEGER)',
'DROP TABLE TABLE_TEST',
'DROP TABLE TABLE_TEST2',
]
HOME_DIR = expanduser("~")
def get_absolute_path(path):
if path.startswith("/"):
return path
else:
return os.path.join(HOME_DIR, path)
postgres_kwargs = dict(
user=quote_plus(GCSQL_POSTGRES_USER),
password=quote_plus(GCSQL_POSTGRES_PASSWORD),
public_port=GCSQL_POSTGRES_PUBLIC_PORT,
public_ip=quote_plus(GCSQL_POSTGRES_PUBLIC_IP),
project_id=quote_plus(GCP_PROJECT_ID),
location=quote_plus(GCP_REGION),
instance=quote_plus(GCSQL_POSTGRES_INSTANCE_NAME_QUERY),
database=quote_plus(GCSQL_POSTGRES_DATABASE_NAME),
)
os.environ['AIRFLOW_CONN_PROXY_POSTGRES_TCP'] = \
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?" \
"database_type=postgres&" \
"project_id={project_id}&" \
"location={location}&" \
"instance={instance}&" \\
"use_proxy=True&" \
"sql_proxy_use_tcp=True".format(**postgres_kwargs)
connection_names = [
"proxy_postgres_tcp",
]
dag = DAG(
'con_SQL',
default_args=default_args,
description='A DAG that connect to the SQL server.',
schedule_interval=timedelta(days=1),
)
def print_client(ds, **kwargs):
client = storage.Client()
print(client)
print_task = PythonOperator(
task_id='print_the_client',
provide_context=True,
python_callable=print_client,
dag=dag,
)
for connection_name in connection_names:
task = CloudSqlQueryOperator(
gcp_cloudsql_conn_id=connection_name,
task_id="example_gcp_sql_task_" + connection_name,
sql=SQL,
dag=dag
)
print_task >> task

Related

How to connect to an Oracle DB from a Python Azure Synapse notebook?

I am trying to query an Oracle database from within an Azure Synapse notebook, preferably using Pyodbc but a pyspark solution would also be acceptable. The complexity here comes from, I believe, the low configurability of the spark pool - I believe the code is generally correct.
host = 'my_endpoint.com:[port here as plain numbers, e.g. 1111]/orcl'
database = 'my_db_name'
username = 'my_username'
password = 'my_password'
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=' + host + ';'
'DATABASE=' + database + ';'
'UID=' + username + ';'
'PWD=' + password + ';')
Approaches I have tried:
Pyodbc - I can use the default driver available ({ODBC Driver 17 for SQL Server}) and I get login timeouts. I have tried both the normal URL of the server and the IP, and with all combinations of port, no port, comma port, colon port, and without the service name "orcl" appended. Code sample is above, but I believe the issue lies with the drivers.
Pyspark .read - With no JDBC driver specified, I get a "No suitable driver" error. I am able to add the OJDBC .jar to the workspace or to my file directory, but I was not able to figure out how to tell spark that it should be used.
cx_oracle - This is not permitted in my workspace.
If the solution requires setting environment variables or using spark-submit, please provide a link that explains how best to do that in Synapse. I would be happy with either a JDBC or an ODBC solution.
By adding the .jar here (ojdbc8-19.15.0.0.1.jar) to the Synapse workspace packages and then adding that package to the Apache spark pool packages, I was able to execute the following code:
host = 'my_host_url'
port = 1521
service_name = 'my_service_name'
jdbcUrl = f'jdbc:oracle:thin:#{host}:{port}:{service_name}'
sql = 'SELECT * FROM my_table'
user = 'my_username'
password = 'my_password'
jdbcDriver = 'oracle.jdbc.driver.OracleDriver'
jdbcDF = spark.read.format('jdbc') \
.option('url', jdbcUrl) \
.option('query', sql) \
.option('user', user) \
.option('password', password) \
.option('driver', jdbcDriver) \
.load()
display(jdbcDF)

'h' format requires -32768 <= number <= 32767

I am trying to write a dataframe into postgreSQL database table.
When i write it into heroku's postgres SQL database, everything works fine. No problems.
For heroku postgresql, I use the connection string
connection_string = "postgresql+psycopg2://%s:%s#%s/%s" % (
conn_params['user'],
conn_params['password'],
conn_params['host'],
conn_params['dbname'])
However, when i try to write the dataframe into GCP's cloud sql table, i get the following error...
struct.error: 'h' format requires -32768 <= number <= 32767
The connection string i use for gcp cloud sql is as follows.
connection_string = \
f"postgresql+pg8000://{conn_params['user']}:{conn_params['password']}#{conn_params['host']}/{conn_params['dbname']}"
the command i use to write to the database is the same for both gcp and heroku
df_Output.to_sql(sql_tablename, con=conn, schema='public', index=False, if_exists=if_exists, method='multi')
I'd recommend using the Cloud SQL Python Connector to manage your connections and take care of the connection string for you. It supports the pg8000 driver and should help resolve your troubles.
from google.cloud.sql.connector import connector
import sqlalchemy
# configure Cloud SQL Python Connector properties
def getconn():
conn = connector.connect(
"project:region:instance",
"pg8000",
user="YOUR_USER",
password="YOUR_PASSWORD",
db="YOUR_DB"
)
return conn
# create connection pool to re-use connections
pool = sqlalchemy.create_engine(
"postgresql+pg8000://",
creator=getconn,
)
# query or insert into Cloud SQL database
with pool.connect() as db_conn:
# query database
result = db_conn.execute("SELECT * from my_table").fetchall()
# Do something with the results
for row in result:
print(row)
For more detailed examples refer to the README of the repository.

How to connect aws postgres db over the ssh using python

I have a Postgres db over the AWS and currently we connect that using postico client by providing below information-
DB Host
DB Port
DB Username
DB Password
DB Name
SSH Host (it is domain)
SSH Port
SSH Private Key
During this time I used to my organisation VPN. But Now I have to connect the same with python code and I believe when I can connect the with postico I should be through code as well. I have used below code but unable to connect to db and fetch records so anyone can give idea or sample code?-
def __init__(self, pgres_host, pgres_port, db, ssh, ssh_user, ssh_host, ssh_pkey):
# SSH Tunnel Variables
self.pgres_host = pgres_host
self.pgres_port = pgres_port
if ssh == True:
self.server = SSHTunnelForwarder(
(ssh_host, 22),
ssh_username=ssh_user,
ssh_private_key=ssh_pkey,
remote_bind_address=(pgres_host, pgres_port),
)
server = self.server
server.start() #start ssh server
self.local_port = server.local_bind_port
print(f'Server connected via SSH || Local Port: {self.local_port}...')
elif ssh == False:
pass
def query(self, db, query, psql_user, psql_pass):
engine = create_engine(f'postgresql://{psql_user}:{psql_pass}#{self.pgres_host}:{self.local_port}/{db}')
print (f'Database [{db}] session created...')
print(f'host [{self.pgres_host}]')
self.query_df = pd.read_sql(query,engine)
print ('<> Query Sucessful <>')
engine.dispose()
return self.query_df
pgres = Postgresql_connect(pgres_host=p_host, pgres_port=p_port, db=db, ssh=ssh, ssh_user=ssh_user, ssh_host=ssh_host, ssh_pkey=ssh_pkey)
print(ssh_pkey)
query_df = pgres.query(db=db, query='Select * from table', psql_user=user, psql_pass=password)
print(query_df)
connect just as you would locally after creating an ssh tunnel
https://www.howtogeek.com/168145/how-to-use-ssh-tunneling/

Unable to import s3 data to RDS(Postgresql) using aws_commons extension

I am trying to import s3 file(CSV file) to Postgre sql rds instance.
I am running s3 import queries from rds instance as per AWS docs but i am getting below error
ERROR: syntax error at or near ":"
LINE 5: :'s3_uri',
^
SQL state: 42601
Character: 76
Query i used for downloading s3 file is below:
SELECT aws_s3.table_import_from_s3(
't1',
'',
'(format csv)',
:'s3_uri',
);
Running query as per AWS doc: https://docs.aws.amazon.com/AmazonRDS/latest/UserGuide/PostgreSQL.Procedural.Importing.html#USER_PostgreSQL.S3Import.Overview
SELECT aws_s3.table_import_from_s3(
't1',
'',
'(format csv)',
:'s3_uri',
);
I had the same error. Solved it with these steps:
TLDR: when you're importing data with aws_commons, you want to provide the credentials and path to the s3 asset you're working with rather than a variable that stores that information. So, assuming you've already created an IAM role and policy to allow your database to access s3, then you run
SELECT aws_s3.table_import_from_s3('table_name',
'"col1", "col2", "col3"',
'(format csv)',
's3_bucket_name',
'file_name.csv',
'region',
'access_key',
'secret_key');
upgrade postgres to version 14.4 (though 12.4 also works)
install aws_commons CREATE EXTENSION aws_s3 CASCADE;
Create IAM role aws iam create-role --role-name $ROLE_NAME --assume-role-policy-document '{"Version": "2012-10-17", "Statement": [{"Effect": "Allow", "Principal": {"Service": "rds.amazonaws.com"}, "Action": "sts:AssumeRole"}]}'
Create an IAM policy for the IAM role aws iam create-policy \ --policy-name $POLICY_NAME \ --policy-document '{"Version": "2012-10-17", "Statement": [{"Sid": "s3import", "Action": ["s3:GetObject", "s3:ListBucket"], "Effect": "Allow", "Resource": ["arn:aws:s3:::${BUCKET_NAME}", "arn:aws:s3:::${BUCKET_NAME}/*"]}]}'
Attach the policy aws iam attach-role-policy \ --policy-arn arn:aws:iam::$AWS_ACCOUNT_ID:policy/$POLICY_NAME \ --role-name $ROLE_NAME
Note:
Steps 1 - 5 worked for allowing data exports from postgres, importing from s3 was failing until I added step 6.
Create the VPC endpoint for the S3 service aws ec2 create-vpc-endpoint \ --vpc-id $VPC_ID \ --service-name com.amazonaws.$REGION.s3 --route-table-ids $ROUTE_TABLE_ID
The route table id related to the VPC where the endpoint is created can be retrieved through the command
aws ec2 describe-route-tables | jq -r '.RouteTables[] | "\(.VpcId) \(.RouteTableId)"'
run SELECT aws_s3.table_import_from_s3('table_name','"col1", "col2", "col3"','(format csv)','s3_bucket_name', 'file_name.csv', 'region', 'access_key','secret_key');
References:
Importing data from Amazon S3 into an RDS for PostgreSQL DB instance
Examples of importing data from Amazon S3 into an RDS for PostgreSQL DB instance
Stack Overflow
:' is not a valid syntax. :'s3_uri', including the leading colon, need to be replaced with a s3 URI. When you are done, there should be no literal colon there outside the quotes.
Try add \n before :
SELECT aws_s3.table_import_from_s3( 't1', '', '(format csv)',
:'s3_uri', );

Access database which is running in EC2 instance through AWS-lambda function

I wrote the lambda function in python3.6 to access the postgresql database which is running in EC2 instance.
psycopg2.connect(user="<USER NAME>",
password="<PASSWORD>",
host="<EC2 IP Address>",
port="<PORT NUMBER>",
database="<DATABASE NAME>")
created deployment package with required dependencies as zip file and uploaded into AWS lambda.To build dependency i followed THIS reference guide.
And also configured Virtual Private Cloud (VPC) as default one and also included Ec2 instance details, but i couldn't get the connection from database. when trying to connect database from lambda result in timeout.
Lambda function:
from __future__ import print_function
import json
import ast,datetime
import psycopg2
def lambda_handler(event, context):
received_event = json.dumps(event, indent=2)
load = ast.literal_eval(received_event)
try:
connection = psycopg2.connect(user="<USER NAME>",
password="<PASSWORD>",
host="<EC2 IP Address>",
# host="localhost",
port="<PORT NUMBER>",
database="<DATABASE NAME>")
cursor = connection.cursor()
postgreSQL_select_Query = "select * from test_table limit 10"
cursor.execute(postgreSQL_select_Query)
print("Selecting rows from mobile table using cursor.fetchall")
mobile_records = cursor.fetchall()
print("Print each row and it's columns values")
for row in mobile_records:
print("Id = ", row[0], )
except (Exception,) as error :
print ("Error while fetching data from PostgreSQL", error)
finally:
#closing database connection.
if(connection):
cursor.close()
connection.close()
print("PostgreSQL connection is closed")
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!'),
'dt' : str(datetime.datetime.now())
}
I googled quite a lot, But i couldn't found any workaround for this.is there any way to accomplish this requirement?
Your configuration would need to be:
A database in a VPC
The Lambda function configured to use the same VPC as the database
A security group on the Lambda function (Lambda-SG)
A security group on the Database (DB-SG) that permits inbound connects from Lambda-SG on the relevant database port
That is, DB-SG refers to Lambda-SG.
For lambda to connect to any resources inside a VPC, it needs to setup ENIs to the related private subnets of the VPC. Have you set up the VPC association and security groups of the EC2 correctly?
You can refer https://docs.aws.amazon.com/lambda/latest/dg/vpc.html