I am trying to write a dataframe into postgreSQL database table.
When i write it into heroku's postgres SQL database, everything works fine. No problems.
For heroku postgresql, I use the connection string
connection_string = "postgresql+psycopg2://%s:%s#%s/%s" % (
conn_params['user'],
conn_params['password'],
conn_params['host'],
conn_params['dbname'])
However, when i try to write the dataframe into GCP's cloud sql table, i get the following error...
struct.error: 'h' format requires -32768 <= number <= 32767
The connection string i use for gcp cloud sql is as follows.
connection_string = \
f"postgresql+pg8000://{conn_params['user']}:{conn_params['password']}#{conn_params['host']}/{conn_params['dbname']}"
the command i use to write to the database is the same for both gcp and heroku
df_Output.to_sql(sql_tablename, con=conn, schema='public', index=False, if_exists=if_exists, method='multi')
I'd recommend using the Cloud SQL Python Connector to manage your connections and take care of the connection string for you. It supports the pg8000 driver and should help resolve your troubles.
from google.cloud.sql.connector import connector
import sqlalchemy
# configure Cloud SQL Python Connector properties
def getconn():
conn = connector.connect(
"project:region:instance",
"pg8000",
user="YOUR_USER",
password="YOUR_PASSWORD",
db="YOUR_DB"
)
return conn
# create connection pool to re-use connections
pool = sqlalchemy.create_engine(
"postgresql+pg8000://",
creator=getconn,
)
# query or insert into Cloud SQL database
with pool.connect() as db_conn:
# query database
result = db_conn.execute("SELECT * from my_table").fetchall()
# Do something with the results
for row in result:
print(row)
For more detailed examples refer to the README of the repository.
Related
I am trying to query an Oracle database from within an Azure Synapse notebook, preferably using Pyodbc but a pyspark solution would also be acceptable. The complexity here comes from, I believe, the low configurability of the spark pool - I believe the code is generally correct.
host = 'my_endpoint.com:[port here as plain numbers, e.g. 1111]/orcl'
database = 'my_db_name'
username = 'my_username'
password = 'my_password'
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=' + host + ';'
'DATABASE=' + database + ';'
'UID=' + username + ';'
'PWD=' + password + ';')
Approaches I have tried:
Pyodbc - I can use the default driver available ({ODBC Driver 17 for SQL Server}) and I get login timeouts. I have tried both the normal URL of the server and the IP, and with all combinations of port, no port, comma port, colon port, and without the service name "orcl" appended. Code sample is above, but I believe the issue lies with the drivers.
Pyspark .read - With no JDBC driver specified, I get a "No suitable driver" error. I am able to add the OJDBC .jar to the workspace or to my file directory, but I was not able to figure out how to tell spark that it should be used.
cx_oracle - This is not permitted in my workspace.
If the solution requires setting environment variables or using spark-submit, please provide a link that explains how best to do that in Synapse. I would be happy with either a JDBC or an ODBC solution.
By adding the .jar here (ojdbc8-19.15.0.0.1.jar) to the Synapse workspace packages and then adding that package to the Apache spark pool packages, I was able to execute the following code:
host = 'my_host_url'
port = 1521
service_name = 'my_service_name'
jdbcUrl = f'jdbc:oracle:thin:#{host}:{port}:{service_name}'
sql = 'SELECT * FROM my_table'
user = 'my_username'
password = 'my_password'
jdbcDriver = 'oracle.jdbc.driver.OracleDriver'
jdbcDF = spark.read.format('jdbc') \
.option('url', jdbcUrl) \
.option('query', sql) \
.option('user', user) \
.option('password', password) \
.option('driver', jdbcDriver) \
.load()
display(jdbcDF)
I have airflow setup on my local machine.Dags are written in a way that they need to access database(postgres).I am trying to setup similar thing on Google Cloud Platform.But I am not able to connect database to Airflow in a composer.I am Keep getting error "no host postgres" Any Suggestions for setting up airflow on GCP or Connecting Database to airflow composer??
Here Is Link For My Complete Airflow Folder:(This setup works fine on my local machine with docker)
https://github.com/digvijay13873/airflow-docker.git
I am using GCP composer.Postgres Database is in SQL instance. My Table creation Dag is here :
https://github.com/digvijay13873/airflow-docker/blob/main/dags/tablecreation.py
What changes should I do in a My existing Dag to connect it with postgres in SQL instance. I tried Giving public IP address of postgres in Host parameter.
Answering your main question, connecting a SQL instance from GCP in Cloud Composer environment can be done in two ways:
Using Public IP
Using Cloud SQL proxy (recommended): secure access without the need of authorized networks and SSL configuration
Connecting using Public IP:
Postgres: connect directly via TCP (non-SSL)
os.environ['AIRFLOW_CONN_PUBLIC_POSTGRES_TCP'] = (
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?"
"database_type=postgres&"
"project_id={project_id}&"
"location={location}&"
"instance={instance}&"
"use_proxy=False&"
"use_ssl=False".format(**postgres_kwargs)
)
For more information refer github
For connecting using Cloud SQL proxy: You can connect using Auth proxy from GKE as per this documentation.
After setting up the SQL proxy you can connect Composer to your SQL instance using a proxy.
Exemplar Code:
SQL = [
'CREATE TABLE IF NOT EXISTS TABLE_TEST (I INTEGER)',
'CREATE TABLE IF NOT EXISTS TABLE_TEST (I INTEGER)',
'INSERT INTO TABLE_TEST VALUES (0)',
'CREATE TABLE IF NOT EXISTS TABLE_TEST2 (I INTEGER)',
'DROP TABLE TABLE_TEST',
'DROP TABLE TABLE_TEST2',
]
HOME_DIR = expanduser("~")
def get_absolute_path(path):
if path.startswith("/"):
return path
else:
return os.path.join(HOME_DIR, path)
postgres_kwargs = dict(
user=quote_plus(GCSQL_POSTGRES_USER),
password=quote_plus(GCSQL_POSTGRES_PASSWORD),
public_port=GCSQL_POSTGRES_PUBLIC_PORT,
public_ip=quote_plus(GCSQL_POSTGRES_PUBLIC_IP),
project_id=quote_plus(GCP_PROJECT_ID),
location=quote_plus(GCP_REGION),
instance=quote_plus(GCSQL_POSTGRES_INSTANCE_NAME_QUERY),
database=quote_plus(GCSQL_POSTGRES_DATABASE_NAME),
)
os.environ['AIRFLOW_CONN_PROXY_POSTGRES_TCP'] = \
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?" \
"database_type=postgres&" \
"project_id={project_id}&" \
"location={location}&" \
"instance={instance}&" \\
"use_proxy=True&" \
"sql_proxy_use_tcp=True".format(**postgres_kwargs)
connection_names = [
"proxy_postgres_tcp",
]
dag = DAG(
'con_SQL',
default_args=default_args,
description='A DAG that connect to the SQL server.',
schedule_interval=timedelta(days=1),
)
def print_client(ds, **kwargs):
client = storage.Client()
print(client)
print_task = PythonOperator(
task_id='print_the_client',
provide_context=True,
python_callable=print_client,
dag=dag,
)
for connection_name in connection_names:
task = CloudSqlQueryOperator(
gcp_cloudsql_conn_id=connection_name,
task_id="example_gcp_sql_task_" + connection_name,
sql=SQL,
dag=dag
)
print_task >> task
I have a Glue/Connection for an RDS/PostgreSQL DB pre-built via CloudFormation, which works fine in a Glue/Scala/Sparkshell via getJDBCSink API to write down a DataFrame to that DB.
But also I need to write down to the same db, plain sql like create index ... or create table ... etc.
How can I forward that sort of statements in the same Glue/Spark shell?
In python, you can provide pg8000 dependency to the spark glue jobs and then run the sql commands by establishing the connection to the RDS using pg8000.
In scala you can directly establish a JDBC connection without the need of any external library as far as driver is concerned, postgres driver is available in aws glue.
You can create connection as
import java.sql.{Connection, DriverManager, ResultSet}
object pgconn extends App {
println("Postgres connector")
classOf[org.postgresql.Driver]
val con_st = "jdbc:postgresql://localhost:5432/DB_NAME?user=DB_USER"
val conn = DriverManager.getConnection(con_str)
try {
val stm = conn.createStatement(ResultSet.TYPE_FORWARD_ONLY, ResultSet.CONCUR_READ_ONLY)
val rs = stm.executeQuery("SELECT * from Users")
while(rs.next) {
println(rs.getString("quote"))
}
} finally {
conn.close()
}
}
or follow this blog
I wrote the lambda function in python3.6 to access the postgresql database which is running in EC2 instance.
psycopg2.connect(user="<USER NAME>",
password="<PASSWORD>",
host="<EC2 IP Address>",
port="<PORT NUMBER>",
database="<DATABASE NAME>")
created deployment package with required dependencies as zip file and uploaded into AWS lambda.To build dependency i followed THIS reference guide.
And also configured Virtual Private Cloud (VPC) as default one and also included Ec2 instance details, but i couldn't get the connection from database. when trying to connect database from lambda result in timeout.
Lambda function:
from __future__ import print_function
import json
import ast,datetime
import psycopg2
def lambda_handler(event, context):
received_event = json.dumps(event, indent=2)
load = ast.literal_eval(received_event)
try:
connection = psycopg2.connect(user="<USER NAME>",
password="<PASSWORD>",
host="<EC2 IP Address>",
# host="localhost",
port="<PORT NUMBER>",
database="<DATABASE NAME>")
cursor = connection.cursor()
postgreSQL_select_Query = "select * from test_table limit 10"
cursor.execute(postgreSQL_select_Query)
print("Selecting rows from mobile table using cursor.fetchall")
mobile_records = cursor.fetchall()
print("Print each row and it's columns values")
for row in mobile_records:
print("Id = ", row[0], )
except (Exception,) as error :
print ("Error while fetching data from PostgreSQL", error)
finally:
#closing database connection.
if(connection):
cursor.close()
connection.close()
print("PostgreSQL connection is closed")
return {
'statusCode': 200,
'body': json.dumps('Hello from Lambda!'),
'dt' : str(datetime.datetime.now())
}
I googled quite a lot, But i couldn't found any workaround for this.is there any way to accomplish this requirement?
Your configuration would need to be:
A database in a VPC
The Lambda function configured to use the same VPC as the database
A security group on the Lambda function (Lambda-SG)
A security group on the Database (DB-SG) that permits inbound connects from Lambda-SG on the relevant database port
That is, DB-SG refers to Lambda-SG.
For lambda to connect to any resources inside a VPC, it needs to setup ENIs to the related private subnets of the VPC. Have you set up the VPC association and security groups of the EC2 correctly?
You can refer https://docs.aws.amazon.com/lambda/latest/dg/vpc.html
I recently started working with Robot framework. So I had a requirement where I needed to connect with Postgres db.
So though I am able to connect with the db but then when I try to execute queries, the flow is getting stuck. Even the test is not failing. Following is what I did:
Connect To Database psycopg2 ${DBName} ${DBUser} ${DBPass} ${DBHost} ${DBPort}
${current_row_count} = Row Count Select * from xyz
The first statement is executing fine but then it gets stuck on second statement.
Can somebody help me out on this
To Execute Query and get data from result :
Connect To Database psycopg2 ${DBName} ${DBUser} ${DBPass} ${DBHost} ${DBPort}
${output} = Query SELECT * from xyz;
Log ${output}
${DataResults}= Get from list ${output} 0
${DataResults}= Convert to list ${DataResults}
${DataResults}= Get from list ${DataResults} 0
${DataResults} convert to string ${DataResults}
Disconnect From Database
You are not executing your query.... read below a bit documentation and an example ;)
In the example you can see example variable but introduce your data ;)
Name: Connect To Database Using Custom Params
Source: DatabaseLibrary
Arguments:
[ dbapiModuleName=None | db_connect_string= ]
Loads the DB API 2.0 module given dbapiModuleName then uses it to connect to the database using the map string db_custom_param_string.
Example usage Example usage: :
Connect To Database Using Custom Params pymssql database='${db_database}' , user='${db_user}', password='${db_password}', host='${db_host}'
${queryResults} Query ${query}
Disconnect From Database