How to connect to an Oracle DB from a Python Azure Synapse notebook? - pyspark

I am trying to query an Oracle database from within an Azure Synapse notebook, preferably using Pyodbc but a pyspark solution would also be acceptable. The complexity here comes from, I believe, the low configurability of the spark pool - I believe the code is generally correct.
host = 'my_endpoint.com:[port here as plain numbers, e.g. 1111]/orcl'
database = 'my_db_name'
username = 'my_username'
password = 'my_password'
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=' + host + ';'
'DATABASE=' + database + ';'
'UID=' + username + ';'
'PWD=' + password + ';')
Approaches I have tried:
Pyodbc - I can use the default driver available ({ODBC Driver 17 for SQL Server}) and I get login timeouts. I have tried both the normal URL of the server and the IP, and with all combinations of port, no port, comma port, colon port, and without the service name "orcl" appended. Code sample is above, but I believe the issue lies with the drivers.
Pyspark .read - With no JDBC driver specified, I get a "No suitable driver" error. I am able to add the OJDBC .jar to the workspace or to my file directory, but I was not able to figure out how to tell spark that it should be used.
cx_oracle - This is not permitted in my workspace.
If the solution requires setting environment variables or using spark-submit, please provide a link that explains how best to do that in Synapse. I would be happy with either a JDBC or an ODBC solution.

By adding the .jar here (ojdbc8-19.15.0.0.1.jar) to the Synapse workspace packages and then adding that package to the Apache spark pool packages, I was able to execute the following code:
host = 'my_host_url'
port = 1521
service_name = 'my_service_name'
jdbcUrl = f'jdbc:oracle:thin:#{host}:{port}:{service_name}'
sql = 'SELECT * FROM my_table'
user = 'my_username'
password = 'my_password'
jdbcDriver = 'oracle.jdbc.driver.OracleDriver'
jdbcDF = spark.read.format('jdbc') \
.option('url', jdbcUrl) \
.option('query', sql) \
.option('user', user) \
.option('password', password) \
.option('driver', jdbcDriver) \
.load()
display(jdbcDF)

Related

Pyspark connector invalid connection string

i'm having problems connecting to a mongodb atlas serverless instance, i downloaded the spark connector from the repo and updated a jar to my Azure Synapse Spark Pool instance, trying to connect with this code snippet:
query = ( spark.read.format("mongodb")
.option('spark.mongodb.connection.uri', 'mongodb+srv://USER:PASSWORD#SERVERURL.mongodb.net/test?retryWrites=true&w=majority')
.option('spark.mongodb.database', 'test') \
.option('spark.mongodb.collection', 'test4') \
.load())
But it says
Invalid connection string: 'mongodb+srv:/USER:PASSWORD#SERVERURL.mongodb.net/test?retryWrites=true&w=majority'
Caused by: com.mongodb.MongoConfigurationException: A TXT record is only permitted to contain the keys [authsource, replicaset], but the TXT record for 'SERVERURL.mongodb.net' contains the keys [loadbalanced, authsource]
What does this mean? it seems the url is wrong but in other snippets i have seen on the forums it seems to be well specified.
My versions are:
spark: 3.2.2+
python: 3.8+
mongodb-driver-3.9.1.jar
mongo-spark-connector-10.1.0-SNAPSHOT.jar
mongodb-driver-core-3.9.1.jar
bson-3.9.1.jar

How to Connect Database(postgres) to Airflow composer On Google Cloud Platform?

I have airflow setup on my local machine.Dags are written in a way that they need to access database(postgres).I am trying to setup similar thing on Google Cloud Platform.But I am not able to connect database to Airflow in a composer.I am Keep getting error "no host postgres" Any Suggestions for setting up airflow on GCP or Connecting Database to airflow composer??
Here Is Link For My Complete Airflow Folder:(This setup works fine on my local machine with docker)
https://github.com/digvijay13873/airflow-docker.git
I am using GCP composer.Postgres Database is in SQL instance. My Table creation Dag is here :
https://github.com/digvijay13873/airflow-docker/blob/main/dags/tablecreation.py
What changes should I do in a My existing Dag to connect it with postgres in SQL instance. I tried Giving public IP address of postgres in Host parameter.
Answering your main question, connecting a SQL instance from GCP in Cloud Composer environment can be done in two ways:
Using Public IP
Using Cloud SQL proxy (recommended): secure access without the need of authorized networks and SSL configuration
Connecting using Public IP:
Postgres: connect directly via TCP (non-SSL)
os.environ['AIRFLOW_CONN_PUBLIC_POSTGRES_TCP'] = (
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?"
"database_type=postgres&"
"project_id={project_id}&"
"location={location}&"
"instance={instance}&"
"use_proxy=False&"
"use_ssl=False".format(**postgres_kwargs)
)
For more information refer github
For connecting using Cloud SQL proxy: You can connect using Auth proxy from GKE as per this documentation.
After setting up the SQL proxy you can connect Composer to your SQL instance using a proxy.
Exemplar Code:
SQL = [
'CREATE TABLE IF NOT EXISTS TABLE_TEST (I INTEGER)',
'CREATE TABLE IF NOT EXISTS TABLE_TEST (I INTEGER)',
'INSERT INTO TABLE_TEST VALUES (0)',
'CREATE TABLE IF NOT EXISTS TABLE_TEST2 (I INTEGER)',
'DROP TABLE TABLE_TEST',
'DROP TABLE TABLE_TEST2',
]
HOME_DIR = expanduser("~")
def get_absolute_path(path):
if path.startswith("/"):
return path
else:
return os.path.join(HOME_DIR, path)
postgres_kwargs = dict(
user=quote_plus(GCSQL_POSTGRES_USER),
password=quote_plus(GCSQL_POSTGRES_PASSWORD),
public_port=GCSQL_POSTGRES_PUBLIC_PORT,
public_ip=quote_plus(GCSQL_POSTGRES_PUBLIC_IP),
project_id=quote_plus(GCP_PROJECT_ID),
location=quote_plus(GCP_REGION),
instance=quote_plus(GCSQL_POSTGRES_INSTANCE_NAME_QUERY),
database=quote_plus(GCSQL_POSTGRES_DATABASE_NAME),
)
os.environ['AIRFLOW_CONN_PROXY_POSTGRES_TCP'] = \
"gcpcloudsql://{user}:{password}#{public_ip}:{public_port}/{database}?" \
"database_type=postgres&" \
"project_id={project_id}&" \
"location={location}&" \
"instance={instance}&" \\
"use_proxy=True&" \
"sql_proxy_use_tcp=True".format(**postgres_kwargs)
connection_names = [
"proxy_postgres_tcp",
]
dag = DAG(
'con_SQL',
default_args=default_args,
description='A DAG that connect to the SQL server.',
schedule_interval=timedelta(days=1),
)
def print_client(ds, **kwargs):
client = storage.Client()
print(client)
print_task = PythonOperator(
task_id='print_the_client',
provide_context=True,
python_callable=print_client,
dag=dag,
)
for connection_name in connection_names:
task = CloudSqlQueryOperator(
gcp_cloudsql_conn_id=connection_name,
task_id="example_gcp_sql_task_" + connection_name,
sql=SQL,
dag=dag
)
print_task >> task

'h' format requires -32768 <= number <= 32767

I am trying to write a dataframe into postgreSQL database table.
When i write it into heroku's postgres SQL database, everything works fine. No problems.
For heroku postgresql, I use the connection string
connection_string = "postgresql+psycopg2://%s:%s#%s/%s" % (
conn_params['user'],
conn_params['password'],
conn_params['host'],
conn_params['dbname'])
However, when i try to write the dataframe into GCP's cloud sql table, i get the following error...
struct.error: 'h' format requires -32768 <= number <= 32767
The connection string i use for gcp cloud sql is as follows.
connection_string = \
f"postgresql+pg8000://{conn_params['user']}:{conn_params['password']}#{conn_params['host']}/{conn_params['dbname']}"
the command i use to write to the database is the same for both gcp and heroku
df_Output.to_sql(sql_tablename, con=conn, schema='public', index=False, if_exists=if_exists, method='multi')
I'd recommend using the Cloud SQL Python Connector to manage your connections and take care of the connection string for you. It supports the pg8000 driver and should help resolve your troubles.
from google.cloud.sql.connector import connector
import sqlalchemy
# configure Cloud SQL Python Connector properties
def getconn():
conn = connector.connect(
"project:region:instance",
"pg8000",
user="YOUR_USER",
password="YOUR_PASSWORD",
db="YOUR_DB"
)
return conn
# create connection pool to re-use connections
pool = sqlalchemy.create_engine(
"postgresql+pg8000://",
creator=getconn,
)
# query or insert into Cloud SQL database
with pool.connect() as db_conn:
# query database
result = db_conn.execute("SELECT * from my_table").fetchall()
# Do something with the results
for row in result:
print(row)
For more detailed examples refer to the README of the repository.

Spark job dataframe write to Oracle using jdbc failing

When writing spark dataframe to Oracle database (Oracle Database 11g Enterprise Edition Release 11.2.0.4.0 - 64bit), the spark job is failing with the exception java.sql.SQLRecoverableException: IO Error: The Network Adapter could not establish the connection. The scala code is
dataFrame.write.mode(SaveMode.Append).jdbc("jdbc:oracle:thin:#" + ipPort + ":" + sid, table, props)
Already tried setting below properties for jdbc connection but hasn't worked.
props.put("driver", "oracle.jdbc.OracleDriver")
props.setProperty("testOnBorrow","true")
props.setProperty("testOnReturn","false")
props.setProperty("testWhileIdle","false")
props.setProperty("validationQuery","SELECT 1 FROM DUAL")
props.setProperty("autoReconnect", "true")
Based on the earlier search results, it seems that the connection is opened initially but is being killed by the firewall after some idle time. The connection URL is verified and is working as the select queries work fine. Need help in getting this resolved.

How do I access a postgres table from pyspark on IBM's Data Science Experience?

Here is my code:
uname = "xxxxx"
pword = "xxxxx"
dbUrl = "jdbc:postgresql:dbserver"
table = "xxxxx"
jdbcDF = spark.read.format("jdbc").option("url", dbUrl).option("dbtable",table).option("user", uname).option("password", pword).load()
I'm getting a "No suitable driver" error after adding the postgres driver jar (%Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar). Is there a working example of loading data from postgres in pyspark 2.0 on DSX?
Please make use of pixiedust package manager to install the postgres driver on spark service level.
http://datascience.ibm.com/docs/content/analyze-data/Package-Manager.html
Since Pixiedust is only supported on spark 1.6 , run
pixiedust.installPackage("https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar")
Once you install this, restart kernel and then
Switch to spark 2.0 to run your postgres connection to get spark dataframe using sparksession.
uname = "username"
pword = "xxxxxx"
dbUrl = "jdbc:postgresql://hostname:10635/compose?user="+uname+"&password="+pword
table = "tablename"
Df = spark.read.format('jdbc').options(url=dbUrl,database='compose',dbtable=table).load()
houseDf.take(1)
Working Notebook:-
https://apsportal.ibm.com/analytics/notebooks/8b220408-6fc7-48a9-8350-246fbbf10ac8/view?access_token=7297af80b2e4109087a78365e7df3205f6ed9d0840c0c46d2208bc00ed0b0274
Thanks,
Charles.
Just provide driver option
option("driver", "org.postgresql.Driver")