Pyspark connection to Postgres database in ipython notebook

Pyspark connection to Postgres database in ipython notebook - postgresql

I've read previous posts on this, but I still cannot pinpoint why I am unable to connect my ipython notebook to a Postgres db.
I am able to launch pyspark in an ipython notebook, SparkContext is loaded as 'sc'.
I have the following in my .bash_profile for finding the Postgres driver:
export SPARK_CLASSPATH=/path/to/downloaded/jar
Here's what I am doing in the ipython notebook to connect to the db (based on this post):
from pyspark.sql import DataFrameReader as dfr
sqlContext = SQLContext(sc)
table= 'some query'
url = 'postgresql://localhost:5432/dbname'
properties = {'user': 'username', 'password': 'password'}
df = dfr(sqlContext).jdbc(
url='jdbc:%s' % url, table=table, properties=properties
)
The error:
Py4JJavaError: An error occurred while calling o156.jdbc.
: java.SQL.SQLException: No suitable driver.
I understand it's an error with finding the driver I've downloaded, but I don't understand why I am getting this error when I've added the path to it in my .bash_profile.
I also tried to set driver via pyspark --jars, but I get a "no such file or directory" error.
This blogpost also shows how to connect to Postgres data sources, but the following also gives me a "no such directory" error:
./bin/spark-shell --packages org.postgresql:postgresql:42.1.4
Additional info:
spark version: 2.2.0
python version: 3.6
java: 1.8.0_25
postgres driver: 42.1.4

I am not sure why the above answer did not work for me but I thought I could also share what actually worked for me when running pyspark from a jupyter notebook (Spark 2.3.1 - Python 3.6.3):
from pyspark.sql import SparkSession
spark = SparkSession.builder.config('spark.driver.extraClassPath', '/path/to/postgresql.jar').getOrCreate()
url = 'jdbc:postgresql://host/dbname'
properties = {'user': 'username', 'password': 'pwd'}
df = spark.read.jdbc(url=url, table='tablename', properties=properties)

They've changed how this works several times in Apache Spark. Looking at my setup, this is what I have in my .bashrc (aka .bash_profile on Mac), so you could try it: export SPARK_CLASSPATH=$SPARK_CLASSPATH:/absolute/path/to/your/driver.jar Edit: I'm using Spark 1.6.1.
And, as always, make sure you use a new shell or source the script so you have the updated envvar (verify with echo $SPARK_CLASSPATH in your shell before you run ipython notebook).

I followed directions in this post. SparkContext is already set as sc for me, so all I had to do was remove the SPARK_CLASSPATH setting from my .bash_profile, and use the following in my ipython notebook:
import os
os.environ['PYSPARK_SUBMIT_ARGS'] = '--driver-class-path /path/to/postgresql-42.1.4.jar --jars /path/to/postgresql-42.1.4.jar pyspark-shell'
I added a 'driver' settings to properties as well, and it worked. As stated elsewhere in this post, this is likely because SPARK_CLASSPATH is deprecated, and it is preferable to use --driver-class-path.

Related

How to connect to an Oracle DB from a Python Azure Synapse notebook?

I am trying to query an Oracle database from within an Azure Synapse notebook, preferably using Pyodbc but a pyspark solution would also be acceptable. The complexity here comes from, I believe, the low configurability of the spark pool - I believe the code is generally correct.
host = 'my_endpoint.com:[port here as plain numbers, e.g. 1111]/orcl'
database = 'my_db_name'
username = 'my_username'
password = 'my_password'
conn = pyodbc.connect( 'DRIVER={ODBC Driver 17 for SQL Server};'
'SERVER=' + host + ';'
'DATABASE=' + database + ';'
'UID=' + username + ';'
'PWD=' + password + ';')
Approaches I have tried:
Pyodbc - I can use the default driver available ({ODBC Driver 17 for SQL Server}) and I get login timeouts. I have tried both the normal URL of the server and the IP, and with all combinations of port, no port, comma port, colon port, and without the service name "orcl" appended. Code sample is above, but I believe the issue lies with the drivers.
Pyspark .read - With no JDBC driver specified, I get a "No suitable driver" error. I am able to add the OJDBC .jar to the workspace or to my file directory, but I was not able to figure out how to tell spark that it should be used.
cx_oracle - This is not permitted in my workspace.
If the solution requires setting environment variables or using spark-submit, please provide a link that explains how best to do that in Synapse. I would be happy with either a JDBC or an ODBC solution.

By adding the .jar here (ojdbc8-19.15.0.0.1.jar) to the Synapse workspace packages and then adding that package to the Apache spark pool packages, I was able to execute the following code:
host = 'my_host_url'
port = 1521
service_name = 'my_service_name'
jdbcUrl = f'jdbc:oracle:thin:#{host}:{port}:{service_name}'
sql = 'SELECT * FROM my_table'
user = 'my_username'
password = 'my_password'
jdbcDriver = 'oracle.jdbc.driver.OracleDriver'
jdbcDF = spark.read.format('jdbc') \
.option('url', jdbcUrl) \
.option('query', sql) \
.option('user', user) \
.option('password', password) \
.option('driver', jdbcDriver) \
.load()
display(jdbcDF)

How do I access a postgres table from pyspark on IBM's Data Science Experience?

Here is my code:
uname = "xxxxx"
pword = "xxxxx"
dbUrl = "jdbc:postgresql:dbserver"
table = "xxxxx"
jdbcDF = spark.read.format("jdbc").option("url", dbUrl).option("dbtable",table).option("user", uname).option("password", pword).load()
I'm getting a "No suitable driver" error after adding the postgres driver jar (%Addjar -f https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar). Is there a working example of loading data from postgres in pyspark 2.0 on DSX?

Please make use of pixiedust package manager to install the postgres driver on spark service level.
http://datascience.ibm.com/docs/content/analyze-data/Package-Manager.html
Since Pixiedust is only supported on spark 1.6 , run
pixiedust.installPackage("https://jdbc.postgresql.org/download/postgresql-9.4.1207.jre7.jar")
Once you install this, restart kernel and then
Switch to spark 2.0 to run your postgres connection to get spark dataframe using sparksession.
uname = "username"
pword = "xxxxxx"
dbUrl = "jdbc:postgresql://hostname:10635/compose?user="+uname+"&password="+pword
table = "tablename"
Df = spark.read.format('jdbc').options(url=dbUrl,database='compose',dbtable=table).load()
houseDf.take(1)
Working Notebook:-
https://apsportal.ibm.com/analytics/notebooks/8b220408-6fc7-48a9-8350-246fbbf10ac8/view?access_token=7297af80b2e4109087a78365e7df3205f6ed9d0840c0c46d2208bc00ed0b0274
Thanks,
Charles.

Just provide driver option
option("driver", "org.postgresql.Driver")

Cannot connect to remote MongoDB from EMR cluster with spark-shell

I'm trying to connect to a remote Mongo database from a EMR cluster. The following code is executed with the command spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2:
import com.stratio.datasource.mongodb._
import com.stratio.datasource.mongodb.config._
import com.stratio.datasource.mongodb.config.MongodbConfig._
val builder = MongodbConfigBuilder(Map(Host -> List("[IP.OF.REMOTE.HOST]:3001"), Database -> "meteor", Collection ->"my_target_collection", ("user", "user_name"), ("database", "meteor"), ("password", "my_password")))
val readConfig = builder.build()
val mongoRDD = sqlContext.fromMongoDB(readConfig)
Spark-shell responds with the following error:
16/07/26 15:44:35 INFO SparkContext: Starting job: aggregate at MongodbSchema.scala:47
16/07/26 15:44:45 WARN DAGScheduler: Creating new stage failed due to exception - job: 1
com.mongodb.MongoTimeoutException: Timed out after 10000 ms while waiting to connect. Client view of cluster state is {type=Unknown, servers=[{address=[IP.OF.REMOTE.HOST]:3001, type=Unknown, state=Connecting, exception={java.lang.IllegalArgumentException: response too long: 1347703880}}]
at com.mongodb.BaseCluster.getDescription(BaseCluster.java:128)
at com.mongodb.DBTCPConnector.getClusterDescription(DBTCPConnector.java:394)
at com.mongodb.DBTCPConnector.getType(DBTCPConnector.java:571)
at com.mongodb.DBTCPConnector.getReplicaSetStatus(DBTCPConnector.java:362)
at com.mongodb.Mongo.getReplicaSetStatus(Mongo.java:446)
.
.
.
After reading for a while, a few responses here in SO and other forums state that the java.lang.IllegalArgumentException: response too long: 1347703880 error might be caused by a faulty Mongo driver. Based on that I started executing spark-shell with updated drivers like so:
spark-shell --packages com.stratio.datasource:spark-mongodb_2.10:0.11.2 --jars casbah-commons_2.10-3.1.1.jar,casbah-core_2.10-3.1.1.jar,casbah-query_2.10-3.1.1ja.jar,mongo-java-driver-2.13.0.jar
Of course before this I downloaded the jars and stored them in the same route as the spark-shell was executed. Nonetheless, with this approach spark-shell answers with the following cryptic error message:
Exception in thread "dag-scheduler-event-loop" java.lang.NoClassDefFoundError: com/mongodb/casbah/query/dsl/CurrentDateOp
at com.mongodb.casbah.MongoClient.apply(MongoClient.scala:218)
at com.stratio.datasource.mongodb.partitioner.MongodbPartitioner.isShardedCollection(MongodbPartitioner.scala:78)
It is worth mentioning that the target MongoDB is a Meteor Mongo database, that's why I'm trying to connect with [IP.OF.REMOTE.HOST]:3001 instead of using the port 27017.
What might be the issue? I've followed many tutorials but all of them seem to have the MongoDB in the same host, allowing them to declare localhost:27017 in the credentials. Is there something I'm missing?
Thanks for the help!

I ended up using MongoDB's official Java driver instead. This was my first experience with Spark and the Scala programming language, so I wasn't very familiar with the idea of using plain Java JARs yet.
The solution
I downloaded the necessary JARs and stored them in the same directory as the job file, which is a Scala file. So the directory looked something like:
/job_directory
|--job.scala
|--bson-3.0.1.jar
|--mongodb-driver-3.0.1.jar
|--mongodb-driver-core-3.0.1.jar
Then, I start spark-shell as follows to load the JARs and its classes into the shell environment:
spark-shell --jars "mongodb-driver-3.0.1.jar,mongodb-driver-core-3.0.1.jar,bson-3.0.1.jar"
Next, I execute the following to load the source code of the job into the spark-shell:
:load job.scala
Finally I execute the main object in my job like so:
MainObject.main(Array())
As of the code inside the MainObject, it is merely as the tutorial states:
val mongo = new MongoClient(IP_OF_REMOTE_MONGO , 27017)
val db = mongo.getDB(DB_NAME)
Hopefully this will help future readers and spark-shell/Scala beginners!

SAP HANA Vora failing to see the table contents in Scala

Failing to see the data in Scala using Vora.
VORA: 1.2
Spark: 1.5.2 / Spark controller: 1.5.8
The hdfs file "content" showing fine.
hdfs dfs -cat /user/vora/XXXXXXXX/part-00000
AB05,560
CD06,340
EF07,590
GH08,230
Table showing up fine in the "show datasourcestables" command
scala> vc.sql(s"""SHOW DATASOURCETABLES USING com.sap.spark.vora""".stripMargin ).show
Output
Show table is failing in Scala
scala> vc.sql("select * from VVCSV").show
scala> vc.sql("select * from VVCSV").show
java.lang.RuntimeException: Table Not Found: VVCSV
at scala.sys.package$.error(package.scala:27)
at >org.apache.spark.sql.catalyst.analysis.SimpleCatalog.lookupRelation(Catalog.scala:139)
at >org.apache.spark.sql.extension.ExtendableSQLContext$$anon$2.org$apache$spark$sql$catalyst$analysis$OverrideCatalog$$super$lookupRelation(ExtendableSQLContext.scala:52)
at >org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at >?>org.apache.spark.sql.catalyst.analysis.OverrideCatalog$$anonfun$lookupRelation$3.apply(Catalog.scala:203)
at scala.Option.getOrElse(Option.scala:120)

Command show datasourcetables is deprecated with Vora1.2 and been replaced by show tables using com.sap.spark.vora. However that command is only showing what is persisted in the Vora catalog. To load the tables in the current Spark context (e.g. after restarting the spark-shell) you need to run the register tables command:
vc.sql("register all tables using com.sap.spark.vora").show
To check what is in the current Spark context you can use the show tables command (without the 'using' clause). For more details you can check the Vora Developer Guide and the Spark documentation.

HiveContext setting in scala+spark project to access existing HDFS

I am trying to access my existing hadoop setup in my spark+scala project
Spark Version 1.4.1
Hadoop 2.6
Hive 1.2.1
from Hive Console I able to create table and access it without any issue, I can also see the same table from Hadoop URL as well.
the problem is when I try to create a table from project, system shows error
ERROR Driver: FAILED: Execution Error, return code 1 from
org.apache.hadoop.hive.ql.exec.DDLTask.
MetaException(message:file:/user/hive/warehouse/src is not a directory
or unable to create one)
following is the code I write:
import
import org.apache.spark._
import org.apache.spark.sql.hive._
Code
val sparkContext = new SparkContext("local[2]", "HiveTable")
val hiveContext = new org.apache.spark.sql.hive.HiveContext(sparkContext)
hiveContext.setConf("hive.metastore.warehouse.dir", "hdfs://localhost:54310/user/hive/warehouse")
hiveContext.sql("CREATE TABLE IF NOT EXISTS src (key INT, value STRING)")
Edit:
instead of create table if I had to execute insert statement like:
hiveContext.sql("INSERT INTO TABLE default.src SELECT 'username','password' FROM foo;")
any help to resolve his issue would be highly appreciable.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark connection to Postgres database in ipython notebook - postgresql

Related

How to connect to an Oracle DB from a Python Azure Synapse notebook?

How do I access a postgres table from pyspark on IBM's Data Science Experience?

Cannot connect to remote MongoDB from EMR cluster with spark-shell

SAP HANA Vora failing to see the table contents in Scala

HiveContext setting in scala+spark project to access existing HDFS

Categories

Resources