Using Postgresql JDBC source with Apache Spark on EMR - postgresql

I have existing EMR cluster running and wish to create DF from Postgresql DB source.
To do this, it seems you need to modify the spark-defaults.conf with the updated spark.driver.extraClassPath and point to the relevant PostgreSQL JAR that has been already downloaded on master & slave nodes, or you can add these as arguments to a spark-submit job.
Since I want to use existing Jupyter notebook to wrangle the data, and not really looking to relaunch cluster, what is the most efficient way to resolve this?
I tried the following:
Create new directory (/usr/lib/postgresql/ on master and slaves and copied PostgreSQL jar to it. (postgresql-9.41207.jre6.jar)
Edited spark-default.conf to include wildcard location
spark.driver.extraClassPath :/usr/lib/postgresql/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/$
Tried to create dataframe in Jupyter cell using the following code:
SQL_CONN = "jdbc:postgresql://some_postgresql_db:5432/dbname?user=user&password=password"
spark.read.jdbc(SQL_CONN, table="someTable", properties={"driver":'com.postgresql.jdbc.Driver'})
I get a Java error as per below:
Py4JJavaError: An error occurred while calling o396.jdbc.
: java.lang.ClassNotFoundException: com.postgresql.jdbc.Driver
Help appreciated.

I think you don't need to copy postgres jar in slaves as the driver programme and cluster manager take care everything. I've created dataframe from Postgres external source by the following way:
Download postgres driver jar:
cd $HOME && wget https://jdbc.postgresql.org/download/postgresql-42.2.5.jar
Create dataframe:
atrribute = {'url' : 'jdbc:postgresql://{host}:{port}/{db}?user={user}&password={password}' \
.format(host=<host>, port=<port>, db=<db>, user=<user>, password=<password>),
'database' : <db>,
'dbtable' : <select * from table>}
df=spark.read.format('jdbc').options(**attribute).load()
Submit to spark job:
Add the the downloaded jar to driver class path while submitting the spark job.
--properties spark.driver.extraClassPath=$HOME/postgresql-42.2.5.jar,spark.jars.packages=org.postgresql:postgresql:42.2.5

Check the github repo of the Driver. The class path seems to be something like this org.postgresql.Driver. Try using the same.

Related

Using Spark-Submit to write to S3 in "local" mode using S3A Directory Committer

I'm currently running PySpark via local mode. I want to be able to efficiently output parquet files to S3 via the S3 Directory Committer. This PySpark instance is using the local disk, not HDFS, as it is being submitted via spark-submit --master local[*].
I can successfully write to my S3 Instance without enabling the directory committer. However, this involves writing staging files to S3 and renaming them, which is slow and unreliable. I would like for Spark to write to my local filesystem as a temporary store, and then copy to S3.
I have the following configuration in my PySpark conf:
self.spark.conf.set("spark.sql.execution.arrow.pyspark.enabled", "true")
self.spark.conf.set("spark.hadoop.fs.s3a.committer.name", "directory")
self.spark.conf.set("spark.sql.sources.commitProtocolClass", "org.apache.spark.internal.io.cloud.PathOutputCommitProtocol")
self.spark.conf.set("spark.sql.parquet.output.committer.class", "org.apache.spark.internal.io.cloud.BindingParquetOutputCommitter")
My spark-submit command looks like this:
spark-submit --master local[*] --py-files files.zip --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark.internal.io.cloud.PathOutputCommitProtocol --driver-memory 4G --name clean-raw-recording_data main.py
spark-submit gives me the following error, due to the requisite JAR not being in place:
java.lang.ClassNotFoundException: org.apache.spark.internal.io.cloud.PathOutputCommitProtocol
My questions are:
Which JAR (specifically, the maven coordinates) do I need to include in spark-submit --packages in order to be able to reference PathOutputCommitProtocol?
Once I have (1) working, will I be able to use PySpark's local mode to stage temporary files on the local filesystem? Or is HDFS a strict requirement?
I need this to be running in local mode, not cluster mode.
EDIT:
I got this to work with the following configuration:
Using pyspark version 3.1.2 and the package
org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253.
I needed to add the cloudera repository using the --repositories option for spark-submit:
spark-submit --repositories https://repository.cloudera.com/artifactory/cloudera-repos/ --packages com.amazonaws:aws-java-sdk:1.11.375,org.apache.hadoop:hadoop-aws:3.2.0,org.apache.spark:spark-hadoop-cloud_2.12:3.1.1.3.1.7270.0-253
you need the spark-hadoop-cloud module for the release of spark you are using
the committer is happy using the local fs (it's now the public integration test suites work https://github.com/hortonworks-spark/cloud-integration. all that's needed is a "real" filesystem shared across all workers and the spark driver, so the driver gets the manifests of each pending commit.
print the _SUCCESS file after a job to see what the committer did: 0 byte file == old committer, JSON with diagnostics == new one

Working with jdbc jar in pyspark

I need to read from a postgres sql database in pyspark.
I know this has been asked before such as here, here and many other places, however, the solutions there either use a jar in the local running directory or copy it to all workers manually.
I downloaded the postgresql-9.4.1208 jar and placed it in /tmp/jars. I then proceeded to call pyspark with the --jars and --driver-class-path switches:
pyspark --master yarn --jars /tmp/jars/postgresql-9.4.1208.jar --driver-class-path /tmp/jars/postgresql-9.4.1208.jar
Inside pyspark I did:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name").load()
df.count()
However, while using --jars and --driver-class-path worked fine for jars I created, it failed for jdbc and I got an exception from the workers:
java.lang.IllegalStateException: Did not find registered driver with class org.postgresql.Driver
If I copy the jar manually to all workers and add --conf spark.executor.extraClassPath and --conf spark.driver.extraClassPath, it does work (with the same jar). The documentation btw suggests using SPARK_CLASSPATH which is deprecated actually adds these two switches (but has the side effect of preventing adding OTHER jars with the --jars option which I need to do)
So my question is: what is special about the jdbc driver which makes it not work and how can I add it without having to manually copy it to all workers.
Update:
I did some more looking and found this in the documentation:
"The JDBC driver class must be visible to the primordial class loader on the client session and on all executors. This is because Java’s DriverManager class does a security check that results in it ignoring all drivers not visible to the primordial class loader when one goes to open a connection. One convenient way to do this is to modify compute_classpath.sh on all worker nodes to include your driver JARs.".
The problem is I can't seem to find computer_classpath.sh nor do I understand what the primordial class loader means.
I did find this which basically explains that this needs to be done locally.
I also found this which basically says there is a fix but it is not yet available in version 1.6.1.
I found a solution which works (Don't know if it is the best one so feel free to continue commenting).
Apparently, If I add the option: driver="org.postgresql.Driver", this works properly. i.e. My full line (inside pyspark) is:
df = sqlContext.read.format("jdbc").options(url="jdbc:postgresql://ip_address:port/db_name?user=myuser&password=mypasswd", dbtable="table_name",driver="org.postgresql.Driver").load()
df.count()
Another thing: If you are already using a fat jar of your own (I am in my full application) then all you need to do is add the jdbc driver to your pom file as such:
<dependency>
<groupId>org.postgresql</groupId>
<artifactId>postgresql</artifactId>
<version>9.4.1208</version>
</dependency>
and then you don't have to add the driver as a separate jar, just use the jar with dependencies.
What version of the documentation are you looking at ?
It seems like compute-classpath.sh was deprecated a while back - as of Spark 1.3.1:
$ unzip -l spark-1.3.1.zip | egrep '\.sh' | egrep classpa
6592 2015-04-11 00:04 spark-1.3.1/bin/compute-classpath.sh
$ unzip -l spark-1.4.0.zip | egrep '\.sh' | egrep classpa
produces nothing.
I think you should be using load-spark-env.sh to set your classpath:
$/opt/spark-1.6.0-bin-hadoop2.6/bin/load-spark-env.sh
and you'll need to set SPARK_CLASSPATH in your $SPARK_HOME/conf/spark-env.sh file (which you'll copy over from the template file $SPARK_HOME/conf/spark-env.sh.template).
I think that this is another manifestation of the issue described and fixed here: https://github.com/apache/spark/pull/12000. I authored that fix 3 weeks ago and there has been no movement on it. Maybe if others also express the fact that they have been affected by it, it may help?

Spark cannot find the postgres jdbc driver

EDIT: See the edit at the end
First of all, I am using Spark 1.5.2 on Amazon EMR and using Amazon RDS for my postgres database. Second is that I am a complete newbie in this world of Spark and Hadoop and MapReduce.
Essentially my problem is the same as for this guy:
java.sql.SQLException: No suitable driver found when loading DataFrame into Spark SQL
So the dataframe is loaded, but when I try to evaluate it (doing df.show(), where df is the dataframe) gives me the error:
java.sql.SQLException: No suitable driver found for jdbc:postgresql://mypostgres.cvglvlp29krt.eu-west-1.rds.amazonaws.com:5432/mydb
I should note that I start spark like this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar
The solutions suggest delivering the jar onto the worker nodes and setting the classpath on them somehow, which I don't really understand how to do. But then they say that apparently the issue was fixed in Spark 1.4, and I'm using 1.5.2, and still having this issue, so what is going on?
EDIT: Looks like I resolved the issue, however I still don't quite understand why this works and the thing above doesn't, so I guess my questions is now why does doing this:
spark-shell --driver-class-path /home/hadoop/postgresql-9.4.1207.jre7.jar --conf spark.driver.extraClassPath=/home/hadoop/postgresql-9.4.1207.jre7.jar --jars /home/hadoop/postgresql-9.4.1207.jre7.jar
solve the problem? I just added the path as a parameter into some more of the flags it seems.
spark-shell --driver-class-path .... --jars ... works because all jar files listed in --jars are automatically distributed over the cluster.
Alternatively you could use
spark-shell --packages org.postgresql:postgresql:9.4.1207.jre7
and specify driver class as an option for DataFrameReader / DataFrameWriter
val df = sqlContext.read.format("jdbc").options(Map(
"url" -> url, "dbtable" -> table, "driver" -> "org.postgresql.Driver"
)).load()
or even manually copy required jars to the workers and place these somewhere on the CLASSPATH.

Exception after Setting property 'spark.sql.hive.metastore.jars' in 'spark-defaults.conf'

Given below is the version of Spark & Hive I have installed in my system
Spark : spark-1.4.0-bin-hadoop2.6
Hive : apache-hive-1.0.0-bin
I have configured the Hive installation to use MySQL as Metastore. The goal is to access the MySQL Metastore & execute HiveQL queries inside spark-shell(using HiveContext)
So far I am able to execute the HiveQL queries by accessing the Derby Metastore(As described here, believe Spark-1.4 comes bundled with Hive 0.13.1 which in turn uses the internal Derby database as Metastore)
Then I tried to point spark-shell to my external Metastore(MySQL in this case) by setting the property(as suggested here) given below in $SPARK_HOME/conf/spark-defaults.conf,
spark.sql.hive.metastore.jars /home/mountain/hv/lib:/home/mountain/hp/lib
I have also copied $HIVE_HOME/conf/hive-site.xml into $SPARK_HOME/conf. But I am getting the following exception when I start the spark-shell
mountain#mountain:~/del$ spark-shell
Spark context available as sc.
java.lang.ClassNotFoundException: java.lang.NoClassDefFoundError:
org/apache/hadoop/hive/ql/session/SessionState when creating Hive client
using classpath: file:/home/mountain/hv/lib/, file:/home/mountain/hp/lib/
Please make sure that jars for your version of hive and hadoop are
included in the paths passed to spark.sql.hive.metastore.jars.
Am I missing something (or) not setting the property spark.sql.hive.metastore.jars correctly?
Note: In Linux Mint verified.
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages", "org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
Corrupted version of hive-site.xml will cause this... please copy the correct hive-site.xml

HBase Export/Import: Unable to find output directory

I am using HBase for my application and I am trying to export the data using org.apache.hadoop.hbase.mapreduce.Export as it was directed here. The issue I am facing with the command is that once the command is executed, there are no errors while creating the export. But the specified output directoy does not appear at its place.The command I used was
$ bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name db_dump/
I got the solution hence I am replying my own answer
You must have following two lines in hadoop-env.sh in conf directory of hadoop
export HBASE_HOME=/home/sitepulsedev/hbase/hbase-0.90.4
export HADOOP_CLASSPATH=$HBASE_HOME/hbase-0.90.4.jar:$HBASE_HOME/conf:$HBASE_HOME/hbase-0.90.4-test.jar:$HBASE_HOME/lib/zookeeper-3.3.2.jar:$HBASE_HOME
save it and restart mapred by ./stop-mapred.sh and ./start-mapred.sh
now run in bin directory of hadoop
./hadoop jar ~/hbase/hbase-0.90.4/hbase-0.90.4.jar export your_table /export/your_table
Now you can verify the dump by hitting
./hadoop fs -ls /export
finally you need to copy the whole thing into your local file system for which run
./hadoop fs -copyToLocal /export/your_table ~/local_dump/your_table
here are the References that helped me out in export/import and in hadoop shell commands
Hope this one helps you out!!
As you noticed the HBase export tool will create the backup in the HDFS, if you instead want the output to be written on your local FS you can use the file URI. In your example it would be something similar to:
bin/hbase org.apache.hadoop.hbase.mapreduce.Export table_name file:///tmp/db_dump/
Related to your own answer, this would also avoid going through the HDFS. Just be very careful if your are running this is a cluster of servers, because each server will write the result files in their own local file systems.
This is true for HBase 0.94.6 at least.
Hope this helps
I think the previous answer needs some modification:
Platform: AWS EC2,
OS: Amazon Linux
Hbase Version: 0.96.1.1
Hadoop Distribution: Cloudera CDH5.0.1
MR engine: MRv1
To export data from Hbase Table to local filesystem:
sudo -u hdfs /usr/bin/hbase org.apache.hadoop.hbase.mapreduce.Export -Dmapred.job.tracker=local "table_name" "file:///backups/"
This command will dump data in HFile format with number of files equaling the number of regions of that table in Hbase.