How to save data-frame in MySQL using PySpark - pyspark

I am new to Apache Spark. I have a use case where I have to save data frame data in MySQL. I got the below code to do the same:
data_frame.write.format('jdbc').options(
url='URI',
driver='com.mysql.jdbc.Driver',
dbtable=table_name,
user=user_name,
password='your_password').mode('append').save()
But when I ran the code, I got the below error:
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
I might be missing out on very minute detail. How can I fix this?

The error description is clearly indicating that it's not able to locate the JDBC driver class. You will have to include the JAR file for com.mysql.jdbc.Driver using
pyspark --jars <jar-file-location>
See this question - How to add third-party Java JAR files for use in PySpark.

Related

databricks-connect, py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache

Connection to databricks works fine, working with DataFrames goes smoothly (operations like join, filter, etc).
The problem appears when I call cache on a dataframe.
py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache.
: java.io.InvalidClassException: failed to read class descriptor
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$client53442a94a3$$anonfun$mapPartitions$1$$anonfun$apply$23
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:257)
at org.apache.spark.sql.util.ProtoSerializer.org$apache$spark$sql$util$ProtoSerializer$$readResolveClassDescriptor(ProtoSerializer.scala:4316)
at org.apache.spark.sql.util.ProtoSerializer$$anon$4.readClassDescriptor(ProtoSerializer.scala:4304)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1857)
... 71 more
I work with java8 as required, clearing pycache doesn't help.
The same code submitted as a job to databricks works fine.
It looks like a local problem on a bridge python-jvm level but java version (8) and python (3.7) is as required. Switching to java13 produces quite the same message.
Versions databricks-connect==6.2.0, openjdk version "1.8.0_242", Python 3.7.6
EDIT:
Behavior depends on how DF is created, if the source of DF is external then it works fine, if DF is created locally then such error appears.
# works fine
df = spark.read.csv("dbfs:/some.csv")
df.cache()
# ERROR in 'cache' line
df = spark.createDataFrame([("a",), ("b",)])
df.cache()
This is a known issue and I think a recent patch fixed it. This was seen for Azure, I am not sure whether you are using which Azure or AWS but it's solved. Please check the issue - https://github.com/MicrosoftDocs/azure-docs/issues/52431

Unsupported UCS-4 endianness (3412) detected - Scala/Spark

I was running a code on spark/scala and I've got an error, which I don't know what means...
Error : Caused by: java.io.CharConversionException: Unsupported UCS-4 endianness (3412) detected
I imported some files and I was using the variable i stored it as parameter in a function, some files work, but there are some that don't.

Spark executor is throwing error "java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver"

I am trying to import a table from my oracle database using spark and here I am using Scala to import the table.
My jdbc driver is ojdbc7.jar and it's added in both the parameter spark.driver.extraClassPath and spark.executor.extraClassPath in configuration file
spark.driver.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
I can successfully import the table. I can print the schema of the table. But while performing any operations like Count,show() it throws below error
`
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver
at java.lang.ClassLoader.findClass(ClassLoader.java:530) at
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 21 more
`
This error was because Spark was not able to locate the ojdbc7.jar from every core node. So placing this jar in a shared location like /usr/lib/spark/jars will resolve this issue.
You can also do few other things including adding jar file full location as a dependency under spark section in the interpreter as an artifact
If you just want %jdbc to work, update the jdbc section under interpreter, add the jar file full location as a artifact under the dependencies and also update the default.driver, default.url, default.user, default.password accordingly

Pyspark and phoenix table

I would like to use the phoenix tables with Pyspark. I try the solution that i found here: https://phoenix.apache.org/phoenix_spark.html
But I have an error. Can you help me to solve this error?
df_metadata = sqlCtx.read.format("org.apache.phoenix.spark").option("zkUrl", "xxx").load("lib.name_of_table")
print(df_metadata.collect())
and the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark-packages.org
How can I use org.apache.phoenix.spark with pyspark?
OK I found how correct this code:
I add this part in my spark-submit:
--jars /opt/phoenix-4.8.1-HBase-1.2/phoenix-spark-4.8.1-HBase-1.2.jar,/opt/phoenix-4.8.1-HBase-1.2/phoenix-4.8.1-HBase-1.2-client.jar \
I know the answer given by #Zop works.
I've got this error py4j.protocol.Py4JJavaError: An error occurred while calling o53.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark.apache.org/third-party-projects.html
You can do it this way too
spark-submit --jars /usr/hdp/current/phoenix-client/phoenix-spark2.jar,/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.4.0-91-client.jar,/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.4.0-91-server.jar <file here>

PriviledgedActionException while running kmeans on hadoop

I am trying to run KMeans on hadoop, using this guidelines.
http://www.slideshare.net/titusdamaiyanti/hadoop-installation-k-means-clustering-mapreduce?qid=44b5881c-089d-474b-b01d-c35a2f91cc67&v=qf1&b=&from_search=1#likes-panel
I am running this in eclipse-luna. when I executed, both map and reduce are showing they are complete 100%. But i am not getting output. Instead i am getting following error at the end. Please some help me to solve this..
15/03/20 11:29:44 INFO mapred.JobClient: Cleaning up the staging area file:/tmp/hadoop-hduser/mapred/staging/hduser378797276/.staging/job_local378797276_0002
15/03/20 11:29:44 ERROR security.UserGroupInformation: PriviledgedActionException as:hduser cause:java.io.IOException: No input paths specified in job
Exception in thread "main" java.io.IOException: No input paths specified in job
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.listStatus(FileInputFormat.java:193)
at org.apache.hadoop.mapreduce.lib.input.SequenceFileInputFormat.listStatus(SequenceFileInputFormat.java:55)
at org.apache.hadoop.mapreduce.lib.input.FileInputFormat.getSplits(FileInputFormat.java:252)
at org.apache.hadoop.mapred.JobClient.writeNewSplits(JobClient.java:1054)
at org.apache.hadoop.mapred.JobClient.writeSplits(JobClient.java:1071)
at org.apache.hadoop.mapred.JobClient.access$700(JobClient.java:179)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:983)
at org.apache.hadoop.mapred.JobClient$2.run(JobClient.java:936)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1190)
at org.apache.hadoop.mapred.JobClient.submitJobInternal(JobClient.java:936)
at org.apache.hadoop.mapreduce.Job.submit(Job.java:550)
at org.apache.hadoop.mapreduce.Job.waitForCompletion(Job.java:580)
at com.clustering.mapreduce.KMeansClusteringJob.main(KMeansClusteringJob.java:114)
You have to provide the input file location before running the map reduce program. There are two ways for providing the input file location.
Using ecliple go to run configration and provide the file name as arguments
Convert your program to jar file and run the below command inside your hadoop cluster
hadoop jar NameOfYourJarFile InputFileLocation OutPutFileLocation `