Pyspark and phoenix table - pyspark

I would like to use the phoenix tables with Pyspark. I try the solution that i found here: https://phoenix.apache.org/phoenix_spark.html
But I have an error. Can you help me to solve this error?
df_metadata = sqlCtx.read.format("org.apache.phoenix.spark").option("zkUrl", "xxx").load("lib.name_of_table")
print(df_metadata.collect())
and the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark-packages.org
How can I use org.apache.phoenix.spark with pyspark?

OK I found how correct this code:
I add this part in my spark-submit:
--jars /opt/phoenix-4.8.1-HBase-1.2/phoenix-spark-4.8.1-HBase-1.2.jar,/opt/phoenix-4.8.1-HBase-1.2/phoenix-4.8.1-HBase-1.2-client.jar \

I know the answer given by #Zop works.
I've got this error py4j.protocol.Py4JJavaError: An error occurred while calling o53.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark.apache.org/third-party-projects.html
You can do it this way too
spark-submit --jars /usr/hdp/current/phoenix-client/phoenix-spark2.jar,/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.4.0-91-client.jar,/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.4.0-91-server.jar <file here>

Related

java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z

I want to show spark dataframe and I used:
df.writeStream.outputMode("append").start().awaitTermination()
But when I got the error when run this line:
21/07/16 01:20:53 ERROR MicroBatchExecution: Query [id = f243e6e6-c02e-4e70-b5c3-6a821fd33232, runId = 312544cf-fea8-45b4-94a1-c052306538cf] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z
Check the version of spark and version of dependencies you have added. Make sure both are having the same versions. This will resolve the issue.

ERROR Stopping due to error (org.apache.kafka.connect.cli.ConnectDistributed) java.lang.NoClassDefFoundError: io/debezium/util/IoUtil

Objective
I am trying to connect to my Oracle Database(12c) from Kafka Connect(ideally in distributed mode) using the Debezium connector(1.2.4.Final). The Kafka version i am using is 2.13-2.6.0.
Command used
As per mentioned here, i am running this command:
C:\Users\username\Downloads\kafka>bin\windows\connect-distributed.bat config\connect-distributed.properties
Error
The error i am getting is:
ERROR Stopping due to error
(org.apache.kafka.connect.cli.ConnectDistributed)
java.lang.NoClassDefFoundError: io/debezium/util/IoUtil
at io.debezium.connector.oracle.Module.(Module.java:19)
at io.debezium.connector.oracle.OracleConnector.version(OracleConnector.java:23)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.versionFor(DelegatingClassLoader.java:390)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.versionFor(DelegatingClassLoader.java:395)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.getPluginDesc(DelegatingClassLoader.java:365)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:337)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:268)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:260)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initPluginLoader(DelegatingClassLoader.java:229)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:206)
at org.apache.kafka.connect.runtime.isolation.Plugins.(Plugins.java:61)
at org.apache.kafka.connect.cli.ConnectDistributed.startConnect(ConnectDistributed.java:91)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:78)
Caused by: java.lang.ClassNotFoundException: io.debezium.util.IoUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 13 more
Settings
In my connect-distributed.properties, i have this:
plugin.path=C:/Users/username/Downloads/kafka/libs/debezium
And inside the debezium folder(following Gunnar's recommendation from the comment in this question), i have these jars:
I also added the plugin path in %PATH% as follows:
echo %PATH% | findstr debezium
XXX;C:\Users\username\Downloads\kafka\libs\debezium;
Help
Any help would be greatly appreciated, as i hope to replace my database polling with this debezium connector which seems a better approach. Thanks!
The solution from Gunnar here works! (His explanation is there too if you want to check it out.)
plugin.path=C:\\Users\\username\\Downloads\\kafka\\libs
and that also works:
plugin.path=C:/Users/username/Downloads/kafka/libs
plugin.path=C:\Users\username\Downloads\kafka\libs
plugin.path=/Users/username/Downloads/kafka/libs
The mistake is: plugin.path should be up to libs and not libs/debezium

databricks-connect, py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache

Connection to databricks works fine, working with DataFrames goes smoothly (operations like join, filter, etc).
The problem appears when I call cache on a dataframe.
py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache.
: java.io.InvalidClassException: failed to read class descriptor
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$client53442a94a3$$anonfun$mapPartitions$1$$anonfun$apply$23
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:257)
at org.apache.spark.sql.util.ProtoSerializer.org$apache$spark$sql$util$ProtoSerializer$$readResolveClassDescriptor(ProtoSerializer.scala:4316)
at org.apache.spark.sql.util.ProtoSerializer$$anon$4.readClassDescriptor(ProtoSerializer.scala:4304)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1857)
... 71 more
I work with java8 as required, clearing pycache doesn't help.
The same code submitted as a job to databricks works fine.
It looks like a local problem on a bridge python-jvm level but java version (8) and python (3.7) is as required. Switching to java13 produces quite the same message.
Versions databricks-connect==6.2.0, openjdk version "1.8.0_242", Python 3.7.6
EDIT:
Behavior depends on how DF is created, if the source of DF is external then it works fine, if DF is created locally then such error appears.
# works fine
df = spark.read.csv("dbfs:/some.csv")
df.cache()
# ERROR in 'cache' line
df = spark.createDataFrame([("a",), ("b",)])
df.cache()
This is a known issue and I think a recent patch fixed it. This was seen for Azure, I am not sure whether you are using which Azure or AWS but it's solved. Please check the issue - https://github.com/MicrosoftDocs/azure-docs/issues/52431

How to save data-frame in MySQL using PySpark

I am new to Apache Spark. I have a use case where I have to save data frame data in MySQL. I got the below code to do the same:
data_frame.write.format('jdbc').options(
url='URI',
driver='com.mysql.jdbc.Driver',
dbtable=table_name,
user=user_name,
password='your_password').mode('append').save()
But when I ran the code, I got the below error:
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
I might be missing out on very minute detail. How can I fix this?
The error description is clearly indicating that it's not able to locate the JDBC driver class. You will have to include the JAR file for com.mysql.jdbc.Driver using
pyspark --jars <jar-file-location>
See this question - How to add third-party Java JAR files for use in PySpark.

Spark executor is throwing error "java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver"

I am trying to import a table from my oracle database using spark and here I am using Scala to import the table.
My jdbc driver is ojdbc7.jar and it's added in both the parameter spark.driver.extraClassPath and spark.executor.extraClassPath in configuration file
spark.driver.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
I can successfully import the table. I can print the schema of the table. But while performing any operations like Count,show() it throws below error
`
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver
at java.lang.ClassLoader.findClass(ClassLoader.java:530) at
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 21 more
`
This error was because Spark was not able to locate the ojdbc7.jar from every core node. So placing this jar in a shared location like /usr/lib/spark/jars will resolve this issue.
You can also do few other things including adding jar file full location as a dependency under spark section in the interpreter as an artifact
If you just want %jdbc to work, update the jdbc section under interpreter, add the jar file full location as a artifact under the dependencies and also update the default.driver, default.url, default.user, default.password accordingly