I would like to use the phoenix tables with Pyspark. I try the solution that i found here: https://phoenix.apache.org/phoenix_spark.html
But I have an error. Can you help me to solve this error?
df_metadata = sqlCtx.read.format("org.apache.phoenix.spark").option("zkUrl", "xxx").load("lib.name_of_table")
print(df_metadata.collect())
and the error:
py4j.protocol.Py4JJavaError: An error occurred while calling o103.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark-packages.org
How can I use org.apache.phoenix.spark with pyspark?
OK I found how correct this code:
I add this part in my spark-submit:
--jars /opt/phoenix-4.8.1-HBase-1.2/phoenix-spark-4.8.1-HBase-1.2.jar,/opt/phoenix-4.8.1-HBase-1.2/phoenix-4.8.1-HBase-1.2-client.jar \
I know the answer given by #Zop works.
I've got this error py4j.protocol.Py4JJavaError: An error occurred while calling o53.load.
: java.lang.ClassNotFoundException: Failed to find data source: org.apache.phoenix.spark. Please find packages at http://spark.apache.org/third-party-projects.html
You can do it this way too
spark-submit --jars /usr/hdp/current/phoenix-client/phoenix-spark2.jar,/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.4.0-91-client.jar,/usr/hdp/current/phoenix-client/phoenix-4.7.0.2.6.4.0-91-server.jar <file here>
Related
I want to show spark dataframe and I used:
df.writeStream.outputMode("append").start().awaitTermination()
But when I got the error when run this line:
21/07/16 01:20:53 ERROR MicroBatchExecution: Query [id = f243e6e6-c02e-4e70-b5c3-6a821fd33232, runId = 312544cf-fea8-45b4-94a1-c052306538cf] terminated with error
java.lang.NoSuchMethodError: org.apache.spark.sql.internal.SQLConf.useDeprecatedKafkaOffsetFetching()Z
Check the version of spark and version of dependencies you have added. Make sure both are having the same versions. This will resolve the issue.
Objective
I am trying to connect to my Oracle Database(12c) from Kafka Connect(ideally in distributed mode) using the Debezium connector(1.2.4.Final). The Kafka version i am using is 2.13-2.6.0.
Command used
As per mentioned here, i am running this command:
C:\Users\username\Downloads\kafka>bin\windows\connect-distributed.bat config\connect-distributed.properties
Error
The error i am getting is:
ERROR Stopping due to error
(org.apache.kafka.connect.cli.ConnectDistributed)
java.lang.NoClassDefFoundError: io/debezium/util/IoUtil
at io.debezium.connector.oracle.Module.(Module.java:19)
at io.debezium.connector.oracle.OracleConnector.version(OracleConnector.java:23)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.versionFor(DelegatingClassLoader.java:390)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.versionFor(DelegatingClassLoader.java:395)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.getPluginDesc(DelegatingClassLoader.java:365)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanPluginPath(DelegatingClassLoader.java:337)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.scanUrlsAndAddPlugins(DelegatingClassLoader.java:268)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.registerPlugin(DelegatingClassLoader.java:260)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initPluginLoader(DelegatingClassLoader.java:229)
at org.apache.kafka.connect.runtime.isolation.DelegatingClassLoader.initLoaders(DelegatingClassLoader.java:206)
at org.apache.kafka.connect.runtime.isolation.Plugins.(Plugins.java:61)
at org.apache.kafka.connect.cli.ConnectDistributed.startConnect(ConnectDistributed.java:91)
at org.apache.kafka.connect.cli.ConnectDistributed.main(ConnectDistributed.java:78)
Caused by: java.lang.ClassNotFoundException: io.debezium.util.IoUtil
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.kafka.connect.runtime.isolation.PluginClassLoader.loadClass(PluginClassLoader.java:104)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
... 13 more
Settings
In my connect-distributed.properties, i have this:
plugin.path=C:/Users/username/Downloads/kafka/libs/debezium
And inside the debezium folder(following Gunnar's recommendation from the comment in this question), i have these jars:
I also added the plugin path in %PATH% as follows:
echo %PATH% | findstr debezium
XXX;C:\Users\username\Downloads\kafka\libs\debezium;
Help
Any help would be greatly appreciated, as i hope to replace my database polling with this debezium connector which seems a better approach. Thanks!
The solution from Gunnar here works! (His explanation is there too if you want to check it out.)
plugin.path=C:\\Users\\username\\Downloads\\kafka\\libs
and that also works:
plugin.path=C:/Users/username/Downloads/kafka/libs
plugin.path=C:\Users\username\Downloads\kafka\libs
plugin.path=/Users/username/Downloads/kafka/libs
The mistake is: plugin.path should be up to libs and not libs/debezium
Connection to databricks works fine, working with DataFrames goes smoothly (operations like join, filter, etc).
The problem appears when I call cache on a dataframe.
py4j.protocol.Py4JJavaError: An error occurred while calling o342.cache.
: java.io.InvalidClassException: failed to read class descriptor
...
Caused by: java.lang.ClassNotFoundException: org.apache.spark.rdd.RDD$client53442a94a3$$anonfun$mapPartitions$1$$anonfun$apply$23
at java.lang.ClassLoader.findClass(ClassLoader.java:523)
at org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.java:35)
at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
at org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.java:40)
at org.apache.spark.util.ChildFirstURLClassLoader.loadClass(ChildFirstURLClassLoader.java:48)
at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.util.Utils$.classForName(Utils.scala:257)
at org.apache.spark.sql.util.ProtoSerializer.org$apache$spark$sql$util$ProtoSerializer$$readResolveClassDescriptor(ProtoSerializer.scala:4316)
at org.apache.spark.sql.util.ProtoSerializer$$anon$4.readClassDescriptor(ProtoSerializer.scala:4304)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1857)
... 71 more
I work with java8 as required, clearing pycache doesn't help.
The same code submitted as a job to databricks works fine.
It looks like a local problem on a bridge python-jvm level but java version (8) and python (3.7) is as required. Switching to java13 produces quite the same message.
Versions databricks-connect==6.2.0, openjdk version "1.8.0_242", Python 3.7.6
EDIT:
Behavior depends on how DF is created, if the source of DF is external then it works fine, if DF is created locally then such error appears.
# works fine
df = spark.read.csv("dbfs:/some.csv")
df.cache()
# ERROR in 'cache' line
df = spark.createDataFrame([("a",), ("b",)])
df.cache()
This is a known issue and I think a recent patch fixed it. This was seen for Azure, I am not sure whether you are using which Azure or AWS but it's solved. Please check the issue - https://github.com/MicrosoftDocs/azure-docs/issues/52431
I am new to Apache Spark. I have a use case where I have to save data frame data in MySQL. I got the below code to do the same:
data_frame.write.format('jdbc').options(
url='URI',
driver='com.mysql.jdbc.Driver',
dbtable=table_name,
user=user_name,
password='your_password').mode('append').save()
But when I ran the code, I got the below error:
File "/usr/local/Cellar/apache-spark/2.3.1/libexec/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling o207.save.
: java.lang.ClassNotFoundException: com.mysql.jdbc.Driver
I might be missing out on very minute detail. How can I fix this?
The error description is clearly indicating that it's not able to locate the JDBC driver class. You will have to include the JAR file for com.mysql.jdbc.Driver using
pyspark --jars <jar-file-location>
See this question - How to add third-party Java JAR files for use in PySpark.
I am trying to import a table from my oracle database using spark and here I am using Scala to import the table.
My jdbc driver is ojdbc7.jar and it's added in both the parameter spark.driver.extraClassPath and spark.executor.extraClassPath in configuration file
spark.driver.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.driver.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
spark.executor.extraClassPath :/usr/lib/hadoop-lzo/lib/*:/usr/lib/hadoop/hadoop-aws.jar:/usr/share/aws/aws-java-sdk/*:/usr/share/aws/emr/emrfs/conf:/usr/s
hare/aws/emr/emrfs/lib/*:/usr/share/aws/emr/emrfs/auxlib/*:/usr/share/aws/emr/security/conf:/usr/share/aws/emr/security/lib/*:/home/hadoop/ojdbc7.jar
spark.executor.extraLibraryPath /usr/lib/hadoop/lib/native:/usr/lib/hadoop-lzo/lib/native
I can successfully import the table. I can print the schema of the table. But while performing any operations like Count,show() it throws below error
`
Caused by: java.lang.ClassNotFoundException: oracle.jdbc.OracleDriver
at java.lang.ClassLoader.findClass(ClassLoader.java:530) at
org.apache.spark.util.ParentClassLoader.findClass(ParentClassLoader.scala:26)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:34)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357) at
org.apache.spark.util.ParentClassLoader.loadClass(ParentClassLoader.scala:30)
at
org.apache.spark.repl.ExecutorClassLoader.findClass(ExecutorClassLoader.scala:77)
... 21 more
`
This error was because Spark was not able to locate the ojdbc7.jar from every core node. So placing this jar in a shared location like /usr/lib/spark/jars will resolve this issue.
You can also do few other things including adding jar file full location as a dependency under spark section in the interpreter as an artifact
If you just want %jdbc to work, update the jdbc section under interpreter, add the jar file full location as a artifact under the dependencies and also update the default.driver, default.url, default.user, default.password accordingly