SQLException: No suitable driver found for jdbc:phoenix:(hosts) - scala

I am running a spark-submit command that will do some database work via a Scala class.
spark-submit
--verbose
--class mycompany.MyClass
--conf spark.driver.extraJavaOptions=-Dconfig.resource=dev-test.conf
--conf spark.executor.extraJavaOptions=-Dconfig.resource=dev-test.conf
--master yarn
--driver-library-path /usr/lib/hadoop-lzo/lib/native/
--jars /home/hadoop/mydir/dbp.spark-utils-1.1.0-SNAPSHOT.jar,/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar,/usr/lib/hadoop-lzo/lib/hadoop-lzo.jar,/usr/lib/hadoop/lib/commons-compress-1.18.jar,/usr/lib/hadoop/hadoop-aws-3.2.1-amzn-5.jar,/usr/share/aws/aws-java-sdk/aws-java-sdk-bundle-1.12.31.jar
--files /home/hadoop/mydir/dev-test.conf
--num-executors 1
--executor-memory 3g
--driver-memory 3g
--queue default /home/hadoop/mydir/dbp.spark-utils-1.1.0-SNAPSHOT.jar
<<args to MyClass>>
When I run, I get an exception:
Caused by: java.sql.SQLException: No suitable driver found for jdbc:phoenix:host1,host2,host3:2181:/hbase;
at java.sql.DriverManager.getConnection(DriverManager.java:689)
at java.sql.DriverManager.getConnection(DriverManager.java:208)
at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:422)
at org.apache.phoenix.util.QueryUtil.getConnection(QueryUtil.java:414)
Here are the relevant parts of my Scala code:
val conf: SerializableHadoopConfiguration =
new SerializableHadoopConfiguration(sc.hadoopConfiguration)
Class.forName("org.apache.phoenix.jdbc.PhoenixDriver")
val tableRowKeyPairs: RDD[(Cell, ImmutableBytesWritable)] =
df.rdd.mapPartitions(partition => {
val configuration = conf.get()
val partitionConn: JavaConnection = QueryUtil.getConnection(configuration)
// ...
}
My spark-submit command includes /usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar using the --jars argument. When I search that file for "org.apache.phoenix.jdbc.PhoenixDriver", I find it:
$ jar -tf /usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar | grep -i driver
...
org/apache/phoenix/jdbc/PhoenixDriver.class
...
So why can't my program locate the driver?

I was able to get the program to find the driver by adding the following argument to the spark-submit command shown in the question:
--conf "spark.executor.extraClassPath=/usr/lib/phoenix/phoenix-client-hbase-2.4-5.1.2.jar"
This StackOverflow article has great explanations for what the various arguments do.

Related

Spark-submit not picking modules and sub-modules of project structure

Folder structure of pyspark project on pycharm:
TEST
TEST (marked as sources root)
com
earl
test
pysprk
utils
utilities.py
test_main.py
test_main.py has:
from _ast import arg
__author__ = "earl"
from pyspark.sql.functions import to_json, struct, lit
from com.earl.test.pyspark.utils.utilities import *
import sys
utilities.py has:
__author__ = "earl"
from py4j.protocol import Py4JJavaError
from pyspark.sql import SparkSession
import sys
On PyCharm, I execute the code by Running test_main.py and it works absolutley OK. Calls functions from utilities.py and executes perfectly. I set Run -> Edit Configurations -> Parameters on PyCharm as D:\Users\input\test.json localhost:9092 and use sys.argv[1] and sys.argv[2] and it does it OK
Spark submit command:
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py --files D:\Users\earl\com\test\pyspark\utils\utilities.py D:\Users\input\test.json localhost:9092
Error:
Traceback (most recent call last):
File "D:\Users\earl\com\earl\test\pyspark\test_main.py", line 5, in <module>
from com.earl.test.pyspark.utils.utilities import *
ModuleNotFoundError: No module named 'com'
Fixed it by setting below property before running spark-submit.
PYTHONPATH earlier is set as %PY_HOME%\Lib;%PY_HOME%\DLLs;%PY_HOME%\Lib\lib-tk
set PYTHONPATH=%PYTHONPATH%;D:\Users\earl\TEST\ (Path of the project home structure)
And updated spark-submit as (Only main needed to be mentioned):
spark-submit --master local --conf spark.sparkContext.setLogLevel=WARN --name test D:\Users\earl\com\earl\test\pyspark\test_main.py D:\Users\input\test.json localhost:9092

Reading a BigQuery table into a Spark RDD on GCP DataProc, why is the class missing for use in newAPIHadoopRDD

Approximately just over one week ago, I was able to read a BigQuery table into an RDD for a Spark job running on a Dataproc cluster using the guide at https://cloud.google.com/dataproc/docs/tutorials/bigquery-connector-spark-example as a template. Since then, I am now encountering missing class issues, despite no changes being affected to the guide.
I have attempted to track down the missing class, com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList, although I cannot find any information on whether or not this class is now excluded from the gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
The job submission request is as follows:
gcloud dataproc jobs submit pyspark \
--cluster $CLUSTER_NAME \
--jars gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
--bucket gs://$BUCKET_NAME \
--region europe-west2 \
--py-files $PYSPARK_PATH/main.py
The PySpark code breaks at the following point:
bq_table_rdd = spark_context.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
where conf is a Python dict structured as follows:
conf = {
'mapred.bq.project.id': project_id,
'mapred.bq.gcs.bucket': gcs_staging_bucket,
'mapred.bq.temp.gcs.path': input_staging_path,
'mapred.bq.input.project.id': bq_input_project_id,
'mapred.bq.input.dataset.id': bq_input_dataset_id,
'mapred.bq.input.table.id': bq_input_table_id,
}
When my output indicates that the code has reached the above spark_context.newAPIHadoopRDD function, the following is printed to stdout:
class com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.DefaultPlatform: cannot cast result of calling 'com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory#getInstance' to 'com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory': java.lang.ClassCastException: Cannot cast com.google.cloud.hadoop.repackaged.gcs.com.google.common.flogger.backend.log4j.Log4jBackendFactory to com.google.cloud.hadoop.repackaged.bigquery.com.google.common.flogger.backend.system.BackendFactory
Traceback (most recent call last):
File "/tmp/0af805a2dd104e46b087037f0790691f/main.py", line 31, in <module>
sc)
File "/tmp/0af805a2dd104e46b087037f0790691f/extract.py", line 65, in bq_table_to_rdd
conf=conf)
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/context.py", line 749, in newAPIHadoopRDD
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in __call__
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
This had not been an issue as recently as last week. I am concerned that even the hello world example on the GCP website is not stable in the short term. If anyone could shed some light on this issue, it would be greatly appreciated. Thanks.
I reproduced the problem
$ gcloud dataproc clusters create test-cluster --image-version=1.4
$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
--cluster test-cluster \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
then exactly the same error happened:
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.NoClassDefFoundError: com/google/cloud/hadoop/repackaged/bigquery/com/google/common/collect/ImmutableList
I noticed there was a new release 1.0.0 on Aug 23:
$ gsutil ls -l gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-**
...
4038762 2018-10-03T20:59:35Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.8.jar
4040566 2018-10-19T23:32:19Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
14104522 2019-06-28T21:08:57Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC1.jar
14104520 2019-07-01T20:38:18Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0-RC2.jar
14149215 2019-08-23T21:08:03Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-1.0.0.jar
14149215 2019-08-24T00:27:49Z gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
Then I tried version 0.13.9, it worked:
$ gcloud dataproc jobs submit pyspark wordcount_bq.py \
--cluster test-cluster \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-0.13.9.jar
It is a problem with 1.0.0, there is already an issue filed on GitHub. We'll fix it and improve the tests.

Could not read csv file in pyspark

I am new to pyspark and I did some initial tutorials. When I am trying to load a CSV file on my local host in the Spark framework using Jupyter Notebook, the below mentioned error pops. My java version is 8.0
from pyspark.sql import SparkSession
spark = SparkSession \
.builder \
.appName('sql based spark data analysis') \
.config('spark.some.config.option', 'some-value') \
.getOrCreate()
df = spark.read.csv('C:/Users/sitaram/Downloads/creditcardfraud/creditcard.csv')
My error is as follows:
Py4JJavaError: An error occurred while calling o55.csv.
: org.apache.spark.sql.AnalysisException: java.lang.RuntimeException:
java.lang.RuntimeException: Error while running command to get file
permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive
at
org.apache.hadoop.util.Shell$ShellCommandExecutor.execute(Shell.java:770)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:866)
at org.apache.hadoop.util.Shell.execCommand(Shell.java:849)
at org.apache.hadoop.fs.FileUtil.execCommand(FileUtil.java:1097)
at org.apache.hadoop.fs.RawLocalFileSystem$DeprecatedRawLocalFileStatus.loadPermissionInfo(RawLocalFileSystem.java:65
Caused by: java.lang.RuntimeException: java.lang.RuntimeException: Error
while running command to get file permissions : java.io.IOException: (null) entry in command string: null ls -F C:\tmp\hive
Please try C://Users//sitaram//Downloads//creditcardfraud//creditcard.csv

case class with 200 attributes gets java.lang.StackOverflowError error and message "That entry seems to have slain the compiler."

In Spark-shell, trying to create a Scala case class with 200 attribute doesn't work. I get the error bellow saying that it slays the compiler:
I got a suggestion to increase the stack memory:
spark-shell --conf "spark.driver.extraJavaOptions=-Xss2M" --conf "spark.executor.extraJavaOptions=-Xss2M"
That helped initially but when I pass more argument to spark-shell for Kafka and HBase:
spark-shell --conf "spark.driver.extraJavaOptions=-Xss4M" --conf "spark.executor.extraJavaOptions=-Xss4M" --conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=/etc/kafka/conf/kafka_jaas.conf" --driver-java-options "-Djava.security.auth.login.config=/etc/kafka/conf/kafka_jaas.conf" -classpath /usr/hdp/current/hbase-client/lib/hbase-common.jar:/usr/hdp/current/hbase-client/lib/hbase-client.jar:/usr/hdp/current/hbase-client/lib/hbase-server.jar:/usr/hdp/current/hbase-client/lib/hbase-protocol.jar:shc-core-1.1.1-2.1-s_2.11.jar:spark-streaming-kafka-0-10_2.11-2.0.0.jar:/usr/hdp/2.6.1.0-129/kafka/libs/kafka-clients-0.10.1.2.6.1.0-129.jar --files /etc/hbase/conf/hbase-site.xml
Same error keep happening:
java.lang.StackOverflowError
at scala.tools.nsc.typechecker.Typers$Typer.normalTypedApply$1(Typers.scala:4504)
at scala.tools.nsc.typechecker.Typers$Typer.typedApply$1(Typers.scala:4580)
at scala.tools.nsc.typechecker.Typers$Typer.typedInAnyMode$1(Typers.scala:5343)
.............
That entry seems to have slain the compiler. Shall I replay
your session? I can re-run each line except the last one.
[y/n]

Spark Pipe example

I'm new to Spark and trying to figure out how the pipe method works. I have the following code in Scala
sc.textFile(hdfsLocation).pipe("preprocess.py").saveAsTextFile(hdfsPreprocessedLocation)
The values hdfsLocation and hdfsPreprocessedLocation are fine. As proof, the following code works from the command line
hadoop fs -cat hdfsLocation/* | ./preprocess.py | head
When I run the above Spark code I get the following errors
14/11/25 09:41:50 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Cannot run program "preprocess.py": error=2, No such file or directory
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1041)
at org.apache.spark.rdd.PipedRDD.compute(PipedRDD.scala:119)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.rdd.MappedRDD.compute(MappedRDD.scala:31)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:229)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)
at org.apache.spark.scheduler.Task.run(Task.scala:54)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:177)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
Caused by: java.io.IOException: error=2, No such file or directory
at java.lang.UNIXProcess.forkAndExec(Native Method)
at java.lang.UNIXProcess.<init>(UNIXProcess.java:135)
at java.lang.ProcessImpl.start(ProcessImpl.java:130)
at java.lang.ProcessBuilder.start(ProcessBuilder.java:1022)
... 12 more
In order to solve this for Hadoop streaming I would just use the --files attribute, so I tried the same thing for Spark. I start Spark with the following command
bin/spark-shell --files ./preprocess.py
but that gave the same error.
I couldn't find a good example of using Spark with an external process via pipe, so I'm not sure if I'm doing this correctly. Any help would be greatly appreciated.
Thanks
I'm not sure if this is the correct answer, so I won't finalize this, but it appears that the file paths are different when running spark in local and cluster mode. When running spark without --master the paths to the pipe command are relative to the local machine. When running spark with --master the paths to the pipe command are ./
UPDATE:
This actually isn't correct. I was using SparkFiles.get() to get the file name. It turns out that when calling .pipe() on an RDD the command string is evaluated on the driver and then passed to the worker. Because of this SparkFiles.get() is not the appropriate way to get the file name. The file name should be ./ because SparkContext.addFile() should put that file on ./ relative to to where each worker is run from. But I'm so sour on .pipe now that I've take .pipe out of my code in total in favor of .mapPartitions in combination of a PipeUtils object that I wrote here. This is actually more efficient because I only have to incur the script startup costs once per partition instead of once per example.