ClassNotFoundException: com.databricks.spark.csv.DefaultSource - scala

I am trying to export data from Hive using spark scala. But I am getting following error.
Caused by: java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
My scala script is like below.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM sparksdata")
df.write.format("com.databricks.spark.csv").save("/root/Desktop/home.csv")
I have also try this command but still is not resolved please help me.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0

If you wish to run that script the way you are running it, you'll need to use the --jars for local jars or --packages for remote repo when you run the command.
So running the script should be like this :
spark-shell -i /path/to/script/scala --packages com.databricks:spark-csv_2.10:1.5.0
If you'd also want to stop the spark-shell after the job is done, you'll need to add :
System.exit(0)
by the end of your script.
PS: You won't be needing to fetch this dependency with spark 2.+.

Related

sequence files from sqoop import

I have imported a table using sqoop and saved it as a sequence file.
How do I read this file into an RDD or Dataframe?
I have tried sc.sequenceFile() but I'm not sure what to pass as keyClass and value Class. I tried tried using org.apache.hadoop.io.Text, org.apache.hadoop.io.LongWritable for keyClass and valueClass
but it did not work. I am using pyspark for reading the files.
in python its not working however in SCALA it works:
You need to do following steps:
step1:
If you are importing as sequence file from sqoop, there is a jar file generated, you need to use that as ValueClass while reading sequencefile. This jar file is generally placed in /tmp folder, but you can redirect it to a specific folder (i.e. to local folder not hdfs) using --bindir option.
example:
sqoop import --connect jdbc:mysql://ms.itversity.com/retail_export --
username retail_user --password itversity --table customers -m 1 --target-dir '/user/srikarthik/udemy/practice4/problem2/outputseq' --as-sequencefile --delete-target-dir --bindir /home/srikarthik/sqoopjars/
step2:
Also, you need to download the jar file from below link:
http://www.java2s.com/Code/Jar/s/Downloadsqoop144hadoop200jar.htm
step3:
Suppose, customers table is imported using sqoop as sequence file.
Run spark-shell --jars path-to-customers.jar,sqoop-1.4.4-hadoop200.jar
example:
spark-shell --master yarn --jars /home/srikarthik/sqoopjars/customers.jar,/home/srikarthik/tejdata/kjar/sqoop-1.4.4-hadoop200.jar
step4: Now run below commands inside the spark-shell
scala> import org.apache.hadoop.io.LongWritable
scala> val data = sc.sequenceFile[LongWritable,customers]("/user/srikarthik/udemy/practice4/problem2/outputseq")
scala> data.map(tup => (tup._1.get(), tup._2.toString())).collect.foreach(println)
You can use SeqDataSourceV2 package to read the sequence file with the DataFrame API without any prior knowledge of the schema (aka keyClass and valueClass).
Please note that the current version is only compatible with Spark 2.4
$ pyspark --packages seq-datasource-v2-0.2.0.jar
df = spark.read.format("seq").load("data.seq")
df.show()

How to refer deltalake tables in jupyter notebook using pyspark

I'm trying to start use DeltaLakes using Pyspark.
To be able to use deltalake, I invoke pyspark on Anaconda shell-prompt as —
pyspark — packages io.delta:delta-core_2.11:0.3.0
Here is the reference from deltalake — https://docs.delta.io/latest/quick-start.html
All commands for delta lake works fine from Anaconda shell-prompt.
On jupyter notebook, reference to a deltalake table gives error.Here is the code I am running on Jupyter Notebook -
df_advisorMetrics.write.mode("overwrite").format("delta").save("/DeltaLake/METRICS_F_DELTA")
spark.sql("create table METRICS_F_DELTA using delta location '/DeltaLake/METRICS_F_DELTA'")
Below is the code I am using at start of notebook to connect to pyspark -
import findspark
findspark.init()
findspark.find()
import pyspark
findspark.find()
Below is the error I get:
Py4JJavaError: An error occurred while calling o116.save.
: java.lang.ClassNotFoundException: Failed to find data source: delta. Please find packages at http://spark.apache.org/third-party-projects.html
Any suggestions?
I have created a Google Colab/Jupyter Notebook example that shows how to run Delta Lake.
https://github.com/prasannakumar2012/spark_experiments/blob/master/examples/Delta_Lake.ipynb
It has all the steps needed to run. This uses the latest spark and delta version. Please change the versions accordingly.
A potential solution is to follow the techniques noted in Import PySpark packages with a regular Jupyter notebook.
Another potential solution is to download the delta-core JAR and place it in the $SPARK_HOME/jars folder so when you run jupyter notebook it automatically includes the Delta Lake JAR.
I use DeltaLake all the time from a Jupyter notebook.
Try the following in you Jupyter notebook running Python 3.x.
### import Spark libraries
from pyspark.sql import SparkSession
import pyspark.sql.functions as F
### spark package maven coordinates - in case you are loading more than just delta
spark_packages_list = [
'io.delta:delta-core_2.11:0.6.1',
]
spark_packages = ",".join(spark_packages_list)
### SparkSession
spark = (
SparkSession.builder
.config("spark.jars.packages", spark_packages)
.config("spark.delta.logStore.class", "org.apache.spark.sql.delta.storage.S3SingleDriverLogStore")
.config("spark.sql.extensions", "io.delta.sql.DeltaSparkSessionExtension")
.config("spark.sql.catalog.spark_catalog", "org.apache.spark.sql.delta.catalog.DeltaCatalog")
.getOrCreate()
)
sc = spark.sparkContext
### Python library in delta jar.
### Must create sparkSession before import
from delta.tables import *
Assuming you have a spark dataframe df
HDFS
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save("my_delta_file", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load("my_delta_file")
AWS S3 ObjectStore
Initial S3 setup
### Spark S3 access
hdpConf = sc._jsc.hadoopConfiguration()
user = os.getenv("USER")
### Assuming you have your AWS credentials in a jceks keystore.
hdpConf.set("hadoop.security.credential.provider.path", f"jceks://hdfs/user/{user}/awskeyfile.jceks")
hdpConf.set("fs.s3a.fast.upload", "true")
### optimize s3 bucket-level parquet column selection
### un-comment to use
# hdpConf.set("fs.s3a.experimental.fadvise", "random")
### Pick one upload buffer option
hdpConf.set("fs.s3a.fast.upload.buffer", "bytebuffer") # JVM off-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "array") # JVM on-heap memory
# hdpConf.set("fs.s3a.fast.upload.buffer", "disk") # DEFAULT - directories listed in fs.s3a.buffer.dir
s3_bucket_path = "s3a://your-bucket-name"
s3_delta_prefix = "delta" # or whatever
Save
### overwrite, change mode="append" if you prefer
(df.write.format("delta")
.save(f"{s3_bucket_path}/{s3_delta_prefix}/", mode="overwrite", partitionBy="partition_column_name")
)
Load
df_delta = spark.read.format("delta").load(f"{s3_bucket_path}/{s3_delta_prefix}/")
Spark Submit
Not directly answering the original question, but for completeness, you can do the following as well.
Add the following to your spark-defaults.conf file
spark.jars.packages io.delta:delta-core_2.11:0.6.1
spark.delta.logStore.class org.apache.spark.sql.delta.storage.S3SingleDriverLogStore
spark.sql.extensions io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog org.apache.spark.sql.delta.catalog.DeltaCatalog
Refer to conf file in spark-submit command
spark-submit \
--properties-file /path/to/your/spark-defaults.conf \
--name your_spark_delta_app \
--py-files /path/to/your/supporting_pyspark_files.zip \
--class Main /path/to/your/pyspark_script.py

BigQuery connector ClassNotFoundException in PySpark on Dataproc

I'm trying to run a script in PySpark, using Dataproc.
The script is kind of a merge between this example and what I need to do, as I wanted to check if everything works. Obviously, it doesn't.
The error I get is:
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.newAPIHadoopRDD.
: java.lang.ClassNotFoundException: com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat
I made sure I have all the jars, added some new jars as suggested in other similar posts. I also checked the SPARK_HOME variable.
Below you can see the code; the error appears when trying to instantiate table_data.
"""BigQuery I/O PySpark example."""
from __future__ import absolute_import
import json
import pprint
import subprocess
import pyspark
from pyspark.sql import SQLContext
sc = pyspark.SparkContext()
bucket = sc._jsc.hadoopConfiguration().get('fs.gs.system.bucket')
project = sc._jsc.hadoopConfiguration().get('fs.gs.project.id')
input_directory = 'gs://{}/hadoop/tmp/bigquery/pyspark_input'.format(bucket)
conf = {
'mapred.bq.project.id': project,
'mapred.bq.gcs.bucket': bucket,
'mapred.bq.temp.gcs.path': input_directory,
'mapred.bq.input.project.id': 'publicdata',
'mapred.bq.input.dataset.id': 'samples',
'mapred.bq.input.table.id': 'shakespeare',
}
output_dataset = 'wordcount_dataset'
output_table = 'wordcount_output'
table_data = sc.newAPIHadoopRDD(
'com.google.cloud.hadoop.io.bigquery.JsonTextBigQueryInputFormat',
'org.apache.hadoop.io.LongWritable',
'com.google.gson.JsonObject',
conf=conf)
As pointed out in the example, you need to include BigQuery connector jar when submitting the job.
Through Dataproc jobs API:
gcloud dataproc jobs submit pyspark --cluster=${CLUSTER} \
/path/to/your/script.py \
--jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar
or spark-submit from inside the cluster:
spark-submit --jars=gs://hadoop-lib/bigquery/bigquery-connector-hadoop2-latest.jar \
/path/to/your/script.py

spark - hadoop argument

I am running both hadoop and spark and I want to use files from hdfs as an argument on spark-submit, so I made a folder in hdfs with the files
eg. /user/hduser/test/input
and I want to run spark-submit like this:
$SPARK_HOME/bin/spark-submit --master spark://admin:7077 ./target/scala-2.10/test_2.10-1.0.jar hdfs://user/hduser/test/input
but I cant make it work, what's the right way to do it?
the error I am getting is :
WARN FileInputDStream: Error finding new files
java.lang.NullPointerException
Check if you are able to access HDFS from Spark code, If yes then you need to add following line of code in your Scala import.
import org.apache.hadoop.fs.FileSystem
import org.apache.hadoop.fs.Path
import org.apache.spark.SparkFiles
then in your code add following lines
var hadoopConf = new org.apache.hadoop.conf.Configuration()
var fileSystem = FileSystem.get(hadoopConf)
var path = new Path(args(0))
actually the problem was the path. I had to use hdfs://localhost:9000/user/hduser/...

In Spark's interactive shell, how to integrate conf param in spark context?

I'm a newbie with Spark. In multiple examples seen on the net, we see something like:
conf.set("es.nodes", "from.escluster.com")
val sc = new SparkContext(conf)
but when I try in the interactive shell to do such a thing, I get the error:
org.apache.spark.SparkException: Only one SparkContext may be running in this JVM (see SPARK-2243)
So how am I supposed to set a conf parameter and be sure that sc "integrates" it?
Two options :
call sc.stop() before this code
Add --conf with your additional configuration as spark-shell argument