Failed to load com.databricks.spark.csv while running with spark-submit - scala

I am trying to run my code with spark-submit with the below command.
spark-submit --class "SampleApp" --master local[2] target/scala-2.11/sample-project_2.11-1.0.jar
And my sbt file is having below dependencies:
libraryDependencies += "org.apache.spark" %% "spark-core" % "1.4.1"
libraryDependencies += "org.apache.spark" % "spark-sql_2.11" % "1.5.2"
libraryDependencies += "com.databricks" % "spark-csv_2.11" % "1.2.0"
My code :
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import scala.collection.mutable.ArrayBuffer
import org.apache.spark.sql.SQLContext
object SampleApp {
def main(args: Array[String]) {
val conf = new SparkConf().setAppName("Sample App").setMaster("local[2]")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
import sqlContext._
import sqlContext.implicits._
val df = sqlContext.load("com.databricks.spark.csv", Map("path" -> "/root/input/Account.csv", "header" -> "true"))
val column_names = df.columns
val row_count = df.count
val column_count = column_names.length
var pKeys = ArrayBuffer[String]()
for ( i <- column_names){
if (row_count == df.groupBy(i).count.count){
pKeys += df.groupBy(i).count.columns(0)
}
}
pKeys.foreach(print)
}
}
The error:
16/03/11 04:47:37 INFO BlockManagerMaster: Registered BlockManager
Exception in thread "main" java.lang.RuntimeException: Failed to load class for data source: com.databricks.spark.csv
at scala.sys.package$.error(package.scala:27)
at org.apache.spark.sql.sources.ResolvedDataSource$.lookupDataSource(ddl.scala:220)
at org.apache.spark.sql.sources.ResolvedDataSource$.apply(ddl.scala:233)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:114)
at org.apache.spark.sql.SQLContext.load(SQLContext.scala:1253)
My Spark Version is 1.4.1 and Scala is 2.11.7
(I am following this link: http://www.nodalpoint.com/development-and-deployment-of-spark-applications-with-scala-eclipse-and-sbt-part-1-installation-configuration/)
I have tried below versions of spark csv
spark-csv_2.10 1.2.0
1.4.0
1.3.1
1.3.0
1.2.0
1.1.0
1.0.3
1.0.2
1.0.1
1.0.0
etc.
Please help!

Since you are running the job in local mode, add external jar path using --jar option
spark-submit --class "SampleApp" --master local[2] --jar file:[path-of-spark-csv_2.11.jar],file:[path-of-other-dependency-jar] target/scala-2.11/sample-project_2.11-1.0.jar
e.g.
spark-submit --jars file:/root/Downloads/jars/spark-csv_2.10-1.0.3.jar,file:/root/Downloads/jars/com‌​mons-csv-1.2.jar,file:/root/Downloads/jars/spark-sql_2.11-1.4.1.jar --class "SampleApp" --master local[2] target/scala-2.11/my-proj_2.11-1.0.jar
Another thing you can do is create a fat jar. In SBT you can try this proper-way-to-make-a-spark-fat-jar-using-sbt and in Maven refer create-a-fat-jar-file-maven-assembly-plugin
Note: Mark scope of Spark's (i.e. spark-core, spark-streaming, spark-sql etc) jar as provided otherwise fat jar will become too fat to deploy.

Better solution is to use --packages option like below.
spark-submit --class "SampleApp" --master local[2] --packages com.databricks:spark-csv_2.10:1.5.0 target/scala-2.11/sample-project_2.11-1.0.jar
Make sure that --packages option precedes the application jar

you have added spark-csv library to your sbt config - it means that you can compile your code with it,
but it still doesn't mean that it's present in runtime(spark sql and spark core are there by default)
so try to use --jars option of spark-submit to add spark-csv jar to runtime classpath or you can build fat-jar(not sure how you doing it with sbt)

You are using the Spark 1.3 syntax of loading the CSV file into a dataframe.
If you check the repository here, you should use the following syntax on Spark 1.4 and higher:
val df = sqlContext.read
.format("com.databricks.spark.csv")
.option("header", "true") // Use first line of all files as header
.option("inferSchema", "true") // Automatically infer data types
.load("cars.csv")

I was looking for an option where in I could skip the --packages option and provide it directly in the assembly jar. The reason I faced this exception was
sqlContext.read.format("csv") which meant it should know the data format of csv. Instead, to specify where the format csv is present use sqlContext.read.format("com.databricks.spark.csv") so it knows where to look for it and does not throw an exception.

Related

Spark Shell Add Multiple Drivers/Jars to Classpath using spark-defaults.conf

We are using Spark-Shell REPL Mode to test various use-cases and connecting to multiple sources/sinks
We need to add custom drivers/jars in spark-defaults.conf file, I have tried to add multiple jars separated by comma
like
spark.driver.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
spark.executor.extraClassPath = /home/sandeep/mysql-connector-java-5.1.36.jar
But its not working, Can anyone please provide details for correct syntax
Note: Verified in Linux Mint and Spark 3.0.1
If you are setting properties in spark-defaults.conf, spark will take those settings only when you submit your job using spark-submit.
Note: spark-shell and pyspark need to verify.
file: spark-defaults.conf
spark.driver.extraJavaOptions -Dlog4j.configuration=file:log4j.properties -Dspark.yarn.app.container.log.dir=app-logs -Dlogfile.name=hello-spark
spark.jars.packages org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,org.apache.spark:spark-avro_2.12:3.0.1
In the terminal run your job say wordcount.py
spark-submit /path-to-file/wordcount.py
If you want to run your job in development mode from an IDE then you should use config() method. Here we will set Kafka jar packages and avro package. Also if you want to include log4j.properties, then use extraJavaOptions.
AppName and master can be provided in 2 way.
use .appName() and .master()
use .conf file
file: hellospark.py
from logger import Log4j
from util import get_spark_app_config
from pyspark.sql import SparkSession
# first approach.
spark = SparkSession.builder \
.appName('Hello Spark') \
.master('local[3]') \
.config("spark.streaming.stopGracefullyOnShutdown", "true") \
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1,
org.apache.spark:spark-avro_2.12:3.0.1") \
.config("spark.driver.extraJavaOptions",
"-Dlog4j.configuration=file:log4j.properties "
"-Dspark.yarn.app.container.log.dir=app-logs "
"-Dlogfile.name=hello-spark") \
.getOrCreate()
# second approach.
conf = get_spark_app_config()
spark = SparkSession.builder \
.config(conf=conf)
.config("spark.jars.packages",
"org.apache.spark:spark-sql-kafka-0-10_2.12:3.0.1") \
.getOrCreate()
logger = Log4j(spark)
file: logger.py
from pyspark.sql import SparkSession
class Log4j(object):
def __init__(self, spark: SparkSession):
conf = spark.sparkContext.getConf()
app_name = conf.get("spark.app.name")
log4j = spark._jvm.org.apache.log4j
self.logger = log4j.LogManager.getLogger(app_name)
def warn(self, message):
self.logger.warn(message)
def info(self, message):
self.logger.info(message)
def error(self, message):
self.logger.error(message)
def debug(self, message):
self.logger.debug(message)
file: util.py
import configparser
from pyspark import SparkConf
def get_spark_app_config(enable_delta_lake=False):
"""
It will read configuration from spark.conf file to create
an instance of SparkConf(). Can be used to create
SparkSession.builder.config(conf=conf).getOrCreate()
:return: instance of SparkConf()
"""
spark_conf = SparkConf()
config = configparser.ConfigParser()
config.read("spark.conf")
for (key, value) in config.items("SPARK_APP_CONFIGS"):
spark_conf.set(key, value))
if enable_delta_lake:
for (key, value) in config.items("DELTA_LAKE_CONFIGS"):
spark_conf.set(key, value)
return spark_conf
file: spark.conf
[SPARK_APP_CONFIGS]
spark.app.name = Hello Spark
spark.master = local[3]
spark.sql.shuffle.partitions = 3
[DELTA_LAKE_CONFIGS]
spark.jars.packages = io.delta:delta-core_2.12:0.7.0
spark.sql.extensions = io.delta.sql.DeltaSparkSessionExtension
spark.sql.catalog.spark_catalog = org.apache.spark.sql.delta.catalog.DeltaCatalog
As an example in addition to Prateek's answer, I have had some success by adding the following to the spark-defaults.conf file to be loaded when starting a spark-shell session in client mode.
spark.jars jars_added/aws-java-sdk-1.7.4.jar,jars_added/hadoop-aws-2.7.3.jar,jars_added/sqljdbc42.jar,jars_added/jtds-1.3.1.jar
Adding the exact line to the spark-defaults.conf file will load the three jar files as long as they are stored in the jars_added folder when spark-shell is run from the specific directory (doing this for me seems to mitigate the need to have the jar files loaded onto the slaves in the specified locations as well). I created the folder 'jars_added' in my $SPARK_HOME directory so whenever I run spark-shell I must run it from this directory (I have not yet worked out how to change the location the spark.jars setting uses as the initial path, it seems to default to the current directory when launching spark-shell). As hinted at by Prateek the jar files need to be comma separated.
I also had to set SPARK_CONF_DIR to $SPARK_HOME/conf (export SPARK_CONF_DIR = "${SPARK_HOME}/conf") for spark-shell to recognise the location of my config file (i.e. spark-defaults.conf). I'm using PuTTY to ssh onto the master.
Just to clarify once I have added the spark.jars jar1, jar2, jar3 to my spark-defaults.conf file I type the following to start my spark-shell session:
cd $SPARK_HOME //navigate to the spark home directory which contains the jars_added folder
spark-shell
On start up the spark-shell then loads the specified jar files from the jars_added folder

ClassNotFoundException: com.databricks.spark.csv.DefaultSource

I am trying to export data from Hive using spark scala. But I am getting following error.
Caused by: java.lang.ClassNotFoundException:com.databricks.spark.csv.DefaultSource
My scala script is like below.
import org.apache.spark.sql.hive.HiveContext
val sqlContext = new HiveContext(sc)
val df = sqlContext.sql("SELECT * FROM sparksdata")
df.write.format("com.databricks.spark.csv").save("/root/Desktop/home.csv")
I have also try this command but still is not resolved please help me.
spark-shell --packages com.databricks:spark-csv_2.10:1.5.0
If you wish to run that script the way you are running it, you'll need to use the --jars for local jars or --packages for remote repo when you run the command.
So running the script should be like this :
spark-shell -i /path/to/script/scala --packages com.databricks:spark-csv_2.10:1.5.0
If you'd also want to stop the spark-shell after the job is done, you'll need to add :
System.exit(0)
by the end of your script.
PS: You won't be needing to fetch this dependency with spark 2.+.

SBT : Running Spark job on remote cluster from sbt

I have a spark-job (lets call it wordcount) written in Scala, which I am able to run in following manners
Run on a local spark instance from within sbt
sbt> runMain WordCount [InputFile] [Otuputdir] local[*]
Run on a remote spark cluster spark-submit the jar
sbt> package
$> spark-submit --master spark://192.168.1.1:7077 --class WordCount target/scala-2.10/wordcount_2.10-1.5.0-SNAPSHOT.jar [InputFile] [Otuputdir]
Code :
// get arguments
val inputFile = args(0)
val outputDir = args(1)
// if 3rd argument defined then use it
val conf = if ( args.length == 3 ) new SparkConf().setAppName("WordCount").setMaster(args(2)) else new SparkConf().setAppName("WordCount")
val sc = new SparkContext(conf)
How can I run this job on remote spark cluster from SBT ?
There is a sbt plugin for spark-submit. https://github.com/saurfang/sbt-spark-submit

Scala Word count jar is not running in spark

I'm very new to both Scala and Spark. I added the Scala IDE to Eclipse Luna. I created a maven project in the eclipse. I was to run the program within the eclipse with run as configuration option and able to get the output successfully. But when i create the jar for the following program and tried to run the spark shell am getting the following error.
error: ';' expected but 'class' found.
Command use to run the jar
spark-submit --class com.kirthi.spark.proj.sparkexamples.WordsCount --master local /home/cloudera/workspace/sparkwc1.jar hdfs://localhost:8020/kirthi3/dataset.txt hdfs://localhost:8020/kirthi3/sparkwco
The word count program which i tried
package com.kirthi.spark.proj.sparkexamples
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object WordsCount {
def main(args: Array[String]){
val conf = new SparkConf()
.setAppName("Word Count")
.setMaster("local")
val sc = new SparkContext(conf)
val textFile = sc.textFile(args(0))
val words = textFile.flatMap(line => line.split(","))
val counts = words.map(word => (word,1))
val wordcount = counts.reduceByKey(_+_)
val wordcount_sorted = wordcount.sortByKey()
wordcount_sorted.foreach(println)
wordcount_sorted.saveAsTextFile(args(1))
}
}
Kindly help me out in this as I am struck with initial program for spark.
I am using cloudera quickstart CDH 5.5
As shown in the comments, you ran the above command in the Scala REPL, and which you should run it from a regular linux shell.

Running h2o in Jupyter scala notebook

I'm trying to get h2o running on a Jupyter notebook with scala kernel, with no success so far. Maybe someone can give me a hint on what could be wrong? The code I'm executing at the moment is
classpath.add("ai.h2o" % "sparkling-water-core_2.10" % "1.6.5")
import org.apache.spark.h2o._
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
val conf = new SparkConf().setAppName("appName").setMaster("local")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
val h2oContext = new H2OContext(sc).start()
It fails on the last line with error
java.lang.NoClassDefFoundError: water/H2O
....
And prints out exception
java.lang.RuntimeException: Cannot launch H2O on executors: numOfExecutors=1, executorStatus=(driver,false) (Cannot launch H2O on executors: numOfExecutors=1, executorStatus=(driver,false))
org.apache.spark.h2o.H2OContextUtils$.startH2O(H2OContextUtils.scala:169)
org.apache.spark.h2o.H2OContext.start(H2OContext.scala:214)
If you use Toree,
In /usr/local/share/jupyter/kernels/apache_toree_scala/kernel.json
You should add --packages ai.h2o:sparkling-water-core_2.10:1.6.6 on __TOREE_SPARK_OPTS__ , something like
"__TOREE_SPARK_OPTS__": "--master local[*] --executor-memory 12g --driver-memory 12g --packages ai.h2o:sparkling-water-core_2.10:1.6.6",
Then, sc is created when you create your notebook. so you don't need to recreate sc.