I am learning Scala programming to write driver program for word count in Apache Spark .I am using Windows 7 and Latest Spark version 2.2.0. While executing the program getting below mentioned error.
How to fix and get result ?
SBT
name := "sample"
version := "0.1"
scalaVersion := "2.12.3"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion,
"org.apache.spark" % "spark-sql_2.11" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.11" % sparkVersion
)
Driver Program
package com.demo.file
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
object Reader {
def main(args: Array[String]): Unit = {
println("Welcome to Reader.")
val filePath = "C:\\notes.txt"
val spark = SparkSession.builder.appName("Simple app").config("spark.master", "local")getOrCreate();
val fileData = spark.read.textFile(filePath).cache()
val count_a = fileData.filter(line => line.contains("a")).count()
val count_b = fileData.filter(line => line.contains("b")).count()
println(s" count of A $count_a and count of B $count_b")
spark.stop()
}
}
Error
Welcome to Reader.
Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class
at org.apache.spark.SparkConf$DeprecatedConfig.<init>(SparkConf.scala:723)
at org.apache.spark.SparkConf$.<init>(SparkConf.scala:571)
at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:92)
at org.apache.spark.SparkConf.set(SparkConf.scala:81)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.demo.file.Reader$.main(Reader.scala:11)
at com.demo.file.Reader.main(Reader.scala)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
Spark 2.2.0 is built and distributed to work with Scala 2.11 by default. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.11.X). And your scala version is 2.12.X. That's why it is throwing exception.
Related
I am doing spark+Scala+SBT project setup in IntelliJ.
Scala Version: 2.12.8
SBT Version: 1.4.2
Java Version: 1.8
Build.sbt file:
name := "Spark_Scala_Sbt"
version := "0.1"
scalaVersion := "2.12.8"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % "2.3.3",
"org.apache.spark" %% "spark-sql" % "2.3.3"
)
Scala file:
import org.apache.spark.sql.SparkSession
object FirstSparkApplication extends App {
val spark = SparkSession.builder
.master("local[*]")
.appName("Sample App")
.getOrCreate()
val data = spark.sparkContext.parallelize(
Seq("I like Spark", "Spark is awesome", "My first Spark job is working now and is counting down these words")
)
val filtered = data.filter(line => line.contains("awesome"))
filtered.collect().foreach(print)
}
But its showing below error messages:
1. Cannot resolve symbol apache.
2. Cannot resolve symbol SparkSession
3. Cannot resolve symbol sparkContext
4. Cannot resolve symbol filter.
5. Cannot resolve symbol collect.
6. Cannot resolve symbol contains.
What should I change here?
I was trying to write simple scala program to use spark, which has following content.
src/main/scala/SimpleApp.scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.random
object SimpleApp {
def main(args: Array[String]) {
val logFile = "<Some Valid Text File Path>" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").master("local").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"
but when I run the program I get following exception stack trace:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/21 03:23:07 INFO SparkContext: Running Spark version 2.4.5
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Maps
at org.apache.hadoop.metrics2.lib.MetricsRegistry.<init>(MetricsRegistry.java:42)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.<init>(MetricsSystemImpl.java:93)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.<init>(MetricsSystemImpl.java:140)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:236)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at SimpleApp$.main(SimpleApp.scala:9)
at SimpleApp.main(SimpleApp.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 17 more
I tried running in debug mode and exception seems to be thrown when trying to create SparkSession object. What am I missing?
I have installed spark from brew and it works from terminal.
I found a solution. To run this in IDE I needed to add few extra dependencies. I appended following to build.sbt
libraryDependencies += "com.google.guava" % "guava" % "28.2-jre"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.10.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.2"
This question already has answers here:
Why does format("kafka") fail with "Failed to find data source: kafka." (even with uber-jar)?
(8 answers)
Closed 4 years ago.
I am trying to use Spark 2.3.1 structured streaming with Kafka. Getting the following error:
java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:190)
at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:164)
... 49 elided
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at scala.reflect.internal.util.AbstractFileClassLoader.findClass(AbstractFileClassLoader.scala:62)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 51 more
Request advise.
Scala code (using IntelliJ 2018.1):
import org.apache.log4j.Logger._
import org.apache.spark.sql.SparkSession
object test {
def main(args: Array[String]): Unit = {
println("test3")
import org.apache.log4j._
getLogger("org").setLevel(Level.ERROR)
getLogger("akka").setLevel(Level.ERROR)
val spark = SparkSession.
builder.
master("local").
appName("StructuredNetworkWordCount").
getOrCreate()
import spark.implicits._
val lines = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "t_tweets")
.load()
lines.selectExpr("CAST(value AS STRING)")
.as[(String)]
// Split the lines into words
val words = lines.as[String].flatMap(_.split(" "))
// Generate running word count
val wordCounts = words.groupBy("value").count()
val query = wordCounts.writeStream
.outputMode("complete")
.format("console")
.start()
query.awaitTermination()
}
}
Build.sbt :
name := "scalaSpark3"
version := "0.1"
scalaVersion := "2.11.8"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.3.1"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.1"
// https://mvnrepository.com/artifact/org.apache.spark/spark-sql-kafka-0-10
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.1" % "provided"
// https://mvnrepository.com/artifact/org.apache.kafka/kafka-clients
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "1.1.0"
// https://mvnrepository.com/artifact/org.apache.spark/spark-streaming
libraryDependencies += "org.apache.spark" %% "spark-streaming" % "2.3.1" % "provided"
Full error log:
objc[7301]: Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/bin/java (0x10c6334c0) and /Library/Java/JavaVirtualMachines/jdk1.8.0_151.jdk/Contents/Home/jre/lib/libinstrument.dylib (0x10c6bf4e0). One of the two will be used. Which one is undefined.
test3
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.streaming.DataStreamReader.load(DataStreamReader.scala:159)
at test$.main(test.scala:33)
at test.main(test.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:335)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 3 more
My code is based the sample code here:
https://spark.apache.org/docs/2.3.1/structured-streaming-programming-guide.html
https://spark.apache.org/docs/2.3.1/structured-streaming-kafka-integration.html
Kafka integration is not included on your Spark classpath
Therefore, remove provided
libraryDependencies += "org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.1" % "provided"
And make sure you create an uber jar before you run Spark Submit
I am attempting to setup a Kafka stream using a CSV so that I can stream it into Spark. However, I keep getting
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
My code looks like this
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val mySchema = StructType(Array(
StructField("incident_id", IntegerType),
StructField("date", StringType),
StructField("state", StringType),
StructField("city_or_county", StringType),
StructField("n_killed", IntegerType),
StructField("n_injured", IntegerType)
))
val streamingDataFrame = spark.readStream.schema(mySchema).csv("C:/Users/zoldham/IdeaProjects/flinkpoc/Data/test")
streamingDataFrame.selectExpr("CAST(incident_id AS STRING) AS key",
"to_json(struct(*)) AS value").writeStream
.format("kafka")
.option("topic", "testTopic")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/zoldham/IdeaProjects/flinkpoc/Data")
.start()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testTopic").load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json(col("value"), mySchema).as("data"), col("timestamp"))
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}
And my build.sbt file looks like this
name := "Spark POC"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.microsoft.sqlserver" % "mssql-jdbc" % "6.2.1.jre8"
libraryDependencies += "org.scalafx" %% "scalafx" % "8.0.144-R12"
libraryDependencies += "org.apache.ignite" % "ignite-core" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-spring" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-indexing" % "2.5.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % "2.3.0"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.11.0.1"
What is causing that error? As you can see, I plainly included Kafka in the library dependencies, and even followed the official guide. Here is the stack trace:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:283)
at SpeedTester$.main(SpeedTester.scala:61)
at SpeedTester.main(SpeedTester.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 3 more
You need to add missing dependency
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
as it stated in documentation or here for example.
I am trying to create DataFrame using spark sqlContext. I have used spark 1.6.3 and scala 2.10.5. Below is my code for creating DataFrames.
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
import org.apache.spark.sql.SQLContext
import com.knoldus.pipeline.KMeansPipeLine
object SimpleApp{
def main(args:Array[String]){
val conf = new SparkConf().setAppName("Simple Application")
val sc = new SparkContext(conf)
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
import sqlContext.implicits._
val kMeans = new KMeansPipeLine()
val df = sqlContext.createDataFrame(Seq(
("a#email.com", 12000,"M"),
("b#email.com", 43000,"M"),
("c#email.com", 5000,"F"),
("d#email.com", 60000,"M")
)).toDF("email", "income","gender")
val categoricalFeatures = List("gender","email")
val numberOfClusters = 2
val iterations = 10
val predictionResult = kMeans.predict(sqlContext,df,categoricalFeatures,numberOfClusters,iterations)
}
}
Its giving me the following exception. What mistake I am doing? Can anyone help me resolve this?
Exception in thread "main" java.lang.NoSuchMethodError:
org.apache.spark.sql.SQLContext.createDataFrame(Lscala/collection/Seq;Lscala/ref lect/api/TypeTags$TypeTag;)Lorg/apache/spark/sql/Dataset;
at SimpleApp$.main(SimpleApp.scala:24)
at SimpleApp.main(SimpleApp.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:731)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:181)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:206)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:121)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
The dependencies I have used are:
scalaVersion := "2.10.5"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.10" % "2.0.0" % "provided",
"org.apache.spark" % "spark-sql_2.10" % "2.0.0" % "provided",
"org.apache.spark" % "spark-mllib_2.10" % "2.0.0" % "provided",
"knoldus" % "k-means-pipeline" % "0.0.1" )
As I see in your createDataFrame missed second argument. Method pattern described here:
https://spark.apache.org/docs/1.6.1/api/scala/index.html#org.apache.spark.sql.SQLContext#createDataFrame(org.apache.spark.api.java.JavaRDD,%20java.lang.Class)
In your case it will be
def createDataFrame[A <: Product](data: Seq[A])(implicit arg0: scala.reflect.api.JavaUniverse.TypeTag[A]): DataFrame
:: Experimental :: Creates a DataFrame from a local Seq of Product.
OR
Converting Seq into List/RDD and using method pattern with 2 arguments