Spark word count with Kafka serialization error - scala

I am trying to do an use case with Kafka and Spark using Scala. I built a consumer and a producer using kafka libs and now I am building the data processor to count words using Spark. Here are my build.sbt:
name := """scala-akka-stream-kafka"""
version := "1.0"
// scalaVersion := "2.12.4"
scalaVersion := "2.11.8"
libraryDependencies ++= Seq(
"org.apache.kafka" %% "kafka" % "0.10.2.0",
"org.apache.kafka" % "kafka-streams" % "0.10.2.0",
"org.apache.spark" %% "spark-core" % "2.2.0",
"org.apache.spark" %% "spark-streaming" % "2.2.0",
"org.apache.spark" %% "spark-streaming-kafka-0-10" % "2.0.0")
dependencyOverrides ++= Seq(
"com.fasterxml.jackson.core" % "jackson-databind" % "2.6.5",
"com.fasterxml.jackson.core" % "jackson-module-scala" % "2.6.5")
resolvers ++= Seq(
"Sonatype OSS Snapshots" at "https://oss.sonatype.org/content/repositories/snapshots"
)
resolvers += Resolver.sonatypeRepo("releases")
My word count data processor is with some error on the line val wordMap = words.map( word => (word, 1)):
package com.spark.streams
import org.apache.kafka.clients.consumer.ConsumerConfig
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka010.{ConsumerStrategies, KafkaUtils, LocationStrategies}
import org.apache.spark.streaming.{Durations, StreamingContext}
import scala.collection.mutable
object WordCountSparkStream extends App {
val kafkaParam = new mutable.HashMap[String, String]()
kafkaParam.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
kafkaParam.put(ConsumerConfig.KEY_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParam.put(ConsumerConfig.VALUE_DESERIALIZER_CLASS_CONFIG, "org.apache.kafka.common.serialization.StringDeserializer")
kafkaParam.put(ConsumerConfig.GROUP_ID_CONFIG, "group1")
kafkaParam.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "latest")
kafkaParam.put(ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG, "true")
val conf = new SparkConf().setMaster("local[2]").setAppName("WordCountSparkStream")
// Read messages in batch of 5 seconds
val sparkStreamingContext = new StreamingContext(conf, Durations.seconds(5))
//Configure Spark to listen messages in topic test
val topicList = List("streams-plaintext-input")
// Read value of each message from Kafka and return it
val messageStream = KafkaUtils.createDirectStream(sparkStreamingContext,
LocationStrategies.PreferConsistent,
ConsumerStrategies.Subscribe[String, String](topicList, kafkaParam))
val lines = messageStream.map(consumerRecord => consumerRecord.value().asInstanceOf[String])
// Break every message into words and return list of words
val words = lines.flatMap(_.split(" "))
// Take every word and return Tuple with (word,1)
val wordMap = words.map( word => (word, 1))
// Count occurance of each word
val wordCount = wordMap.reduceByKey((first, second) => first + second)
//Print the word count
wordCount.print()
sparkStreamingContext.start()
sparkStreamingContext.awaitTermination()
// "streams-wordcount-output"
}
But this is not compilation error. not even lib conflict. It says I cannot deserialize. But I am using String deserializer that is what my producing is generating.
17/12/12 17:02:50 INFO DAGScheduler: Submitting 8 missing tasks from ShuffleMapStage 0 (MapPartitionsRDD[3] at map at WordCountSparkStream.scala:37) (first 15 tasks are for partitions Vector(0, 1, 2, 3, 4, 5, 6, 7))
17/12/12 17:02:50 INFO TaskSchedulerImpl: Adding task set 0.0 with 8 tasks
17/12/12 17:02:50 INFO TaskSetManager: Starting task 0.0 in stage 0.0 (TID 0, localhost, executor driver, partition 0, PROCESS_LOCAL, 4710 bytes)
17/12/12 17:02:50 INFO TaskSetManager: Starting task 1.0 in stage 0.0 (TID 1, localhost, executor driver, partition 1, PROCESS_LOCAL, 4710 bytes)
17/12/12 17:02:50 INFO Executor: Running task 0.0 in stage 0.0 (TID 0)
17/12/12 17:02:50 INFO Executor: Running task 1.0 in stage 0.0 (TID 1)
17/12/12 17:02:50 ERROR Executor: Exception in task 1.0 in stage 0.0 (TID 1)
java.lang.ClassNotFoundException: scala.None$
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at java.lang.Class.forName0(Native Method)
at java.lang.Class.forName(Class.java:348)
at org.apache.spark.serializer.JavaDeserializationStream$$anon$1.resolveClass(JavaSerializer.scala:67)
at java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1863)
at java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1746)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2037)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.defaultReadFields(ObjectInputStream.java:2282)
at java.io.ObjectInputStream.readSerialData(ObjectInputStream.java:2206)
at java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:2064)
at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1568)
at java.io.ObjectInputStream.readObject(ObjectInputStream.java:428)
at org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:75)
at org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:114)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:309)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)

try this:
fork:=true
works for me, but i don't know how~

Related

sparkSession throwing Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Maps

I was trying to write simple scala program to use spark, which has following content.
src/main/scala/SimpleApp.scala:
import org.apache.spark.sql.SparkSession
import org.apache.spark.util.random
object SimpleApp {
def main(args: Array[String]) {
val logFile = "<Some Valid Text File Path>" // Should be some file on your system
val spark = SparkSession.builder.appName("Simple Application").master("local").getOrCreate()
val logData = spark.read.textFile(logFile).cache()
val numAs = logData.filter(line => line.contains("a")).count()
val numBs = logData.filter(line => line.contains("b")).count()
println(s"Lines with a: $numAs, Lines with b: $numBs")
spark.stop()
}
}
build.sbt:
name := "Simple Project"
version := "1.0"
scalaVersion := "2.12.10"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.5"
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.5"
but when I run the program I get following exception stack trace:
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
20/03/21 03:23:07 INFO SparkContext: Running Spark version 2.4.5
Exception in thread "main" java.lang.NoClassDefFoundError: com/google/common/collect/Maps
at org.apache.hadoop.metrics2.lib.MetricsRegistry.<init>(MetricsRegistry.java:42)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.<init>(MetricsSystemImpl.java:93)
at org.apache.hadoop.metrics2.impl.MetricsSystemImpl.<init>(MetricsSystemImpl.java:140)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<init>(DefaultMetricsSystem.java:38)
at org.apache.hadoop.metrics2.lib.DefaultMetricsSystem.<clinit>(DefaultMetricsSystem.java:36)
at org.apache.hadoop.security.UserGroupInformation$UgiMetrics.create(UserGroupInformation.java:120)
at org.apache.hadoop.security.UserGroupInformation.<clinit>(UserGroupInformation.java:236)
at org.apache.spark.util.Utils$.$anonfun$getCurrentUserName$1(Utils.scala:2422)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.util.Utils$.getCurrentUserName(Utils.scala:2422)
at org.apache.spark.SparkContext.<init>(SparkContext.scala:293)
at org.apache.spark.SparkContext$.getOrCreate(SparkContext.scala:2520)
at org.apache.spark.sql.SparkSession$Builder.$anonfun$getOrCreate$5(SparkSession.scala:935)
at scala.Option.getOrElse(Option.scala:189)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:926)
at SimpleApp$.main(SimpleApp.scala:9)
at SimpleApp.main(SimpleApp.scala)
Caused by: java.lang.ClassNotFoundException: com.google.common.collect.Maps
at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 17 more
I tried running in debug mode and exception seems to be thrown when trying to create SparkSession object. What am I missing?
I have installed spark from brew and it works from terminal.
I found a solution. To run this in IDE I needed to add few extra dependencies. I appended following to build.sbt
libraryDependencies += "com.google.guava" % "guava" % "28.2-jre"
libraryDependencies += "com.fasterxml.jackson.core" % "jackson-core" % "2.10.0"
libraryDependencies += "org.apache.hadoop" % "hadoop-client" % "2.7.2"

Streaming K-means Spark Scala: Getting java.lang.NumberFormatException for input string

While I am reading CSV data from a directory which contains double values and applying streaming K-means model on it as follows,
//CSV file
40.729,-73.9422
40.7476,-73.9871
40.7424,-74.0044
40.751,-73.9869
40.7406,-73.9902
.....
//SBT dependencies:
name := "Application name"
version := "0.1"
scalaVersion := "2.11.12"
val sparkVersion ="2.3.1"
libraryDependencies ++= Seq(
"org.apache.spark" %% "spark-core" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.11" % sparkVersion,
"org.apache.spark" %% "spark-mllib" % "2.3.1")
//import statement
import org.apache.spark.sql.{DataFrame, SparkSession}
import org.apache.spark.sql.streaming.OutputMode
import org.apache.spark.sql.types._
import org.apache.spark.{SparkConf, SparkContext, rdd}
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.mllib.clustering.{ KMeans,StreamingKMeans}
import org.apache.spark.mllib.linalg.Vectors
//Reading Csv data
val trainingData = ssc.textFileStream ("directory path")
.map(x=>x.toDouble)
.map(x=>Vectors.dense(x))
// applying Streaming kmeans model
val model = new StreamingKMeans()
.setK(numClusters)
.setDecayFactor(1.0)
.setRandomCenters(numDimensions, 0.0)
model.trainOn(trainingData)
I get the following error:
18/07/24 11:20:04 ERROR Executor: Exception in task 0.0 in stage 2.0
(TID
1)
java.lang.NumberFormatException: For input string: "40.7473,-73.9857" at
sun.misc.FloatingDecimal.readJavaFormatString(FloatingDecimal.java:2043)
at sun.misc.FloatingDecimal.parseDouble(FloatingDecimal.java:110) at
java.lang.Double.parseDouble(Double.java:538) at
scala.collection.immutable.StringLike$class.toDouble(StringLike.scala:285)
at scala.collection.immutable.StringOps.toDouble(StringOps.scala:29)
at ubu$$anonfun$1.apply(uberclass.scala:305) at
ubu$$anonfun$1.apply(uberclass.scala:305) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at
scala.collection.Iterator$$anon$11.next(Iterator.scala:410) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:193)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:63)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:109) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:345)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748) Exception in thread
"streaming-job-executor-0" java.lang.Error:
java.lang.InterruptedException at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1155)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
at java.lang.Thread.run(Thread.java:748)
Can anyone please help?
There was a dimension issue. The dimension of the vector and numDimension passed to streaming K-means model should be the same.

Spark 2.3.0 Failed to find data source: kafka

I am attempting to setup a Kafka stream using a CSV so that I can stream it into Spark. However, I keep getting
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
My code looks like this
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.log4j.{Level, Logger}
import org.apache.spark.sql.execution.streaming.FileStreamSource.Timestamp
import org.apache.spark.sql.types._
object SpeedTester {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder.master("local[4]").appName("SpeedTester").config("spark.driver.memory", "8g").getOrCreate()
val rootLogger = Logger.getRootLogger()
rootLogger.setLevel(Level.ERROR)
import spark.implicits._
val mySchema = StructType(Array(
StructField("incident_id", IntegerType),
StructField("date", StringType),
StructField("state", StringType),
StructField("city_or_county", StringType),
StructField("n_killed", IntegerType),
StructField("n_injured", IntegerType)
))
val streamingDataFrame = spark.readStream.schema(mySchema).csv("C:/Users/zoldham/IdeaProjects/flinkpoc/Data/test")
streamingDataFrame.selectExpr("CAST(incident_id AS STRING) AS key",
"to_json(struct(*)) AS value").writeStream
.format("kafka")
.option("topic", "testTopic")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("checkpointLocation", "C:/Users/zoldham/IdeaProjects/flinkpoc/Data")
.start()
val df = spark.readStream.format("kafka").option("kafka.bootstrap.servers", "localhost:9092")
.option("subscribe", "testTopic").load()
val df1 = df.selectExpr("CAST(value AS STRING)", "CAST(timestamp AS TIMESTAMP)").as[(String, Timestamp)]
.select(from_json(col("value"), mySchema).as("data"), col("timestamp"))
.select("data.*", "timestamp")
df1.writeStream
.format("console")
.option("truncate","false")
.start()
.awaitTermination()
}
}
And my build.sbt file looks like this
name := "Spark POC"
version := "0.1"
scalaVersion := "2.11.12"
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.3.0"
libraryDependencies += "com.microsoft.sqlserver" % "mssql-jdbc" % "6.2.1.jre8"
libraryDependencies += "org.scalafx" %% "scalafx" % "8.0.144-R12"
libraryDependencies += "org.apache.ignite" % "ignite-core" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-spring" % "2.5.0"
libraryDependencies += "org.apache.ignite" % "ignite-indexing" % "2.5.0"
libraryDependencies += "org.apache.spark" %% "spark-streaming-kafka-0-10_2.11" % "2.3.0"
libraryDependencies += "org.apache.kafka" % "kafka-clients" % "0.11.0.1"
What is causing that error? As you can see, I plainly included Kafka in the library dependencies, and even followed the official guide. Here is the stack trace:
Exception in thread "main" java.lang.ClassNotFoundException: Failed to find data source: kafka. Please find packages at http://spark.apache.org/third-party-projects.html
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:635)
at org.apache.spark.sql.streaming.DataStreamWriter.start(DataStreamWriter.scala:283)
at SpeedTester$.main(SpeedTester.scala:61)
at SpeedTester.main(SpeedTester.scala)
Caused by: java.lang.ClassNotFoundException: kafka.DefaultSource
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:349)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23$$anonfun$apply$15.apply(DataSource.scala:618)
at scala.util.Try$.apply(Try.scala:192)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at org.apache.spark.sql.execution.datasources.DataSource$$anonfun$23.apply(DataSource.scala:618)
at scala.util.Try.orElse(Try.scala:84)
at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:618)
... 3 more
You need to add missing dependency
"org.apache.spark" %% "spark-sql-kafka-0-10" % "2.3.0"
as it stated in documentation or here for example.

Scala Exception

I am learning Scala programming to write driver program for word count in Apache Spark .I am using Windows 7 and Latest Spark version 2.2.0. While executing the program getting below mentioned error.
How to fix and get result ?
SBT
name := "sample"
version := "0.1"
scalaVersion := "2.12.3"
val sparkVersion = "2.2.0"
libraryDependencies ++= Seq(
"org.apache.spark" % "spark-core_2.11" % sparkVersion,
"org.apache.spark" % "spark-sql_2.11" % sparkVersion,
"org.apache.spark" % "spark-streaming_2.11" % sparkVersion
)
Driver Program
package com.demo.file
import org.apache.spark._
import org.apache.spark.SparkContext._
import org.apache.spark.sql.SparkSession
object Reader {
def main(args: Array[String]): Unit = {
println("Welcome to Reader.")
val filePath = "C:\\notes.txt"
val spark = SparkSession.builder.appName("Simple app").config("spark.master", "local")getOrCreate();
val fileData = spark.read.textFile(filePath).cache()
val count_a = fileData.filter(line => line.contains("a")).count()
val count_b = fileData.filter(line => line.contains("b")).count()
println(s" count of A $count_a and count of B $count_b")
spark.stop()
}
}
Error
Welcome to Reader.
Exception in thread "main" java.lang.NoClassDefFoundError: scala/Product$class
at org.apache.spark.SparkConf$DeprecatedConfig.<init>(SparkConf.scala:723)
at org.apache.spark.SparkConf$.<init>(SparkConf.scala:571)
at org.apache.spark.SparkConf$.<clinit>(SparkConf.scala)
at org.apache.spark.SparkConf.set(SparkConf.scala:92)
at org.apache.spark.SparkConf.set(SparkConf.scala:81)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6$$anonfun$apply$1.apply(SparkSession.scala:905)
at scala.collection.mutable.HashMap.$anonfun$foreach$1(HashMap.scala:138)
at scala.collection.mutable.HashTable.foreachEntry(HashTable.scala:236)
at scala.collection.mutable.HashTable.foreachEntry$(HashTable.scala:229)
at scala.collection.mutable.HashMap.foreachEntry(HashMap.scala:40)
at scala.collection.mutable.HashMap.foreach(HashMap.scala:138)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:905)
at org.apache.spark.sql.SparkSession$Builder$$anonfun$6.apply(SparkSession.scala:901)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.SparkSession$Builder.getOrCreate(SparkSession.scala:901)
at com.demo.file.Reader$.main(Reader.scala:11)
at com.demo.file.Reader.main(Reader.scala)
Caused by: java.lang.ClassNotFoundException: scala.Product$class
at java.net.URLClassLoader.findClass(URLClassLoader.java:381)
at java.lang.ClassLoader.loadClass(ClassLoader.java:424)
at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:331)
at java.lang.ClassLoader.loadClass(ClassLoader.java:357)
... 18 more
Spark 2.2.0 is built and distributed to work with Scala 2.11 by default. To write applications in Scala, you will need to use a compatible Scala version (e.g. 2.11.X). And your scala version is 2.12.X. That's why it is throwing exception.

Reading RDF in apache spark

I'm trying to read RDF\XML file into Apache spark (scala 2.11, apache spark 1.4.1) using Apache Jena. I wrote this scala snippet:
val factory = new RdfXmlReaderFactory()
HadoopRdfIORegistry.addReaderFactory(factory)
val conf = new Configuration()
conf.set("rdf.io.input.ignore-bad-tuples", "false")
val data = sc.newAPIHadoopFile(path,
classOf[RdfXmlInputFormat],
classOf[LongWritable], //position
classOf[TripleWritable], //value
conf)
data.take(10).foreach(println)
But it throws an error:
INFO readers.AbstractLineBasedNodeTupleReader: Got split with start 0 and length 21765995 for file with total length of 21765995
15/07/23 01:52:42 ERROR readers.AbstractLineBasedNodeTupleReader: Error parsing whole file, aborting further parsing
org.apache.jena.riot.RiotException: Producer failed to ever call start(), declaring producer dead
at org.apache.jena.riot.lang.PipedRDFIterator.hasNext(PipedRDFIterator.java:272)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:242)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
...
ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.io.IOException: Error parsing whole file at position 0, aborting further parsing
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractWholeFileNodeTupleReader.nextKeyValue(AbstractWholeFileNodeTupleReader.java:285)
at org.apache.jena.hadoop.rdf.io.input.readers.AbstractRdfReader.nextKeyValue(AbstractRdfReader.java:85)
at org.apache.spark.rdd.NewHadoopRDD$$anon$1.hasNext(NewHadoopRDD.scala:143)
at org.apache.spark.InterruptibleIterator.hasNext(InterruptibleIterator.scala:39)
at scala.collection.Iterator$$anon$10.hasNext(Iterator.scala:350)
The file is good, because i can parse it locally. What do I miss?
EDIT
Some information to reproduce the behaviour
Imports:
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.io.LongWritable
import org.apache.jena.hadoop.rdf.io.registry.HadoopRdfIORegistry
import org.apache.jena.hadoop.rdf.io.registry.readers.RdfXmlReaderFactory
import org.apache.jena.hadoop.rdf.types.QuadWritable
import org.apache.spark.SparkContext
scalaVersion := "2.11.7"
dependencies:
"org.apache.hadoop" % "hadoop-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-mapreduce-client-common" % "2.7.1",
"org.apache.hadoop" % "hadoop-streaming" % "2.7.1",
"org.apache.spark" % "spark-core_2.11" % "1.4.1",
"com.hp.hpl.jena" % "jena" % "2.6.4",
"org.apache.jena" % "jena-elephas-io" % "0.9.0",
"org.apache.jena" % "jena-elephas-mapreduce" % "0.9.0"
I'm using sample rdf from here. It's freely available information about John Peel sessions (more info about dump).
So it appears your problem was down to you manually managing your dependencies.
In my environment I was simply passing the following to my Spark shell:
--packages org.apache.jena:jena-elephas-io:0.9.0
This does all the dependency resolution for you
If you are building a SBT project then it should be sufficient to do the following in your build.sbt:
libraryDependencies += "org.apache.jena" % "jena-elephas-io" % "0.9.0"
Thx all for discussion in comments. The problem was really tricky and not clear from the stack trace: code needs one extra dependency to work jena-core and this dependency must be packaged first.
"org.apache.jena" % "jena-core" % "2.13.0"
"com.hp.hpl.jena" % "jena" % "2.6.4"
I use this assembly strategy:
lazy val strategy = assemblyMergeStrategy in assembly <<= (assemblyMergeStrategy in assembly) { (old) => {
case PathList("META-INF", xs # _*) =>
(xs map {_.toLowerCase}) match {
case ("manifest.mf" :: Nil) | ("index.list" :: Nil) | ("dependencies" :: Nil) => MergeStrategy.discard
case _ => MergeStrategy.discard
}
case x => MergeStrategy.first
}
}