spark streaming can not do the toDF function - scala

I am trying to make a spark Streaming application that connects to Flume.
I managed to save the data when it is an RDD, but if I try to convert it to a DataFrame using the toDF function it makes an error. I am working with the shell so I can't see what the error is.
this is the code I am doing:
//importing relevant libraries
import org.apache.spark.SparkConf
import org.apache.spark.streaming._
import org.apache.spark.streaming.flume._
import org.apache.spark.util.IntParam
import org.apache.spark.storage.StorageLevel
//creating the spark streaming configuration
val ssc = new StreamingContext(sc, Seconds(5))
val stream = FlumeUtils.createStream(ssc, "0.0.0.0", 44444, StorageLevel.MEMORY_ONLY_SER_2)
//starting the streaming job
val textStream = stream.map(e => new String(e.event.getBody.array) )
val numlines = textStream.count()
numlines.print()
textStream.foreachRDD { rdd =>
//some stuff that needs to be created
import java.util.Date
val d = new Date
//delimeter of '&'
val rdd_s = rdd.map(line => line.split("&"))
val rdd_split = rdd_s.map(line => (d.getTime.toString, line(2), line(3).toInt))
//only saves the data if the toDF is comented out.
rdd_split.saveAsTextFile("/flume/text/final/")
//creating the data-frame - if commented out, the data will be saved to file
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd_split.toDF("moment", "ID","amount")
df.saveAsTextFile("/idan/streaming/flume/text/final/withtime")
}
ssc.start()
found the answer to this
what i needed to do is create a lazy singleton of the sqlContext to create the DataFrame.
here is the final of the singleton code:
//creating the case class for the DataFrame
case class Record(moment:String, name:String, id:String, amount:Int)
/** Lazily instantiated singleton instance of SQLContext */
object SQLContextSingleton {
#transient private var instance: SQLContext = null
// Instantiate SQLContext on demand
def getInstance(sc: SparkContext): SQLContext = synchronized {
if (instance == null) {
instance = new SQLContext(sc)
}
instance
}
}
and to create the DataFrame i needed to create the singleton for it:
val sqlContext = SQLContextSingleton.getInstance(rdd_s.sparkContext)
import sqlContext.implicits._
val df = sqlContext.createDataFrame(rdd_s.map(line => (d.getTime, line(0), line(1), line(2), line(3))))
edit:
it gives me this error:
16/08/09 13:16:05 ERROR scheduler.JobScheduler: Error running job streaming job 1470748565000 ms.1
java.lang.NullPointerException
at org.apache.spark.sql.hive.client.ClientWrapper.conf(ClientWrapper.scala:205)
at org.apache.spark.sql.hive.HiveContext.hiveconf$lzycompute(HiveContext.scala:554)
at org.apache.spark.sql.hive.HiveContext.hiveconf(HiveContext.scala:553)
at org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:540)
at org.apache.spark.sql.hive.HiveContext$$anonfun$configure$1.apply(HiveContext.scala:539)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
at scala.collection.AbstractTraversable.map(Traversable.scala:105)
at org.apache.spark.sql.hive.HiveContext.configure(HiveContext.scala:539)
at org.apache.spark.sql.hive.HiveContext.metadataHive$lzycompute(HiveContext.scala:252)
at org.apache.spark.sql.hive.HiveContext.metadataHive(HiveContext.scala:239)
at org.apache.spark.sql.hive.HiveContext$$anon$2.<init>(HiveContext.scala:459)
at org.apache.spark.sql.hive.HiveContext.catalog$lzycompute(HiveContext.scala:459)
at org.apache.spark.sql.hive.HiveContext.catalog(HiveContext.scala:458)
at org.apache.spark.sql.hive.HiveContext$$anon$3.<init>(HiveContext.scala:475)
at org.apache.spark.sql.hive.HiveContext.analyzer$lzycompute(HiveContext.scala:475)
at org.apache.spark.sql.hive.HiveContext.analyzer(HiveContext.scala:474)
at org.apache.spark.sql.execution.QueryExecution.assertAnalyzed(QueryExecution.scala:34)
at org.apache.spark.sql.DataFrame.<init>(DataFrame.scala:133)
at org.apache.spark.sql.DataFrame$.apply(DataFrame.scala:52)
at org.apache.spark.sql.SQLContext.createDataFrame(SQLContext.scala:417)
at org.apache.spark.sql.SQLImplicits.rddToDataFrameHolder(SQLImplicits.scala:155)
at $line46.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:58)
at $line46.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:48)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:661)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:50)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:426)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:49)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:39)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:224)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:223)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
16/08/09 13:16:08 ERROR scheduler.ReceiverTracker: Deregistered receiver for stream 0: Stopped by driver
did anyone happend to have this kind of problem and know how to help?
Thanks :)

Related

Spark NaiveBayesTextClassification

i'm trying to create a text classifier spark(1.6.2) app, but I don't know what am I doing wrong. This is my code:
import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
/**
* Created by kebodev on 2016.11.29..
*/
object PredTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("IktatoSparkRunner")
.set("spark.executor.memory", "2gb")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainData = sqlContext.read.json("src/main/resources/tst.json")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val model = NaiveBayes.train(featurizedData)
}
}
The NaiveBayes object doesn't have train method, what should I import?
If i try to use this way:
val naBa = new NaiveBayes()
naBa.fit(featurizedData)
I get this exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:40)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
at PredTest$.main(PredTest.scala:37)
at PredTest.main(PredTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
This is how my json file looks like:
{"text":"any text","label":"6.0"}
I'm really noob in this topic. Can anyone help me how to create a model, and then how to predict a new value.
Thank you!
Labels and Feature Vectors only contain Doubles. Your label column contains a String.
See your stacktrace:
Column label must be of type DoubleType but was actually StringType.
You can use the StringIndexer or CountVectorizer to convert it appropriately. See http://spark.apache.org/docs/latest/ml-features.html#stringindexer for further details.

Unable to serialize SparkContext in foreachRDD

I am trying to save the streaming data to cassandra from Kafka. I am able to read and parse the data but when I call below lines to save the data i am getting a Task not Serializable Exception. My class is extending serializable but not sure why i am seeing this error, didn't get much help ever after googling for 3 hours, can some body give any pointers ?
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))`
import org.apache.spark.SparkConf
import org.apache.spark.SparkContext
import org.apache.spark.sql.SaveMode
import org.apache.spark.streaming.Seconds
import org.apache.spark.streaming.StreamingContext
import org.apache.spark.streaming.kafka.KafkaUtils
import com.datastax.spark.connector._
import kafka.serializer.StringDecoder
import org.apache.spark.rdd.RDD
import com.datastax.spark.connector.SomeColumns
import java.util.Formatter.DateTime
object StreamProcessor extends Serializable {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setMaster("local[2]").setAppName("StreamProcessor")
.set("spark.cassandra.connection.host", "127.0.0.1")
val sc = new SparkContext(sparkConf)
val ssc = new StreamingContext(sc, Seconds(2))
val sqlContext = new SQLContext(sc)
val kafkaParams = Map("metadata.broker.list" -> "localhost:9092")
val topics = args.toSet
val stream = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](
ssc, kafkaParams, topics)
stream.foreachRDD { rdd =>
if (!rdd.isEmpty()) {
try {
rdd.foreachPartition { iter =>
iter.foreach {
case (key, msg) =>
val obj = msgParseMaster(msg)
val collection = sc.parallelize(Seq((obj.id, obj.data)))
collection.saveToCassandra("testKS", "testTable ", SomeColumns("id", "data"))
}
}
}
}
}
ssc.start()
ssc.awaitTermination()
}
import org.json4s._
import org.json4s.native.JsonMethods._
case class wordCount(id: Long, data: String) extends serializable
implicit val formats = DefaultFormats
def msgParseMaster(msg: String): wordCount = {
val m = parse(msg).extract[wordCount]
return m
}
}
I am getting
org.apache.spark.SparkException: Task not serializable
below is the full log
16/08/06 10:24:52 ERROR JobScheduler: Error running job streaming job 1470504292000 ms.0
org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2055)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:919)
at org.apache.spark.rdd.RDD$$anonfun$foreachPartition$1.apply(RDD.scala:918)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:150)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:111)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:316)
at org.apache.spark.rdd.RDD.foreachPartition(RDD.scala:918)
at
SparkContext isn't serializable, you can't use it inside foreachRDD, and from the use of your graph you don't need it. Instead, you can simply map over each RDD, parse out the relevant data and save that new RDD to cassandra:
stream
.map {
case (_, msg) =>
val result = msgParseMaster(msg)
(result.id, result.data)
}
.foreachRDD(rdd => if (!rdd.isEmpty)
rdd.saveToCassandra("testKS",
"testTable",
SomeColumns("id", "data")))
You can not call sc.parallelize within a function passed to foreachPartition - that function would have to be serialized and sent to each executor, and SparkContext is (intentionally) not serializable (it should only reside within the Driver application, not the executor).

spark java.io.NotSerializableException: org.apache.spark.SparkContext

I'm try to implement to check exist record by received message from kafka in spark by spark streaming, now when i run RunReadLogByKafka object, there is a NotSerializableException for SparkContext was throwed, i google it, but i still don't know how to fix it, Could anyone suggest me how to rewrite it? thanks in advance.
package com.test.spark.hbase
import java.sql.{DriverManager, PreparedStatement, Connection}
import java.text.SimpleDateFormat
import com.powercn.spark.LogRow
import com.powercn.spark.SparkReadHBaseTest.{SensorStatsRow, SensorRow}
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{TableName, HBaseConfiguration}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Result, Put}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
case class LogRow(rowkey: String, content: String)
object LogRow {
def parseLogRow(result: Result): LogRow = {
val rowkey = Bytes.toString(result.getRow())
val p0 = rowkey
val p1 = Bytes.toString(result.getValue(Bytes.toBytes("data"), Bytes.toBytes("content")))
LogRow(p0, p1)
}
}
class ReadLogByKafka(sct:SparkContext) extends Serializable {
implicit def func(records: String) {
#transient val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "log")
#transient val sc = sct
#transient val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._
try {
//query info table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
println(hBaseRDD.count())
// transform (ImmutableBytesWritable, Result) tuples into an RDD of Results
val resultRDD = hBaseRDD.map(tuple => tuple._2)
println(resultRDD.count())
val logRDD = resultRDD.map(LogRow.parseLogRow)
val logDF = logRDD.toDF()
logDF.printSchema()
logDF.show()
// register the DataFrame as a temp table
logDF.registerTempTable("LogRow")
val logAdviseDF = sqlContext.sql("SELECT rowkey, content as content FROM LogRow ")
logAdviseDF.printSchema()
logAdviseDF.take(5).foreach(println)
} catch {
case e: Exception => e.printStackTrace()
} finally {
}
}
}
package com.test.spark.hbase
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.io.Text
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object RunReadLogByKafka extends Serializable {
def main(args: Array[String]): Unit = {
val broker = "192.168.13.111:9092"
val topic = "log"
#transient val sparkConf = new SparkConf().setAppName("RunReadLogByKafka")
#transient val streamingContext = new StreamingContext(sparkConf, Seconds(2))
#transient val sc = streamingContext.sparkContext
val kafkaConf = Map("metadata.broker.list" -> broker,
"group.id" -> "group1",
"zookeeper.connection.timeout.ms" -> "3000",
"kafka.auto.offset.reset" -> "smallest")
// Define which topics to read from
val topics = Set(topic)
val messages = KafkaUtils.createDirectStream[Array[Byte], String, DefaultDecoder, StringDecoder](
streamingContext, kafkaConf, topics).map(_._2)
messages.foreachRDD(rdd => {
val readLogByKafka =new ReadLogByKafka(sc)
//parse every message, it will throw NotSerializableException
rdd.foreach(readLogByKafka.func)
})
streamingContext.start()
streamingContext.awaitTermination()
}
}
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:889)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:888)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:888)
at com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1.apply(RunReadLogByKafka.scala:38)
at com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1.apply(RunReadLogByKafka.scala:35)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:207)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#5207878f)
- field (class: com.test.spark.hbase.ReadLogByKafka, name: sct, type: class org.apache.spark.SparkContext)
- object (class com.test.spark.hbase.ReadLogByKafka, com.test.spark.hbase.ReadLogByKafka#60212100)
- field (class: com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1$$anonfun$apply$1, name: readLogByKafka$1, type: class com.test.spark.hbase.ReadLogByKafka)
- object (class com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1$$anonfun$apply$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 30 more
Object you get as an argument in foreachRDD is a standard Spark RDD so you have to obey exactly the same rules as usual including no nested actions or transformations and no access to the SparkContext. It is not exactly clear what you try to achieve (It doesn't look like ReadLogByKafka.func is doing anything useful) but I am guess you're looking for some kind of join.

Registering UDF in SparkStreaming Application

I have a Spark streaming application that uses SparkSQL written in Scala that attempts to register a udf after getting an RDD. I get the error below. Is it not possible to register udfs in a SparkStreaming app?
Here is the code snippet that throws the error:
sessionStream.foreachRDD((rdd: RDD[(String)], time: Time) => {
val sqlcc = SqlContextSingleton.getInstance(rdd.sparkContext)
sqlcc.udf.register("getUUID", () => java.util.UUID.randomUUID().toString)
...
}
Here is the error throw when I attempt to register the function:
Exception in thread "pool-6-thread-6" java.lang.NoSuchMethodError: scala.reflect.api.JavaUniverse.runtimeMirror(Ljava/lang/ClassLoader;)Lscala/reflect/api/JavaUniverse$JavaMirror;
at com.ignitionone.datapipeline.ClusterApp$$anonfun$CreateCheckpointStreamContext$1.apply(ClusterApp.scala:173)
at com.ignitionone.datapipeline.ClusterApp$$anonfun$CreateCheckpointStreamContext$1.apply(ClusterApp.scala:164)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:32)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:176)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:175)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
sessionStream.foreachRDD((rdd: RDD[Event], time: Time) => {
val f = (t: Long) => t - t % 60000
val sqlContext = SQLContext.getOrCreate(rdd.sparkContext)
import sqlContext.implicits._
val df = rdd.toDF()
val per_min = udf(f)
val grouped = df.groupBy(per_min(df("created_at")) as "created_at",
df("blah"),
df("status")
).agg(sum("price") as "price",sum("payout") as "payout", sum("counter") as "counter")
...
}
is working fine by me

Spark Streaming into HBase with filtering logic

I have been trying to understand how spark streaming and hbase connect, but have not been successful. What I am trying to do is given a spark stream, process that stream and store the results in an hbase table. So far this is what I have:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "table")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), Bytes.toBytes(row(0)))
hTable.put(thePut)
}
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.map(_.split(","))
val store = words.foreachRDD(rdd => rdd.foreach(blah))
ssc.start()
I am currently running the above code in spark-shell. I am not sure what I am doing wrong.
I get the following error in the shell:
14/09/03 16:21:03 ERROR scheduler.JobScheduler: Error running job streaming job 1409786463000 ms.0
org.apache.spark.SparkException: Job aborted due to stage failure: Task not serializable: java.io.NotSerializableException: org.apache.spark.streaming.StreamingContext
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1033)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1017)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1015)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1015)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitMissingTasks(DAGScheduler.scala:770)
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$submitStage(DAGScheduler.scala:713)
at org.apache.spark.scheduler.DAGScheduler.handleJobSubmitted(DAGScheduler.scala:697)
at org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1176)
at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
at akka.actor.ActorCell.invoke(ActorCell.scala:456)
at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
at akka.dispatch.Mailbox.run(Mailbox.scala:219)
at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
I also double checked the hbase table, just in case, and nothing new is written in there.
I am running nc -lk 9999 on another terminal to feed in data into the spark-shell for testing.
With help from users on the spark user group, I was able to figure out how to get this to work. It looks like I needed to wrap my streaming, mapping and foreach call around a serializable object:
import org.apache.spark.SparkConf
import org.apache.spark.streaming.{Seconds, StreamingContext}
import org.apache.spark.streaming.StreamingContext._
import org.apache.spark.storage.StorageLevel
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.client.{HBaseAdmin,HTable,Put,Get}
import org.apache.hadoop.hbase.util.Bytes
object Blaher {
def blah(row: Array[String]) {
val hConf = new HBaseConfiguration()
val hTable = new HTable(hConf, "table")
val thePut = new Put(Bytes.toBytes(row(0)))
thePut.add(Bytes.toBytes("cf"), Bytes.toBytes(row(0)), Bytes.toBytes(row(0)))
hTable.put(thePut)
}
}
object TheMain extends Serializable{
def run() {
val ssc = new StreamingContext(sc, Seconds(1))
val lines = ssc.socketTextStream("localhost", 9999, StorageLevel.MEMORY_AND_DISK_SER)
val words = lines.map(_.split(","))
val store = words.foreachRDD(rdd => rdd.foreach(Blaher.blah))
ssc.start()
}
}
TheMain.run()
Seems to be a typical antipattern.
See "Design Patterns for using foreachRDD" chapter at http://spark.apache.org/docs/latest/streaming-programming-guide.html for correct pattern.