How to create Spark Dataframe from case classes that contains Enums [duplicate] - scala

I have been trying to create Spark Dataset using case classes that contain Enums but I'm not able to. I'm using Spark version 1.6.0. The exceptions is complaining about that there are no encoder found for my Enum. Is this not possible in Spark, to have enums in the data?
Code:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object MyEnum extends Enumeration {
type MyEnum = Value
val Hello, World = Value
}
case class MyData(field: String, other: MyEnum.Value)
object EnumTest {
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlCtx = new SQLContext(sc)
import sqlCtx.implicits._
val df = sc.parallelize(Array(MyData("hello", MyEnum.World))).toDS()
println(s"df: ${df.collect().mkString(",")}}")
}
}
Error:
Exception in thread "main" java.lang.UnsupportedOperationException: No Encoder found for com.company.MyEnum.Value
- field (class: "scala.Enumeration.Value", name: "other")
- root class: "com.company.MyData"
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:597)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor$1.apply(ScalaReflection.scala:509)
at org.apache.spark.sql.catalyst.ScalaReflection$$anonfun$org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor$1.apply(ScalaReflection.scala:502)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.TraversableLike$$anonfun$flatMap$1.apply(TraversableLike.scala:251)
at scala.collection.immutable.List.foreach(List.scala:318)
at scala.collection.TraversableLike$class.flatMap(TraversableLike.scala:251)
at scala.collection.AbstractTraversable.flatMap(Traversable.scala:105)
at org.apache.spark.sql.catalyst.ScalaReflection$.org$apache$spark$sql$catalyst$ScalaReflection$$extractorFor(ScalaReflection.scala:502)
at org.apache.spark.sql.catalyst.ScalaReflection$.extractorsFor(ScalaReflection.scala:394)
at org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:54)
at org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:41)
at com.company.EnumTest$.main(EnumTest.scala:22)
at com.company.EnumTest.main(EnumTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)

You can create your own encoder:
import org.apache.spark.sql.SQLContext
import org.apache.spark.{SparkConf, SparkContext}
object MyEnum extends Enumeration {
type MyEnum = Value
val Hello, World = Value
}
case class MyData(field: String, other: MyEnum.Value)
object MyDataEncoders {
implicit def myDataEncoder: org.apache.spark.sql.Encoder[MyData] =
org.apache.spark.sql.Encoders.kryo[MyData]
}
object EnumTest {
import MyDataEncoders._
def main(args: Array[String]): Unit = {
val sparkConf = new SparkConf().setAppName("test").setMaster("local[*]")
val sc = new SparkContext(sparkConf)
val sqlCtx = new SQLContext(sc)
import sqlCtx.implicits._
val df = sc.parallelize(Array(MyData("hello", MyEnum.World))).toDS()
println(s"df: ${df.collect().mkString(",")}}")
}
}

Related

How to create a PolygonRDD from H3 boundary?

I'm using Apache Spark with Apache Sedona (previously called GeoSpark), and I'm trying to do the following:
Take a DataFrame containing latitude and longitude in each row (it comes from an arbitrary source, it neither is a PointRDD nor comes from a specific file format) and transform it into a DataFrame with the H3 index of each point.
Take that DataFrame and create a PolygonRDD containing the H3 cell boundaries of each distinct H3 index.
This is what I have so far:
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.sedona.core.spatialRDD.PolygonRDD
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.locationtech.jts.geom.{Polygon, GeometryFactory, Coordinate}
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
object Main {
def main(args: Array[String]) {
val sparkSession: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.master("local[*]")
.appName("Sedona-Analysis")
.getOrCreate()
import sparkSession.implicits._
SedonaSQLRegistrator.registerAll(sparkSession)
SedonaVizRegistrator.registerAll(sparkSession)
val df = Seq(
(-8.01681, -34.92618),
(-25.59306, -49.39895),
(-7.17897, -34.86518),
(-20.24521, -42.14273),
(-20.24628, -42.14785),
(-27.01641, -50.94109),
(-19.72987, -47.94319)
).toDF("latitude", "longitude")
val core: H3Core = H3Core.newInstance()
val geoFactory = new GeometryFactory()
val geoToH3 = udf((lat: Double, lng: Double, res: Int) => core.geoToH3(lat, lng, res))
val trdd = df
.select(geoToH3($"latitude", $"longitude", lit(7)).as("h3index"))
.distinct()
.rdd
.map(row => {
val h3 = row.getAs[Long](0)
val lboundary = core.h3ToGeoBoundary(h3)
val aboundary = lboundary.toArray(Array.ofDim[GeoCoord](lboundary.size))
val poly = geoFactory.createPolygon(
aboundary.map((c: GeoCoord) => new Coordinate(c.lat, c.lng))
)
poly.setUserData(h3)
poly
})
val polyRDD = new PolygonRDD(trdd)
polyRDD.rawSpatialRDD.foreach(println)
sparkSession.stop()
}
}
However, after running sbt assembly and submitting the output jar to spark-submit, I get this error:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:416)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:406)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:162)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2362)
at org.apache.spark.rdd.RDD.$anonfun$map$1(RDD.scala:396)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:388)
at org.apache.spark.rdd.RDD.map(RDD.scala:395)
at Main$.main(Main.scala:44)
at Main.main(Main.scala)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at java.base/jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at java.base/jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.base/java.lang.reflect.Method.invoke(Method.java:566)
at org.apache.spark.deploy.JavaMainApplication.start(SparkApplication.scala:52)
at org.apache.spark.deploy.SparkSubmit.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:928)
at org.apache.spark.deploy.SparkSubmit.doRunMain$1(SparkSubmit.scala:180)
at org.apache.spark.deploy.SparkSubmit.submit(SparkSubmit.scala:203)
at org.apache.spark.deploy.SparkSubmit.doSubmit(SparkSubmit.scala:90)
at org.apache.spark.deploy.SparkSubmit$$anon$2.doSubmit(SparkSubmit.scala:1007)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:1016)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Caused by: java.io.NotSerializableException: com.uber.h3core.H3Core
Serialization stack:
- object not serializable (class: com.uber.h3core.H3Core, value: com.uber.h3core.H3Core#3407ded1)
- element of array (index: 0)
- array (class [Ljava.lang.Object;, size 2)
- field (class: java.lang.invoke.SerializedLambda, name: capturedArgs, type: class [Ljava.lang.Object;)
- object (class java.lang.invoke.SerializedLambda, SerializedLambda[capturingClass=class Main$, functionalInterfaceMethod=scala/Function1.apply:(Ljava/lang/Object;)Ljava/lang/Object;, implementation=invokeStatic Main$.$anonfun$main$2:(Lcom/uber/h3core/H3Core;Lorg/locationtech/jts/geom/GeometryFactory;Lorg/apache/spark/sql/Row;)Lorg/locationtech/jts/geom/Polygon;, instantiatedMethodType=(Lorg/apache/spark/sql/Row;)Lorg/locationtech/jts/geom/Polygon;, numCaptured=2])
- writeReplace data (class: java.lang.invoke.SerializedLambda)
- object (class Main$$$Lambda$1710/0x0000000840d7f040, Main$$$Lambda$1710/0x0000000840d7f040#4853f592)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:41)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:101)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:413)
... 22 more
What is the proper way to achieve what I'm trying to do?
So, basically just adding the Serializable trait to an object containing the H3Core was enough. Also, I had to adjust the Coordinate array to begin and end with the same point.
import org.apache.spark.serializer.KryoSerializer
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.functions._
import org.apache.sedona.core.spatialRDD.PolygonRDD
import org.apache.sedona.sql.utils.SedonaSQLRegistrator
import org.apache.sedona.viz.core.Serde.SedonaVizKryoRegistrator
import org.apache.sedona.viz.sql.utils.SedonaVizRegistrator
import org.locationtech.jts.geom.{Polygon, GeometryFactory, Coordinate}
import com.uber.h3core.H3Core
import com.uber.h3core.util.GeoCoord
object H3 extends Serializable {
val core = H3Core.newInstance()
val geoFactory = new GeometryFactory()
}
object Main {
def main(args: Array[String]) {
val sparkSession: SparkSession = SparkSession
.builder()
.config("spark.serializer", classOf[KryoSerializer].getName)
.config("spark.kryo.registrator", classOf[SedonaVizKryoRegistrator].getName)
.master("local[*]")
.appName("Sedona-Analysis")
.getOrCreate()
import sparkSession.implicits._
SedonaSQLRegistrator.registerAll(sparkSession)
SedonaVizRegistrator.registerAll(sparkSession)
val df = Seq(
(-8.01681, -34.92618),
(-25.59306, -49.39895),
(-7.17897, -34.86518),
(-20.24521, -42.14273),
(-20.24628, -42.14785),
(-27.01641, -50.94109),
(-19.72987, -47.94319)
).toDF("latitude", "longitude")
val geoToH3 = udf((lat: Double, lng: Double, res: Int) => H3.core.geoToH3(lat, lng, res))
val trdd = df
.select(geoToH3($"latitude", $"longitude", lit(7)).as("h3index"))
.distinct()
.rdd
.map(row => {
val h3 = row.getAs[Long](0)
val lboundary = H3.core.h3ToGeoBoundary(h3)
val aboundary = lboundary.toArray(Array.ofDim[GeoCoord](lboundary.size))
val poly = H3.geoFactory.createPolygon({
val ps = aboundary.map((c: GeoCoord) => new Coordinate(c.lat, c.lng))
ps :+ ps(0)
})
poly.setUserData(h3)
poly
})
val polyRDD = new PolygonRDD(trdd)
polyRDD.rawSpatialRDD.foreach(println)
sparkSession.stop()
}
}

No implicits found for parameters i0: TypedColumn.Exists

I am trying out frameless library for Scala and getting an "No implicits found for parameters i0: TypedColumn.Exists". If you can help me resolve it - that would be awesome....
I am using spark 2.4.0 and frameless 0.8.0.
Following is my code
import org.apache.spark.sql.SparkSession
import frameless.TypedDataset
object TestSpark {
def main(args: Array[String]): Unit = {
val spark = SparkSession.builder()
.master("local[*]")
.appName("Spark Test")
.getOrCreate
import spark.implicits._
val empDS = spark.read
.option("header",true)
.option("delimiter",",")
.csv("emp.csv")
.as[Emp]
val empTyDS = TypedDataset.create(empDS)
import frameless.syntax._
empTyDS.show(10,false).run
val deptCol = empTyDS('dept) //Get the error here.`
}
}
case class for the code is
case class Emp (
name: String,
dept: String,
manager: String,
salary: String
)

Spark NaiveBayesTextClassification

i'm trying to create a text classifier spark(1.6.2) app, but I don't know what am I doing wrong. This is my code:
import org.apache.spark.ml.classification.{NaiveBayes, NaiveBayesModel}
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.sql.{Row, SQLContext}
import org.apache.spark.mllib
import org.apache.spark.mllib.util.MLUtils
import org.apache.spark.ml.feature.{HashingTF, IDF, Tokenizer}
/**
* Created by kebodev on 2016.11.29..
*/
object PredTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf()
.setMaster("local[*]")
.setAppName("IktatoSparkRunner")
.set("spark.executor.memory", "2gb")
val sc = new SparkContext(conf)
val sqlContext = new SQLContext(sc)
val trainData = sqlContext.read.json("src/main/resources/tst.json")
val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")
val wordsData = tokenizer.transform(trainData)
val hashingTF = new HashingTF()
.setInputCol("words").setOutputCol("features").setNumFeatures(20)
val featurizedData = hashingTF.transform(wordsData)
val model = NaiveBayes.train(featurizedData)
}
}
The NaiveBayes object doesn't have train method, what should I import?
If i try to use this way:
val naBa = new NaiveBayes()
naBa.fit(featurizedData)
I get this exception:
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Column label must be of type DoubleType but was actually StringType.
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53)
at org.apache.spark.ml.classification.Classifier.org$apache$spark$ml$classification$ClassifierParams$$super$validateAndTransformSchema(Classifier.scala:56)
at org.apache.spark.ml.classification.ClassifierParams$class.validateAndTransformSchema(Classifier.scala:40)
at org.apache.spark.ml.classification.ProbabilisticClassifier.org$apache$spark$ml$classification$ProbabilisticClassifierParams$$super$validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.classification.ProbabilisticClassifierParams$class.validateAndTransformSchema(ProbabilisticClassifier.scala:37)
at org.apache.spark.ml.classification.ProbabilisticClassifier.validateAndTransformSchema(ProbabilisticClassifier.scala:53)
at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116)
at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:68)
at org.apache.spark.ml.Predictor.fit(Predictor.scala:89)
at PredTest$.main(PredTest.scala:37)
at PredTest.main(PredTest.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at com.intellij.rt.execution.application.AppMain.main(AppMain.java:147)
This is how my json file looks like:
{"text":"any text","label":"6.0"}
I'm really noob in this topic. Can anyone help me how to create a model, and then how to predict a new value.
Thank you!
Labels and Feature Vectors only contain Doubles. Your label column contains a String.
See your stacktrace:
Column label must be of type DoubleType but was actually StringType.
You can use the StringIndexer or CountVectorizer to convert it appropriately. See http://spark.apache.org/docs/latest/ml-features.html#stringindexer for further details.

spark java.io.NotSerializableException: org.apache.spark.SparkContext

I'm try to implement to check exist record by received message from kafka in spark by spark streaming, now when i run RunReadLogByKafka object, there is a NotSerializableException for SparkContext was throwed, i google it, but i still don't know how to fix it, Could anyone suggest me how to rewrite it? thanks in advance.
package com.test.spark.hbase
import java.sql.{DriverManager, PreparedStatement, Connection}
import java.text.SimpleDateFormat
import com.powercn.spark.LogRow
import com.powercn.spark.SparkReadHBaseTest.{SensorStatsRow, SensorRow}
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.{TableName, HBaseConfiguration}
import org.apache.hadoop.hbase.client.{ConnectionFactory, Result, Put}
import org.apache.hadoop.hbase.io.ImmutableBytesWritable
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.io.Text
import org.apache.hadoop.mapred.JobConf
import org.apache.spark.sql.{SQLContext, Row}
import org.apache.spark.{SparkContext, SparkConf}
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
case class LogRow(rowkey: String, content: String)
object LogRow {
def parseLogRow(result: Result): LogRow = {
val rowkey = Bytes.toString(result.getRow())
val p0 = rowkey
val p1 = Bytes.toString(result.getValue(Bytes.toBytes("data"), Bytes.toBytes("content")))
LogRow(p0, p1)
}
}
class ReadLogByKafka(sct:SparkContext) extends Serializable {
implicit def func(records: String) {
#transient val conf = HBaseConfiguration.create()
conf.set(TableInputFormat.INPUT_TABLE, "log")
#transient val sc = sct
#transient val sqlContext = SQLContext.getOrCreate(sc)
import sqlContext.implicits._
try {
//query info table
val hBaseRDD = sc.newAPIHadoopRDD(conf, classOf[TableInputFormat],
classOf[org.apache.hadoop.hbase.io.ImmutableBytesWritable],
classOf[org.apache.hadoop.hbase.client.Result])
println(hBaseRDD.count())
// transform (ImmutableBytesWritable, Result) tuples into an RDD of Results
val resultRDD = hBaseRDD.map(tuple => tuple._2)
println(resultRDD.count())
val logRDD = resultRDD.map(LogRow.parseLogRow)
val logDF = logRDD.toDF()
logDF.printSchema()
logDF.show()
// register the DataFrame as a temp table
logDF.registerTempTable("LogRow")
val logAdviseDF = sqlContext.sql("SELECT rowkey, content as content FROM LogRow ")
logAdviseDF.printSchema()
logAdviseDF.take(5).foreach(println)
} catch {
case e: Exception => e.printStackTrace()
} finally {
}
}
}
package com.test.spark.hbase
import kafka.serializer.{DefaultDecoder, StringDecoder}
import org.apache.hadoop.conf.Configuration
import org.apache.hadoop.hbase.HBaseConfiguration
import org.apache.hadoop.hbase.mapreduce.{TableInputFormat, TableOutputFormat}
import org.apache.hadoop.hbase.util.Bytes
import org.apache.hadoop.io.Text
import org.apache.spark.SparkConf
import org.apache.spark.streaming.kafka.KafkaUtils
import org.apache.spark.streaming.{Seconds, StreamingContext}
object RunReadLogByKafka extends Serializable {
def main(args: Array[String]): Unit = {
val broker = "192.168.13.111:9092"
val topic = "log"
#transient val sparkConf = new SparkConf().setAppName("RunReadLogByKafka")
#transient val streamingContext = new StreamingContext(sparkConf, Seconds(2))
#transient val sc = streamingContext.sparkContext
val kafkaConf = Map("metadata.broker.list" -> broker,
"group.id" -> "group1",
"zookeeper.connection.timeout.ms" -> "3000",
"kafka.auto.offset.reset" -> "smallest")
// Define which topics to read from
val topics = Set(topic)
val messages = KafkaUtils.createDirectStream[Array[Byte], String, DefaultDecoder, StringDecoder](
streamingContext, kafkaConf, topics).map(_._2)
messages.foreachRDD(rdd => {
val readLogByKafka =new ReadLogByKafka(sc)
//parse every message, it will throw NotSerializableException
rdd.foreach(readLogByKafka.func)
})
streamingContext.start()
streamingContext.awaitTermination()
}
}
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2021)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:889)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1.apply(RDD.scala:888)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:306)
at org.apache.spark.rdd.RDD.foreach(RDD.scala:888)
at com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1.apply(RunReadLogByKafka.scala:38)
at com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1.apply(RunReadLogByKafka.scala:35)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at org.apache.spark.streaming.dstream.DStream$$anonfun$foreachRDD$1$$anonfun$apply$mcV$sp$3.apply(DStream.scala:631)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply$mcV$sp(ForEachDStream.scala:42)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1$$anonfun$apply$mcV$sp$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.DStream.createRDDWithLocalProperties(DStream.scala:399)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply$mcV$sp(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at org.apache.spark.streaming.dstream.ForEachDStream$$anonfun$1.apply(ForEachDStream.scala:40)
at scala.util.Try$.apply(Try.scala:161)
at org.apache.spark.streaming.scheduler.Job.run(Job.scala:34)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply$mcV$sp(JobScheduler.scala:207)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler$$anonfun$run$1.apply(JobScheduler.scala:207)
at scala.util.DynamicVariable.withValue(DynamicVariable.scala:57)
at org.apache.spark.streaming.scheduler.JobScheduler$JobHandler.run(JobScheduler.scala:206)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.io.NotSerializableException: org.apache.spark.SparkContext
Serialization stack:
- object not serializable (class: org.apache.spark.SparkContext, value: org.apache.spark.SparkContext#5207878f)
- field (class: com.test.spark.hbase.ReadLogByKafka, name: sct, type: class org.apache.spark.SparkContext)
- object (class com.test.spark.hbase.ReadLogByKafka, com.test.spark.hbase.ReadLogByKafka#60212100)
- field (class: com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1$$anonfun$apply$1, name: readLogByKafka$1, type: class com.test.spark.hbase.ReadLogByKafka)
- object (class com.test.spark.hbase.RunReadLogByKafka$$anonfun$main$1$$anonfun$apply$1, <function1>)
at org.apache.spark.serializer.SerializationDebugger$.improveException(SerializationDebugger.scala:40)
at org.apache.spark.serializer.JavaSerializationStream.writeObject(JavaSerializer.scala:47)
at org.apache.spark.serializer.JavaSerializerInstance.serialize(JavaSerializer.scala:84)
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:301)
... 30 more
Object you get as an argument in foreachRDD is a standard Spark RDD so you have to obey exactly the same rules as usual including no nested actions or transformations and no access to the SparkContext. It is not exactly clear what you try to achieve (It doesn't look like ReadLogByKafka.func is doing anything useful) but I am guess you're looking for some kind of join.

A not_so_familiar exception in Apache Spark/Scala

I encountered the following when I tried to do a spark submit:
Exception in thread "main" java.lang.NullPointerException
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:606)
at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:328)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
Is this a known problem?
Thanks and regards,
The following program is able to reproduce the above error message:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class anomaly_model(val inputfile: String, val clusterNum: Int, val maxIterations: Int, val epsilon: Double, val scenarioNum: Int, val outputfile: String){
val conf = new SparkConf().setAppName("Anomaly Model")
val sc = new SparkContext(conf)
val data = sc.textFile(inputfile)
def main(args: Array[String]) {
val inputfile = "sqlexpt.txt"
val clusterNum = 5
val maxIterations = 1000
val epsilon = 0.001
val scenarioNum = 10
val outputfile = "output.csv"
val am = new anomaly_model(inputfile, clusterNum, maxIterations, epsilon, scenarioNum, outputfile)
}
}
Probably it was my mistake to define the main method inside the class... I should have defined it inside a companion object instead (I come from a Java background!).
If you want to put main class in anomaly_model class, you may need to change "class" to "object". And yes, if you want to have anomaly_model a class, you may need to put main method in another object.
In anomaly_model class:
import org.apache.spark.SparkContext
import org.apache.spark.SparkConf
class anomaly_model(val inputfile: String, val clusterNum: Int, val maxIterations: Int, val epsilon: Double, val scenarioNum: Int, val outputfile: String){
val conf = new SparkConf().setAppName("Anomaly Model")
val sc = new SparkContext(conf)
val data = sc.textFile(inputfile)
}
And in Main.scala:
import anomaly_model
object Main {
def main(args: Array[String]) : Unit{
val inputfile = "sqlexpt.txt"
val clusterNum = 5
val maxIterations = 1000
val epsilon = 0.001
val scenarioNum = 10
val outputfile = "output.csv"
val am = new anomaly_model(inputfile, clusterNum, maxIterations, epsilon, scenarioNum, outputfile)
}
}
You have a null value in one of your variables, as the message says.
I had the same problem and had made the same mistake as you. anomaly_model should be an object and not a class