Code working in Spark-Shell not in eclipse - scala

I have a small Scala code which works properly on Spark-Shell but not in Eclipse with Scala plugin. I can access hdfs using plugin tried writing another file and it worked..
FirstSpark.scala
package bigdata.spark
import org.apache.spark.SparkConf
import java. io. _
import org.apache.spark.SparkContext
import org.apache.spark.SparkContext._
object FirstSpark {
def main(args: Array[String])={
val conf = new SparkConf().setMaster("local").setAppName("FirstSparkProgram")
val sparkcontext = new SparkContext(conf)
val textFile =sparkcontext.textFile("hdfs://pranay:8020/spark/linkage")
val m = new Methods()
val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x))
q.saveAsTextFile("hdfs://pranay:8020/output") }
}
Methods.scala
package bigdata.spark
import java.util.function.ToDoubleFunction
class Methods {
def isHeader(s:String):Boolean={
s.contains("id_1")
}
def parse(line:String) ={
val pieces = line.split(',')
val id1=pieces(0).toInt
val id2=pieces(1).toInt
val matches=pieces(11).toBoolean
val mapArray=pieces.slice(2, 11).map(toDouble)
MatchData(id1,id2,mapArray,matches)
}
def toDouble(s: String) = {
if ("?".equals(s)) Double.NaN else s.toDouble
}
}
case class MatchData(id1: Int, id2: Int,
scores: Array[Double], matched: Boolean)
Error Message:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:304)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:294)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:122)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2032)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:335)
at org.apache.spark.rdd.RDD$$anonfun$filter$1.apply(RDD.scala:334)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:147)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:108)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:310)
Can anyone please help me with this

Try changing class Methods { .. } to object Methods { .. }.
I think the problem is at val q =textFile.filter(x => !m.isHeader(x)).map(x=> m.parse(x)). When Spark sees the filter and map functions it tries to serialize the functions passed to them (x => !m.isHeader(x) and x=> m.parse(x)) so that it can dispatch the work of executing them to all of the executors (this is the Task referred to). However, to do this, it needs to serialize m, since this object is referenced inside the function (it is in the closure of the two anonymous methods) - but it cannot do this since Methods is not serializable. You could add extends Serializable to the Methods class, but in this case an object is more appropriate (and is already Serializable).

Related

Scala Reflection exception during creation of DataSet in Spark

I want to run Spark Job on Spark Jobserver.
During execution, I got an exception:
stack:
java.lang.RuntimeException: scala.ScalaReflectionException: class
com.some.example.instrument.data.SQLMapping in JavaMirror with
org.apache.spark.util.MutableURLClassLoader#55b699ef of type class
org.apache.spark.util.MutableURLClassLoader with classpath
[file:/app/spark-job-server.jar] and parent being
sun.misc.Launcher$AppClassLoader#2e817b38 of type class
sun.misc.Launcher$AppClassLoader with classpath [.../classpath
jars/] not found.
at
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:123)
at
scala.reflect.internal.Mirrors$RootsBase.staticClass(Mirrors.scala:22)
at
com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1$$typecreator15$1.apply(DataRetriever.scala:136)
at
scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe$lzycompute(TypeTags.scala:232)
at scala.reflect.api.TypeTags$WeakTypeTagImpl.tpe(TypeTags.scala:232)
at
org.apache.spark.sql.catalyst.encoders.ExpressionEncoder$.apply(ExpressionEncoder.scala:49)
at org.apache.spark.sql.Encoders$.product(Encoders.scala:275) at
org.apache.spark.sql.LowPrioritySQLImplicits$class.newProductEncoder(SQLImplicits.scala:233)
at
org.apache.spark.sql.SQLImplicits.newProductEncoder(SQLImplicits.scala:33)
at
com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1.apply(DataRetriever.scala:136)
at
com.some.example.instrument.DataRetriever$$anonfun$combineMappings$1.apply(DataRetriever.scala:135)
at scala.util.Success$$anonfun$map$1.apply(Try.scala:237) at
scala.util.Try$.apply(Try.scala:192) at
scala.util.Success.map(Try.scala:237) at
scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at
scala.concurrent.Future$$anonfun$map$1.apply(Future.scala:237) at
scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32) at
scala.concurrent.impl.ExecutionContextImpl$AdaptedForkJoinTask.exec(ExecutionContextImpl.scala:121)
at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
In DataRetriever I convert simple case class to DataSet.
case class definition:
case class SQLMapping(id: String,
it: InstrumentPrivateKey,
cc: Option[String],
ri: Option[SourceInstrumentId],
p: Option[SourceInstrumentId],
m: Option[SourceInstrumentId])
case class SourceInstrumentId(instrumentId: Long,
providerId: String)
case class InstrumentPrivateKey(instrumentId: Long,
providerId: String,
clientId: String)
code that causes a problem:
import session.implicits._
def someFunc(future: Future[ID]): Dataset[SQLMappins] = {
future.map {f =>
val seq: Seq[SQLMapping] = getFromEndpoint(f)
val ds: Dataset[SQLMapping] = seq.toDS()
...
}
}
The job sometimes works, but if I re-run job, it will throw an exception.
update 28.03.2018
I forgot to mention one detail, that turns out to be important.
Dataset was constructed inside of Future.
Calling toDS() inside future causing ScalaReflectionException.
I decided to construct DataSet outside future.map.
You can verify that Dataset can't be constructed in future.map with this example job.
package com.example.sparkapplications
import com.typesafe.config.Config
import org.apache.spark.SparkContext
import org.apache.spark.sql.SparkSession
import scala.concurrent.Await
import scala.concurrent.Future
import scala.concurrent.duration._
import scala.concurrent.ExecutionContext.Implicits.global
import spark.jobserver.SparkJob
import spark.jobserver.SparkJobValid
import spark.jobserver.SparkJobValidation
object FutureJob extends SparkJob{
override def runJob(sc: SparkContext,
jobConfig: Config): Any = {
val session = SparkSession.builder().config(sc.getConf).getOrCreate()
import session.implicits._
val f = Future{
val seq = Seq(
Dummy("1", 1),
Dummy("2", 2),
Dummy("3", 3),
Dummy("4", 4),
Dummy("5", 5)
)
val ds = seq.toDS
ds.collect()
}
Await.result(f, 10 seconds)
}
case class Dummy(id: String, value: Long)
override def validate(sc: SparkContext,
config: Config): SparkJobValidation = SparkJobValid
}
Later I will provide information if the problem persists using spark 2.3.0, and when you pass jar via spark-submit directly.

java.lang.NoSuchMethodException: <Class>.<init>(java.lang.String) when copying custom Transformer

Currently playing with custom tranformers in my spark-shell using both spark 2.0.1 and 2.2.1.
While writing a custom ml transformer, in order to add it to a pipeline, I noticed that there is an issue with the override of the copy method.
The copy method is called by the fit method of the TrainValidationSplit in my case.
The error I get :
java.lang.NoSuchMethodException: Custom.<init>(java.lang.String)
at java.lang.Class.getConstructor0(Class.java:3082)
at java.lang.Class.getConstructor(Class.java:1825)
at org.apache.spark.ml.param.Params$class.defaultCopy(params.scala:718)
at org.apache.spark.ml.PipelineStage.defaultCopy(Pipeline.scala:42)
at Custom.copy(<console>:16)
... 48 elided
I then tried to directly call the copy method but I still get the same error.
Here is myclass and the call I perform :
import org.apache.spark.ml.Transformer
import org.apache.spark.sql.{Dataset, DataFrame}
import org.apache.spark.sql.types.{StructField, StructType, DataTypes}
import org.apache.spark.ml.param.{Param, ParamMap}
// Simple DF
val doubles = Seq((0, 5d, 100d), (1, 4d,500d), (2, 9d,700d)).toDF("id", "rating","views")
class Custom(override val uid: String) extends org.apache.spark.ml.Transformer {
def this() = this(org.apache.spark.ml.util.Identifiable.randomUID("custom"))
def copy(extra: org.apache.spark.ml.param.ParamMap): Custom = {
defaultCopy(extra)
}
override def transformSchema(schema: org.apache.spark.sql.types.StructType): org.apache.spark.sql.types.StructType = {
schema.add(org.apache.spark.sql.types.StructField("trending", org.apache.spark.sql.types.IntegerType, false))
}
def transform(df: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
df.withColumn("trending", (df.col("rating") > 4 && df.col("views") > 40))
}
}
val mycustom = new Custom("Custom")
// This call throws the exception.
mycustom.copy(new org.apache.spark.ml.param.ParamMap())
Does anyone know if this is a known issue ? I cant seem to find it anywhere.
Is there another way to implement the copy method in a custom transformer ?
Thanks
These are a couple of things that I would change about your custom Transformer (also to enable SerDe operations of your PipelineModel):
Implement the DefaultParamsWritable trait
Add a Companion object that extends the DefaultParamsReadable Interface
e.g.
class Custom(override val uid: String) extends Transformer
with DefaultParamsWritable {
...
...
}
object Custom extends DefaultParamsReadable[Custom]
Do take a look at the UnaryTransformer if you have only 1 Input/Output columns.
Finally, what's the need to call mycustom.copy(new ParamMap()) exactly??

State management not serializable

In my application, I want to keep track of multiple states. Thus I tried to encapsulate the whole state management logic within a class StateManager as follows:
#SerialVersionUID(xxxxxxxL)
class StateManager(
inputStream: DStream[(String, String)],
initialState: RDD[(String, String)]
) extends Serializable {
lazy val state = inputStream.mapWithState(stateSpec).map(_.get)
lazy val stateSpec = StateSpec
.function(trackStateFunc _)
.initialState(initialState)
.timeout(Seconds(30))
def trackStateFunc(key: String, value: Option[String], state: State[String]): Option[(String, String)] = {}
}
object StateManager { def apply(dstream: DStream[(String, String)], initialstate: RDD[(String, String)]) = new StateManager(_dStream, _initialState) }
The #SerialVersionUID(xxxxxxxL) ... extends Serializable is an attempt to solve my problem.
But when calling StateManager from my main class like the following:
val lStreamingContext = StreamingEnvironment(streamingWindow, checkpointDirectory)
val statemanager= StateManager(lStreamingEnvironment.sparkContext, 1, None)
val state= statemanager.state(lKafkaStream)
state.foreachRDD(_.foreach(println))
(see below for StreamingEnvironment), I get:
Exception in thread "main" org.apache.spark.SparkException: Task not serializable
[...]
Caused by: java.io.NotSerializableException: Object of org.apache.spark.streaming.kafka.DirectKafkaInputDStream is being serialized possibly as a part of closure of an RDD operation. This is because the DStream object is being referred to from within the closure. Please rewrite the RDD operation inside this DStream to avoid this. This has been enforced to avoid bloating of Spark tasks with unnecessary objects.
The error is clear, but still I don't get on what point does it trigger.
Where does it trigger?
What could I do to solve this and have a reusable class?
The might-be-useful StreamingEnvironment class:
class StreamingEnvironment(mySparkConf: SparkConf, myKafkaConf: KafkaConf, myStreamingWindow: Duration, myCheckPointDirectory: String) {
val sparkContext = spark.SparkContext.getOrCreate(mySparkConf)
lazy val streamingContext = new StreamingContext(sparkContext , mMicrobatchPeriod)
streamingContext.checkpoint(mCheckPointDirectory)
streamingContext.remember(Minutes(1))
def stream() = KafkaUtils.createDirectStream[String, String, StringDecoder, StringDecoder](streamingContext, myKafkaConf.mBrokers, myKafkaConf.mTopics)
}
object StreamingEnvironment {
def apply(streamingWindow: Duration, checkpointDirectory: String) = {
//setup sparkConf and kafkaConf
new StreamingEnvironment(sparkConf , kafkaConf, streamingWindow, checkpointDirectory)
}
}
When we lift a method into a function, the outer reference to the parent class will be part of that function reference, like here: function(trackStateFunc _)
Declaring trackStateFunc directly as a function (i.e. as a val) will probably take care of the problem.
Also note that marking a class Serializable does not make it magically so. DStream is not serializable and should be annotated as #transient, which will probably solve the issue as well.

Spark: Task not serializable (Broadcast/RDD/SparkContext)

There are numerous questions about Task is not serializable in Spark. However, this case seems quite particular.
I have created a class:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.persist()
val sc = allEs.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
[...]
The class defines the following method:
private def centroidDistances(v: Vector): Array[Double] = {
centroids.value.map(c => (centroids.value.indexOf(c), Vectors.sqdist(v, c)))
.sortBy(_._1)
.map(_._2)
}
However, when the class is called, a Task is not serializable exception is thrown.
Strange enough, a tiny change in the header of class Neighbours suffices to fix the issue. Instead of creating a val sc: SparkContext to use for broadcasting, I merely inline the code that creates the Spark context:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.setName("embeddings")
.persist()
val centroids = allEmbeddings.sparkContext(m.clusterCenters)
[...]
My question is: how does the second variant make a difference? What goes wrong in the first one? Intuitively, this should be merely syntactic sugar, is this a bug in Spark?
I use Spark 1.4.1 on a Hadoop/Yarn cluster.
When you define
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
...
val sc = allEmbeddings.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
...
}
You have made sc into a class variable, meaning it could be accessed from an instance of Neighbours e.g. neighbours.sc. This means that sc needs to be serializable, which is it not.
When you inline the code, only the final value of centroids needs to be serializable. centroids is of type Broadcast which is Serializable.

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}