Failed to execute user defined function in Spark-Scala - scala

Below is the UDF to convert multivalued column into map.
def convertToMapFn (c: String): Map[String,String] = {
val str = Option(c).getOrElse(return Map[String, String]())
val arr = str.split(",")
val l = arr.toList
val regexPattern = ".*(=).*".r
s"$c".toString match {
case regexPattern(a) => l.map(x => x.split("=")).map(a => {if(a.size==2) (a(0).toString -> a(1).toString) else "ip_adr" -> a(0).toString} ).toMap
case "null" => Map[String, String]()
}
}
val convertToMapUDF = udf(convertToMapFn _)
I am able to display the data, but while trying to insert the data into Delta table, I am getting the below error.
Caused by: org.apache.spark.SparkException: Job aborted due to stage failure: Task 9 in stage 97.0 failed 4 times, most recent failure: Lost task 9.3 in stage 97.0 (TID 2561, 10.73.244.39, executor 5): org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2326/1884779796: (string) => map<string,string>)
Caused by: scala.MatchError: a8:9f:e (of class java.lang.String)
at line396de0100d5344c9994f63f7de7884fe49.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.convertToMapFn
Caused by: org.apache.spark.SparkException: Failed to execute user defined function($read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$Lambda$2326/1884779796: (string) => map<string,string
Could someone pls let me know how to fix this. Thank you

You can see in the error message that you have a MatchError. This happens when you don't account for all possible match cases. A basic fix is to change case "null" => to case _ => which will match anything that the regex doesn't.
Other matters:
s"$c".toString is equivalent to writing c in this case.
I think you mean to match on str and not c

Related

Spark- Why do I get a NPE when write.mode(SaveMode.Overwrite) even if the dataframe allows other actions as first or show?

I have a dataframe with 3 columns which has got a schema similar to this:
org.apache.spark.sql.types.StructType = StructType(StructField(UUID,StringType,true), StructField(NAME,StringType,true), StructField(DOCUMENT,ArrayType(MapType(StringType,StringType,true),true),true))
This could be a sample of a row in this dataframe:
org.apache.spark.sql.Row = [11223344,ALAN,28,WrappedArray(Map(source -> central, document_number -> 1234, first_seen -> 2018-05-01))]
I am generating a new column after applying a udf function over the last column of this dataframe. The one which is and Array>
This is the code I am applying:
def number_length( num:String ) : String = { if(num.length < 6) "000000" else num }
def validating_doc = udf((inputSeq: Seq[Map[String, String]]) => {
inputSeq.map(x => Map("source" -> x("source"),"document_number" -> number_length(x("document_number")),"first_seen"-> x("first_seen"))))
})
val newDF = DF.withColumn("VALID_DOCUMENT", validating_doc($"DOCUMENT"))
After this everything works fine and I can perform some actions like show and first, which returns:
org.apache.spark.sql.Row = [11223344,ALAN,28,WrappedArray(Map(source -> central, document_number -> 1234, first_seen -> 2018-05-01)),WrappedArray(Map(source -> central, document_number -> 000000, first_seen -> 2018-05-01))]
But if I try to write as an avro this Dataframe, doing like this:
newDF.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:///data/mypath")
I get the following error:
WARN scheduler.TaskSetManager: Lost task 3.0 in stage 0.0 (TID 6, myserver.azure.com): org.apache.spark.SparkException: Task failed while writing rows.
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:272)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.sql.execution.datasources.InsertIntoHadoopFsRelation$$anonfun$run$1$$anonfun$apply$mcV$sp$3.apply(InsertIntoHadoopFsRelation.scala:150)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.NullPointerException
at $line101.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$validating_doc$1.apply(<console>:52)
at $line101.$read$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$anonfun$validating_doc$1.apply(<console>:51)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$SpecificUnsafeProjection.apply(Unknown Source)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:51)
at org.apache.spark.sql.execution.Project$$anonfun$1$$anonfun$apply$1.apply(basicOperators.scala:49)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at org.apache.spark.sql.execution.datasources.DefaultWriterContainer.writeRows(WriterContainer.scala:263)
But if I drop this new column, it is possible to write the dataframe.
What am I missing when writting the dataframe? is the udf changing something in the schema that I am not aware of?
Your code gives NPE in the UDF call. Function you use is not null-safe, it will fail if:
inputSeq is null.
Any element of inputSeq is null.
Any document_number number is null in any element of inputSeq is null.
It would also fail if any item was missing (although it is not a problem here. You have to include proper checks, starting with something like this (not tested):
def number_length( num:String ) : String = num match {
case null => null
case _ => if(num.length < 6) "000000" else num
}
def validating_doc = udf((inputSeq: Seq[Map[String, String]]) => inputSeq match {
case null => null
case xs => xs.map {
case null => null
case x => Map(
"source" -> x("source"),
"document_number" -> number_length(x("document_number")),
"first_seen" -> x("first_seen")
)
}
})
Why do I get a NPE when write.mode(SaveMode.Overwrite) even if the dataframe allows other actions as first or show?
Because both first and show evaluate only a subset of data and clearly don't hit problematic row.

Scala - Spark : return vertex properties from particular node

I have a Graph and I want to compute the max degree. In particular the vertex with max degree I want to know all properties.
This is the snippets of code:
def max(a: (VertexId, Int), b: (VertexId, Int)): (VertexId, Int) = {
if (a._2 > b._2) a else b
}
val maxDegrees : (VertexId, Int) = graphX.degrees.reduce(max)
max: (a: (org.apache.spark.graphx.VertexId, Int), b: (org.apache.spark.graphx.VertexId, Int))(org.apache.spark.graphx.VertexId, Int)
maxDegrees: (org.apache.spark.graphx.VertexId, Int) = (2063726182,56387)
val startVertexRDD = graphX.vertices.filter{case (hash_id, (id, state)) => hash_id == maxDegrees._1}
startVertexRDD.collect()
But it returned this exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 1 times, most recent failure: Lost task 0.0 in stage 145.0 (TID 5380, localhost, executor driver): scala.MatchError: (1009147972,null) (of class scala.Tuple2)
How can fix it?
I think this is the problem. Here:
val startVertexRDD = graphX.vertices.filter{case (hash_id, (id, state)) => hash_id == maxDegrees._1}
So it tries to compare some tuple like this
(2063726182,56387)
expecting something like this:
(hash_id, (id, state))
Raising a scala.MatchError because is comparing a Tuple2 of (VertextId, Int) with a Tuple2 of (VertexId, Tuple2(id, state))
Be carefull with this as well:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 145.0 failed 1 times, most recent failure: Lost task 0.0 in stage 145.0 (TID 5380, localhost, executor driver): scala.MatchError: (1009147972,null) (of class scala.Tuple2)
Concretely here:
scala.MatchError: (1009147972,null)
There is no degree calculated for vertice 1009147972 so when it compares could raise some problems as well.
Hope this helps.

org.apache.spark.SparkException: Task not serializable in Spark Scala

I am trying to get employeeId from employee_table and use this id to query employee_address table to fetch the address.
There is nothing wrong with tables. But when I run the below code, I get org.apache.spark.SparkException: Task not serializable
I think I know the issue. The issue is sparkContext is with master and not with worker. But I don't know how to get my head around this.
val employeeRDDRdd = sc.cassandraTable("local_keyspace", "employee_table")
try {
val data = employeeRDDRdd
.map(row => {
row.getStringOption("employeeID") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
}
})
//create tuple of employee id and address. Filtering out cases when for an employee address is empty.
val id = data
.map(s => (s,getID(s)))
filter(tups => tups._2.nonEmpty)
//printing out total size of rdd.
println(id.count())
} catch {
case e: Exception => e.printStackTrace()
}
def getID(employeeID: String): String = {
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
val data = addressRDD.map(row => row.getStringOption("address") match {
case Some(s) if (s != null) && s.nonEmpty => s
case None => ""
})
data.collect()(0)
}
Exception ==>
rg.apache.spark.SparkException: Task not serializable
at org.apache.spark.util.ClosureCleaner$.ensureSerializable(ClosureCleaner.scala:298)
at org.apache.spark.util.ClosureCleaner$.org$apache$spark$util$ClosureCleaner$$clean(ClosureCleaner.scala:288)
at org.apache.spark.util.ClosureCleaner$.clean(ClosureCleaner.scala:108)
at org.apache.spark.SparkContext.clean(SparkContext.scala:2039)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:366)
at org.apache.spark.rdd.RDD$$anonfun$map$1.apply(RDD.scala:365)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:151)
at org.apache.spark.rdd.RDDOperationScope$.withScope(RDDOperationScope.scala:112)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:358)
at org.apache.spark.rdd.RDD.map(RDD.scala:365)
Serialization Error Caused by SparkContext Captured in Lambda
The serialization issue is caused by
val addressRDD = sc.cassandraTable("local_keyspace", "employee_address")
This portion is used inside of a serialized lambda here :
val id = data
.map(s => (s,getID(s)))
All RDD transformations represent remotely executed code which means their entire contents must be serializable.
The Spark Context is not serializable but it is necessary for "getIDs" to work so there is an exception. The basic rule is you cannot touch the SparkContext within any RDD transformation.
If you are actually trying to join with data in cassandra you have a few options.
If you are just pulling rows based on Partition Key
https://github.com/datastax/spark-cassandra-connector/blob/master/doc/2_loading.md#using-joinwithcassandratable
If you are trying to join on some other field
Load both RDDs seperately and do a Spark Join
val leftrdd = sc.cassandraTable(test, table1)
val rightrdd = sc.cassandraTable(test, table2)
leftrdd.join(rightRdd)

Error when extracting features(spark)

I encountered some problems when I tried to extract features from raw data.
Here is my data:
25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
and here is my code:
val rawData = sc.textFile("data/myData.data")
val lines = rawData.map(_.split(","))
val categoriesMap = lines.map(fields => fields(1)).distinct.collect.zipWithIndex.toMap
Here is the error info:
Exception in thread "main" org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 1.0 failed 1 times, most recent failure: Lost task 1.0 in stage 1.0 (TID 3, localhost): java.lang.ArrayIndexOutOfBoundsException: 1
I want to extract the second column as the categorical feature, but it seems that it cannot read the column and leads to ArrayIndexOutOfBoundsException.
I tried many times but still cannot solve the problem.
val categoriesMap1 = lines.map(fields => fields(1)).distinct.collect.zipWithIndex.toMap
val labelpointRDD = lines.map { fields =>
val categoryFeaturesArray1 = Array.ofDim[Double](categoriesMap1.size)
val categoryIdx1 = categoriesMap1(fields(1))
categoryFeaturesArray1(categoryIdx1) = 1 }
Your code works for the example you supplied - which means it's fine for "valid" rows - but your input probably contains some invalid rows - in this case, rows with no commas.
You can either clean your data or improve the code to handle these rows more gracefully, for example using some default value for bad rows:
val rawData = sc.parallelize(Seq(
"25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0",
"BAD LINE"
))
val lines = rawData.map(_.split(","))
val categoriesMap = lines.map {
case Array(_, s, _*) => s // for arrays with 2 or more items - use 2nd
case _ => "UNKNOWN" // default
}.distinct().collect().zipWithIndex.toMap
println(categoriesMap) // prints Map(UNKNOWN -> 0, Private -> 1)
UPDATE: per updated question - assuming these rows are indeed invalid, you can just skip them entirely, both when extracting the categories map and when mapping to labeled points:
val secondColumn: RDD[String] = lines.collect {
case Array(_, s, _*) => s // for arrays with 2 or more items - use 2nd
// shorter arrays (bad records) will be ~~filtered out~~
}
val categoriesMap = secondColumn.distinct().collect().zipWithIndex.toMap
val labelpointRDD = secondColumn.map { field =>
val categoryFeaturesArray1 = Array.ofDim[Double](categoriesMap.size)
val categoryIdx1 = categoriesMap(field)
categoryFeaturesArray1(categoryIdx1) = 1
categoryFeaturesArray1
}

Spark Streaming stateful transformation mapWithState function getting error java.util.NoSuchElementException: None.get

I wanted to replace my updateStateByKey function with mapWithState function (Spark 1.6) to improve performance of my program.
I was following these two documents:
https://databricks.com/blog/2016/02/01/faster-stateful-stream-processing-in-spark-streaming.html
https://docs.cloud.databricks.com/docs/spark/1.6/index.html#examples/Streaming%20mapWithState.html
but i am getting error scala.MatchError: [Ljava.lang.Object]
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 71.0 failed 4 times, most recent failure: Lost task 0.3 in stage 71.0 (TID 88, ttsv-lab-vmdb-01.englab.juniper.net): scala.MatchError: [Ljava.lang.Object;#eaf8bc8 (of class [Ljava.lang.Object;)
at HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84)
at HbaseCovrageStream$$anonfun$HbaseCovrageStream$$tracketStateFunc$1$3.apply(HbaseCoverageStream_mapwithstate.scala:84)
at scala.Option.flatMap(Option.scala:170)
at HbaseCovrageStream$.HbaseCovrageStream$$tracketStateFunc$1(HbaseCoverageStream_mapwithstate.scala:84)
Reference code:
def trackStateFunc(key:String, value:Option[Array[Long]], current:State[Seq[Array[Long]]]):Option[Array[Long]] = {
/*adding current state to the previous state*/
val res = value.map(x => x +: current.getOption().get).orElse(current.getOption())
current.update(res.get)
res.flatMap {
case as: Seq[Array[Long]] => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption //throws match error
}
}
val statespec:StateSpec[String, Array[Long], Array[Long], Option[Array[Long]]] = StateSpec.function(trackStateFunc _)
val state: MapWithStateDStream[String, Array[Long], Array[Long], Option[Array[Long]]] = parsedStream.mapWithState(statespec)
My previous working code which was using updateStateByKey function:
val state: DStream[(String, Array[Long])] = parsedStream.updateStateByKey(
(current: Seq[Array[Long]], prev: Option[Array[Long]]) => {
prev.map(_ +: current).orElse(Some(current))
.flatMap(as => Try(as.map(BDV(_)).reduce(_ + _).toArray).toOption)
})
might be your problem is with a case when value is absent: you'll wrap state in Some and then you should match it. Or you can use state.getOption(check once again example in link you've attached)
Thanks Igor. I changed my trackStateFunc, it's working now.
For reference my working code with mapWithState:
def trackStateFunc(batchTime: Time, key: String, value: Option[Array[Long]], state: State[Array[Long]])
: Option[(String, Array[Long])] = {
// Check if state exists
if (state.exists) {
val newState:Array[Long] = Array(state.get, value.get).transpose.map(_.sum)
state.update(newState) // Set the new state
Some((key, newState))
} else {
val initialState = value.get
state.update(initialState) // Set the initial state
Some((key, initialState))
}
}
// StateSpec[KeyType, ValueType, StateType, MappedType]
val stateSpec: StateSpec[String, Array[Long], Array[Long], (String, Array[Long])] = StateSpec.function(trackStateFunc _)
val state: MapWithStateDStream[String, Array[Long], Array[Long], (String, Array[Long])] = parsedStream.mapWithState(stateSpec)