Spark Scala Exception in thread "main" java.lang.UnsupportedOperationException: empty collection - scala

Launching spark-submit I'm getting the following error:
Exception in thread "main" java.lang.UnsupportedOperationException: empty collection
Spark Code Class 1:
var vertices = sc.textFile("hdfs:///user/cloudera/dstlf.txt")
.flatMap { line => line.split("\\s+") }
.distinct()
vertices.map {
vertex => vertex.replace("-", "") + "\t" + vertex.saveAsTextFile("hdfs:///user/cloudera/metadata-lookup-tlf")
}
Spark Code Class 2:
val sc = new SparkContext(conf)
val graph = GraphLoader.edgeListFile(sc,"hdfs:///user/cloudera/metadata-processed-tlf")
// Run PageRank
val ranks = graph.pageRank(0.0001).vertices
Looks like for some reason it is not working in the RDD
The input file dstlf contains 2 columns with numbers:
94-92250 94-92174
94-92889 94-91869
94-94172 94-91682
...

Related

org.apache.spark.SparkException: This RDD lacks a SparkContext error

Complete error is:
org.apache.spark.SparkException: This RDD lacks a SparkContext. It
could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x =>
rdd2.values.count() * x) is invalid because the values transformation
and count action cannot be performed inside of the rdd1.map
transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the
streaming job is used in DStream operations. For more information, See
SPARK-13758.
but I think I didn't use nested rdd transform in my code.
how to solve it?
my scala code:
stream.foreachRDD { rdd => {
val nRDD = rdd.map(item => item.value())
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
val top = oldRDD.sortBy(item => {
val arr = item.split(' ')
arr(0)
}, ascending = false).take(200)
val topRDD = sc.makeRDD(top)
val unionRDD = topRDD.union(nRDD)
val validRDD = unionRDD.map(item => {
val arr = item.split(' ')
((arr(1), arr(2)), arr(3).toDouble)
})
.reduceByKey((f, s) => {
if (f > s) f else s
})
.distinct()
val ratings = validRDD.map(item => {
Rating(item._1._2.toInt, item._1._1.toInt, item._2)
})
val rank = 10
val numIterations = 5
val model = ALS.train(ratings, rank, numIterations, 0.01)
nRDD.map(item => {
val arr = item.split(' ')
arr(2)
}).toDS()
.distinct()
.foreach(item=>{
println("als recommending for user "+item)
val recommendRes = model.recommendProducts(item.toInt, 10)
for (elem <- recommendRes) {
println(elem)
}
})
nRDD.saveAsTextFile("hdfs://localhost:9011/recData/miniApp/mall")
}
}
The error is telling you that you're missing a SparkContext. I'm guessing that the program fails on this line:
val oldRDD = sc.textFile("hdfs://localhost:9011/recData/miniApp/mall")
The documentation provides an example of creating a SparkContext to use in this situation.
From the docs:
val stream: DStream[String] = ...
stream.foreachRDD { rdd =>
// Get the singleton instance of SparkSession
val spark = SparkSession.builder.config(rdd.sparkContext.getConf).getOrCreate()
import spark.implicits._
// Do things...
}
Although you're using RDDs instead of DataFrames, the same principles should apply.

UDF to extract String in scala

I'm trying to extract the last set number from this data type:
urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)
In this example I'm trying to extract 10342800535 as a string.
This is my code in scala,
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
This is run as an UDF and it throws the following error,
org.apache.spark.SparkException: Failed to execute user defined function
What am I doing wrong?
You can simply use regexp_extract function. Check this
val df = Seq(("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)")).toDF("x")
df.show(false)
+-------------------------------------------------------------------+
|x |
+-------------------------------------------------------------------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|
+-------------------------------------------------------------------+
df.withColumn("NestedUrn", regexp_extract(col("x"), """.*,(\d+)""", 1)).show(false)
+-------------------------------------------------------------------+-----------+
|x |NestedUrn |
+-------------------------------------------------------------------+-----------+
|urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)|10342800535|
+-------------------------------------------------------------------+-----------+
One reason that org.apache.spark.SparkException: Failed to execute user defined function exception are raised is when an exception is raised inside your user defined function.
Analysis
If I try to run your user defined function with the example input you provided, using the code below:
import org.apache.spark.sql.functions.{col, udf}
import sparkSession.implicits._
val dataframe = Seq("urn:fb:candidateHiringState:(urn:fb:contract:187236028,10342800535)").toDF("urn")
def extractNestedUrn(urn: String): String = {
val arr = urn.split(":").map(_.trim)
val nested = arr(3)
val clean = nested.substring(1, nested.length -1)
val subarr = clean.split(":").map(_.trim)
val res = subarr(3)
val out = res.split(",").map(_.trim)
val fin = out(1)
fin.toString
}
val extract_urn = udf(extractNestedUrn _)
dataframe.select(extract_urn(col("urn"))).show(false)
I get this complete stack trace:
Exception in thread "main" org.apache.spark.SparkException: Failed to execute user defined function(UdfExtractionError$$$Lambda$1165/1699756582: (string) => string)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1130)
at org.apache.spark.sql.catalyst.expressions.Alias.eval(namedExpressions.scala:156)
...
at UdfExtractionError$.main(UdfExtractionError.scala:37)
at UdfExtractionError.main(UdfExtractionError.scala)
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
at UdfExtractionError$.extractNestedUrn$1(UdfExtractionError.scala:29)
at UdfExtractionError$.$anonfun$main$4(UdfExtractionError.scala:35)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.$anonfun$f$2(ScalaUDF.scala:157)
at org.apache.spark.sql.catalyst.expressions.ScalaUDF.eval(ScalaUDF.scala:1127)
... 86 more
The important part of this stack trace is actually:
Caused by: java.lang.ArrayIndexOutOfBoundsException: 3
This is the exception raised when executing your user defined function code.if we analyse your function code, you split two times the input by :. The result of the first split is actually this array:
["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
and not this array:
["urn", "fb", "candidateHiringState", "(urn:fb:contract:187236028,10342800535)"]
So, if we execute the remaining statements of your function, you get:
val arr = ["urn", "fb", "candidateHiringState", "(urn", "fb", "contract", "187236028,10342800535)"]
val nested = "(urn"
val clean = "urn"
val subarr = ["urn"]
As at the next line you call the fourth element of the array subarr that contains only one element, an ArrayOutOfBound exception is raised and then Spark returns a SparkException
Solution
Although the best solution to your problem is obviously the previous answer with regexp_extract, you can correct your user defined function as below:
def extractNestedUrn(urn: String): String = {
val arr = urn.split(':') // split using character instead of string regexp
val nested = arr.last // get last element of array, here "187236028,10342800535)"
val subarr = nested.split(',')
val res = subarr.last // get last element, here "10342800535)"
val out = res.init // take all the string except the last character, to remove ')'
out // no need to use .toString as out is already a String
}
However, as said before, the best solution is to use spark inner function regexp_extract as explained in first answer. Your code will be easier to understand and more performant

Spark Graphx : class not found error on EMR cluster

I am trying to process Hierarchical Data using Grapghx Pregel and the code I have works fine on my local.
But when I am running on my Amazon EMR cluster it is giving me an error:
java.lang.NoClassDefFoundError: Could not initialize class
What would be the reason of this happening? I know the class is there in the jar file as it run fine on my local as well there is no build error.
I have included GraphX dependency on pom file.
Here is a snippet of code where error is being thrown:
def calcTopLevelHierarcy (vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any, (Int, Any, String, Int, Int))] =
{
val verticesRDD = vertexDF.rdd
.map { x => (x.get(0), x.get(1), x.get(2)) }
.map { x => (MurmurHash3.stringHash(x._1.toString).toLong, (x._1.asInstanceOf[Any], x._2.asInstanceOf[Any], x._3.asInstanceOf[String])) }
//create the edge RD top down relationship
val EdgesRDD = edgeDF.rdd.map { x => (x.get(0), x.get(1)) }
.map { x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong, "topdown") }
// create the edge RD top down relationship
val graph = Graph(verticesRDD, EdgesRDD).cache()
//val pathSeperator = """/"""
//initialize id,level,root,path,iscyclic, isleaf
val initialMsg = (0L, 0, 0.asInstanceOf[Any], List("dummy"), 0, 1)
val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1))
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
//build the path from the list
val hrchyOutRDD = hrchyRDD.vertices.map { case (id, v) => (v._8, (v._2, v._3, pathSeperator + v._4.reverse.mkString(pathSeperator), v._5, v._7)) }
hrchyOutRDD
}
I was able to narrow down the line that is causing an error:
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
I had this exact same issue happening to me, where I was able to run it on spark-shell failing when executed from spark-submit. Here’s an example of the code I was trying to execute (looks like it's the same as yours)
The error that pointed me to the right solution was:
org.apache.spark.SparkException: A master URL must be set in your configuration
In my case, I was getting that error because I had defined the SparkContext outside the main function:
object Test {
val sc = SparkContext.getOrCreate
val sqlContext = new SQLContext(sc)
def main(args: Array[String]) {
...
}
}
I was able to solve it by moving SparkContext and sqlContext inside the main function as described in this other post

NullPointerException applying a function to spark RDD that works on non-RDD

I have a function that I want to apply to a every row of a .csv file:
def convert(inString: Array[String]) : String = {
val country = inString(0)
val sellerId = inString(1)
val itemID = inString(2)
try{
val minidf = sqlContext.read.json( sc.makeRDD(inString(3):: Nil) )
.withColumn("country", lit(country))
.withColumn("seller_id", lit(sellerId))
.withColumn("item_id", lit(itemID))
val finalString = minidf.toJSON.collect().mkString(",")
finalString
} catch{
case e: Exception =>println("AN EXCEPTION "+inString.mkString(","))
("this is an exception "+e+" "+inString.mkString(","))
}
}
This function transforms an entry of the sort:
CA 112578240 132080411845 [{"id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
Where I have 4 columns, the 4th being a json blob, into
[{"country":"CA", "seller":112578240", "product":112578240, "id":"general_spam_policy","severity":"critical","timestamp":"2017-02-26T08:30:16Z"}]
which is the json object where the first 3 columns have been inserted into the fourth.
Now, this works:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).collect().map(x => convert(x))
or this:
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).take(10).map(x => convert(x))
but this does not
val conv_string = sc.textFile(path_to_file).map(_.split('\t')).map(x => convert(x))
The last one throw a java.lang.NullPointerException.
I included a try catch clause so see where exactly is this failing and it's failing for every single row.
What am I doing wrong here?
You cannot put sqlContext or sparkContext in a Spark map, since that object can only exist on the driver node. Essentially they are in charge of distributing your tasks.
You could rewite the JSON parsing bit using one of these libraries in pure scala: https://manuel.bernhardt.io/2015/11/06/a-quick-tour-of-json-libraries-in-scala/

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.