Spark Graphx : class not found error on EMR cluster - scala

I am trying to process Hierarchical Data using Grapghx Pregel and the code I have works fine on my local.
But when I am running on my Amazon EMR cluster it is giving me an error:
java.lang.NoClassDefFoundError: Could not initialize class
What would be the reason of this happening? I know the class is there in the jar file as it run fine on my local as well there is no build error.
I have included GraphX dependency on pom file.
Here is a snippet of code where error is being thrown:
def calcTopLevelHierarcy (vertexDF: DataFrame, edgeDF: DataFrame): RDD[(Any, (Int, Any, String, Int, Int))] =
{
val verticesRDD = vertexDF.rdd
.map { x => (x.get(0), x.get(1), x.get(2)) }
.map { x => (MurmurHash3.stringHash(x._1.toString).toLong, (x._1.asInstanceOf[Any], x._2.asInstanceOf[Any], x._3.asInstanceOf[String])) }
//create the edge RD top down relationship
val EdgesRDD = edgeDF.rdd.map { x => (x.get(0), x.get(1)) }
.map { x => Edge(MurmurHash3.stringHash(x._1.toString).toLong, MurmurHash3.stringHash(x._2.toString).toLong, "topdown") }
// create the edge RD top down relationship
val graph = Graph(verticesRDD, EdgesRDD).cache()
//val pathSeperator = """/"""
//initialize id,level,root,path,iscyclic, isleaf
val initialMsg = (0L, 0, 0.asInstanceOf[Any], List("dummy"), 0, 1)
val initialGraph = graph.mapVertices((id, v) => (id, 0, v._2, List(v._3), 0, v._3, 1, v._1))
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)
//build the path from the list
val hrchyOutRDD = hrchyRDD.vertices.map { case (id, v) => (v._8, (v._2, v._3, pathSeperator + v._4.reverse.mkString(pathSeperator), v._5, v._7)) }
hrchyOutRDD
}
I was able to narrow down the line that is causing an error:
val hrchyRDD = initialGraph.pregel(initialMsg, Int.MaxValue, EdgeDirection.Out)(setMsg, sendMsg, mergeMsg)

I had this exact same issue happening to me, where I was able to run it on spark-shell failing when executed from spark-submit. Here’s an example of the code I was trying to execute (looks like it's the same as yours)
The error that pointed me to the right solution was:
org.apache.spark.SparkException: A master URL must be set in your configuration
In my case, I was getting that error because I had defined the SparkContext outside the main function:
object Test {
val sc = SparkContext.getOrCreate
val sqlContext = new SQLContext(sc)
def main(args: Array[String]) {
...
}
}
I was able to solve it by moving SparkContext and sqlContext inside the main function as described in this other post

Related

How to construct (key, value) list from parallelized list in scala spark?

I am new to Scala and I have some questions about how it works.
I want to do the next thing : given list of values, I want to construct some imitation of dictionary in parallel, something like that: (1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) ). I know that if we deal with parallelized collections we should use accumulators. So here is my attempt:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.util.AccumulatorV2
import scala.collection.mutable.ListBuffer
class DictAccumulatorV2 extends AccumulatorV2[Int, ListBuffer[(Int, Int)]] {
private var dict:ListBuffer[(Int, Int)]= new ListBuffer[(Int, Int)]
def reset(): Unit = {
dict.clear()
}
def add(v: Int): Unit = {
dict.append((v, v))
}
def value():ListBuffer[(Int, Int)] = {
return dict
}
def isZero(): Boolean = {
return dict.isEmpty
}
def copy() : AccumulatorV2[Int, ListBuffer[(Int, Int)]] = {
// I do not understand how to code it correctly
return new DictAccumulatorV2
}
def merge(other:AccumulatorV2[Int, ListBuffer[(Int, Int)]]): Unit = {
// I do not understand how to code it correctly without reinitializing dict from val to var
dict = dict ++ other.value
}
}
object FirstSparkApplication {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("MyFirstApp").setMaster("local")
val sc = new SparkContext(conf)
val accum = new DictAccumulatorV2()
sc.register(accum, "mydictacc")
val data = Array(1, 2, 3, 4, 5)
val distData = sc.parallelize(data)
var res = distData.map(x => accum.add(x))
res.count()
println(accum)
}
}
So I wonder if I do it right or there are any mistakes.
In general I also have questions about how sc.parallelize works. Does it actually parallelize job on my machine or it's just fictional string of code? What should I put instead of "local" in setMaster? How can I see on which nodes is the task been performing? Is the task performed on the all of the nodes at the same time or there is some sequence?
(1,2,3,4) -> ((1,1), (2,2), (3,3), (4,4) )
You can do this in Scala by doing
val list = List(1,2,3,4)
val dict = list.map(i => (i,i))
Spark Accumulators are used as a communication means from Spark executor to Driver.
If you want to do the above in Parallel, then you would construct an RDD out of this list and applying map transformation to it like shown above.
In spark shell it would look like
val list = List(1,2,3,4)
val listRDD = sc.parallelize(list)
val dictRDD = listRDD.map(i => (i,i))
how sc.parallelize works
It creates a distributed Dataset (RDD in spark terms) using the collection that you pass in to the function. More information.
It does parallelize your job.
If you are submitting your spark job to a cluster then you should be able to see a YARN application ID or URL after running spark-submit command.You can visit the YARN application URL and see how many executors are processing that distributed dataset and what sequence they are performed in.
What should I put instead of "local" in setMaster
From the Spark documentation -
The master URL to connect to, such as "local" to run locally with one thread, "local[4]" to run locally with 4 cores, or "spark://master:7077" to run on a Spark standalone cluster.

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

sortByKey in Spark

New to Spark and Scala. Trying to sort a word counting example. My code is based on this simple example.
I want to sort the results alphabetically by key. If I add the key sort to an RDD:
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
then I get a compile error:
error: No implicit view available from java.io.Serializable => Ordered[java.io.Serializable].
[INFO] val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
I don't know what the lack of an implicit view means. Can someone tell me how to fix it? I am running the Cloudera 5 Quickstart VM. I think it bundles Spark version 0.9.
Source of the Scala job
import org.apache.spark.SparkContext._
import org.apache.spark.SparkConf
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
Array("NO NAME")
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Some (unsorted) output
("INTERNATIONAL EYELETS INC",879)
("SHAQUITA SALLEY",865)
("PAZ DURIGA",791)
("TERESSA ALCARAZ",824)
("MING CHAIX",878)
("JACKSON SHIELDS YEISER",837)
("AUDRY HULLINGER",875)
("GABRIELLE MOLANDS",802)
("TAM TACKER",775)
("HYACINTH VITELA",837)
No implicit view means there is no scala function like this defined
implicit def SerializableToOrdered(x :java.io.Serializable) = new Ordered[java.io.Serializable](x) //note this function doesn't work
The reason this error is coming out is because in your function you are returning two different types with a super type of java.io.Serializable (ones a String the other an Array[String]). Also reduceByKey for obvious reasons requires the key to be an Orderable. Fix it like this
object SparkWordCount {
def main(args: Array[String]) {
val sc = new SparkContext(new SparkConf().setAppName("Spark Count"))
val files = sc.textFile(args(0)).map(_.split(","))
def f(x:Array[String]) = {
if (x.length > 3)
x(3)
else
"NO NAME"
}
val names = files.map(f)
val wordCounts = names.map((_, 1)).reduceByKey(_ + _).sortByKey()
System.out.println(wordCounts.collect().mkString("\n"))
}
}
Now the function just returns Strings instead of two different types

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}

Accumulator fails on cluster, works locally

In the official spark documentation, there is an example for an accumulator which is used in a foreach call which is directly on an RDD:
scala> val accum = sc.accumulator(0)
accum: spark.Accumulator[Int] = 0
scala> sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
...
10/09/29 18:41:08 INFO SparkContext: Tasks finished in 0.317106 s
scala> accum.value
res2: Int = 10
I implemented my own accumulator:
val myCounter = sc.accumulator(0)
val myRDD = sc.textFile(inputpath) // :spark.RDD[String]
myRDD.flatMap(line => foo(line)) // line 69
def foo(line: String) = {
myCounter += 1 // line 82 throwing NullPointerException
// compute something on the input
}
println(myCounter.value)
In a local setting, this works just fine. However, if I run this job on a spark standalone cluster with several machines, the workers throw a
13/07/22 21:56:09 ERROR executor.Executor: Exception in task ID 247
java.lang.NullPointerException
at MyClass$.foo(MyClass.scala:82)
at MyClass$$anonfun$2.apply(MyClass.scala:67)
at MyClass$$anonfun$2.apply(MyClass.scala:67)
at scala.collection.Iterator$$anon$21.hasNext(Iterator.scala:440)
at scala.collection.Iterator$$anon$19.hasNext(Iterator.scala:400)
at spark.PairRDDFunctions.writeToFile$1(PairRDDFunctions.scala:630)
at spark.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:640)
at spark.PairRDDFunctions$$anonfun$saveAsHadoopDataset$2.apply(PairRDDFunctions.scala:640)
at spark.scheduler.ResultTask.run(ResultTask.scala:77)
at spark.executor.Executor$TaskRunner.run(Executor.scala:98)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:722)
at the line which increments the accumulator myCounter.
My question is: Can accumulators only be used in "top-level" anonymous functions which are applied directly to RDDs and not in nested functions?
If yes, why does my call succeed locally and fail on a cluster?
edit: increased verbosity of exception.
In my case too, accumulator was null in closure when I used 'extends App' to create a spark application as shown below
object AccTest extends App {
val conf = new SparkConf().setAppName("AccTest").setMaster("yarn-client")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
println("count:" + accum.value)
sc.stop
}
}
I replaced extends App with main() method and it worked in YARN cluster in HDP 2.4
object AccTest {
def main(args: Array[String]): Unit = {
val conf = new SparkConf().setAppName("AccTest").setMaster("yarn-client")
val sc = new SparkContext(conf)
sc.setLogLevel("ERROR")
val accum = sc.accumulator(0, "My Accumulator")
sc.parallelize(Array(1, 2, 3, 4)).foreach(x => accum += x)
println("count:" + accum.value)
sc.stop
}
}
worked
What if you define the function like this:
def foo(line: String, myc: org.apache.spark.Accumulator[Int]) = {
myc += 1
}
And then call it like this:
foo(line, myCounter)
?
If you use "flatMap" then "myCounter" will not update because "flatMap" is lazy function. You can use this code:
myRDD.foreach(line => foo(line))
def foo(line: String) = {myCounter +=1}
println(myCounter.value)