Nested Spark DataFrame when calling Vegas-viz .withDataFrame - scala

Quick abstract :
I am trying to display multiple histograms from Spark DataFrames with Vegas-viz in Scala. I created a trait to create different types of histograms, and implemented classes expending it. When I create an instance of a child class, I get a NullPointerException which makes me think there is a nested DataFrame somewhere.
Is there a workaround? Did I miss something and the error is something else?
Details :
Here is the trait :
trait Histogram {
val rawdf: DataFrame
val sparseDim: Seq[String]
val name: String
val xColumn: String
val yColumn: String
val group: DataFrame
val plot: ExtendedUnitSpecBuilder = Vegas(name).
withDataFrame(group).
encodeX(
field = xColumn,
Quantitative,
scale = Scale(ScaleType.Log),
title = sparseDim.reduce((a, b) => a + ", " + b)
).
encodeY(field = yColumn, Quantitative).
mark(Bar)
def show(): Unit = plot.show
}
And here is one of the classes extending it :
class HistogramCount(val rawdf: DataFrame,
val sparseDim: Seq[String],
val name: String = "Histogram Count") extends Histogram {
val xColumn = "cube"
val yColumn = "count"
override val group: DataFrame = rawdf.
select("VALUE", sparseDim: _*).
groupBy(sparseDim.head, sparseDim.tail: _*).
count().
withColumnRenamed("count", "cube").
groupBy("cube").
count()
}
When i create an instance of the child class, the following error occures :
Exception in thread "main" java.lang.NullPointerException
at <Pointing to .withDataFrame(group) in the trait>
I guess this is because the evaluation of group is lazy and that it is called in .withDataFrame(group) when plot is created.
I tried to evaluate the group DataFrame before Calling plot with a val evaluate: Long = group.rdd.count(), but it does not solve the issue.

Solved it by making the variable plot lazy. Still not sure if that is the best way thought.

Related

Create Spark UDF of a function that depends on other resources

I have a code for tokenizing a string.
But that tokenization method uses some data which is loaded when my application starts.
val stopwords = getStopwords();
val tokens = tokenize("hello i am good",stopwords)
def tokenize(string:String,stopwords: List[String]) : List[String] = {
val splitted = string.split(" ")
// I use this stopwords for filtering my splitted array.
// Then i return the items back.
}
Now I want to make the tokenize method an UDF for Spark.I want to use it to create new column in DataFrame Transformations.
I created simple UDFs before which had no dependencies like it needs items that needs to be read from text file etc.
Can some one tell me how to do these kind of operation?
This is what I have tried ,and its working.
val moviesDF = Seq(
("kingdomofheaven"),
("enemyatthegates"),
("salesinfointheyearofdecember"),
).toDF("column_name")
val tokenizeUDF: UserDefinedFunction = udf(tokenize(_: String): List[String])
moviesDF.withColumn("tokenized", tokenizeUDF(col("column_name"))).show(100, false)
def tokenize(name: String): List[String] = {
val wordFreqMap: Map[String, Double] = DataProviderUtil.getWordFreqMap()
val stopWords: Set[String] = DataProviderUtil.getStopWordSet()
val maxLengthWord: Int = wordFreqMap.keys.maxBy(_.length).length
.................
.................
}
Its giving me the expected output:
+----------------------------+--------------------------+
|columnname |tokenized |
+----------------------------+--------------------------+
|kingdomofheaven |[kingdom, heaven] |
|enemyatthegates |[enemi, gate] |
|salesinfointheyearofdecember|[sale, info, year, decemb]|
+----------------------------+--------------------------+
Now my question is , will it work when its deployed ? Currently I am
running it locally. My main concern it that this function reads from a
file to get information like stopwords,wordfreq etc for making the
tokenization possible. So registering it like this will work properly
?
At this point, if you deploy this code Spark will try to serialize your DataProviderUtil, you would need to mark as serializable that class. Another possibility is to declare you logic inside an Object. Functions inside objects are considered static functions and they are not serialized.

Getting java.lang.ArrayIndexOutOfBoundsException: 1 in spark when applying aggregate functions

I am trying to do some transformations on a data set. After reading the data set when performing df.show() operations, I am getting the rows listed in spark shell. But when I try to do df.count or any aggregate functions, I am getting
java.lang.ArrayIndexOutOfBoundsException: 1.
val itpostsrow = sc.textFile("/home/jayk/Downloads/spark-data")
import scala.util.control.Exception.catching
import java.sql.Timestamp
implicit class StringImprovements(val s:String) {
def toIntSafe = catching(classOf[NumberFormatException])
opt s.toInt
def toLongsafe = catching(classOf[NumberFormatException])
opt s.toLong
def toTimeStampsafe = catching(classOf[IllegalArgumentException]) opt Timestamp.valueOf(s)
}
case class Post(commentcount:Option[Int],lastactivitydate:Option[java.sql.Timestamp],ownerUserId:Option[Long],body:String,score:Option[Int],creattiondate:Option[java.sql.Timestamp],viewcount:Option[Int],title:String,tags:String,answerCount:Option[Int],acceptedanswerid:Option[Long],posttypeid:Option[Long],id:Long)
def stringToPost(row:String):Post = {
val r = row.split("~")
Post(r(0).toIntSafe,
r(1).toTimeStampsafe,
r(2).toLongsafe,
r(3),
r(4).toIntSafe,
r(5).toTimeStampsafe,
r(6).toIntSafe,
r(7),
r(8),
r(9).toIntSafe,
r(10).toLongsafe,
r(11).toLongsafe,
r(12).toLong)
}
val itpostsDFcase1 = itpostsrow.map{x=>stringToPost(x)}
val itpostsDF = itpostsDFcase1.toDF()
Your function stringToPost() might cause a Java error ArrayIndexOutOfBoundsException if the text file contains some empty row or if the number of fields after the split is not 13.
Due to Spark's lazy evaluation one notices such errors only when performing an action like count.

String filter using Spark UDF

input.csv:
200,300,889,767,9908,7768,9090
300,400,223,4456,3214,6675,333
234,567,890
123,445,667,887
What I want:
Read input file and compare with set "123,200,300" if match found, gives matching data
200,300 (from 1 input line)
300 (from 2 input line)
123 (from 4 input line)
What I wrote:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.rdd.RDD
object sparkApp {
val conf = new SparkConf()
.setMaster("local")
.setAppName("CountingSheep")
val sc = new SparkContext(conf)
def parseLine(invCol: String) : RDD[String] = {
println(s"INPUT, $invCol")
val inv_rdd = sc.parallelize(Seq(invCol.toString))
val bs_meta_rdd = sc.parallelize(Seq("123,200,300"))
return inv_rdd.intersection(bs_meta_rdd)
}
def main(args: Array[String]) {
val filePathName = "hdfs://xxx/tmp/input.csv"
val rawData = sc.textFile(filePathName)
val datad = rawData.map{r => parseLine(r)}
}
}
I get the following exception:
java.lang.NullPointerException
Please suggest where I went wrong
Problem is solved. This is very simple.
val pfile = sc.textFile("/FileStore/tables/6mjxi2uz1492576337920/input.csv")
case class pSchema(id: Int, pName: String)
val pDF = pfile.map(_.split("\t")).map(p => pSchema(p(0).toInt,p(1).trim())).toDF()
pDF.select("id","pName").show()
Define UDF
val findP = udf((id: Int,
pName: String
) => {
val ids = Array("123","200","300")
var idsFound : String = ""
for (id <- ids){
if (pName.contains(id)){
idsFound = idsFound + id + ","
}
}
if (idsFound.length() > 0) {
idsFound = idsFound.substring(0,idsFound.length -1)
}
idsFound
})
Use UDF in withCoulmn()
pDF.select("id","pName").withColumn("Found",findP($"id",$"pName")).show()
For simple answer, why we are making it so complex? In this case we don't require UDF.
This is your input data:
200,300,889,767,9908,7768,9090|AAA
300,400,223,4456,3214,6675,333|BBB
234,567,890|CCC
123,445,667,887|DDD
and you have to match it with 123,200,300
val matchSet = "123,200,300".split(",").toSet
val rawrdd = sc.textFile("D:\\input.txt")
rawrdd.map(_.split("|"))
.map(arr => arr(0).split(",").toSet.intersect(matchSet).mkString(",") + "|" + arr(1))
.foreach(println)
Your output:
300,200|AAA
300|BBB
|CCC
123|DDD
What you are trying to do can't be done the way you are doing it.
Spark does not support nested RDDs (see SPARK-5063).
Spark does not support nested RDDs or performing Spark actions inside of transformations; this usually leads to NullPointerExceptions (see SPARK-718 as one example). The confusing NPE is one of the most common sources of Spark questions on StackOverflow:
call of distinct and map together throws NPE in spark library
NullPointerException in Scala Spark, appears to be caused be collection type?
Graphx: I've got NullPointerException inside mapVertices
(those are just a sample of the ones that I've answered personally; there are many others).
I think we can detect these errors by adding logic to RDD to check whether sc is null (e.g. turn sc into a getter function); we can use this to add a better error message.

Spark: Task not serializable (Broadcast/RDD/SparkContext)

There are numerous questions about Task is not serializable in Spark. However, this case seems quite particular.
I have created a class:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.persist()
val sc = allEs.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
[...]
The class defines the following method:
private def centroidDistances(v: Vector): Array[Double] = {
centroids.value.map(c => (centroids.value.indexOf(c), Vectors.sqdist(v, c)))
.sortBy(_._1)
.map(_._2)
}
However, when the class is called, a Task is not serializable exception is thrown.
Strange enough, a tiny change in the header of class Neighbours suffices to fix the issue. Instead of creating a val sc: SparkContext to use for broadcasting, I merely inline the code that creates the Spark context:
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
val allEs: RDD[(String, E)] = e.map(e => (e.w, e))
.setName("embeddings")
.persist()
val centroids = allEmbeddings.sparkContext(m.clusterCenters)
[...]
My question is: how does the second variant make a difference? What goes wrong in the first one? Intuitively, this should be merely syntactic sugar, is this a bug in Spark?
I use Spark 1.4.1 on a Hadoop/Yarn cluster.
When you define
class Neighbours(e: RDD[E], m: KMeansModel) extends Serializable {
...
val sc = allEmbeddings.sparkContext
val centroids = sc.broadcast(m.clusterCenters)
...
}
You have made sc into a class variable, meaning it could be accessed from an instance of Neighbours e.g. neighbours.sc. This means that sc needs to be serializable, which is it not.
When you inline the code, only the final value of centroids needs to be serializable. centroids is of type Broadcast which is Serializable.

Enriching SparkContext without incurring in serialization issues

I am trying to use Spark to process data that comes from HBase tables. This blog post gives an example of how to use NewHadoopAPI to read data from any Hadoop InputFormat.
What I have done
Since I will need to do this many times, I was trying to use implicits to enrich SparkContext, so that I can get an RDD from a given set of columns in HBase. I have written the following helper:
trait HBaseReadSupport {
implicit def toHBaseSC(sc: SparkContext) = new HBaseSC(sc)
implicit def bytes2string(bytes: Array[Byte]) = new String(bytes)
}
final class HBaseSC(sc: SparkContext) extends Serializable {
def extract[A](data: Map[String, List[String]], result: Result, interpret: Array[Byte] => A) =
data map { case (cf, columns) =>
val content = columns map { column =>
val cell = result.getColumnLatestCell(cf.getBytes, column.getBytes)
column -> interpret(CellUtil.cloneValue(cell))
} toMap
cf -> content
}
def makeConf(table: String) = {
val conf = HBaseConfiguration.create()
conf.setBoolean("hbase.cluster.distributed", true)
conf.setInt("hbase.client.scanner.caching", 10000)
conf.set(TableInputFormat.INPUT_TABLE, table)
conf
}
def hbase[A](table: String, data: Map[String, List[String]])
(interpret: Array[Byte] => A) =
sc.newAPIHadoopRDD(makeConf(table), classOf[TableInputFormat],
classOf[ImmutableBytesWritable], classOf[Result]) map { case (key, row) =>
Bytes.toString(key.get) -> extract(data, row, interpret)
}
}
It can be used like
val rdd = sc.hbase[String](table, Map(
"cf" -> List("col1", "col2")
))
In this case we get an RDD of (String, Map[String, Map[String, String]]), where the first component is the rowkey and the second is a map whose key are column families and the values are maps whose keys are columns and whose content are the cell values.
Where it fails
Unfortunately, it seems that my job gets a reference to sc, which is itself not serializable by design. What I get when I run the job is
Exception in thread "main" org.apache.spark.SparkException: Job aborted: Task not serializable: java.io.NotSerializableException: org.apache.spark.SparkContext
at org.apache.spark.scheduler.DAGScheduler$$anonfun$org$apache$spark$scheduler$DAGScheduler$$abortStage$1.apply(DAGScheduler.scala:1028)
I can remove the helper classes and use the same logic inline in my job and everything runs fine. But I want to get something which I can reuse instead of writing the same boilerplate over and over.
By the way, the issue is not specific to implicit, even using a function of sc exhibits the same problem.
For comparison, the following helper to read TSV files (I know it's broken as it does not support quoting and so on, never mind) seems to work fine:
trait TsvReadSupport {
implicit def toTsvRDD(sc: SparkContext) = new TsvRDD(sc)
}
final class TsvRDD(val sc: SparkContext) extends Serializable {
def tsv(path: String, fields: Seq[String], separator: Char = '\t') = sc.textFile(path) map { line =>
val contents = line.split(separator).toList
(fields, contents).zipped.toMap
}
}
How can I encapsulate the logic to read rows from HBase without unintentionally capturing the SparkContext?
Just add #transient annotation to sc variable:
final class HBaseSC(#transient val sc: SparkContext) extends Serializable {
...
}
and make sure sc is not used within extract function, since it won't be available on workers.
If it's necessary to access Spark context from within distributed computation, rdd.context function might be used:
val rdd = sc.newAPIHadoopRDD(...)
rdd map {
case (k, v) =>
val ctx = rdd.context
....
}