Bigram frequency of query logs events with Apache Spark

Bigram frequency of query logs events with Apache Spark - scala

I would like to study user actions within sessions extracted from search engine query logs. I define first two kinds of actions : Queries and Clics.
sealed trait Action{}
case class Query(val input:String) extends Action
case class Click(val link:String) extends Action
Suppose that first action in the query log is given by the following timestamp in milliseconds :
val t0 = 1417444964686L // 2014-12-01 15:42:44
Let's define a corpus of temporally ordered actions associated to sessions ids.
val query_log:Array[(String, (Action, Long))] = Array (
("session1",(Query("query1"),t0)),
("session1",(Click("link1") ,t0+1000)),
("session1",(Click("link2") ,t0+2000)),
("session1",(Query("query2"),t0+3000)),
("session1",(Click("link3") ,t0+4000)),
("session2",(Query("query3"),t0+5000)),
("session2",(Click("link4") ,t0+6000)),
("session2",(Query("query4"),t0+7000)),
("session2",(Query("query5"),t0+8000)),
("session2",(Click("link5") ,t0+9000)),
("session2",(Click("link6") ,t0+10000)),
("session3",(Query("query6"),t0+11000))
)
And we create a RDD for this quey_log :
import org.apache.spark.rdd.RDD
var logs:RDD[(String, (Action, Long))] = sc.makeRDD(query_log)
The logs are then grouped by session ids
val sessions_groups:RDD[(String, Iterable[(Action, Long)])] = logs.groupByKey().cache()
Now, we want to study Action cooccurrences within a session, for example, the numbers of rewritings in a sesssion. We then define the class Cooccurrences which will be initialized from session actions.
case class Cooccurrences(
var numQueriesWithClicks:Int = 0,
var numQueries:Int = 0,
var numRewritings:Int = 0,
var numQueriesBeforeClicks:Int = 0
) {
// The cooccurrence object is initialized from a list of timestamped action in order to catch a session group
def initFromActions(actions:Iterable[(Action, Long)]) = {
// 30 seconds is the maximal time (in milliseconds) between two queries (q1, q2) to consider q2 is a rewririting of q1
var thirtySeconds = 30000
var hasClicked = false
var hasRewritten = false
// int the observed action sequence, we extract consecutives (sliding(2)) actions sorted by timestamps
// for each bigram in the sequence we want to count and modify the cooccurrence object
actions.toSeq.sortBy(_._2).sliding(2).foreach{
// case Seq(l0) => // session with only one Action
case Seq((e1:Click, t0)) => { // click without any query
numQueries = 0
}
case Seq((e1:Query, t0)) => { // query without any click
numQueries = 1
numQueriesBeforeClicks = 1
}
// case Seq(l0, l1) => // session with at least two Actions
case Seq((e1:Click, t0), (e2:Query, t1)) => { // a click followed by a query
if(! hasClicked)
numQueriesBeforeClicks = numQueries
hasClicked = true
}
case Seq((e1:Click, t0), (e2:Click, t1)) => { //two consecutives clics
if(! hasClicked)
numQueriesBeforeClicks = numQueries
hasClicked = true
}
case Seq((e1:Query, t0), (e2:Click, t1)) => { // a query followed by a click
numQueries += 1
if(! hasClicked)
numQueriesBeforeClicks = numQueries
hasClicked = true
numQueriesWithClicks +=1
}
case Seq((e1:Query, t0), (e2:Query, t1)) => { // two consecutives queries
val dt = t1 - t0
numQueries += 1
if(dt < thirtySeconds && e1.input != e2.input){
hasRewritten = true
numRewritings += 1
}
}
}
}
}
Now, let's try to compute a RDD of Cooccurrences for each session :
val session_cooc_stats:RDD[Cooccurrences] = sessions_groups.map{
case (sessionId, actions) => {
var coocs = Cooccurrences()
coocs.initFromActions(actions)
coocs
}
}
Unfortunately, it raises the following MatchError
scala> session_cooc_stats.take(2)
15/02/06 22:50:08 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 4) scala.MatchError: List((Query(query3),1417444969686), (Click(link4),1417444970686)) (of class scala.collection.immutable.$colon$colon) at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at $line25.$read$$iwC$$iwC$Cooccurrences.initFromActions(<console>:29)
at $line28.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
at $line28.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:28)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081)
at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/06 22:50:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4, localhost): scala.MatchError: List((Query(query3),1417444969686), (Click(link4),1417444970686)) (of class scala.collection.immutable.$colon$colon)
at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
...
If I build my own action list equivalent to the first group in session_cooc_stats RDD
val actions:Iterable[(Action, Long)] = Array(
(Query("query1"),t0),
(Click("link1") ,t0+1000),
(Click("link2") ,t0+2000),
(Query("query2"),t0+3000),
(Click("link3") ,t0+4000)
)
I get the expected result
var c = Cooccurrences()
c.initFromActions(actions)
// c == Cooccurrences(2,2,0,1)
Something seems wrong when I build a Cooccurrence object from a RDD.
It seems linked to the CompactBuffer built with groupByKey().
What is missing ?
I am new to Spark and Scala.
Thanks by advance for your help.
Thomas

As you advised, I rewrote the code with IntelliJ and created a companion object for the main function.
Surprisingly, the code compiles (with sbt) and runs flawlessly.
However, I don't really understand why compiled code runs whereas it doesn't work with spark-shell.
Thanks you for your answer !

I set your code upon IntelliJ.
Create one class for Action, Query, Click, and Coocurence.
And your code on a main.
val sessions_groups:RDD[(String, Iterable[(Action, Long)])] = logs.groupByKey().cache()
val session_cooc_stats:RDD[Cooccurrences] = sessions_groups.map{
case (sessionId, actions) => {
val coocs = Cooccurrences()
coocs.initFromActions(actions)
coocs
}
}
session_cooc_stats.take(2).foreach(println(_))
Just modified var coocs > val coocs
I guess that the point.
Cooccurrences(0,1,0,1)
Cooccurrences(2,3,1,1)

Related

suddenly throwing This RDD lacks a SparkContext it was working before every code was in main method

It was a working piece of code but suddenly its not working after I tried creating Sparksession from different scala object
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
val primary_ke = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
b.foreach(println)
}
Error:
ERROR Executor: Exception in task 0.0 in stage 5.0 (TID 5)
org.apache.spark.SparkException: This RDD lacks a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
Not working even after I revoked it and I'm not using any objects.
Code Updated:
object try {
def main(args: Array[String]) {
val spark = SparkSession.builder().master("local").appName("50columns3nodes").getOrCreate()
var s = spark.read.csv("/home/hadoopuser/Desktop/input/source.csv").rdd.map(_.mkString(","))
var k = spark.read.csv("/home/hadoopuser/Desktop/input/destination.csv").rdd.map(_.mkString(","))
val source_primary_key = s.map(rec => (rec.split(",")(0), rec))
val destination_primary_key = k.map(rec => (rec.split(",")(0), rec))
val a = source_primary_key.cogroup(destination_primary_key).filter { x => ((x._2._1) != (x._2._2)) }
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }
var extra_In_Dest = a.filter(x => x._2._1.isEmpty && !x._2._2.isEmpty).map(rec => (rec._2._2.mkString("")))
var extra_In_Src = a.filter(x => !x._2._1.isEmpty && x._2._2.isEmpty).map(rec => (rec._2._1.mkString("")))
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct
for (i <- primary_key_distinct) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
}
}
Input data follows
s=1,david
2,ajay
3,jijo
4,abi
5,surendhar
k=1,david
2,ajay
3,jijoaa
4,abisdsdd
5,surendhar
val a contains {3,(jijo,jijoaa),5(abi,abisdsdd)}

If you read carefully the first message
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
It clearly states that actions and transformations cannot be performed inside a transformation.
primary_key_distinct is transformation done on b and b itself is a transformation done on a. And b.foreach(println) is an action done inside transformation of primary_key_distinct
So if you collect b or primary_key_distinct inside driver, then the code should run properly
val b = a.filter { x => (!x._2._1.isEmpty) && (!x._2._2.isEmpty) }.collect
or
val primary_key_distinct = b.map(rec => (rec._1.split(",")(0))).distinct.collect
or if you don't use action inside another transformation then the code should run properly too as
for (i <- 1 to 2) {
var lengthofarray = 0
println(i)
b.foreach(println)
}
I hope the explanation is clear.

scala.MatchError on a tuple

After processing some input data, I got a RDD[(String, String, Long)], say input, in hand.
input: org.apache.spark.rdd.RDD[(String, String, Long)] = MapPartitionsRDD[9] at flatMap at <console>:54
The string fields here represent vertices of graph and long field is the weight of the edge.
To create a graph out of this, first I am inserting vertices into a map with a unique id if vertex is not known already. If it was already encountered, I use the vertex id that was assigned previously. Essentially, each vertex is assigned a unique id of type Long and then I want to create Edges.
Here is what I am doing:
var vertexMap = collection.mutable.Map[String, Long]()
var vid : Long = 0 // global vertex id counter
var srcVid : Long = 0 // source vertex id
var dstVid : Long = 0 // destination vertex id
val graphEdges = input.map {
case Row(src: String, dst: String, weight: Long) => (
if (vertexMap.contains(src)) {
srcVid = vertexMap(src)
if (vertexMap.contains(dst)) {
dstVid = vertexMap(dst)
} else {
vid += 1 // pick a new vertex id
vertexMap += (dst -> vid)
dstVid = vid
}
Edge(srcVid, dstVid, weight)
} else {
vid += 1
vertexMap(src) = vid
srcVid = vid
if (vertexMap.contains(dst)) {
dstVid = vertexMap(dst)
} else {
vid += 1
vertexMap(dst) = vid
dstVid = vid
}
Edge(srcVid, dstVid, weight)
}
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);
What I see is
graphEdges is of type RDD[org.apache.spark.graphx.Edge[Long]] and graph is of type Graph[Int,Long]
graphEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Long]] = MapPartitionsRDD[10] at map at <console>:64
graph: org.apache.spark.graphx.Graph[Int,Long] = org.apache.spark.graphx.impl.GraphImpl#1b48170a
but I get the following error, while printing the graph's edge and vertex count.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 9, localhost, executor driver): ***scala.MatchError: (vertexA, vertexN, 2000
)*** (of class scala.Tuple3)
at $anonfun$1.apply(<console>:64)
at $anonfun$1.apply(<console>:64)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I don't understand where is the mismatch here.
Thanks #Joe K for the helpful tip. I started using zipIndex and code looks compact now, however graph instantiation still fails. Here is the updated code:
val vertices = input.map(r => r._1).union(input.map(r => r._2)).distinct.zipWithIndex
val graphEdges = input.map {
case (src, dst, weight) =>
Edge(vertices.lookup(src)(0), vertices.lookup(dst)(0), weight)
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
So, from the original 3-tuple, I am forming a union of 1st and 2nd tuples (which are vertices), then assigning unique Ids to each after uniquifying them. I am then using their ids, while creating edges. However, it fails with following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 23, localhost, executor driver): org.apache.spark.SparkException: This RDD lacks
a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed
inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:937)
at $anonfun$1.apply(<console>:55)
at $anonfun$1.apply(<console>:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
Any thoughts ?

This specific error is coming from trying to match a tuple as a Row, which it is not.
Change:
case Row(src: String, dst: String, weight: Long) => {
to just:
case (src, dst, weight) => {
Also, your larger plan for generating vertex ids will not work. All of the logic inside the map will happen in parallel in different executors, which will have different copies of the mutable map.
You should flatMap your edges to get a list of all vertexes, then call .distinct.zipWithIndex to assign each vertex a single unique long value. You would then need to re-join with the original edges.

Apache Spark: Why I can't use broadcast var defined in a global object

Here is a simplified example to show my concern. This example contains 3 files with 3 objects, depending on spark 1.6.1.
//file globalObject.scala
import org.apache.spark.broadcast.Broadcast
object globalObject {
var br_value: Broadcast[Map[Int, Double]] = null
}
//file someFunc.scala
import org.apache.spark.SparkContext
import org.apache.spark.rdd.RDD
object someFunc {
def go(rdd: RDD[Int])(implicit sc: SparkContext): Array[Int] = {
rdd.map(i => {
val acc = globalObject.br_value.value
if(acc.contains(i)) {
i + 1
} else {
i
}
}).take(100)
}
}
//testMain.scala
import org.apache.spark.{SparkConf, SparkContext}
object testMain {
def bootStrap()(implicit sc:SparkContext): Unit = {
globalObject.br_value = sc.broadcast(Map(1->2, 2->3, 4->5))
}
def main(args: Array[String]): Unit = {
lazy val appName = getClass.getSimpleName.split("\\$").last
implicit val sc = new SparkContext(new SparkConf().setAppName(appName))
val datardd = sc.parallelize(Range(0, 200), 200)
.flatMap(i => Range(0, 1000))
bootStrap()
someFunc.go(datardd).foreach(println)
}
}
When I run this code on cluster, it gives me the following error:
ERROR executor.Executor: Exception in task 0.0 in stage 0.0 (TID 0)
java.lang.NullPointerException
at someFunc$$anonfun$go$1.apply$mcII$sp(someFunc.scala:7)
at someFunc$$anonfun$go$1.apply(someFunc.scala:6)
at someFunc$$anonfun$go$1.apply(someFunc.scala:6)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
Apparently, the data is not successfully broadcasted. I met this problem when I was refactoring my code these days. I want different scala objects to share a same broadcast variable. But now here it is. Pretty confusing now, as to my understanding driver uses pointer to indicate broadcast variable. Calling broadcast variable shouldn't be restricted to the same code scope.
Correct me if I am wrong. And what's the proper way to share broadcast var among scala objects ? Thanks in advance.

Code in map is serialized and executed on each node. val acc = globalObject.br_value.value uses the node's globalObject.br_value. But of course that's still null; you only assign it on the driver. You could make your code close over the broadcast variable by pulling it out of the lambda:
val br_value = globalObject.br_value
rdd.map(i => {
val acc = br_value.value
if(acc.contains(i)) {
i + 1
} else {
i
}
}).take(100)

Scala-Spark NullPointerError when submitting jar, not in shell

My spark job raises a null pointer exception that I cannot trace down. When I print potential null variables, they're all populated on every worker. My data does not contain null values as the same job works within the spark shell. The execute function of the job is below, followed by the error message.
All helper methods not defined in the function are defined within the body of the spark job object, so I believe closure is not the problem.
override def execute(sc:SparkContext) = {
def construct_query(targetTypes:List[String]) = Map("query" ->
Map("nested" ->
Map("path"->"annotations.entities.items",
"query"-> Map("terms"->
Map("annotations.entities.items.type"-> targetTypes)))))
val sourceConfig = HashMap(
"es.nodes" -> params.targetClientHost
)
// Base elastic search RDD returning articles which match the above query on entity types
val rdd = EsSpark.esJsonRDD(sc,
params.targetIndex,
toJson(construct_query(params.entityTypes)),
sourceConfig
).sample(false,params.sampleRate)
// Mapping ES json into news article object, then extracting the entities list of
// well defined annotations
val objectsRDD = rdd.map(tuple => {
val maybeArticle =
try {
Some(JavaJsonUtils.fromJson(tuple._2, classOf[SearchableNewsArticle]))
}catch {
case e: Exception => None
}
(tuple._1,maybeArticle)
}
).filter(tuple => {tuple._2.isDefined && tuple._2.get.annotations.isDefined &&
tuple._2.get.annotations.get.entities.isDefined}).map(tuple => (tuple._1, tuple._2.get.annotations.get.entities.get))
// flat map the RDD of entities lists into a list of (entity text, (entity type, 1)) tuples
(line 79) val entityDataMap: RDD[(String, (String, Int))] = objectsRDD.flatMap(tuple => tuple._2.items.collect({
case item if (item.`type`.isDefined) && (item.text.isDefined) &&
(line 81)(params.entityTypes.contains(item.`type`.get)) => (cleanUpText(item.text.get), (item.`type`.get, 1))
}))
// bucketize the tuples RDD into entity text, List(entity_type, entity_count) to make count aggregation and file writeouts
// easier to follow
val finalResults: Array[(String, (String, Int))] = entityDataMap.reduceByKey((x, y) => (x._1, x._2+y._2)).collect()
val entityTypeMapping = Map(
"HealthCondition" -> "HEALTH_CONDITION",
"Drug" -> "DRUG",
"FieldTerminology" -> "FIELD_TERMINOLOGY"
)
for (finalTuple <- finalResults) {
val entityText = finalTuple._1
val entityType = finalTuple._2._1
if(entityTypeMapping.contains(entityType))
{
if(!Files.exists(Paths.get(entityTypeMapping.get(entityType).get+".txt"))){
val myFile = new java.io.FileOutputStream(new File(entityTypeMapping.get(entityType).get+".txt"),false)
printToFile(myFile) {p => p.println(entityTypeMapping.get(entityType))}
}
}
val myFile = new java.io.FileOutputStream(new File(entityTypeMapping.get(entityType).get+".txt"),true)
printToFile(myFile) {p => p.println(entityText)}
}
}
And the error message below:
java.lang.NullPointerException at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4$$anonfun$apply$1.isDefinedAt(GazetteerGenerator.scala:81)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4$$anonfun$apply$1.isDefinedAt(GazetteerGenerator.scala:79)
at
scala.collection.TraversableLike$$anonfun$collect$1.apply(TraversableLike.scala:278)
at scala.collection.immutable.List.foreach(List.scala:318) at
scala.collection.TraversableLike$class.collect(TraversableLike.scala:278)
at
scala.collection.AbstractTraversable.collect(Traversable.scala:105)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4.apply(GazetteerGenerator.scala:79)
at
com.quid.gazetteers.GazetteerGenerator$$anonfun$4.apply(GazetteerGenerator.scala:79)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:371) at
org.apache.spark.util.collection.ExternalSorter.insertAll(ExternalSorter.scala:189)
at
org.apache.spark.shuffle.sort.SortShuffleWriter.write(SortShuffleWriter.scala:64)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at
org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)

This question has been resolved. The params attribute was not serialized and available to spark workers. The solution is to form a spark broadcast variable within scope of the areas where the params attribute is needed.

joda DateTime format cause null pointer error in spark RDD functions

The exception message as following
User class threw exception: Job aborted due to stage failure: Task 0 in stage
1.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1.0 (TID 11, 10.215.155.82): java.lang.NullPointerException at
org.joda.time.tz.CachedDateTimeZone.getInfo(CachedDateTimeZone.java:143)
at
org.joda.time.tz.CachedDateTimeZone.getOffset(CachedDateTimeZone.java:103)
at
org.joda.time.format.DateTimeFormatter.printTo(DateTimeFormatter.java:676)
at
org.joda.time.format.DateTimeFormatter.printTo(DateTimeFormatter.java:521)
at
org.joda.time.format.DateTimeFormatter.print(DateTimeFormatter.java:625)
at
org.joda.time.base.AbstractDateTime.toString(AbstractDateTime.java:328)
at
com.xxx.ieg.face.demo.DateTimeNullReferenceReappear$$anonfun$3$$anonfun$apply$1.apply(DateTimeNullReferenceReappear.scala:41)
at
com.xxx.ieg.face.demo.DateTimeNullReferenceReappear$$anonfun$3$$anonfun$apply$1.apply(DateTimeNullReferenceReappear.scala:41)
at
scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:328)
at
scala.collection.TraversableLike$$anonfun$groupBy$1.apply(TraversableLike.scala:327)
at scala.collection.Iterator$class.foreach(Iterator.scala:727) at
org.apache.spark.util.collection.CompactBuffer$$anon$1.foreach(CompactBuffer.scala:113)
at scala.collection.IterableLike$class.foreach(IterableLike.scala:72)
at
org.apache.spark.util.collection.CompactBuffer.foreach(CompactBuffer.scala:28)
at
scala.collection.TraversableLike$class.groupBy(TraversableLike.scala:327)
at
org.apache.spark.util.collection.CompactBuffer.groupBy(CompactBuffer.scala:28)
at
com.xxx.ieg.face.demo.DateTimeNullReferenceReappear$$anonfun$3.apply(DateTimeNullReferenceReappear.scala:41)
at
com.xxx.ieg.face.demo.DateTimeNullReferenceReappear$$anonfun$3.apply(DateTimeNullReferenceReappear.scala:40)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328) at
scala.collection.Iterator$$anon$10.next(Iterator.scala:312) at
scala.collection.Iterator$class.foreach(Iterator.scala:727) at
scala.collection.AbstractIterator.foreach(Iterator.scala:1157) at
scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at
scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at
scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157) at
scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157) at
org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081) at
org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081) at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at
org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56) at
org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:744)
My code as following:
import org.apache.hadoop.conf.Configuration
import org.apache.spark.rdd.RDD
import org.apache.spark.SparkContext._
import org.apache.spark.{ SparkConf, SparkContext }
import org.joda.time.DateTime
import org.joda.time.format.{ DateTimeFormat, DateTimeFormatter }
object DateTimeNullReferenceReappear extends App {
case class Record(uin: String = "", date: DateTime = null, value: Double = 0.0)
val cfg = new Configuration
val sparkConf = new SparkConf()
sparkConf.setAppName("bourne_exception_reappear")
val sc = new SparkContext(sparkConf)
val data = TDWSparkContext.tdwTable( // this function just read data from an data warehouse
sc,
tdwuser = FaceConf.TDW_USER,
tdwpasswd = FaceConf.TDW_PASSWORD,
dbName = "my_db",
tblName = "my_table",
parts = Array("p_20150323", "p_20150324", "p_20150325", "p_20150326", "p_20150327", "p_20150328", "p_20150329"))
.map(row => {
Record(uin = row(2),
date = DateTimeFormat.forPattern("yyyyMMdd").parseDateTime(row(0)),
value = row(4).toDouble)
}).map(x => (x.uin, (x.date, x.value)))
.groupByKey
.map(x => {
x._2.groupBy(_._1.toString("yyyyMMdd")).mapValues(_.map(_._2).sum) // throw exception here
})
// val data = TDWSparkContext.tdwTable( // It works, as I don't user datetime toString in the groupBy
// sc,
// tdwuser = FaceConf.TDW_USER,
// tdwpasswd = FaceConf.TDW_PASSWORD,
// dbName = "hy",
// tblName = "t_dw_cf_oss_tblogin",
// parts = Array("p_20150323", "p_20150324", "p_20150325", "p_20150326", "p_20150327", "p_20150328", "p_20150329"))
// .map(row => {
// Record(uin = row(2),
// date = DateTimeFormat.forPattern("yyyyMMdd").parseDateTime(row(0)),
// value = row(4).toDouble)
// }).map(x => (x.uin, (x.date.toString("yyyyMMdd"), x.value)))
// .groupByKey
// .map(x => {
// x._2.groupBy(_._1).mapValues(_.map(_._2).sum)
// })
data.take(10).map(println)
}
So, it seems that call toString in the groupBy cause the exception, so can anybody explain it?
Thanks

You need to either disable Kryo, use Kryo JodaTime Serializers, or avoid serializing the DateTime object, i.e. pass around Longs.

The problem here is bad serialization of Joda's CachedDateTimeZone - it includes a transient field that doesn't get serialized, remaining null in the deserialized object.
You can create and register your own Serializer that handles this object properly:
import com.esotericsoftware.kryo.Kryo;
import com.esotericsoftware.kryo.Serializer;
import com.esotericsoftware.kryo.io.Input;
import com.esotericsoftware.kryo.io.Output;
import org.joda.time.DateTimeZone;
import org.joda.time.tz.CachedDateTimeZone;
public class JodaCachedDateTimeZoneSerializer extends Serializer<CachedDateTimeZone> {
public JodaCachedDateTimeZoneSerializer() {
setImmutable(true);
}
#Override
public CachedDateTimeZone read(final Kryo kryo, final Input input, final Class<CachedDateTimeZone> type) {
// reconstruct from serialized ID:
final String id = input.readString();
return CachedDateTimeZone.forZone(DateTimeZone.forID(id));
}
#Override
public void write(final Kryo kryo, final Output output, final CachedDateTimeZone cached) {
// serialize ID only:
output.writeString(cached.getID());
}
}
Then, in your class extending KryoRegistrator, add:
kryo.register(classOf[CachedDateTimeZone], new JodaCachedDateTimeZoneSerializer())
This way you don't have to disable Kryo or refrain from using Joda.

We don't know much about the "problem". So we can try following experimat which will let us see more about the problem.
Replace the following part,
map(x => {
x._2.groupBy(_._1.toString("yyyyMMdd")).mapValues(_.map(_._2).sum) // throw exception here
})
With this,
map( x => {
x._2.groupBy( t => {
val dateStringTry = Try( t._2.toString( "yyyyMMdd" ) )
dateStringTry match {
case Success( dateString ) => Right( dateString )
case Failure( e ) => {
println( "=========== Null Tuple Description ==========" )
println( "Problem Tuple :: [" + t + "]" )
println( "Error Info :: [" + e.getMessage + "]" )
// finally the stack trace, if needed
// e.printStackTrace()
prinln( "=============================================" )
Left( e )
}
}
} )
} )
Let's check the result of running this experiment.

The issue seems to be that DateTime looses something when serialising in Spark (which happens a lot there I guess). In my case the Chronology was messed up which caused the same exception.
One really very hacky workaround which worked for me is to recreate the DateTime just before using it, e.g.:
date.toMutableDateTime.toDateTime
This seems to restore whatever missing bits and everything is working after that.
The solution posted by Marius Soutier to disable Kryo also worked for me. This is a less hacky approach.

sparkConf.set("spark.serializer", "org.apache.spark.serializer.JavaSerializer");

Please refer to this -- https://issues.apache.org/jira/browse/SPARK-4170
Basically, you shouldn't be extending scala.App for your main class. It may not work correctly in some cases. Use an explicit main() method instead.
Here's the documented warning in the Spark 1.6.1 code (In SparkSubmit class)
// SPARK-4170
if (classOf[scala.App].isAssignableFrom(mainClass)) {
printWarning("Subclasses of scala.App may not work correctly. Use a main() method instead.")
}