scala.MatchError on a tuple - scala

After processing some input data, I got a RDD[(String, String, Long)], say input, in hand.
input: org.apache.spark.rdd.RDD[(String, String, Long)] = MapPartitionsRDD[9] at flatMap at <console>:54
The string fields here represent vertices of graph and long field is the weight of the edge.
To create a graph out of this, first I am inserting vertices into a map with a unique id if vertex is not known already. If it was already encountered, I use the vertex id that was assigned previously. Essentially, each vertex is assigned a unique id of type Long and then I want to create Edges.
Here is what I am doing:
var vertexMap = collection.mutable.Map[String, Long]()
var vid : Long = 0 // global vertex id counter
var srcVid : Long = 0 // source vertex id
var dstVid : Long = 0 // destination vertex id
val graphEdges = input.map {
case Row(src: String, dst: String, weight: Long) => (
if (vertexMap.contains(src)) {
srcVid = vertexMap(src)
if (vertexMap.contains(dst)) {
dstVid = vertexMap(dst)
} else {
vid += 1 // pick a new vertex id
vertexMap += (dst -> vid)
dstVid = vid
}
Edge(srcVid, dstVid, weight)
} else {
vid += 1
vertexMap(src) = vid
srcVid = vid
if (vertexMap.contains(dst)) {
dstVid = vertexMap(dst)
} else {
vid += 1
vertexMap(dst) = vid
dstVid = vid
}
Edge(srcVid, dstVid, weight)
}
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
println("num vertices = " + graph.numVertices);
What I see is
graphEdges is of type RDD[org.apache.spark.graphx.Edge[Long]] and graph is of type Graph[Int,Long]
graphEdges: org.apache.spark.rdd.RDD[org.apache.spark.graphx.Edge[Long]] = MapPartitionsRDD[10] at map at <console>:64
graph: org.apache.spark.graphx.Graph[Int,Long] = org.apache.spark.graphx.impl.GraphImpl#1b48170a
but I get the following error, while printing the graph's edge and vertex count.
org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 8.0 failed 1 times, most recent failure: Lost task 1.0 in stage 8.0 (TID 9, localhost, executor driver): ***scala.MatchError: (vertexA, vertexN, 2000
)*** (of class scala.Tuple3)
at $anonfun$1.apply(<console>:64)
at $anonfun$1.apply(<console>:64)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:105)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.RDD$$anonfun$mapPartitionsWithIndex$1$$anonfun$apply$26.apply(RDD.scala:844)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:336)
at org.apache.spark.rdd.RDD$$anonfun$8.apply(RDD.scala:334)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1038)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:969)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:1029)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:760)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
I don't understand where is the mismatch here.
Thanks #Joe K for the helpful tip. I started using zipIndex and code looks compact now, however graph instantiation still fails. Here is the updated code:
val vertices = input.map(r => r._1).union(input.map(r => r._2)).distinct.zipWithIndex
val graphEdges = input.map {
case (src, dst, weight) =>
Edge(vertices.lookup(src)(0), vertices.lookup(dst)(0), weight)
}
val graph = Graph.fromEdges(graphEdges, 0)
println("num edges = " + graph.numEdges);
So, from the original 3-tuple, I am forming a union of 1st and 2nd tuples (which are vertices), then assigning unique Ids to each after uniquifying them. I am then using their ids, while creating edges. However, it fails with following exception:
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 15.0 failed 1 times, most recent failure: Lost task 0.0 in stage 15.0 (TID 23, localhost, executor driver): org.apache.spark.SparkException: This RDD lacks
a SparkContext. It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations; for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed
inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
at org.apache.spark.rdd.RDD.org$apache$spark$rdd$RDD$$sc(RDD.scala:89)
at org.apache.spark.rdd.RDD.withScope(RDD.scala:362)
at org.apache.spark.rdd.PairRDDFunctions.lookup(PairRDDFunctions.scala:937)
at $anonfun$1.apply(<console>:55)
at $anonfun$1.apply(<console>:53)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.graphx.EdgeRDD$$anonfun$1.apply(EdgeRDD.scala:107)
Any thoughts ?

This specific error is coming from trying to match a tuple as a Row, which it is not.
Change:
case Row(src: String, dst: String, weight: Long) => {
to just:
case (src, dst, weight) => {
Also, your larger plan for generating vertex ids will not work. All of the logic inside the map will happen in parallel in different executors, which will have different copies of the mutable map.
You should flatMap your edges to get a list of all vertexes, then call .distinct.zipWithIndex to assign each vertex a single unique long value. You would then need to re-join with the original edges.

Related

NullPointerException on DataFrame where

I have the following method written in Scala:
def fillEmptyCells: Unit = {
val hourIndex = _weather.schema.fieldIndex("Hour")
val dateIndex = _weather.schema.fieldIndex("Date")
val groundSurfaceIndex = _weather.schema.fieldIndex("GroundSurface")
val snowyGroundIndex = _weather.schema.fieldIndex("SnowyGroundSurface")
val precipitationIndex = _weather.schema.fieldIndex("catPrec")
val snowDepthIndex = _weather.schema.fieldIndex("catSnowDepth")
var resultDf : DataFrame = sparkSession.createDataFrame(sparkSession.sparkContext.emptyRDD[Row],_weather.schema)
val days = _weather.select("Date").distinct().rdd
_weather.where("Date = '2014-08-01'").show()
days.foreach(x => {
println(s"Date = '${x.getDate(0)}'")
_weather.where(s"Date = '${x.getDate(0)}'").show()
val day = _weather.where(s"Date = '${x.getDate(0)}'")
val dayValues = day.where("Hour = 6").first()
val grSur = dayValues.getString(groundSurfaceIndex)
val snSur = dayValues.getString(snowyGroundIndex)
val prec = dayValues.getString(precipitationIndex)
val snowDepth = dayValues.getString(snowDepthIndex)
val dayRddMapped = day.rdd.map(y => (y(0), y(1), grSur, snSur, y(4), y(5), y(6), y(7), prec, snowDepth))
.foreach(z => {
resultDf = resultDf.union(Seq(z).toDF())
})
})
resultDf.show(20)
Unit
}
The problem is this line: _weather.where(s"Date = '${x.getDate(0)}'").show() where the NullPointerException occurs. As can be seen at line above, I print the where clause to console (it looks like Date = '2014-06-03') and the line just before foreach takes one of the outputs as parameters and works fine. _weather is a class variable and does not change while this method is running. Debugger shows more stranger things: _weather gets nulled after first iteration.
What is the source of this magic and how can I avoid it?
Moreover, if you have any suggestions according to architecture and code quality, welcome here
Stacktrace:
java.lang.NullPointerException
at org.apache.spark.sql.Dataset.where(Dataset.scala:1344)
at org.[package].WeatherHelper$$anonfun$fillEmptyCells$1.apply(WeatherHelper.scala:148)
at org.[package].WeatherHelper$$anonfun$fillEmptyCells$1.apply(WeatherHelper.scala:146)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1336)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.rdd.RDD$$anonfun$foreach$1$$anonfun$apply$28.apply(RDD.scala:918)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.SparkContext$$anonfun$runJob$5.apply(SparkContext.scala:2062)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
at org.apache.spark.scheduler.Task.run(Task.scala:108)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:335)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
19/01/10 13:39:35 ERROR Executor: Exception in task 6.0 in stage 10.0 (TID 420)
The class name is WeatherHelper it's just a part of the whole stacktrace which repeats ~20 times.
you cannot use dataframes in RDD code (you use dataframes in days.foreach), th dataframes are null here as it only lives on the driver, but not on the executors

Spark java.lang.NullPointerException when using tuples

I am using the GraphX API for spark to build a graph and process it with Pregel API. The error does not happen if I return an argument tuple from vprog function, but if I return a new tuple using the same tuple, I get null point error.
Here is the relevant code:
val verticesRDD = cleanDtaDF.select("ChildHash", "DN").rdd.map(row => (row(0).toString.toLong, (row(1).toString.toDouble,row(0).toString.toLong)))
val edgesRDD = (rawDtaDF.select("ChildHash", "ParentHash", "dealer_code", "dealer_customer_number", "parent_dealer_cust_number").rdd
.map(row => Edge(row.get(0).toString.toLong, row.get(1).toString.toLong, (row(3) + " is a child of " + row(4), " when dealer is " + row.get(2)))))
val myGraph = Graph(verticesRDD, edgesRDD)
def vprog(vertexId: VertexId, vertexDTA:(Double, Long), msg: Double): (Double, Long) = {
(vertexDTA._1, vertexDTA._2)
}
val result = myGraph.pregel(0.0, 1, activeDirection = EdgeDirection.Out)(vprog,t => Iterator((t.dstId, t.srcAttr._2)),(x, y) => x + y)
The error does not happen if I make a simple change to vprog(...)--not access the tuples' members:
def vprog(vertexId: VertexId, vertexDTA:(Double, Long), msg: Double): (Double, Long) = {
vertexDTA
}
The error is
[Stage 101:> (0 + 0) / 200][Stage 102:> (0 + 4) / 200]18/03/10 20:43:16 WARN scheduler.TaskSetManager: Lost task 1.0 in stage 102.0 (TID 5959, ue1lslaved25.na.aws.cat.com, executor 146): java.lang.NullPointerException
at $line69.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw.vprog(<console>:60)
at $line70.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:75)
at $line70.$read$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$iw$$anonfun$2.apply(<console>:75)
at org.apache.spark.graphx.Pregel$$anonfun$1.apply(Pregel.scala:125)
at org.apache.spark.graphx.Pregel$$anonfun$1.apply(Pregel.scala:125)
at org.apache.spark.graphx.impl.VertexPartitionBaseOps.map(VertexPartitionBaseOps.scala:61)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$5.apply(GraphImpl.scala:129)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:409)
at org.apache.spark.storage.memory.MemoryStore.putIteratorAsValues(MemoryStore.scala:216)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:988)
at org.apache.spark.storage.BlockManager$$anonfun$doPutIterator$1.apply(BlockManager.scala:979)
at org.apache.spark.storage.BlockManager.doPut(BlockManager.scala:919)
at org.apache.spark.storage.BlockManager.doPutIterator(BlockManager.scala:979)
at org.apache.spark.storage.BlockManager.getOrElseUpdate(BlockManager.scala:697)
at org.apache.spark.rdd.RDD.getOrCompute(RDD.scala:334)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:285)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:96)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:53)
at org.apache.spark.scheduler.Task.run(Task.scala:99)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:322)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
This issue has a simple explanation. It's not related with Spark or Graphx.
Having the function (just strip out irrelevant items from the original):
def vprog(vertexDTA:(Double, Long)): (Double, Long) = {
(vertexDTA._1, vertexDTA._2)
}
If the arg vertexDTA is null, both vertexDTA._1 and vertexDTA._2 will throw NullPointerException.
If we change the function to
def vprog(vertexDTA:(Double, Long)): (Double, Long) = {
vertexDTA
}
when the arg is null, it simply returns it, there is no access to tuple's members, so no NPE.

GraphX VertexRDD NullPointerException

I am trying to do some message passing on a graph to calculate recursive features.
I get an error when I define a graph whose vertices are the output of aggregateMessages. Code for context
> val newGraph = Graph(newVertices, edges)
newGraph: org.apache.spark.graphx.Graph[List[Double],Int] = org.apache.spark.graphx.impl.GraphImpl#2091594b
//This is the RDD that causes the problem
> val result = newGraph.aggregateMessages[List[Double]](
{triplet => triplet.sendToDst(triplet.srcAttr)},
{(a,b) => a.zip(b).map { case (x, y) => x + y }},
{TripletFields.Src})
result: org.apache.spark.graphx.VertexRDD[List[Double]] = VertexRDDImpl[1990] at RDD at VertexRDD.scala:57
> result.take(1)
res121: Array[(org.apache.spark.graphx.VertexId, List[Double])] = Array((1944425548,List(0.0, 0.0, 137.0, 292793.0)))
So far no problem, but when I try
> val newGraph2 = Graph(result, edges)
newGraph2: org.apache.spark.graphx.Graph[List[Double],Int] = org.apache.spark.graphx.impl.GraphImpl#710919e1
> val result2 = newGraph2.aggregateMessages[List[Double]](
{triplet => triplet.sendToDst(triplet.srcAttr)},
{(a,b) => a.zip(b).map { case (x, y) => x + y }},
{TripletFields.Src})
> result2.count
I get the following (trimmed) error:
result2: org.apache.spark.graphx.VertexRDD[List[Double]] = VertexRDDImpl[2009] at RDD at VertexRDD.scala:57
org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 4839.0 failed 4 times, most recent failure: Lost task 0.3 in stage 4839.0 (TID 735, 10.0.2.15): java.lang.NullPointerException
at $anonfun$2.apply(<console>:62)
at $anonfun$2.apply(<console>:62)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.send(EdgePartition.scala:536)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.sendToDst(EdgePartition.scala:531)
at $anonfun$1.apply(<console>:61)
at $anonfun$1.apply(<console>:61)
at org.apache.spark.graphx.impl.EdgePartition.aggregateMessagesEdgeScan(EdgePartition.scala:409)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:237)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:207)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
...
Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1450)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1438)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1437)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
...
Caused by: java.lang.NullPointerException
at $anonfun$2.apply(<console>:62)
at $anonfun$2.apply(<console>:62)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.send(EdgePartition.scala:536)
at org.apache.spark.graphx.impl.AggregatingEdgeContext.sendToDst(EdgePartition.scala:531)
at $anonfun$1.apply(<console>:61)
at $anonfun$1.apply(<console>:61)
at org.apache.spark.graphx.impl.EdgePartition.aggregateMessagesEdgeScan(EdgePartition.scala:409)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:237)
at org.apache.spark.graphx.impl.GraphImpl$$anonfun$13$$anonfun$apply$3.apply(GraphImpl.scala:207)
at scala.collection.Iterator$$anon$12.nextCur(Iterator.scala:434)
at scala.collection.Iterator$$anon$12.hasNext(Iterator.scala:440)
at org.apache.spark.shuffle.sort.BypassMergeSortShuffleWriter.write(BypassMergeSortShuffleWriter.java:125)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:79)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:47)
at org.apache.spark.scheduler.Task.run(Task.scala:85)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:274)
... 3 more
I don't think this is a type mismatch error because aggregateMessages returns a VertexRDD, any ideas why I am getting this problem?
Not all the nodes in the graph are returned by aggregateMessages, only the ones that receive a message. The NullPointerException is caused by edges in the graph pointing at those nodes plus the absence of a default node value in the graph definition.

Can't access broadcast variable in transformation

I'm having problems accessing a variable from inside a transformation function. Could someone help me out?
Here are my relevant classes and functions.
#SerialVersionUID(889949215L)
object MyCache extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
#volatile var cache: Broadcast[Map[UUID, Definition]] = null
def getInstance(sparkContext: SparkContext) : Broadcast[Map[UUID, Definition]] = {
if (cache == null) {
synchronized {
val map = sparkContext.cassandraTable("keyspace", "table")
.collect()
.map(m => m.getUUID("id") ->
Definition(m.getString("c1"), m.getString("c2"), m.getString("c3"),
m.getString("c4"))).toMap
cache = sparkContext.broadcast(map)
}
}
cache
}
}
In a different file:
object Processor extends Serializable {
#transient lazy val logger = Logger(getClass.getName)
def processData[T: ClassTag](rawStream: DStream[(String, String)], ssc: StreamingContext,
processor: (String, Broadcast[Map[UUID, Definition]]) => T): DStream[T] = {
MYCache.getInstance(ssc.sparkContext)
var newCacheValues = Map[UUID, Definition]()
rawStream.cache()
rawStream
.transform(rdd => {
val array = rdd.collect()
array.foreach(r => {
val value = getNewCacheValue(r._2, rdd.context)
if (value.isDefined) {
newCacheValues = newCacheValues + value.get
}
})
rdd
})
if (newCacheValues.nonEmpty) {
logger.info(s"Rebroadcasting. There are ${newCacheValues.size} new values")
logger.info("Destroying old cache")
MyCache.cache.destroy()
// this is probably wrong here, destroying object, but then referencing it. But I haven't gotten to this part yet.
MyCache.cache = ssc.sparkContext.broadcast(MyCache.cache.value ++ newCacheValues)
}
rawStream
.map(r => {
println("######################")
println(MyCache.cache.value)
r
})
.map(r => processor(r._2, MyCache.cache.value))
.filter(r => null != r)
}
}
Every time I run this I get SparkException: Failed to get broadcast_1_piece0 of broadcast_1 when trying to access cache.value
When I add a println(MyCache.cache.values) right after the .getInstance I'm able to access the broadcast variable, but when I deploy it to a mesos cluster I'm unable to access the broadcast values again, but with a null pointer exception.
Update:
The error I'm seeing is on println(MyCache.cache.value). I shouldn't have added this if statement containing the destroy, because my tests are never hitting that.
The basics of my application are, I have a table in cassandra that won't be updated very much. But I need to do some validation on some streaming data. So I want to pull all the data from this table, that isn't update much, into memory. getInstance pulls the whole table in on startup, and then I check all my streaming data to see if I need to pull from cassandra again (which I will have to very rarely). The transform and collect is where I check to see if I need to pull new data in. But since there is a chance that my table will be updated, I will need to update the broadcast occasionally. So my idea was to destroy it and then rebroadcast. I will update that once I get the other stuff working.
I get the same error if I comment out the destroy and rebroadcast.
Another update:
I need to access the broadcast variable in processor this line: .map(r => processor(r._2, MyCache.cache.value)).
I'm able to broadcast variable in the transform, and if I do println(MyCache.cache.value) in the transform, then all my tests pass, and I'm able to then access the broadcast in processor
Update:
rawStream
.map(r => {
println("$$$$$$$$$$$$$$$$$$$")
println(metrics.value)
r
})
This is the stack trace I get when it hits this line.
ERROR org.apache.spark.executor.Executor - Exception in task 0.0 in stage 135.0 (TID 114)
java.io.IOException: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1222)
at org.apache.spark.broadcast.TorrentBroadcast.readBroadcastBlock(TorrentBroadcast.scala:165)
at org.apache.spark.broadcast.TorrentBroadcast._value$lzycompute(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast._value(TorrentBroadcast.scala:64)
at org.apache.spark.broadcast.TorrentBroadcast.getValue(TorrentBroadcast.scala:88)
at org.apache.spark.broadcast.Broadcast.value(Broadcast.scala:70)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:160)
at com.uptake.readings.ingestion.StreamProcessors$$anonfun$processIncomingKafkaData$4.apply(StreamProcessors.scala:158)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:370)
at scala.collection.Iterator$$anon$13.hasNext(Iterator.scala:414)
at org.apache.spark.storage.MemoryStore.unrollSafely(MemoryStore.scala:284)
at org.apache.spark.CacheManager.putInBlockManager(CacheManager.scala:171)
at org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:78)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:268)
at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:38)
at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:306)
at org.apache.spark.rdd.RDD.iterator(RDD.scala:270)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:73)
at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
at org.apache.spark.scheduler.Task.run(Task.scala:89)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
at java.lang.Thread.run(Thread.java:745)
Caused by: org.apache.spark.SparkException: Failed to get broadcast_1_piece0 of broadcast_1
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1$$anonfun$2.apply(TorrentBroadcast.scala:138)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply$mcVI$sp(TorrentBroadcast.scala:137)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$org$apache$spark$broadcast$TorrentBroadcast$$readBlocks$1.apply(TorrentBroadcast.scala:120)
at scala.collection.immutable.List.foreach(List.scala:381)
at org.apache.spark.broadcast.TorrentBroadcast.org$apache$spark$broadcast$TorrentBroadcast$$readBlocks(TorrentBroadcast.scala:120)
at org.apache.spark.broadcast.TorrentBroadcast$$anonfun$readBroadcastBlock$1.apply(TorrentBroadcast.scala:175)
at org.apache.spark.util.Utils$.tryOrIOException(Utils.scala:1219)
... 24 more
[Updated answer]
You're getting an error because the code inside rawStream.map i.e. MyCache.cache.value is getting executed on one of the executor and there the MyCache.cache is still null!
When you did MyCache.getInstance, it created the MyCache.cache value on the driver and broadcasted it alright. But you're not referring to the same object in the your map method, so it doesn't get sent over to executors. Instead since you are directly referring to the MyCache, the executors invoke MyCache.cache on their own copy of MyCache object, and this obviously is null.
You can get this to work as expected by first getting an instance of cache broadcast object within the driver and using that object in the map. The following code should work for you --
val cache = MYCache.getInstance(ssc.sparkContext)
rawStream.map(r => {
println(cache.value)
r
})

Bigram frequency of query logs events with Apache Spark

I would like to study user actions within sessions extracted from search engine query logs. I define first two kinds of actions : Queries and Clics.
sealed trait Action{}
case class Query(val input:String) extends Action
case class Click(val link:String) extends Action
Suppose that first action in the query log is given by the following timestamp in milliseconds :
val t0 = 1417444964686L // 2014-12-01 15:42:44
Let's define a corpus of temporally ordered actions associated to sessions ids.
val query_log:Array[(String, (Action, Long))] = Array (
("session1",(Query("query1"),t0)),
("session1",(Click("link1") ,t0+1000)),
("session1",(Click("link2") ,t0+2000)),
("session1",(Query("query2"),t0+3000)),
("session1",(Click("link3") ,t0+4000)),
("session2",(Query("query3"),t0+5000)),
("session2",(Click("link4") ,t0+6000)),
("session2",(Query("query4"),t0+7000)),
("session2",(Query("query5"),t0+8000)),
("session2",(Click("link5") ,t0+9000)),
("session2",(Click("link6") ,t0+10000)),
("session3",(Query("query6"),t0+11000))
)
And we create a RDD for this quey_log :
import org.apache.spark.rdd.RDD
var logs:RDD[(String, (Action, Long))] = sc.makeRDD(query_log)
The logs are then grouped by session ids
val sessions_groups:RDD[(String, Iterable[(Action, Long)])] = logs.groupByKey().cache()
Now, we want to study Action cooccurrences within a session, for example, the numbers of rewritings in a sesssion. We then define the class Cooccurrences which will be initialized from session actions.
case class Cooccurrences(
var numQueriesWithClicks:Int = 0,
var numQueries:Int = 0,
var numRewritings:Int = 0,
var numQueriesBeforeClicks:Int = 0
) {
// The cooccurrence object is initialized from a list of timestamped action in order to catch a session group
def initFromActions(actions:Iterable[(Action, Long)]) = {
// 30 seconds is the maximal time (in milliseconds) between two queries (q1, q2) to consider q2 is a rewririting of q1
var thirtySeconds = 30000
var hasClicked = false
var hasRewritten = false
// int the observed action sequence, we extract consecutives (sliding(2)) actions sorted by timestamps
// for each bigram in the sequence we want to count and modify the cooccurrence object
actions.toSeq.sortBy(_._2).sliding(2).foreach{
// case Seq(l0) => // session with only one Action
case Seq((e1:Click, t0)) => { // click without any query
numQueries = 0
}
case Seq((e1:Query, t0)) => { // query without any click
numQueries = 1
numQueriesBeforeClicks = 1
}
// case Seq(l0, l1) => // session with at least two Actions
case Seq((e1:Click, t0), (e2:Query, t1)) => { // a click followed by a query
if(! hasClicked)
numQueriesBeforeClicks = numQueries
hasClicked = true
}
case Seq((e1:Click, t0), (e2:Click, t1)) => { //two consecutives clics
if(! hasClicked)
numQueriesBeforeClicks = numQueries
hasClicked = true
}
case Seq((e1:Query, t0), (e2:Click, t1)) => { // a query followed by a click
numQueries += 1
if(! hasClicked)
numQueriesBeforeClicks = numQueries
hasClicked = true
numQueriesWithClicks +=1
}
case Seq((e1:Query, t0), (e2:Query, t1)) => { // two consecutives queries
val dt = t1 - t0
numQueries += 1
if(dt < thirtySeconds && e1.input != e2.input){
hasRewritten = true
numRewritings += 1
}
}
}
}
}
Now, let's try to compute a RDD of Cooccurrences for each session :
val session_cooc_stats:RDD[Cooccurrences] = sessions_groups.map{
case (sessionId, actions) => {
var coocs = Cooccurrences()
coocs.initFromActions(actions)
coocs
}
}
Unfortunately, it raises the following MatchError
scala> session_cooc_stats.take(2)
15/02/06 22:50:08 ERROR Executor: Exception in task 0.0 in stage 1.0 (TID 4) scala.MatchError: List((Query(query3),1417444969686), (Click(link4),1417444970686)) (of class scala.collection.immutable.$colon$colon) at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at $line25.$read$$iwC$$iwC$Cooccurrences.initFromActions(<console>:29)
at $line28.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:31)
at $line28.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:28)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
at scala.collection.generic.Growable$class.$plus$plus$eq(Growable.scala:48)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:103)
at scala.collection.mutable.ArrayBuffer.$plus$plus$eq(ArrayBuffer.scala:47)
at scala.collection.TraversableOnce$class.to(TraversableOnce.scala:273)
at scala.collection.AbstractIterator.to(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toBuffer(TraversableOnce.scala:265)
at scala.collection.AbstractIterator.toBuffer(Iterator.scala:1157)
at scala.collection.TraversableOnce$class.toArray(TraversableOnce.scala:252)
at scala.collection.AbstractIterator.toArray(Iterator.scala:1157)
at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081)
at org.apache.spark.rdd.RDD$$anonfun$26.apply(RDD.scala:1081)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.SparkContext$$anonfun$runJob$4.apply(SparkContext.scala:1314)
at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:61)
at org.apache.spark.scheduler.Task.run(Task.scala:56)
at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:196)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
at java.lang.Thread.run(Thread.java:745)
15/02/06 22:50:08 WARN TaskSetManager: Lost task 0.0 in stage 1.0 (TID 4, localhost): scala.MatchError: List((Query(query3),1417444969686), (Click(link4),1417444970686)) (of class scala.collection.immutable.$colon$colon)
at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
at $line25.$read$$iwC$$iwC$Cooccurrences$$anonfun$initFromActions$2.apply(<console>:29)
...
If I build my own action list equivalent to the first group in session_cooc_stats RDD
val actions:Iterable[(Action, Long)] = Array(
(Query("query1"),t0),
(Click("link1") ,t0+1000),
(Click("link2") ,t0+2000),
(Query("query2"),t0+3000),
(Click("link3") ,t0+4000)
)
I get the expected result
var c = Cooccurrences()
c.initFromActions(actions)
// c == Cooccurrences(2,2,0,1)
Something seems wrong when I build a Cooccurrence object from a RDD.
It seems linked to the CompactBuffer built with groupByKey().
What is missing ?
I am new to Spark and Scala.
Thanks by advance for your help.
Thomas
As you advised, I rewrote the code with IntelliJ and created a companion object for the main function.
Surprisingly, the code compiles (with sbt) and runs flawlessly.
However, I don't really understand why compiled code runs whereas it doesn't work with spark-shell.
Thanks you for your answer !
I set your code upon IntelliJ.
Create one class for Action, Query, Click, and Coocurence.
And your code on a main.
val sessions_groups:RDD[(String, Iterable[(Action, Long)])] = logs.groupByKey().cache()
val session_cooc_stats:RDD[Cooccurrences] = sessions_groups.map{
case (sessionId, actions) => {
val coocs = Cooccurrences()
coocs.initFromActions(actions)
coocs
}
}
session_cooc_stats.take(2).foreach(println(_))
Just modified var coocs > val coocs
I guess that the point.
Cooccurrences(0,1,0,1)
Cooccurrences(2,3,1,1)