insert sequence into dataframe returning NullPointerException [duplicate] - scala

sessionIdList is of type :
scala> sessionIdList
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
When I try to run below code :
val x = sc.parallelize(List(1,2,3))
val cartesianComp = x.cartesian(x).map(x => (x))
val kDistanceNeighbourhood = sessionIdList.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
I receive exception :
14/05/21 16:20:46 ERROR Executor: Exception in task ID 80
java.lang.NullPointerException
at org.apache.spark.rdd.RDD.filter(RDD.scala:261)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:38)
at $line94.$read$$iwC$$iwC$$iwC$$iwC$$anonfun$1.apply(<console>:36)
at scala.collection.Iterator$$anon$11.next(Iterator.scala:328)
at scala.collection.Iterator$$anon$10.next(Iterator.scala:312)
at scala.collection.Iterator$class.foreach(Iterator.scala:727)
However if I use :
val l = sc.parallelize(List("1","2"))
val kDistanceNeighbourhood = l.map(s => {
cartesianComp.filter(v => v != null)
})
kDistanceNeighbourhood.take(1)
Then no exception is displayed
The difference between the two code snippets is that in first snippet sessionIdList is of type :
res19: org.apache.spark.rdd.RDD[String] = MappedRDD[17] at distinct at <console>:30
and in second snippet "l" is of type
scala> l
res13: org.apache.spark.rdd.RDD[String] = ParallelCollectionRDD[32] at parallelize at <console>:12
Why is this error occuring ?
Do I need to convert sessionIdList to ParallelCollectionRDD in order to fix this ?

Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations.
In the first case, you're seeing a NullPointerException thrown by the worker when it tries to access a SparkContext object that's only present on the driver and not the workers.
In the second case, my hunch is the job was run locally on the driver and worked purely by accident.

Its a reasonable question and I have heard it asked it enough times that. I'm going to try to take a stab at explaining why this is true, because it might help.
Nested RDDs will always throw an exception in production. Nested function calls as I think you are describing them here, if it means calling an RDD operation inside an RDD operation, will cause also cause failures since it is actually the same thing. (RDDs are immutable, so performing an RDD operation such as a "map" is equivalent to creating a new RDD.) The in ability to create nested RDDs is a necessary consequence of the way an RDD is defined and the way the Spark Application is set up.
An RDD is a distributed collection of objects (called partitions) that live on the Spark Executors. Spark executors cannot communicate with each other, only with the Spark driver. The RDD operations are all computed in pieces on these partitions.Because the RDD's executor environment isn't recursive (i.e. you can configure a Spark driver to be on a spark executor with sub executors) neither can an RDD.
In your program, you have created a distributed collection of partitions of integers. You are then performing a mapping operation. When the Spark driver sees a mapping operation, it sends the instructions to do the mapping to the executors, who perform the transformation on each partition in parallel. But your mapping cannot be done, because on each partition you are trying to call the "whole RDD" to perform another distributed operation. This can't not be done, because each partition does not have access to the information on the other partitions, if it did, the computation couldn't run in parallel.
What you can do instead, because the data you need in the map is probably small (since you are doing a filter, and the filter does not require any information about sessionIdList) is to first filter the session ID list. Then collect that list to the driver. Then broadcast it to the executors, where you can use it in the map. If the sessionID list is too large, you will probably need to do a join.

Related

checkpointing / persisting / shuffling does not seem to 'short circuit' the lineage of an rdd as detailed in 'learning spark' book

In learning Spark, I read the following:
In addition to pipelining, Spark’s internal scheduler may truncate the lineage of the RDD graph if an existing RDD has already been persisted in cluster memory or on disk. Spark can “short-circuit” in this case and just begin computing based on the persisted RDD. A second case in which this truncation can happen is when an RDD is already materialized as a side effect of an earlier shuffle, even if it was not explicitly persist()ed. This is an under-the-hood optimization that takes advantage of the fact that Spark shuffle outputs are written to disk, and exploits the fact that many times portions of the RDD graph are recomputed.
So, I decided to try to see this in action with a simple program (below):
val pairs = spark.sparkContext.parallelize(List((1,2)))
val x = pairs.groupByKey()
x.toDebugString // before collect
x.collect()
x.toDebugString // after collect
spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
x.collect()
x.toDebugString // after checkpoint
I did not see what I expected after reading the above paragraph from the Spark book. I saw the exact same output of toDebugString each time I invoked this method -- each time indicating two stages (where I would have expected only one stage after the checkpoint was supposed to have truncated the lineage.) like this:
scala> x.toDebugString // after collect
res5: String =
(8) ShuffledRDD[1] at groupByKey at <console>:25 []
+-(8) ParallelCollectionRDD[0] at parallelize at <console>:23 []
I am wondering if the key thing that I overlooked might be the word "may", as in the "schedule MAY truncate the lineage". Is this truncation something that might happen given the same program that I wrote above, under other circumstances ? Or is the little program that I wrote not doing the right thing to force the lineage truncation ? Thanks in advance for any insight you can provide !
I think that you should do persist/checkpoint before you do first collect.
From that code for me it looks correct what you get since when spark does first collect it does not know that it should persist or save anything.
Also probably you need to save result of x.persist and then use it...
I propose - try it:
val pairs = spark.sparkContext.parallelize(List((1,2)))
val x = pairs.groupByKey()
x.checkpoint()
x.persist(org.apache.spark.storage.StorageLevel.DISK_ONLY)
// **Also maybe do val xx = x.persist(...) and use xx later.**
x.toDebugString // before collect
x.collect()
x.toDebugString // after collect
spark.sparkContext.setCheckpointDir("/tmp")
// try both checkpointing and persisting to disk to cut lineage
x.collect()
x.toDebugString // after checkpoint

Spark aggregateByKey reduceByKey - aggregation (e.g collection) must be thread safe?

This probably sounds basic. If I do aggregateByKey or reduceByKey, and I aggregate a specific implementation of a collection. Do I need to use a thread safe collection during this aggregation ?
Is this OK ?
val sc: SparkContext = ???
val notAggregated = Seq(((1), 100),((1), 200),((1), 300),((2), 100),((2), 200))
sc.parallelize(notAggregated)
.aggregateByKey(mutable.HashSet.empty[Int])(
seqOp = (set, member) => set += member,
combOp = (set1, set2) => set1 ++= set2)
.foreach(println(_))
It don't have to be thread safe.
It uses combineByKey in the background and, if you look at Spark source code, class PairDStreamFunctions.groupByKeyAndWindow. It uses ArrayBuffer as a combiner. Look also here in the comment, it says why it's thread-safe
Why?
You are not putting zero value directly - you write function that creates combiner. Then Spark creates combiner in each partition (clones the value for each partition). One combiner in one partition is used like normal object, without parallel access from many threads in application as objects in one partition are processed sequentially

Why does RDD.foreach fail with "SparkException: This RDD lacks a SparkContext"?

I have a dataset (as an RDD) that I divide into 4 RDDs by using different filter operators.
val RSet = datasetRdd.
flatMap(x => RSetForAttr(x, alLevel, hieDict)).
map(x => (x, 1)).
reduceByKey((x, y) => x + y)
val Rp:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rp"))
val Rc:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("Rc"))
val RpSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RpSv"))
val RcSv:RDD[(String, Int)] = RSet.filter(x => x._1.split(",")(0).equals("RcSv"))
I sent Rp and RpSV to the following function calculateEntropy:
def calculateEntropy(Rx: RDD[(String, Int)], RxSv: RDD[(String, Int)]): Map[Int, Map[String, Double]] = {
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
.
.
}
}
I have two questions:
1- When I loop operation on RxSv as:
RxSv.foreach{item=> { ... }}
it collects all items of the partitions, but I want to only a partition where i am in. If you said that user map function but I don't change anything on RDD.
So when I run the code on a cluster with 4 workers and a driver the dataset is divided into 4 partitions and each worker runs the code. But for example i use foreach loop as i specified in the code. Driver collects all data from workers.
2- I have encountered with a problem on this code
val t = Rx.filter(x => x._1.split(",")(2).equals(abc(2)))
The error :
org.apache.spark.SparkException: This RDD lacks a SparkContext.
It could happen in the following cases:
(1) RDD transformations and actions are NOT invoked by the driver, but inside of other transformations;
for example, rdd1.map(x => rdd2.values.count() * x) is invalid because the values transformation and count action cannot be performed inside of the rdd1.map transformation. For more information, see SPARK-5063.
(2) When a Spark Streaming job recovers from checkpoint, this exception will be hit if a reference to an RDD not defined by the streaming job is used in DStream operations. For more information, See SPARK-13758.
First of all, I'd highly recommend caching the first RDD using cache operator.
RSet.cache
That will avoid scanning and transforming your dataset every time you filter for the other RDDs: Rp, Rc, RpSv and RcSv.
Quoting the scaladoc of cache:
cache() Persist this RDD with the default storage level (MEMORY_ONLY).
Performance should increase.
Secondly, I'd be very careful using the term "partition" to refer to a filtered RDD since the term has a special meaning in Spark.
Partitions say how many tasks Spark executes for an action. They are hints for Spark so you, a Spark developer, could fine-tune your distributed pipeline.
The pipeline is distributed across cluster nodes with one or many Spark executors per the partitioning scheme. If you decide to have a one partition in a RDD, once you execute an action on that RDD, you'll have one task on one executor.
The filter transformation does not change the number of partitions (in other words, it preserves partitioning). The number of partitions, i.e. the number of tasks, is exactly the number of partitions of RSet.
1- When I loop operation on RxSv it collects all items of the partitions, but I want to only a partition where i am in
You are. Don't worry about it as Spark will execute the task on executors where the data lives. foreach is an action that does not collect items but describes a computation that runs on executors with the data distributed across the cluster (as partitions).
If you want to process all items at once per partition use foreachPartition:
foreachPartition Applies a function f to each partition of this RDD.
2- I have encountered with a problem on this code
In the following lines of the code:
RxSv.foreach{item => {
val string = item._1.split(",")
val t = Rx.filter(x => x._1.split(",")(2).equals(string(2)))
you are executing foreach action that in turn uses Rx which is RDD[(String, Int)]. This is not allowed (and if it were possible should not have been compiled).
The reason for the behaviour is that an RDD is a data structure that just describes what happens with the dataset when an action is executed and lives on the driver (the orchestrator). The driver uses the data structure to track the data sources, transformations and the number of partitions.
A RDD as an entity is gone (= disappears) when the driver spawns tasks on executors.
And when the tasks run nothing is available to help them to know how to run RDDs that are part of their work. And hence the error. Spark is very cautious about it and checks such anomalies before they could cause issues after tasks are executed.

Using Futures within Spark

A Spark job makes a remote web service for every element in an RDD. A simple implementation might look something like this:
def webServiceCall(url: String) = scala.io.Source.fromURL(url).mkString
rdd2 = rdd1.map(x => webServiceCall(x.field1))
(The above example has been kept simple and does not handle timeouts).
There is no interdependency between any of the results for different elements of the RDD.
Would the above be improved by using Futures to optimise performance by making parallel calls to the web service for each element of the RDD? Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
If the above can be optimized by using Futures, does anyone have some code examples showing the correct way to use Futures within a function passed to a Spark RDD.
Thanks
Or does Spark itself have that level of optimization built in, so that it will run the operations on each element in the RDD in parallel?
It doesn't. Spark parallelizes tasks at the partition level but by default every partition is processed sequentially in a single thread.
Would the above be improved by using Futures
It could be an improvement but is quite hard to do it right. In particular:
every Future has to be completed in the same stage before any reshuffle takes place.
given lazy nature of the Iterators used to expose partition data you cannot do it high level primitives like map (see for example Spark job with Async HTTP call).
you can build your custom logic using mapPartitions but then you have to deal with all the consequences of non-lazy partition evaluation.
I couldnt find an easy way to achieve this. But after several iteration of retries this is what I did and its working for a huge list of queries. Basically we used this to do a batch operation for a huge query into multiple sub queries.
// Break down your huge workload into smaller chunks, in this case huge query string is broken
// down to a small set of subqueries
// Here if needed to optimize further down, you can provide an optimal partition when parallelizing
val queries = sqlContext.sparkContext.parallelize[String](subQueryList.toSeq)
// Then map each one those to a Spark Task, in this case its a Future that returns a string
val tasks: RDD[Future[String]] = queries.map(query => {
val task = makeHttpCall(query) // Method returns http call response as a Future[String]
task.recover {
case ex => logger.error("recover: " + ex.printStackTrace()) }
task onFailure {
case t => logger.error("execution failed: " + t.getMessage) }
task
})
// Note:: Http call is still not invoked, you are including this as part of the lineage
// Then in each partition you combine all Futures (means there could be several tasks in each partition) and sequence it
// And Await for the result, in this way you making it to block untill all the future in that sequence is resolved
val contentRdd = tasks.mapPartitions[String] { f: Iterator[Future[String]] =>
val searchFuture: Future[Iterator[String]] = Future sequence f
Await.result(searchFuture, threadWaitTime.seconds)
}
// Note: At this point, you can do any transformations on this rdd and it will be appended to the lineage.
// When you perform any action on that Rdd, then at that point,
// those mapPartition process will be evaluated to find the tasks and the subqueries to perform a full parallel http requests and
// collect those data in a single rdd.
I'm reposting it from my original answer here

parallelize list.map for concurrent rdd jobs

according to a posting
https://groups.google.com/forum/#!topic/spark-users/3QIn42VbQe0
This code will submit all jobs directly to Spark's scheduler, and you get a list of "Future"s back.
val rdds: List[RDD[T]] = ...
val futures = rdds.map { rdd =>
rdd.map(...).reduceByKey(...).collect()
}
I am wondering whether adding .par would speed this up, as in
rdds.par.map
or, maybe the author meant that each map entry would be just a spark job submission, and running them in sequence would be just as fast.
In the code provided, the RDDs will be evaluated sequentially. When you call .collect on an RDD, the RDD transformation is evaluated and the results are collected in the driver. The driver is blocked while the results are collected.
If you were to change it to rdds.par.map { ... }, then the .collects would be called in parallel and all RDDs would be evaluated at the same time. This then leaves it to the Spark scheduling mechanism to decide how to share the cluster between the RDDs.