Referring to this question : NullPointerException in Scala Spark, appears to be caused be collection type?
Answer states "Spark doesn't support nesting of RDDs (see https://stackoverflow.com/a/14130534/590203 for another occurrence of the same problem), so you can't perform transformations or actions on RDDs inside of other RDD operations."
This code :
val x = sc.parallelize(List(1 , 2, 3))
def fun1(n : Int) = {
fun2(n)
}
def fun2(n: Int) = {
n + 1
}
x.map(v => fun1(v)).take(1)
prints :
Array[Int] = Array(2)
This is correct.
But does this not disagree with "can't perform transformations or actions on RDDs inside of other RDD operations." since a nested action is occurring on an RDD ?
No. In the linked question d.filter(...) returns an RDD, so the type of
d.distinct().map(x => d.filter(_.equals(x)))
is RDD[RDD[String]]. This isn't allowed, but it doesn't happen in your code. If I understand the answer right, you can't refer to d or other RDDs inside map as well even if you don't get RDD[RDD[SomeType]] in the end.
Related
I've added following code:
var counters: Map[String, Int] = Map()
val results = rdd.filter(l => l.contains("xyz")).map(l => mapEvent(l)).filter(r => r.isDefined).map (
i => {
val date = i.get.getDateTime.toString.substring(0, 10)
counters = counters.updated(date, counters.getOrElse(date, 0) + 1)
}
)
I want to get counts for different dates in the RDD in one single iteration. But when I run this I get message saying:
No implicits found for parameters evidence$6: Encoder[Unit]
So I added this line:
implicit val myEncoder: Encoder[Unit] = org.apache.spark.sql.Encoders.kryo[Unit]
But then I get this error.
Exception in thread "main" java.lang.ExceptionInInitializerError
at com.xyz.SparkBatchJob.main(SparkBatchJob.scala)
Caused by: java.lang.UnsupportedOperationException: Primitive types are not supported.
at org.apache.spark.sql.Encoders$.genericSerializer(Encoders.scala:200)
at org.apache.spark.sql.Encoders$.kryo(Encoders.scala:152)
How do I fix this? OR Is there a better way to get the counts I want in a single iteration (O(N) time)?
A Spark RDD is a representation of a distributed collection. When you apply a map function to an RDD, the function that you use to manipulate the collection is going to be executed across the cluster so there is no sense in mutating a variable created out of the scope of the map function.
In your code, the problem is because you donĀ“t return any value, instead you are trying to mutate a structure and for that reason the compiler infers that the new created RDD after the transformation is a RDD[Unit].
If you need to create a Map as a result of a Spark action you must create a pairRDD and then apply the reduce operation.
Include the type of the rdd and the mapEvent function to see how it could be done.
Spark builds a DAG with the transformation and the action, it does not process the data twice.
I am attempting to build a spark function that recursively re-writes ArrayType columns:
import org.apache.spark.sql.{DataFrame, Column}
import org.apache.spark.sql.functions._
val arrayHead = udf((sequence: Seq[String]) => sequence.head)
val arrayTail = udf((sequence: Seq[String]) => sequence.tail)
// re-produces the ArrayType column recursively
val rewriteArrayCol = (c: Column) => {
def helper(elementsRemaining: Column, outputAccum: Column): Column = {
when(size(elementsRemaining) === lit(0), outputAccum)
.otherwise(helper(arrayTail(elementsRemaining), concat(outputAccum, array(arrayHead(elementsRemaining)))))
}
helper(c, array())
}
// Test
val df =
Seq("100" -> Seq("a", "b", "b", "b", "b", "b", "c", "c", "d"))
.toDF("id", "sequence")
// .withColumn("test_tail", arrayTail($"sequence")) //head & tail udfs work
// .withColumn("test", rewriteArrayCol($"sequence")) //stackoverflow if uncommented
display(df)
Unfortunately, I keep getting a stackoverflow. One area where I believe the function is lacking is that it's not tail-recursive; i.e. the whole 'when().otherwise()' block is not the same as an 'if else' block. That being said, the function currently throws a stackoverflow when applied to even tiny dataframes (so I figure there is must be more wrong with it than just not being tail-recursive).
I have not been able to find any examples of a similar function online, so I thought I'd ask here. The only implementations of Column => Column functions that I've been able to find are very, very simple ones which were not helpful to this use-case.
Note: I am able to achieve the functionality of the above by using a UDF. The reason I am attempting to make a Column => Column function is because Spark is better able to optimize these compared to UDFs (as far as I am aware).
That's not going to work, because there is no meaningful stop condition here. when / otherwise are not language level control flow blocks (hence cannot break execution), and the function will simply recurse forever.
In fact it won't stop even for an empty array, outside any SQL evaluation context:
rewriteArrayCol(array())
Furthermore you assumption is incorrect. Skipping over the fact that your code deserializes data twice (once for each arrayHead, arrayTail) which is way worse than just calling udf once (though it could be avoided with slice), very complex expressions come with their own issues, one of which is code generation size limit.
Don't despair though - there is already a valid solution out there - which is transform. See How to use transform higher-order function?
I am using Scala Spark API. In my code, I have an RDD of the following structure:
Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])]
I need to process (perform validations and modify values) the second element of the RDD. I am using map function to do that:
myRDD.map(line => mappingFunction(line))
Unfortunately, the mappingFunction is not invoked. This is the code of the mapping function:
def mappingFunction(line: Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] ): Tuple2[String, (Iterable[Array[String]], Option[Iterable[Array[String]]])] = {
println("Inside mappingFunction")
return line
}
When my program ends, there are no printed messages in the stdout.
In order to investigate the problem, I implemented a code snippet that worked:
val x = List.range(1, 10)
val mappedX = x.map(i => callInt(i))
And the following mapping function was invoked:
def callInt(i: Int) = {
println("Inside callInt")
}
Please assist in getting the RDD mapping function mappingFunction invoked. Thank you.
x is a List, so there is no laziness there, that's why your action is being invoked regardless you are not calling an action.
myRDD is an RDD, RDDs are lazy, this means that you don't actually execute your transformations (map, flatMap, filter) until you need to.
That means that you are not running your map function until you perform an action. An action is an operation that triggers the precedent operations (called transformations) to be executed.
Some examples of actions are collect or count
If you do this:
myRDD.map(line => mappingFunction(line)).count()
You'll see your prints. Anyway, there is no problem with your code at all, you just need to take into consideration the laziness nature of the RDDs
There is a good answer about this topic here.
Also you can find more info and a whole list of transformations and actions here
I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?
Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.
What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.
When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?
Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava