I have 2 RDD's that are pulled in with the following code:
val fileA = sc.textFile("fileA.txt")
val fileB = sc.textFile("fileB.txt")
I then Map and Reduce it by key:
val countsB = fileB.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
val countsA = fileA.flatMap(line => line.split("\n"))
.map(word => (word, 1))
.reduceByKey(_+_)
I now wan't to find and remove all keys in countB if the key exist in countA
I have tried something like:
countsB.keys.foreach(b => {
if(countsB.collect().exists(_ == b)){
countsB.collect().drop(countsB.collect().indexOf(b))
}
})
but it doesn't seem like it removes them by the key.
There are 3 issues with your suggested code:
You are collecting the RDDs, which means they are not RDDs anymore, they are copied into the driver application's memory as plain Scala collections, so you lose Spark's parallelism and risk OutOfMemory errors in case your dataset is large
When calling drop on an immutable Scala collection (or an RDD), you don't change the original collection, you get a new collection with those records dropped, so you can't expect original collection to change
You cannot access an RDD within a function passed to any of the RDDs higher-order methods (e.g. foreach in this case) - any function passed to these method is serialized and sent to workers, and RDDs are (intentionally) not serializable - it makes no sense to fetch them into driver memory, serialize them, and send back to workers - the data is already distributed on the workers!
To solve all these - when you want to use one RDD's data to transform/filter another one, you usually want to use some type of join. In this case you can do:
// left join, and keep only records for which there was NO match in countsA:
countsB.leftOuterJoin(countsA).collect { case (key, (valueB, None)) => (key, valueB) }
NOTE that this collect that I'm using here isn't the collect you used - this one takes a PartialFunction as an argument, and behaves like a combination of map and filter, and most importantly: it doesn't copy all data into driver memory.
EDIT: as The Archetypal Paul commented - you have a much shorter and nicer option - subtractByKey:
countsB.subtractByKey(countsA)
Related
I'm new to using spark and scala but I have to solve the following problem:
I have one ORC file containing rows which I have to check against a certain condition comming from a hash map.
I build the hash map (filename,timestamp) with 120,000 entries this way (getTimestamp returns an Option[Long] type):
val tgzFilesRDD = sc.textFile("...")
val fileNameTimestampRDD = tgzFilesRDD.map(itr => {
(itr, getTimestamp(itr))
})
val fileNameTimestamp = fileNameTimestampRDD.collect.toMap
And retrieve the RDD with 6 million entries like this:
val sessionDataDF = sqlContext.read.orc("...")
case class SessionEvent(archiveName: String, eventTimestamp: Long)
val sessionEventsRDD = sessionDataDF.as[SessionEvent].rdd
And do the check:
val sessionEventsToReport = sessionEventsRDD.filter(se => {
val timestampFromFile = fileNameTimestamp.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
Is this the right and performant way to do it? Is there a caching recommended?
Will the Map fileNameTimestamp get shuffled to the clusters where the parititons where processed?
fileNameTimestamp will get serialized for each task, and with 120,000 entries, it may be quite expensive. You should broadcast large objects and reference the broadcast variables:
val fileNameTimestampBC = sc.broadcast(fileNameTimestampRDD.collect.toMap)
Now only one of these object will be shipped to each worker. There is also no need to drop down to the RDD API, as the Dataset API has a filter method:
val sessionEvents = sessionDataDF.as[SessionEvent]
val sessionEventsToReport = sessionEvents.filter(se => {
val timestampFromFile = fileNameTimestampBC.value.getOrElse(se.archiveName, None)
se.eventTimestamp < timestampFromFile.getOrElse[Long](Long.MaxValue)
})
The fileNameTimestamp Map you collected exists on Spark Master Node. In order to be referenced efficiently like this in a query, the worker nodes need to have access to it. This is done by broadcasting.
In essence, you rediscovered Broadcast Hash Join: You are left joining sessionEventsRDD with tgzFilesRDD to gain access to the optional timestamp and then you filter accordingly.
When using RDDs, you need to explicitly code the joining strategy. Dataframes/Datasets API has a query optimiser that can make the choice for you. You can also explicitly ask the API to use the above broadcast join technique behind the scenes. You can find examples for both approaches here.
Let me know if this is clear enough :)
I have the following code where rddMap is of org.apache.spark.rdd.RDD[(String, (String, String))], and myHashMap is scala.collection.mutable.HashMap.
I did .saveAsTextFile("temp_out") to force the evaluation of rddMap.map.
However, even println(" t " + t) is printing things, later myHashMap still has only one element I manually put in the beginning ("test1", ("10", "20")).
Everything in the rddMap is not put into myHashMap.
Snippet code:
val myHashMap = new HashMap[String, (String, String)]
myHashMap.put("test1", ("10", "20"))
rddMap.map { t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}.saveAsTextFile("temp_out")
println(rddMap.count)
println(myHashMap.toString)
Why I cannot put the elements from rddMap to my myHashMap?
Here is a working example of what you want to accomplish.
val rddMap = sc.parallelize(Map("A" -> ("v", "v"), "B" -> ("d","d")).toSeq)
// Collects all the data in the RDD and converts the data to a Map
val myMap = rddMap.collect().toMap
myMap.foreach(println)
Output:
(A,(v,v))
(B,(d,d))
Here is similar code to what you've posted
rddMap.map { t=>
println("t" + t)
newHashMap.put(t._1, t._2)
println(newHashMap.toString)
}.collect
Here is the output to the above code from the Spark shell
t(A,(v,v))
Map(A -> (v,v), test1 -> (10,20))
t(B,(d,d))
Map(test1 -> (10,20), B -> (d,d))
To me it looks like Spark copies your HashMap and does add the element to the copied map.
What you are trying to do is not really supported in Spark today.
Note that every user defined function (e.g., what you add inside a map()) is a closure that gets serialized and pushed to each executioner.
Therefore everything you have inside this map() gets serialized and gets transferred around:
.map{ t =>
println(" t " + t)
myHashMap.put(t._1, t._2)
}
Essentially, your myHashMap will be copied to each executioner and each executioner will be updating it's own version of that HashMap. This is why at the end of the execution the myHashMap you have in your driver will never get changed. (Driver is the JVM that manages/orchestrates your Spark jobs. It's the place where you define your SparkContext.)
In order to push structures defined in the driver to all executioners you need to broadcast them (see link here). Note that broadcasted variables are read-only, so again, using broadcasts will not help you here.
Another way is to use Accumulators but I feel that these are more tune towards summarizing numeric values, like doing sum, max, min, etc. Maybe you can take a look at creating a custom accumulator that extends AccumulatorParam. See link here.
Coming back to the original question, if you want to collect values to your driver, currently the best way to do this is to transform your RDDs until they become a small and manageable collection of elements and then you collect() this final/small RDD.
I am new to Scala and use Spark to process data. Why does the following code fail to change the categoryMap?
import scala.collection.mutable.LinkedHashMap
val catFile=sc.textFile(inputFile);
var categoryMap=LinkedHashMap[Int,Tuple2[String,Int]]()
catFile.foreach(line => {
val strs=line.split("\001");
categoryMap += (strs(0).toInt -> (strs(2),strs(3).toInt));
})
It's a good practice to try to stay away from both mutable data structures and vars. Sometimes they are needed, but mostly this kind of processing is easy to do by chaining transformation operations on collections. Also, .toMap is handy to convert a Seq containing Tuple2's to a Map.
Here's one way (that I didn't test properly):
val categoryMap = catFile map { _.split("\001") } map { array =>
(array(0).toInt, (array(2), array(3).toInt))
} toMap
Note that if there are more than one record corresponding to a key then only the last one will be present in the resulting map.
Edit: I didn't actually answer your original question - based on a quick test it results in a similar map to what my code above produces. Blind guess, you should make sure that your catFile actually contains data to process.
I have a situation where an underlying function operates significantly more efficiently when given batches to work on. I have existing code like this:
// subjects: RDD[Subject]
val subjects = Subject.load(job, sparkContext, config)
val classifications = subjects.flatMap(subject => classify(subject)).reduceByKey(_ + _)
classifications.saveAsTextFile(config.output)
The classify method works on single elements but would be more efficient operating on groups of elements. I considered using coalesce to split the RDD into chunks and acting on each chunk as a group, however there are two problems with this:
I'm not sure how to return the mapped RDD.
classify doesn't know in advance how big the groups should be and it varies based on the contents of the input.
Sample code on how classify could be called in an ideal situation (the output is kludgey since it can't spill for very large inputs):
def classifyRdd (subjects: RDD[Subject]): RDD[(String, Long)] = {
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}
This way classifyInBatches can have code like this internally:
def classifyInBatches(subject: Subject) {
if (!internals.canAdd(subject)) {
partialResults.add(internals.processExisting)
}
internals.add(subject) // Assumption: at least one will fit.
}
What can I do in Apache Spark that will allow behavior somewhat like this?
Try using the mapPartitions method, which allows your map function to consume a partition as an iterator and produce an iterator of output.
You should be able to write something like this:
subjectsRDD.mapPartitions { subjects =>
val classifier = new Classifier
subjects.foreach(subject => classifier.classifyInBatches(subject))
classifier.classifyRemaining
classifier.results
}
When I call the map function of an RDD is is not being applied. It works as expected for a scala.collection.immutable.List but not for an RDD. Here is some code to illustrate :
val list = List ("a" , "d" , "c" , "d")
list.map(l => {
println("mapping list")
})
val tm = sc.parallelize(list)
tm.map(m => {
println("mapping RDD")
})
Result of above code is :
mapping list
mapping list
mapping list
mapping list
But notice "mapping RDD" is not printed to screen. Why is this occurring ?
This is part of a larger issue where I am trying to populate a HashMap from an RDD :
def getTestMap( dist: RDD[(String)]) = {
var testMap = new java.util.HashMap[String , String]();
dist.map(m => {
println("populating map")
testMap.put(m , m)
})
testMap
}
val testM = getTestMap(tm)
println(testM.get("a"))
This code prints null
Is this due to lazy evaluation ?
Lazy evaluation might be part of this, if map is the only operation you are executing. Spark will not schedule execution until an action (in Spark terms) is requested on the RDD lineage.
When you execute an action, the println will happening, but not on the driver where you are expecting it but rather on the slave executing that closure. Try looking into the logs of the workers.
A similar thing is happening on the hashMap population in the 2nd part of the question. The same piece of code will be executed on each partition, on separate workers and will be serialized back to the driver. Given that closures are 'cleaned' by Spark, probably testMap is being removed from the serialized closure, resulting in a null. Note that if it was only due to the map not being executed, the hashmap should be empty, not null.
If you want to transfer the data of the RDD to another structure, you need to do that in the driver. Therefore you need to force Spark to deliver all the data to the driver. That's the function of rdd.collect().
This should work for your case. Be aware that all the RDD data should fit in the memory of your driver:
import scala.collection.JavaConverters._
def getTestMap(dist: RDD[(String)]) = dist.collect.map(m => (m , m)).toMap.asJava