Merging two RDDs based on common key and then outputting - scala

I'm fairly new to Scala and Spark and functional programming in general, so forgive me if this is a pretty basic question.
I'm merging two CSV files, so I got a lot of inspiration from this: Merge the intersection of two CSV files with Scala
although that is just Scala code and I wanted to write it in Spark to handle much larger CSV files.
This part of the code I think I've got right:
val csv1 = sc.textFile(Csv1Location).cache()
val csv2 = sc.textFile(Csv2Location).cache()
def GetInput1Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv1KeyLocation))
def GetInput2Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv2KeyLocation))
val intersectionOfKeys = csv1 map GetInput1Key intersection(csv2 map GetInput2Key)
val map1 = csv1 map (input => GetInput1Key(input) -> input)
val map2 = csv2 map (input => GetInput2Key(input) -> input)
val broadcastedIntersection = sc.broadcast(intersectionOfKeys.collect.toSet)
And this is where I'm a little lost. I have a set of keys (intersectionOfKeys) that are present in both of my RDDs, and I have two RDDs that contain [Key, String] maps. If they were plain maps I could just do:
val output = broadcastedIntersection.value map (key => map1(key) + ", " + map2(key))
but that syntax isn't working.
Please let me know if you need any more information about the CSV files or what I'm trying to accomplish. Also, I'd love any syntactical and/or idiomatic comments on my code as well if you all notice anything incorrect.
Update:
val csv1 = sc.textFile(Csv1Location).cache()
val csv2 = sc.textFile(Csv2Location).cache()
def GetInput1Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv1KeyLocation))
def GetInput2Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv2KeyLocation))
val intersectionOfKeys = csv1 map GetInput1Key intersection(csv2 map GetInput2Key)
val map1 = csv1 map (input => GetInput1Key(input) -> input)
val map2 = csv2 map (input => GetInput2Key(input) -> input)
val intersections = map1.join(map2)
intersections take NumOutputs foreach println
This code worked and did what I needed to do, but I was wondering if there were any modifications or performance implications of using join. I remember reading somewhere that join is typically really expensive and time consuming because all the data needs to be sent to all the distributed workers.

I think hveiga is correct, a join would be simpler:
val csv1KV = csv1.map(line=>(GetInput1Key(line), line))
val csv2KV = csv2.map(line=>(GetInput2Key(line), line))
val joined = csv1KV join csv2KV
joined.mapValues(lineTuple = lineTuple._1 + ", " lineTuple._2
This is more performant AND readable as far as I can see as you would need to join the two sets together at some point, and your way relies on a single machine mentality where you would have to pull each collection in to make sure you are requesting the line from all partitions. Note that I used mapValues, which at least keeps your sets hash partitioned and cuts down on network noise.

Related

Using Group By with Array[(String, Int)]

Working on Scala and I have the following:
val file1 = Array(("test",2),("other",5));
val file2 = Array(("test",3),("boom",4));
Then I join the two arrays:
val toGether = file1.union(file2);
Finally want to produce a GroupBy that will produce the following:
Array(("test",(2,3)),("other",(5,0)),("boom",(0,4)))
is this possible?
If I understand your requirements, then what you want can be done with a following piece of code:
val file1 = Array(("test", 2), ("other", 5))
val file2 = Array(("test", 3), ("boom", 4))
val map1 = file1.toMap
val map2 = file2.toMap
val allKeys = map1.keySet ++ map2.keySet
val result: Array[(String, (Int, Int))] = allKeys.map(k => (k, (map1.getOrElse(k, 0), map2.getOrElse(k, 0))))(scala.collection.breakOut)
println(result.mkString)
The idea is simple: convert both arrays to maps and create the resulting array using iteration over joined keyset. Note that this code does not preserved any order but I'm not sure if this is important and it makes code much simpler. Note also that this code requires the collections to fit into memory, actually several times. If file1 and file2 are actually contents of big files that do not fit into memory, much more complicated algorithm should be used.

Spark - create list of words from text file and the word that comes immediately after it

I'm trying to create a pair rdd of every word from a text file and every word that follows it.
So for instance,
("I'm", "trying"), ("trying", "to"), ("to", "create") ...
It seems like I can almost use the zip fuction here, if I was able to start with an offset of 1 on the second bit.
How can I do this, or is there a better way?
I'm still not quite used to thinking in terms of functional programming here.
You can manipulate the index, then join on the initial pair RDD:
val rdd = sc.parallelize("I'm trying to create a".split(" "))
val el1 = rdd.zipWithIndex().map(l => (-1+l._2, l._1))
val el2 = rdd.zipWithIndex().map(l => (l._2, l._1))
el2.join(el1).map(l => l._2).collect()
Which outputs:
Array[(String, String)] = Array((I'm,trying), (trying,to), (to,create), (create,a))

How to split 1 RDD into 6 parts in a performant manner?

I have built a Spark RDD where each element of this RDD is a JAXB Root Element representing an XML Record.
I want to split this RDD so as to produce 6 RDDs from this set. Essentially this job simply converts the hierarchical XML structure into 6 sets of flat CSV records. I am currently passing over the same RDD 6 six times to do this.
xmlRdd.cache()
val rddofTypeA = xmlRdd.map {iterate over XML Object and create Type A}
rddOfTypeA.saveAsTextFile("s3://...")
val rddofTypeB = xmlRdd.map { iterate over XML Object and create Type B}
rddOfTypeB.saveAsTextFile("s3://...")
val rddofTypeC = xmlRdd.map { iterate over XML Object and create Type C}
rddOfTypeC.saveAsTextFile("s3://...")
val rddofTypeD = xmlRdd.map { iterate over XML Object and create Type D}
rddOfTypeD.saveAsTextFile("s3://...")
val rddofTypeE = xmlRdd.map { iterate over XML Object and create Type E}
rddOfTypeE.saveAsTextFile("s3://...")
val rddofTypeF = xmlRdd.map { iterate over XML Object and create Type F}
rddOfTypeF.saveAsTextFile("s3://...")
My input dataset are 35 Million Records split into 186 files of 448MB each stored in Amazon S3. My output directory is also on S3. I am using EMR Spark.
With a six node m4.4xlarge cluster it taking 38 minutes to finish this splitting and writing the output.
Is there an efficient way to achieve this without walking over the RDD six times?
The easiest solution (from a Spark developer's perspective) is to do the map and saveAsTextFile per RDD on a separate thread.
What's not widely known (and hence exploited) is the fact that SparkContext is thread-safe and so can be used to submit jobs from separate threads.
With that said, you could do the following (using the simplest Scala solution with Future but not necessarily the best as Future starts a computation at instantiation time not when you say so):
xmlRdd.cache()
import scala.concurrent.ExecutionContext.Implicits.global
val f1 = Future {
val rddofTypeA = xmlRdd.map { map xml to csv}
rddOfTypeA.saveAsTextFile("s3://...")
}
val f2 = Future {
val rddofTypeB = xmlRdd.map { map xml to csv}
rddOfTypeB.saveAsTextFile("s3://...")
}
...
Future.sequence(Seq(f1,f2)).onComplete { ... }
That could cut the time for doing the mapping and saving, but would not cut the number of scans over the dataset. That should not be a big deal anyway since the dataset is cached and hence in memory and/or disk (the default persistence level is MEMORY_AND_DISK in Spark SQL's Dataset.cache).
Depending on your requirements regarding output paths you can solve it using simple partitionByClause with standard DataFrameWriter.
Instead of multiple maps design a function which takes element of xmlRdd and returns a Seq of Tuples. General structure would be like this:
def extractTypes(value: T): Seq[(String, String)] = {
val a: String = extractA(value)
val b: String = extractB(value)
...
val f: String = extractF(value)
Seq(("A", a), ("B", b), ..., ("F", f))
}
flatMap, convert to Dataset and write:
xmlRdd.flatMap(extractTypes _).toDF("id", "value").write
.partitionBy("id")
.option("escapeQuotes", "false")
.csv(...)

Get all possible combinations of 3 values from n possible elements

I'm trying to get the list of all the possible combinations of 3 elements from a list of 30 items. I tried to use the following code, but it fails throwing an OutOfMemoryError. Is there any alternative approach which is more efficient than this?
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).
select("item_id").distinct.cache
val items.take(1) // Compute cache before join
val itemCombinations = items.select($"item_id".alias("id_A")).
join(
items.select($"item_id".alias("id_B")), $"id_A".lt($"id_B")).
join(
items.select($"item_id".alias("id_C")), $"id_B".lt($"id_C"))
The approach seems OK but might generate quite some overhead at the query execution level. Give that n is a fairly small number, we could do it using the Scala implementation directly:
val localItems = items.collect
val combinations = localItems.combinations(3)
The result is an iterator that can be consumed one element at the time, without significant memory overhead.
Spark Version (edit)
Given the desire to make a Spark version of this, it could be possible to avoid the query planner (assuming that the issue is there), by dropping to RDD level. This is basically the same expression as the join in the question:
val items = sqlContext.table(SOURCE_DB + "." + SOURCE_TABLE).select("item_id").rdd
val combinations = items.cartesian(items).filter{case(x,y) => x<y}.cartesian(items).filter{case ((x,y),z) => y<z}
Running an equivalent code in my local machine:
val data = List.fill(1000)(scala.util.Random.nextInt(999))
val rdd = sparkContext.parallelize(data)
val combinations = rdd.cartesian(rdd).filter{case(x,y) => x<y}.cartesian(rdd).filter{case ((x,y),z) => y<z}
combinations.count
// res5: Long = 165623528

Merge multiple RDD generated in loop

I am calling a function in scala which gives an RDD[(Long,Long,Double)] as its output.
def helperfunction(): RDD[(Long, Long, Double)]
I call this function in loop in another part of the code and I want to merge all the generated RDDs. The loop calling the function looks something like this
for (i <- 1 to n){
val tOp = helperfunction()
// merge the generated tOp
}
What I want to do is something similar to what StringBuilder would do for you in Java when you wanted to merge the strings. I have looked at techniques of merging RDDs, which mostly point to using union function like this
RDD1.union(RDD2)
But this requires both RDDs to be generated before taking their union. I though of initializing a var RDD1 to accumulate the results outside the for loop but I am not sure how can I initialize a blank RDD of type [(Long,Long,Double)]. Also I am starting out with spark, so I am not even sure if this is the most elegant method to solve this problem.
Instead of using vars, you can use functional programming paradigms to achieve what you want :
val rdd = (1 to n).map(x => helperFunction()).reduce(_ union _)
Also, if you still need to create an empty RDD, you can do it using :
val empty = sc.emptyRDD[(long, long, String)]
You're correct that this might not be the optimal way to do this, but we would need more info on what you're trying to accomplish with generating a new RDD with each call to your helper function.
You could define 1 RDD prior to the loop and assign it a var then run it through your loop. Here's an example:
val rdd = sc.parallelize(1 to 100)
val rdd_tuple = rdd.map(x => (x.toLong, (x*10).toLong, x.toDouble))
var new_rdd = rdd_tuple
println("Initial RDD count: " + new_rdd.count())
for (i <- 2 to 4) {
new_rdd = new_rdd.union(rdd_tuple)
}
println("New count after loop: " + new_rdd.count())