Using Group By with Array[(String, Int)] - scala

Working on Scala and I have the following:
val file1 = Array(("test",2),("other",5));
val file2 = Array(("test",3),("boom",4));
Then I join the two arrays:
val toGether = file1.union(file2);
Finally want to produce a GroupBy that will produce the following:
Array(("test",(2,3)),("other",(5,0)),("boom",(0,4)))
is this possible?

If I understand your requirements, then what you want can be done with a following piece of code:
val file1 = Array(("test", 2), ("other", 5))
val file2 = Array(("test", 3), ("boom", 4))
val map1 = file1.toMap
val map2 = file2.toMap
val allKeys = map1.keySet ++ map2.keySet
val result: Array[(String, (Int, Int))] = allKeys.map(k => (k, (map1.getOrElse(k, 0), map2.getOrElse(k, 0))))(scala.collection.breakOut)
println(result.mkString)
The idea is simple: convert both arrays to maps and create the resulting array using iteration over joined keyset. Note that this code does not preserved any order but I'm not sure if this is important and it makes code much simpler. Note also that this code requires the collections to fit into memory, actually several times. If file1 and file2 are actually contents of big files that do not fit into memory, much more complicated algorithm should be used.

Related

Filter a Scala Seq[(String, String)] using a Seq[String]

I have this Seq[(String, String)] :
val tupleSeq: Seq[(String, String)] = Seq(
("aaa", "A_A_A"),
("bbb", "B_B_B"),
("ccc", "C_C_C")
)
I want to use the given seqA on tupleSeq:
val seqA: Seq[String] = Seq("aaa", "bbb")
In order to obtain :
val seqB: Seq[String] = Seq("A_A_A", "B_B_B")
Any ideas ?
One approach is to use the data unaltered.
// The size of `data` is M
// The size of `query` is N
val data: Seq[(String, String)] = Seq(
("aaa", "A_A_A"),
("bbb", "B_B_B"),
("ccc", "C_C_C")
)
val query: Seq[String] = Seq("aaa", "bbb")
// Use the data as is
// O(M * N)
for {
(key, value) <- data
lookup <- query
if key == lookup
} yield value
The problem with this approach is that the overall complexity is O(M * N), where M and N are the sizes of the data and query collections. This might be completely acceptable if either M or N are known to be very small and can be further improved in practical terms by making use of functions that can terminate early (like find, exemplified in another answer).
If M and N are reasonably large, you might want to spend the time necessary to convert them into an appropriate data structure (which consumes time and space in a way which is linear to the size of the collection).
Depending on which size you expect to be larger you might want to either turn the data into a map and look up the relevant keys or turn the query into a set and iterate each key in the map to find which is relevant.
I would expect the data to be queried in most cases to be larger than the query, so probably you may want to turn the data into a map. Keeping the map around would also allow you to query it multiple times without suffering from the time to turn it into a more appropriate structure for querying.
// Turn the query into a set and iterate the data
// O(M)
val lookups = query.toSet
data.collect {
case (key, value) if lookups.contains(key) => value
}
// Turn the data into a map and iterate the query
// O(N)
val map = data.toMap
query.collect(map)
You can play around with this code here on Scastie.
Your tupleSeq naturally looks like a Map of key-to-value pairs, so you should treat it like one. The code becomes very simple with this observation:
val myMap = tupleSeq.toMap
val seqB = seqA.collect(myMap) // List(A_A_A, B_B_B)
For additional space complexity, you get O(1) amortized time complexity for your query, which is an acceptable trade-off and arguably a better solution than linear searches through the sequence.
Note the use of collect instead of map because it discards keys that do not have a mapping value in your Map.
val tupleSeq: Seq[(String, String)] = Seq(
("aaa", "A_A_A"),
("bbb", "B_B_B"),
("ccc", "C_C_C")
)
val seqA: Seq[String] = Seq("aaa", "bbb")
// List(A_A_A, B_B_B)
val seqB = for {
key <- seqA
value <- tupleSeq.find(_._1 == key).map(_._2)
} yield value
You can try something like this:
val seqB = tupleSeq.filter{x => seqA.contains(x._1)}.map(x => x._2)
It filters the sequence and keeps the tuples where the first value is part of your second sequence, and then maps the tuples to the second value.
seqB.foreach(println) then outputs this:
A_A_A
B_B_B

Sorting and Merging Using spark-shell

I have a string array in scala
Array[String] = Array(apple, banana, oranges, grapes, lichi, anar)
I have converted it into a format like this:
Array[(Int, String)] = Array((5,apple), (6,banana), (7,oranges), (6,grapes), (5,lichi), (4,anar))
and i want output like this:
Array[(Int, String)] = Array((4,anar), (5,applelichi), (6,bananagrapes), (7,oranges))
means after sorting i want to add together the words with same key.
i have done sorting. heres my code:
val a = sc.parallelize(List("apple","banana","oranges","grapes","lichi","anar"))
val b = a.map(x =>(x.length,x))
val c = b.sortBy(_._2)
You can use groupByKey() to do this and then merge the lists you get with mkString. Small example using what you have (a,b are the same):
val c = b.groupByKey().map{case (key, list) => (key, list.toList.sorted.mkString)}.sortBy(_._1)
c.collect() foreach println
Which will give you:
(4,anar)
(5,applelichi)
(6,bananagrapes)
(7,oranges)

How do I split a Spark rdd Array[(String, Array[String])]?

I'm practicing on doing sorts in the Spark shell. I have an rdd with about 10 columns/variables. I want to sort the whole rdd on the values of column 7.
rdd
org.apache.spark.rdd.RDD[Array[String]] = ...
From what I gather the way to do that is by using sortByKey, which in turn only works on pairs. So I mapped it so I'd have a pair consisting of column7 (string values) and the full original rdd (array of strings)
rdd2 = rdd.map(c => (c(7),c))
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
I then apply sortByKey, still no problem...
rdd3 = rdd2.sortByKey()
rdd3: org.apache.spark.rdd.RDD[(String, Array[String])] = ...
But now how do I split off, collect and save that sorted original rdd from rdd3 (Array[String])? Whenever I try a split on rdd3 it gives me an error:
val rdd4 = rdd3.map(_.split(',')(2))
<console>:33: error: value split is not a member of (String, Array[String])
What am I doing wrong here? Are there other, better ways to sort an rdd on one of its columns?
what you did with rdd2 = rdd.map(c => (c(7),c)) is to map it to a tuple.
rdd2: org.apache.spark.rdd.RDD[(String, Array[String])]
exactly as it says :).
now if you want to split the record you need to get it from this tuple.
you can map again, taking only the second part of the tuple (which is the array of Array[String]...) like so : rdd3.map(_._2)
but i would strongly suggest to use try rdd.sortBy(_(7)) or something of this sort. this way you do not need to bother yourself with tuple and such.
if you want to sort the rdd using the 7th string in the array, you can just do it directly by
rdd.sortBy(_(6)) // array starts at 0 not 1
or
rdd.sortBy(arr => arr(6))
That will save you all the hassle of doing multiple transformations. The reason why rdd.sortBy(_._7) or rdd.sortBy(x => x._7) won't work is because that's not how you access an element inside an Array. To access the 7th element of an array, say arr, you should do arr(6).
To test this, i did the following:
val rdd = sc.parallelize(Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd")))
// I want to sort it using the 3rd String
val sorted_rdd = rdd.sortBy(_(2))
Here's the result:
Array(Array("ard", "bas", "wer"), Array("csg", "dip", "hwd"), Array("asg", "qtw", "hasd"))
just do this:
val rdd4 = rdd3.map(_._2)
I thought you don't familiar with Scala,
So, below should help you understand more,
rdd3.map(kv => {
println(kv._1) // This represent String
println(kv._2) // This represent Array[String]
})

How to generate key-value format using Scala in Spark

I am studying Spark on VirtualBox. I use ./bin/spark-shell to open Spark and use Scala. Now I got confused about key-value format using Scala.
I have a txt file in home/feng/spark/data, which looks like:
panda 0
pink 3
pirate 3
panda 1
pink 4
I use sc.textFile to get this txt file. If I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7")
Then I can use rdd.collect() to show rdd on the screen:
scala> rdd.collect()
res26: Array[String] = Array(panda 0, pink 3, pirate 3, panda 1, pink 4)
However, if I do
val rdd = sc.textFile("/home/feng/spark/data/rdd4.7.txt")
which no ".txt" here. Then when I use rdd.collect(), I got a mistake:
org.apache.hadoop.mapred.InvalidInputException: Input path does not exist: file:/home/feng/spark/A.txt
at org.apache.hadoop.mapred.FileInputFormat.singleThreadedListStatus(FileInputFormat.java:285)
......
But I saw other examples. All of them have ".txt" at the end. Is there something wrong with my code or my system?
Another thing is when I tried to do:
scala> val rddd = rdd.map(x => (x.split(" ")(0),x))
rddd: org.apache.spark.rdd.RDD[(String, String)] = MapPartitionsRDD[2] at map at <console>:29
scala> rddd.collect()
res0: Array[(String, String)] = Array((panda,panda 0), (pink,pink 3), (pirate,pirate 3), (panda,panda 1), (pink,pink 4))
I intended to select the first column of the data and use it as the key. But rddd.collect() looks like not that way as the words occur twice, which is not right. I cannot keep doing the rest operations like mapbykey, reducebykey or others. Where did I do wrong?
Just for example I create a String with your dataset, after this I split the record by line, and use SparkContext's parallelize method to create an RDD. Notice that after I create the RDD I use its map method to split the String stored in each record and convert it to a Row.
import org.apache.spark.sql.Row
val text = "panda 0\npink 3\npirate 3\npanda 1\npink 4"
val rdd = sc.parallelize(text.split("\n")).map(x => Row(x.split(" "):_*))
rdd.take(3)
The output from the take method is:
res4: Array[org.apache.spark.sql.Row] = Array([panda,0], [pink,3], [pirate,3])
About your first question, there is no need for files to have any extension. Because, in this case files are seen as plain text.

Merging two RDDs based on common key and then outputting

I'm fairly new to Scala and Spark and functional programming in general, so forgive me if this is a pretty basic question.
I'm merging two CSV files, so I got a lot of inspiration from this: Merge the intersection of two CSV files with Scala
although that is just Scala code and I wanted to write it in Spark to handle much larger CSV files.
This part of the code I think I've got right:
val csv1 = sc.textFile(Csv1Location).cache()
val csv2 = sc.textFile(Csv2Location).cache()
def GetInput1Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv1KeyLocation))
def GetInput2Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv2KeyLocation))
val intersectionOfKeys = csv1 map GetInput1Key intersection(csv2 map GetInput2Key)
val map1 = csv1 map (input => GetInput1Key(input) -> input)
val map2 = csv2 map (input => GetInput2Key(input) -> input)
val broadcastedIntersection = sc.broadcast(intersectionOfKeys.collect.toSet)
And this is where I'm a little lost. I have a set of keys (intersectionOfKeys) that are present in both of my RDDs, and I have two RDDs that contain [Key, String] maps. If they were plain maps I could just do:
val output = broadcastedIntersection.value map (key => map1(key) + ", " + map2(key))
but that syntax isn't working.
Please let me know if you need any more information about the CSV files or what I'm trying to accomplish. Also, I'd love any syntactical and/or idiomatic comments on my code as well if you all notice anything incorrect.
Update:
val csv1 = sc.textFile(Csv1Location).cache()
val csv2 = sc.textFile(Csv2Location).cache()
def GetInput1Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv1KeyLocation))
def GetInput2Key(input: String): Key = Key(getAtIndex(input.split(SplitByCommas, -1), Csv2KeyLocation))
val intersectionOfKeys = csv1 map GetInput1Key intersection(csv2 map GetInput2Key)
val map1 = csv1 map (input => GetInput1Key(input) -> input)
val map2 = csv2 map (input => GetInput2Key(input) -> input)
val intersections = map1.join(map2)
intersections take NumOutputs foreach println
This code worked and did what I needed to do, but I was wondering if there were any modifications or performance implications of using join. I remember reading somewhere that join is typically really expensive and time consuming because all the data needs to be sent to all the distributed workers.
I think hveiga is correct, a join would be simpler:
val csv1KV = csv1.map(line=>(GetInput1Key(line), line))
val csv2KV = csv2.map(line=>(GetInput2Key(line), line))
val joined = csv1KV join csv2KV
joined.mapValues(lineTuple = lineTuple._1 + ", " lineTuple._2
This is more performant AND readable as far as I can see as you would need to join the two sets together at some point, and your way relies on a single machine mentality where you would have to pull each collection in to make sure you are requesting the line from all partitions. Note that I used mapValues, which at least keeps your sets hash partitioned and cuts down on network noise.