Sorting and Merging Using spark-shell - scala

I have a string array in scala
Array[String] = Array(apple, banana, oranges, grapes, lichi, anar)
I have converted it into a format like this:
Array[(Int, String)] = Array((5,apple), (6,banana), (7,oranges), (6,grapes), (5,lichi), (4,anar))
and i want output like this:
Array[(Int, String)] = Array((4,anar), (5,applelichi), (6,bananagrapes), (7,oranges))
means after sorting i want to add together the words with same key.
i have done sorting. heres my code:
val a = sc.parallelize(List("apple","banana","oranges","grapes","lichi","anar"))
val b = a.map(x =>(x.length,x))
val c = b.sortBy(_._2)

You can use groupByKey() to do this and then merge the lists you get with mkString. Small example using what you have (a,b are the same):
val c = b.groupByKey().map{case (key, list) => (key, list.toList.sorted.mkString)}.sortBy(_._1)
c.collect() foreach println
Which will give you:
(4,anar)
(5,applelichi)
(6,bananagrapes)
(7,oranges)

Related

Using Group By with Array[(String, Int)]

Working on Scala and I have the following:
val file1 = Array(("test",2),("other",5));
val file2 = Array(("test",3),("boom",4));
Then I join the two arrays:
val toGether = file1.union(file2);
Finally want to produce a GroupBy that will produce the following:
Array(("test",(2,3)),("other",(5,0)),("boom",(0,4)))
is this possible?
If I understand your requirements, then what you want can be done with a following piece of code:
val file1 = Array(("test", 2), ("other", 5))
val file2 = Array(("test", 3), ("boom", 4))
val map1 = file1.toMap
val map2 = file2.toMap
val allKeys = map1.keySet ++ map2.keySet
val result: Array[(String, (Int, Int))] = allKeys.map(k => (k, (map1.getOrElse(k, 0), map2.getOrElse(k, 0))))(scala.collection.breakOut)
println(result.mkString)
The idea is simple: convert both arrays to maps and create the resulting array using iteration over joined keyset. Note that this code does not preserved any order but I'm not sure if this is important and it makes code much simpler. Note also that this code requires the collections to fit into memory, actually several times. If file1 and file2 are actually contents of big files that do not fit into memory, much more complicated algorithm should be used.

Spark scala filter multiple rdd based on string length

I am trying to solve one of the quiz, the question is as below,
Write the missing code in the given program to display the expected output to identify animals that have names with four
letters.
Output: Array((4,lion))
Program
val a = sc.parallelize(List("dog","tiger","lion","cat","spider","eagle"),2)
val b = a.keyBy(_.length)
val c = sc.parallelize(List("ant","falcon","squid"),2)
val d = c.keyBy(_.length)
I have tried to write code in spark shell but get stuck with syntax to add 4 RDD and applying filter.
How about using the PairRDD lookup method:
b.lookup(4).toArray
// res1: Array[String] = Array(lion)
d.lookup(4).toArray
// res2: Array[String] = Array()

Spark RDD map internal object to Row

My initial data from a CSV file is:
1 ,21658392713 ,21626890421
1 ,21623461747 ,21626890421
1 ,21623461747 ,21626890421
The data I have after a few transformations and grouping based on business logic is yields
scala> val sGrouped = grouped
sGrouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])] = ShuffledRDD[85] at groupBy at <console>:51
scala> sGrouped.foreach(f=>println(f))
(21626890421,CompactBuffer((21626890421,
([Ljava.lang.String;#62ac8444,21626890421)),
(21626890421,([Ljava.lang.String;#59d80fe,21626890421)),
(21626890421,([Ljava.lang.String;#270042e8,21626890421)),
from this I want to get a map that yields something like the following format
[String, Row[String]]
so the data may look like:
[ 21626890421 , Row[(1 ,21658392713 ,21626890421)
, (1 ,21623461747 ,21626890421)
, (1 ,21623461747,21626890421)]]
I really appreciate any guidance on moving forward on this.
I found the answer, but I am not sure if this is an efficient way, any better approaches are appreciated, as this feels more like a hack.
scala> import org.apache.spark.sql.Row
scala> val grouped = cToP.groupBy(_._1)
grouped: org.apache.spark.rdd.RDD[(String, Iterable[(String,
(Array[String], String))])]
scala> val sGrouped = grouped.map(f => f._2.toList)
sGrouped: org.apache.spark.rdd.RDD[List[(String, (Array[String],
String))]]
scala> val tGrouped = sGrouped.map(f =>f.map(_._2).map(c =>
Row(c._1(0), c._1(12), c._1(18))))
tGrouped: org.apache.spark.rdd.RDD[List[org.apache.spark.sql.Row]] =
MapPartitionsRDD[42] a
scala> tGrouped.foreach(f => println(f))
yields
List([1,21658392713,21626890421], [1,21623461747,21626890421],
[1,21623461747,21626890421])
scala> tGrouped.count()
res6: Long = 1
The answer I am getting is correct, and even the count is correct. However, I do not understand why the count is 1.

How to unpack a map/list in scala to tuples for a variadic function?

I'm trying to create a PairRDD in spark. For that I need a tuple2 RDD, like RDD[(String, String)]. However, I have an RDD[Map[String, String]].
I can't work out how to get rid of the iterable so I'm just left with RDD[(String, String)] rather than e.g. RDD[List[(String, String)]].
A simple demo of what I'm trying to make work is this broken code:
val lines = sparkContext.textFile("data.txt")
val pairs = lines.map(s => Map(s -> 1))
val counts = pairs.reduceByKey((a, b) => a + b)
The last line doesn't work because pairs is an RDD[Map[String, Int]] when it needs to be an RDD[(String, Int)].
So how can I get rid of the iterable in pairs above to convert the Map to just a tuple2?
You can actually just run:
val counts = pairs.flatMap(identity).reduceByKey(_ + _)
Note that the usage of the identity function that replicates the functionality of flatten on an RDD and the reduceByKey() function has a nifty underscore notation for conciseness.

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}