Spark Split RDD into chunks and concatenate - scala

I have a relatively simple problem.
I have an large Spark RDD[String] (containing JSON). In my use case I want to group (concatenate) N strings together into a new RDD[String], so that it will have the size of oldRDD.size/N.
pseudo example:
val oldRDD : RDD[String] = ['{"id": 1}', '{"id": 2}', '{"id": 3}', '{"id": 4}']
val newRDD : RDD[String] = someTransformation(oldRDD, ",", 2)
newRDD = ['{"id": 1},{"id": 2}','{"id": 3},{"id": 4}']
val anotherRDD : RDD[String] = someTransformation(oldRDD, ",", 3)
anotherRDD = ['{"id": 1},{"id": 2},{"id": 3}','{"id": 4}']
I already looked for a similar case, but couldnt find anything.
Thanks!

Here you have to use zipWithIndex function and then calculate group.
For example, index = 3 and n (number of groups) = 2 gives you 2nd group. 3 / 2 = 1 (integer divide), so 0-based 2nd group
val n = 3;
val newRDD1 = oldRDD.zipWithIndex() // creates tuples (element, index)
// map to tuple (group, content)
.map(x => (x._2 / n, x._1))
// merge
.reduceByKey(_ + ", " + _)
// remove key
.map(x => x._2)
One note: order of "zipWithIndex" is internal order. It can make no sense in business logic, you must check if order is ok in your case. If not, sort RDD and then use zipWithIndex

Related

SPARK: sum of elements with this same indexes from RDD[Array[Int]] in spark-rdd

I have three files like:
file1: 1,2,3,4,5
6,7,8,9,10
file2: 11,12,13,14,15
16,17,18,19,20
file3: 21,22,23,24,25
26,27,28,29,30
I have to find the sum of rows from each file:
1+2+3+4+5 + 11+12+13+14+15 + 21+21+23+24+25
6+7+8+9+10 + 16+17+18+19+20 + 26+27+28+29+30
I have written following code in spark-scala to get the Array of sum of all the rows:
val filesRDD = sc.wholeTextFiles("path to folder\\numbers\\*")
// creating RDD[Array[String]]
val linesRDD = filesRDD.map(elem => elem._2.split("\\n"))
// creating RDD[Array[Array[Int]]]
val rdd1 = linesRDD.map(line => line.map(str => str.split(",").map(_.trim.toInt)))
// creating RDD[Array[Int]]
val rdd2 = rdd1.map(elem => elem.map(e => e.sum))
rdd2.collect.foreach(elem => println(elem.mkString(",")))
the output I am getting is:
15,40
65,90
115,140
What I want is to sum 15+65+115 and 40+90+140
Any help is appreciated!
PS:
the files can have different no. of lines like some with 3 lines other with 4 and there can be any no. of files.
I want to do this using rdds only not dataframes.
You can use reduce to sum up the arrays:
val result = rdd2.reduce((x,y) => (x,y).zipped.map(_ + _))
// result: Array[Int] = Array(195, 270)
and if the files are of different length (e.g. file 3 has only one line 21,22,23,24,25)
val result = rdd2.reduce((x,y) => x.zipAll(y,0,0).map{case (a, b) => a + b})

Spark - calculate max ocurrence per day-event

I have the following RDD[String]:
1:AAAAABAAAAABAAAAABAAABBB
2:BBAAAAAAAAAABBAAAAAAAAAA
3:BBBBBBBBAAAABBAAAAAAAAAA
The first number is supposed to be days and the following characters are events.
I have to calculate the day where each event has the maximum occurrence.
The expected result for this dataset should be:
{ "A" -> Day2 , "B" -> Day3 }
(A has repeated 10 times in day2 and b 10 times in day3)
I am splitting the original dataset
val foo = rdd.map(_.split(":")).map(x => (x(0), x(1).split("")) )
What could be the best implementation for count and aggregation?
Any help is appreciated.
This should do the trick:
import org.apache.spark.sql.functions._
val rdd = sqlContext.sparkContext.makeRDD(Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
))
val keys = Seq("A", "B")
val seqOfMaps: RDD[(String, Map[String, Int])] = rdd.map{str =>
val split = str.split(":")
(s"Day${split.head}", split(1).groupBy(a => a.toString).mapValues(_.length))
}
keys.map{key => {
key -> seqOfMaps.mapValues(_.get(key).get).sortBy(a => -a._2).first._1
}}.toMap
The processing that need to be done consist in transforming the data into a rdd that is easy to apply on functions like: find the maximum for a list
I will try to explain step by step
I've used dummy data for "A" and "B" chars.
The foo rdd is the first step it will give you RDD[(String, Array[String])]
Let's extract each char for the Array[String]
val res3 = foo.map{case (d,s)=> (d, s.toList.groupBy(c => c).map{case (x, xs) => (x, xs.size)}.toList)}
(1,List((A,18), (B,6)))
(2,List((A,20), (B,4)))
(3,List((A,14), (B,10)))
Next we will flatMap over values to expand our rdd by char
res3.flatMapValues(list => list)
(3,(A,14))
(3,(B,10))
(1,(A,18))
(2,(A,20))
(2,(B,4))
(1,(B,6))
Rearrange the rdd in order to look better
res5.map{case (d, (s, c)) => (s, c, d)}
(A,20,2)
(B,4,2)
(A,18,1)
(B,6,1)
(A,14,3)
(B,10,3)
Now we are groupy by char
res7.groupBy(_._1)
(A,CompactBuffer((A,18,1), (A,20,2), (A,14,3)))
(B,CompactBuffer((B,6,1), (B,4,2), (B,10,3)))
Finally we are taking the maxium count for each row
res9.map{case (s, list) => (s, list.maxBy(_._2))}
(B,(B,10,3))
(A,(A,20,2))
Hope this help
Previous answers are good, but I prefer such solution:
val data = Seq(
"1:AAAAABAAAAABAAAAABAAABBB",
"2:BBAAAAAAAAAABBAAAAAAAAAA",
"3:BBBBBBBBAAAABBAAAAAAAAAA"
)
val initialRDD = sparkContext.parallelize(data)
// to tuples like (1,'A',18)
val charCountRDD = initialRDD.flatMap(s => {
val parts = s.split(":")
val charCount = parts(1).groupBy(i => i).mapValues(_.length)
charCount.map(i => (parts(0), i._1, i._2))
})
// group by character, and take max value from grouped collection
val result = charCountRDD.groupBy(i => i._2).map(k => k._2.maxBy(z => z._3))
result.foreach(println(_))
Result is:
(3,B,10)
(2,A,20)

how to merge 2 different rdd in spark using scala

I m trying to merge 2 rdds to one. If my rdd1 consists of 2 records of 2 elements both are strings ex:
key_A:value_A and Key_B:value_B
rdd2 also consists of 1 record of 2 elements both of which are strings
key_C:value_c
my final rdd would look like this:
key_A :value_A , Key_B :value_B , key_C :value_c
we can use union method of rdd but its not working . Plz kindly help
while using union of 2 rdds should the row of the 2 differnt rdd contain the same no of elments or there size can differ.......??
Try with join:
join(otherDataset, [numTasks])
When called on datasets of type (K, V) and (K, W), returns a dataset of (K, (V, W)) pairs with all pairs of elements for each key. Outer joins are supported through leftOuterJoin, rightOuterJoin, and fullOuterJoin.
See the associated section of the docs
union is working.
Sample code is:
val rdd = sparkContext.parallelize(1 to 10, 3)
val pairRDD = rdd.map { x => (x, x) }
val rdd1 = sparkContext.parallelize(11 to 20, 3)
val pairRDD1 = rdd1.map { x => (x, x) }
pairRDD.union(pairRDD1).foreach(tuple => {
println(tuple._1)
println(tuple._2)
})

How to do data cleansing in Scala

I just started Scala on Spark, so I am not sure if my question is workable or should I turn to other solution/tool:
I have a text file for word counting and sorting, here is the file.
I load the file into HDFS
I then use the following code in Scala to do the counting:
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.split(" ")).map(p => (p,1)).reduceByKey(_+_).sortByKey(true,1)
counts.saveAsTextFile("Peter_SortedOutput6")
I checked the result on hdfs by hdfs dfs -cat hdfs://localhost:9000/user/root/Peter_SortedOutput5/part-00000
Part of the result is posted here for the convenience of reading:
((For,1)
((not,1)
(1,8)
(10,8)
(11,8)
(12,8)
(13,8)
(14,8)
(15,7)
(16,7)
(17,7)
(18,7)
(19,6)
(2,8)
(20,5)
(21,5)
(22,4)
(23,2)
(24,2)
(25,2)
(3,8)
(4,8)
(5,8)
(6,8)
(7,8)
(8,8)
(9,8)
(Abraham,,1)
(According,1)
(Amen.,4)
(And,19)
(As,5)
(Asia,,1)
(Babylon,,1)
(Balaam,1)
(Be,2)
(Because,1)
First, this is really not what I expect, I want the result showing in the desc order of count.
Second, there are result like the following:
(God,25)
(God's,1)
(God,,9)
(God,),1)
(God.,6)
(God:,2)
(God;,2)
(God?,1)
How to do some cleansing in the split so these occurrences can be grouped into one (God, 47)
Thank you very much.
There is a course BerkeleyX: CS105x Introduction to Apache Spark on edx.org by Berkerly&Databricks. One of the assignment is doing word count.
The steps are
remove punctuation, by replace "[^A-Za-z0-9\s]+" with "", or not include numbers "[^A-Za-z\s]+"
trim all spaces
lower all words
we can add extra step like
remove stop words
Code as follows
import org.apache.spark.ml.feature.StopWordsRemover
import org.apache.spark.sql.functions.split
// val reg = raw"[^A-Za-z0-9\s]+" // with numbers
val reg = raw"[^A-Za-z\s]+" // no numbers
val lines = sc.textFile("peter.txt").
map(_.replaceAll(reg, "").trim.toLowerCase).toDF("line")
val words = lines.select(split($"line", " ").alias("words"))
val remover = new StopWordsRemover()
.setInputCol("words")
.setOutputCol("filtered")
val noStopWords = remover.transform(words)
val counts = noStopWords.select(explode($"filtered")).map(word =>(word, 1))
.reduceByKey(_+_)
// from word -> num to num -> word
val mostCommon = counts.map(p => (p._2, p._1)).sortByKey(false, 1)
mostCommon.take(5)
Clean data by use of replaceAll:
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""));
sort by value in scala API:
.map(item => item.swap) // interchanges position of entries in each tuple
.sortByKey(true, 1) // 1st arg configures ascending sort, 2nd arg configures one task
.map(item => item.swap)
sort by value in python API:
.map(lambda (a, b): (b, a)) \
.sortByKey(1, 1) \ # 1st arg configures ascending sort, 2nd configures 1 task
.map(lambda (a, b): (b, a))
Code should look like this (you may see syntax error, please fix if any):
val file = sc.textFile("hdfs://localhost:9000/Peter")
val counts = file.flatMap(line => line.trim.toLowerCase.split(" ").replaceAll("[$,?+.;:\'s\\W\\d]", ""))
.map(p => (p,1))
.reduceByKey(_+_)
.map(rec => rec.swap)
.sortByKey(true, 1)
.map(rec => rec.swap)
counts.saveAsTextFile("Peter_SortedOutput6")
see scala_regular_expressions - for what [\\W] or [\\d] or [;:',.?] mean.

Summing items within a Tuple

Below is a data structure of List of tuples, ot type List[(String, String, Int)]
val data3 = (List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1)) )
//> data3 : List[(String, String, Int)] = List((id1,a,1), (id1,a,1), (id1,a,1),
//| (id2,a,1))
I'm attempting to count the occurences of each Int value associated with each id. So above data structure should be converted to List((id1,a,3) , (id2,a,1))
This is what I have come up with but I'm unsure how to group similar items within a Tuple :
data3.map( { case (id,name,num) => (id , name , num + 1)})
//> res0: List[(String, String, Int)] = List((id1,a,2), (id1,a,2), (id1,a,2), (i
//| d2,a,2))
In practice data3 is of type spark obj RDD , I'm using a List in this example for testing but same solution should be compatible with an RDD . I'm using a List for local testing purposes.
Update : based on following code provided by maasg :
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
I needed to amend slightly to get into format I expect which is of type
.RDD[(String, Seq[(String, Int)])]
which corresponds to .RDD[(id, Seq[(name, count-of-names)])]
:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => ((id1),(id2,values.sum))}
val counted = result.groupedByKey
In Spark, you would do something like this: (using Spark Shell to illustrate)
val l = List( ("id1" , "a", 1), ("id1" , "a", 1), ("id1" , "a", 1) , ("id2" , "a", 1))
val rdd = sc.parallelize(l)
val grouped = rdd.groupBy{case (id1,id2,v) => (id1,id2)}
val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}
Another option would be to map the rdd into a PairRDD and use groupByKey:
val byKey = rdd.map({case (id1,id2,v) => (id1,id2)->v})
val byKeyGrouped = byKey.groupByKey
val result = byKeyGrouped.map{case ((id1,id2),values) => (id1,id2,values.sum)}
Option 2 is a slightly better option when handling large sets as it does not replicate the id's in the cummulated value.
This seems to work when I use scala-ide:
data3
.groupBy(tupl => (tupl._1, tupl._2))
.mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum))
.values.toList
And the result is the same as required by the question
res0: List[(String, String, Int)] = List((id1,a,3), (id2,a,1))
You should look into List.groupBy.
You can use the id as the key, and then use the length of your values in the map (ie all the items sharing the same id) to know the count.
#vptheron has the right idea.
As can be seen in the docs
def groupBy[K](f: (A) ⇒ K): Map[K, List[A]]
Partitions this list into a map of lists according to some discriminator function.
Note: this method is not re-implemented by views. This means when applied to a view it will >always force the view and return a new list.
K the type of keys returned by the discriminator function.
f the discriminator function.
returns
A map from keys to lists such that the following invariant holds:
(xs partition f)(k) = xs filter (x => f(x) == k)
That is, every key k is bound to a list of those elements x for which f(x) equals k.
So something like the below function, when used with groupBy will give you a list with keys being the ids.
(Sorry, I don't have access to an Scala compiler, so I can't test)
def f(tupule: A) :String = {
return tupule._1
}
Then you will have to iterate through the List for each id in the Map and sum up the number of integer occurrences. That is straightforward, but if you still need help, ask in the comments.
The following is the most readable, efficient and scalable
data.map {
case (key1, key2, value) => ((key1, key2), value)
}
.reduceByKey(_ + _)
which will give a RDD[(String, String, Int)]. By using reduceByKey it means the summation will paralellize, i.e. for very large groups it will be distributed and summation will happen on the map side. Think about the case where there are only 10 groups but billions of records, using .sum won't scale as it will only be able to distribute to 10 cores.
A few more notes about the other answers:
Using head here is unnecessary: .mapValues(v =>(v.head._1,v.head._2, v.map(_._3).sum)) can just use .mapValues(v =>(v_1, v._2, v.map(_._3).sum))
Using a foldLeft here is really horrible when the above shows .map(_._3).sum will do: val result = grouped.map{case ((id1,id2),values) => (id1,id2,value.foldLeft(0){case (cumm, tuple) => cumm + tuple._3})}