How do I use reducedByKey instead of GroupBy for data stored as RDD?
Purpose is to group by key and then sum the values.
I have a working Scala process to find the Odds Ratio.
Problem:
Data we are ingesting to the script has grown drastically and started failing because of memory/disk issue. Main problem here is lot of shuffling because of "GROUP BY".
Sample Data:
(543040000711860,543040000839322,0,0,0,0)
(543040000711860,543040000938728,0,0,1,1)
(543040000711860,543040000984046,0,0,1,1)
(543040000711860,543040001071137,0,0,1,1)
(543040000711860,543040001121115,0,0,1,1)
(543040000711860,543040001281239,0,0,0,0)
(543040000711860,543040001332995,0,0,1,1)
(543040000711860,543040001333073,0,0,1,1)
(543040000839322,543040000938728,0,1,0,0)
(543040000839322,543040000984046,0,1,0,0)
(543040000839322,543040001071137,0,1,0,0)
(543040000839322,543040001121115,0,1,0,0)
(543040000839322,543040001281239,1,0,0,0)
(543040000839322,543040001332995,0,1,0,0)
(543040000839322,543040001333073,0,1,0,0)
(543040000938728,543040000984046,0,0,1,1)
(543040000938728,543040001071137,0,0,1,1)
(543040000938728,543040001121115,0,0,1,1)
(543040000938728,543040001281239,0,0,0,0)
(543040000938728,543040001332995,0,0,1,1)
Here is the code to transform my data:
var groupby = flags.groupBy(item =>(item._1, item._2) )
var counted_group = groupby.map(item => (item._1, item._2.map(_._3).sum, item._2.map(_._4).sum, item._2.map(_._5).sum, item._2.map(_._6).sum))
Result:
((3900001339662,3900002247644),6,12,38,38)
((543040001332995,543040001352893),112,29,57,57)
((3900001572602,543040001071137),1,0,1,1)
((3900001640810,543040001281239),2,1,0,0)
((3900001295323,3900002247644),8,21,8,8)
I need to convert this to "REDUCE BY KEY" so that data will be reduced in each partition before sending it back. I am using RDD so there is no direct method to do REDUCE BY.
I think I solved the problem by using aggregateByKey.
Remapped the RDD to generate a key value pair
val rddPair = flags.map(item => ((item._1, item._2), (item._3, item._4, item._5, item._6)))
Then applied the aggregateByKey function on the result, Now each partition return the aggregated result rather than group result.
rddPair.aggregateByKey((0, 0, 0, 0))(
(iTotal, oisubtotal) => (iTotal._1 + oisubtotal._1, iTotal._2 + oisubtotal._2, iTotal._3 + oisubtotal._3, iTotal._4 + oisubtotal._4 ),
(fTotal, iTotal) => (fTotal._1 + iTotal._1, fTotal._2 + iTotal._2, fTotal._3 + iTotal._3, fTotal._4 + iTotal._4)
)
reducyByKey would require a RDD[(K, V)] i.e. key value pair, so you should create a rdd pairs first
val rddPair = flags.map(item => ((item._1, item._2), (item._3, item._4, item._5, item._6)))
Then you can use reduceByKey on above rddPair as
rddPair.reduceByKey((x, y)=> (x._1+y._1, x._2+y._2, x._3+y._3, x._4+y._4))
I hope the answer is helpful
Related
I have a streaming app that take a dstream and run an sql manipulation over the Dstream and dump it to file
dstream.foreachRDD { rdd =>
{spark.read.json(rdd)
.select("col")
.filter("value = 1")
.write.csv("s3://..")
now I need to be able to take into account the previous calculation (from eaelier batch) in my calculation (something like the following):
dstream.foreachRDD { rdd =>
{val df = spark.read.json(rdd)
val prev_df = read_prev_calc()
df.join(prev_df,"id")
.select("col")
.filter(prev_df("value)
.equalTo(1)
.write.csv("s3://..")
is there a way to write the calc result in memory somehow and use it as an input to to the calculation
Have you tried using the persist() method on a DStream? It will automatically persist every RDD of that DStream in memory.
by default, all input data and persisted RDDs generated by DStream transformations are automatically cleared.
Also, DStreams generated by window-based operations are automatically persisted in memory.
For more details, you can check https://spark.apache.org/docs/latest/streaming-programming-guide.html#caching--persistence
https://spark.apache.org/docs/0.7.2/api/streaming/spark/streaming/DStream.html
If you are looking only for one or two previously calculated dataframes, you should look into Spark Streaming Window.
Below snippet is from spark documentation.
val windowedStream1 = stream1.window(Seconds(20))
val windowedStream2 = stream2.window(Minutes(1))
val joinedStream = windowedStream1.join(windowedStream2)
or even simpler, if we want to do a word count over the last 20 seconds of data, every 10 seconds, we have to apply the reduceByKey operation on the pairs DStream of (word, 1) pairs over the last 30 seconds of data. This is done using the operation reduceByKeyAndWindow.
// Reduce last 20 seconds of data, every 10 seconds
val windowedWordCounts = pairs.reduceByKeyAndWindow((a:Int,b:Int) => (a + b), Seconds(20), Seconds(10))
more details and examples at-
https://spark.apache.org/docs/latest/streaming-programming-guide.html#window-operations
CSV table stored in location "/user/root/sqoopImport/orders"
val orders = sc.textFile("/user/root/sqoopImport/orders")
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().foreach(println)
Here I am getting this result in unsorted based on key (String,String)
((2014-03-19 00:00:00.0,PENDING),9)
((2014-04-18 00:00:00.0,ON_HOLD),11)
((2013-09-17 00:00:00.0,ON_HOLD),8)
((2014-07-10 00:00:00.0,COMPLETE),57)
I want to sort so I have tried
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().sortBy(_._1).foreach(println)
<console>:30: error: value sortBy is not a member of scala.collection.Map[(String, String),Long]
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().sortBy(_._1).foreach(println)
countByKey() is an action. It finishes Spark calculation and gives you a normal Scala Map. Since Map is unordered, it makes no sense to sort it: you need to convert it to Seq first, using toSeq. If you want to stay in Spark land, you should use a transformation instead, in this case reduceByKey():
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).reduceByKey(_ + _).sortBy(_._1).foreach(println)
Also, please note that foreach(println) will only work as you expect in local mode: https://spark.apache.org/docs/latest/programming-guide.html#printing-elements-of-an-rdd.
A Map is an unordered collection. You would need to convert that map into a collection that maintains order and sort it by key. ex:
val sorted = map.toSeq.sortBy{
case (key,_) => key
}
This is because the
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey()
returns Map[(String, String),Long] in which we cannot apply sortBy() function
What you can do is
val result = orders.map(_.split(",")).
map(x=>((x(1),x(3)),1)).countByKey().toSeq
//and apply the sortby function in new RDD
sc.parallelize(result).sortBy(_._1).collect().foreach(println)
Hope this helps!
I have the following RDD
val reducedListOfCalls: RDD[(String, List[Row])]
The RDDs are:
[(923066800846, List[2016072211,1,923066800846])]
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
As shown above first RDD has 1 (key,value) pair, second has 2, and third has 3 pairs.
I want to remove all RDDs that has less than 2 key-value pairs. The result RDD expected is:
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
I have tried the following:
val reducedListOfCalls = listOfMappedCalls.filter(f => f._1.size >1)
but it still given the original list only. The filter seems to have not made any difference.
Is it possible to count the number of keys in a mapped RDD, and then filter based on the count of keys?
You can use aggregateByKey in Spark to count the no of keys.
You should create a Tuple2(count, List[List[Row]]) in your combine function. The same can be achieved by reduceByKey.
Read this post comparing these two functions.
I have an array Array[(Int, String)] which consists of the key-value pairs for the entire dataset where key is the column number and value is column's value.
So, I want to use reduceByKey to perform certain operations like max,min,mean,median,quartile calculations by key.
How can I achieve this using reduceByKey as groupByKey spills a lot of data to the disk. How can I pass a custom function inside reduceByKey.
Or is there a better way to do this.
Thanks !!
You can use combineByKey to track sum, count, min, max values, all in the same transformation. For that you need 3 functions:
create combiner function - that will initialize the 'combined value' consisting of min, max etc
merge values function - that will add another value to the 'combined value'
merge combiners - that will merge two 'combined values' together
The second approach would be to use an Accumulable object, or several Accumulators.
Please, check the documentation for those. I can provide some examples, if necessary.
Update:
Here is an example to calculate average by key. You can expand it to calculate min and max, too:
def createComb = (v:Double) => (1, v)
def mergeVal:((Int,Double),Double)=>(Int,Double) =
{case((c,s),v) => (c+1, s+v)}
def mergeComb:((Int,Double),(Int,Double))=>(Int,Double) =
{case((c1,s1),(c2,s2)) => (c1+c2, s1+s2)}
val avgrdd = rdd.combineByKey(createComb, mergeVal, mergeComb,
new org.apache.spark.HashPartitioner(rdd.partitions.size))
.mapValues({case(x,y)=>y/x})
I want to create a parallel scanLeft(computes prefix sums for an associative operator) function for Hadoop (scalding in particular; see below for how this is done).
Given a sequence of numbers in a hdfs file (one per line) I want to calculate a new sequence with the sums of consecutive even/odd pairs. For example:
input sequence:
0,1,2,3,4,5,6,7,8,9,10
output sequence:
0+1, 2+3, 4+5, 6+7, 8+9, 10
i.e.
1,5,9,13,17,10
I think in order to do this, I need to write an InputFormat and InputSplits classes for Hadoop, but I don't know how to do this.
See this section 3.3 here. Below is an example algorithm in Scala:
// for simplicity assume input length is a power of 2
def scanadd(input : IndexedSeq[Int]) : IndexedSeq[Int] =
if (input.length == 1)
input
else {
//calculate a new collapsed sequence which is the sum of sequential even/odd pairs
val collapsed = IndexedSeq.tabulate(input.length/2)(i => input(2 * i) + input(2*i+1))
//recursively scan collapsed values
val scancollapse = scanadd(collapse)
//now we can use the scan of the collapsed seq to calculate the full sequence
val output = IndexedSeq.tabulate(input.length)(
i => i.evenOdd match {
//if an index is even then we can just look into the collapsed sequence and get the value
// otherwise we can look just before it and add the value at the current index
case Even => scancollapse(i/2)
case Odd => scancollapse((i-1)/2) + input(i)
}
output
}
I understand that this might need a fair bit of optimization for it to work nicely with Hadoop. Translating this directly I think would lead to pretty inefficient Hadoop code. For example, Obviously in Hadoop you can't use an IndexedSeq. I would appreciate any specific problems you see. I think it can probably be made to work well, though.
Superfluous. You meant this code?
val vv = (0 to 1000000).grouped(2).toVector
vv.par.foldLeft((0L, 0L, false))((a, v) =>
if (a._3) (a._1, a._2 + v.sum, !a._3) else (a._1 + v.sum, a._2, !a._3))
This was the best tutorial I found for writing an InputFormat and RecordReader. I ended up reading the whole split as one ArrayWritable record.