Custom function inside reduceByKey in spark - scala

I have an array Array[(Int, String)] which consists of the key-value pairs for the entire dataset where key is the column number and value is column's value.
So, I want to use reduceByKey to perform certain operations like max,min,mean,median,quartile calculations by key.
How can I achieve this using reduceByKey as groupByKey spills a lot of data to the disk. How can I pass a custom function inside reduceByKey.
Or is there a better way to do this.
Thanks !!

You can use combineByKey to track sum, count, min, max values, all in the same transformation. For that you need 3 functions:
create combiner function - that will initialize the 'combined value' consisting of min, max etc
merge values function - that will add another value to the 'combined value'
merge combiners - that will merge two 'combined values' together
The second approach would be to use an Accumulable object, or several Accumulators.
Please, check the documentation for those. I can provide some examples, if necessary.
Update:
Here is an example to calculate average by key. You can expand it to calculate min and max, too:
def createComb = (v:Double) => (1, v)
def mergeVal:((Int,Double),Double)=>(Int,Double) =
{case((c,s),v) => (c+1, s+v)}
def mergeComb:((Int,Double),(Int,Double))=>(Int,Double) =
{case((c1,s1),(c2,s2)) => (c1+c2, s1+s2)}
val avgrdd = rdd.combineByKey(createComb, mergeVal, mergeComb,
new org.apache.spark.HashPartitioner(rdd.partitions.size))
.mapValues({case(x,y)=>y/x})

Related

PySpark - Split the combined key

I have a RDD that is structured in this format:
(MAC_address, dst_ip_address, 1)
Here, 1 means the machine with the MAC_address has accessed the dst_ip_address once. I need to count how many times a specific machine with MAC_address has reached a specific dst_ip_address.
I created a rdd with a combined MAC_address and dst_ip_address as key, and applied reduceByKey to count the times.
def processJson(data):
return ((MAC_address, dst_ip_address), 1)
def countreducer(a,b):
return a+b
tt = df.map(processJson).reduceByKey(countreducer)
I am able to get a RDD ((MAC_address, dst_ip_address), 52)
I need to write the RDD into a Json format like this:
MAC_address_1:
[dst_ip_1: 52],
[dst_ip_2: 38]
MAC_address_2:
[dst_ip_1: 12]
My intuition is to split the combined key first but there is no function to flat a combined key. Thus, I wonder whether the above approach is on the right track.

How to sort spark countByKey() result which is in scala.collection.Map[(String, String),Long] based on value?

CSV table stored in location "/user/root/sqoopImport/orders"
val orders = sc.textFile("/user/root/sqoopImport/orders")
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().foreach(println)
Here I am getting this result in unsorted based on key (String,String)
((2014-03-19 00:00:00.0,PENDING),9)
((2014-04-18 00:00:00.0,ON_HOLD),11)
((2013-09-17 00:00:00.0,ON_HOLD),8)
((2014-07-10 00:00:00.0,COMPLETE),57)
I want to sort so I have tried
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().sortBy(_._1).foreach(println)
<console>:30: error: value sortBy is not a member of scala.collection.Map[(String, String),Long]
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey().sortBy(_._1).foreach(println)
countByKey() is an action. It finishes Spark calculation and gives you a normal Scala Map. Since Map is unordered, it makes no sense to sort it: you need to convert it to Seq first, using toSeq. If you want to stay in Spark land, you should use a transformation instead, in this case reduceByKey():
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).reduceByKey(_ + _).sortBy(_._1).foreach(println)
Also, please note that foreach(println) will only work as you expect in local mode: https://spark.apache.org/docs/latest/programming-guide.html#printing-elements-of-an-rdd.
A Map is an unordered collection. You would need to convert that map into a collection that maintains order and sort it by key. ex:
val sorted = map.toSeq.sortBy{
case (key,_) => key
}
This is because the
orders.map(_.split(",")).map(x=>((x(1),x(3)),1)).countByKey()
returns Map[(String, String),Long] in which we cannot apply sortBy() function
What you can do is
val result = orders.map(_.split(",")).
map(x=>((x(1),x(3)),1)).countByKey().toSeq
//and apply the sortby function in new RDD
sc.parallelize(result).sortBy(_._1).collect().foreach(println)
Hope this helps!

How to filter RDDs using count of keys in a map

I have the following RDD
val reducedListOfCalls: RDD[(String, List[Row])]
The RDDs are:
[(923066800846, List[2016072211,1,923066800846])]
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
As shown above first RDD has 1 (key,value) pair, second has 2, and third has 3 pairs.
I want to remove all RDDs that has less than 2 key-value pairs. The result RDD expected is:
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
I have tried the following:
val reducedListOfCalls = listOfMappedCalls.filter(f => f._1.size >1)
but it still given the original list only. The filter seems to have not made any difference.
Is it possible to count the number of keys in a mapped RDD, and then filter based on the count of keys?
You can use aggregateByKey in Spark to count the no of keys.
You should create a Tuple2(count, List[List[Row]]) in your combine function. The same can be achieved by reduceByKey.
Read this post comparing these two functions.

Index wise most frequently occuring element

I have a an array of the form
val array: Array[(Int, (String, Int))] = Array(
(idx1,(word1,count1)),
(idx2,(word2,count2)),
(idx1,(word1,count1)),
(idx3,(word3,count1)),
(idx4,(word4,count4)))
I want to get the top 10 and bottom 10 elements from this array for each index (idx1,idx2,....). Basically I want the top 10 most occuring and bottom 10 least occuring elements for each index value.
Please suggest how to acheive in spark in most efficient way.
I have tried it using the for loops for each index but this makes the program too slow and runs sequentially.
An example would be this :
(0,("apple",1))
(0,("peas",2))
(0,("banana",4))
(1,("peas",2))
(1,("banana",1))
(1,("apple",3))
(2,("NY",3))
(2,("London",5))
(2,("Zurich",6))
(3,("45",1))
(3,("34",4))
(3,("45",6))
Suppose I do top 2 on this set output would be
(0,("banana",4))
(0,("peas",2))
(1,("apple",3))
(1,("peas",2))
(2,("Zurich",6))
(2,("London",5))
(3,("45",6))
(3,("34",4))
I also need bottom 2 in the same way
I understand this is equivalent to producing the entire column list by using groupByKey on (K,V) pairs and then doing sort operation on it. Although the operation is right but in a typical spark environment the groupByKey operation will involve a lot of shuffle output and this may lead to inefficient operation.
Not sure about spark, but I think you can go with something like:
def f(array: Array[(Int, (String, Int))], n:Int) =
array.groupBy(_._1)
.map(pair => (
pair._1,
pair._2.sortBy(_._2).toList
)
)
.map(pair => (
pair._1,
(
pair._2.take(Math.min(n, pair._2.size)),
pair._2.drop(Math.max(0, pair._2.size - n))
)
)
)
The groupBy returns a map of index into a sorted list of entries by frequenct. After this, you map these entries to a pair of lists, one containing the top n elements and the other containing the bottom n elements. Note that you can replace all named parameters with _, I did it for clarity.
This version assumes that you always are interested in computing both the top and bot n elements, and thus does both in a single pass. If you usually only need one of the two, it's more efficient to add the .take or .drop immediately after the toList.

Sum of elements based on grouping an element using scala?

I have following scala list-
List((192.168.1.1,8590298237), (192.168.1.1,8590122837), (192.168.1.1,4016236988),
(192.168.1.1,1018539117), (192.168.1.1,2733649135), (192.168.1.2,16755417009),
(192.168.1.1,3315423529), (192.168.1.2,1523080027), (192.168.1.1,1982762904),
(192.168.1.2,6148851261), (192.168.1.1,1070935897), (192.168.1.2,276531515092),
(192.168.1.1,17180030107), (192.168.1.1,8352532280), (192.168.1.3,8590120563),
(192.168.1.3,24651063), (192.168.1.3,4431959144), (192.168.1.3,8232349877),
(192.168.1.2,17493253102), (192.168.1.2,4073818556), (192.168.1.2,42951186251))
I want following output-
List((192.168.1.1, sum of all values of 192.168.1.1),
(192.168.1.2, sum of all values of 192.168.1.2),
(192.168.1.3, sum of all values of 192.168.1.3))
How do I get sum of second elements from list by grouping on first element using scala??
Here you can use the groupBy function in Scala. You do, in your input data however have some issues, the ip numbers or whatever that is must be Strings and the numbers Longs. Here is an example of the groupBy function:
val data = ??? // Your list
val sumList = data.groupBy(_._1).map(x => (x._1, x._2.map(_._2).sum)).toList
If the answer is correct accept it or comment and I'll explain some more.