PySpark - Split the combined key - pyspark

I have a RDD that is structured in this format:
(MAC_address, dst_ip_address, 1)
Here, 1 means the machine with the MAC_address has accessed the dst_ip_address once. I need to count how many times a specific machine with MAC_address has reached a specific dst_ip_address.
I created a rdd with a combined MAC_address and dst_ip_address as key, and applied reduceByKey to count the times.
def processJson(data):
return ((MAC_address, dst_ip_address), 1)
def countreducer(a,b):
return a+b
tt = df.map(processJson).reduceByKey(countreducer)
I am able to get a RDD ((MAC_address, dst_ip_address), 52)
I need to write the RDD into a Json format like this:
MAC_address_1:
[dst_ip_1: 52],
[dst_ip_2: 38]
MAC_address_2:
[dst_ip_1: 12]
My intuition is to split the combined key first but there is no function to flat a combined key. Thus, I wonder whether the above approach is on the right track.

Related

How to filter RDDs using count of keys in a map

I have the following RDD
val reducedListOfCalls: RDD[(String, List[Row])]
The RDDs are:
[(923066800846, List[2016072211,1,923066800846])]
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
As shown above first RDD has 1 (key,value) pair, second has 2, and third has 3 pairs.
I want to remove all RDDs that has less than 2 key-value pairs. The result RDD expected is:
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
I have tried the following:
val reducedListOfCalls = listOfMappedCalls.filter(f => f._1.size >1)
but it still given the original list only. The filter seems to have not made any difference.
Is it possible to count the number of keys in a mapped RDD, and then filter based on the count of keys?
You can use aggregateByKey in Spark to count the no of keys.
You should create a Tuple2(count, List[List[Row]]) in your combine function. The same can be achieved by reduceByKey.
Read this post comparing these two functions.

Running sum on pairRDD/ a Map data structure until a threshold

I have a dataset from which I created a pairRDD[K,V]
v = number of datapoints under each key)]
val loadInfoRDD = inputRDD.map(a => (a._1.substring(0,variabelLength),a._2)).reduceByKey(_+_)
(dr5n,108)
(dr5r4,67)
(dr5r5,1163)
(dr5r6,121)
(dr5r7,1103)
(dr5rb,93)
(dr5re8,11)
(dr5re9,190)
(dr5reb,26)
(dr5rec,38088)
(dr5red,36713)
(dr5ree,47316)
(dr5ref,131353)
(dr5reg,121227)
(dr5reh,264)
(dr5rej,163)
(dr5rek,163)
(dr5rem,229)
I need to allocate each Key to an RDD partition, after this stage, I zipWithIndex the keys of this RDD
val partitioner = loadTree.coalesce(1).sortByKey().keys.zipWithIndex
(dr5n,0)
(dr5r4,1)
(dr5r5,2)
(dr5r6,3)
(dr5r7,4)
(dr5rb,5)
(dr5re8,6)
(dr5re9,7)
(dr5reb,8)
(dr5rec,9)
(dr5red,10)
(dr5ree,11)
(dr5ref,12)
(dr5reg,13)
(dr5reh,14)
(dr5rej,15)
(dr5rek,16)
(dr5rem,17)
But in order to get better load distribution in each partition, I need to run through values, starting from key1(in the sorted order), and calculate a running sum on values until a Threshold value and set all the keys to a same value (partition number in this case, starting from 0)
Say, threshold = 10000, then
(dr5n,0)
(dr5r4,0)
(dr5r5,0)
(dr5r6,0)
(dr5r7,0)
(dr5rb,0)
(dr5re8,0)
(dr5re9,0)
(dr5reb,0)
(dr5rec,1)
(dr5red,2)
(dr5ree,3)
(dr5ref,4)
(dr5reg,5)
(dr5reh,6)
(dr5rej,6)
(dr5rek,6)
(dr5rem,6)
I tried creating a new map, creating a set of keys which could be grouped and inserted them into the new map.
Is there any expert way to achieve the same ? Thanks!

Implement a MergeSort like feature in spark with scala

Spark Version 1.2.1
Scala Version 2.10.4
I have 2 SchemaRDD which are associated by a numeric field:
RDD 1: (Big table - about a million records)
[A,3]
[B,4]
[C,5]
[D,7]
[E,8]
RDD 2: (Small table < 100 records so using it as a Broadcast Variable)
[SUM, 2]
[WIN, 6]
[MOM, 7]
[DOM, 9]
[POM, 10]
Result
[C,5, WIN]
[D,7, MOM]
[E,8, DOM]
[E,8, POM]
I want the max(field) from RDD1 which is <= the field from RDD2.
I am trying to approach this using Merge by:
Sorting RDD by a key (sort within a group will have not more than 100 records in that group. In the above example is within a group)
Performing the merge operation similar to mergesort. Here I need to keep a track of the previous value as well to find the max; still I traverse the list only once.
Since there are too may variables here I am getting "Task not serializable" exception. Is this implementation approach Correct? I am trying to avoid the Cartesian Product here. Is there a better way to do it?
Adding the code -
rdd1.groupBy(itm => (itm(2), itm(3))).mapValues( itmorg => {
val miorec = itmorg.toList.sortBy(_(1).toString)
for( r <- 0 to miorec.length) {
for ( q <- 0 to rdd2.value.length) {
if ( (miorec(r)(1).toString > rdd2.value(q).toString && miorec(r-1)(1).toString <= rdd2.value(q).toString && r > 0) || r == miorec.length)
org.apache.spark.sql.Row(miorec(r-1)(0),miorec(r-1)(1),miorec(r-1)(2),miorec(r-1)(3),rdd2.value(q))
}
}
}).collect.foreach(println)
I would not do a global sort. It is an expensive operation for what you need. Finding the maximum is certainly cheaper than getting a global ordering of all values. Instead, do this:
For each partition, build a structure that keeps the max on RDD1 for each row on RDD2. This can be trivially done using mapPartitions and normal scala data structures. You can even use your one-pass merge code here. You should get something like a HashMap(WIN -> (C, 5), MOM -> (D, 7), ...)
Once this is done locally on each executor, merging these resulting data structures should be simple using reduce.
The goal here is to do little to no shuffling an keeping the most complex operation local, since the result size you want is very small (it would be easier in code to just create all valid key/values with RDD1 and RDD2 then aggregateByKey, but less efficient).
As for your exception, you woudl need to show the code, "Task not serializable" usually means you are passing around closures which are not, well, serializable ;-)

Custom function inside reduceByKey in spark

I have an array Array[(Int, String)] which consists of the key-value pairs for the entire dataset where key is the column number and value is column's value.
So, I want to use reduceByKey to perform certain operations like max,min,mean,median,quartile calculations by key.
How can I achieve this using reduceByKey as groupByKey spills a lot of data to the disk. How can I pass a custom function inside reduceByKey.
Or is there a better way to do this.
Thanks !!
You can use combineByKey to track sum, count, min, max values, all in the same transformation. For that you need 3 functions:
create combiner function - that will initialize the 'combined value' consisting of min, max etc
merge values function - that will add another value to the 'combined value'
merge combiners - that will merge two 'combined values' together
The second approach would be to use an Accumulable object, or several Accumulators.
Please, check the documentation for those. I can provide some examples, if necessary.
Update:
Here is an example to calculate average by key. You can expand it to calculate min and max, too:
def createComb = (v:Double) => (1, v)
def mergeVal:((Int,Double),Double)=>(Int,Double) =
{case((c,s),v) => (c+1, s+v)}
def mergeComb:((Int,Double),(Int,Double))=>(Int,Double) =
{case((c1,s1),(c2,s2)) => (c1+c2, s1+s2)}
val avgrdd = rdd.combineByKey(createComb, mergeVal, mergeComb,
new org.apache.spark.HashPartitioner(rdd.partitions.size))
.mapValues({case(x,y)=>y/x})

Index wise most frequently occuring element

I have a an array of the form
val array: Array[(Int, (String, Int))] = Array(
(idx1,(word1,count1)),
(idx2,(word2,count2)),
(idx1,(word1,count1)),
(idx3,(word3,count1)),
(idx4,(word4,count4)))
I want to get the top 10 and bottom 10 elements from this array for each index (idx1,idx2,....). Basically I want the top 10 most occuring and bottom 10 least occuring elements for each index value.
Please suggest how to acheive in spark in most efficient way.
I have tried it using the for loops for each index but this makes the program too slow and runs sequentially.
An example would be this :
(0,("apple",1))
(0,("peas",2))
(0,("banana",4))
(1,("peas",2))
(1,("banana",1))
(1,("apple",3))
(2,("NY",3))
(2,("London",5))
(2,("Zurich",6))
(3,("45",1))
(3,("34",4))
(3,("45",6))
Suppose I do top 2 on this set output would be
(0,("banana",4))
(0,("peas",2))
(1,("apple",3))
(1,("peas",2))
(2,("Zurich",6))
(2,("London",5))
(3,("45",6))
(3,("34",4))
I also need bottom 2 in the same way
I understand this is equivalent to producing the entire column list by using groupByKey on (K,V) pairs and then doing sort operation on it. Although the operation is right but in a typical spark environment the groupByKey operation will involve a lot of shuffle output and this may lead to inefficient operation.
Not sure about spark, but I think you can go with something like:
def f(array: Array[(Int, (String, Int))], n:Int) =
array.groupBy(_._1)
.map(pair => (
pair._1,
pair._2.sortBy(_._2).toList
)
)
.map(pair => (
pair._1,
(
pair._2.take(Math.min(n, pair._2.size)),
pair._2.drop(Math.max(0, pair._2.size - n))
)
)
)
The groupBy returns a map of index into a sorted list of entries by frequenct. After this, you map these entries to a pair of lists, one containing the top n elements and the other containing the bottom n elements. Note that you can replace all named parameters with _, I did it for clarity.
This version assumes that you always are interested in computing both the top and bot n elements, and thus does both in a single pass. If you usually only need one of the two, it's more efficient to add the .take or .drop immediately after the toList.