Running sum on pairRDD/ a Map data structure until a threshold - scala

I have a dataset from which I created a pairRDD[K,V]
v = number of datapoints under each key)]
val loadInfoRDD = inputRDD.map(a => (a._1.substring(0,variabelLength),a._2)).reduceByKey(_+_)
(dr5n,108)
(dr5r4,67)
(dr5r5,1163)
(dr5r6,121)
(dr5r7,1103)
(dr5rb,93)
(dr5re8,11)
(dr5re9,190)
(dr5reb,26)
(dr5rec,38088)
(dr5red,36713)
(dr5ree,47316)
(dr5ref,131353)
(dr5reg,121227)
(dr5reh,264)
(dr5rej,163)
(dr5rek,163)
(dr5rem,229)
I need to allocate each Key to an RDD partition, after this stage, I zipWithIndex the keys of this RDD
val partitioner = loadTree.coalesce(1).sortByKey().keys.zipWithIndex
(dr5n,0)
(dr5r4,1)
(dr5r5,2)
(dr5r6,3)
(dr5r7,4)
(dr5rb,5)
(dr5re8,6)
(dr5re9,7)
(dr5reb,8)
(dr5rec,9)
(dr5red,10)
(dr5ree,11)
(dr5ref,12)
(dr5reg,13)
(dr5reh,14)
(dr5rej,15)
(dr5rek,16)
(dr5rem,17)
But in order to get better load distribution in each partition, I need to run through values, starting from key1(in the sorted order), and calculate a running sum on values until a Threshold value and set all the keys to a same value (partition number in this case, starting from 0)
Say, threshold = 10000, then
(dr5n,0)
(dr5r4,0)
(dr5r5,0)
(dr5r6,0)
(dr5r7,0)
(dr5rb,0)
(dr5re8,0)
(dr5re9,0)
(dr5reb,0)
(dr5rec,1)
(dr5red,2)
(dr5ree,3)
(dr5ref,4)
(dr5reg,5)
(dr5reh,6)
(dr5rej,6)
(dr5rek,6)
(dr5rem,6)
I tried creating a new map, creating a set of keys which could be grouped and inserted them into the new map.
Is there any expert way to achieve the same ? Thanks!

Related

What kind of variable select for incrementing node labels in a community detection algorithm

i am working on a community detection algorithm that uses the concept of propagating label to nodes. i have problem in selecting the true type for the Label_counter variable.
we have an algorithm with name LPA(label propagation algorithm) which propagates labels to nodes through iterations. think labels as node property. the initial label for each node is the node id, and in iterations nodes update their new label based on the most frequent label among its neighbors. the algorithm i am working on is something like LPA. at first every node has initial label equal to 0 and then nodes get new labels. as nodes update and get new labels, based on some conditions the Label_counter should be incremented by one to use this value as label for other nodes . for example label=1 or label = 2 and so on. for example we have zachary karate club dataset that it has 34 nodes and the dataset has 2 communities.
the initial state is like this:
(1,0)
(2,0)
.
.
.
(34,0)
first number is node Id and second one is label.
as nodes get new label, the Label_counter increments and other nodes in next iterations get new label and again Label_counter increments.
(1,1)
(2,1)
(3,1)
.
.
.
(33,3)
(34,3)
nodes with same label, belong to same community.
the problem that i have is:
because nodes in RDD and variables are distributed across the machines(each machine has a copy of variables) and executors dont have connection with each other, if an executor updates the Label_counter, other executors wont be informed of new value of Label_counter and maybe nodes will get wrong labels, IS it true to use Accumulator as label counter in this case, because Accumulators are shared variables across machines, or there is other ways for handling this problem???
In spark it is always complicated to compute index like values because they depend on things that are not in all the partitions. I can propose the following idea.
Compute the number of time the condition is met per partition
Compute the cumulated increment per partition so that we know the initial increment of each partition.
Increment the values of the partition based on that initial increment
Here is what the code could look like this. Let me start by setting up a few things.
// Let's define some condition
def condition(node : Long) = node % 10 == 1
// step 0, generate the data
val rdd = spark.range(34)
.select('id+1).repartition(10).rdd
.map(r => (r.getAs[Long](0), 0))
.sortBy(_._1).cache()
rdd.collect
Array[(Long, Int)] = Array((1,0), (2,0), (3,0), (4,0), (5,0), (6,0), (7,0), (8,0),
(9,0), (10,0), (11,0), (12,0), (13,0), (14,0), (15,0), (16,0), (17,0), (18,0),
(19,0), (20,0), (21,0), (22,0), (23,0), (24,0), (25,0), (26,0), (27,0), (28,0),
(29,0), (30,0), (31,0), (32,0), (33,0), (34,0))
Then the core of the solution:
// step 1 and 2
val partIncrInit = rdd
// to each partition, we associate the number of times we need to increment
.mapPartitionsWithIndex{ case (i,p) =>
Iterator(i -> p.map(_._1).count(condition))
}
.collect.sorted // sort by partition index
.map(_._2) // we don't need the index anymore
.scanLeft(0)(_+_) // cumulated sum
// step 3, we increment each partition based on this initial increment.
val result = rdd
.mapPartitionsWithIndex{ case (i, p) =>
var incr = 0
p.map{ case (node, value) =>
if(condition(node))
incr+=1
(node, partIncrInit(i) + value + incr)
}
}
result.collect
Array[(Long, Int)] = Array((1,1), (2,1), (3,1), (4,1), (5,1), (6,1), (7,1), (8,1),
(9,1), (10,1), (11,2), (12,2), (13,2), (14,2), (15,2), (16,2), (17,2), (18,2),
(19,2), (20,2), (21,3), (22,3), (23,3), (24,3), (25,3), (26,3), (27,3), (28,3),
(29,3), (30,3), (31,4), (32,4), (33,4), (34,4))

PySpark - Split the combined key

I have a RDD that is structured in this format:
(MAC_address, dst_ip_address, 1)
Here, 1 means the machine with the MAC_address has accessed the dst_ip_address once. I need to count how many times a specific machine with MAC_address has reached a specific dst_ip_address.
I created a rdd with a combined MAC_address and dst_ip_address as key, and applied reduceByKey to count the times.
def processJson(data):
return ((MAC_address, dst_ip_address), 1)
def countreducer(a,b):
return a+b
tt = df.map(processJson).reduceByKey(countreducer)
I am able to get a RDD ((MAC_address, dst_ip_address), 52)
I need to write the RDD into a Json format like this:
MAC_address_1:
[dst_ip_1: 52],
[dst_ip_2: 38]
MAC_address_2:
[dst_ip_1: 12]
My intuition is to split the combined key first but there is no function to flat a combined key. Thus, I wonder whether the above approach is on the right track.

How to filter RDDs using count of keys in a map

I have the following RDD
val reducedListOfCalls: RDD[(String, List[Row])]
The RDDs are:
[(923066800846, List[2016072211,1,923066800846])]
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
As shown above first RDD has 1 (key,value) pair, second has 2, and third has 3 pairs.
I want to remove all RDDs that has less than 2 key-value pairs. The result RDD expected is:
[(923027659472, List[2016072211,1,92328880275]),
923027659472, List[2016072211,1,92324440275])]
[(923027659475, List[2016072211,1,92328880275]),
(923027659475, List[2016072211,1,92324430275]),
(923027659475, List[2016072211,1,92334340275])]
I have tried the following:
val reducedListOfCalls = listOfMappedCalls.filter(f => f._1.size >1)
but it still given the original list only. The filter seems to have not made any difference.
Is it possible to count the number of keys in a mapped RDD, and then filter based on the count of keys?
You can use aggregateByKey in Spark to count the no of keys.
You should create a Tuple2(count, List[List[Row]]) in your combine function. The same can be achieved by reduceByKey.
Read this post comparing these two functions.

Finding maximum edge weight in Spark GraphX

Let`s say I have a graph with double values for edge attributes and I
want to find the maximum edge weight of my graph. If I do this:
val max = sc.accumulator(0.0) //max holds the maximum edge weight
g.edges.distinct.collect.foreach{ e => if (e.attr > max.value) max.value
= e.attr }
I want to ask how much work is done on the master and how much on the
executors, because I know that collect() method brings the entire RDD to
the master? Does a parallelism happen? Is there a better way to find the
maximum edge weight?
NOTE:
g.edges.distinct.foreach{ e => if (e.attr > max.value) max.value =
e.attr } // does not work without the collect() method.
//I use an accumulator because I want to use the max edge weight later
And if I want to apply some averaging function to the attributes of edges that have same srcId and dstId between two graphs, what is the best way to do it?
You can either aggregate:
graph.edges.aggregate(Double.NegativeInfinity)(
(m, e) => e.attr.max(m),
(m1, m2) => m1.max(m2)
)
or map and take max:
graph.edges.map(_.attr).max
Regarding your attempts:
If you collect all data is processed sequentially on a driver so there is no reason to use an accumulator.
it doesn't work because accumulators are write-only from a worker perspective.

Custom function inside reduceByKey in spark

I have an array Array[(Int, String)] which consists of the key-value pairs for the entire dataset where key is the column number and value is column's value.
So, I want to use reduceByKey to perform certain operations like max,min,mean,median,quartile calculations by key.
How can I achieve this using reduceByKey as groupByKey spills a lot of data to the disk. How can I pass a custom function inside reduceByKey.
Or is there a better way to do this.
Thanks !!
You can use combineByKey to track sum, count, min, max values, all in the same transformation. For that you need 3 functions:
create combiner function - that will initialize the 'combined value' consisting of min, max etc
merge values function - that will add another value to the 'combined value'
merge combiners - that will merge two 'combined values' together
The second approach would be to use an Accumulable object, or several Accumulators.
Please, check the documentation for those. I can provide some examples, if necessary.
Update:
Here is an example to calculate average by key. You can expand it to calculate min and max, too:
def createComb = (v:Double) => (1, v)
def mergeVal:((Int,Double),Double)=>(Int,Double) =
{case((c,s),v) => (c+1, s+v)}
def mergeComb:((Int,Double),(Int,Double))=>(Int,Double) =
{case((c1,s1),(c2,s2)) => (c1+c2, s1+s2)}
val avgrdd = rdd.combineByKey(createComb, mergeVal, mergeComb,
new org.apache.spark.HashPartitioner(rdd.partitions.size))
.mapValues({case(x,y)=>y/x})