Function "shared neighbors distribution" in igraph - networkx

I wonder if someone could give me some clue about writing a function like "shared neighbors distribution", given clusters. I found this feature in Cytoscape NetworkAnalyzer very informative for my research purpose. Since I have many clusters to analyze, it would be handy to write a script for this job. Suggestions using igraph, networkx, etc are welcomed. Thank you very much!
For example:
edgelist <- read.table(text = "
E B
E A
B D
B F
B C
A C
A F
A D")
library(igraph)
graph <- graph.data.frame(edgelist,directed=F)
plot(graph)
We would see a graph like this:
enter image description here
Either two of the Nodes (C, D, E, F) shared Nodes A and B. That is 6 times.
Nodes A and B shared Nodes (C, D, E, F).
In total, the summary should like this:
enter image description here
Instead of writing a loop (to get the neighbors for each vertex, and compare them), I wonder if there are better solutions.

Networkx has no built-in functions for this problem so you should do it manually. Moreover, if it had, these functions used node loops inside them. So node loop is actually an optimal or sub-optimal solution. For your code you can use Python defaultdicts to make it simplier:
import networkx as nx
from collections import defaultdict
G = nx.Graph()
G.add_edges_from([
("E", "B"),
("E", "A"),
("B", "D"),
("B", "F"),
("B", "C"),
("A", "C"),
("A", "F"),
("A", "D")
])
snd = defaultdict(int)
for n1 in G.nodes:
for n2 in G.nodes:
len_nbrs = len(set(G.neighbors(n1)) & set(G.neighbors(n2)))
if len_nbrs:
snd[len_nbrs] += 1
snd
So snd will looks like it:
defaultdict(int, {2: 16, 4: 4})

Thank you so much #vurmux for the framework and idea. I just adjusted pair combinations, to avoid self-pairs, duplicates etc. Then we get the correct answer. Great. Cheers!
import networkx as nx
from collections import defaultdict
from itertools import combinations
G = nx.Graph()
G.add_edges_from([
("E", "B"),
("E", "A"),
("B", "D"),
("B", "F"),
("B", "C"),
("A", "C"),
("A", "F"),
("A", "D")
])
snd = defaultdict(int)
l =G.nodes()
comb =combinations(l,2) # combinations('ABCD', 2) --> AB AC AD BC BD CD
for i in list(comb):
len_nbrs = len(set(G.neighbors(i[0])) & set(G.neighbors(i[1])))
if len_nbrs:
snd[len_nbrs] += 1
snd
Now we have:
defaultdict(int, {2: 6, 4: 1})

Related

spark rdd filter after groupbykey

//create RDD
val rdd = sc.makeRDD(List(("a", (1, "m")), ("b", (1, "m")),
("a", (1, "n")), ("b", (2, "n")), ("c", (1, "m")),
("c", (5, "m")), ("d", (1, "m")), ("d", (1, "n"))))
val groupRDD = rdd.groupByKey()
after groupByKey i want to filter the second element is not equal 1 and get
("b", (1, "m")),("b", (2, "n")), ("c", (1, "m")), ("c", (5, "m"))
groupByKey() is must necessary, could help me, thanks a lot.
add:
but if the second element type is string,filter the second element All of them equal x ,like
("a",("x","m")), ("a",("x","n")), ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m")), ("d",("x","m")), ("d",("x","n"))
and also get the same result ("b",("x","m")), ("b",("y","n")), ("c",("x","m")), ("c",("z","m"))
You could do:
val groupRDD = rdd
.groupByKey()
.filter(value => value._2.map(tuple => tuple._1).sum != value._2.size)
.flatMapValues(list => list) // to get the result as you like, because right now, they are, e.g. (b, Seq((1, m), (1, n)))
What this does, is that we are first grouping keys through groupByKey, then we are filtering through filter by summing the keys from your grouped entries, and checking whether the sum is as much as the grouped entries size. For example:
(a, Seq((1, m), (1, n)) -> grouped by key
(a, Seq((1, m), (1, n), 2 (the sum of 1 + 1), 2 (size of sequence))
2 = 2, filter this row out
The final result:
(c,(1,m))
(b,(1,m))
(c,(5,m))
(b,(2,n))
Good luck!
EDIT
Under the assumption that key from tuple can be any string; assuming rdd is your data that contains:
(a,(x,m))
(c,(x,m))
(c,(z,m))
(d,(x,m))
(b,(x,m))
(a,(x,n))
(d,(x,n))
(b,(y,n))
Then we can construct uniqueCount as:
val uniqueCount = rdd
// we swap places, we want to check for combination of (a, 1), (b, a), (b, b), (c, a), etc.
.map(entry => ((entry._1, entry._2._1), entry._2._2))
// we count keys, meaning that (a, 1) gives us 2, (b, a) gives us 1, (b, b) gives us 1, etc.
.countByKey()
// we filter out > 2, because they are duplicates
.filter(a => a._2 == 1)
// we get the very keys, so we can filter below
.map(a => a._1._1)
.toList
Then this:
val filteredRDD = rdd.filter(a => uniqueCount.contains(a._1))
Gives this output:
(b,(y,n))
(c,(x,m))
(c,(z,m))
(b,(x,m))

PySpark: Count pair frequency occurences

Let's say I have a dataset as follows:
1: a, b, c
2: a, d, c
3: c, d, e
I want to write a Pyspark code to count the occurrences of each of the pairs such as (a,b), (a,c), (b,c) etc.
Expected output:
(a,b) 1
(b,c) 1
(c,d) 2
etc..
Note that, (c,d) and (d,c) should be the same instant.
How should I go about it?
Till now, I have written the code to read the data from textfile as follows -
sc = SparkContext("local", "bp")
spark = SparkSession(sc)
data = sc.textFile('doc.txt')
dataFlatMap = data.flatMap(lambda x: x.split(" "))
Any pointers would be appreciated.
I relied on the answer in this question - How to create a Pyspark Dataframe of combinations from list column
Below is the code that creates a udf where itertools.combinations function is applied to the list of items. The combinations in udf are sorted to avoid double counting occurrences such as ("a", "b") and ("b", "a"). Once you get combinations, you can groupBy and count rows. You may want to count distinct rows in case list elements are repeating, like ("a", "a", "b"), but this depends on your requirements.
import pyspark.sql.functions as F
import itertools
from pyspark.sql.types import *
data = [(1, ["a", "b", "c"]), (2, ["a", "d", "c"]), (3, ["c", "d", "e"])]
df = spark.createDataFrame(data, schema = ["id", "arr"])
# df is
# id arr
# 1 ["a", "b", "c"]
# 2 ["a", "d", "c"]
# 3 ["c", "d", "e"]
#udf(returnType=ArrayType(ArrayType(StringType())))
def combinations_udf(arr):
x = (list(itertools.combinations(arr, 2)))
return [ sorted([y[0], y[1]]) for y in x ]
df1 = (df.withColumn("combinations", F.explode(combinations_udf1("arr"))))
df_ans =(df1
.groupBy("combinations")
.agg(F.countDistinct("id").alias("count"))
.orderBy(F.desc("count")))
For the given dataframe df, df_ans is

Scala number of occurences of element in list of lists

To make it simple, let's imagine I have the following input:
List(List("A", "A"), List("A", "B"), List("B", "C"), List("B", "C"))
How would it be possible to group the elements inside the lists in such fashion so that I would know how many lists are they in. For example, following the output of a mapValues function just to illustrate what I mean, the result of the previous input should be something like:
Map("A" -> 2, "B" -> 3, "C" -> 2)
Just to be sure I made clear what I mean, a way to interpret the result would be to say that "A" is present in 2 of the sub-lists (regardless of how many times it appears inside of a particular sub-list), "B" is present in 3 of the sub-lists and "C" is in 2. I just want a way to map how many different sub-lists each of the individual elements are present in.
Disregarding performance, this would work:
val list = List(List("A", "A"), List("A", "B"), List("B", "C"), List("B", "C"))
val elements = list.flatten.distinct
elements.map(el => el -> list.count(_.contains(el))).toMap
You can also use fold operation to like this
list.flatten.foldLeft(Map.empty[String, Int])((map, word) => map + (word -> (map.getOrElse(word,0) + 1)))
//scala> res2: scala.collection.immutable.Map[String,Int] = Map(A -> 3, B -> 3, C -> 2)

Counting number of occurrences of Array element in a RDD

I have a RDD1 with Key-Value pair of type [(String, Array[String])] (i will refer to them as (X, Y)), and a Array Z[String].
I'm trying for every element in Z to count how many X instances there are that have Z in Y. I want my output as ((X, Z(i)), #ofinstances).
RDD1= ((A, (2, 3, 4), (B, (4, 4, 4)), (A, (4, 5)))
Z = (1, 4)
then i want to get:
(((A, 4), 2), ((B, 4), 1))
Hope that made sense.
As you can see over, i only want an element if there is atleast one occurence.
I have tried this so far:
val newRDD = RDD1.map{case(x, y) => for(i <- 0 to (z.size-1)){if(y.contains(z(i))) {((x, z(i)), 1)}}}
My output here is an RDD[Unit]
Im not sure if what i'm asking for is even possible, or if i have to do it an other way.
So it is just another word count
val rdd = sc.parallelize(Seq(
("A", Array("2", "3", "4")),
("B", Array("4", "4", "4")),
("A", Array("4", "5"))))
val z = Array("1", "4")
To make lookups efficient convert z to Set:
val zs = z.toSet
val result = rdd
.flatMapValues(_.filter(zs contains _).distinct)
.map((_, 1))
.reduceByKey(_ + _)
where
_.filter(zs contains _).distinct
filters out values that occur in z and deduplicates.
result.take(2).foreach(println)
// ((B,4),1)
// ((A,4),2)

How to transform RDD[(Key, Value)] into Map[Key, RDD[Value]]

I searched a solution for a long time but didn't get any correct algorithm.
Using Spark RDDs in scala, how could I transform a RDD[(Key, Value)] into a Map[key, RDD[Value]], knowing that I can't use collect or other methods which may load data into memory ?
In fact, my final goal is to loop on Map[Key, RDD[Value]] by key and call saveAsNewAPIHadoopFile for each RDD[Value]
For example, if I get :
RDD[("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)]
I'd like :
Map[("A" -> RDD[1, 2, 3]), ("B" -> RDD[4, 5]), ("C" -> RDD[6])]
I wonder if it would cost not too much to do it using filter on each key A, B, C of RDD[(Key, Value)], but I don't know if calling filter as much times there are different keys would be efficient ? (off course not, but maybe using cache ?)
Thank you
You should use the code like this (Python):
rdd = sc.parallelize( [("A", 1), ("A", 2), ("A", 3), ("B", 4), ("B", 5), ("C", 6)] ).cache()
keys = rdd.keys().distinct().collect()
for key in keys:
out = rdd.filter(lambda x: x[0] == key).map(lambda (x,y): y)
out.saveAsNewAPIHadoopFile (...)
One RDD cannot be a part of another RDD and you have no option to just collect keys and transform their related values to a separate RDD. In my example you would iterate over the cached RDD which is ok and would work fast
It sounds like what you really want is to save your KV RDD to a separate file for each key. Rather than creating a Map[Key, RDD[Value]] consider using a MultipleTextOutputFormat similar to the example here. The code is pretty much all there in the example.
The benefit of this approach is that you're guaranteed to only take one pass over the RDD after the shuffle and you get the same result you wanted. If you did this by filtering and creating several IDs as suggested in the other answer (unless your source supported pushdown filters) you would end up taking one pass over the dataset for each individual key which would be way slower.
This is my simple test code.
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val groupby_RDD = test_RDD.groupByKey()
val result_RDD = groupby_RDD.map{v =>
var result_list:List[Int] = Nil
for (i <- v._2) {
result_list ::= i
}
(v._1, result_list)
}
The result is below
result_RDD.take(3)
>> res86: Array[(String, List[Int])] = Array((A,List(1, 3, 2)), (B,List(5, 4)), (C,List(6)))
Or you can do it like this
val test_RDD = sc.parallelize(List(("A",1),("A",2), ("A",3),("B",4),("B",5),("C",6)))
val nil_list:List[Int] = Nil
val result2 = test_RDD.aggregateByKey(nil_list)(
(acc, value) => value :: acc,
(acc1, acc2) => acc1 ::: acc2 )
The result is this
result2.take(3)
>> res209: Array[(String, List[Int])] = Array((A,List(3, 2, 1)), (B,List(5, 4)), (C,List(6)))