NetworkX - Connected Columns - visualization

I'm trying to create a data visualization, and maybe NetworkX isn't the best tool but I would like to have parallel columns of nodes (2 separate groups) which connect to each other. I can't figure out how to place the two groups of nodes in this layout. The different options I have tried always default to a more 'web-like' layout. I'm trying to create a visualization where customers/companies (the keys in a dictionary) would have edges drawn to product nodes (values in the same dictionary).
For example:
d = {"A":[ 1, 2, 3], "B": [2,3], "C": [1,3]
From dictionary 'd', we would have a column of nodes ["A", "B", "C"] and second column [1, 2, 3] and between the two edges would be drawn.
A 1
B 2
C 3
UPDATE:
So the 'pos' argument suggested helped but I thought it was having difficulties using this on multiple objects. Here is the method I came up with:
nodes = ["A", "B", "C", "D"]
nodes2 = ["X", "Y", "Z"]
edges = [("A","Y"),("C","X"), ("C","Z")]
#This function will take a list of values we want to turn into nodes
# Then it assigns a y-value for a specific value of X creating columns
def create_pos(column, node_list):
pos = {}
y_val = 0
for key in node_list:
pos[key] = (column, y_val)
y_val = y_val+1
return pos
G.add_nodes_from(nodes)
G.add_nodes_from(nodes2)
G.add_edges_from(edges)
pos1 = create_pos(0, nodes)
pos2 = create_pos(1, nodes2)
pos = {**pos1, **pos2}
nx.draw(G, pos)

Here is the code I wrote with the help of #wolfevokcats to create columns of nodes which are connected.
G = nx.Graph()
nodes = ["A", "B", "C", "D"]
nodes2 = ["X", "Y", "Z"]
edges = [("A","Y"),("C","X"), ("C","Z")]
#This function will take a list of values we want to turn into nodes
# Then it assigns a y-value for a specific value of X creating columns
def create_pos(column, node_list):
pos = {}
y_val = 0
for key in node_list:
pos[key] = (column, y_val)
y_val = y_val+1
return pos
G.add_nodes_from(nodes)
G.add_nodes_from(nodes2)
G.add_edges_from(edges)
pos1 = create_pos(0, nodes)
pos2 = create_pos(1, nodes2)
pos = {**pos1, **pos2}
nx.draw(G, pos, with_labels = True)

Related

scala array.product function in 2d array

I have 2 d array:
val arr =Array(Array(2,1),Array(3,1),Array(4,1))
I should multiply all inner 1st elements and sum all inner 2nd elements to get as result:
Array(24,3)
I`m looking a way to use map there, something like :
arr.map(a=>Array(a(1stElemnt).product , a(2ndElemnt).sum ))
Any suggestion
Regards.
Following works but note that it is not safe, it throws exception if arr contains element/s that does not have exact 2 elements. You should add additional missing cases in pattern match as per your use case
val result = arr.fold(Array(1, 0)) {
case (Array(x1, x2), Array(y1, y2)) => Array(x1 * y1, x2 + y2)
}
Update
As #Luis suggested, if you make your original Array[Array] to Array[Tuple], another implementation could look like this
val arr = Array((2, 1), (3, 1), (4, 1))
val (prdArr, sumArr) = arr.unzip
val result = (prdArr.product, sumArr.sum)
println(result) // (24, 3)

Adding Sparse Vectors 3.0.0 Apache Spark Scala

I am trying to create a function as the following to add
two org.apache.spark.ml.linalg.Vector. or i.e two sparse vectors
This vector could look as the following
(28,[1,2,3,4,7,11,12,13,14,15,17,20,22,23,24,25],[0.13028398104008743,0.23648605632753023,0.7094581689825907,0.13028398104008743,0.23648605632753023,0.0,0.14218861229025295,0.3580566057240087,0.14218861229025295,0.13028398104008743,0.26056796208017485,0.0,0.14218861229025295,0.06514199052004371,0.13028398104008743,0.23648605632753023])
For e.g.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector): org.apache.spark.ml.linalg.Vector = {
}
Let's look at a use case
val x = Vectors.sparse(2, List(0), List(1)) // [1, 0]
val y = Vectors.sparse(2, List(1), List(1)) // [0, 1]
I want to output to be
Vectors.sparse(2, List(0,1), List(1,1))
Here's another case where they share the same indices
val x = Vectors.sparse(2, List(1), List(1))
val y = Vectors.sparse(2, List(1), List(1))
This output should be
Vectors.sparse(2, List(1), List(2))
I've realized doing this is harder than it seems. I looked into one possible solution of converting the vectors into breeze, adding them in breeze and then converting it back to a vector. e.g Addition of two RDD[mllib.linalg.Vector]'s. So I tried implementing this.
def add_vectors(x: org.apache.spark.ml.linalg.Vector,y:org.apache.spark.ml.linalg.Vector) ={
val dense_x = x.toDense
val dense_y = y.toDense
val bv1 = new DenseVector(dense_x.toArray)
val bv2 = new DenseVector(dense_y.toArray)
val vectout = Vectors.dense((bv1 + bv2).toArray)
vectout
}
however this gave me an error in the last line
val vectout = Vectors.dense((bv1 + bv2).toArray)
Cannot resolve the overloaded method 'dense'.
I'm wondering why is error is occurring and ways to fix it?
To answer my own question, I had to think about how sparse vectors are. For e.g. Sparse Vectors require 3 arguments. the number of dimensions, an array of indices, and finally an array of values. For e.g.
val indices: Array[Int] = Array(1,2)
val norms: Array[Double] = Array(0.5,0.3)
val num_int = 4
val vector: Vector = Vectors.sparse(num_int, indices, norms)
If I converted this SparseVector to an Array I would get the following.
code:
val choiced_array = vector.toArray
choiced_array.map(element => print(element + " "))
Output:
[0.0, 0.5,0.3,0.0].
This is considered a more dense representation of it. So once you convert the two vectors to array you can add them with the following code
val add: Array[Double] = (vector.toArray, vector_2.toArray).zipped.map(_ + _)
This gives you another array of them both added. Next to create your new sparse vector, you would want to create an indices array as shown in the construction
var i = -1;
val new_indices_pre = add.map( (element:Double) => {
i = i + 1
if(element > 0.0)
i
else{
-1
}
})
Then lets filter out all -1 indices indication that indicate zero for that indice.
new_indices_pre.filter(element => element != -1)
Remember to filter out none zero values from the array which has the addition of the two vectors.
val final_add = add.filter(element => element > 0.0)
Lastly, we can make the new sparse Vector
Vectors.sparse(num_int,new_indices,final_add)

Not getting correct label Graphframe LPA

I am using Graphframe LPA to find the communities but somehow it's not giving me expected result
graph_data = spark.createDataFrame([
("a", "d", "friend"),
("b", "d", "friend"),
("c", "d", "friend")
], ["src", "dst", "relationship"])
here my requirement is to get single community id for all vertices a,b,c and d but i am getting two different community id one for a,b,c and one for d
code:
df1 = graph_data.selectExpr('src AS id')
df2 = graph_data.selectExpr('dst AS id')
vertices = df1.union(df2)
vertices = vertices.distinct()
edges = graph_data
g = GraphFrame(vertices, edges)
communities = g.labelPropagation(maxIter=5)
Given that d is a root it has a separate label. To accomplish a single label, recommend using connected components instead, see docs.
communities = g.connectedComponents()
Note: requires you set a checkpoint directory prior.
sc.setCheckpointDir("some_path")

Checking specific key/value exists in custom sequences

I am mind boggled right now as to how to get this to work.
I have a case class such as:
case class Visitors(start: Int, end: Int, visitor_num: Int)
Now say I create two separate Sequences of type Visitors:
val visitors_A = Seq(Visitors(start = 1, end = 1, visitor_num = 2),Visitors(start = 2, end = 2, visitor_num = 129),Visitors(start = 3, end = 3, visitor_num = 90))
val visitors_B = Seq(Visitors(start = 1, end = 1, visitor_num = 0),Visitors(start = 2, end = 2, visitor_num = 0))
I want to create a separate Visitors Sequence that will output a Sequence of Visitors that have the same start times from both visitors_A and visitors_B,
output example should be:
visitors_Same = Seq(Visitors(start = 1, end = 1, visitor_num = 2),Visitors(start = 2, end = 2, visitor_num = 129)
It should check whether start times are in both of the Sequences, if they are, grab the Sequence values from visitors_A and append it to the list.
What confuses me is that I am working with a "custom" type of Visitor, and I cannot seem to be able to run intersect or contains function calls for visitor_a in visitor_b, I understand I probably need to check whether start value from A exists in B and then map (?) the output to a new sequence of type Visitor?
If you want a one liner (but probably very inefficient) here it is:
val r1 = visitors_A.filter(va => visitors_B.exists(vb => vb.start == va.start))
You can gain a bit more speed if you convert visitors_B to a Map first (logically the map is start -> visitor):
val vbm = visitors_B.map(vb => (vb.start, vb)).toMap
val r2 = visitors_A.filter(va => vbm.contains(va.start))
Edit
Actually since values in the Map are not used at all, you can use Set instead which will be a bit more efficient than Map:
val vbs = visitors_B.map(vb => vb.start).toSet
val r3 = visitors_A.filter(va => vbs.contains(va.start))

in scala how do we aggregate an Array to determine the count per Key and the percentage vs total

I am trying to find an efficient way to find the following :
Int1 = 1 or 0, Int2 = 1..k (where k = 3) and Double = 1.0
I want to find how many 1 or 0 are there in every k
I need to find the percentage of result of 3 on the total of the size of the Array??
Input is :
val clusterAndLabel = sc.parallelize(Array((0, 0), (0, 0), (1, 0), (1, 1), (2, 1), (2, 1), (2, 0)))
So in this example:
I have : 0,0 = 2 , 0,1 = 0
I have : 1,0 = 1 , 1,1 = 1
I have : 2,1 = 2 , 2,0 = 1
Total is 7 instances
I was thinking of doing some aggegation but I am stuck on the thought that they are both considered 2-key join
If you want to find how many 1 and 0s there are you can do:
val rdd = clusterAndLabel.map(x => (x,1)).reduceByKey(_+_)
this will give you an RDD[(Int,Int),Int] containing exactly what you described, meaning: [((0,0),2), ((1,0),1), ((1,1),1), ((2,1),2), ((2,0),1)]. If you really want them gathered by their first key, you can add this line:
val rdd2 = rdd.map(x => (x._1._1, (x._1._2, x._2))).groupByKey()
this will yield an RDD[(Int, (Int,Int)] which will look like what you described, i.e.: [(0, [(0,2)]), (1, [(0,1),(1,1)]), (2, [(1,2),(0,1)])].
If you need the number of instances, it looks like (at least in your example) clusterAndLabel.count() should do the work.
I don't really understand question 3? I can see two things:
you want to know how many keys have 3 occurrences. To do so, you can start from the object I called rdd (no need for the groupByKey line) and do so:
val rdd3 = rdd.map(x => (x._2,1)).reduceByKey(_+_)
this will yield and RDD[(Int,Int)] which is kind of a frequency RDD: the key is the number of occurences and the value is how many times this key is hit. Here it would look like: [(1,3),(2,2)]. So if you want to know how many pairs occur 3 times, you just do rdd3.filter(_._1==3).collect() (which will be an array of size 0, but if it's not empty then it'll have one value and it will be your answer).
you want to know how many time the first key 3 occurs (once again 0 in your example). Then you start from rdd2 and do:
val rdd3 = rdd2.map(x=>(x._1,x._2.size)).filter(_._1==3).collect()
once again it will yield either an empty array or an array of size 1 containing how many elements have a 3 for their first key. Note that you can do it directly if you don't need to display rdd2, you can just do:
val rdd4 = rdd.map(x => (x._1._1,1)).reduceByKey(_+_).filter(_._1==3).collect()
(for performance you might want to do the filter before reduceByKey also!)