GraphFrames Pregel doesn't converge - pyspark

I have a relatively shallow, directed, acyclic graph represented in GraphFrames (a large number of nodes, mainly on disjunct subgraphs). I want to propagate the id of the root nodes (nodes without incoming edges) to all nodes downstream. To achieve this, I chose the pregel algorithm. This process should converge once the passed messages don't change, however the process keeps going until the max iteration is reached.
This a model of the problem:
data = [
('v1', 'v1'),
('v3', 'v1'),
('v2', 'v1'),
('v4', 'v2'),
('v4', 'v5'),
('v5', 'v5'),
('v6', 'v4'),
]
df = spark.createDataFrame(data, ['variantId', 'explained']).persist()
# Create nodes:
nodes = (
df.select(
f.col('variantId').alias('id'),
f.when(f.col('variantId') == f.col('explained'), f.col('variantId')).alias('origin_root')
)
.distinct()
)
# Create edges:
edges = (
df
.filter(f.col('variantId')!=f.col('explained'))
.select(
f.col('variantId').alias('dst'),
f.col('explained').alias('src'),
f.lit('explains').alias('edgeType')
)
.distinct()
)
# Converting into a graphframe graph:
graph = GraphFrame(nodes, edges)
The graph will look like this:
I want to propagate
[v1] => v2 and v3,
[v1, v5] => v4 and v6.
To do this I wrote the following function:
maxiter = 3
(
graph.pregel
.setMaxIter(maxiter)
# New column for the resolved roots:
.withVertexColumn(
"resolved_roots",
# The value is initialized by the original root value:
f.when(
f.col('origin_root').isNotNull(),
f.array(f.col('origin_root'))
).otherwise(f.array()),
# When new value arrives to the node, it gets merged with the existing list:
f.when(
Pregel.msg().isNotNull(),
f.array_union(Pregel.msg(), f.col('resolved_roots'))
).otherwise(f.col("resolved_roots"))
)
# We need to reinforce the message in both direction:
.sendMsgToDst(Pregel.src("resolved_roots"))
# Once the message is delivered it is updated with the existing list of roots at the node:
.aggMsgs(f.flatten(f.collect_list(Pregel.msg())))
.run()
.orderBy( 'id')
.show()
)
It returns:
+---+-----------+--------------+
| id|origin_root|resolved_roots|
+---+-----------+--------------+
| v1| v1| [v1]|
| v2| null| [v1]|
| v3| null| [v1]|
| v4| null| [v1, v5]|
| v5| v5| [v5]|
| v6| null| [v1, v5]|
+---+-----------+--------------+
Although all the nodes now have root information, which stays the same, if we increase the max iteration number to 100, the process just keeps going.
The questions:
Why this process won't converge?
What can I do to make sure it converges?
Is this the right approach to achieve this goal?
Any helpful comment is highly appreciated, I'm absolutely new to graphs.

Related

pyspark SparseVectors dataframe columns .dot product or any other vectors type column computation using #udf or #pandas_udf

I do try to compute .dot product between 2 columns of a give dataframe,
SparseVectors has this ability in spark already so I try to execute this in an easy & scalable way without converting to RDDs or to
DenseVectors but i'm stuck, spent past 3 days to try find out of an
approach and does fail, doesn't return computation for passed 2 vector
columns from dataframe and looking for guidance on this matter,
please, because something I'm missing here and not sure what is root cause ...
For separate vectors and rdd vectors works this approach but does fail
to work when passing dataframe column vectors, to replicate the flow
and issues please see below, ideally would be this computation to happen in parallel since real work data is with billions or more rows (dataframe observations):
from pyspark.ml.linalg import Vectors, SparseVector
from pyspark.sql import Row
df = spark.createDataFrame(
[
[["a","b","c"], SparseVector(4527, {0:0.6363067860791387, 1:1.0888040725098247, 31:4.371858972705023}),SparseVector(4527, {0:0.6363067860791387, 1:2.0888040725098247, 31:4.371858972705023})],
[["d"], SparseVector(4527, {8: 2.729945780576634}), SparseVector(4527, {8: 4.729945780576634})],
], ["word", "i", "j"])
# # daframe content
df.show()
+---------+--------------------+--------------------+
| word| i| j|
+---------+--------------------+--------------------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...|
+---------+--------------------+--------------------+
#udf(returnType=ArrayType(FloatType()))
def sim_cos(v1, v2):
if v1 is not None and v2 is not None:
return float(v1.dot(v2))
# # calling udf
df = df.withColumn("dotP", sim_cos(df.i, df.j))
# # output after udf
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j| dotP|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| null|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| null|
+---------+--------------------+--------------------+----------+
Rewriting udf as lambda does work on spark 2.4.5. Posting in case
anyone is interested in this approach for PySpark dataframes:
# # rewrite udf as lambda function:
sim_cos = F.udf(lambda x,y : float(x.dot(y)), FloatType())
# # executing udf on dataframe
df = df.withColumn("similarity", sim_cos(col("i"),col("j")))
# # end result
df.show()
+---------+--------------------+--------------------+----------+
| word| i| j|similarity|
+---------+--------------------+--------------------+----------+
|[a, b, c]|(4527,[0,1,31],[0...|(4527,[0,1,31],[0...| 21.792336|
| [d]|(4527,[8],[2.7299...|(4527,[8],[4.7299...| 12.912496|
+---------+--------------------+--------------------+----------+

PySpark: remove rows which derivate from others

I do have the following dataframe, which contains all the paths within a tree after going through all nodes. For each jump between nodes, a row will be created where "dist" is the number of nodes so far, "node" the current node and "path" the path so far.
dist | node | path
0 | 1 | [1]
1 | 2 | [1,2]
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
At the end I just want to have a dataframe containing the complete paths without the intermediate steps:
dist | node | path
1 | 5 | [1,5]
2 | 3 | [1,2,3]
2 | 4 | [1,2,4]
I also tried by having the path column as a string ("1;2;3") and comparing which row is a substring from each other, however i could not find a way to do that.
I found my old code and created an adapted example for your problem. I used the spark graph library Graphframes for this. The path can be determined by a Pregel like message aggregation loop.
Here the code.
First import all modules
from pyspark import SparkContext, SparkConf
from pyspark.sql import SparkSession
from pyspark.sql import SQLContext
import pyspark.sql.functions as f
from graphframes import GraphFrame
from pyspark.sql.types import *
from graphframes.lib import *
# shortcut for the aggregate message object from the graphframes.lib
AM=AggregateMessages
# to plot the graph
import numpy as np
import networkx as nx
import matplotlib.pyplot as plt
spark = (SparkSession
.builder
.appName("PathReduction")
.getOrCreate()
)
sc=spark.sparkContext
Then create a sample dataset
# create dataframe
raw_data = [
("0","1"),
("1","2"),
("1","5"),
("2","3"),
("2","4"),
("a","b"),
("b","c"),
("c","d")]
schema = ["src","dst"]
data = spark.createDataFrame(data=raw_data, schema = schema)
data.show()
+---+---+
|src|dst|
+---+---+
| 0| 1|
| 1| 2|
| 1| 5|
| 2| 3|
| 2| 4|
| a| b|
| b| c|
| c| d|
+---+---+
For visualisation run
plotData_1 = data.select("src","dst").rdd.collect()
plotData_2 = np.array(plotData_1)
plotData_3=[]
for row in plotData_2:
plotData_3.append((row[0],row[1]))
G=nx.DiGraph(directed=True)
G.add_edges_from(plotData_3)
options = {
'node_color': 'orange',
'node_size': 500,
'width': 2,
'arrowstyle': '-|>',
'arrowsize': 20,
}
nx.draw(G, arrows=True, **options,with_labels=True)
With this message aggregation algorithm you find the paths as you searched them. if you set the flag show_steps to True the results of each step is shown which helps to understand.
# if flag is true print results within the loop for debuging
show_steps=False
# max itertions of the loop, should be larger then the longest expected path
max_iter=10
# create vertices from edge data set
vertices=(data.select("src").union(data.select("dst")).distinct().withColumnRenamed('src', 'id'))
edges=data
# create graph to get in and out degrees
gx = GraphFrame(vertices, edges)
# calclulate in and out degrees of each node
inDegrees=gx.inDegrees
outDegrees=gx.outDegrees
if(show_steps==True):
print("in and out degrees")
inDegrees.show()
outDegrees.show()
# create intial vertices
init_vertices=(vertices
# join out degrees on vertices
.join(outDegrees,on="id",how="left")
# join in degree on vertices
.join(inDegrees,on="id",how="left")
# define root, childs in the middle and leafs of the path in order to distinguish full paths later on
.withColumn("nodeType",f.when(f.col("inDegree").isNull(),"root").otherwise(f.when(f.col("outDegree").isNull(),"leaf").otherwise("child")))
# define message with all information [array(id) and array(nodeType)] to be send to the next noe
.withColumn("message",f.array_union(f.array(f.array(f.col("id"))),f.array(f.array(f.col("nodeType")))))
# remove columns that are not used anymore
.drop("inDegree","outDegree")
)
if(show_steps==True):
print("init vertices")
init_vertices.show()
# update graph object with init vertices
gx = GraphFrame(init_vertices, edges)
# define empty dataframe to append found paths on
results = sqlContext.createDataFrame(
sc.emptyRDD(),
StructType([StructField("paths",ArrayType(StringType()),True)])
)
# start loopp for mesage aggregation. Set a max_iter value which has to be larger as the longest path expected
for iter_ in range(max_iter):
if(show_steps==True):
print("iteration step=" + str(iter_))
print("##################################################")
# define the message that should be send. Here we send a message to the source node and we take the column message from the destination source we send backward
msgToSrc = AM.dst["message"]
agg = gx.aggregateMessages(
f.collect_set(AM.msg).alias("aggMess"), # aggregation function is a collect into an array (attention!! this can be an expensive operation in terms of shuffel)
sendToSrc=msgToSrc,
sendToDst=None
)
if(show_steps==True):
print("aggregated message")
agg.show(truncate=False)
# stop loop if no more agg messages collected
if(len(agg.take(1))==0):
print("All paths found in " + str(iter_) + " iterations")
break
# get new vertices to send into next round. Here we have to prepare the next message columns all _column_names are temporary columns for calculation purpose only
vertices_update=(agg
# join initial data to aggregation in order to have to nodeType of the vertice
.join(init_vertices,on="id",how="left")
# exploe the nested array with the path and the nodeType
.withColumn("_explode_to_flatten_array",f.explode(f.col("aggMess")))
# put the path aray into a seperate column
.withColumn("_dataMsg",f.col("_explode_to_flatten_array")[0])
# put the node type into a seperate column
.withColumn("_typeMsg",f.col("_explode_to_flatten_array")[1][0])
# deside if a path is complete. A path is complete if the vertices type is a root and the message type is a leaf
.withColumn("pathComplete",f.when(((f.col("nodeType")=="root") & (f.col("_typeMsg")=="leaf")),True).otherwise(False))
# append the curent vertice id to the path array that is send forward
.withColumn("_message",f.array_union(f.array(f.col("id")),f.col("_dataMsg")))
# merge together the path array and the nodeType array for the new message object
.withColumn("message",f.array_union(f.array(f.col("_message")),f.array(f.array(f.col("_typeMsg")))))
)
if(show_steps==True):
print("new vertices with all temp columns")
vertices_update.show()
# add complete paths to the result dataframe
results=(
results
.union(
vertices_update
.where(f.col("pathComplete")==True)
.select(f.col("_message"))
)
)
# chache the vertices for next iteration and only push forward the two relevant columns in order to reduce data shuffeling between spark executors
cachedNewVertices = AM.getCachedDataFrame(vertices_update.select("id","message"))
# create new updated graph object for next iteration
gx = GraphFrame(cachedNewVertices, gx.edges)
print("##################################################")
print("Collecting result set")
results.show()
it shows then the correct results
All paths found in 3 iterations
##################################################
Collecting result set
+------------+
| paths|
+------------+
| [0, 1, 5]|
|[0, 1, 2, 3]|
|[0, 1, 2, 4]|
|[a, b, c, d]|
+------------+
to get your final dataframe you can join it back or take the first and last element of the array into separate columns
result2=(results
.withColumn("dist",f.element_at(f.col("paths"), 1))
.withColumn("node",f.element_at(f.col("paths"), -1))
)
result2.show()
+------------+----+----+
| paths|dist|node|
+------------+----+----+
| [0, 1, 5]| 0| 5|
|[0, 1, 2, 3]| 0| 3|
|[0, 1, 2, 4]| 0| 4|
|[a, b, c, d]| a| d|
+------------+----+----+
You can write the same algorithm with the Graphframes Pregel API I suppose.
P.S: The algorithm in this form might cause problems if the graph has lops or backward directed edges. I had another algorithm to first clean up loops and cycles

How to find membership of vertices using Graphframes or igraph or networx in pyspark

my input dataframe is df
valx valy
1: 600060 09283744
2: 600131 96733110
3: 600194 01700001
and I want to create the graph treating above two columns are edgelist and then my output should have list of all vertices of graph with its membership .
I have tried Graphframes in pyspark and networx library too, but not getting desired results
My output should look like below (its basically all valx and valy under V1 (as vertices) and their membership info under V2)
V1 V2
600060 1
96733110 1
01700001 3
I tried below
import networkx as nx
import pandas as pd
filelocation = r'Pathtodataframe df csv'
Panda_edgelist = pd.read_csv(filelocation)
g = nx.from_pandas_edgelist(Panda_edgelist,'valx','valy')
g2 = g.to_undirected(g)
list(g.nodes)
``
I'm not sure if you are violating any rules here by asking the same question two times.
To detect communities with graphframes, at first you have to create graphframes object. Give your example dataframe the following code snippet shows you the necessary transformations:
from graphframes import *
sc.setCheckpointDir("/tmp/connectedComponents")
l = [
( '600060' , '09283744'),
( '600131' , '96733110'),
( '600194' , '01700001')
]
columns = ['valx', 'valy']
#this is your input dataframe
edges = spark.createDataFrame(l, columns)
#graphframes requires two dataframes: an edge and a vertice dataframe.
#the edge dataframe has to have at least two columns labeled with src and dst.
edges = edges.withColumnRenamed('valx', 'src').withColumnRenamed('valy', 'dst')
edges.show()
#the vertice dataframe requires at least one column labeled with id
vertices = edges.select('src').union(edges.select('dst')).withColumnRenamed('src', 'id')
vertices.show()
g = GraphFrame(vertices, edges)
Output:
+------+--------+
| src| dst|
+------+--------+
|600060|09283744|
|600131|96733110|
|600194|01700001|
+------+--------+
+--------+
| id|
+--------+
| 600060|
| 600131|
| 600194|
|09283744|
|96733110|
|01700001|
+--------+
You wrote in the comments of your other question that the community detection algorithmus doesn't matter for you currently. Therefore I will pick the connected components:
result = g.connectedComponents()
result.show()
Output:
+--------+------------+
| id| component|
+--------+------------+
| 600060|163208757248|
| 600131| 34359738368|
| 600194|884763262976|
|09283744|163208757248|
|96733110| 34359738368|
|01700001|884763262976|
+--------+------------+
Other community detection algorithms (like LPA) can be found in the user guide.

pyspark dataframe complex calculation with previous row

I am working with Pyspark and trying to figure out how to do complex calculation with previous columns. I think there are generally two ways to do calculation with previous columns : Windows, and mapwithPartition. I think my problem is too complex to solve by windows, and I want the result as a sepreate row, not column. So I am trying to use mapwithpartition. I am having a trouble with syntax of this.
For instance, here is a rough draft of the code.
def change_dd(rows):
prev_rows = []
prev_rows.append(rows)
for row in rows:
new_row=[]
for entry in row:
# Testing to figure out syntax, things would get more complex
new_row.append(entry + prev_rows[0])
yield new_row
updated_rdd = select.rdd.mapPartitions(change_dd)
However, I can't access to the single data of prev_rows. Seems like prev_rows[0] is itertools.chain. How do I iterate over this prev_rows[0]?
edit
neighbor = sc.broadcast(df_sliced.where(df_sliced.id == neighbor_idx).collect()[0][:-1]).value
current = df_sliced.where(df_sliced.id == i)
def oversample_dt(dataframe):
for row in dataframe:
new_row = []
for entry, neigh in zip(row, neighbor):
if isinstance(entry, str):
if scale < 0.5:
new_row.append(entry)
else:
new_row.append(neigh)
else:
if isinstance(entry, int):
new_row.append(int(entry + (neigh - entry) * scale))
else:
new_row.append(entry + (neigh - entry) * scale)
yield new_row
sttt = time.time()
sample = current.rdd.mapPartitions(oversample_dt).toDF(schema)
In the end, I ended up doing like this for now, but I really don't want to use collect in the first row. If someone knows how to fix this / point out any problem in using pyspark, please tell me.
edit2
--Suppose Alice, and its neighbor Alice_2
scale = 0.4
+---+-------+--------+
|age| name | height |
+---+-------+--------+
| 10| Alice | 170 |
| 11|Alice_2| 175 |
+---+-------+--------+
Then, I want a row
+---+-------+----------------------------------+
|age | name | height |
+---+-------+---------------------------------+
| 10+1*0.4 | Alice_2 | 170 + 5*0.4 |
+---+-------+---------------------------------+
Why not using dataframes?
Add a column to the dataframe with the previous values using window functions like this:
from pyspark.sql import SparkSession, functions
from pyspark.sql.window import Window
spark_session = SparkSession.builder.getOrCreate()
df = spark_session.createDataFrame([{'name': 'Alice', 'age': 1}, {'name': 'Alice_2', 'age': 2}])
df.show()
+---+-------+
|age| name|
+---+-------+
| 1| Alice|
| 2|Alice_2|
+---+-------+
window = Window.partitionBy().orderBy('age')
df = df.withColumn("age-1", functions.lag(df.age).over(window))
df.show()
You can use this function for every column
+---+-------+-----+
|age| name|age-1|
+---+-------+-----+
| 1| Alice| null|
| 2|Alice_2| 1|
+---+-------+-----+
An then just make your calculus
And if you want to use rdd, then just use df.rdd

Spark columnar performance

I'm a relative beginner to things Spark. I have a wide dataframe (1000 columns) that I want to add columns to based on whether a corresponding column has missing values
so
+----+
| A |
+----+
| 1 |
+----+
|null|
+----+
| 3 |
+----+
becomes
+----+-------+
| A | A_MIS |
+----+-------+
| 1 | 0 |
+----+-------+
|null| 1 |
+----+-------+
| 3 | 1 |
+----+-------+
This is part of a custom ml transformer but the algorithm should be clear.
override def transform(dataset: org.apache.spark.sql.Dataset[_]): org.apache.spark.sql.DataFrame = {
var ds = dataset
dataset.columns.foreach(c => {
if (dataset.filter(col(c).isNull).count() > 0) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}
Loop over the columns, if > 0 nulls create a new column.
The dataset passed in is cached (using the .cache method) and the relevant config settings are the defaults.
This is running on a single laptop for now, and runs in the order of 40 minutes for the 1000 columns even with a minimal amount of rows.
I thought the problem was due to hitting a database, so I tried with a parquet file instead with the same result. Looking at the jobs UI it appears to be doing filescans in order to do the count.
Is there a way I can improve this algorithm to get better performance, or tune the cacheing in some way? Increasing spark.sql.inMemoryColumnarStorage.batchSize just got me an OOM error.
Remove the condition:
if (dataset.filter(col(c).isNull).count() > 0)
and leave only the internal expression. As it is written Spark requires #columns data scans.
If you want prune columns compute statistics once, as outlined in Count number of non-NaN entries in each column of Spark dataframe with Pyspark, and use single drop call.
Here's the code that fixes the problem.
override def transform(dataset: Dataset[_]): DataFrame = {
var ds = dataset
val rowCount = dataset.count()
val exprs = dataset.columns.map(count(_))
val colCounts = dataset.agg(exprs.head, exprs.tail: _*).toDF(dataset.columns: _*).first()
dataset.columns.foreach(c => {
if (colCounts.getAs[Long](c) > 0 && colCounts.getAs[Long](c) < rowCount ) {
ds = ds.withColumn(c + "_MIS", when(col(c).isNull, 1).otherwise(0))
}
})
ds.toDF()
}