Not getting correct label Graphframe LPA - pyspark

I am using Graphframe LPA to find the communities but somehow it's not giving me expected result
graph_data = spark.createDataFrame([
("a", "d", "friend"),
("b", "d", "friend"),
("c", "d", "friend")
], ["src", "dst", "relationship"])
here my requirement is to get single community id for all vertices a,b,c and d but i am getting two different community id one for a,b,c and one for d
code:
df1 = graph_data.selectExpr('src AS id')
df2 = graph_data.selectExpr('dst AS id')
vertices = df1.union(df2)
vertices = vertices.distinct()
edges = graph_data
g = GraphFrame(vertices, edges)
communities = g.labelPropagation(maxIter=5)

Given that d is a root it has a separate label. To accomplish a single label, recommend using connected components instead, see docs.
communities = g.connectedComponents()
Note: requires you set a checkpoint directory prior.
sc.setCheckpointDir("some_path")

Related

NetworkX - Connected Columns

I'm trying to create a data visualization, and maybe NetworkX isn't the best tool but I would like to have parallel columns of nodes (2 separate groups) which connect to each other. I can't figure out how to place the two groups of nodes in this layout. The different options I have tried always default to a more 'web-like' layout. I'm trying to create a visualization where customers/companies (the keys in a dictionary) would have edges drawn to product nodes (values in the same dictionary).
For example:
d = {"A":[ 1, 2, 3], "B": [2,3], "C": [1,3]
From dictionary 'd', we would have a column of nodes ["A", "B", "C"] and second column [1, 2, 3] and between the two edges would be drawn.
A 1
B 2
C 3
UPDATE:
So the 'pos' argument suggested helped but I thought it was having difficulties using this on multiple objects. Here is the method I came up with:
nodes = ["A", "B", "C", "D"]
nodes2 = ["X", "Y", "Z"]
edges = [("A","Y"),("C","X"), ("C","Z")]
#This function will take a list of values we want to turn into nodes
# Then it assigns a y-value for a specific value of X creating columns
def create_pos(column, node_list):
pos = {}
y_val = 0
for key in node_list:
pos[key] = (column, y_val)
y_val = y_val+1
return pos
G.add_nodes_from(nodes)
G.add_nodes_from(nodes2)
G.add_edges_from(edges)
pos1 = create_pos(0, nodes)
pos2 = create_pos(1, nodes2)
pos = {**pos1, **pos2}
nx.draw(G, pos)
Here is the code I wrote with the help of #wolfevokcats to create columns of nodes which are connected.
G = nx.Graph()
nodes = ["A", "B", "C", "D"]
nodes2 = ["X", "Y", "Z"]
edges = [("A","Y"),("C","X"), ("C","Z")]
#This function will take a list of values we want to turn into nodes
# Then it assigns a y-value for a specific value of X creating columns
def create_pos(column, node_list):
pos = {}
y_val = 0
for key in node_list:
pos[key] = (column, y_val)
y_val = y_val+1
return pos
G.add_nodes_from(nodes)
G.add_nodes_from(nodes2)
G.add_edges_from(edges)
pos1 = create_pos(0, nodes)
pos2 = create_pos(1, nodes2)
pos = {**pos1, **pos2}
nx.draw(G, pos, with_labels = True)

How can I group RDD by key then count per unique string?

I have an RDD like:
[(1, "Western"),
(1, "Western")
(1, "Drama")
(2, "Western")
(2, "Romance")
(2, "Romance")]
I wish to count per userID the occurances of each movie genres resulting in
1, { "Western", 2), ("Drama", 1) } ...
After that it's my intention to pick the one with the largest number and thus gaining the most popular genre per user.
I have tried userGenre.sortByKey().countByValue()
but to no avail I have no clue on how I can perform this task. I'm using pyspark jupyter notebook.
EDIT:
I have tried the following and it seems to have worked, could someone confirm?
userGenreRDD.map(lambda x: (x, 1)).aggregateByKey(\
0, # initial value for an accumulator \
lambda r, v: r + v, # function that adds a value to an accumulator \
lambda r1, r2: r1 + r2 # function that merges/combines two accumulators \
)
Here is one way of doing
rdd = sc.parallelize([('u1', "Western"),('u2', "Western"),('u1', "Drama"),('u1', "Western"),('u2', "Romance"),('u2', "Romance")])
The occurrence of each movie genre could be
>>> rdd = sc.parallelize(rdd.countByValue().items())
>>> rdd.map(lambda ((x,y),z): (x,(y,z))).groupByKey().map(lambda (x,y): (x, [y for y in y])).collect()
[('u1', [('Western', 2), ('Drama', 1)]), ('u2', [('Western', 1), ('Romance', 2)])]
Most popular genre
>>> rdd.map(lambda (x,y): ((x,y),1)).reduceByKey(lambda x,y: x+y).map(lambda ((x,y),z):(x,(y,z))).groupByKey().mapValues(lambda (x,y): (y)).collect()
[('u1', ('Western', 2)), ('u2', ('Romance', 2))]
Now one could ask what should be most popular genre if more than one genre have the same popularity count?

How to embed spark dataframe columns to a map column?

I have a spark dataframe with many many columns. Now, I want to combine them together to a map and build a new column.
e.g.
col1:String col2:String col3:String... coln:String =>
col: Map(colname -> colval)
One way to do this is to:
df.withColumn("newcol", struct(df.columns.head, df.columns.tail: _*))
However, I still have to convert df to dataset. I have no idea how to define the case class that can match struct type here.
Another option is to embed columns to Map type, but I do not know how to express this.
For performance reasons, you can avoid rolling your own UDF by using the existing Spark function:
org.apache.spark.sql.functions.map
Here is a fully worked example:
var mydata = Seq(("a", "b", "c"), ("d", "e", "f"), ("g", "h", "i"))
.toDF("f1", "f2", "f3")
var colnms_n_vals = mydata.columns.flatMap { c => Array(lit(c), col(c)) }
display(mydata.withColumn("myMap", map(colnms_n_vals:_*)))
Results in this:
f1 f2 f3 myMap
a b c {"f1":"a","f2":"b","f3":"c"}
d e f {"f1":"d","f2":"e","f3":"f"}
g h i {"f1":"g","f2":"h","f3":"i"}
If you want to build new column from all existing columns, here is one simple solution.
import org.apache.spark.sql.functions._
val columnsName = ds.columns
val mkMap = udf((values: mutable.WrappedArray[Int]) => columnsName.zip(values).toMap)
ds.withColumn("new_col", mkMap(array(columnsName.head, columnsName.tail:_*)))

Scala Random forest feature importance extraction with names (labels)

Is there any way to extract the feature importance from a model and append featureCols names for an easier analysis?
I have something like:
val featureCols = Array("a","b","c".......... like 67 more)
val assembler = new VectorAssembler().setInputCols(featureCols).setOutputCol("features")
val df2 = assembler.transform(modeling_db)
val labelIndexer = new StringIndexer().setInputCol("def").setOutputCol("label")
val df3 = labelIndexer.fit(df2).transform(df2)
val splitSeed = 5043
val Array(trainingData, testDataCE) = df3.randomSplit(Array(0.7, 0.3), splitSeed)
val classifier = new RandomForestClassifier().setImpurity("gini").setMaxDepth(19).setNumTrees(57).setFeatureSubsetStrategy("auto").setSeed(5043)
val model = classifier.fit(trainingData)
After that, we try to extract the importance with:
model.featureImportances
and the answer is really hard to analyze:
res14: org.apache.spark.mllib.linalg.Vector = (71,[0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,20,23,25,27,33,34,35,38,39,41,42,45,47,48,49,50,51,52,53,54,55,56,57,58,60,61,62,63,64,65,66,67,68,69,70],[0.22362951804309808,0.1830148359365108,0.10246542303449771,0.1699399958851977,0.06486419413350401,0.05187244974385025,0.02627047699833213,0.014498050071723645,0.026182513062665076,0.007126662761055224,0.012375060477018274,0.004354513006816487,0.004361008357237427,0.008435852744278544,0.003195472326415685,0.0023071401643885753,0.004602370417578224,0.0030394399903992345,6.92408316823549E-4,0.011207695216651398,7.609910745572573E-4,8.316382113306638E-4,0.0021506289318167916,0.0013468620354363688,0.006968754359778437,0.018796331618729723,0.0024516591941419444,0.005980997035580654,0.0027983...
Is there a way to "upack" this answer and append it to the original label names?
You have the original column names in featureCols and there does not seem to be any vector involved, therefore you can simply zip the two arrays together. For input data like this:
val featureCols = Array("a", "b", "c", "d", "e")
val featureImportance = Vectors.dense(Array(0.15, 0.25, 0.1, 0.35, 0.15)).toSparse
Simply do
val res = featureCols.zip(featureImportance.toArray).sortBy(-_._2)
which by printing will result in
(d,0.35)
(b,0.25)
(a,0.15)
(e,0.15)
(c,0.1)

Spark migrate sql window function to RDD for better performance

A function should be executed for multiple columns in a data frame
def handleBias(df: DataFrame, colName: String, target: String = target) = {
val w1 = Window.partitionBy(colName)
val w2 = Window.partitionBy(colName, target)
df.withColumn("cnt_group", count("*").over(w2))
.withColumn("pre2_" + colName, mean(target).over(w1))
.withColumn("pre_" + colName, coalesce(min(col("cnt_group") / col("cnt_foo_eq_1")).over(w1), lit(0D)))
.drop("cnt_group")
}
This can be written nicely as shown above in spark-SQL and a for loop. However this is causing a lot of shuffles (spark apply function to columns in parallel).
A minimal example:
val df = Seq(
(0, "A", "B", "C", "D"),
(1, "A", "B", "C", "D"),
(0, "d", "a", "jkl", "d"),
(0, "d", "g", "C", "D"),
(1, "A", "d", "t", "k"),
(1, "d", "c", "C", "D"),
(1, "c", "B", "C", "D")
).toDF("TARGET", "col1", "col2", "col3TooMany", "col4")
val columnsToDrop = Seq("col3TooMany")
val columnsToCode = Seq("col1", "col2")
val target = "TARGET"
val targetCounts = df.filter(df(target) === 1).groupBy(target)
.agg(count(target).as("cnt_foo_eq_1"))
val newDF = df.join(broadcast(targetCounts), Seq(target), "left")
val result = (columnsToDrop ++ columnsToCode).toSet.foldLeft(newDF) {
(currentDF, colName) => handleBias(currentDF, colName)
}
result.drop(columnsToDrop: _*).show
How can I formulate this more efficient using RDD API? aggregateByKeyshould be a good idea but is still not very clear to me how to apply it here to substitute the window functions.
(provides a bit more context / bigger example https://github.com/geoHeil/sparkContrastCoding)
edit
Initially, I started with Spark dynamic DAG is a lot slower and different from hard coded DAG which is shown below. The good thing is, each column seems to run independent /parallel. The downside is that the joins (even for a small dataset of 300 MB) get "too big" and lead to an unresponsive spark.
handleBiasOriginal("col1", df)
.join(handleBiasOriginal("col2", df), df.columns)
.join(handleBiasOriginal("col3TooMany", df), df.columns)
.drop(columnsToDrop: _*).show
def handleBiasOriginal(col: String, df: DataFrame, target: String = target): DataFrame = {
val pre1_1 = df
.filter(df(target) === 1)
.groupBy(col, target)
.agg((count("*") / df.filter(df(target) === 1).count).alias("pre_" + col))
.drop(target)
val pre2_1 = df
.groupBy(col)
.agg(mean(target).alias("pre2_" + col))
df
.join(pre1_1, Seq(col), "left")
.join(pre2_1, Seq(col), "left")
.na.fill(0)
}
This image is with spark 2.1.0, the images from Spark dynamic DAG is a lot slower and different from hard coded DAG are with 2.0.2
The DAG will be a bit simpler when caching is applied
df.cache
handleBiasOriginal("col1", df). ...
What other possibilities than window functions do you see to optimize the SQL?
At best it would be great if the SQL was generated dynamically.
The main point here is to avoid unnecessary shuffles. Right now your code shuffles twice for each column you want to include and the resulting data layout cannot be reused between columns.
For simplicity I assume that target is always binary ({0, 1}) and all remaining columns you use are of StringType. Furthermore I assume that the cardinality of the columns is low enough for the results to be grouped and handled locally. You can adjust these methods to handle other cases but it requires more work.
RDD API
Reshape data from wide to long:
import org.apache.spark.sql.functions._
val exploded = explode(array(
(columnsToDrop ++ columnsToCode).map(c =>
struct(lit(c).alias("k"), col(c).alias("v"))): _*
)).alias("level")
val long = df.select(exploded, $"TARGET")
aggregateByKey, reshape and collect:
import org.apache.spark.util.StatCounter
val lookup = long.as[((String, String), Int)].rdd
// You can use prefix partitioner (one that depends only on _._1)
// to avoid reshuffling for groupByKey
.aggregateByKey(StatCounter())(_ merge _, _ merge _)
.map { case ((c, v), s) => (c, (v, s)) }
.groupByKey
.mapValues(_.toMap)
.collectAsMap
You can use lookup to get statistics for individual columns and levels. For example:
lookup("col1")("A")
org.apache.spark.util.StatCounter =
(count: 3, mean: 0.666667, stdev: 0.471405, max: 1.000000, min: 0.000000)
Gives you data for col1, level A. Based on the binary TARGET assumption this information is complete (you get count / fractions for both classes).
You can use lookup like this to generate SQL expressions or pass it to udf and apply it on individual columns.
DataFrame API
Convert data to long as for RDD API.
Compute aggregates based on levels:
val stats = long
.groupBy($"level.k", $"level.v")
.agg(mean($"TARGET"), sum($"TARGET"))
Depending on your preferences you can reshape this to enable efficient joins or convert to a local collection and similarly to the RDD solution.
Using aggregateByKey
A simple explanation on aggregateByKey can be found here. Basically you use two functions: One which works inside a partition and one which works between partitions.
You would need to do something like aggregate by the first column and build a data structure internally with a map for every element of the second column to aggregate and collect data there (of course you could do two aggregateByKey if you want).
This will not solve the case of doing multiple runs on the code for each column you want to work with (you can do use aggregate as opposed to aggregateByKey to work on all data and put it in a map but that will probably give you even worse performance). The result would then be one line per key, if you want to move back to the original records (as window function does) you would actually need to either join this value with the original RDD or save all values internally and flatmap
I do not believe this would provide you with any real performance improvement. You would be doing a lot of work to reimplement things that are done for you in SQL and while doing so you would be losing most of the advantages of SQL (catalyst optimization, tungsten memory management, whole stage code generation etc.)
Improving the SQL
What I would do instead is attempt to improve the SQL itself.
For example, the result of the column in the window function appears to be the same for all values. Do you really need a window function? You can instead do a groupBy instead of a window function (and if you really need this per record you can try to join the results. This might provide better performance as it would not necessarily mean shuffling everything twice on every step).