Improve PySpark implementation for finding connected components in a graph - pyspark

I am currently working on the implementation of this paper describing Map Reduce Algorithm to fing connected component :
As a beginner in Big Data world , I started the implementation of CCF-Iterate (w. secondary sorting) algorithm with a small graph : 6 edges and 8 nodes. I'm running this code with the free version of Databricks.
It takes 1 minute to give a result. That seems too long for a such small example . How can I reduce this time ? What kind of optimization is possible? Any advice would be really apreciated.
The idea is to test this algo for large graphs
PySpark code:
graph = sc.parallelize([ (2,3),(1,2), (2,4), (3,5), (6,7),(7,8)])
counter_new_pair = sc.accumulator(1)
while (counter_new_pair.value > 0):
counter_new_pair = sc.accumulator(0)
#CCF Iterate Sorting
mapping_1 = x : (x[0], x[1]))
mapping_2 = x : (x[1], x[0]))
fusion = mapping_1.union(mapping_2)
fusion = fusion.groupByKey().map(lambda x : (x[0], list(x[1])))
fusion = x : (x[0], sorted(x[1])))
values = fusion.filter(lambda x : x[1][0] < x[0])
key_min_value = x : (x[0], x[1][0]))
values = x : (x[1][0], x[1][1:]))
values = values.filter(lambda x : len(x[1]) != 0)
values = values.flatMap(lambda x : [(val, x[0]) for val in x[1]])
values.foreach(lambda x: counter_new_pair.add(1))
joined = values.union(key_min_value)
# CCF Dedup
mapping = x : ((x[0], x[1]), None))
graph = mapping.groupByKey().map(lambda x : (x[0][0], x[0][1]))


Using the GPU with Lux and NeuralPDE Julia

I am trying to run a model using the GPU, no problem with the CPU. I think somehow using measured boundary conditions is causing the issue but I am not sure. I am following this example: I am following this example for using measured boundary conditions:
using Random
using NeuralPDE, Lux, CUDA, Random
using Optimization
using OptimizationOptimisers
using NNlib
import ModelingToolkit: Interval
using Interpolations
# Measured Boundary Conditions (Arbitrary For Example)
bc1 = 1.0:1:1001.0 .|> Float32
bc2 = 1.0:1:1001.0 .|> Float32
ic1 = zeros(101) .|> Float32
ic2 = zeros(101) .|> Float32;
# Interpolation Functions Registered as Symbolic
itp1 = interpolate(bc1, BSpline(Cubic(Line(OnGrid()))))
up_cond_1_f(t::Float32) = itp1(t)
#register_symbolic up_cond_1_f(t)
itp2 = interpolate(bc2, BSpline(Cubic(Line(OnGrid()))))
up_cond_2_f(t::Float32) = itp2(t)
#register_symbolic up_cond_2_f(t)
itp3 = interpolate(ic1, BSpline(Cubic(Line(OnGrid()))))
init_cond_1_f(x::Float32) = itp3(x)
#register_symbolic init_cond_1_f(x)
itp4 = interpolate(ic2, BSpline(Cubic(Line(OnGrid()))))
init_cond_2_f(x::Float32) = itp4(x)
#register_symbolic init_cond_2_f(x);
# Parameters and differentials
#parameters t, x
#variables u1(..), u2(..)
Dt = Differential(t)
Dx = Differential(x);
# Arbitrary Equations
eqs = [Dt(u1(t, x)) + Dx(u2(t, x)) ~ 0.,
Dt(u1(t, x)) * u1(t,x) + Dx(u2(t, x)) + 9.81 ~ 0.]
# Boundary Conditions with Measured Data
bcs = [
u1(t,1) ~ up_cond_1_f(t),
u2(t,1) ~ up_cond_2_f(t),
u1(1,x) ~ init_cond_1_f(x),
u2(1,x) ~ init_cond_2_f(x)
# Space and time domains
domains = [t ∈ Interval(1.0,1001.0),
x ∈ Interval(1.0,101.0)];
# Neural network
input_ = length(domains)
n = 10
chain = Chain(Dense(input_,n,NNlib.tanh_fast),Dense(n,n,NNlib.tanh_fast),Dense(n,4))
strategy = GridTraining(.25)
ps = Lux.setup(Random.default_rng(), chain)[1]
ps = ps |> Lux.ComponentArray |> gpu .|> Float32
discretization = PhysicsInformedNN(chain,
# Model Setup
#named pdesystem = PDESystem(eqs,bcs,domains,[t,x],[u1(t, x),u2(t, x)])
prob = discretize(pdesystem,discretization);
sym_prob = symbolic_discretize(pdesystem,discretization);
# Losses and Callbacks
pde_inner_loss_functions = sym_prob.loss_functions.pde_loss_functions
bcs_inner_loss_functions = sym_prob.loss_functions.bc_loss_functions
callback = function (p, l)
println("loss: ", l)
println("pde_losses: ", map(l_ -> l_(p), pde_inner_loss_functions))
println("bcs_losses: ", map(l_ -> l_(p), bcs_inner_loss_functions))
return false
# Train Model (Throws Error)
res = Optimization.solve(prob,Adam(0.01); callback = callback, maxiters=5000)
phi = discretization.phi;
I get the following error:
GPU broadcast resulted in non-concrete element type Union{}.
This probably means that the function you are broadcasting contains an error or type instability.
Please Advise.

Implement Louvain in pyspark using dataframes

I'm trying to implement the Louvain algorihtm in pyspark using dataframes. The problem is that my implementation is reaaaally slow. This is how I do it:
I collect all vertices and communityIds into simple python lists
For each vertex - communityId pair I calculate the modularity gain using dataframes (just a fancy formula involving edge weights sums/differences)
Repeat untill no change
What am I doing wrong?
I suppose that if I could somehow parallelize the for each loop the performance would increase, but how can I do that?
I could use vertices.foreach(changeCommunityId) instead of the for each loop, but then I'd have to compute the modularity gain (that fancy formula) without dataframes.
See the code sample below:
def louvain(self):
oldModularity = 0 # since intially each node represents a community
graph = self.graph
# retrieve graph vertices and edges dataframes
vertices = verticesDf = self.graph.vertices
aij = edgesDf = self.graph.edges
canOptimize = True
allCommunityIds = [row['communityId'] for row in'communityId').distinct().collect()]
verticesIdsCommunityIds = [(row['id'], row['communityId']) for row in'id', 'communityId').collect()]
allEdgesSum = self.graph.edges.groupBy().sum('weight').collect()
m = allEdgesSum[0]['sum(weight)']/2
def computeModularityGain(vertexId, newCommunityId):
# the sum of all weights of the edges within C
sourceNodesNewCommunity = vertices.join(aij, == aij.src) \
.select('weight', 'src', 'communityId') \
.where(vertices.communityId == newCommunityId);
destinationNodesNewCommunity = vertices.join(aij, == aij.dst) \
.select('weight', 'dst', 'communityId') \
.where(vertices.communityId == newCommunityId);
k_in = sourceNodesNewCommunity.join(destinationNodesNewCommunity, sourceNodesNewCommunity.communityId == destinationNodesNewCommunity.communityId) \
# the rest of the formula computation goes here, I just wanted to show you an example
# just return some value for the modularity
return 0.9
def changeCommunityId(vertexId, currentCommunityId):
maxModularityGain = 0
maxModularityGainCommunityId = None
for newCommunityId in allCommunityIds:
if (newCommunityId != currentCommunityId):
modularityGain = computeModularityGain(vertexId, newCommunityId)
if (modularityGain > maxModularityGain):
maxModularityGain = modularityGain
maxModularityGainCommunityId = newCommunityId
if (maxModularityGain > 0):
return maxModularityGainCommunityId
return currentCommunityId
while canOptimize:
while self.changeInModularity:
self.changeInModularity = False
for vertexCommunityIdPair in verticesIdsCommunityIds:
vertexId = vertexCommunityIdPair[0]
currentCommunityId = vertexCommunityIdPair[1]
newCommunityId = changeCommunityId(vertexId, currentCommunityId)
self.changeInModularity = False
canOptimize = False

SSP Algorithm minimal subset of length k

Suppose S is a set with t elements modulo n. There are indeed, 2^t subsets of any length. Illustrate a PARI/GP program which finds the smallest subset U (in terms of length) of distinct elements such that the sum of all elements in U is 0 modulo n. It is easy to write a program which searches via brute force, but brute force is infeasible as t and n get larger, so would appreciate help writing a program which doesn't use brute force to solve this instance of the subset sum problem.
Dynamic Approach:
def isSubsetSum(st, n, sm) :
# The value of subset[i][j] will be
# true if there is a subset of
# set[0..j-1] with sum equal to i
subset=[[True] * (sm+1)] * (n+1)
# If sum is 0, then answer is true
for i in range(0, n+1) :
subset[i][0] = True
# If sum is not 0 and set is empty,
# then answer is false
for i in range(1, sm + 1) :
subset[0][i] = False
# Fill the subset table in botton
# up manner
for i in range(1, n+1) :
for j in range(1, sm+1) :
if(j < st[i-1]) :
subset[i][j] = subset[i-1][j]
if (j >= st[i-1]) :
subset[i][j] = subset[i-1][j] or subset[i - 1][j-st[i-1]]
"""uncomment this code to print table
for i in range(0,n+1) :
for j in range(0,sm+1) :
print(" ")"""
return subset[n][sm];
I got this code from here I don't know weather it seems to work.
function getSummingItems(a,t){
return a.reduce((h,n) => Object.keys(h)
.reduceRight((m,k) => +k+n <= t ? (m[+k+n] = m[+k+n] ? m[+k+n].concat(m[k].map(sa => sa.concat(n)))
: m[k].map(sa => sa.concat(n)),m)
: m, h), {0:[[]]})[t];
var arr = Array(20).fill().map((_,i) => i+1), // [1,2,..,20]
tgt = 42,
res = [];
res = getSummingItems(arr,tgt);
console.log("found",res.length,"subsequences summing to",tgt);

pyspark randomForest feature importance: how to get column names from the column numbers

I am using the standard (string indexer + one hot encoder + randomForest) pipeline in spark, as shown below
labelIndexer = StringIndexer(inputCol = class_label_name, outputCol="indexedLabel").fit(data)
string_feature_indexers = [
StringIndexer(inputCol=x, outputCol="int_{0}".format(x)).fit(data)
for x in char_col_toUse_names
onehot_encoder = [
OneHotEncoder(inputCol="int_"+x, outputCol="onehot_{0}".format(x))
for x in char_col_toUse_names
all_columns = num_col_toUse_names + bool_col_toUse_names + ["onehot_"+x for x in char_col_toUse_names]
assembler = VectorAssembler(inputCols=[col for col in all_columns], outputCol="features")
rf = RandomForestClassifier(labelCol="indexedLabel", featuresCol="features", numTrees=100)
labelConverter = IndexToString(inputCol="prediction", outputCol="predictedLabel", labels=labelIndexer.labels)
pipeline = Pipeline(stages=[labelIndexer] + string_feature_indexers + onehot_encoder + [assembler, rf, labelConverter])
crossval = CrossValidator(estimator=pipeline,
cvModel =
now after the the fit I can get the random forest and the feature importance using cvModel.bestModel.stages[-2].featureImportances, but this does not give me feature/ column names, rather just the feature number.
What I get is below:
How can I map it back to some column names or column name + value format?
Basically to get the feature importance of random forest along with the column names.
The transformed dataset metdata has the required attributes.Here is an easy way to do -
create a pandas dataframe (generally feature list will not be huge, so no memory issues in storing a pandas DF)
pandasDF = pd.DataFrame(dataset.schema["features"].metadata["ml_attr"]
Then create a broadcast dictionary to map. broadcast is necessary in a distributed environment.
feature_dict = dict(zip(pandasDF["idx"],pandasDF["name"]))
feature_dict_broad = sc.broadcast(feature_dict)
You can also look here and here
Hey why don't you just map it back to the original columns through list expansion. Here is an example:
# in your case: trainingData.columns
data_frame_columns = ["A", "B", "C", "D", "E", "F"]
# in your case: print(cvModel.bestModel.stages[-2].featureImportances)
feature_importance = (1, [1, 3, 5], [0.5, 0.5, 0.5])
rf_output = [(data_frame_columns[i], feature_importance[2][j]) for i, j in zip(feature_importance[1], range(len(feature_importance[2])))]
{'B': 0.5, 'D': 0.5, 'F': 0.5}
I was not able to find any way to get the true initial list of the columns back after the ml algorithm, I am using this as the current workaround.
for x in cols_now:
FEATURE_COLS+=[list(x)[0] for x in temp_list]
I have kept a consistent suffix naming across all the indexer (_tmp) & encoder (_catVar) like:
column_vec_in = str_col
column_vec_out = [col+"_catVar" for col in str_col]
indexers = [StringIndexer(inputCol=x, outputCol=x+'_tmp')
for x in column_vec_in ]
encoders = [OneHotEncoder(dropLast=False, inputCol=x+"_tmp", outputCol=y)
for x,y in zip(column_vec_in, column_vec_out)]
tmp = [[i,j] for i,j in zip(indexers, encoders)]
tmp = [i for sublist in tmp for i in sublist]
This can be further improved and generalized, but currently this tedious work around works best

Exclude items from training set data

I have my data in two colors and excluded_colors.
colors contains all colors
excluded_colors contains some colors that I wish to exclude from my trainingset.
I am trying to split the data into a training and testing set and ensure that the colors in excluded_colors are not in my training set but exist in the testing set.
In order to achieve the above, I did this
var colors = spark.sql("""
select colors.*
from colors
LEFT JOIN excluded_colors
ON excluded_colors.color_id = colors.color_id
where excluded_colors.color_id IS NULL
val trainer: (Int => Int) = (arg:Int) => 0
val sqlTrainer = udf(trainer)
val tester: (Int => Int) = (arg:Int) => 1
val sqlTester = udf(tester)
val rsplit = colors.randomSplit(Array(0.7, 0.3))
val train_colors = splits(0).select("color_id").withColumn("test",sqlTrainer(col("color_id")))
val test_colors = splits(1).select("color_id").withColumn("test",sqlTester(col("color_id")))
However, I'm realizing that by doing the above the colors in excluded_colors are completely ignored. They are not even in my testing set.
How can I split the data in 70/30 while also ensuring that the colors in excluded_colors are not in training but are present in testing.
What we want to do is remove the "excluded colors" from the training set but have them in the testing and have a training/test split of 70/30.
What we need is a bit of math.
Given the total dataset (TD) and the excluded colors dataset (E) we can say that for train dataset (Tr) and test dataset (Ts) that:
|Tr| = x * (|TD|-|E|)
|Ts| = |E| + (1-x) * |TD|
We also know that |Tr| = 0.7 |TD|
Hence x = 0.7 |TD| / (|TD| - |E|)
Now that we know the sampling factor x, we can say:
Tr = (TD-E).sample(withReplacement = false, fraction = x)
// where (TD - E) is the result of the SQL expr above
Ts = TD.sample(withReplacement = false, fraction = 0.3)
// we sample the test set from the original dataset