Algorithm to determine clusters of linked nodes in Spark - scala

Say I have a dataset of 9 unique that are connected in the following way like this:
start end
----------
1 2
2 3
4 5
6 7
7 8
7 1
4 9
9 5
The dataset represents a graph of nodes, and the links between them. So for instance, the given links are represented as two clusters: one with 6 nodes, and one with 3 nodes.
CLUSTER 1 CLUSTER 2
1 --- 2 --- 3 4 --- 5
| \___ |
| \|
7 --- 6 9
|
|
8
I want an efficient algorithm clusters the edges together like so:
node cluster
-------------
1 1
2 1
3 1
4 2
5 2
6 1
7 1
8 1
9 2
The problem is that I have a lot of these edges, and my current algorithm is pretty slow. Assuming that these datasets are represented as DataFrames in Spark, is there a more SQL-like way of achieving this besides stripping them down to RDDs and iterating over them like lists?

Spark's GraphX library comes with a connected components method: https://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components
However I had tried using that in the past and found it a bit slow, so I ended up implementing it myself using the algorithm described here: http://mmds-data.org/presentations/2014/vassilvitskii_mmds14.pdf
It is very fast but requires a bit of implementation, probably beyond the scope of a stackoverflow answer (and the previous code I wrote is proprietary). And it's also possible GraphX has improved their implementation in the years since.

Related

How to add an iterative id column which goes up when a value in another column resets to 1 in Postgresql

I have a SQL table which has two columns called seq and sub_seq as seen below. I would like to add a third column called id, which goes up by 1 every time the sub_seq starts again at 1 as shown in the table below.
seq
sub_seq
id
1
1
1
2
2
1
3
3
1
4
4
1
5
5
1
6
1
2
7
2
2
8
3
2
9
1
3
10
2
3
11
3
3
12
4
3
13
5
3
14
6
3
15
7
3
I could write a solution using plpgsql, however I would like to know if there is a way of doing this in standard SQL. Any help would be greatly appreciated.
If sub_seq is always a running sequence then you can use the DENSE RANK function and order over the differences of two columns, assuming it will consistently uniform.
SELECT seq, sub_Seq, DENSE_RANK() OVER (ORDER BY seq-sub_Seq) AS id
FROM tableDemo
This solution is based on the sample data you have provided, I think more sample data would be helpful to check the whole scenario.

How can I cluster 3-dimensional data?

For a project in my data mining class, I am to perform fuzzy-c means clustering on a data set, where each data point has 3 axes (I googled and that's apparently the correct way to pluralize 'axis'). I'm not exactly sure how I would do so, especially given the clustering algorithm I'm using. Here's an example of the data set I'm using;
-
x
y
z
apple
2
5
5
banana
3
2
5
carrot
1
4
4
durian
6
7
1
eggplant
0
3
6
Any help or resources would be greatly appreciated!

Genetic Algorithm for Flow shop Scheduling

I need help in Matlab: I need to find out how to Crossover any two sequences for genetic alghorithm in FlowShop, e.g.
1st sequence = 1 5 4 7 3 2 9 8 10 6
2nd sequence = 7 8 9 10 5 4 2 1 3 6
after crossover, the off-springs should be
offspring 1 = 1 5 4 7 3 2 8 9 10 6
offspring 2 = 7 8 9 10 1 5 4 3 2 6
Crossover should be such that each number doesn't repeat itself in the offspring sequence. Can anyone tell me how to do this?
There are a number of existing crossovers defined for permutation encodings. Among them the following would be useful for you:
Cyclic Crossover
Partially Matched Crossover
Uniform-like Crossover
Position-based Crossover
These crossovers aim to preserve the position of the job in the permutation. You can find implementations in C# in the PermutationEncoding plugin of HeuristicLab. Browse the source files and you can also find references to scientific articles that describe these crossovers.

Splitting of data in two clusters using kmeans matlab

I have a large dataset in which i want to divide into two clusters using kmeans algorithm in matlab.Here my problem is that those two clusters should contain the data present in the dataset.How can I do that in matlab?
For eg:
1 2 3
4 5 6
6 3 5
1 1 2
....
in the output I should get in this format:
cluster1:
...
1 2 3
1 1 2
cluster 2
4 5 6
6 3 5
idx=kmeans(dataset,k)
//dataset-the dataset for your kmeans
//k=number of clusters
then to know what is assigned into the clusters
try this
cluster1data=dataset(idx==1,:)

Strange behaviour of MATLAB combnk function

I am trying to generate all combination of 2 elements in a given range of numbers. I am using 'combnk' function as follows.
combnk(1:4,2)
ans =
3 4
2 4
2 3
1 4
1 3
1 2
combnk(1:6,2)
ans =
1 2
1 3
1 4
1 5
1 6
2 3
2 4
2 5
2 6
3 4
3 5
3 6
4 5
4 6
5 6
The order of combinations returned appears to change. I need to know the order in advance for my program to work properly.
Is there any solution to make sure I get the combinations in a consistent order?
Also, why is MATLAB showing this strange behavior?
The only solution I can think of so far is to first check the 1st entry of the result matrix and flip it up side down using 'flipud' function.
Update: By a little bit of experimenting I noticed the reverse order occurs only when the length of the set of numbers is less than 6. This is why combnk(1:6,2) produce the 'correct' order. Where as combnk(1:5,2) produce the results backwards. This is still big problem.
You could try nchoosek instead of combnk. I don't have the matlab statistics toolbox (only octave), so I don't know if nchoosek has any significant disadvanvatages.
This will solve the ordering issue:
a=combnk(1:4,2);
[~,idx]=sortrows(a);
aNew=a(idx,:);
I don't know why MATLAB is showing this behavior.