Say I have a dataset of 9 unique that are connected in the following way like this:
start end
----------
1 2
2 3
4 5
6 7
7 8
7 1
4 9
9 5
The dataset represents a graph of nodes, and the links between them. So for instance, the given links are represented as two clusters: one with 6 nodes, and one with 3 nodes.
CLUSTER 1 CLUSTER 2
1 --- 2 --- 3 4 --- 5
| \___ |
| \|
7 --- 6 9
|
|
8
I want an efficient algorithm clusters the edges together like so:
node cluster
-------------
1 1
2 1
3 1
4 2
5 2
6 1
7 1
8 1
9 2
The problem is that I have a lot of these edges, and my current algorithm is pretty slow. Assuming that these datasets are represented as DataFrames in Spark, is there a more SQL-like way of achieving this besides stripping them down to RDDs and iterating over them like lists?
Spark's GraphX library comes with a connected components method: https://spark.apache.org/docs/latest/graphx-programming-guide.html#connected-components
However I had tried using that in the past and found it a bit slow, so I ended up implementing it myself using the algorithm described here: http://mmds-data.org/presentations/2014/vassilvitskii_mmds14.pdf
It is very fast but requires a bit of implementation, probably beyond the scope of a stackoverflow answer (and the previous code I wrote is proprietary). And it's also possible GraphX has improved their implementation in the years since.
Related
I have a SQL table which has two columns called seq and sub_seq as seen below. I would like to add a third column called id, which goes up by 1 every time the sub_seq starts again at 1 as shown in the table below.
seq
sub_seq
id
1
1
1
2
2
1
3
3
1
4
4
1
5
5
1
6
1
2
7
2
2
8
3
2
9
1
3
10
2
3
11
3
3
12
4
3
13
5
3
14
6
3
15
7
3
I could write a solution using plpgsql, however I would like to know if there is a way of doing this in standard SQL. Any help would be greatly appreciated.
If sub_seq is always a running sequence then you can use the DENSE RANK function and order over the differences of two columns, assuming it will consistently uniform.
SELECT seq, sub_Seq, DENSE_RANK() OVER (ORDER BY seq-sub_Seq) AS id
FROM tableDemo
This solution is based on the sample data you have provided, I think more sample data would be helpful to check the whole scenario.
For a project in my data mining class, I am to perform fuzzy-c means clustering on a data set, where each data point has 3 axes (I googled and that's apparently the correct way to pluralize 'axis'). I'm not exactly sure how I would do so, especially given the clustering algorithm I'm using. Here's an example of the data set I'm using;
-
x
y
z
apple
2
5
5
banana
3
2
5
carrot
1
4
4
durian
6
7
1
eggplant
0
3
6
Any help or resources would be greatly appreciated!
I need help in Matlab: I need to find out how to Crossover any two sequences for genetic alghorithm in FlowShop, e.g.
1st sequence = 1 5 4 7 3 2 9 8 10 6
2nd sequence = 7 8 9 10 5 4 2 1 3 6
after crossover, the off-springs should be
offspring 1 = 1 5 4 7 3 2 8 9 10 6
offspring 2 = 7 8 9 10 1 5 4 3 2 6
Crossover should be such that each number doesn't repeat itself in the offspring sequence. Can anyone tell me how to do this?
There are a number of existing crossovers defined for permutation encodings. Among them the following would be useful for you:
Cyclic Crossover
Partially Matched Crossover
Uniform-like Crossover
Position-based Crossover
These crossovers aim to preserve the position of the job in the permutation. You can find implementations in C# in the PermutationEncoding plugin of HeuristicLab. Browse the source files and you can also find references to scientific articles that describe these crossovers.
I have a large dataset in which i want to divide into two clusters using kmeans algorithm in matlab.Here my problem is that those two clusters should contain the data present in the dataset.How can I do that in matlab?
For eg:
1 2 3
4 5 6
6 3 5
1 1 2
....
in the output I should get in this format:
cluster1:
...
1 2 3
1 1 2
cluster 2
4 5 6
6 3 5
idx=kmeans(dataset,k)
//dataset-the dataset for your kmeans
//k=number of clusters
then to know what is assigned into the clusters
try this
cluster1data=dataset(idx==1,:)
I am trying to generate all combination of 2 elements in a given range of numbers. I am using 'combnk' function as follows.
combnk(1:4,2)
ans =
3 4
2 4
2 3
1 4
1 3
1 2
combnk(1:6,2)
ans =
1 2
1 3
1 4
1 5
1 6
2 3
2 4
2 5
2 6
3 4
3 5
3 6
4 5
4 6
5 6
The order of combinations returned appears to change. I need to know the order in advance for my program to work properly.
Is there any solution to make sure I get the combinations in a consistent order?
Also, why is MATLAB showing this strange behavior?
The only solution I can think of so far is to first check the 1st entry of the result matrix and flip it up side down using 'flipud' function.
Update: By a little bit of experimenting I noticed the reverse order occurs only when the length of the set of numbers is less than 6. This is why combnk(1:6,2) produce the 'correct' order. Where as combnk(1:5,2) produce the results backwards. This is still big problem.
You could try nchoosek instead of combnk. I don't have the matlab statistics toolbox (only octave), so I don't know if nchoosek has any significant disadvanvatages.
This will solve the ordering issue:
a=combnk(1:4,2);
[~,idx]=sortrows(a);
aNew=a(idx,:);
I don't know why MATLAB is showing this behavior.