Unable to use precomputed distances with Elki

Unable to use precomputed distances with Elki - cluster-analysis

I am trying to use precomputed distances with Elki, but for some reason cannot get it working. I have read the instructions here: http://elki.dbs.ifi.lmu.de/wiki/HowTo/PrecomputedDistances and this question on SO: ELKI - input distance matrix.
Unfortunately I am still unable to get ELKI working.
This is the command I am running in a bash shell:
java -jar elki.jar -verbose -dbc.filter FixedDBIDsFilter -dbc.startid 0 -dbc.in elki_dummy_ids -algorithm clustering.kmeans.KMeansLloyd -algorithm.distancefunction external.FileBasedDoubleDistanceFunction -distance.matrix elki_sample_dist_ut.txt -kmeans.k 3
And these are the contents of the files in the parameters:
$cat elki_dummy_ids
0
1
2
$cat elki_sample_dist_ut.txt
0 0 0.0000
0 1 0.8876
0 2 0.8571
1 1 0.0
1 2 0.9059
2 2 0.0
I tried with a lower-triangular distance matrix too:
$cat elki_sample_dist_lt.txt
0 0 0.0000
1 0 0.8876
1 1 0.0
2 0 0.8571
2 1 0.9059
2 2 0.0
but no luck with that either.
I keep getting this error (truncated - but let me know if you need the full error msg):
The following parameters were not processed:
[external.FileBasedDoubleDistanceFunction, -distance.matrix,
elki_sample_dist_ut.txt] Task is not completely configured:
Wrong value of parameter algorithm.distancefunction. Read:
de.lmu.ifi.dbs.elki.distance.distancefunction.external.FileBasedDoubleDistanceFunction.
Expected: Distance function to determine the distance between database
objects. Implementing
de.lmu.ifi.dbs.elki.distance.distancefunction.PrimitiveDistanceFunction
Known classes (default package
de.lmu.ifi.dbs.elki.distance.distancefunction):
I am using OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1) and Elki 0.6.0.
Can someone please point out what I am missing here? Thanks in advance!

k-means cannot be used with precomputed distances.
Because it computes distances from points to centroids, which you do not know before, and thus cannot be precomputed.
Plus, k-means should only be used on numerical data, with squared Euclidean distance. Otherwise it may fail to converge. The mean minimizes sum-of-squared deviations, and does not minimize arbitrary distances.
You might be looking for PAM, k-medoids, DBSCAN, OPTICS, HAC, ... these algorithms do work with other distances, and only need pairwise distances.

Related

Error when running G= graph(s,t) in matlab

I want to calculate L = laplacian(G) from a graph dataset. I imported the dataset which contains two columns: FromNodeId and ToNodeId:
# Nodes: 3997962 Edges: 34681189
# FromNodeId ToNodeId
0 1
0 2
0 31
0 73
0 80
0 113619
0 2468556
0 2823829
0 2823833
0 2846857
0 2947898
0 3011654
0 3701688
0 3849377
0 4036524
0 4036525
0 4036527
0 4036529
0 4036531
0 4036533
0 4036534
0 4036536
0 4036537
1 2
1 3
1 4
1 5
1 6
1 7
1 8
1 9
1 10
1 11
To do so, I need to find G first so I use G = graph(FromNodeId, FromNodeId). When I did that, I got this error:
>> G = graph(fromNodeId,toNodeId)
Error using matlab.internal.graph.MLGraph
Source must be a dense double array of node indices.
Error in matlab.internal.graph.constructFromEdgeList (line 125)
G = underlyingCtor(double(s), double(t), totalNodes);
Error in graph (line 264)
matlab.internal.graph.constructFromEdgeList(...
I don't know why! Can I get a solution of that? Thank you.

Turns out the problem lies in the fact that no zeros are allowed when using the Graph function in this manner. (See: Target must be a dense double array of node indices. How to Solve?)
I downloaded the dataset and ran it successfully with the following code. Note that this code uses a system command and is not compatible with all operating systems, but it should be simple enough to rewrite to whatever operating system you use. It also assumes the .txt file to be in the working directory.
% Removes first lines with comments in them; this system command was tested on Linux Ubuntu 14.04 and is probably not portable to Windows.
% If this system command doesn't work, manually remove the first four lines from the text file.
system('tail -n +5 com-lj.ungraph.txt > delimitedFile.txt');
% Read the newly created delimited file and add 1 to all nodes.
edges=dlmread('delimitedFile.txt')+1;
% Build the graph
G=graph(edges(:,1),edges(:,2));
Assuming you've build your arrays similarly to how I did it, adding 1 to FromNodeIdFull and ToNodeIdFull should resolve your problem. In other words, the following code snippet should solve your problem; if it doesn't I advise you to rewrite based on the code presented above.
G=graph(FromNodeIdFull+1,ToNodeIdFull+1);
Leaving my old answer here, as deleting it may cause confusion for others reading both this answer and the comments to it. Note that the answer below did NOT resolve the issue.
Just putting the comments by myself and NKN into an answer:
The problem lies in the fact that the arrays are sparse but graph() seems to expect full arrays. The following should work:
FromNodeIdFull=full(double(FromNodeId));
ToNodeIdFull=full(double(ToNodeId));
G=graph(FromNodeIdFull,ToNodeIdFull);
Depending on whether your input arrays are already doubles or not you may be able to remove the double() from the first two lines.

Choosing which variables to normalize while applying logistic regression

Suppose a dataset comprises independent variables that are continuous and binary variables. Usually the label/outcome column is converted to a one hot vector, whereas continuous variables can be normalized. But what needs to be applied for binary variables.
AGE RACE GENDER NEURO EMOT
15.95346 0 0 3 1
14.57084 1 1 0 0
15.8193 1 0 0 0
15.59754 0 1 0 0
How does this apply for logistic regression and neural networks?

If the range of continuous value is small, encode it into a binary form and use each bit of that binary form as a predictor.
For example, number 2 = 10 in binary.
Therefore
predictor_bit_0 = 0
predictor_bit_1 = 1
Try and see if it works. Just to warn you, this method is very subjective and may or may not yield good results for your data. I'll keep you posted if I find a better solution

Good way to get phase difference between vectors?

So I have two vectors:
>> [phase exp_phase]
ans =
0.2266 0
-0.0702 0
-0.0070 0
-0.0854 0
0.0888 0
3.1403 -3.1416
-2.9571 -3.1416
-0.1441 0
-0.2660 0
2.8749 -3.1416
0.0126 0
-2.9309 -3.1416
0.0064 0
phase is obtained by atan2(b,a). I want to figure out the phase difference. The problem is that I obviously want the difference between -3.00 and +3.00 to be roughly 0.28, but at the same time I want the difference between -2.72 and +3.00 to be the same.
It's probably trivial but I can't figure out a good way to do it :(

say you have two angles, w1=+3 and w2=-3 (both in radians).
to find the smallest angular difference, do the following:
atan2(sin(w1-w2),cos(w1-w2))
ans =
-0.2832

If you are finding the phase with arctan, then I assume you have two vectors already in Cartesian coordinates, and you used atan2 to get the phase value of the polar coordinates. Just find the phase difference in Cartesian coordinates directly. I might be messing up this formula (my trig is rusty so google it), but it is something like acos((a . b)/|a||b|). If this is the correct formula, it gives the phase difference for two vectors, anyway there is "a" formula for this calculation in Cartesian coordinates. You might be able to avoid the atan2 function twice (unless you also need to know the actual phases).

How to properly model ANN to find relationship between real value input-output data?

I'm trying to make an ANN which could tell me if there is causality between my input and output data. Data is following:
My input are measured values of pesticides (19 total) in an area eg:
-1.031413662 -0.156086316 -1.079232918 -0.659174849 -0.734577317 -0.944137546 -0.596917991 -0.282641072 -0.023508282 3.405638835 -1.008434997 -0.102330305 -0.65961995 -0.687140701 -0.167400684 -0.4387984 -0.855708613 -0.775964435 1.283238514
And the output is the measured value of plant-somthing in the same area (55 total) eg:
0.00 0.00 0.00 13.56 0 13.56 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13.56 0 0 0 1.69 0 0 0 0 0 0 0 0 0 0 1.69 0 0 0 0 13.56 0 0 0 0 13.56 0 0 0 0 0 0
Values for input are in range from -2.5 to 10, and for output from 0 to 100.
So the question I'm trying to answer is: in what measure does pesticide A affect plant-somthings.
What are good ways to model (represent) input/output neurons to be able to process the mentioned input/output data? And how to scale/convert input/output data to be useful for NN?
Is there a book/paper that I should look at?

First, a neural network cannot find the causality between output and input, but only the correlation (just like every other probabilistic methods). Causality can only be derived logically from reasoning (and even then, it's not always clear, it all depends on your axioms).
Secondly, about how to design a neural network to model your data, here is a pretty simple rule that can be generally applied to make a first working draft:
set the number of input neurons = the number of input variables for one sample
set the number of output neurons = the number of output variables for one sample
then play with the number of hidden layers and the number of hidden neurons per hidden layer. In practice, you want to use the fewest number of hidden layers/neurons to model your data correctly, but enough so that the function approximated by your neural network fits correctly the data (else the error in output will be huge compared to the real output dataset).
Why do you need to use just enough neurons but not too much? This is because if you use a lot of hidden neurons, you are sure to overfit your data, and thus you will make a perfect prediction on your training dataset, but not in the general case when you will use real datasets. Theoretically, this is because a neural network is a function approximator, thus it can approximate any function, but using a too high order function will lead to overfitting. See PAC learning for more info on this.
So, in your precise case, the first thing to do is to clarify how many variables you have in input and in output for each sample. If it's 19 in input, then create 19 input nodes, and if you have 55 output variables, then create 55 output neurons.
About scaling and pre-processing, yes you should normalize your data between the range 0 and 1 (or -1 and 1 it's up to you and it depends on the activation function). A very good place to start is to watch the videos at the machine learning course by Andrew Ng at Coursera, this should get you kickstarted quickly and correctly (you'll be taught the tools to check that your neural network is working correctly, and this is immensely important and useful).
Note: you should check your output variables, from the sample you gave it seems they use discrete values: if the values are discrete, then you can use discrete output variables which will be a lot more precise and predictive than using real, floating values (eg, instead of having [0, 1.69, 13.56] as the possible output values, you'll have [0,1,2], this is called "binning" or multi-class categorization). In practice, this means you have to change the way your network works, by using a classification neural network (using activation functions such as sigmoid) instead of a regressive neural network (using activation functions such as logistic regression or rectified linear unit).

My neural network forgets the last training when I try to teach next set of training inputs

Im learning(started today) neural networks and could finish a 2x2x1 network(forward data feeding and backward error propagated) that can learn AND operation for one set of inputs. It also dodges any local minimums using randomized parameters. My first source for this is: http://www.codeproject.com/Articles/14342/Designing-And-Implementing-A-Neural-Network-Librar
The problem is: it learns 0 AND 0 using inputs (0,0) but when I give (0,1) it forgets 0 AND 0 then learns 0 AND 1. Is this a general newbie bug?
What I tried:
loop for 10000 times
learn 0 and 0
end loop
loop for 10000 times
learn 0 and 1 (forgets 0 and 0)
end loop
loop for 10000 times
learn 1 and 0 (forgets 0 and 1)
end loop
loop for 10000 times
learn 1 and 1 (forgets 1 and 0)
end loop
only one set is learned
fail
Trial 2:
loop for 10000 times
learn 0 and 0
learn 0 and 1
learn 1 and 0
learn 1 and 1
end loop
gives same result for all input combinations.
fail.
Activation function for each neuron: hyperbolic tangent
2x2 structure: all-pairs
2x1 structure: all-pairs
Randomized learning rate: yes, small enough to keep far from explosive iteration (per iteration)
Randomized bias per neuron: yes, between -0.5 and +0.5 (just at start)
Randomized weighting: yes, between -0.5 and +0.5 (just at start)
Edit: Bias and weight updates are done for all-pairs of hidden and output layers.
Edit: All neurons(hidden+output) use same activation function.

Without specific code it is hard to say for sure, but I think the issue is that you are only giving it one case to learn at a time. You should give it a matrix of your different learning examples, with an expected result vector. Then, when you update your weights and biases, you are finding the values that minimize the error between your network output for all cases, and the expected output for all cases.
For an AND gate, your input would be (in MATLAB code, not sure what language you are using but that syntax is easy to understand):
input = [0, 0;
0, 1;
1, 0;
1, 1];
And your expected output would be:
output = [0;
0;
0;
1];
I think what you are doing now is basically finding the weights and biases that minimize the error between the network output and the expected output for just one input case, and then re-training those weights and biases to minimize the error for the second case, then the third, then the fourth. If you put them in arrays like this it should minimize the overall error for all cases. This is just my best guess though without any code to go on.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Unable to use precomputed distances with Elki - cluster-analysis

Related

Error when running G= graph(s,t) in matlab

Choosing which variables to normalize while applying logistic regression

Good way to get phase difference between vectors?

How to properly model ANN to find relationship between real value input-output data?

My neural network forgets the last training when I try to teach next set of training inputs

Categories

Resources