Using a Geo Distance Function on ELKI - cluster-analysis

I am using ELKI to mine some geospatial data (lat,long pairs) and I am quite concerned on using the right data types and algorithms. On the parameterizer of my algorithm, I tried to change the default distance function by a geo function (LngLatDistanceFunction, as I am using x,y data) as bellow:
params.addParameter (DISTANCE_FUNCTION_ID, geo.LngLatDistanceFunction.class);
However the results are quite surprising: it creates clusters of a repeated point, such as the example bellow:
(2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN), (2.17199922, 41.38190043, NaN)]
This is an image of this example.
Whether I used a non-geo distance (for instance manhattan):
params.addParameter (DISTANCE_FUNCTION_ID, geo.minkowski.ManhattanDistanceFunction.class);
,the output is much more reasonable
I wonder if there is something wrong with my code.
I am running the algorithm directly on the db, like this:
Clustering<Model> result = dbscan.run(db);
And then iterating over the results in a loop, while I construct the convex hulls:
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cl : result.getAllClusters()) {
if (!cl.isNoise()){
Coordinate[] ptList=new Coordinate[cl.size()];
int ct=0;
for (DBIDIter iter = cl.getIDs().iter();
iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
out.write(poly.coordinates.toText()+"\n");
}
}
}
To map each ID to a point, I use a hashmap, that I initialized when reading the database.
The reason why I am adding this code, is because I suspect that I may doing something wrong regarding the structures that I am passing/reading to/from the algorithm.
I thank you in advance for any comments that could help me to solve this. I find ELKI a very efficient and sophisticated library, but I have trouble to find examples that illustrate simple cases, like mine.

What is your epsilon value?
Geographic distance is in meters in ELKI (if I recall correctly); Manhattan distance would be in latitude + longitude degrees. For obvious reasons, these live on very different scales, and therefore you need to choose a different epsilon value.
In your previous questions, you used epsilon=0.008. For geodetic distance, 0.008 meters = 8 millimeter.
At epsilon = 8 millimeter, I am not surprised if the clusters you get consist only of duplicated coordinates. Any chance that above coordinates do exist multiple times in your data set?

Related

Problem understanding Loss function behavior using Flux.jl. in Julia

So. First of all, I am new to Neural Network (NN).
As part of my PhD, I am trying to solve some problem through NN.
For this, I have created a program that creates some data set made of
a collection of input vectors (each with 63 elements) and its corresponding
output vectors (each with 6 elements).
So, my program looks like this:
Nₜᵣ = 25; # number of inputs in the data set
xtrain, ytrain = dataset_generator(Nₜᵣ); # generates In/Out vectors: xtrain/ytrain
datatrain = zip(xtrain,ytrain); # ensamble my data
Now, both xtrain and ytrain are of type Array{Array{Float64,1},1}, meaning that
if (say)Nₜᵣ = 2, they look like:
julia> xtrain #same for ytrain
2-element Array{Array{Float64,1},1}:
[1.0, -0.062, -0.015, -1.0, 0.076, 0.19, -0.74, 0.057, 0.275, ....]
[0.39, -1.0, 0.12, -0.048, 0.476, 0.05, -0.086, 0.85, 0.292, ....]
The first 3 elements of each vector is normalized to unity (represents x,y,z coordinates), and the following 60 numbers are also normalized to unity and corresponds to some measurable attributes.
The program continues like:
layer1 = Dense(length(xtrain[1]),46,tanh); # setting 6 layers
layer2 = Dense(46,36,tanh) ;
layer3 = Dense(36,26,tanh) ;
layer4 = Dense(26,16,tanh) ;
layer5 = Dense(16,6,tanh) ;
layer6 = Dense(6,length(ytrain[1])) ;
m = Chain(layer1,layer2,layer3,layer4,layer5,layer6); # composing the layers
squaredCost(ym,y) = (1/2)*norm(y - ym).^2;
loss(x,y) = squaredCost(m(x),y); # define loss function
ps = Flux.params(m); # initializing mod.param.
opt = ADAM(0.01, (0.9, 0.8)); #
and finally:
trainmode!(m,true)
itermax = 700; # set max number of iterations
losses = [];
for iter in 1:itermax
Flux.train!(loss,ps,datatrain,opt);
push!(losses, sum(loss.(xtrain,ytrain)));
end
It runs perfectly, however, it comes to my attention that as I train my model with an increasing data set(Nₜᵣ = 10,15,25, etc...), the loss function seams to increase. See the image below:
Where: y1: Nₜᵣ=10, y2: Nₜᵣ=15, y3: Nₜᵣ=25.
So, my main question:
Why is this happening?. I can not see an explanation for this behavior. Is this somehow expected?
Remarks: Note that
All elements from the training data set (input and output) are normalized to [-1,1].
I have not tryed changing the activ. functions
I have not tryed changing the optimization method
Considerations: I need a training data set of near 10000 input vectors, and so I am expecting an even worse scenario...
Some personal thoughts:
Am I arranging my training dataset correctly?. Say, If every single data vector is made of 63 numbers, is it correctly to group them in an array? and then pile them into an ´´´Array{Array{Float64,1},1}´´´?. I have no experience using NN and flux. How can I made a data set of 10000 I/O vectors differently? Can this be the issue?. (I am very inclined to this)
Can this behavior be related to the chosen act. functions? (I am not inclined to this)
Can this behavior be related to the opt. algorithm? (I am not inclined to this)
Am I training my model wrong?. Is the iteration loop really iterations or are they epochs. I am struggling to put(differentiate) this concept of "epochs" and "iterations" into practice.
loss(x,y) = squaredCost(m(x),y); # define loss function
Your losses aren't normalized, so adding more data can only increase this cost function. However, the cost per data doesn't seem to be increasing. To get rid of this effect, you might want to use a normalized cost function by doing something like using the mean squared cost.

MATLAB plotting in a parfor-loop

I need to plot figures with subplots inside a parfor-loop, similar to this question (which deals more with the quality of the plots).
My code looks something like this:
parfor idx=1:numel(A)
N = A(idx);
fig = figure();
ax = subplot(3,1,1);
plot(ax, ...);
...
saveas(fig,"..."),'fig');
saveas(fig,"...",'png');
end
This gives a weird error:
Data must be numeric, datetime, duration or an array convertible to double.
I am sure that the problem does not lie in non-numeric data as the same code without parallelization works.
At this point I expected an error because threads will concurrently create and access figures and axes objects, and I do not think it is ensured that the handles always correspond to the right object (threads are "cross-plotting" so to say).
If I pre-initialize the objects and then acces them like this,
ax = cell(1,numel(A)); % or ax = zeros(1,numel(A));
ax(idx) = subplot(3,1,1);
I get even weirder errors somewhere in the fit-calls I use:
Error using curvefit.ensureLogical>iConvertSubscriptIndexToLogical (line 26)
Excluded indices must be nonnegative integers that reference the fit's input data points
Error in curvefit.ensureLogical (line 18)
exclude = iConvertSubscriptIndexToLogical(exclude, nPoints);
Error in cfit/plot (line 46)
outliers = curvefit.ensureLogical( outliers, numel( ydata ) );
I have the feeling it has to work with some sort of variable slicing described in the documentation, I just can't quite figure out how.
I was able to narrow the issues down to a fitroutine I was using.
TLDR: Do not use fitobjects (cfit or sfit) for plots in a parfor-loop!
Solutions:
Use wrappers like nlinfit() or lsqcurvefit() instead of fit(). They give you the fit parameters directly so you can call your fitfunction with them when plotting.
If you have to use fit() (for some reason it is the only one which was able to fit my data more or less consistently), extract the fit parameters and then call your fitfunction using cell expansion.
fitfunc = #(a,b,c,d,e,x) ( ... );
[fitobject,gof,fitinfo] = fit(x,y,fitfunc,fitoptions(..));
vFitparam = coeffvalues(fitobject);
vFitparam_cell = num2cell(vFitparam);
plot(ax,x,fitfunc(vFitparam_cell{:},x), ... );
As far as I know fit() requires the function handle to have subsequent parameters (not a vector), so by using a cell you can avoid bloated code like this:
plot(ax,x,fitfunc(vFitparam(1),vFitparam(2),vFitparam(3),vFitparam(4),vFitparam(5),x), ... );

Convergence when utilizing scipy.odr module to find best-fit parameters when there is only horizontal errorbars

I am trying to fit a piecewise (otherwise linear) function to a set of experimental data. The form of the data is such that there is only horizontal error bars and no vertical error bars. I am familiar with scipy.optimize.curve_fit module but that works when there is only vertical error bars corresponding to the dependent variable y. After searching for my specific need, I came across the following post where it explains about the possibility of using scipy.odr module when errorbars are those of independent variable x. (Correct fitting with scipy curve_fit including errors in x?)
Attached is my version of the code which tries to find best-fit parameters using ODR methodology. It actually draws best-fit function and it seems it's working. However, after changing initial (educated guess) values and trying to extract best-fit parameters, I am getting the same guessed parameters I inserted initially. This means that the method is not convergent and you can verify this by printing output.stopreason and getting
['Numerical error detected']
So, my question is whether this methodology is consistent with my function being piecewise and if not, if there is any other correct methodology to adopt in such cases?
from numpy import *
import matplotlib.pyplot as plt
from matplotlib.ticker import MaxNLocator
from scipy.odr import ODR, Model, Data, RealData
x_array=array([8.2,8.6,9.,9.4,9.8,10.2,10.6,11.,11.4,11.8])
x_err_array=array([0.2]*10)
y_array=array([-2.05179545,-1.64998354,-1.49136169,-0.94200805,-0.60205999,0.,0.,0.,0.,0.])
y_err_array=array([0]*10)
# Linear Fitting Model
def func(beta, x):
return piecewise(x, [x < beta[0]], [lambda x:beta[1]*x-beta[1]*beta[0], lambda x:0.0])
data = RealData(x_array, y_array, x_err_array, y_err_array)
model = Model(func)
odr = ODR(data, model, [10.1,1.02])
odr.set_job(fit_type=0)
output = odr.run()
f, (ax1) = plt.subplots(1, sharex=True, sharey=True, figsize=(10,10))
ax1.errorbar(x_array, y_array, xerr = x_err_array, yerr = y_err_array, ecolor = 'blue', elinewidth = 3, capsize = 3, linestyle = '')
ax1.plot(x_array, func(output.beta, x_array), 'blue', linestyle = 'dotted', label='Best-Fit')
ax1.legend(loc='lower right', ncol=1, fontsize=12)
ax1.set_xlim([7.95, 12.05])
ax1.set_ylim([-2.1, 0.1])
ax1.yaxis.set_major_locator(MaxNLocator(prune='upper'))
ax1.set_ylabel('$y$', fontsize=12)
ax1.set_xlabel('$x$', fontsize=12)
ax1.set_xscale("linear", nonposx='clip')
ax1.set_yscale("linear", nonposy='clip')
ax1.get_xaxis().tick_bottom()
ax1.get_yaxis().tick_left()
f.subplots_adjust(top=0.98,bottom=0.14,left=0.14,right=0.98)
plt.setp([a.get_xticklabels() for a in f.axes[:-1]], visible=True)
plt.show()
An error of 0 for y is causing problems. Make it small but not zero, e.g. 1e-16. Doing so the fit converges. It also does if you omit the y_err_array when defining RealData but I am not sure what happens internally in that case.

how to convert delaunay triangulation to .stl (stereolithography) format?

I have found several tools which convert isosurface - class or meshgrid data in MATLAB to an STL format. Examples include stlwrite and surf2stl . What I can't quite figure out is how to take a delaunayTriangulation object and either uses it to create an STL file or convert it into an isosurface object.
The root problem is that I'm starting with an N-by-2 array of boundary points for irregular polygons, so I don't have any simple way to generate an xyz meshgrid. If there's a way to convert the boundary list into an isosurface of the interior region (constant Z-height is all I need), that would also solve my problem.
Otherwise, I need some way to convert the delaunayTriangulation object into something the referenced MATLAB FEX tools can handle.
edit to respond to Ander B's suggestion:
I verified that my triangulated set inside MATLAB is a 2-D sector of a circle. But when I feed the data to stlwrite , and import into Cura , I get a disaster - triangles at right angles or rotate pi from desired, or worse. Whether this is the fault of stlwrite , Cura being sensitive to some unexpected value, or both I can't tell. HEre's what started out as a disc:
As an example, here's a set of points which define a sector of a circle. I can successfully create a delaunayTriangulation object from these data.
>> [fcx1',fcy1']
ans =
100.4563 26.9172
99.9712 28.6663
99.4557 30.4067
98.9099 32.1378
98.3339 33.8591
97.7280 35.5701
97.0924 37.2703
96.4271 38.9591
95.7325 40.6360
95.0087 42.3006
94.2560 43.9523
93.4746 45.5906
92.6647 47.2150
91.8265 48.8250
90.9604 50.4202
90.0666 52.0000
89.1454 53.5640
88.1970 55.1116
87.2217 56.6425
86.2199 58.1561
85.1918 59.6519
84.1378 61.1297
83.0581 62.5888
81.9531 64.0288
80.8232 65.4493
79.6686 66.8499
78.4898 68.2301
77.2871 69.5896
76.0608 70.9278
74.8113 72.2445
73.5391 73.5391
72.2445 74.8113
70.9278 76.0608
69.5896 77.2871
68.2301 78.4898
66.8499 79.6686
65.4493 80.8232
64.0288 81.9531
62.5888 83.0581
61.1297 84.1378
59.6519 85.1918
58.1561 86.2199
56.6425 87.2217
55.1116 88.1970
53.5640 89.1454
52.0000 90.0666
50.4202 90.9604
48.8250 91.8265
47.2150 92.6647
45.5906 93.4746
43.9523 94.2560
42.3006 95.0087
40.6360 95.7325
38.9591 96.4271
37.2703 97.0924
35.5701 97.7280
33.8591 98.3339
32.1378 98.9099
30.4067 99.4557
28.6663 99.9712
26.9172 100.4563
25.1599 100.9108
23.3949 101.3345
21.6228 101.7274
19.8441 102.0892
18.0594 102.4200
16.2692 102.7196
14.4740 102.9879
12.6744 103.2248
10.8710 103.4303
9.0642 103.6042
7.2547 103.7467
5.4429 103.8575
3.6295 103.9366
1.8151 103.9842
0 104.0000
-1.8151 103.9842
-3.6295 103.9366
-5.4429 103.8575
-7.2547 103.7467
-9.0642 103.6042
-10.8710 103.4303
-12.6744 103.2248
-14.4740 102.9879
-16.2692 102.7196
-18.0594 102.4200
-19.8441 102.0892
-21.6228 101.7274
-23.3949 101.3345
-25.1599 100.9108
-26.9172 100.4563
0 0
Building on Ander B's answer, here is the complete sequence. These steps ensure that even concave polygons are properly handled.
Start with two vectors containing all the x and the y coordinates. Then:
% build the constraint list
constr=[ (1:(numel(x)-1))' (2:numel(x))' ; numel(x) 1;];
foodel = delaunayTriangulation(x',y',constr);
% get logical indices of interior triangles
inout = isInterior(foodel);
% if desired, plot the triangles and the original points to verify.
% triplot(foodel.ConnectivityList(inout, :),...
foodel.Points(:,1),foodel.Points(:,2), 'r')
% hold on
% plot(fooa.Points(:,1),fooa.Points(:,2),'g')
% now solidify
% need to create dummy 3rd column of points for a solid
point3 = [foodel.Points,ones(numel(foodel.Points(:,1)),1)];
% pick any negative 'elevation' to make the area into a solid
[solface,solvert] = surf2solid(foodel.ConnectivityList(inout,:),...
point3, 'Elevation', -10);
stlwrite('myfigure.stl',solface,solvert);
I've successfully turned some 'ugly' concave polygons into STLs that Cura is happy to turn into gCode.
STL is just a format to store in memory mesh information, thus you have the data if you have a mesh, you just need to write it to memory using the right format.
It appears that you input the vertices and faces to the stlwrite function as
stlwrite(FILE, FACES, VERTICES);
And the delaunayTriangulation output gives you a object that has easy access to this data as for an object DT, DT.Points is the vertices, and DT.ConnectivityList is the faces.
You can read more about it in the documentation you linked.

Shift a semi-log chart

There are two related things I would like to ask help with.
1) I'm trying to shift a "semi-log" chart (using semilogy) such that the new line passes through a given point on the chart, but still appears to be parallel to the original.
2) Shift the "line" exactly as in 1), but then also invert the slope.
I think that the desired results are best illustrated with an actual chart.
Given the following code:
x = [50 80];
y = [10 20];
all_x = 1:200;
P = polyfit(x, log10(y),1);
log_line = 10.^(polyval(log_line,all_x));
semilogy(all_x,log_line)
I obtain the following chart:
For 1), let's say I want to move the line such that it passes through point (20,10). The desired result would look something like the orange line below (please note that I added a blue dot at the (20,10) point only for reference):
For 2), I want to take the line from 1) and take an inverse of the slope, so that the final result looks like the orange line below:
Please let me know if any clarifications are needed.
EDIT: Based on Will's answer (below), the solution is as follows:
%// to shift to point (40, 10^1.5)
%// solution to 1)
log_line_offset = (10^1.5).^(log10(log_line)/log10(10^1.5) + 1-log10(log_line(40))/log10(10^1.5));
%// solution to 2)
log_line_offset_inverted = (10^1.5).^(1 + log10(log_line(40))/log10(10^1.5) - log10(log_line)/log10(10^1.5));
To do transformations described as linear operations on logarithmic axes, perform those linear transformations on the logarithm of the values and then reapply the exponentiation. So for 1):
log_line_offset = 10.^(log10(log_line) + 1-log10(log_line(20)));
And for 2):
log_line_offset_inverted = 10.^(2*log10(log_line_offset(20)) - log10(log_line_offset));
or:
log_line_offset_inverted = 10.^(1 + log10(log_line(20)) - log10(log_line));
These can then be plot with semilogy in the same way:
semilogy(all_x,log_line,all_x, log_line_offset, all_x,log_line_offset_inverted)
I can't guarantee that this is a sensible solution for the application that you're creating these plots and their underlying data though. It seems an odd way to describe the problem, so you might be better off creating these offsets further up the chain of calculation.
For example, log_line_offset can just as easily be calculated using your original code but for an x value of [20 50], but whether that is a meaningful way to treat the data may depend on what it's supposed to represent.