I have a dataset with discontinuous displacements across a few surfaces (so far I am working in 2D, so these would be lines, 1D).
It is input as a PVDReader (just in case it makes any difference for the question, which I doubt).
Is there any way to programmatically create a new Source with the displacement jumps along those lines?
Note that this new field would probably have to be defined on a domain with a lower dimension, see above.
So far, I created filters PlotOverLine, on lines slightly below and slightly above the prescribed line.
But I do not know how to subtract the two and place them in a single field over the line.
Notes:
So far I am working in 2D (domain where the discontinuity is defined is 1D). I mean to get also discontinuities in fields on 3D domains (domain where the discontinuity is defined is 2D).
As a simple example (see below) I take for Domain of discontinuity: x axis, y=0. I actually mean to have an arbitrary line (in 2D) or plane (in 3D).
I would provide an extra reward if the Domain of discontinuity is an arbitrary curve (in 2D) or surface (in 3D).
Mathematical description:
Discontinuous field: u(x,y)
Domain of discontinuity: x axis, y=0
Value of the field on one side of the domain of discontinuity: u(x+,0)
Value of the field on the other side of the domain of discontinuity: u(x-,0)
Jump in the field: d(x) = u(x+,0) - u(x-,0)
The field u is defined on a 2D domain. The field d is defined on a 1D domain (x axis).
Ressample with dataset should do the trick as far as I understand.
First generate a line using the line source
open your 2D dataset containing the discontinuous U field.
Use ressample with dataset on your dataset and use the line source as source input
You now have a line which contains the field defined on your datasets
If you need to do some computation on it, you can then use a calculator or a programmable filter.
If I understood correctly, you want to plot discontinuity in u(x,y) a 2D surface as a line and associate as a field the jump in u() with d(x) = u(x+,0) - u(x-,0)
Theorically speaking, it should be possible to use a Programmable Filter or develop a ParaView Filter which could take advantage of your filter PlotOverLine and use the lines slightly above and below as its first parameters and use the output line as last parameter (or create it as output).
The different parameters are exposed in the PF script in the array inputs. The question of the order of the parameters can be addressed with a special field 'LineType' for example indicating which is above and which is below.
This way, maybe you could explicitly compute the fields u(x+,0) and u(x-,0) and then the jump field d(x) to finally associate it to the output line.
I'm well aware this is more a suggestion about how it could be done than a real answer but I wanted to share ideas.
I put together a Programmable Filter.
I do not know if this is the most efficient way to do it, but it works.
programmableFilter1 = ProgrammableFilter(Input=[plotOverLineBot,plotOverLineTop])
programmableFilter1.Script = """
import vtk.util.numpy_support as ns
import numpy as np
input_bot = self.GetInputDataObject(0, 0);
input_top = self.GetInputDataObject(0, 1);
output = self.GetPolyDataOutput()
#output.ShallowCopy(input_bot)
npts_bot = input_bot.GetNumberOfPoints()
npts_top = input_top.GetNumberOfPoints()
if ( npts_bot == npts_top ) :
print( "Number of points: " + str(npts_bot) )
u_bot = input_bot.GetPointData().GetArray("displacement")
u_top = input_top.GetPointData().GetArray("displacement")
u_bot_np = ns.vtk_to_numpy(u_bot)
u_top_np = ns.vtk_to_numpy(u_top)
u_jump_np = np.subtract(u_top_np, u_bot_np)
u_jump = ns.numpy_to_vtk(u_jump_np)
u_jump.SetName("displacement jump");
output.GetPointData().AddArray(u_jump)
else :
pass
"""
programmableFilter1.RequestInformationScript = ''
programmableFilter1.RequestUpdateExtentScript = ''
programmableFilter1.PythonPath = ''
RenameSource("Programmable Filter - Jumps", programmableFilter1)
I have an adjacency matrix adj and a cellarray nodeManes that contains names that will be given to the graph G that will be constructed from adj.
So I use G = digraph(adj,nodeNames); and I get the following graph :
Now, I want to find the strongly connected components in G and do a graph condensation so I use the following:
C = condensation(G);
p2 = plot(C);
and get this results :
So I have 6 strongly connected components, but my problem is that I lost the node names, I want to get something like:
Is that any way to get the nodes names in the result of the condentation?
I think the official documentation can take you to the right point:
Output Arguments
C - Condensation Graph
Condensation graph, returned as a digraph object. C is a directed
acyclic graph (DAG), and is topologically sorted. The node numbers in
C correspond to the bin numbers returned by conncomp.
Let's take a loot at conncomp:
conncomp(G) returns the connected components of graph G as bins. The
bin numbers indicate which component each node in the graph belongs to
Look at the examples... I think that if you use conncomp on your graph before using the condensation function, you will be able to rebuild your node names on your new graph with a little effort.
So I have a sort of strange problem. I have a dataset with 240 points and I'm trying to use k-means to cluster it into 100 clusters. I'm using Matlab but I don't have access to the statistics toolbox, so I had to write my own k-means function. It's pretty simple, so that shouldn't be too hard, right? Well, it seems something is wrong with my code:
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
It seems what happens is that when I go to recompute the centroids, some centroids have no datapoints assigned to them, so I'm not really sure what to do with that. After doing some research on this, I found that this can happen if you supply arbitrary initial centroids, but in this case the initial centroids are taken from the datapoints themselves, so this doesn't really make sense. I've tried re-assigning these centroids to random datapoints, but that causes the code to not converge (or at least after letting it run all night, the code never converged). Basically they get re-assigned, but that causes other centroids to get marginalized, and repeat. I'm not really sure what's wrong with my code, but I ran this same dataset through R's k-means function for k=100 for 1000 iterations and it managed to converge. Does anyone know what I'm messing up here? Thank you.
Let's step through your code one piece at a time and discuss what you're doing with respect to what I know about the k-means algorithm.
function result=Kmeans(X,c)
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
old_label = zeros(1,N);
label = ones(1,N);
This looks like a function that takes in a data matrix of size N x n, where N is the number of points you have in your dataset, while n is the dimension of a point in your dataset. This function also takes in c: the desired number of output clusters.index provides a random permutation between 1 to as many data points as you have, and then we select at random c points from this permutation which you have used to initialize your cluster centres.
iter = 0;
while ~isequal(old_label, label)
old_label = label;
label = assign_labels(X, ctrs);
for i = 1:c
ctrs(i,:) = mean(X(label == i,:));
if sum(isnan(ctrs(i,:))) ~= 0
ctrs(i,:) = zeros(1,n);
end
end
iter = iter + 1;
end
result = ctrs;
For k-means, we basically keep iterating until the cluster membership of each point from the previous iteration matches with the current iteration, which is what you have going with your while loop. Now, label determines the cluster membership of each point in your dataset. Now, for each cluster that exists, you determine what the mean data point is, then assign this mean data point as the new cluster centre for each cluster. For some reason, should you experience any NaN for any dimension of your cluster centre, you set your new cluster centre to all zeroes instead. This looks very abnormal to me, and I'll provide a suggestion later. Edit: Now I understand why you did this. This is because should you have any clusters that are empty, you would simply make this cluster centre all zeroes as you wouldn't be able to find the mean of empty clusters. This can be solved with my suggestion for duplicate initial clusters towards the end of this post.
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum((X - repmat(ctrs(i,:),[N,1])).^2,2);
end
[~,label] = min(dist,[],2);
This function takes in a dataset X and the current cluster centres for this iteration, and it should return a label list of where each point belongs to each cluster. This also looks correct because for each column of dist, you are calculating the distance between each point to each cluster, where those distances are in the ith column for the ith cluster. One optimization trick that I would use is to avoid using repmat here and use bsxfun which handles the replication internally. Therefore, do this instead:
function label = assign_labels(X, ctrs)
[N,~]=size(X);
[c,~]=size(ctrs);
dist = zeros(N,c);
for i = 1:c
dist(:,i) = sum(bsxfun(#minus, X, ctrs(i,:)).^2, 2);
end
[~,label] = min(dist,[],2);
Now, this all looks correct. I also ran some tests myself and it all seems to work out, provided that the initial cluster centres are unique. One small problem with k-means is that we implicitly assume that all cluster centres are unique. Should they not be unique, then you'll run into a problem where two clusters (or more) have the exact same initial cluster centres.... so which cluster should the data point be assigned to? When you're doing the min in your assign_labels function, should you have two identical cluster centres, the cluster label that the point gets assigned to will be the minimum of these two numbers. This is why you will have a cluster with no points in it, as all of the points that should have been assigned to this cluster get assigned to the other.
As such, you may have two (or more) initial cluster centres that are the same upon randomization. Even though the permutation of the indices to select are unique, the actual data points themselves may not be unique upon selection. One thing that I can impose is to loop over the permutation until you get a unique set of initial clusters without repeats. As such, try doing this at the beginning of your code instead.
[N,n]=size(X);
index=randperm(N);
ctrs = X(index(1:c),:);
while size(unique(ctrs, 'rows'), 1) ~= c
index=randperm(N);
ctrs = X(index(1:c),:);
end
old_label = zeros(1,N);
label = ones(1,N);
iter = 0;
%// While loop appears here
This will ensure that you have a unique set of initial clusters before you continue on in your code. Now, going back to your NaN stuff inside the for loop. I honestly don't see how any dimension could result in NaN after you compute the mean if your data doesn't have any NaN to begin with. I would suggest you get rid of this in your code as (to me) it doesn't look very useful. Edit: You can now remove the NaN check as the initial cluster centres should now be unique.
This should hopefully fix your problems you're experiencing. Good luck!
"Losing" a cluster is not half as special as one may think, due to the nature of k-means.
Consider duplicates. Lets assume that all your first k points are identical, what would happen in your code? There is a reason you need to carefully handle this case. The simplest solution would be to leave the centroid as it was before, and live with degenerate clusters.
Given that you only have 240 points, but want to use k=100, don't expect too good results. Most objects will be on their own... choosing a much too large k is probably a reason why you do see this degeneration effect a lot. Let's assume out of these 240, fewer than 100 are unique... Then you cannot have 100 non-empty clusters... Plus, I would consider this kind of result "overfitting", anyway.
If you don't have the toolboxes you need in Matlab, maybe you should move on to free software. Octave, R, Weka, ELKI, ... there is plenty of software, some of which is much more powerful when it comes to clustering than pure Matlab (in particular, if you don't have the toolboxes).
Also benchmark. You will be surprised of the performance differences.
I am running kmeans in Mahout and as an output I get folders clusters-x, clusters-x-final and clusteredPoints.
If I understood well, clusters-x are centroid locations in each of iterations, clusters-x-final are final centroid locations, and clusteredPoints should be the points being clustered with cluster id and weight which represents probability of belonging to cluster (depending on the distance between point and its centroid). On the other hand, clusters-x and clusters-x-final contain clusters centroids, number of elements, features values of centroid and the radius of the cluster (distance between centroid and its farthest point.
How do I examine this outputs?
I used cluster dumper successfully for clusters-x and clusters-x-final from terminal, but when I used it clusteredPoints, I got an empty file? What seems to be the problem?
And how can I get this values from code? I mean, the centroid values and points belonging to clusters?
FOr clusteredPoint I used IntWritable as key, and WeightedPropertyVectorWritable for value, in a while loop, but it passes the loop like there are no elements in clusteredPoints?
This is even more strange because the file that I get with clusterDumper is empty?
What could be the problem?
Any help would be greatly appreciated!
I believe your interpretation of the data is correct (I've only been working with Mahout for ~3 weeks, so someone more seasoned should probably weigh in on this).
As far as linking points back to the input that created them I've used NamedVector, where the name is the key for the vector. When you read one of the generated points files (clusteredPoints) you can convert each row (point vector) back into a NamedVector and retrieve the name using .getName().
Update in response to comment
When you initially read your data into Mahout, you convert it into a collection of vectors with which you then write to a file (points) for use in the clustering algorithms later. Mahout gives you several Vector types which you can use, but they also give you access to a Vector wrapper class called NamedVector which will allow you to identify each vector.
For example, you could create each NamedVector as follows:
NamedVector nVec = new NamedVector(
new SequentialAccessSparseVector(vectorDimensions),
vectorName
);
Then you write your collection of NamedVectors to file with something like:
SequenceFile.Writer writer = new SequenceFile.Writer(...);
VectorWritable writable = new VectorWritable();
// the next two lines will be in a loop, but I'm omitting it for clarity
writable.set(nVec);
writer.append(new Text(nVec.getName()), nVec);
You can now use this file as input to one of the clustering algorithms.
After having run one of the clustering algorithms with your points file, it will have generated yet another points file, but it will be in a directory named clusteredPoints.
You can then read in this points file and extract the name you associated to each vector. It'll look something like this:
IntWritable clusterId = new IntWritable();
WeightedPropertyVectorWritable vector = new WeightedPropertyVectorWritable();
while (reader.next(clusterId, vector))
{
NamedVector nVec = (NamedVector)vector.getVector();
// you now have access to the original name using nVec.getName()
}
check the parameter named "clusterClassificationThreshold".
clusterClassificationThreshold should be 0.
You can check this http://mail-archives.apache.org/mod_mbox/mahout-user/201211.mbox/%3C50B62629.5020700#windwardsolutions.com%3E
I have a 3D dataset containing multiple connected components. Using matlab, I would like to compute a certain metric for each of these components (the metric is not included in the 'regionprops' function). My question is what is the best way to do it?
The metric I would like to compute is the surface area. I know how to do this for one connected component, but I'm looking for an efficient way to do it for all components that meet a certain volume criteria.
What I have so far:
cc = bwconncomp(data,26); % find components
L = labelmatrix(cc); %
stats = regionprops(data, 'area');
for i = 1:length(cc.PixelIdxList)
if stats(i,1).Area > threshold
a = (L==i);
surfaceArea(i,1) = compute_surface_area(a);
end
end
I'm sure there is a better way to do this!
Thanks in advance, N
You may want to use arrayfun that computes the surface area for each connected component whose area is above threshold.
idx = find([stats.Area]>threshold);
arrayfun(#(ii) compute_surface_area(L == ii), idx, 'UniformOutput', 0 )
Here, the for loop is written in a single line of code.