I am trying to perform an interpolation/fit (preferably non-linear, but linear should also be fine) on 4D data. My data has a form of:
[a,b,c] = func(input)
obviously, func is unknown and ultimately data looks like (input, a, b, c):
0 -0.1253 0.0341 0.01060
35 -0.0985 0.0176 0.02060
50 -0.0315 -0.0533 0.1118
60 -0.0518 -0.0327 0.03020
80 0.2939 -0.0713 0.05670
100 0.3684 -0.0765 0.06740
I take observations at e.g. input = [0, 35, 50, 60, 80, 100] (0 being min and 100 being max; I take 6 samples in between min and max) and then I get corresponding a, b and c values (I understand that 6 sample points are a bad design of experiment so I will extend it in future).
I am trying to guess the value of a, b and c at say input = 19? Any pointers?
How to estimate goodness of fit in such scenario?
This is not 4D interpolation, this is 3 times 1D interpolation. You just interpolate interp1([0 35],[-0.1253 -0.0985],19) and the same for b and c. (interp1(intput,a,19))
Note that for the most basic 1D interpolation in a mesh grid (not what you have), you need 2 data points in general. For the most basic 2D interpolation, you need 4 data points. For 3D interpolation, 8 minimum, 4D, 16.... (2^d in general).
Also note that 1D interpolation uses 2 "dims". Because you use one to guide the interpolation, the other one is interpolated. General, with [v,a,b,c] data you would use 3D interpolation.
all that said, you do are nto in this case. You have scattered data, not a grid, thus the problem becomes considerably more complicated.
In case you can generate a few more points (not necessarily 16) you can use the function griddatan for interpolating scattered data. Note that you can not just say "give me [a,b,c] for input=19, there could be infinite amount of a,b,cs that have that condition. In any case, you always need to give dim-1 amount of sample points, and get the last one interpolated. Just an advice: this function is computationally and memory-wise very expensive. Do not use for big data points because it will crash your PC.
In the case you want to find a set of parameters that make input=19 then you are getting to more complicated area. You want to minimise a function f(x), where x=[a,b,c] for f(x)=input
In math terms:
argmin_x |f(x)-input|^2= \vec{input}
this is a harder problem and arguably more mathematics than a programming question. Perhaps a ND bspline fitting of your data would be a good f
I have a set of ages (over 10000 of them) and I want to plot a graph with the age from 20 to 100 on the x axis and then the number of times each of those ages appears in the data on the y axis. I have tried several ways to do this and I can't figure it out. I also have some other data which requires me to plot values vs how many times they occur so any advice on how to do this would be much appreciated.
I'm quite new to Matlab so it would be great if you could explain how things in your answer work rather than just typing out some code.
Thanks.
EDIT:
So I typed histogram(Age, 80) because as I understand that will plot the values in Age on a histogram split up into 80 bars (1 for each age). Instead I get this:
The bars aren't aligned and it's clearly not 1 per age nor has it plotted the number of times each age occurs on the y axis.
You have to use histogram(), and that's correct.
Let's see with an example.
I extract 100 ages between 20 and 100:
ages=randsample([20:100],100,true);
Now I call histogram() in this manner:
h=histogram(ages,[20:100]);
where h is an histogram object and this will also show the following plot:
However, this might look easy due to the fact that my ages vector is in range 20:100, so it will not contain any other values. If your vector, as instead, contains also ages not in range 20:100, you can specify the additional option 'BinLimits' as third input in histogram() like this:
h=histogram(ages,length([20:100]),'BinLimits',[20:100]);
and this option plots a histogram using the values in ages that fall between 20 and 100 inclusive.
Note: by inspecting h you can actually see and/or edit some proprieties of your histogram. An attribute (field) of such object you might be interested to is Values. This is a vector of length 80 (in our case, since we work with 80 bins) in which the i-th element is the number of items is the i-th bin. This will help you count the occurrences (just in case you need them to go on with your analysis).
Like Luis said in comments, hist is the way to go. You should specify bin edges, rather than the number of bins:
ages = randi([20 100], [1 10000]);
hist(ages, [20:100])
Is this what you were looking for?
I have a 128 x 1 input in block 'Local maxima'. I want to take as an output, the 4 maximum values of an input. I set: Maximum number of local maxima: 4, and Neighborhood size: [1 1]. I expect to take an 2x4 matrix each has in the first row the values I want. However, this block outputs 2 matrices with size 2x4. Why does it happend?
EDIT: I use the 'simout' to spectate the output of block 'Local maxima'.
Thanks in advance!
As I mentioned in the comments, the output of the block is probably a 2x4 matrix, but at each time step. If you have, say 101 time steps (from 0 to 10s in steps of 0.1), then the input signal is not 128x1, but 128x1x101, and so the output that is stored in simout will be 2x4x101.
I think what he is trying to do is:
To produce a 2-dimensional matrix/array directly from Simulink. In other words, when data is exported.. the 3rd dimensional for time should be omitted. Can this happen?
I understand that taking the output and editing it with matlab so that it becomes from 3-dimensional a 2-dimensional array is trivial. But is the above possible?
I am currently working in matlab to design a way to reconstruct 3D data. For this I have two pictures with black points. The difference in the amount of points per frame is key for the reconstruction, but MATLAB gives an error when matrixes are not equal. This is happening becaus the code is not doing what I want it to do, so can anyone hel me with the following?
I have two columns of Xdata: XLI and XRI
What matlab does when I do XLI-XRI is substracting the pairs i.e XLI(1)-XRI(1) etc, but I want to substract each value of XRI of every value of XLI. i.e
XLI(1)-XRI(1,2,3,4 etc)
XLI(2)-XRI(1 2 3 4 etc)
and so on
Can anyone help?
I think you are looking for a way to deduct all combinations from eachother. Here is an example of how you can do that with bsxfun:
xLI = [1 2 3]
xRI = [1 2]
bsxfun(#minus,xLI ,xRI')
I cannot comment on Dennis's post (not enough points on this website) : his solution should work, but depending on your version of Matlab you might get a "Error using ==> bsxfun" and need to transpose either xLI or xRI for that to work :
bsxfun(#minus,xLI' ,xRI)
Best,
Tepp
I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end