Would greatly appreciate some help with the following challenge:
I am importing a fact table from a database into a Matlab table. The fact table consist of a sequence of observations across several categories as follows:
SeqNo Cat Observation
1 A 0.3
1 B 0.5
1 C 0.6
2 B 0.9
2 C 1.0
3 A 1.2
3 C 1.5
I need now to delinearize the fact table and create a matrix (or another table) with the categories representing columns, i.e. something like this:
Seq A B C
1 0.3 0.5 0.6
2 NaN 0.9 1.0
3 1.2 NaN 1.5
I played around with findgroup and the split-apply-combine workflow, but no luck. In the end I had to resort to SPSS Modeler create to create a properly structured csv file for import, but would need to achieve this fully in Matlab or Simulink.
Any help would be most welcome.
%Import table
T=readtable('excelTable.xlsx');
obs_Array=T.Observation;
%Extract unique elements from SeqNo column
seqNo_values=(unique(T.SeqNo));
%Extract unique elements from Cat column
cat_values=(unique(T.Cat));
%Notice that the elements in seqNo_values
%already specify the row of your new matrix
%The index of each element in cat_values
%does the same thing for the columns of your new matrix.
numRows=numel(seqNo_values);
numCols=numel(cat_values);
%Initialize a new, NaN matrix:
reformatted_matrix=NaN(numRows,numCols);
%magic numbers:
seqNo_ColNum=1;
cat_ColNum=2;
for i=1:numel(obs_Array)
target_row=T(i,seqNo_ColNum);
%convert to array for ease of indexing
target_row=table2array(target_row);
%convert to array for ease of indexing
target_cat=table2array(T(i,cat_ColNum));
target_cat=cell2mat(target_cat);
target_col=find([cat_values{:}] == target_cat);
reformatted_matrix(target_row,target_col)=obs_Array(i);
end
reformatted_matrix
Output:
reformatted_matrix =
0.3000 0.5000 0.6000
NaN 0.9000 1.0000
1.2000 NaN 1.5000
Related
Overview
I am currently working with a series of .txt files I am importing into MATLAB. For simplicity, I'll show my problem conceptually. Obligatory, I'm new to MATLAB (or programming in general).
These .txt files contain data from tracking a ROI in a video (frame-by-frame) with time ('t') in the first column and velocity ('v') in the second as shown below;
T1 = T2 = etc.
t v t v
0 NaN 0 NaN
0.1 100 0.1 200
0.2 200 0.2 500
0.3 400 0.3 NaN
0.4 150
0.5 NaN
Problem
Files differ in their size, the columns remain fixed but the rows vary from trial to trial as shown in T1 and T2.
The time column is the same for each of these files so I wanted to organise data in a table as follows;
time v1 v2 etc.
0 NaN NaN
0.1 100 200
0.2 200 500
0.3 400 NaN
0.4 150 0
0.5 NaN 0
Note that I want to add 0s (or NaN) to end of shorter trials to fix the issue of size differences.
Edit
Both solutions worked well for my dataset. I appreciate all the help!
You could import each file into a table using readtable and then use outerjoin to combine the tables in the way that you would expect. This will work if all data starts at t = 0 or not.
To create a table from a file:
T1 = readtable('filename1.dat');
T2 = readtable('filename2.dat');
Then to perform the outerjoin (pseudo data created for demonstration purposes).
t1 = table((1:4)', (5:8)', 'VariableNames', {'t', 'v'});
%// t v
%// _ _
%// 1 5
%// 2 6
%// 3 7
%// 4 8
% t2 is missing row 2
t2 = table([1;3;4], [1;3;4], 'VariableNames', {'t', 'v'});
%// t v
%// _ _
%// 1 1
%// 3 3
%// 4 4
%// Now perform an outer join and merge the key column
t3 = outerjoin(t1, t2, 'Keys', 't', 'MergeKeys', true)
%// t v_t1 v_t2
%// _ ____ ____
%// 1 5 1
%// 2 6 NaN
%// 3 7 3
%// 4 8 4
I would suggest the use of the padarray and horzcat functions. They respectively :
Pad a matrix or vector with extra data, effectively adding extra 0's or any specified value (NaNs work too).
Concatenate matrices or vectors horizontally.
First, try to obtain the length of the longest vector you have to concatenate. Let's call this value max_len. Once you have that, you can then pad each vector by doing :
v1 = padarray(v1, max_len - length(v1), 0, 'post');
% You can replace the '0' by any value you want !
Finally, once you have vectors of the same size, you can concatenate them using horzcat :
big_table = horzcat(v1, v2, ... , vn);
I read a lot of information about this subject but I can't obtain a solution about my problem.
First, I have a file with 3 columns: X Y Z
In MATLAB, I did this:
data = load('data.txt');
X = data(:,1);
Y = data(:,2);
Z = data(:,3);
This file is like this:
7037 6032 3
7036 6028 5
7037 6029 4
7037 6030 3
7038 6031 6
7039 6031 2
7037 6033 7
And I want to obtain the following matrix from the above matrix:
5 NaN NaN NaN NaN NaN
NaN 4 3 NaN 3 7
NaN NaN NaN 6 NaN NaN
NaN NaN NaN 2 NaN NaN
The rules is that the first column Y(1) = min(Y) , the second column Y(2) = Y(1) + 1.
The first line is X(1) = min(X), X(2) = X(1) + 1. Essentially, the first column acts as a row index, the second column acts as a column index, and for each row and column pair, the third column gets mapped to a location in this matrix. As such, the output matrix will be like so: out(1,1)=X(1) Y(1) ; out(1,2) = X(1) Y(2)
At the start, I think about created a matrix out like so:
xr = sort(unique(X));
yr = sort(unique(Y));
a = length(xr);
b = length(yr);
out = NaN(a,b);
After, with a loop, put I place this data onto this out matrix, but this obviously doesn't work.
For more information on an Esri grid, here's a Wikipedia article about it. The example grid in that page is what I desire. http://en.wikipedia.org/wiki/Esri_grid
I now understand what you want. The link that you posted from Wikipedia is very useful. You are trying to build what is known as an Esri grid. Here is a pictorial representation found on Wikipedia:
What you are given is a N x 3 matrix where the first column denotes the row IDs of this matrix, the second row denotes the column IDs of this matrix, and the third column denotes the values at each pair of IDs. So for example, given the example above - specifically looking at the right of the figure, your text file could look like:
275 125 5
275 175 2
...
...
25 75 5
25 125 1
Each row consists of a row index, a column index and a value that maps to this location in the grid. You had the right approach in that you should use unique - specifically the third output. We need to obtain a unique ID for the first two columns of your data independently. Once we do this, I'm going to show you the very powerful accumarray function. We are basically going to use the unique IDs found in the previous step, and we use these to index into our grid and place each value that corresponds to each unique pair of row and column IDs into this grid. Therefore, your code is very simply:
data = load('data.txt');
%// Or you can do this for reproducing the results
%data = [7037 6032 3;
%7036 6028 5;
%7037 6029 4;
%7037 6030 3;
%7038 6031 6;
%7039 6031 2;
%7037 6033 7];
[~,~,id1] = unique(data(:,1));
[~,~,id2] = unique(data(:,2));
out = accumarray([id1 id2], data(:,3), [], [], NaN);
out produces the desired Esri grid, and we get:
out =
5 NaN NaN NaN NaN NaN
NaN 4 3 NaN 3 7
NaN NaN NaN 6 NaN NaN
NaN NaN NaN 2 NaN NaN
So how does this work? accumarray accepts in a matrix of row and column locations that you want to use to access the output. At each of the corresponding row and column locations, you provide a value that gets mapped to this bin. Now, by default accumarray sums up the values that get mapped to each bin, but I'm going to assume that your values in your text file are all unique in that only one value gets mapped to each row and column index. Therefore, we can certainly get away with the default behaviour, and so you'd specify a [] for this behaviour (fourth input). Therefore, we will use the last column of your matrix as the values that get put into this matrix, use the [] input to allow accumarray to infer the size of your matrix (third input), then any values that don't get mapped to anything, we will fill this in with NaN. We aren't going to sum anything.
With the above explanation, the code follows.
I am trying to find the mean of a column however I am having trouble getting an output for a function I created. My code is below, I cannot see what mistake I have made.
for j=1:48;
C_f2 = V(V(:,3) == j,:);
C_f2(C_f2==0)=NaN;
m=mean(C_f2(:,4));
s=std(C_f2(:,4));
row=[j,m,s];
s1=[s1;row];
end
I have checked the matrix, C_f2 and that is full of values so should not be returning NaN. However my output for the matrix s1 is
1 NaN NaN
2 NaN NaN
3 NaN NaN
. ... ...
48 NaN NaN
Can anyone see my issue? Help would me much appreciated!
The matrix C_f2 looks like,
1 185 01 5003
1 185 02 5009
. ... .. ....
1 259 48 5001
On line 3 you set all values which are zero to NaN. The mean function will return NaN as mean if any element is NaN. If you want to ignore the NaN values, you have to use the nanmean function, which comes with the Statistics toolbox. See the following example:
a = [1 NaN 2 3];
mean(a)
ans =
NaN
nanmean(a)
ans =
2
If you don't have the Statistics toolbox, you can exclude NaN elements with logical indexing
mean(a(~isnan(a)))
ans =
2
or it is possibly the easiest, if you directly exlude all elements which are zero instead of replacing them by NaN.
mean(a(a~=0))
Your line C_f2(C_f2==0)=NaN; will put NaNs into C_f2. Then, your mean and std operations will see those NaNs and output NaNs themselves.
To have the mean and std ignore NaN, you need to use the alternate version nanmean and nanstd.
These are part of a toolbox, however, so you might not have them if you just have the base Matlab installation.
Don't set it to NaN, any NaN involved computation without additional rules will return NaN,
use find to correctly index the none zero part of your column
say column n is your input
N = n(find(n~=0))
now do your Mu calculation
To compute the mean and standard deviation of each column excluding zeros:
A = [1 2;
3 0;
4 5;
6 7;
0 0]; %// example data
den = sum(A~=0); %// number of nonzero values in each column
mean_nz = bsxfun(#rdivide, sum(A), den);
mean2_nz = bsxfun(#rdivide, sum(A.^2), den);
std_nz = sqrt(bsxfun(#times, mean2_nz-mean_nz.^2, den./(den-1)));
The results for the example are
mean_nz =
3.5000 4.6667
std_nz =
2.0817 2.5166
The above uses the "corrected" definition of standard deviation (which divides by n-1, where n is the number of values). If you want the "uncorrected" version (i.e. divide by n):
std_nz = sqrt(mean2_nz-mean_nz.^2);
Given two vectors containing numerical values, say for example
a=1.:0.1:2.;
b=a+0.1;
I would like to select only the differing values. For this Matlab provides the function setdiff. In the above example it is obvious that setdiff(a,b) should return 1. and setdiff(b,a) gives 2.1. However, due to computational precision (see the questions here or here) the result differs. I get
>> setdiff(a,b)
ans =
1.0000 1.2000 1.4000 1.7000 1.9000
Matlab provides a function which returns a lower limit to this precision error, eps. This allows us to estimate a tolerance like tol = 100*eps;
My question now, is there an intelligent and efficient way to select only those values whose difference is below tol? Or said differently: How do I write my own version of setdiff, returning both values and indexes, which includes a tolerance limit?
I don't like the way it is answered in this question, since matlab already provides part of the required functionality.
Introduction and custom function
In a general case with floating point precision issues, one would be advised to use a tolerance value for comparisons against suspected zero values and that tolerance must be a very small value. A little robust method would use a tolerance that uses eps in it. Now, since MATLAB basically performs subtractions with setdiff, you can use eps directly here by comparing for lesser than or equal to it to find zeros.
This forms the basis of a modified setdiff for floating point numbers shown here -
function [C,IA] = setdiff_fp(A,B)
%//SETDIFF_FP Set difference for floating point numbers.
%// C = SETDIFF_FP(A,B) for vectors A and B, returns the values in A that
%// are not in B with no repetitions. C will be sorted.
%//
%// [C,IA] = SETDIFF_FP(A,B) also returns an index vector IA such that
%// C = A(IA). If there are repeated values in A that are not in B, then
%// the index of the first occurrence of each repeated value is returned.
%// Get 2D matrix of absolute difference between each element of A against
%// each element of B
abs_diff_mat = abs(bsxfun(#minus,A,B.')); %//'
%// Compare each element against eps to "negate" the floating point
%// precision issues. Thus, we have a binary array of true comparisons.
abs_diff_mat_epscmp = abs_diff_mat<=eps;
%// Find indices of A that are exclusive to it
A_ind = ~any(abs_diff_mat_epscmp,1);
%// Get unique(to account for no repetitions and being sorted) exclusive
%// A elements for the final output alongwith the indices
[C,IA] = intersect(A,unique(A(A_ind)));
return;
Example runs
Case1 (With integers)
This will verify that setdiff_fp works with integer arrays just the way setdiff does.
A = [2 5];
B = [9 8 8 1 2 1 1 5];
[C_setdiff,IA_setdiff] = setdiff(B,A)
[C_setdiff_fp,IA_setdiff_fp] = setdiff_fp(B,A)
Output
A =
2 5
B =
9 8 8 1 2 1 1 5
C_setdiff =
1 8 9
IA_setdiff =
4
2
1
C_setdiff_fp =
1 8 9
IA_setdiff_fp =
4
2
1
Case2 (With floating point numbers)
This is to show that setdiff_fp produces the correct results, while setdiff doesn't. Additionally, this will also test out the output indices.
A=1.:0.1:1.5
B=[A+0.1 5.5 5.5 2.6]
[C_setdiff,IA_setdiff] = setdiff(B,A)
[C_setdiff_fp,IA_setdiff_fp] = setdiff_fp(B,A)
Output
A =
1.0000 1.1000 1.2000 1.3000 1.4000 1.5000
B =
1.1000 1.2000 1.3000 1.4000 1.5000 1.6000 5.5000 5.5000 2.6000
C_setdiff =
1.2000 1.4000 1.6000 2.6000 5.5000
IA_setdiff =
2
4
6
9
7
C_setdiff_fp =
1.6000 2.6000 5.5000
IA_setdiff_fp =
6
9
7
For Tolerance of 1 epsilon This should work:
a=1.0:0.1:2.0;
b=a+0.1;
b=[b b-eps b+eps];
c=setdiff(a,b)
The idea is to expand b to include also its closest values.
I have a n X 2 matrix which has been formed by appending many matrices together. Column 1 of the matrix consists of numbers that indicate item_ids and column 2 consists of similarity values. Since this matrix has been formed by concatenating many matrices together, there might exist duplicate values in column 1 which I do not want. I would like to remove all the duplicate values in column 1 such that for any value X in column 1 of which there are duplicates, all the rows of the matrix are removed in which column 1 = X , except that row of the matrix where column 1 = X and column2 value is the maximum among all the values for X in the matrix.
Example:
1 0.85
1 0.5
1 0.95
2 0.5
result required:
1 0.95
2 0.5
obtained by removing all the rows in the n X 2 matrix where the duplicate values in column 1 did not have the maximum value in column 2.
If you might have gaps in the index, use sparse output:
>> result = accumarray( M(:,1), M(:,2), [], #max, 0, true)
>> uMat = [find(result) nonzeros(result)]
uMat =
1.0000 0.9500
2.0000 0.5000
This also simplifies creation of the first column of the output.
A couple of other ways to do it with unique.
First way, use sort with 'descend' ordering:
>> [~,IS] = sort(M(:,2),'descend');
>> [C,ia] = unique(M(IS,1));
>> M(IS(ia),:)
ans =
1.0000 0.9500
2.0000 0.5000
Second, use sortrows (ascending sort by second column), and unique with 'first' occurrence option:
>> [Ms,IS] = sortrows(M,2)
>> [~,ia] = unique(Ms(:,1),'last')
>> M(IS(ia),:)
ans =
1.0000 0.9500
2.0000 0.5000
You can try
result = accumarray( M(:,1), M(:,2), [max(M(:,1)) 1], #max);
According to the documentation, that should work.
Apologies I can't try it out right now...
update - I did try the above, and it gave me the max values correctly. However it doesn't give you the indices corresponding to the max values. For that, you need to do a bit more work (since the identifiers probably aren't sorted).
result = accumarray( M(:,1), M(:,2), [], #max, true); % to create a sparse matrix
c1 = find(result); % to get the indices of nonzero values
c2 = full(result(c1)); % to get the values corresponding to the indices
answer = [c1 c2]; % to put them side by side
result = accumarray( M(:,1), M(:,2), [max(M(:,1)) 1], #max);
finalResult = [sort(unique(M(:,1))),nonzeros(result)]
This basically reattaches the required item_ids in sorted order to the corresponding max_similarity values in the second column. As a result in the finalResult matrix, each value in column 1 is unique and the corresponding value in column 2 is the maximum similarity value for that item_id.
#Floris, thanks for your help couldn't have solved this without your help.
Yet another approach: use sortrows and then diff to select the last row for each value of the first column:
M2 = sortrows(M);
result = M2(diff([M2(:,1); inf])>0,:);
This works also if the indices in the first column have gaps.