flatten matlab table by key - matlab

I have a large table whose entries are
KEY_A,KEY_B,VAL
where KEY_A and KEY_B are finite sets of keys. For arguments sake, we'll have 4 different KEY_B values and 4 different KEY_A values. And example table:
KEY_A KEY_B KEY_C
_____ _____ _________
1 1 0.45054
1 2 0.083821
1 3 0.22898
1 4 0.91334
2 1 0.15238
2 2 0.82582
2 3 0.53834
2 4 0.99613
3 1 0.078176
3 2 0.44268
3 3 0.10665
3 4 0.9619
4 1 0.0046342
4 2 0.77491
4 3 0.8173
4 4 0.86869
4 5 1
I want to elegantly flatten the table into
KEY_A KEY_B_1 KEY_B_2 KEY_B_3 KEY_B_4 KEY_B_5
_____ _________ ________ _______ _______ _______
1 0.45054 0.083821 0.22898 0.91334 -1
2 0.15238 0.82582 0.53834 0.99613 -1
3 0.078176 0.44268 0.10665 0.9619 -1
4 0.0046342 0.77491 0.8173 0.86869 1
I'd like to be able to handle missing B values (set them to a default like -1), but I think if I get an elegant way to do this to start then such things will fall into place.
The actual table has millions of records, so I do want to use a vectorized call.
The line I've got (which doesn't handle int invalid 5) is:
cell2mat(arrayfun(#(x)[x,testtable{testtable.KEY_A==x,3}'],unique(testtable{:,1}),'UniformOutput',false))
But
it doesn't output a different table
If there are missing keys in the table, it doesn't handle that
I would think that this isn't that uncommon of an activity...has anyone done something like this before?

If the input table is T, then you could try this for the given case -
KEY_B_ =-1.*ones(max(T.KEY_A),max(T.KEY_B))
KEY_B_(sub2ind(size(KEY_B_),T.KEY_A,T.KEY_B)) = T.KEY_C
T1 = array2table(KEY_B_)
Output for the edited input -
T1 =
KEY_B_1 KEY_B_2 KEY_B_3 KEY_B_4 KEY_B_5
_________ ________ _______ _______ _______
0.45054 0.083821 0.22898 0.91334 -1
0.15238 0.82582 0.53834 0.99613 -1
0.078176 0.44268 0.10665 0.9619 -1
0.0046342 0.77491 0.8173 0.86869 1
Edit by MadScienceDreams: This answer lead me to write the following function, which will smash together pretty much any table based on the input keys. Enjoy!
function [ OT ] = flatten_table( T,primary_keys,secondary_keys,value_key,default_value )
%UNTITLED Summary of this function goes here
% Detailed explanation goes here
if nargin < 5
default_value = {NaN};
end
if ~iscell(default_value)
default_value={default_value};
end
if ~iscell(primary_keys)
primary_keys={primary_keys};
end
if ~iscell(secondary_keys)
secondary_keys={secondary_keys};
end
if ~iscell(value_key)
value_key={value_key};
end
primary_key_values = unique(T(:,primary_keys));
num_primary = size(primary_key_values,1);
[primary_key_map,primary_key_map] = ismember(T(:,primary_keys),primary_key_values);
secondary_key_values = unique(T(:,secondary_keys));
num_secondary = size(secondary_key_values,1);
[secondary_key_map,secondary_key_map] = ismember(T(:,secondary_keys),secondary_key_values);
%out =-1.*ones(max(T.KEY_A),max(T.KEY_B))
try
values = num2cell(T{:,value_key},2);
catch
values = num2cell(table2cell(T(:,value_key)),2);
end
if (~iscell(values))
values=num2cell(values);
end
OT=repmat(default_value,num_primary,num_secondary);
OT(sub2ind(size(OT),primary_key_map,secondary_key_map)) = values;
label_array = num2cell(cellfun(#(x,y)[x '_' mat2str(y)],...
repmat (secondary_keys,size(secondary_key_values,1),1),...
table2cell(secondary_key_values),'UniformOutput',false),1);
label_array = strcat(label_array{:});
OT = [primary_key_values,cell2table(OT,'VariableNames',label_array)];
end

Related

How do I expand a range of numbers in MATLAB

Lets say I have this range of numbers, I want to expand these intervals. What is going wrong with my code here? The answer I am getting isn't correct :(
intervals are only represented with -
each 'thing' is separated by ;
I would like the output to be:
-6 -3 -2 -1 3 4 5 7 8 9 10 11 14 15 17 18 19 20
range_expansion('-6;-3--1;3-5;7-11;14;15;17-20 ')
function L=range_expansion(S)
% Range expansion
if nargin < 1;
S='[]';
end
if all(isnumeric(S) | (S=='-') | (S==',') | isspace(S))
error 'invalid input';
end
ixr = find(isnumeric(S(1:end-1)) & S(2:end) == '-')+1;
S(ixr)=':';
S=['[',S,']'];
L=eval(S) ;
end
ans =
-6 -2 -2 -4 14 15 -3
You can use regexprep to replace ;by , and the - that define ranges by :. Those - are identified by them being preceded by a digit. The result is a string that can be transformed into the desired output using str2num. However, since this function evaluates the string, for safety it is first checked that the string only contains the allowed characters:
in = '-6;-3--1;3-5;7-11;14;15;17-20 '; % example
assert(all(ismember(in, '0123456789 ,;-')), 'Characters not allowed') % safety check
out = str2num(regexprep(in, {'(?<=\d)-' ';'}, {':' ','})); % replace and evaluate

Table sort by month

I have a table in MATLAB with attributes in the first three columns and data from the fourth column onwards. I was trying to sort the entire table based on the first three columns. However, one of the columns (Column C) contains months ('January', 'February' ...etc). The sortrows function would only let me choose 'ascend' or 'descend' but not a custom option to sort by month. Any help would be greatly appreciated. Below is the code I used.
sortrows(Table, {'Column A','Column B','Column C'} , {'ascend' , 'ascend' , '???' } )
As #AnonSubmitter85 suggested, the best thing you can do is to convert your month names to numeric values from 1 (January) to 12 (December) as follows:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t.ColumnC = month(datenum(t.ColumnC,'mmmm'));
This will facilitate the access to a standard sorting criterion for your ColumnC too (in this example, ascending):
t = sortrows(t,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
If, for any reason that is unknown to us, you are forced to keep your months as literals, you can use a workaround that consists in sorting a clone of the table using the approach described above, and then applying to it the resulting indices:
c = {
7 1 'February';
1 0 'April';
2 1 'December';
2 1 'January';
5 1 'January';
};
t_original = cell2table(c,'VariableNames',{'ColumnA' 'ColumnB' 'ColumnC'});
t_clone = t_original;
t_clone.ColumnC = month(datenum(t_clone.ColumnC,'mmmm'));
[~,idx] = sortrows(t_clone,{'ColumnA' 'ColumnB' 'ColumnC'},{'ascend', 'ascend', 'ascend'});
t_original = t_original(idx,:);

Finding Duplicate string values in two cell array 22124x1

I have a cell 22124x1 and it contain duplicate Values, I want to know how many times these values duplicate and their index
first cell contain these values Datacell=
'221853_s_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'222031_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'
symbol cell:
'OR1D4 '
' OR1D5'
' UTP14C'
'GTF2H2 '
'ZNF324B '
' LOC644504'
'JMJD7 '
'ZNF324B '
'JMJD7-PLA2G4B'
' OR2A5 '
'OR1D4 '
For example i want the output from cell 1 like this
ID duplicated index
'221853_s_at' 1 1
'221971_x_at' 4 {2:5,1}
I tried to use unique but it does not work. Any help will be highly appreciated
Generating the indices in a visually pleasing matter isn't necessarily a trivial exercise. It's made simpler if you assume d is sorted.
An alternative utilizing accumarray:
d = {'221853_s_at'; '221971_x_at'; '221971_x_at'; '221971_x_at'; '221971_x_at'; ...
'222031_at'; '222031_at'; '31637_s_at'; '37796_at'; '38340_at' ...
};
d = sort(d); % Sort to make indices easier
% Find unique strings and their locations
[uniquestrings, ~, stringbin] = unique(d);
counts = accumarray(stringbin, 1);
repeatidx = find(counts - 1 > 0);
repeatedstrings = uniquestrings(repeatidx);
repeatcounts = counts(repeatidx) - 1;
% Find where string repeats start
startidx = find([true; diff(stringbin) > 0]);
repeatstart = startidx(repeatidx);
repeatend = startidx(repeatidx + 1) - 1;
% Generate table, requires R2013b or newer
t = table(repeatedstrings, repeatcounts, repeatstart, repeatend, ...
'VariableNames', {'ID', 'Duplicated', 'StringStart', 'StringEnd'} ...
);
Which yields:
t =
ID Duplicated StringStart StringEnd
_____________ __________ ___________ _________
'221971_x_at' 3 2 5
'222031_at' 1 6 7
d = { '221853_s_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'221971_x_at'
'222031_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'};
[ids,ia,ic]=unique(d);
ids has the unique strings
ia has an index corresponding to an instance of the unique string within d
ic has an index corresponding to which entry in ids is in that index within d
[ncnt] = hist(ic,1:numel(ids)) - 1; % minus 1 since you only want duplicates
ncnt =
0 3 1 0 0 0
Gets you the number of duplicates for
ids =
'221853_s_at'
'221971_x_at'
'222031_at'
'31637_s_at'
'37796_at'
'38340_at'
ic has the lookup table for the indexes.. use find or logical indexing

Detect cell entries in MATLAB Table

I have a Matlab table (the new 'Table' class), let's call it A:
A=table([1;2;3],{'A';'B';'C'})
As you can see, some of the columns are double, some are cell.
I'm trying to figure out which ones are cells.
For some reason, there is no A.Properties.class I can use, and I can't seem to call iscell on it.
What's the "Matlab" way of doing this? Do I have to loop through each column of the table to figure out its class?
One approach -
out = cellfun(#(x) iscell(getfield(A,x)),A.Properties.VariableNames)
Or, a better way would be to access the fields(variables) dynamically like so -
out = cellfun(#(x) iscell(A.(x)), A.Properties.VariableNames)
Sample runs:
Run #1 -
A=table([1;2;3],{4;5;6})
A =
Var1 Var2
____ ____
1 [4]
2 [5]
3 [6]
out =
0 1
Run #2 -
>> A=table([1;2;3],{'A';'B';'C'})
A =
Var1 Var2
____ ____
1 'A'
2 'B'
3 'C'
out =
0 1
Run #3 -
>> A=table([1;2;3],{4;5;6},{[99];'a';'b'},{'m';'n';'p'})
A =
Var1 Var2 Var3 Var4
____ ____ ____ ____
1 [4] [99] 'm'
2 [5] 'a' 'n'
3 [6] 'b' 'p'
>> out
out =
0 1 1 1
You could test with iscell(A.Var2) if the second variable is of type cell. More generally, you could reference columns by their index:
for k = 1 : width(A)
disp(iscell(A.(k)))
end

'load data' issue in winbugs (bayesian hierarchical)

I have a hierarchical linear model in Winbugs. Data is a longitudinal one and is made up of three categories(red = 1, blue = 2, white = 3)
k - total observations =280
Structure of the data is as follows:
N[] T[] logs[] logp[] cat[] rank[]
1 1 4.2 8.9 1 1
1 2 4.2 8.1 1 2
1 3 3.5 9.2 1 1
2 1 4.1 7.5 1 2
2 2 4.5 6.5 1 2
3 1 5.1 6.6 2 4
3 2 6.2 6.8 3 7
#N = school
#T = time
#logs = log(score)
#logp = log(average hours of inputs)
#rank - rank of school
#cat = section red, section blue, section white in school
My model is syntactically correct, but when I try to load data, I get error = 'expected square bracket at the end]'
model {
# N brands
# T time periods
for (k in 1:K){
for (i in 1:N) {
for (t in 1:T) {
logs[k,i,t] ~ dnorm(mu[k,i,t], tau)
mu[k,i,t] <- bcons +bprice*(logp[t] - logpricebar)
+ brank[cat[t]]*(rank[t] - rankbar)
}
}
}
# C categories
for (c in 1:C) {
brank[c] ~ dnorm(beta, taub)}
# priors
bcons ~ dnorm(0,1.0E-6)
bprice ~ dnorm(0,1.0E-6)
bad ~ dnorm(0,1.0E-6)
beta ~ dnorm(0,1.0E-6)
tau ~ dgamma(0.001,0.001)
taub ~dgamma(0.001,0.001)
}
I follow the standard process of loading data, I select N and then press 'load data' in dialogue box.
Can someone help figure me out the issue here?