Multiple queries (OR statements) using find - matlab

Currently to perform multiple queries using find, I invoke each query individually separated by |
index = find(strcmp(data{:,{'type'}},'A') | strcmp(data{:,{'type'}},'B') | strcmp(data{:,{'type'}},'C') | strcmp(data{:,{'type'}},'D'));
To find all rows that where the field 'type' contains either A, B, C or D.
data is held in a table hence the usage of }.
Is there a more concise way of doing this without the need to specify the query in full each time?

You could use ismember instead of multiple uses of strcmp.
index = find(ismember(data{:,{'type'}}, {'A','B','C','D'}));
An alternative (because ismember will probably be slower than multiple uses of strcmp) would be to factor out the repeated code -
x = data{:, {'type'}}; %# This isn't valid MATLAB but whatever...
index = find(strcmp(x,'A') | strcmp(x,'B') | strcmp(x,'C') | strcmp(x,'D'));
You could also use multiple lines for readability
x = data{:, {'type'}}; %# This isn't valid MATLAB but whatever...
index = find(strcmp(x,'A') ...
| strcmp(x,'B') ...
| strcmp(x,'C') ...
| strcmp(x,'D'));

Related

Linear Regression with Apache Beam

How might one go about fitting a large number of linear regressions in a beam pipeline? I have a large csv, and I want to normalize every column (about 500) according to two columns A and B. That is, I would like to get standard residuals for X ~ A + B for each column in the csv X.
That's an interesting use case. You can do something like so:
INDEX_A = # Something
INDEX_B = # Something else
parsed_rows = pipeline | beam.ReadFromText(my_csv)
| beam.Map(parse_each_line)
def column_paired_rows(row):
for idx, val in row:
if idx in (INDEX_A, INDEX_B): continue
# Yield the values keyed with the independent + dependent variable indices
yield ((INDEX_A, idx), {'independent_var_value': row[INDEX_A],
'independent_var_idx': INDEX_A,
'dependent_var_value': val,
'dependent_var_idx': idx})
yield ((INDEX_B, idx), {'independent_var_value': row[INDEX_B],
'independent_var_idx': INDEX_B,
'dependent_var_value': val,
'dependent_var_idx': idx})
column_pairs = parsed_rows | beam.FlatMap(column_paired_rows) | beam.GroupByKey()
The column_pairs PCollection will group all your elements by independent, dependent variable pairs, and then you can run the analysis.
def perform_linear_regression(elm):
key = elm[0] # KEY is a tuple with (independent variable index, dependent variable index)
values = elm[1] # This is an iterable with the data points that you need.
pairs = [(v['independent_var_value'], v['dependent_var_value']) for v in values]
model = linear_regression(pairs)
return (key, model)
models = column_pairs | beam.Map(perform_linear_regression)
LMK if you'd like me to add further detail

Left Join tables in MATLAB

I tried to find something relevant but to no avail, except this post which I can't say it was helpful.
I have two tables A and B:
A has dimensions 5x5 and non unique values in the LastName
LastName = {'Smith';'Johnson';'Williams';'Smith';'Williams'};
YearNow= [2010;2010;2010;2010;2010];
Height = [71;69;64;67;64];
Weight = [176;163;131;133;119];
BloodPressure = [124; 109; 125; 117; 122];
A = table(LastName,YearNow,Height,Weight,BloodPressure);
and B has dimensions 3x2 and unique values in LastName
LastName = {'Smith';'Johnson';'Williams'};
YearBorn= [1950;1975;1965];
B = table(LastName,YearBorn);
I want to create a new column on Table A that will contain their age after I subtract for each A.YearNow the B.YearBorn, so the last column will have the form
A.Age = [60,35,45,60,45];
When I try to use [detect,pos] = ismember(A,B(:,1)); I get an error:
A and B must contain the same variables.
Any help would be appreciated.
Instead of using ismember, which can be quite error-prone as you have to put things in the right order, you could also use Matlab's outerjoin instead:
A = outerjoin(A,B,'Type','Left','MergeKeys',true);
A.Age = A.YearNow - A.YearBorn;
Note that outerjoin modifies the ordering. See the official Matlab documentation for all the input arguments.
An additional advantage of outerjoin over ismember is that in case not all LastNames in table A exist in table B, you will have to pre-allocate output with ismember, and use the first output argument as well.
You need to extract the LastName columns and pass them to ismember. Then you can use the index vector it returns to compute the Age column as follows:
[~, index] = ismember(A.LastName, B.LastName);
A.Age = A.YearNow-B.YearBorn(index);

Fastest type to use for comparing hashes in matlab

I have a table in Matlab with some columns representing 128 bit hashes.
I would like to match rows, to one or more rows, based on these hashes.
Currently, the hashes are represented as hexadecimal strings, and compared with strcmp(). Still, it takes many seconds to process the table.
What is the fastest way to compare two hashes in matlab?
I have tried turning them into categorical variables, but that is much slower. Matlab as far as I know does not have a 128 bit numerical type. nominal and ordinal types are deprecated.
Are there any others that could work?
The code below is analogous to what I am doing:
nodetype = { 'type1'; 'type2'; 'type1'; 'type2' };
hash = {'d285e87940fb9383ec5e983041f8d7a6'; 'd285e87940fb9383ec5e983041f8d7a6'; 'ec9add3cf0f67f443d5820708adc0485'; '5dbdfa232b5b61c8b1e8c698a64e1cc9' };
entries = table(categorical(nodetype),hash,'VariableNames',{'type','hash'});
%nodes to match. filter by type or some other way so rows don't match to
%themselves.
A = entries(entries.type=='type1',:);
B = entries(entries.type=='type2',:);
%pick a node/row with a hash to find all counterparts of
row_to_match_in_A = A(1,:);
matching_rows_in_B = B(strcmp(B.hash,row_to_match_in_A.hash),:);
% do stuff with matching rows...
disp(matching_rows_in_B);
The hash strings are faithful representations of what I am using, but they are not necessarily read or stored as strings in the original source. They are just converted for this purpose because its the fastest way to do the comparison.
Optimization is nice, if you need it. Try it out yourself and measure the performance gain for relevant test cases.
Some suggestions:
Sorted arrays are easier/faster to search
Matlab's default numbers are double, but you can also construct integers. Why not use 2 uint64's instead of the 128bit column? First search for the upper 64bit, then for the lower; or even better: use ismember with the row option and put your hashes in rows:
A = uint64([0 0;
0 1;
1 0;
1 1;
2 0;
2 1]);
srch = uint64([1 1;
0 1]);
[ismatch, loc] = ismember(srch, A, 'rows')
> loc =
4
2
Look into the compare functions you use (eg edit ismember) and strip out unnecessary operations (eg sort) and safety checks that you know in advance won't pose a problem. Like this solution does. Or if you intend do call a search function multiple times, sort in advance and skip the check/sort in the search function later on.

if-statement: how to rewrite in matlab

I'm newby in Matlab. I have took the work code with complex if-statement condition and need to rewrite it. This code should prepare some initial data to solve an optimization task. This if-statement condition looks like:
x=[784.8 959.2 468 572 279 341 139.5 170.5 76.5 93.5 45 55];
a=nchoosek(x,6); % all possible combinations from 6 elements of x
n=length(a);
q=[];
for i=1:n
if( ((a(i,1)==x(1)) & (a(i,2)==x(2))) |
((a(i,1)==x(3)) & (a(i,2)==x(4))) |
((a(i,1)==x(5)) & (a(i,2)==x(6))) |
((a(i,1)==x(7)) & (a(i,2)==x(8))) |
((a(i,2)==x(3)) & (a(i,3)==x(4))) |
((a(i,2)==x(5)) & (a(i,3)==x(6))) |
((a(i,2)==x(7)) & (a(i,3)==x(8))) |
((a(i,3)==x(3)) & (a(i,4)==x(4))) |
((a(i,3)==x(5)) & (a(i,4)==x(6))) |
((a(i,3)==x(7)) & (a(i,4)==x(8))) |
((a(i,3)==x(9)) & (a(i,4)==x(10)))|
((a(i,4)==x(5)) & (a(i,5)==x(6))) |
((a(i,4)==x(7)) & (a(i,5)==x(8))) |
((a(i,4)==x(9)) & (a(i,5)==x(10)))|
((a(i,5)==x(5)) & (a(i,6)==x(6))) |
((a(i,5)==x(7)) & (a(i,6)==x(8))) |
((a(i,5)==x(9)) & (a(i,6)==x(10)) |
((a(i,5)==x(11)) & (a(i,6)==x(12)))))
q(i,:)=a(i,:);
end;
end;
q;
R1=a-q;
R1(~any(R1,2),:) = [];
R1(:, ~any(R1)) = [];
Question: Could anyone give an idea how to rewrite if-statement to improve readability of code?
If I understood you correctly, what the convoluted if statement basically says
If "x(1) x(2)" or "x(3) x(4)" or ... "x(11) x(12)" appears anywhere consecutively in row i
Think about it:
((a(i,1)==x(1)) & (a(i,2)==x(2))) | ((a(i,1)==x(3)) & (a(i,2)==x(4))) |
((a(i,1)==x(5)) & (a(i,2)==x(6))) | ((a(i,1)==x(7)) & (a(i,2)==x(8)))
is no different from:
((a(i,1)==x(1)) & (a(i,2)==x(2))) | ((a(i,1)==x(3)) & (a(i,2)==x(4))) |
((a(i,1)==x(5)) & (a(i,2)==x(6))) | ((a(i,1)==x(7)) & (a(i,2)==x(8))) |
((a(i,1)==x(9)) & (a(i,2)==x(10))) | ((a(i,1)==x(11)) & (a(i,2)==x(12)))
since [x(9) x(10)] and [x(11) x(12)] will never appear at a(i, 1:2), so the line I added is always false and does not change the result of the chain of OR's. But if makes the logic much easier to understand. Same logic applies to a(i,2:3), a(i,3:4)..., complete those cases too and then you will get the first statement I made in this answer.
Then, instead of generating a directly from x, you should generate a from the INDEX of x, i.e. [1:12], as such:
a = nchoosek(1:length(x), 6);
Why? You said x consists of real numbers, and using == on real numbers does not guarantee success, and is a very bad practice in general.
Then, your target becomes:
find if sequence `[1 2]` or `[3 4]` or `[5 6]` ... exists in each line of `a`
which is equivalent to:
find if there is any odd number n followed by n+1
This logic can be represented as:
success = any (mod(a(:,1:end-1), 2) & diff(a,1,2)==1, 2)
Now success(i) will be true/false for the every a(i) that your statement evaluates to the same value. This method is better than your statement because it is very concise, automatically adapts to different sizes of x and does not need to run in a loop.
And if you want to get the actual combination of x values, just do
x(a(i)); % Get the ith permutation of x
x(a); % Get all permutation of x
x(a(success,:)); % Get all permutation of x that satisfy the requirement.
EDIT:
q = a; % q is basically a copy of a
q(~success,:) = 0; % except the `non-success` rows are zero
x(q) - x(a) % suppose q and a store index, this will give you the substraction.

Vectorize filtering on matlab cell structures

I have a huge cell vector cc (size: 1xN) of the form:
cc{1} = {'indexString1', 'str_row1col1', 'str_row1col2' }
cc{2} = {'indexString2', 'str_row2col1', 'bighello', 'str_row1col3' }
cc{3} = {'indexString3','str_row3col1'}
cc{4} = {'indexString4','str_row3col1', 'helloWorld'}
I want to traverse each cell and remove specific cells that contain the word "hello", e.g c{4}{2}. Can we do that without for loops keeping the final structure of cc?
Best,
Thoth.
EDIT: From the answers and comments I have seen that the structure of the cell impose some limitations. So any other suggestion to store my data are welcome. I just want to keep together all the cells (e.g. 'str_row1col1', 'str_row1col2') that correspond to the same indexString*n* (e.g. indexString1). I made this edit in case it helps some final reshape.
Using regular expressions, you can obtain a logical array in which zeros represent occurences of the word 'hello' somewhere in the nested cell. As #LuisMendo pointed out, this would be much easier to delete the unwanted cells if they were not nested:
clc
clear
cc{1} = {'str_row1col1', 'str_row1col2' };
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1'};
cc{4} = {'str_row3col1', 'helloWorld'};
A = (cellfun(#isempty,regexp([cc{:}],'(\w*hello|hello\w*)','match')))
Gives the following array:
A =
1 1 1 0 1 1 1 0
For the rest I think you would need a loop since the nested cells are not all of the same size. Anyhow I hope it helps you a bit.
EDIT Here is what you can do using a for loop. In order to identify words of interest (earth and water as in your comment below), simply add them to the argument in the call to regexp. This character: | is used to make some sort of list so that Matlab checks all the expressions in the brackets.
Please refer to this page for more infos on regular expressions. There is also a possibility to look for regular expressions with case-sensitivity.
Sample code, in which I added strings containing earth and water:
cc{1} = {'str_row1col1', 'earth!superman' 'str_row1col2' 'DummyString'};
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1' 'str_row3col3' 'water_batman'};
cc{4} = {'str_row3col1' 'str_row4col2' 'helloWorld'};
cc{5} = {'str_row5_LegoMan' 'str_row5col2' 'AnotherDummyString' 'Useless String' 'BonjourWorld'};
% With a for loop, for example:
FinalCell = cell(size(cc,2),1);
for k = 1:size(cc,2)
DummyCell = cc{k}; % Use dummy cell for easier indexing
% This is where you tell Matlab what words/expressions you are looking for
A = cellfun(#isempty,regexp(cc{k},'(\w*hello|hello\w*|earth|water)','match'));
DummyCell(~A) = []; % Remove the cells containing the strings/words of interest
FinalCell{k} = DummyCell;
end
Then you're good to go. Hope that helps!
The closest thing possible I found is:
clear all
cc{1} = {'str_row1col1', 'str_row1col2' };
cc{2} = {'str_row2col1', 'bighello', 'str_row1col3' };
cc{3} = {'str_row3col1'};
cc{4} = {'str_row3col1', 'helloWorld'};
cc1 = [cc{:}];
cc1 = cc1(~strcmp('bighello',cc1));
This reorganize your array into a one dimensional array and it cannot match regular expression, but only whole words.
For a better job I am afraid you have to use for loops.