Deleting rows with specific rules - matlab

I got a 20*3 cell array and I need to delete the rows contains "137", "2" and "n:T"
Origin data:
'T' '' ''
'NP(*)' '' ''
[ 137] '' ''
[ 2] '' ''
'ARE' 'and' 'NP(FCC_A1#1)'
'' '' '1:T'
[ 1200] [0.7052] ''
[1.2051e+03] [0.7076] ''
'ARE' 'and' 'NP(FCC_A1#3)'
'' '' '2:T'
[ 1200] [0.0673] ''
[1.2051e+03] [0.0671] ''
'ARE' 'and' 'NP(M23C6)'
'' '' '3:T'
[ 1200] [0.2275] ''
[1.2051e+03] [0.2253] ''
[ 137] '' ''
[ 2] '' ''
And I want it to be like
'T' '' ''
'NP(*)' '' ''
'ARE' 'and' 'NP(FCC_A1#1)'
[ 1200] [0.7052] ''
[1.2051e+03] [0.7076] ''
'ARE' 'and' 'NP(FCC_A1#3)'
[ 1200] [0.0673] ''
[1.2051e+03] [0.0671] ''
'ARE' 'and' 'NP(M23C6)'
[ 1200] [0.2275] ''
[1.2051e+03] [0.2253] ''
I've tried regexp and strcmp and they don't work well. Plus the cell array also hard to deal with. Can anyone help?
Thank you in advance.

If you can somehow read your original data so that all cells are strings or empty arrays (not numeric values), you can do it with strcmp and regexprep:
% The variable 'data' is a 2D-cell array of strings or empty arrays
datarep = regexprep(data,'^\d+:T','2'); % replace 'n:T' with '2' for convenience
remove1 = strcmp('2',datarep); % this takes care of '2' and 'n:T'
remove2 = strcmp('137',datarep); % this takes care of '137'
rows_keep = find(~sum(remove1|remove2,2)); % rows that will be kept
solution = data(rows_keep,:)
For example, with this data
'aa' 'bb' 'cc'
'dd' 'dd' '2'
'137' 'dd' 'dd'
'dd' 'dd' '11:T'
'1:T' '1:137' 'dd'
'dd' '' []
the result in the variable solution is
'aa' 'bb' 'cc'
'dd' '' []

I just tried the following codes on my desktop and it seems to do the trick. I made a as the cell array you had.
L = size(a, 1);
mask = false(L, 1);
for ii = 1:L
if isnumeric(a{ii, 1}) && (a{ii, 1} == 137 || a{ii, 1} == 2)
mask(ii) = true;
elseif ~isempty(a{ii, 3}) && strcmp(a{ii, 3}(end-1:end), ':T')
mask(ii) = true;
end
end
b = a(~mask, :)
Now, b should be the cell array you wanted. Basically, I created a logical mask that indicates the position of rows that satisfy the rules, then use the inverse of it to call out the rows.

Here is another simple option:
%Anonymous function that checks if a cell is equal to 173 or to 2 or fits the '*:T*' pattern
Eq137or2 = #(x) sum(x == 137 | x == 2) | sum(strfind(num2str(x), ':T') > 1)
%Use the anonymous functions to find the rows you don't want
mask = = sum(cellfun(Eq137or2, a),2)
%Remove the unwanted rows
a(~mask, :)

Related

Vectorize IF statement

I am trying to vectorize an if statement in Matlab and I am not sure how to do it. I want to assign a 'N' for positive values and 'S' for negative values. I want to avoid a for loop but here is my code:
LatDD = [23.0,12.3,-43.2,9.9,-40.7];
LatDir = ['' '' '' '' ''];
if (LatDD < 0)
LatDir = 'S'
else
LatDir = 'N'
end
Obviously this fails to do what I want because it really only checks the first element of LatDD. I could easily do a for loop but I want it to be vectorized. I tried logical indexing but all that got me was another vector with zeroes or ones which I would have to check with a for loop anyway.
You can use logical indexing here, you just have to do it twice
LatDD = [23.0,12.3,-43.2,9.9,-40.7];
LatDir = ['' '' '' '' ''];
LatDir(LatDD < 0) = 'S';
LatDir(LatDD >= 0) = 'N';
Since you have a binary choice here, you could even skip a step by prefilling LatDir with all 'N' and just changing the ones corresponding to negative LatDD values to 'S'
LatDD = [23.0,12.3,-43.2,9.9,-40.7];
LatDir = ['N' 'N' 'N' 'N' 'N'];
LatDir(LatDD < 0) = 'S';
Here's a one-liner -
char('S'*(LatDD<0) + 'N'*(~(LatDD<0)))
Sample run -
>> LatDD = [23.0,12.3,-43.2,9.9,-40.7];
>> LatDir = ['' '' '' '' ''];
>> char('S'*(LatDD<0) + 'N'*(~(LatDD<0)))
ans =
NNSNS

extracting data from excel to matlab

Suppose i have an excel file (data.xlsx) , which contains the following data.
Name age
Tom 43
Dick 24
Harry 32
Now i want to extract the data from it and make 2 cell array (or matrix) which shall contain
name = ['Tom' ; 'Dick';'Harry'] age = [43;24;32]
i have used xlsread(data.xlsx) , but its only extracting the numerical values ,but i want to obtain both as mentioned above . Please help me out
You have to use additional output arguments from xlread in order to get the text.
I created a dummy Excel file with your data and here is the output (nevermind about the NaNs):
[ndata, text, alldata] = xlsread('DummyExcel.xlsx')
ndata =
43
24
32
text =
'Name' 'Age'
'Tom' ''
'Dick' ''
'Harry' ''
alldata =
[NaN] 'Name' 'Age'
[NaN] 'Tom' [ 43]
[NaN] 'Dick' [ 24]
[NaN] 'Harry' [ 32]
Now if you use this:
text{2:end,1}
you get
ans =
Tom
ans =
Dick
ans =
Harry
You can use the function called importdata.
Example:
%Import Data
filename = 'yourfilename.xlsx';
delimiterIn = ' ';
headerlinesIn = 1;
A = importdata(filename,delimiterIn,headerlinesIn);
This will help to take both the text data and numerical data. Textdata will be under A.textdata and numerical data will be under A.data.

MATLAB: textscan to parse irregular text, trouble debugging format specifier

I've been browsing stack overflow and the mathworks website trying to come up with a solution for reading an irregularly formatted text file into MATLAB using textscan but have yet to figure out a good solution.
The format of the text file looks as such:
// Reference=MNI
// Citation=Beauregard M, 1998
// Condition=Primed - Unprimed Semantic Category Decision
// Domain=Semantics
// Modality=Visual
// Subjects=13
-55 -25 -23
33 -9 -20
// Citation=Beauregard M, 1998
// Condition=Unprimed Semantic Category Decision - Baseline
// Domain=Semantics
// Modality=Visual
// Subjects=13
0 -73 9
-25 -59 47
0 -14 59
8 -18 63
-21 -90 -11
-24 -4 62
24 -93 -6
-21 15 47
-35 -26 -21
9 13 44
// Citation=Binder J R, 1996
// Condition=Words > Tones - Passive
// Domain=Language Perception
// Modality=Auditory
// Subjects=12
-58.73 -12.05 -4.61
I would like to end up with a cell array that looks like this
{nx3 double} {nx1 cellstr} {nx1 cellstr} {nx1 cellstr} {nx1 double}
Where the first element in the array are the 3d coordinates, the second element the citation, the third element the condition, the fourth element the domain, the fifth element the modality and the sixth element the number of subjects.
I would then like to use these cell array to organize the data into a structure to allow for easy indexing of the coordinates by each of the features I extracted from the text file.
I've tried a bunch of things but have only been able to extract out the coordinates as a string and the feature as a single cell array.
Here is how far I have gotten after searching through stack overflow and the mathworks website:
fid = fopen(fullfile(path2proj,path2loc),'r');
data = textscan(fid,'%s %s %s','HeaderLines',1,...
'delimiter',{...
sprintf('// '),...
'Citation=',...
'Condition=',...
'Domain=',...
'Modality=',...
'Subjects='});
I get the following output with this code:
data =
{16470x1 cell} {16470x1 cell} {16470x1 cell}
data{1}(1:20)
ans =
''
''
''
''
''
'-55 -25 -23'
'33 -9 -20'
''
''
''
''
''
'0 -73 9'
'-25 -59 47'
'0 -14 59'
'8 -18 63'
'-21 -90 -11'
'-24 -4 62'
'24 -93 -6'
'-21 15 47'
data{2}(1:20)
ans =
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
data{3}(1:20)
ans =
'Beauregard M, 1998'
'Primed - Unprimed Semantic Category Decision'
'Semantics'
'Visual'
'13'
''
''
'Beauregard M, 1998'
'Unprimed Semantic Category Decision - Baseline'
'Semantics'
'Visual'
'13'
''
''
''
''
''
''
''
''
Although I can work with the data in this format, it would be nice to understand how to correctly right a format specifier to extract out piece of data into it's own cell array. Does anyone have any dieas?
Assuming that Reference is only in the first line, you could do the following to obtained the values you want from each section Citation section.
% read the file and split it into sections based on Citation
filecontents = strsplit(fileread('data.txt'), '// Citation');
% iterate through section and extract desired info from each
% section. We start from i=2, as for i=1 we have 'Reference' line.
for i = 2:numel(filecontents)
lines = regexp(filecontents{i}, '\n', 'split');
% remove empty lines
lines(find(strcmp(lines, ''))) = [];
% get values of the fields
citation = lines{1};
condition = get_value(lines{2}, 'Condition');
domain = get_value(lines{3}, 'Domain');
modality = get_value(lines{4}, 'Modality');
subjects = get_value(lines{5}, 'Subjects');
coordinates = cellfun(#str2num, lines(6:end), 'UniformOutput', 0)';
% now you can save in some global cell,
% display or process the extracted values as you please.
end
where get_value is:
function value = get_value(line, search_for)
[tokens, ~] = regexp(line, [search_for, '=(.+)'],'tokens','match');
value = tokens{1};
Hope this helps.

Extract Cumulative N-grams Matlab

I have an array of words:
x=['ae' ; 'be' ; 'ce' ; 'de' ; 'ee' ; 'fe']
I would like to extract sets of characters. So assume each set has N = 2 words, how can I go about getting return values that look like this
'ae' 'be'
'be' 'ce'
'ce' 'de'
'de' 'ee'
'ee' 'fe'
So if N = 2, I get back a matrix where each row contains pairs of the current and previous characters. If N=3 i will get back current and previous 2 chars for each row. I want to avoid loops if possible.
Any ideas?
You can use the Circulant Matrix Maltlab provides, truncate it as needed and use it as an index vector:
x = {'ae' ; 'be' ; 'ce' ; 'de' ; 'ee' ; 'fe'}
N = 3;
n = numel(x);
A = gallery('circul',n:-1:1)
B = fliplr( A(1:n-N+1,n-N+1:end) )
result = x(B)
or a little shorter:
A = fliplr( gallery('circul',n:-1:1) )
result = x( A(1:n-N+1,1:n-N) )
or another option using the hankel-Matrix:
A = hankel(1:n,1:N)
result = x( A(1:n-N+1,:) )
gives:
result =
'ae' 'be' 'ce'
'be' 'ce' 'de'
'ce' 'de' 'ee'
'de' 'ee' 'fe'

removing duplicates - ** only when the duplicates occur in sequence

I would like to do something similar to the following, except I would only like to remove 'g' and'g' because they are the duplicates that occur one after each other. I would also like to keep the sequence the same.
Any help would be appreciated!!!
I have this cell array in MATLAB:
y = { 'd' 'f' 'a' 'g' 'g' 'w' 'a' 'h'}
ans =
'd' 'f' 'a' 'w' 'a' 'h'
There was an error in my first answer (below) when used on multiple duplicates (thanks grantnz). Here's an updated version:
>> y = { 'd' 'f' 'a' 'g' 'g' 'w' 'a' 'h' 'h' 'i' 'i' 'j'};
>> i = find(diff(char(y)) == 0);
>> y([i; i+1]) = []
y =
'd' 'f' 'a' 'w' 'a' 'j'
OLD ANSWER
If your "cell vector" always contains only single character elements you can do the following:
>> y = { 'd' 'f' 'a' 'g' 'g' 'w' 'a' 'h'}
y =
'd' 'f' 'a' 'g' 'g' 'w' 'a' 'h'
>> y(find(diff(char(y)) == 0) + [0 1]) = []
y =
'd' 'f' 'a' 'w' 'a' 'h'
Look at it like this: you want to keep an element if and only if either (1) it's the first element or (2) its predecessor is different from it and either (3) it's the last element or (4) its successor is different from it. So:
y([true ~strcmp(y(1:(end-1)),y(2:end))] & [~strcmp(y(1:(end-1)),y(2:end)) true])
or, perhaps better,
different = ~strcmp(y(1:(end-1)),y(2:end));
result = y([true different] & [different true]);
This should work:
y([ diff([y{:}]) ~= 0 true])
or slightly more compactly
y(diff([y{:}]) == 0) = []
Correction : The above wont remove both the duplicates
ind = diff([y{:}]) == 0;
y([ind 0] | [0 ind]) = []
BTW, this works even if there are multiple duplicate sequences
eg,
y = { 'd' 'f' 'a' 'g' 'g' 'w' 'a' 'h' 'h'};
ind = diff([y{:}]) == 0;
y([ind 0] | [0 ind]) = []
y =
'd' 'f' 'a' 'w' 'a'