MATLAB: textscan to parse irregular text, trouble debugging format specifier - matlab

I've been browsing stack overflow and the mathworks website trying to come up with a solution for reading an irregularly formatted text file into MATLAB using textscan but have yet to figure out a good solution.
The format of the text file looks as such:
// Reference=MNI
// Citation=Beauregard M, 1998
// Condition=Primed - Unprimed Semantic Category Decision
// Domain=Semantics
// Modality=Visual
// Subjects=13
-55 -25 -23
33 -9 -20
// Citation=Beauregard M, 1998
// Condition=Unprimed Semantic Category Decision - Baseline
// Domain=Semantics
// Modality=Visual
// Subjects=13
0 -73 9
-25 -59 47
0 -14 59
8 -18 63
-21 -90 -11
-24 -4 62
24 -93 -6
-21 15 47
-35 -26 -21
9 13 44
// Citation=Binder J R, 1996
// Condition=Words > Tones - Passive
// Domain=Language Perception
// Modality=Auditory
// Subjects=12
-58.73 -12.05 -4.61
I would like to end up with a cell array that looks like this
{nx3 double} {nx1 cellstr} {nx1 cellstr} {nx1 cellstr} {nx1 double}
Where the first element in the array are the 3d coordinates, the second element the citation, the third element the condition, the fourth element the domain, the fifth element the modality and the sixth element the number of subjects.
I would then like to use these cell array to organize the data into a structure to allow for easy indexing of the coordinates by each of the features I extracted from the text file.
I've tried a bunch of things but have only been able to extract out the coordinates as a string and the feature as a single cell array.
Here is how far I have gotten after searching through stack overflow and the mathworks website:
fid = fopen(fullfile(path2proj,path2loc),'r');
data = textscan(fid,'%s %s %s','HeaderLines',1,...
'delimiter',{...
sprintf('// '),...
'Citation=',...
'Condition=',...
'Domain=',...
'Modality=',...
'Subjects='});
I get the following output with this code:
data =
{16470x1 cell} {16470x1 cell} {16470x1 cell}
data{1}(1:20)
ans =
''
''
''
''
''
'-55 -25 -23'
'33 -9 -20'
''
''
''
''
''
'0 -73 9'
'-25 -59 47'
'0 -14 59'
'8 -18 63'
'-21 -90 -11'
'-24 -4 62'
'24 -93 -6'
'-21 15 47'
data{2}(1:20)
ans =
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
''
data{3}(1:20)
ans =
'Beauregard M, 1998'
'Primed - Unprimed Semantic Category Decision'
'Semantics'
'Visual'
'13'
''
''
'Beauregard M, 1998'
'Unprimed Semantic Category Decision - Baseline'
'Semantics'
'Visual'
'13'
''
''
''
''
''
''
''
''
Although I can work with the data in this format, it would be nice to understand how to correctly right a format specifier to extract out piece of data into it's own cell array. Does anyone have any dieas?

Assuming that Reference is only in the first line, you could do the following to obtained the values you want from each section Citation section.
% read the file and split it into sections based on Citation
filecontents = strsplit(fileread('data.txt'), '// Citation');
% iterate through section and extract desired info from each
% section. We start from i=2, as for i=1 we have 'Reference' line.
for i = 2:numel(filecontents)
lines = regexp(filecontents{i}, '\n', 'split');
% remove empty lines
lines(find(strcmp(lines, ''))) = [];
% get values of the fields
citation = lines{1};
condition = get_value(lines{2}, 'Condition');
domain = get_value(lines{3}, 'Domain');
modality = get_value(lines{4}, 'Modality');
subjects = get_value(lines{5}, 'Subjects');
coordinates = cellfun(#str2num, lines(6:end), 'UniformOutput', 0)';
% now you can save in some global cell,
% display or process the extracted values as you please.
end
where get_value is:
function value = get_value(line, search_for)
[tokens, ~] = regexp(line, [search_for, '=(.+)'],'tokens','match');
value = tokens{1};
Hope this helps.

Related

How do I expand a range of numbers in MATLAB

Lets say I have this range of numbers, I want to expand these intervals. What is going wrong with my code here? The answer I am getting isn't correct :(
intervals are only represented with -
each 'thing' is separated by ;
I would like the output to be:
-6 -3 -2 -1 3 4 5 7 8 9 10 11 14 15 17 18 19 20
range_expansion('-6;-3--1;3-5;7-11;14;15;17-20 ')
function L=range_expansion(S)
% Range expansion
if nargin < 1;
S='[]';
end
if all(isnumeric(S) | (S=='-') | (S==',') | isspace(S))
error 'invalid input';
end
ixr = find(isnumeric(S(1:end-1)) & S(2:end) == '-')+1;
S(ixr)=':';
S=['[',S,']'];
L=eval(S) ;
end
ans =
-6 -2 -2 -4 14 15 -3
You can use regexprep to replace ;by , and the - that define ranges by :. Those - are identified by them being preceded by a digit. The result is a string that can be transformed into the desired output using str2num. However, since this function evaluates the string, for safety it is first checked that the string only contains the allowed characters:
in = '-6;-3--1;3-5;7-11;14;15;17-20 '; % example
assert(all(ismember(in, '0123456789 ,;-')), 'Characters not allowed') % safety check
out = str2num(regexprep(in, {'(?<=\d)-' ';'}, {':' ','})); % replace and evaluate

MATLAB fprintf Increase Number of Digits of Exponent

If we have
A=[100 -0.1 0];
B=[30 0.2 -2]; t1='text 1'; t2=text 2'
how to use fprintf so that the output saved in a file will look like that
100 -1.000E-0001 0.000E-0000 'text 1'
30 2.000E-0001 -2.000E-0000 'text 2'
I put together a "one-liner" (spread across several lines for better readability) that takes an array, a single number format, and a delimiter and returns the desired string. And while you found the leading blank-space flag, I prefer the + flag, though the function will work with both:
A=[-0.1 0];
B=[0.2 -2];
minLenExp = 4;
extsprintf = #(num,fmt,delim) ...
strjoin(cellfun(...
#(toks)[toks{1},repmat('0',1,max([0,minLenExp-length(toks{2})])),toks{2}],...
regexp(sprintf(fmt,num),'([+-\s][\.\d]+[eE][+-])(\d+)','tokens'),...
'UniformOutput',false),delim);
Astr = extsprintf(A,'%+.4E',' ');
Bstr = extsprintf(B,'%+.4E',' ');
disp([Astr;Bstr]);
Running this yields:
>> foo
-1.0000E-0001 +0.0000E+0000
+2.0000E-0001 -2.0000E+0000
(foo is just what the script file is called.)
Here's a more general approach that searches for the exponential format instead of assuming it:
A=[100 -0.1 0].';
B=[30 0.2 -2];
extsprintf = #(fmt,arr) ...
regexprep(...
sprintf(fmt,arr),...
regexprep(regexp(sprintf(fmt,arr),'([+-\s][\.\d]+[eE][+-]\d+)','match'),'([\+\.])','\\$1'),...
cellfun(#(match)...
cellfun(...
#(toks)[toks{1},repmat('0',1,max([0,minLenExp-length(toks{2})])),toks{2}],...
regexp(match,'([+-\s][\.\d]+[eE][+-])(\d+)','tokens'),...
'UniformOutput',false),...
regexp(sprintf(fmt,arr),'([+-\s][\.\d]+[eE][+-]\d+)','match')));
fmt = '%3d %+.4E %+.4e';
disp(extsprintf(fmt,A));
disp(extsprintf(fmt,B));
Outputs
>> foo
100 -1.0000E-0001 +0.0000e+0000
30 +2.0000E-0001 -2.0000e+0000

What the best way to build a dictionary (word count) for NLP in matlab?

I have a frequency count dictionary, I want to be able to read the frequency count to a given word in my dictonary.
for example
my input word is 'about' ,so the output will be the count of 'about' in my dictionary, which 139 to be able to calculate the probability.
139 about
133 according
163 accusing
244 actually
567 afternoon
175 again
156 ah
167 a-ha
165 ahh
I tried do this with fopen method, but not getting the wanted result.
1 fid = fopen('dictionary.txt');
2 words = textscan(fid, '%s');
3 fclose(fid);
4 words = words{1};
I tried this as well, but getting different result,
countfunction = #(word) nnz(strcmp(word, words));
count = cellfun(countfunction, words);
tally = [words num2cell(count)];
sortrows(tally, 2);
The problem is that you're running countfunction for each instance of each word in the dictionary, rather than each unique word in the dictionary.
Here's how to incrementally improve your code:
words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
unique_words = unique(words(:));
countfunction = #(word) nnz(strcmp(word, words));
count = cellfun(countfunction, unique_words);
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
'a' [1]
'the' [2]
'hi' [3]
However, I'd recommend using grpstats instead:
words = {'hi' 'hi' 'the' 'hi' 'the' 'a'};
[unique_words, count] = grpstats(ones(size(words)), words(:), {'gname', 'numel'});
tally = [unique_words, num2cell(count)];
disp(sortrows(tally, 2));
'a' [1]
'the' [2]
'hi' [3]

Extract numbers from string in MATLAB

I'm working with sscanf to extract a number from a string. The strings are usually in the form of:
'44 ppm'
'10 gallons'
'23.4 inches'
but ocassionally they are in the form of:
'<1 ppm'
If I use the following code:
x = sscanf('1 ppm','%f')
I get an output of
1
But if I add the less than sign in front of the one:
x = sscanf('<1 ppm','%f')
I get:
[]
How can I write this code so this actually produces a number? I'm not sure yet what number it should print...but let's just say it should print 1 for the moment.
You can use regexp:
s= '<1 ppm';
x=regexp(s, '.*?(\d+(\.\d+)*)', 'tokens' )
x{1}
Demo :
>> s= {'44 ppm', '10 gallons', '23.4 inches', '<1 ppm' } ;
>> x = regexp(s, '.*?(\d+(\.\d+)*)', 'tokens' );
>> cellfun( #(x) disp(x{1}), x ) % Demo for all
'44'
'10'
'23.4'
'1'

extracting data from excel to matlab

Suppose i have an excel file (data.xlsx) , which contains the following data.
Name age
Tom 43
Dick 24
Harry 32
Now i want to extract the data from it and make 2 cell array (or matrix) which shall contain
name = ['Tom' ; 'Dick';'Harry'] age = [43;24;32]
i have used xlsread(data.xlsx) , but its only extracting the numerical values ,but i want to obtain both as mentioned above . Please help me out
You have to use additional output arguments from xlread in order to get the text.
I created a dummy Excel file with your data and here is the output (nevermind about the NaNs):
[ndata, text, alldata] = xlsread('DummyExcel.xlsx')
ndata =
43
24
32
text =
'Name' 'Age'
'Tom' ''
'Dick' ''
'Harry' ''
alldata =
[NaN] 'Name' 'Age'
[NaN] 'Tom' [ 43]
[NaN] 'Dick' [ 24]
[NaN] 'Harry' [ 32]
Now if you use this:
text{2:end,1}
you get
ans =
Tom
ans =
Dick
ans =
Harry
You can use the function called importdata.
Example:
%Import Data
filename = 'yourfilename.xlsx';
delimiterIn = ' ';
headerlinesIn = 1;
A = importdata(filename,delimiterIn,headerlinesIn);
This will help to take both the text data and numerical data. Textdata will be under A.textdata and numerical data will be under A.data.