Matlab strsplit by space and quotes - matlab

I have a string of variable names as shown below:
{'"NORM TIME SEC, SEC, 9999997" "ROD FORCE, LBS, 3000118" "ROD POS, DEG, 3000216" P_ext_chamber_press P_ret_chamber_press "GEAR#1 POS INCH" 388821 Q_valve_gpm P_return 3882992 "COMMAND VOLTAGE VOLT"'}
the double quotes are for the variable names with spaces or special characters between the words" and a single word variable doesn't have any quotes around them. The variables are separated by one space. Some variable names are just numbers.
At the end, I want to create a cell with strings as follows:
{'NORM_TIME_SEC_SEC_9999997','ROD_FORCE_LBF_3000118','ROD_POS_DEG_3000216','P_ext_chamber_press','P_ret_chamber_press','GEAR#1_POS_INCH','3388821','Q_valve_gpm','P_return','3882992','COMMAND_VOLTAGE_VOLT'}

You can use regexp to first split it into groups and then replace all space with _
data = {'"abc def ghi" "jkl mno pqr" "stu vwx" yz"'};
% Get each piece within the " "
pieces = regexp(data{1}, '(?<=\"\s*)([A-Za-z0-9]+\s*)*(?=\"\s*)', 'match');
% 'abc def ghi' 'jkl mno pqr'
% Replace any space with _
names = regexprep(pieces, '\s+', '_');
% 'abc_def_ghi' 'jkl_mno_pqr'
Update
Since your last variable isn't surrounded by quotes, you could do something like the following
pieces = strtrim(regexp(data, '[^\"]+(?=\")', 'match'));
pieces = pieces{1};
pieces = pieces(~cellfun(#isempty, pieces));
% Replace spaces with _
regexprep(pieces, '\s+', '_')

I ended up forcing myself to study little bit on regular expression and the following seems to be working well for what I'm trying to do:
regexp(str,'(\"[\w\s\,\.\#]+\"|\w+)','match')
Probably not as robust as I want since I'm specifically singling out a certain set of special characters only, but so far, I haven't seen other special characters other than those in the data sets I have.

str = {'"abc def ghi" "jkl mno pqr" "stu vwx" yz"'};
Then
str_u = strrep(str,' ','_');
[str_q rest] = strtok(str_u,'"');
str_u = rest;
while ~strcmp(rest,'')
[token rest] = strtok(str_u,'"');
if ~(strcmp(token,'_')||strcmp(token,''))
if strcmp(token{1,1}(1),'_')
token{1,1} = strrep(token{1,1},'_','');
end
str_q = [str_q, token];
end
str_u = rest;
end
The resultant cell array is str_q, which will give the names of the variables
str_q = 'abc_def_ghi' 'jkl_mno_pqr' 'stu_vwx' 'yz'

Related

How to calculate the number of appearance of each letter(A-Z ,a-z as well as '.' , ',' and ' ' ) in a text file in matlab?

How can I go about doing this? So far I've opened the file like this
fileID = fopen('hamlet.txt'.'r');
[A,count] = fscanf(fileID, '%s');
fclose(fileID);
Getting spaces from the file
First, if you want to capture spaces, you'll need to change your format specifier. %s reads only non-whitespace characters.
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%s');
>> fclose(fileID);
>> A
A = Thistexthasspacesinit.
Instead, we can use %c:
>> fileID = fopen('space.txt','r');
>> A = fscanf(fileID, '%c');
>> fclose(fileID);
>> A
A = This text has spaces in it.
Mapping between characters and values (array indices)
We could create a character array that contains all of the target characters to look for:
search_chars = ['A':'Z', 'a':'z', ',', '.', ' '];
That would work, but to map the character to a position in the array you'd have to do something like:
>> char_pos = find(search_chars == 'q')
char_pos = 43
You could also use containters.Map, but that seems like overkill.
Instead, let's use the ASCII value of each character. For convenience, we'll use only values 1:126 (0 is NUL, and 127 is DEL. We should never encounter either of those.) Converting from characters to their ASCII code is easy:
>> c = 'q'
c = s
>> a = uint8(c) % MATLAB actually does this using double(). Seems wasteful to me.
a = 115
>> c2 = char(a)
c2 = s
Note that by doing this, you're counting characters that are not in your desired list like ! and *. If that's a problem, then use search_chars and figure out how you want to map from characters to indices.
Looping solution
The most intuitive way to count each character is a loop. For each character in A, find its ASCII code and increment the counter array at that index.
char_count = zeros(1, 126);
for current_char = A
c = uint8(current_char);
char_count(c) = char_count(c) + 1;
end
Now you've got an array of counts for each character with ASCII codes from 1 to 126. To find out how many instances of 's' there are, we can just use its ASCII code as an index:
>> char_count(115)
ans = 4
We can even use the character itself as an index:
>> char_count('s')
ans = 4
Vectorized solution
As you can see with that last example, MATLAB's weak typing makes characters and their ASCII codes pretty much equivalent. In fact:
>> 's' == 115
ans = 1
That means that we can use implicit broadcasting and == to create a logical 2D array where L(c,a) == 1 if character c in our string A has an ASCII code of a. Then we can get the count for each ASCII code by summing along the columns.
L = (A.' == [1:126]);
char_count = sum(L, 1);
A one-liner
Just for fun, I'll show one more way to do this: histcounts. This is meant to put values into bins, but as we said before, characters can be treated like values.
char_count = histcounts(uint8(A), 1:126);
There are dozens of other possibilities, for instance you could use the search_chars array and ismember(), but this should be a good starting point.
With [A,count] = fscanf(fileID, '%s'); you'll only count all string letters, doesn't matter which one. You can use regexp here which search for each letter you specify and will put it in a cell array. It consists of fields which contains the indices of your occuring letters. In the end you only sum the number of indices and you have the count for each letter:
fileID = fopen('hamlet.txt'.'r');
A = fscanf(fileID, '%s');
indexCellArray = regexp(A,{'A','B','C','D',... %I'm too lazy to add the other letters now^^
'a','b','c','d',...
','.' '};
letterCount = cellfun(#(x) numel(x),indexCellArray);
fclose(fileID);
Maybe you put the cell array in a struct where you can give fieldnames for the letters, otherwise you might loose track which count belongs to which number.
Maybe there's much easier solution, cause this one is kind of exhausting to put all the letters in the regexp but it works.

Capitalize the first and last letter of three letter words in a string

I am trying to capitalize the first and last letter of only the three letter words in a string. So far, I have tried
spaces = strfind(str, ' ');
spaces = [0 spaces];
lw = diff(spaces);
lw3 = find(lw ==4);
a3 = lw-1;
b3 = spaces(a3+1);
b4 = b3 + 2 ;
str(b3) = upper(str(b3));
str(b4) = upper(str(b4);
we had to find where the 3 letter words were first so that is what the first 4 lines of code are and then the others are trying to get it so that it will find where the first and last letters are and then capitalize them?
I would use regular expressions to identity the 3-letter words and then use regexprep combined with an anonymous function to perform the case-conversion.
str = 'abcd efg hijk lmn';
% Custom function to capitalize the first and last letter of a word
f = #(x)[upper(x(1)), x(2:end-1), upper(x(end))];
% This will match 3-letter words and apply function f to them
out = regexprep(str, '\<\w{3}\>', '${f($0)}')
% abcd EfG hijk LmN
Regular expressions are definitely the way to go. I am going to suggest a slightly different route, and that is to return the indices using the tokenExtents flag for regexpi:
str = 'abcd efg hijk lmn';
% Tokenize the words and return the first and last index of each
idx = regexpi(str, '(\<w{3}\>)', 'tokenExtents');
% Convert those indices to upper case
str([idx{:}]) = upper(str([idx{:}]));
Using the matlab ipusum function from the File Exchange, I generated a 1000 paragraph random text string with mean word length 4 +/- 2.
str = lower(matlab_ipsum('WordLength', 4, 'Paragraphs', 1000));
The result was a 177,575 character string with 5,531 3-letter words. I used timeit to check the execution time of using regexprep and regexpi with tokenExtents. Using regexpi is an order of magnitude faster:
regexpi = 0.013979s
regexprep = 0.14401s

addressing array in the function return

I would like to make the following code simpler.
files=dir('~/some*.txt');
numFiles=length(files);
for i = 1:numFiles
name=files(i).name;
name=strsplit(name,'.');
name=name{1};
name=strsplit(name, '_');
name=name(2);
name = str2num(name{1});
disp(name);
end
I am a begginer in Matlab, in general I would love something like:
name = str2num(strsplit(strsplit(files(i).name,'.')(1),'_')(2));
but matlab does not like this.
Another issue of the approach above is that matlab keeps giving cell type even for something like name(2) but this may be just the problem with my syntax.
Example file names:
3000_0_100ms.txt
3000_0_5s.txt
3000_110_5s.txt
...
Let's say I want to select all files ending in '5s' then I need to split them (after removing the extension) by '_' and return the second part, in the case of the three filenames above, that would be 0, 0, 110.
But I am in general curious how to do this simple operation in matlab without the complicated code that I have above.
Because your filenames follow a specific pattern, they're a prime candidate for a regular expression. While regular expressions can be confusing to learn at the outset they are very powerful tools.
Consider the following example, which pulls out all numbers that have both leading and trailing underscores:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '(?<=\_)(\d+)(?=\_)', 'match');
strs = [strs{:}]; % Denest one layer of cells
nums = str2double(strs);
Which returns:
nums =
0 0 110
Being used here are what's called lookbehind (?<=...) and lookahead (?=...) operators. As their names suggest, they look in their respective directions related to the expression they're part of, (\d+) in our case, which looks for one or more digits. Though this approach requires more steps than the simple '\_(\d+)\_' expression, the latter requires you either utilize MATLAB's 'tokens' regex operator, which adds another layer of cells and that annoys me, or use the 'match' operator and strip the underscores from the match prior to converting to a numeric value.
Approach 2:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '\_(\d+)\_', 'tokens');
strs = [strs{:}]; % Denest one layer of cells
strs = [strs{:}]; % Denest another layer of cells
nums = str2double(strs);
Approach 3:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '\_(\d+)\_', 'match');
strs = [strs{:}]; % Denest one layer of cells
strs = regexprep(strs, '\_', '');
nums = str2double(strs);
You can use regexp to do a regular expression matching and obtain the numbers in the second place directly. This is an explanation of the regular expression I am using.
>>names = regexp({files(:).name},'\d*_(\d*)_\d*m?s\.txt$','tokens')
>>names = [names{:}]; % Get names out of their cells
>>names = [names{:}]; % Break cells one more time
>> nums = str2double(names); % Convert to double to obtain numbers

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

How to add \ before all special characters in MATLAB?

I am trying to add '\' before all special characters in a string in MATLAB, could anyone please help me out. Here is the example:
tStr = 'Hi, I'm a Big (Not So Big) MATLAB addict; Since my school days!';
I want this string to be changed to:
'Hi\, I\'m a Big \(Not so Big \) MATLAB addict\; Since my school days\!'
The escape character in Matlab is the single quote ('), not the backslash (\), like in C language. Thus, your string must be like this:
tStr = 'Hi\, I\''m a Big (Not so Big ) MATLAB addict\; Since my school days!'
I took the list of special charecters defined on the Mathworks webpage to do this:
special = '[]{}()=''.().....,;:%%{%}!#';
tStr = 'Hi, I''m a Big (Not So Big) MATLAB addict; Since my school days!';
outStr = '';
for l = tStr
if (length(find(special == l)) > 0)
outStr = [outStr, '\', l];
else
outStr = [outStr, l];
end
end
which will automatically add those \s. You do need to use two single quotes ('') in place of the apostrophe in your input string. If tStr is obtained with the function input(), or something similar, this will procedure will still work.
Edited:
Or using regular expressions:
regexprep(tStr,'([[\]{}()=''.(),;:%%{%}!#])','\\$1')