strsplit excluding one, but including another - matlab

A very short question. I have a string
str = 'var(:,1),var(:,2),var(:,3)';
I need to split it with strsplit by ',' but not by ':,' so that I will end up with a cell array
cel = {'var(:,1)','var(:,2)','var(:,3)'};
I am not good with regular expression at all and I tried ,^(:,) but this fails. I thought ^ is not () is group.
How can it be done?

Use a regular expression with negative lookbehind:
cel = regexp(str, '(?<!:),', 'split');

Related

addressing array in the function return

I would like to make the following code simpler.
files=dir('~/some*.txt');
numFiles=length(files);
for i = 1:numFiles
name=files(i).name;
name=strsplit(name,'.');
name=name{1};
name=strsplit(name, '_');
name=name(2);
name = str2num(name{1});
disp(name);
end
I am a begginer in Matlab, in general I would love something like:
name = str2num(strsplit(strsplit(files(i).name,'.')(1),'_')(2));
but matlab does not like this.
Another issue of the approach above is that matlab keeps giving cell type even for something like name(2) but this may be just the problem with my syntax.
Example file names:
3000_0_100ms.txt
3000_0_5s.txt
3000_110_5s.txt
...
Let's say I want to select all files ending in '5s' then I need to split them (after removing the extension) by '_' and return the second part, in the case of the three filenames above, that would be 0, 0, 110.
But I am in general curious how to do this simple operation in matlab without the complicated code that I have above.
Because your filenames follow a specific pattern, they're a prime candidate for a regular expression. While regular expressions can be confusing to learn at the outset they are very powerful tools.
Consider the following example, which pulls out all numbers that have both leading and trailing underscores:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '(?<=\_)(\d+)(?=\_)', 'match');
strs = [strs{:}]; % Denest one layer of cells
nums = str2double(strs);
Which returns:
nums =
0 0 110
Being used here are what's called lookbehind (?<=...) and lookahead (?=...) operators. As their names suggest, they look in their respective directions related to the expression they're part of, (\d+) in our case, which looks for one or more digits. Though this approach requires more steps than the simple '\_(\d+)\_' expression, the latter requires you either utilize MATLAB's 'tokens' regex operator, which adds another layer of cells and that annoys me, or use the 'match' operator and strip the underscores from the match prior to converting to a numeric value.
Approach 2:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '\_(\d+)\_', 'tokens');
strs = [strs{:}]; % Denest one layer of cells
strs = [strs{:}]; % Denest another layer of cells
nums = str2double(strs);
Approach 3:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '\_(\d+)\_', 'match');
strs = [strs{:}]; % Denest one layer of cells
strs = regexprep(strs, '\_', '');
nums = str2double(strs);
You can use regexp to do a regular expression matching and obtain the numbers in the second place directly. This is an explanation of the regular expression I am using.
>>names = regexp({files(:).name},'\d*_(\d*)_\d*m?s\.txt$','tokens')
>>names = [names{:}]; % Get names out of their cells
>>names = [names{:}]; % Break cells one more time
>> nums = str2double(names); % Convert to double to obtain numbers

Function to split string in matlab and return second number

I have a string and I need two characters to be returned.
I tried with strsplit but the delimiter must be a string and I don't have any delimiters in my string. Instead, I always want to get the second number in my string. The number is always 2 digits.
Example: 001a02.jpg I use the fileparts function to delete the extension of the image (jpg), so I get this string: 001a02
The expected return value is 02
Another example: 001A43a . Return values: 43
Another one: 002A12. Return values: 12
All the filenames are in a matrix 1002x1. Maybe I can use textscan but in the second example, it gives "43a" as a result.
(Just so this question doesn't remain unanswered, here's a possible approach: )
One way to go about this uses splitting with regular expressions (MATLAB's strsplit which you mentioned):
str = '001a02.jpg';
C = strsplit(str,'[a-zA-Z.]','DelimiterType','RegularExpression');
Results in:
C =
'001' '02' ''
In older versions of MATLAB, before strsplit was introduced, similar functionality was achieved using regexp(...,'split').
If you want to learn more about regular expressions (abbreviated as "regex" or "regexp"), there are many online resources (JGI..)
In your case, if you only need to take the 5th and 6th characters from the string you could use:
D = str(5:6);
... and if you want to convert those into numbers you could use:
E = str2double(str(5:6));
If your number is always at a certain position in the string, you can simply index this position.
In the examples you gave, the number is always the 5th and 6th characters in the string.
filename = '002A12';
num = str2num(filename(5:6));
Otherwise, if the formating is more complex, you may want to use a regular expression. There is a similar question matlab - extracting numbers from (odd) string. Modifying the code found there you can do the following
all_num = regexp(filename, '\d+', 'match'); %Find all numbers in the filename
num = str2num(all_num{2}) %Convert second number from str

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

Separating file name in parts by identifier

This may be a very simple task for many but I could not find anything appropriate for me.
I have a file name: filenm_A006.2011.269.10.47.G25_2010
I want to separate all its parts (separated by . and _) to use them separately. How can I do it with simple matlab commands?
Kind Regards,
Mushi
I recommend regexp:
fname = 'filenm_A006.2011.269.10.47.G25_2010';
parts = regexp(fname, '[^_.]+', 'match');
parts =
'filenm' 'A006' '2011' '269' '10' '47' 'G25' '2010'
You can now refer to parts{1} through parts{8} for the pieces. Explanation: the regexp pattern [^_.] means all characters not equal to _ or ., and the + means you want groups of at least 1 character. Then 'match' asks the regexp function to return a cell array of the strings of all the matches of that pattern. There are other regexp modes; for example, the indices of each piece of the file.
Use the command
strsplit.
cellArrayOfParts = strsplit(fileName,{'.' '_'});
You can use strsplit to split it:
strsplit('filenm_A006.2011.269.10.47.G25_2010',{'_','.'})
ans =
'filenm' 'A006' '2011' '269' '10' '47' 'G25' '2010'
Another option is to use regexp, like Peter suggested.

Working string in MATLAB

I have the following string in MATLAB, for example
##%%F1_USA(40)_u
and I want
F1_USA_40__u
Does it has any function for this?
Your best bet is probably regexprep which allows you to replace parts of a string using regular expressions:
s_new = regexprep(regexprep(s, '[()]', '_'), '[^A-Za-z0-9_]', '')
Update: based on your updated comment, this is probably what you want:
s_new = regexprep(regexprep(s, '^[^A-Za-z0-9_]*', ''), '[^A-Za-z0-9_]', '')
or:
s_new = regexprep(regexprep(s, '[^A-Za-z0-9_]', '_'), '^_*', '')
One way to do this is to use the function ISSTRPROP to find the indices of alphanumeric characters and replace or remove the others accordingly:
>> str = '##%%F1_USA(40)_u'; %# Sample string
>> index = isstrprop(str,'alphanum'); %# Find indices of alphanumeric characters
>> str(~index) = '_'; %# Set non-alphanumeric characters to '_'
>> str = str(find(index,1):end) %# Remove any leading '_'
str =
F1_USA_40__u %# Result
If you want to use regular expressions (which can get a little more complicated) then the last suggestion from Tamas will work. However, it can be greatly simplified to the following:
str = regexprep(str,{'\W','^_*'},{'_',''});