Regexp to find a matching condition in a string - matlab

Hi need help in using regexp for condition matching.
ex.my file has the following content
{hello.program='function'`;
bye.program='script'; }
I am trying to use regexp to match the string that has .program='function' in them:
pattern = '[.program]+\=(function)'
also tried pattern='[^\n]*(.hello=function)[^\n]*';
pattern_match = regexp(myfilename,pattern , 'match')
but this returns me pattern_match={} while i expect the result to be hello.program='function'`;

If 'function' comes with string-markers, you need to include these in the match. Also, you need to escape the dot (otherwise, it's considered "any character"). [.program]+ looks for one or several letters contained in the square brackets - but you can just look for program instead. Also, you don't need to escape the =-sign (which is probably what messed up the match).
cst = {'hello.program=''function''';'bye.program=''script'''; };
pat = 'hello\.program=''function''';
out = regexp(cst,pat,'match');
out{1}{1} %# first string from list, first match
hello.program='function'
EDIT
In response to the comment
my file contains
m2 = S.Parameter;
m2.Value = matlabstandard;
m2.Volatility = 'Tunable';
m2.Configurability = 'None';
m2.ReferenceInterfaceFile ='';
m2.DataType = 'auto';
my objective is to find all the lines that match, .DataType='auto'
Here's how you find the matching lines with regexp
%# read the file with textscan into a variable txt
fid = fopen('myFile.m');
txt = textscan(fid,'%s');
fclose(fid);
txt = txt{1};
%# find the match. Allow spaces and equal signs between DataType and 'auto'
match = regexp(txt,'\.DataType[ =]+''auto''','match')
%# match is not empty only if a match was found. Identify the non-empty match
matchedLine = find(~cellfun(#isempty,match));

Try this as it matches .program='function' exactly:
(\.)program='function'
I think this did not work:
'[.program]+\=(function)'
because of how the []'s work. Here is a link explaining why I say that: http://www.regular-expressions.info/charclass.html

Related

addressing array in the function return

I would like to make the following code simpler.
files=dir('~/some*.txt');
numFiles=length(files);
for i = 1:numFiles
name=files(i).name;
name=strsplit(name,'.');
name=name{1};
name=strsplit(name, '_');
name=name(2);
name = str2num(name{1});
disp(name);
end
I am a begginer in Matlab, in general I would love something like:
name = str2num(strsplit(strsplit(files(i).name,'.')(1),'_')(2));
but matlab does not like this.
Another issue of the approach above is that matlab keeps giving cell type even for something like name(2) but this may be just the problem with my syntax.
Example file names:
3000_0_100ms.txt
3000_0_5s.txt
3000_110_5s.txt
...
Let's say I want to select all files ending in '5s' then I need to split them (after removing the extension) by '_' and return the second part, in the case of the three filenames above, that would be 0, 0, 110.
But I am in general curious how to do this simple operation in matlab without the complicated code that I have above.
Because your filenames follow a specific pattern, they're a prime candidate for a regular expression. While regular expressions can be confusing to learn at the outset they are very powerful tools.
Consider the following example, which pulls out all numbers that have both leading and trailing underscores:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '(?<=\_)(\d+)(?=\_)', 'match');
strs = [strs{:}]; % Denest one layer of cells
nums = str2double(strs);
Which returns:
nums =
0 0 110
Being used here are what's called lookbehind (?<=...) and lookahead (?=...) operators. As their names suggest, they look in their respective directions related to the expression they're part of, (\d+) in our case, which looks for one or more digits. Though this approach requires more steps than the simple '\_(\d+)\_' expression, the latter requires you either utilize MATLAB's 'tokens' regex operator, which adds another layer of cells and that annoys me, or use the 'match' operator and strip the underscores from the match prior to converting to a numeric value.
Approach 2:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '\_(\d+)\_', 'tokens');
strs = [strs{:}]; % Denest one layer of cells
strs = [strs{:}]; % Denest another layer of cells
nums = str2double(strs);
Approach 3:
filenames = {'3000_0_100ms.txt', '3000_0_5s.txt', '3000_110_5s.txt'};
strs = regexp(filenames, '\_(\d+)\_', 'match');
strs = [strs{:}]; % Denest one layer of cells
strs = regexprep(strs, '\_', '');
nums = str2double(strs);
You can use regexp to do a regular expression matching and obtain the numbers in the second place directly. This is an explanation of the regular expression I am using.
>>names = regexp({files(:).name},'\d*_(\d*)_\d*m?s\.txt$','tokens')
>>names = [names{:}]; % Get names out of their cells
>>names = [names{:}]; % Break cells one more time
>> nums = str2double(names); % Convert to double to obtain numbers

Function to split string in matlab and return second number

I have a string and I need two characters to be returned.
I tried with strsplit but the delimiter must be a string and I don't have any delimiters in my string. Instead, I always want to get the second number in my string. The number is always 2 digits.
Example: 001a02.jpg I use the fileparts function to delete the extension of the image (jpg), so I get this string: 001a02
The expected return value is 02
Another example: 001A43a . Return values: 43
Another one: 002A12. Return values: 12
All the filenames are in a matrix 1002x1. Maybe I can use textscan but in the second example, it gives "43a" as a result.
(Just so this question doesn't remain unanswered, here's a possible approach: )
One way to go about this uses splitting with regular expressions (MATLAB's strsplit which you mentioned):
str = '001a02.jpg';
C = strsplit(str,'[a-zA-Z.]','DelimiterType','RegularExpression');
Results in:
C =
'001' '02' ''
In older versions of MATLAB, before strsplit was introduced, similar functionality was achieved using regexp(...,'split').
If you want to learn more about regular expressions (abbreviated as "regex" or "regexp"), there are many online resources (JGI..)
In your case, if you only need to take the 5th and 6th characters from the string you could use:
D = str(5:6);
... and if you want to convert those into numbers you could use:
E = str2double(str(5:6));
If your number is always at a certain position in the string, you can simply index this position.
In the examples you gave, the number is always the 5th and 6th characters in the string.
filename = '002A12';
num = str2num(filename(5:6));
Otherwise, if the formating is more complex, you may want to use a regular expression. There is a similar question matlab - extracting numbers from (odd) string. Modifying the code found there you can do the following
all_num = regexp(filename, '\d+', 'match'); %Find all numbers in the filename
num = str2num(all_num{2}) %Convert second number from str

How to get rid of the punctuation? and check the spelling error

eliminate punctuation
words split when meeting new line and space, then store in array
check the text file got error or not with the function of checkSpelling.m file
sum up the total number of error in that article
no suggestion is assumed to be no error, then return -1
sum of error>20, return 1
sum of error<=20, return -1
I would like to check spelling error of certain paragraph, I face the problem to get rid of the punctuation. It may have problem to the other reason, it return me the error as below:
My data2 file is :
checkSpelling.m
function suggestion = checkSpelling(word)
h = actxserver('word.application');
h.Document.Add;
correct = h.CheckSpelling(word);
if correct
suggestion = []; %return empty if spelled correctly
else
%If incorrect and there are suggestions, return them in a cell array
if h.GetSpellingSuggestions(word).count > 0
count = h.GetSpellingSuggestions(word).count;
for i = 1:count
suggestion{i} = h.GetSpellingSuggestions(word).Item(i).get('name');
end
else
%If incorrect but there are no suggestions, return this:
suggestion = 'no suggestion';
end
end
%Quit Word to release the server
h.Quit
f19.m
for i = 1:1
data2=fopen(strcat('DATA\PRE-PROCESS_DATA\F19\',int2str(i),'.txt'),'r')
CharData = fread(data2, '*char')'; %read text file and store data in CharData
fclose(data2);
word_punctuation=regexprep(CharData,'[`~!##$%^&*()-_=+[{]}\|;:\''<,>.?/','')
word_newLine = regexp(word_punctuation, '\n', 'split')
word = regexp(word_newLine, ' ', 'split')
[sizeData b] = size(word)
suggestion = cellfun(#checkSpelling, word, 'UniformOutput', 0)
A19(i)=sum(~cellfun(#isempty,suggestion))
feature19(A19(i)>=20)=1
feature19(A19(i)<20)=-1
end
Substitute your regexprep call to
word_punctuation=regexprep(CharData,'\W','\n');
Here \W finds all non-alphanumeric characters (inclulding spaces) that get substituted with the newline.
Then
word = regexp(word_punctuation, '\n', 'split');
As you can see you don't need to split by space (see above). But you can remove the empty cells:
word(cellfun(#isempty,word)) = [];
Everything worked for me. However I have to say that you checkSpelling function is very slow. At every call it has to create an ActiveX server object, add new document, and delete the object after check is done. Consider rewriting the function to accept cell array of strings.
UPDATE
The only problem I see is removing the quote ' character (I'm, don't, etc). You can temporary substitute them with underscore (yes, it's considered alphanumeric) or any sequence of unused characters. Or you can use list of all non-alphanumeric characters to be remove in square brackets instead of \W.
UPDATE 2
Another solution to the 1st UPDATE:
word_punctuation=regexprep(CharData,'[^A-Za-z0-9''_]','\n');

ignore spaces and cases MATLAB

diary_file = tempname();
diary(diary_file);
myFun();
diary('off');
output = fileread(diary_file);
I would like to search a string from output, but also to ignore spaces and upper/lower cases. Here is an example for what's in output:
the test : passed
number : 4
found = 'thetest:passed'
a = strfind(output,found )
How could I ignore spaces and cases from output?
Assuming you are not too worried about accidentally matching something like: 'thetEst:passed' here is what you can do:
Remove all spaces and only compare lower case
found = 'With spaces'
found = lower(found(found ~= ' '))
This will return
found =
withspaces
Of course you would also need to do this with each line of output.
Another way:
regexpi(output(~isspace(output)), found, 'match')
if output is a single string, or
regexpi(regexprep(output,'\s',''), found, 'match')
for the more general case (either class(output) == 'cell' or 'char').
Advantages:
Fast.
robust (ALL whitespace (not just spaces) is removed)
more flexible (you can return starting/ending indices of the match, tokenize, etc.)
will return original case of the match in output
Disadvantages:
more typing
less obvious (more documentation required)
will return original case of the match in output (yes, there's two sides to that coin)
That last point in both lists is easily forced to lower or uppercase using lower() or upper(), but if you want same-case, it's a bit more involved:
C = regexpi(output(~isspace(output)), found, 'match');
if ~isempty(C)
C = found; end
for single string, or
C = regexpi(regexprep(output, '\s', ''), found, 'match')
C(~cellfun('isempty', C)) = {found}
for the more general case.
You can use lower to convert everything to lowercase to solve your case problem. However ignoring whitespace like you want is a little trickier. It looks like you want to keep some spaces but not all, in which case you should split the string by whitespace and compare substrings piecemeal.
I'd advertise using regex, e.g. like this:
a = regexpi(output, 'the\s*test\s*:\s*passed');
If you don't care about the position where the match occurs but only if there's a match at all, removing all whitespaces would be a brute force, and somewhat nasty, possibility:
a = strfind(strrrep(output, ' ',''), found);

Separating an array based on whether it contains a phrase or not

I am really just a noob at Matlab, so please don't get upset if I use wrong syntax. I am currently writing a small program in which I put all .xlsx file names from a certain directory into an array. I now want to separate the files into two different arrays based on their name. This is what I tried:
files = dir('My_directory\*.xlsx')
file_number = 1;
file_amount = length(files);
while file_number <= file_amount;
file_name = files(file_number).name;
filescs = [];
filescwf = [];
if strcmp(file_name,'*cs.xlsx') == 1;
filescs = [filescs,file_name];
else
filescwf = [filescwf,file_name];
end
file_number = file_number + 1
end
The idea here is that strcmp(file_name,'*cs.xlsx') checks if file_name contains 'cs' at the end. If it does, it is put into filescs, if it doesn't it is put into filescwf. However, this does not seem to work...
Any thoughts?
strcmp(file_name,'*cs.xlsx') checks whether file_name is identical to *cs.xlsx. If there is no file by that name (hint: few file systems allow '*' as part of a file name), it will always be false. (btw: there is no need for the '==1' comparison or the semicolon on the respective line)
You can use array indexing here to extract the relevant part of the filename you want to compare. file_name(1:5), will give you the first 5 characters, file_name(end-5:end) will give you the last 6, for example.
strcmp doesn't work with wildcards such as *cs.xlsx. See this question for an alternative approach.
You can use regexp to check the final letters of each of your files, then cellfun to apply regexp to all your filenames.
Here, getIndex will have 1's for all the files ending with cs.xlsx. The (?=$) part make sure that cs.xlsx is at the end.
files = dir('*.xlsx')
filenames = {files.name}'; %get filenames
getIndex = cellfun(#isempty,regexp(filenames, 'cs.xlsx(?=$)'));
list1 = filenames(getIndex);
list2 = filenames(~getIndex);