MatLab Regexprep Syntax with $ sign - matlab

Here is a line using regexprep.
line = regexprep(line,'(,([^0-9])',' , $1');
What does the $1 syntax mean?

The $1 in the replacement string provided to regexprep references the first matched token in your regular expression.
So for example, if we match two tokens, we can replace the matched string with either the first token
regexprep('abcdefgh', '(ab)(cd)', '$1')
% abefgh
Second token
regexprep('abcdefgh', '(ab)(cd)', '$2')
% cdefgh
Or both tokens
regexprep('abcdefgh', '(ab)(cd)', '$1$2')
% abcdefgh
In your example, the part matched by([^0-9]) is the token referenced by $1. The code that you have posted, removes the (, from a string and replaces it with , and the $1 keeps the rest of the match the same.
line = 'abcd(,abcd';
regexprep(line,'(,([^0-9])',' , $1')
% abcd , abcd

Related

How to match exact string in perl

I am trying to parse all the files and verify if any of the file content has strings TESTDIR or TEST_DIR
Files contents might look something like:-
TESTDIR = foo
include $(TESTDIR)/chop.mk
...
TEST_DIR := goldimage
MAKE_TESTDIR = var_make
NEW_TEST_DIR = tesing_var
Actually I am only interested in TESTDIR ,$(TESTDIR),TEST_DIR but in my case last two lines should be ignored. I am new to perl , Can anyone help me out with re-rex.
/\bTEST_?DIR\b/
\b means a "word boundary", i.e. the place between a word character and a non-word character. "Word" here has the Perl meaning: it contains characters, numbers, and underscores.
_? means "nothing or an underscore"
Look at "characterset".
Only (space) surrounding allowed:
/^(.* )?TEST_?DIR /
^ beginning of the line
(.* )? There may be some content .* but if, its must be followed by a space
at the and says that a whitespace must be there. Otherwise use ( .*)?$ at the end.
One of a given characterset is allowed:
Should the be other characters then a space be possible you can use a character class []:
/^(.*[ \t(])?TEST_?DIR[) :=]/
(.*[ \t(])? in front of TEST_?DIR may be a (space) or a \t (tab) or ( or nothing if the line starts with itself.
afterwards there must be one of (space) or : or = or ). Followd by anything (to "anything" belongs the "=" of ":=" ...).
One of a given group is allowed:
So you need groups within () each possible group in there devided by a |:
/^(.*( |\t))?TEST_?DIR( | := | = )/
In this case, at the beginning is no change to [ \t] because each group holds only one character and \t.
At the end, there must be (single space) or := (':=' surrounded by spaces) or = ('=' surrounded by spaces), following by anything...
You can use any combination...
/^(.*[ \t(])?TEST_?DIR([) =:]| :=| =|)/
Test it on Debuggex.com. (Use 'PCRE')

Avoiding duplicate items in a comma-separated list of two-letter words

I need to write a regex which allows a group of 2 chars only once. This is my current regex :
^([A-Z]{2},)*([A-Z]{2}){1}$
This allows me to validate something like this :
AL,RA,IS,GD
AL
AL,RA
The problem is that it validates also AL,AL and AL,RA,AL.
EDIT
Here there are more details.
What is allowed:
AL,RA,GD
AL
AL,RA
AL,IS,GD
What it shouldn't be allowed:
AL,RA,AL
AL,AL
AL,RA,RA
AL,IS,AL
IS,IS,AL
IS,GD,GD
IS,GD,IS
I need that every group of two characters appears only once in the sequence.
Try something like this expression:
/^(?:,?(\b\w{2}\b)(?!.*\1))+$/gm
I have no knowledge of swift, so take it with a grain of salt. The idea is basically to only match a whole line while making sure that no single matched group occurs at a later point in the line.
First of all, let's shorten your pattern. It can be easily achieved since the length of each comma-separated item is fixed and the list items are only made up of uppercase ASCII letters. So, your pattern can be written as ^(?:[A-Z]{2}(?:,\b)?)+$. See this regex demo.
Now, you need to add a negative lookahead that will check the string for any repeating two-letter sequence at any distance from the start of string, and within any distance between each. Use
^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$
See the regex demo
Possible implementation in Swift:
func isValidInput(Input:String) -> Bool {
return Input.range(of: #"^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$"#, options: .regularExpression) != nil
}
print(isValidInput(Input:"AL,RA,GD")) // true
print(isValidInput(Input:"AL,RA,AL")) // false
Details
^ - start of string
(?!.*\b([A-Z]{2})\b.*\b\1\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there is:
.* - any 0+ chars other than line break chars, as many as possible
\b([A-Z]{2})\b - a two-letter word as a whole word
.* - any 0+ chars other than line break chars, as many as possible
\b\1\b - the same whole word as in Group 1. NOTE: The word boundaries here are not necessary in the current scenario where the word length is fixed, it is two, but if you do not know the word length, and you have [A-Z]+, you will need the word boundaries, or other boundaries depending on the situation
(?:[A-Z]{2}(?:,\b)?)+ - 1 or more sequences of:
[A-Z]{2} - two uppercase ASCII letters
(?:,\b)? - an optional sequence: , only if followed with a word char: letter, digit or _. This guarantees that , won't be allowed at the end of the string
$ - end of string.
You can use a negative lookahead with a back-reference:
^(?!.*([A-Z]{2}).*\1).*
if, as in the all the examples in the question, it is known that the string contains only comma-separated pairs of capital letters. I will relax that assumption later in my answer.
Demo
The regex performs the following operations:
^ # match beginning of line
(?! # begin negative lookahead
.* # match 0+ characters (1+ OK)
([A-Z]{2}) # match 2 uppercase letters in capture group 1
.* # match 0+ characters (1+ OK)
\1 # match the contents of capture group 1
) # end negative lookahead
.* # match 0+ characters (the entire string)
Suppose now that one or more capital letters may appear between each pair of commas, or before the first comma or after the last comma, but it is only strings of two letters that cannot be repeated. Moreover, I assume the regex must confirm the regex has the desired form. Then the following regex could be used:
^(?=[A-Z]+(?:,[A-Z]+)*$)(?!.*(?:^|,)([A-Z]{2}),(?:.*,)?\1(?:,|$)).*
Demo
The regex performs the following operations:
^ # match beginning of line
(?= # begin pos lookahead
[A-Z]+ # match 1+ uc letters
(?:,[A-Z]+) # match ',' then by 1+ uc letters in a non-cap grp
* # execute the non-cap grp 0+ times
$ # match the end of the line
) # end pos lookahead
(?! # begin neg lookahead
.* # match 0+ chars
(?:^|,) # match beginning of line or ','
([A-Z]{2}) # match 2 uc letters in cap grp 1
, # match ','
(?:.*,) # match 0+ chars, then ',' in non-cap group
? # optionally match non-cap grp
\1 # match the contents of cap grp 1
(?:,|$) # match ',' or end of line
) # end neg lookahead
.* # match 0+ chars (entire string)
If there is no need check that the string contains only comma-separated strings of one or more upper case letters the postive lookahead at the beginning can be removed.

How to recognize ID, Literals and Comments in Lex file

I have to write a lex program that has these rules:
Identifiers: String of alphanumeric (and _), starting with an alphabetic character
Literals: Integers and strings
Comments: Start with ! character, go to until the end of the line
Here is what I came up with
[a-zA-Z][a-zA-Z0-9]+ return(ID);
[+-]?[0-9]+ return(INTEGER);
[a-zA-Z]+ return ( STRING);
!.*\n return ( COMMENT );
However, I still get a lot of errors when I compile this lex file.
What do you think the error is?
It would have helped if you'd shown more clearly what the problem was with your code. For example, did you get an error message or did it not function as desired?
There are a couple of problems with your code, but it is mainly correct. The first issue I see is that you have not divided your lex program into the necessary parts with the %% divider. The first part of a lex program is the declarations section, where regular expression patterns are specified. The second part is where the action that match patterns are specified. The (optional) third section is where any code (for the compiler) is placed. Code for the compiler can also be placed in the declaration section when delineated by %{ and %} at the start of a line.
If we put your code through lex we would get this error:
"SoNov16.l", line 1: bad character: [
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: ]
"SoNov16.l", line 1: bad character: +
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: (
"SoNov16.l", line 1: unknown error processing section 1
"SoNov16.l", line 1: bad character: )
"SoNov16.l", line 1: bad character: ;
Did you get something like that? In your example code you are specifying actions (the return(ID); is an example of an action) and thus your code is for the second section. You therefore need to put a %% line ahead of it. It will then be a valid lex program.
You code is dependant on (probably) a parser, which consumes (and declares) the tokens. For testing purposes it is often easier to just print the tokens first. I solved this problem by making a C macro which will do the print and can be redefined to do the return at a later stage. Something like this:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z][a-zA-Z0-9]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
[a-zA-Z]+ TOKEN (STRING);
!.*\n TOKEN (COMMENT);
If we build and test this, we get the following:
abc
String: abc Matched: ID
abc123
String: abc123 Matched: ID
! comment text
String: ! comment text
Matched: COMMENT
Not quite correct. We can see that the ID rule is matching what should be a string. This is due to the ordering of the rules. We have to put the String rule first to ensure it matches first - unless of course you were supposed to match strings inside some quotes? You also missed the underline from the ID pattern. Its also a good idea to match and discard any whitespace characters:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
[a-zA-Z]+ TOKEN (STRING);
[a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
!.*\n TOKEN (COMMENT);
[ \t\r\n]+ ;
Which when tested shows:
abc
String: abc Matched: STRING
abc123_
String: abc123_ Matched: ID
-1234
String: -1234 Matched: INTEGER
abc abc123 ! comment text
String: abc Matched: STRING
String: abc123 Matched: ID
String: ! comment text
Matched: COMMENT
Just in case you wanted strings in quotes, that is easy too:
%{
#define TOKEN(t) printf("String: %s Matched: " #t "\n",yytext)
%}
%%
\"[^"]+\" TOKEN (STRING);
[a-zA-Z][a-zA-Z0-9_]+ TOKEN(ID);
[+-]?[0-9]+ TOKEN(INTEGER);
!.*\n TOKEN (COMMENT );
[ \t\r\n] ;
"abc"
String: "abc" Matched: STRING

How to read a specific number (or word) from an answer

I have an .nc file I'm reading in matlab, and getting info out of the time variable.
the code looks like this
>> ncreadatt(model_list{3},'T','units')
ans =
'months since 1850-01-01'
what I want to do is get just the '1850' out of the answer.
Regular expression is a very powerful tool to parse and manipulate strings.
Matlab has regexp command:
line = 'months since 1850-01-01';
res = regexp( line, '\s(\d+)-', 'tokens', 'once');
year = str2double(res{1})
And the results is:
year =
1850
The regular expression used '\s(\d+)-' means:
\s - look for a single white space character (the space before 1850).
'(\d+)' - look for one or more digit ('\d+'), the parentheses means that all charcters matching here will be saved as a "token".
'-' - look for a single '-' after the digits.
You can play with it on ideone.

Matlab: how to convert character array or string into a formatted output OR parse a string

Could someone please tell me how to convert character array into a formatted output using Matlab?
I am expecting data like this:
CHAR (1 x 29) : 0.050822999 3.141592979 ; (1)
OR
CELL (1 x 1) or string: '0.050822999 3.141592979 ; (1)'
I am looking for output like this:
d1 = 0.050822999; %double
d2 = 3.141592979; %double
index = 1; % integer
I tried transposing and then using str2num(Str'); but, it's returning me 0x 0 double.
Any help would be appreciated.
Regards,
DK
you can use regexp to parse the string
c = { '0.050822999 3.141592979 ; (1)' };
p = regexp( c{1}, '^(\d+\.\d+)\s(\d+\.\d+)\s*;\s*\((\d+)\)$', 'tokens', 'once' ); %//parse the input string
numbers = str2mat(p); %// convert extracted strings to numerical values
Example result
ans =
0.050822999
3.141592979
1
Explaining the regexp pattern:
^ - pattern starts at the beginning of the input string
(\d+\.\d+) - parentheses ('()') enclosing this sub-pattern indicates it as a single token
\d+ matches one or more digits, then expecting \. a dot (notice the \, since . alone in regexp acts as a wildcard) and after the dot \d+ one or more digits are expected.
This token should correspond to the first number, e.g., 0.050822999
\s expecting a single space
(\d+\.\d+) - again, expecting another decimal fraction as the second token.
\s* - expecting white space (zero or more).
; - capture the ; in the expression, but not as a token.
\s+ - expecting white space (zero or more).
\( - expecting an open parenthesis, note the \ since parentheses in regexp are used to denote tokens.
(\d+) - expecting one or more digits as the third token, only integer numbers are expected here. no decimal point.
\) - expecting a closing parenthesis.
$ - pattern should reach the end of the input string.
You can use something like this (if I understood you correctly)
function str_dump(var)
info = whos;
disp([info.class ' ' mat2str(info.size) ' : ' var]);
end
This just shows information about the string. If you want to parse it and convert to another Matlab's structure, you have to explain it more carefully.
%// Input
a = [0.050822999 3.141592979];
n = 1;
%// Output
str = [num2str(a,'%0.9f ') ' ; (' num2str(n) ')']
Result:
str =
0.050822999 3.141592979 ; (1)