Could I specify pattern match priority in lex code? - match

I've got a related thread in the site(My lex pattern doesn't work to match my input file, how to correct it?)
The problems I met, is about how "greedy" lex will do pattern match, e.g. I've got my lex file:
$ cat b.l
%{
#include<stdio.h>
%}
%%
"12" {printf("head\n");}
"34" {printf("tail\n");}
.* {printf("content\n");}
%%
What I wish to say is, when meet "12", print "head"; when meet "34", print "tail", otherwise print "content" for the longest match that doesn't contain either "12" or "34".
But the fact was, ".*" was a greedy match that whatever I input, it prints "content".
My requirement is, when I use
12sdf2dfsd3sd34
as input, the output should be
head
content
tail
So seems there're 2 possible ways:
1, To specify a match priority for ".*", it should work only when neither "12" and "34" works to match. Does lex support "priority"?
2, to change the 3rd expression, as to match any contiguous string that doesn't contain sub-string of "12", or "34". But how to write this regular expression?

Does (f)lex support priority?
(F)lex always produces the longest possible match. If more than one rule matches the same longest match, the first one is chosen, so in that case it supports priority. But it does not support priority for shorter matches, nor does it implement non-greedy matching.
How to match a string which does not contain one or more sequences?
You can, with some work, create a regular expression which matches a string not containing specified substrings, but it is not particularly easy and (f)lex does not provide a simple syntax for such regular expressions.
A simpler (but slightly less efficient) solution is to match the string in pieces. As a rough outline, you could do the following:
"12" { return HEAD; }
"34" { if (yyleng > 2) {
yyless(yyleng - 2);
return CONTENT;
}
else
return TAIL;
}
.|\n { yymore(); }
This could be made more efficient by matching multiple characters when there is not chance of skipping a delimiter; change the last rule to:
.|[^13]+ { yymore(); }
yymore() causes the current token to be retained, so that the next match appends to the current token rather than starting a new token. yyless(x) returns all but the first x characters to the input stream; in this case, that is used to cause the end delimiter 34 to be rescanned after the CONTENT token is identified.
(That assumes you actually want to tokenize the input stream, rather than just print a debugging message, which is why I called it an outline solution.)

Related

Parsing Infix Mathematical Expressions in Swift Using Regular Expressions

I would like to convert a string that is formatted as an infix mathematical to an array of tokens, using regular expressions. I'm very new to regular expressions, so forgive me if the answer to this question turns out to be too trivial
For example:
"31+2--3*43.8/1%(1*2)" -> ["31", "+", "2", "-", "-3", "*", "43.8", "/", "1", "%", "(", "*", "2", ")"]
I've already implemented a method that achieves this task, however, it consists of many lines of code and a few nested loops. I figured that when I define more operators/functions that may even consist of multiple characters, such as log or cos, it would be easier to edit a regex string rather than adding many more lines of code to my working function. Are regular expressions the right job for this, and if so, where am I going wrong? Or am I better off adding to my working parser?
I've already referred to the following SO posts:
How to split a string, but also keep the delimiters?
This one was very helpful, but I don't believe I'm using 'lookahead' correctly.
Validate mathematical expressions using regular expression?
The solution to the question above doesn't convert the string into an array of tokens. Rather, it checks to see if the given string is a valid mathematical expression.
My code is as follows:
func convertToInfixTokens(expression: String) -> [String]?
{
do
{
let pattern = "^(((?=[+-/*]))(-)?\\d+(\\.\\d+)?)*"
let regex = try NSRegularExpression(pattern: pattern)
let results = regex.matches(in: expression, range: NSRange(expression.startIndex..., in: expression))
return results.map
{
String(expression[Range($0.range, in: expression)!])
}
}
catch
{
return nil
}
}
When I do pass a valid infix expression to this function, it returns nil. Where am I going wrong with my regex string?
NOTE: I haven't even gotten to the point of trying to parse parentheses as individual tokens. I'm still figuring out why it won't work on this expression:
"-99+44+2+-3/3.2-6"
Any feedback is appreciated, thanks!
Your pattern does not work because it only matches text at the start of the string (see ^ anchor), then the (?=[+-/*]) positive lookahead requires the first char to be an operator from the specified set but the only operator that you consume is an optional -. So, when * tries to match the enclosed pattern sequence the second time with -99+44+2+-3/3.2-6, it sees +44 and -?\d fails to match it (as it does not know how to match + with -?).
Here is how your regex matches the string:
You may tokenize the expression using
let pattern = "(?<!\\d)-?\\d+(?:\\.\\d+)?|[-+*/%()]"
See the regex demo
Details
(?<!\d) - there should be no digit immediately to the left of the current position
-? - an optional -
\d+ - 1 or more digits
(?:\.\d+)? - an optional sequence of . and 1+ digits
| - or
\D - any char but a digit.
Output using your function:
Optional(["31", "+", "2", "-", "-3", "*", "43.8", "/", "1", "%", "(", "1", "*", "2", ")"])

How can I obtain only word without All Punctuation Marks when I read text file?

The text file abc.txt is an arbitrary article that has been scraped from the web. For example, it is as follows:
His name is "Donald" and he likes burger. On December 11, he married.
I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:
{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}
My code is as follows:
filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));
Is there any simple code to extract only alphabet words and numbers except all symbols?
Two steps are necessary as you want to convert certain words to lowercase.
regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.
In the regexprep function, we use the following pattern:
(?<=^|\. )([A-Z])
to indicate that:
(?<=^|\. ) We want to assert that before the word of interest either the start of string (^), or (|) a full stop (.) followed by whitespace are found. This type of construct is called a lookbehind.
([A-Z]) This part of the expression matches and captures (stores the match) a upper case letter (A-Z).
The ${lower($0)} component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])) to lower case. This syntax is specific to the MATLAB language.
You can check the behaviour of the above expression here.
Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.
The pattern [a-zA-Z0-9]+ matches lower case letters, upper case letters and digits.
You can check the behavior of this regex here.
text = fileread('abc.txt')
data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
>>data{1}
13×1 cell array
{'his' }
{'name' }
{'is' }
{'Donald' }
{'and' }
{'he' }
{'likes' }
{'burger' }
{'on' }
{'December'}
{'11' }
{'he' }
{'married' }

Match "com.project.name" but not when it contains something else

I have the following code:
var i = "test"
and
var i = "com.project.name.test"
print("something else")
fatalError("some error")
I have a regex:
"((?!com\.project\.name).)*"
to match any string that does NOT contain "com.project.name".
However, I want to modify it to still have the above condition but not if the line contains print\(.*?\) and fatalError\(.*?\).
Why do I want to do this? Because I can only use regex for SwiftLint custom rules and right now my regex is greedy and matches every single string in the project that the developers forgot to localize..
What I've tried:
"((?!com\\.project\\.name).)*(?!print)(?!fatalError)"
but it does not work and instead matches the same as the original expression.
You may use this regex with a negative lookahead assertions:
^(?!.*(?:com\.project\.name|print\(|fatalError\()).*
RegEx Demo
This negative lookahead assertion uses alternations to fail the match on 3 different matches anywhere in the input:
com\.project\.name
print\(
fatalError\(

Is it possible to match any character that is not ']' in PATINDEX?

I need to find the index of the first character that is not ]. Normally to match any character except X, you use the pattern [^X]. The problem is that [^]] simply closes the first bracket too early. The first part, [^], will match any character.
In the documentation for the LIKE operator, if you scroll down to the section "Using Wildcard Characters As Literals" it shows a table of methods to indicated literal characters like [ and ] inside a pattern. It makes no mention of using [ or ] inside double brackets. If the pattern is being used with the LIKE operator, you would use the ESCAPE clause. LIKE doesn't return an index and PATINDEX doesn't seem to have a parameter for an escape clause.
Is there no way to do this?
(This may seem arbitrary. To put some context around it, I need to match ] immediately followed by a character that is not ] in order to locate the end of a quoted identifier. ]] is the only character escape inside a quoted identifier.)
This isn't possible. The Connect item PATINDEX Missing ESCAPE Clause is closed as won't fix.
I'd probably use CLR and regular expressions.
A simple implementation might be
using System.Data.SqlTypes;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlInt32 PatIndexCLR(SqlString pattern, SqlString expression)
{
if (pattern.IsNull || expression.IsNull)
return new SqlInt32();
Match match = Regex.Match(expression.ToString(), pattern.ToString());
if (match.Success)
{
return new SqlInt32(match.Index + 1);
}
else
{
return new SqlInt32(0);
}
}
}
With example usage
SELECT [dbo].[PatIndexCLR] ( N'[^]]', N']]]]]]]]ABC[DEF');
If that is not an option a possible flaky workaround might be to substitute a character unlikely to be in the data without this special significance in the grammar.
WITH T(Value) AS
(
SELECT ']]]]]]]]ABC[DEF'
)
SELECT PATINDEX('%[^' + char(7) + ']%', REPLACE(Value,']', char(7)))
FROM T
(Returns 9)

ignore spaces and cases MATLAB

diary_file = tempname();
diary(diary_file);
myFun();
diary('off');
output = fileread(diary_file);
I would like to search a string from output, but also to ignore spaces and upper/lower cases. Here is an example for what's in output:
the test : passed
number : 4
found = 'thetest:passed'
a = strfind(output,found )
How could I ignore spaces and cases from output?
Assuming you are not too worried about accidentally matching something like: 'thetEst:passed' here is what you can do:
Remove all spaces and only compare lower case
found = 'With spaces'
found = lower(found(found ~= ' '))
This will return
found =
withspaces
Of course you would also need to do this with each line of output.
Another way:
regexpi(output(~isspace(output)), found, 'match')
if output is a single string, or
regexpi(regexprep(output,'\s',''), found, 'match')
for the more general case (either class(output) == 'cell' or 'char').
Advantages:
Fast.
robust (ALL whitespace (not just spaces) is removed)
more flexible (you can return starting/ending indices of the match, tokenize, etc.)
will return original case of the match in output
Disadvantages:
more typing
less obvious (more documentation required)
will return original case of the match in output (yes, there's two sides to that coin)
That last point in both lists is easily forced to lower or uppercase using lower() or upper(), but if you want same-case, it's a bit more involved:
C = regexpi(output(~isspace(output)), found, 'match');
if ~isempty(C)
C = found; end
for single string, or
C = regexpi(regexprep(output, '\s', ''), found, 'match')
C(~cellfun('isempty', C)) = {found}
for the more general case.
You can use lower to convert everything to lowercase to solve your case problem. However ignoring whitespace like you want is a little trickier. It looks like you want to keep some spaces but not all, in which case you should split the string by whitespace and compare substrings piecemeal.
I'd advertise using regex, e.g. like this:
a = regexpi(output, 'the\s*test\s*:\s*passed');
If you don't care about the position where the match occurs but only if there's a match at all, removing all whitespaces would be a brute force, and somewhat nasty, possibility:
a = strfind(strrrep(output, ' ',''), found);