How to match exact string in perl - perl

I am trying to parse all the files and verify if any of the file content has strings TESTDIR or TEST_DIR
Files contents might look something like:-
TESTDIR = foo
include $(TESTDIR)/chop.mk
...
TEST_DIR := goldimage
MAKE_TESTDIR = var_make
NEW_TEST_DIR = tesing_var
Actually I am only interested in TESTDIR ,$(TESTDIR),TEST_DIR but in my case last two lines should be ignored. I am new to perl , Can anyone help me out with re-rex.

/\bTEST_?DIR\b/
\b means a "word boundary", i.e. the place between a word character and a non-word character. "Word" here has the Perl meaning: it contains characters, numbers, and underscores.
_? means "nothing or an underscore"

Look at "characterset".
Only (space) surrounding allowed:
/^(.* )?TEST_?DIR /
^ beginning of the line
(.* )? There may be some content .* but if, its must be followed by a space
at the and says that a whitespace must be there. Otherwise use ( .*)?$ at the end.
One of a given characterset is allowed:
Should the be other characters then a space be possible you can use a character class []:
/^(.*[ \t(])?TEST_?DIR[) :=]/
(.*[ \t(])? in front of TEST_?DIR may be a (space) or a \t (tab) or ( or nothing if the line starts with itself.
afterwards there must be one of (space) or : or = or ). Followd by anything (to "anything" belongs the "=" of ":=" ...).
One of a given group is allowed:
So you need groups within () each possible group in there devided by a |:
/^(.*( |\t))?TEST_?DIR( | := | = )/
In this case, at the beginning is no change to [ \t] because each group holds only one character and \t.
At the end, there must be (single space) or := (':=' surrounded by spaces) or = ('=' surrounded by spaces), following by anything...
You can use any combination...
/^(.*[ \t(])?TEST_?DIR([) =:]| :=| =|)/
Test it on Debuggex.com. (Use 'PCRE')

Related

Can Sed match matching brackets?

My code has a ton of occurrences of something like:
idof(some_object)
I want to replace them with:
some_object["id"]
It sounds simple:
sed -i 's/idof(\([^)]\+\))/\1["id"]/g' source.py
The problem is that some_object might be something like idof(get_some_object()), or idof(my_class().get_some_object()), in which case, instead of getting what I want (get_some_object()["id"] or my_class().get_some_object()["id"]), I get get_some_object(["id"]) or my_class(["id"].get_some_object()).
Is there a way to have sed match closing bracket, so that it internally keeps track of any opening/closing brackets inside my (), and ignores those?
It needs to keep everything that's between those brackets: idof(ANYTHING) becomes ANYTHING["id"].
Using sed
$ sed -E 's/idof\(([[:alpha:][:punct:]]*)\)/\1["id"]/g' input_file
Using ERE, exclude idof and the first opening parenthesis.
As a literal closing parenthesis is also excluded, everything in-between the capture parenthesis including additional parenthesis will be captured.
[[:alpha:]] will match all alphabetic characters including upper and lower case while [[:punct:]] will capture punctuation characters including ().-{} and more.
The g option will make the substitution as many times as the pattern is found.
Theoretically, you can write a regex that will handle all combinations of idof(....) up to some limit of nested () calls inside ..... Such regex would have to list with all possible combinations of calls, like idof(one(two(three))) or idof(one(two(three)four(five)) you can match with an appropriate regex like idof([^()]*([^()]*([^()]*)[^()]*)[^()]*) or idof([^()]*([^()]*([^()]*)[^()]*([^()]*)[^()]*) respectively.
The following regex handles only some cases, but shows the complexity and general path. Writing a regex to handle all possible cases to "eat" everything in front of the trailing ) is left to OP as an exercise why it's better to use something else. Note that handling string literals ")" becomes increasingly complex.
The following Bash code:
sed '
: begin
# No idof? Just print the line!
/^\(.*\)idof(\([^)]*)\)/!n
# Note: regex is greedy - we start from the back!
# Note: using newline as a stack separator.
s//\1\n\2/
# hold the front
{ h ; x ; s/\n.*// ; x ; s/[^\n]*\n// ; }
: handle_brackets
# Eat everything before final ) up to some number of nested ((())) calls.
# Insert more jokes here.
: eat_brackets
/^[^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)/{
s//&\n/
# Hold the front.
{ H ; x ; s/\n\([^\n]*\)\n.*/\1/ ; x ; s/[^\n]*\n// ; }
b eat_brackets
}
/^\([^()]*\))/!{
s/^/ERROR: eating brackets did not work: /
q1
}
# Add the id after trailing ) and remove it.
s//\1["id"]/
# Join with hold space and clear the hold space for next round
{ H ; s/.*// ; x ; s/\n//g ; }
# Restart for another idof if in input.
b begin
' <<EOF
before idof(some_object) after
before idof(get_some_object()) after
before idof(my_class().get_some_object()) after
before idof(one(two(three)four)five) after
before idof(one(two(three)four)five) between idof(one(two(three)four)five) after
before idof( one(two(three)four)five one(two(three)four)five ) after
before idof(one(two(three(four)five)six(seven(eight)nine)ten) between idof(one(two(three(four)five)six(seven(eight)nine)ten) after
EOF
Will output:
before some_object["id"] after
before get_some_object()["id"] after
before my_class().get_some_object()["id"] after
before one(two(three)four)five["id"] after
before one(two(three)four)five["id"] between one(two(three)four)five["id"] after
before one(two(three)four)five one(two(three)four)five ["id"] after
ERROR: eating brackets did not work: one(two(three(four)five)six(seven(eight)nine)ten) after
The last line is not handled correctly, because (()()) case is not correctly handled. One would have to write a regex to match it.

Avoiding duplicate items in a comma-separated list of two-letter words

I need to write a regex which allows a group of 2 chars only once. This is my current regex :
^([A-Z]{2},)*([A-Z]{2}){1}$
This allows me to validate something like this :
AL,RA,IS,GD
AL
AL,RA
The problem is that it validates also AL,AL and AL,RA,AL.
EDIT
Here there are more details.
What is allowed:
AL,RA,GD
AL
AL,RA
AL,IS,GD
What it shouldn't be allowed:
AL,RA,AL
AL,AL
AL,RA,RA
AL,IS,AL
IS,IS,AL
IS,GD,GD
IS,GD,IS
I need that every group of two characters appears only once in the sequence.
Try something like this expression:
/^(?:,?(\b\w{2}\b)(?!.*\1))+$/gm
I have no knowledge of swift, so take it with a grain of salt. The idea is basically to only match a whole line while making sure that no single matched group occurs at a later point in the line.
First of all, let's shorten your pattern. It can be easily achieved since the length of each comma-separated item is fixed and the list items are only made up of uppercase ASCII letters. So, your pattern can be written as ^(?:[A-Z]{2}(?:,\b)?)+$. See this regex demo.
Now, you need to add a negative lookahead that will check the string for any repeating two-letter sequence at any distance from the start of string, and within any distance between each. Use
^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$
See the regex demo
Possible implementation in Swift:
func isValidInput(Input:String) -> Bool {
return Input.range(of: #"^(?!.*\b([A-Z]{2})\b.*\b\1\b)(?:[A-Z]{2}(?:,\b)?)+$"#, options: .regularExpression) != nil
}
print(isValidInput(Input:"AL,RA,GD")) // true
print(isValidInput(Input:"AL,RA,AL")) // false
Details
^ - start of string
(?!.*\b([A-Z]{2})\b.*\b\1\b) - a negative lookahead that fails the match if, immediately to the right of the current location, there is:
.* - any 0+ chars other than line break chars, as many as possible
\b([A-Z]{2})\b - a two-letter word as a whole word
.* - any 0+ chars other than line break chars, as many as possible
\b\1\b - the same whole word as in Group 1. NOTE: The word boundaries here are not necessary in the current scenario where the word length is fixed, it is two, but if you do not know the word length, and you have [A-Z]+, you will need the word boundaries, or other boundaries depending on the situation
(?:[A-Z]{2}(?:,\b)?)+ - 1 or more sequences of:
[A-Z]{2} - two uppercase ASCII letters
(?:,\b)? - an optional sequence: , only if followed with a word char: letter, digit or _. This guarantees that , won't be allowed at the end of the string
$ - end of string.
You can use a negative lookahead with a back-reference:
^(?!.*([A-Z]{2}).*\1).*
if, as in the all the examples in the question, it is known that the string contains only comma-separated pairs of capital letters. I will relax that assumption later in my answer.
Demo
The regex performs the following operations:
^ # match beginning of line
(?! # begin negative lookahead
.* # match 0+ characters (1+ OK)
([A-Z]{2}) # match 2 uppercase letters in capture group 1
.* # match 0+ characters (1+ OK)
\1 # match the contents of capture group 1
) # end negative lookahead
.* # match 0+ characters (the entire string)
Suppose now that one or more capital letters may appear between each pair of commas, or before the first comma or after the last comma, but it is only strings of two letters that cannot be repeated. Moreover, I assume the regex must confirm the regex has the desired form. Then the following regex could be used:
^(?=[A-Z]+(?:,[A-Z]+)*$)(?!.*(?:^|,)([A-Z]{2}),(?:.*,)?\1(?:,|$)).*
Demo
The regex performs the following operations:
^ # match beginning of line
(?= # begin pos lookahead
[A-Z]+ # match 1+ uc letters
(?:,[A-Z]+) # match ',' then by 1+ uc letters in a non-cap grp
* # execute the non-cap grp 0+ times
$ # match the end of the line
) # end pos lookahead
(?! # begin neg lookahead
.* # match 0+ chars
(?:^|,) # match beginning of line or ','
([A-Z]{2}) # match 2 uc letters in cap grp 1
, # match ','
(?:.*,) # match 0+ chars, then ',' in non-cap group
? # optionally match non-cap grp
\1 # match the contents of cap grp 1
(?:,|$) # match ',' or end of line
) # end neg lookahead
.* # match 0+ chars (entire string)
If there is no need check that the string contains only comma-separated strings of one or more upper case letters the postive lookahead at the beginning can be removed.

Get Event Log Message content in a Variable

I want to get the the first "WDS.Device.ID" (00-15-5D-8A-44-25) (without the [] brackets) into a variable.
I tried some RegEx things but without success as I lack the knowledge for it.
PS C:\Windows\system32> $result | fl
Message : A device query was successfully processed (status 0x0):
Input:
WDS.Request.Type='Deployment'
WDS.Client.Property.Architecture.Process='X64'
WDS.Client.Property.Architecture.Native='X64'
WDS.Client.Property.Firmware.Type='BIOS'
WDS.Client.Property.SMBIOS.Manufacturer='Microsoft Corporation'
WDS.Client.Property.SMBIOS.Model='Virtual Machine'
WDS.Client.Property.SMBIOS.Vendor='American Megatrends Inc.'
WDS.Client.Property.SMBIOS.Version='090008 '
WDS.Client.Property.SMBIOS.ChassisType='Desktop'
WDS.Client.Property.SMBIOS.UUID={CCD695BE-20AB-48CC-8F01-319B498F7A69}
WDS.Client.Request.Version=1.0.0.0
WDS.Client.Version=10.0.18362.1
WDS.Client.Host.Version=10.0.18362.1
WDS.Client.DDP.Default.Match=FALSE
WDS.Device.ID=[00-15-5D-8A-44-25]
WDS.Device.ID=[BE-95-D6-CC-AB-20-CC-48-8F-01-31-9B-49-8F-7A-69]
Output:
WDS.Client.Property.Architecture.Process='X64'
WDS.Client.Property.Architecture.Native='X64'
WDS.Client.Property.Firmware.Type='BIOS'
WDS.Client.Property.SMBIOS.Manufacturer='Microsoft Corporation'
WDS.Client.Property.SMBIOS.Model='Virtual Machine'
WDS.Client.Property.SMBIOS.Vendor='American Megatrends Inc.'
WDS.Client.Property.SMBIOS.Version='090008 '
WDS.Client.Property.SMBIOS.ChassisType='Desktop'
WDS.Client.Property.SMBIOS.UUID={CCD695BE-20AB-48CC-8F01-319B498F7A69}
WDS.Client.Request.Version=1.0.0.0
WDS.Client.Version=10.0.18362.1
WDS.Client.Host.Version=10.0.18362.1
WDS.Client.DDP.Default.Match=FALSE
WDS.Client.Request.ResendAuthenticated=TRUE
Turning my comment into an answer.
If the message you show is inside a string variable (let's call it $message), then you can use regex to get the value for the WDS.Device.ID without the brackets like this:
$devideID = ([regex]'(?i)WDS\.Device\.ID=\[((?:[0-9a-f]{2}-){5}[0-9a-f]{2})\]').Match($message).Groups[1].Value
Result:
00-15-5D-8A-44-25
Regex details:
WDS Match the characters “WDS” literally
\. Match the character “.” literally
Device Match the characters “Device” literally
\. Match the character “.” literally
ID= Match the characters “ID=” literally
\[ Match the character “[” literally
( Match the regular expression below and capture its match into backreference number 1
(?: Match the regular expression below
[0-9a-f] Match a single character present in the list below
A character in the range between “0” and “9”
A character in the range between “a” and “f”
{2} Exactly 2 times
- Match the character “-” literally
){5} Exactly 5 times
[0-9a-f] Match a single character present in the list below
A character in the range between “0” and “9”
A character in the range between “a” and “f”
{2} Exactly 2 times
)
] Match the character “]” literally
The (?i) in the regex makes it case-insensitive
here's another way to go about it. this presumes the $Result variable holds one multiline string AND that the 1st [ & the 1st ] are "bracketing" your target data. [grin]
$Result.Split('[')[1].Split(']')[0]
output = 00-15-5D-8A-44-25

How can I obtain only word without All Punctuation Marks when I read text file?

The text file abc.txt is an arbitrary article that has been scraped from the web. For example, it is as follows:
His name is "Donald" and he likes burger. On December 11, he married.
I want to extract only words in lower case and numbers except for all kinds of periods and quotes in the above article. In the case of the above example:
{his, name, is, Donald, and, he, likes, burger, on, December, 11, he, married}
My code is as follows:
filename = 'abc.txt';
fileID = fopen(filename,'r');
C = textscan(fileID,'%s','delimiter',{',','.',':',';','"','''});
fclose(fileID);
Cstr = C{:};
Cstr = Cstr(~cellfun('isempty',Cstr));
Is there any simple code to extract only alphabet words and numbers except all symbols?
Two steps are necessary as you want to convert certain words to lowercase.
regexprep converts words, which are either at the start of the string or follow a full stop and whitespace, to lower case.
In the regexprep function, we use the following pattern:
(?<=^|\. )([A-Z])
to indicate that:
(?<=^|\. ) We want to assert that before the word of interest either the start of string (^), or (|) a full stop (.) followed by whitespace are found. This type of construct is called a lookbehind.
([A-Z]) This part of the expression matches and captures (stores the match) a upper case letter (A-Z).
The ${lower($0)} component in the regex is called a dynamic expression, and replaces the contents of the captured group (([A-Z])) to lower case. This syntax is specific to the MATLAB language.
You can check the behaviour of the above expression here.
Once the lower case conversions have occurred, regexp finds all occurrences of one or more digits, lower case and upper case letters.
The pattern [a-zA-Z0-9]+ matches lower case letters, upper case letters and digits.
You can check the behavior of this regex here.
text = fileread('abc.txt')
data = {regexp(regexprep(text,'(?<=^|\. )([A-Z])','${lower($0)}'),'[a-zA-Z0-9]+','match')'}
>>data{1}
13×1 cell array
{'his' }
{'name' }
{'is' }
{'Donald' }
{'and' }
{'he' }
{'likes' }
{'burger' }
{'on' }
{'December'}
{'11' }
{'he' }
{'married' }

How do I print a tab character in Pascal?

I'm trying to figure out in all the Internets what's the special character for printing a simple tab in Pascal. I have to format a table in a CLI program and that would be handy.
Single non printable characters can be constructed using their ascii code prefixed with #
Since the ascii value for tab is 9, a tab is then #9. Characters such constructed must be outside literals, but don't need + to concatenate:
E.g.
const
sometext = 'firstfield'#9'secondfield'#13#10;
contains two fields separated by a tab, ended by a carriage return (#13) + a linefeed #10
The ' character can be made both via this route, or shorter by just ending the literal and reopening it:
const
some2 = '''bla'''; // will contain 'bla' with the ticks.
some3 = 'start''bla''end'; // will contain start'bla'end
write( ^i );
:-)