why can the character's order in regex expression affect sed? - sed

The tv.txt file is as following:
mms://live21.gztv.com/gztv_gz 广州台[可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3]
mms://live21.gztv.com/gztv_news 广州新闻台·直播广州(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
mms://live21.gztv.com/gztv_kids 广州少儿台(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
mms://live21.gztv.com/gztv_econ 广州经济台
I want to group it into three groups.
sed -r 's/([^ ]*)\s([^][()]*)((\(.+\))*|(\[.+\])*)/\3/' tv.txt
got the result:
[可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3]
(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
(可于Totem/VLC/MPlayer播放,记得把高宽比设置成4:3)
When I write it into
sed -r 's/([^ ]*)\s([^][()]*)((\(.+\))*|(\[.+\])*)/\3/' tv.txt
It can't work.
The only difference is [^][()] and [^[]()]; neither of the [^\[\]()] ,escape characters can not make it run properly.
I want to know the reason.

The POSIX rules for getting ] into a character class are a little arcane, but they make sense when you think about it hard.
For a positive (non-negated) character class, the ] must be the first character:
[]and]
This recognizes any character a, n, d or ] as part of the character class.
For a negated character class, the ] must be the first character after the ^:
[^]and]
This recognizes any character except a, n, d or ] as part of the character class.
Otherwise, the first ] after the [ marks the end of the character class. Inside a character class, most of the normal regex special characters lose their special meaning, and others (notably - minus) acquire special meanings. (If you want a - in a character class, it has to be 'first' or last, where 'first' means 'after the optional ^ and only if ] is not present'.)
In your examples:
[^][()] — this is a negated character class that recognizes any character except [, ], ( or ), but
[^[]()] — this is a negated character class that recognizes any character except [, followed by whatever () symbolizes in the regex family you're using, and ] which represents itself.

Related

Can Sed match matching brackets?

My code has a ton of occurrences of something like:
idof(some_object)
I want to replace them with:
some_object["id"]
It sounds simple:
sed -i 's/idof(\([^)]\+\))/\1["id"]/g' source.py
The problem is that some_object might be something like idof(get_some_object()), or idof(my_class().get_some_object()), in which case, instead of getting what I want (get_some_object()["id"] or my_class().get_some_object()["id"]), I get get_some_object(["id"]) or my_class(["id"].get_some_object()).
Is there a way to have sed match closing bracket, so that it internally keeps track of any opening/closing brackets inside my (), and ignores those?
It needs to keep everything that's between those brackets: idof(ANYTHING) becomes ANYTHING["id"].
Using sed
$ sed -E 's/idof\(([[:alpha:][:punct:]]*)\)/\1["id"]/g' input_file
Using ERE, exclude idof and the first opening parenthesis.
As a literal closing parenthesis is also excluded, everything in-between the capture parenthesis including additional parenthesis will be captured.
[[:alpha:]] will match all alphabetic characters including upper and lower case while [[:punct:]] will capture punctuation characters including ().-{} and more.
The g option will make the substitution as many times as the pattern is found.
Theoretically, you can write a regex that will handle all combinations of idof(....) up to some limit of nested () calls inside ..... Such regex would have to list with all possible combinations of calls, like idof(one(two(three))) or idof(one(two(three)four(five)) you can match with an appropriate regex like idof([^()]*([^()]*([^()]*)[^()]*)[^()]*) or idof([^()]*([^()]*([^()]*)[^()]*([^()]*)[^()]*) respectively.
The following regex handles only some cases, but shows the complexity and general path. Writing a regex to handle all possible cases to "eat" everything in front of the trailing ) is left to OP as an exercise why it's better to use something else. Note that handling string literals ")" becomes increasingly complex.
The following Bash code:
sed '
: begin
# No idof? Just print the line!
/^\(.*\)idof(\([^)]*)\)/!n
# Note: regex is greedy - we start from the back!
# Note: using newline as a stack separator.
s//\1\n\2/
# hold the front
{ h ; x ; s/\n.*// ; x ; s/[^\n]*\n// ; }
: handle_brackets
# Eat everything before final ) up to some number of nested ((())) calls.
# Insert more jokes here.
: eat_brackets
/^[^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*\(([^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)\?[^()]*)\)/{
s//&\n/
# Hold the front.
{ H ; x ; s/\n\([^\n]*\)\n.*/\1/ ; x ; s/[^\n]*\n// ; }
b eat_brackets
}
/^\([^()]*\))/!{
s/^/ERROR: eating brackets did not work: /
q1
}
# Add the id after trailing ) and remove it.
s//\1["id"]/
# Join with hold space and clear the hold space for next round
{ H ; s/.*// ; x ; s/\n//g ; }
# Restart for another idof if in input.
b begin
' <<EOF
before idof(some_object) after
before idof(get_some_object()) after
before idof(my_class().get_some_object()) after
before idof(one(two(three)four)five) after
before idof(one(two(three)four)five) between idof(one(two(three)four)five) after
before idof( one(two(three)four)five one(two(three)four)five ) after
before idof(one(two(three(four)five)six(seven(eight)nine)ten) between idof(one(two(three(four)five)six(seven(eight)nine)ten) after
EOF
Will output:
before some_object["id"] after
before get_some_object()["id"] after
before my_class().get_some_object()["id"] after
before one(two(three)four)five["id"] after
before one(two(three)four)five["id"] between one(two(three)four)five["id"] after
before one(two(three)four)five one(two(three)four)five ["id"] after
ERROR: eating brackets did not work: one(two(three(four)five)six(seven(eight)nine)ten) after
The last line is not handled correctly, because (()()) case is not correctly handled. One would have to write a regex to match it.

How to match exact string in perl

I am trying to parse all the files and verify if any of the file content has strings TESTDIR or TEST_DIR
Files contents might look something like:-
TESTDIR = foo
include $(TESTDIR)/chop.mk
...
TEST_DIR := goldimage
MAKE_TESTDIR = var_make
NEW_TEST_DIR = tesing_var
Actually I am only interested in TESTDIR ,$(TESTDIR),TEST_DIR but in my case last two lines should be ignored. I am new to perl , Can anyone help me out with re-rex.
/\bTEST_?DIR\b/
\b means a "word boundary", i.e. the place between a word character and a non-word character. "Word" here has the Perl meaning: it contains characters, numbers, and underscores.
_? means "nothing or an underscore"
Look at "characterset".
Only (space) surrounding allowed:
/^(.* )?TEST_?DIR /
^ beginning of the line
(.* )? There may be some content .* but if, its must be followed by a space
at the and says that a whitespace must be there. Otherwise use ( .*)?$ at the end.
One of a given characterset is allowed:
Should the be other characters then a space be possible you can use a character class []:
/^(.*[ \t(])?TEST_?DIR[) :=]/
(.*[ \t(])? in front of TEST_?DIR may be a (space) or a \t (tab) or ( or nothing if the line starts with itself.
afterwards there must be one of (space) or : or = or ). Followd by anything (to "anything" belongs the "=" of ":=" ...).
One of a given group is allowed:
So you need groups within () each possible group in there devided by a |:
/^(.*( |\t))?TEST_?DIR( | := | = )/
In this case, at the beginning is no change to [ \t] because each group holds only one character and \t.
At the end, there must be (single space) or := (':=' surrounded by spaces) or = ('=' surrounded by spaces), following by anything...
You can use any combination...
/^(.*[ \t(])?TEST_?DIR([) =:]| :=| =|)/
Test it on Debuggex.com. (Use 'PCRE')

Get Event Log Message content in a Variable

I want to get the the first "WDS.Device.ID" (00-15-5D-8A-44-25) (without the [] brackets) into a variable.
I tried some RegEx things but without success as I lack the knowledge for it.
PS C:\Windows\system32> $result | fl
Message : A device query was successfully processed (status 0x0):
Input:
WDS.Request.Type='Deployment'
WDS.Client.Property.Architecture.Process='X64'
WDS.Client.Property.Architecture.Native='X64'
WDS.Client.Property.Firmware.Type='BIOS'
WDS.Client.Property.SMBIOS.Manufacturer='Microsoft Corporation'
WDS.Client.Property.SMBIOS.Model='Virtual Machine'
WDS.Client.Property.SMBIOS.Vendor='American Megatrends Inc.'
WDS.Client.Property.SMBIOS.Version='090008 '
WDS.Client.Property.SMBIOS.ChassisType='Desktop'
WDS.Client.Property.SMBIOS.UUID={CCD695BE-20AB-48CC-8F01-319B498F7A69}
WDS.Client.Request.Version=1.0.0.0
WDS.Client.Version=10.0.18362.1
WDS.Client.Host.Version=10.0.18362.1
WDS.Client.DDP.Default.Match=FALSE
WDS.Device.ID=[00-15-5D-8A-44-25]
WDS.Device.ID=[BE-95-D6-CC-AB-20-CC-48-8F-01-31-9B-49-8F-7A-69]
Output:
WDS.Client.Property.Architecture.Process='X64'
WDS.Client.Property.Architecture.Native='X64'
WDS.Client.Property.Firmware.Type='BIOS'
WDS.Client.Property.SMBIOS.Manufacturer='Microsoft Corporation'
WDS.Client.Property.SMBIOS.Model='Virtual Machine'
WDS.Client.Property.SMBIOS.Vendor='American Megatrends Inc.'
WDS.Client.Property.SMBIOS.Version='090008 '
WDS.Client.Property.SMBIOS.ChassisType='Desktop'
WDS.Client.Property.SMBIOS.UUID={CCD695BE-20AB-48CC-8F01-319B498F7A69}
WDS.Client.Request.Version=1.0.0.0
WDS.Client.Version=10.0.18362.1
WDS.Client.Host.Version=10.0.18362.1
WDS.Client.DDP.Default.Match=FALSE
WDS.Client.Request.ResendAuthenticated=TRUE
Turning my comment into an answer.
If the message you show is inside a string variable (let's call it $message), then you can use regex to get the value for the WDS.Device.ID without the brackets like this:
$devideID = ([regex]'(?i)WDS\.Device\.ID=\[((?:[0-9a-f]{2}-){5}[0-9a-f]{2})\]').Match($message).Groups[1].Value
Result:
00-15-5D-8A-44-25
Regex details:
WDS Match the characters “WDS” literally
\. Match the character “.” literally
Device Match the characters “Device” literally
\. Match the character “.” literally
ID= Match the characters “ID=” literally
\[ Match the character “[” literally
( Match the regular expression below and capture its match into backreference number 1
(?: Match the regular expression below
[0-9a-f] Match a single character present in the list below
A character in the range between “0” and “9”
A character in the range between “a” and “f”
{2} Exactly 2 times
- Match the character “-” literally
){5} Exactly 5 times
[0-9a-f] Match a single character present in the list below
A character in the range between “0” and “9”
A character in the range between “a” and “f”
{2} Exactly 2 times
)
] Match the character “]” literally
The (?i) in the regex makes it case-insensitive
here's another way to go about it. this presumes the $Result variable holds one multiline string AND that the 1st [ & the 1st ] are "bracketing" your target data. [grin]
$Result.Split('[')[1].Split(']')[0]
output = 00-15-5D-8A-44-25

Why C11 define character constant recursively?

The character constants are defined in c11 as:
 Syntax
  character-constant:
   ' c-char-sequence '
   L' c-char-sequence '
   u' c-char-sequence '
   U' c-char-sequence '
  c-char-sequence:
   c-char
   c-char-sequence c-char
  c-char:
   any member of the source character set except the single-quote ', backslash \, or new-line character
   escape-sequence
It is defined recursively, so inside the single-quotes, there are one or more c-chars, like 'abc'.
However as I know, a character constant contains only one c-char, like 'a', doesn't it?
as I know, a character constant contains only one c-char, like 'a', doesn't it?
no, 'abcd' is also a character constant. Its value is technically implementation-defined, but everywhere I've looked it was formed out of the values of the chars, in big-endian order (in that case, 0x61626364)
The C side of cppreference has a discussion of various character constants

Is it possible to match any character that is not ']' in PATINDEX?

I need to find the index of the first character that is not ]. Normally to match any character except X, you use the pattern [^X]. The problem is that [^]] simply closes the first bracket too early. The first part, [^], will match any character.
In the documentation for the LIKE operator, if you scroll down to the section "Using Wildcard Characters As Literals" it shows a table of methods to indicated literal characters like [ and ] inside a pattern. It makes no mention of using [ or ] inside double brackets. If the pattern is being used with the LIKE operator, you would use the ESCAPE clause. LIKE doesn't return an index and PATINDEX doesn't seem to have a parameter for an escape clause.
Is there no way to do this?
(This may seem arbitrary. To put some context around it, I need to match ] immediately followed by a character that is not ] in order to locate the end of a quoted identifier. ]] is the only character escape inside a quoted identifier.)
This isn't possible. The Connect item PATINDEX Missing ESCAPE Clause is closed as won't fix.
I'd probably use CLR and regular expressions.
A simple implementation might be
using System.Data.SqlTypes;
using System.Text.RegularExpressions;
public partial class UserDefinedFunctions
{
[Microsoft.SqlServer.Server.SqlFunction]
public static SqlInt32 PatIndexCLR(SqlString pattern, SqlString expression)
{
if (pattern.IsNull || expression.IsNull)
return new SqlInt32();
Match match = Regex.Match(expression.ToString(), pattern.ToString());
if (match.Success)
{
return new SqlInt32(match.Index + 1);
}
else
{
return new SqlInt32(0);
}
}
}
With example usage
SELECT [dbo].[PatIndexCLR] ( N'[^]]', N']]]]]]]]ABC[DEF');
If that is not an option a possible flaky workaround might be to substitute a character unlikely to be in the data without this special significance in the grammar.
WITH T(Value) AS
(
SELECT ']]]]]]]]ABC[DEF'
)
SELECT PATINDEX('%[^' + char(7) + ']%', REPLACE(Value,']', char(7)))
FROM T
(Returns 9)