What does Perl's modifier /m = multiline mean? - perl

what does Perl's /m modifier means from this example?
For example, let say I have the following information in the Example.txt text file. And each line ends with the newline character with a data record of Data
The input record separator is set to:
$/="__Data__";
Example.txt
__Data__
This is test A.\n
This is test B.\n
This is test C.\n
This is test D.\n
Question 1, after changing the input record separator to Data, would the ^ and $ characters be position as follow?
^__Data__
This is test A.\n
This is test B.\n
This is test C.\n
This is test D.\n$
Question 2, let say I use the /m modifier while having the input record separator still set to Data, would the ^ and $ characters be set to the following?
^__Data__$
^This is test A.\n$
^This is test B.\n$
^This is test C.\n$
^This is test D.\n$
if(/__Data__/m)
{
print;
}

/$/ is not affected by $/.
Without /m,
/^/ matches the starts of the string. (/(?-m:^)/ ⇔ /\A/)
/$/ matches at the end of the string, and before a newline at the end of the string. (/(?-m:$)/ ⇔ /\Z/ ⇔ /(?=\n\z)|\z/)
^__Data__\n "^" denotes where /(?-m:$)/ can match
This is test A.\n "$" denotes where /(?-m:$)/ can match
This is test B.\n
This is test C.\n
This is test D.$\n$
With /m,
/^/ matches the starts of the string and after a "\n". (/(?m:^)/ ⇔ /\A|(?<=\n)/)
/$/ matches before a newline and at the end of the string. (/(?m:$)/ ⇔ /(?=\n)|\z/)
^__Data__$\n "^" denotes where /(?m:^)/ can match
^This is test A.$\n "$" denotes where /(?m:$)/ can match
^This is test B.$\n
^This is test C.$\n
^This is test D.$\n$
I was asked about
...$\n$
First, let's demonstrate:
>perl -E"say qq{abc\n} =~ /abc$/ ? 1 : 0"
1
>perl -E"say qq{abc\n} =~ /abc\n$/ ? 1 : 0"
1
The point is to allow /^abc$/ to match both "abc\n" and "abc".
>perl -E"say qq{abc\n} =~ /^abc$/ ? 1 : 0"
1
>perl -E"say qq{abc} =~ /^abc$/ ? 1 : 0"
1

Your assumption is correct, multiline causes ^ and $ to match the beginning and end of the string, whereas without it you match between newlines (and string ends).

Related

Can somebody explain this obfuscated perl regexp script?

This code is taken from the HackBack DIY guide to rob banks by Phineas Fisher. It outputs a long text (The Sixth Declaration of the Lacandon Jungle). Where does it fetch it? I don't see any alphanumeric characters at all. What is going on here? And what does the -r switch do? It seems undocumented.
perl -Mre=eval <<\EOF
''
=~(
'(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})')
;$:="\."^ '~';$~='#'
|'(';$^= ')'^'[';
$/='`' |'.';
$,= '('
EOF
The basic idea of the code you posted is that each alphanumeric character has been replaced by a bitwise operation between two non-alphanumeric characters. For instance,
'`'|'%'
(5th line of the "star" in your code)
Is a bitwise or between backquote and modulo, whose codepoints are respectively 96 and 37, whose "or" is 101, which is the codepoint of the letter "e". The following few lines all print the same thing:
say '`' | '%' ;
say chr( ord('`' | '%') );
say chr( ord('`') | ord('%') );
say chr( 96 | 37 );
say chr( 101 );
say "e"
Your code starts with (ignore whitespaces which don't matter):
'' =~ (
The corresponding closing bracket is 28 lines later:
^'(').'"})')
(C-f this pattern to see it on the web-page; I used my editor's matching parenthesis highlighting to find it)
We can assign everything in between the opening and closing parenthesis to a variable that we can then print:
$x = '(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})';
print $x;
This will print:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
The remaining of the code is a bunch of assignments into some variables; probably here only to complete the pattern: the end of the star is:
$:="\."^'~';
$~='#'|'(';
$^=')'^'[';
$/='`'|'.';
$,='(';
Which just assigns simple one-character strings to some variables.
Back to the main code:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
This code is inside a regext which is matched against an empty string (don't forget that we had first '' =~ (...)). (?{...}) inside a regex runs the code in the .... With some whitespaces, and removing the string within the eval, this gives us:
# fetch an url from the web using curl _quitely_ (-s)
($: = `curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)
# replace end of paragraphs with newlines in the HTML fetched
=~ s|</p>|\n|g;
# Remove all HTML tags
$: =~ s/<.*?>//g;
# Print everything between SEXTA and Mexicano (+2 chars)
print $: =~ /(SEXTA.*?Mexicano..)/s
You can automate this unobfuscation process by using B::Deparse: running
perl -MO=Deparse yourcode.pl
Will produce something like:
'' =~ m[(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})];
$: = 'P';
$~ = 'h';
$^ = 'r';
$/ = 'n';
$, = '(';

Find and print two blocks of lines from file with sed in one pass

I am trying to come up with an sed command to find and print two blocks of variable number of lines from a text file that look like this:
...
INFO first block to match
id: "value"
...
last line of the first block
INFO next irrelevant block
id: "different value"
...
INFO second block to match
id: "value"
...
last line of the second block
...
I only have prior knowledge of the id value and the fact that each block starts with a line that has "INFO". I want to match each block from that first line and not include the first line of the next block in the output:
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
Ideally I would prefer to do it in a single pass, not have the file scanned multiple times from top to bottom. Currently I have this (it only matches the first block, and I need both):
sed -n -e "/INFO/{"'$!'"{N;/INFO.*id: \"value\"/{:l;p;n;/^[^\\[]/bl;}}}" file.log
EDIT
Linebreak between blocks is certainly nice, but entirely optional.
EDIT 2
Please note that INFO and id: "value" do not have to be in the beginning of the line, and all other words in my example are arbitrary and not known in advance. There can be any number of blocks (including 0) between and around the ones I need to match.
sed is powerful, terse, and dumb. awk is smarter!
awk '/^INFO/{f = /match/? 1: 0} f'
edit: I see you want a linebreak between each "block"; will update if I find a tighter way:
awk '/^INFO/{f = /match/? 1: 0; if(i++) $0 = RS $0} f'
/^INFO/{action}: Execute {action} only on lines beginning with "INFO"
variable = if ? then : else: Conditional Expression (ternary operator)
if(i++): The first time this is evaluated, i will be zero, thus the expression will be false. This prevents an extra line break at the first block.
$0 = RS $0: Prepend a Record Separator (newline) to $0 (entire record)
f If f is greater than zero, {print $0} is implied.
This might work for you (GNU sed):
sed -nE ':a;/^INFO/{N;/^id: "value"/M!D;:b;H;$!{n;/^INFO/!bb};x;s/^/x/;/^x{2}/{s/^x*.//p;q};x;ba}' file
This solution stores the required blocks in the hold space, prefixed by a counter. Once the required number of blocks are stored the counters are removed, the blocks printed and the process quit.
The solution (based only on the input provided) supposes that an id (if it exists) always follows the the INFO line.
Here is an alternative solution using a combination of sed and awk. It allows you to parse the input blockwise or recordwise. This approach relies on setting awk record separator (RS) to the empty string which makes awk read a full block in at a time.
So there are 2 steps:
Make the input record-parsable.
Process each record.
For your example this could something like this:
sed '1!s/^INFO/\n&/' infile | awk '/id: "value"/' RS= ORS='\n\n'
Output:
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
awk is good for this, and if you could set RS to a multi-character expression it would be ideal. (gnu awk allows this, but why bother with gnu awk when there is perl?)
perl -wnle 'BEGIN{$/="INFO"; undef $\} print "$/$_" if m/id: \"value\"/' input
Basically, this sets the record separator ($/) to the string "INFO" (so now each of your "records" is a "line" to perl). If the record matches the pattern id: "value", it is printed with "INFO" prepended to the start. (without -n, perl would retain the record separator the end of each record, which is not quite what you want). By omitting the "undef $\", you can get an extra newline between records. Some code golf could probably cut the length of this in half, but my perl is a bit rusty. Waiting for the shorter version in comments.
This may or may not be what you want depending on what your real data looks like:
$ awk '/INFO/{info=$0; f=0} /id: "value"/{print info; f=1} f' file
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
or if you want to do more with each block than just print it as you go then some variation of this is better:
$ awk '
/INFO/ { prt() }
{ block = block $0 ORS }
END { prt() }
function prt() {
if (block ~ /id: "value"/) {
printf "%s", block
}
block=""
}
' file
INFO first block to match
id: "value"
...
last line of the first block
INFO second block to match
id: "value"
...
last line of the second block
The above will behave the same using any awk in any shell on any UNIX box.

Use sed to replace word in 2-line pattern

I try to use sed to replace a word in a 2-line pattern with another word. When in one line the pattern 'MACRO "something"' is found then in the next line replace 'BLOCK' with 'CORE'. The "something" is to be put into a reference and printed out as well.
My input data:
MACRO ABCD
CLASS BLOCK ;
SYMMETRY X Y ;
Desired outcome:
MACRO ABCD
CLASS CORE ;
SYMMETRY X Y ;
My attempt in sed so far:
sed 's/MACRO \([A-Za-z0-9]*\)/,/ CLASS BLOCK ;/MACRO \1\n CLASS CORE ;/g' input.txt
The above did not work giving message:
sed: -e expression #1, char 30: unknown option to `s'
What am I missing?
I'm open to one-liner solutions in perl as well.
Thanks,
Gert
Using a perl one-liner in slurp mode:
perl -0777 -pe 's/MACRO \w+\n CLASS \KBLOCK ;/CORE ;/g' input.txt
Or using a streaming example:
perl -pe '
s/^\s*\bCLASS \KBLOCK ;/CORE ;/ if $prev;
$prev = $_ =~ /^MACRO \w+$/
' input.txt
Explanation:
Switches:
-0777: Slurp files whole
-p: Creates a while(<>){...; print} loop for each line in your input file.
-e: Tells perl to execute the code on command line.
When in one line the pattern 'MACRO "something"' is found then in the
next line replace 'BLOCK' with 'CORE'.
sed works on lines of input. If you want to perform substitution on the next line of a specified pattern, then you need to add that to the pattern space before being able to do so.
The following might work for you:
sed '/MACRO/{N;s/\(CLASS \)BLOCK/\1CORE/;}' filename
Quoting from the documentation:
`N'
Add a newline to the pattern space, then append the next line of
input to the pattern space. If there is no more input then sed
exits without processing any more commands.
If you want to make use of address range as in your attempt, then you need:
sed '/MACRO/,/CLASS BLOCK/{s/\(CLASS\) BLOCK/\1 CORE/}' filename
I'm not sure why do you need a backreference for substituting the macro name.
You could try this awk command also,
awk '{print}/MACRO/ {getline; sub (/BLOCK/,"CORE");{print}}' file
It prints all the lines as it is and do the replacing action on seeing a word MACRO on a line.
Since getline has so many pitfall I try not to use it, so:
awk '/MACRO/ {a++} a==1 {sub(/BLOCK/,"CORE")}1' file
MACRO ABCD
CLASS CORE ;
SYMMETRY X Y ;
This could do it
#!awk -f
BEGIN {
RS = ";"
}
/MACRO/ {
sub("BLOCK", "CORE")
}
{
printf s++ ? ";" $0 : $0
}
"line" ends with ;
sub BLOCK for CORE in "lines" with MACRO
print ; followed by "line" unless first line

Check if string starts with given string

I am trying to check if message after "5:16:51:209|INFO| " starts with "Marker". I need to add string "|ICD" after timstamp.
input is :" 05:16:51:209|INFO|Markerprocedure Magnet "
I tried this regex, but its not working. Please help me to get it correct.
if ( $lines[$i] =~ m/(\d{2}:\d{2}:\d{2}:\d{3})|(\w+)|^Marker/)
{
$lines[$i] =~ s/(\d{2}:\d{2}:\d{2}:\d{3})(.*)/$1|ICD$2/ ;
}
I am trying to check if message after "5:16:51:209|INFO| " starts with "Marker"
What it seems to me you're trying to check is whether Marker immediately follows 5:16:51:209|INFO| so it isn't correct to use the ^ regex character because that checks to see whether the start of the string occurs in that position (which, of course, it doesn't). So remove the ^ character and Perl will check whether Marker immediately follows.
Also, you need to escape the | characters like this: \| to prevent it being treated as an alternation command in the regex. Then you can do the test and replace in a single substitution command:
if ( $lines[$i] =~ s/(\d{2}:\d{2}:\d{2}:\d{3})(\|\w+\|Marker)/$1|ICD$2/ )
{
# Line contained "Marker" and "|ICD" inserted
}
Example:
$ echo '15:16:51:209|INFO|Marker blah' | perl -ple 's/(\d{2}:\d{2}:\d{2}:\d{3})(\|\w+\|Marker)/$1|ICD$2/'
Output is:
15:16:51:209|ICD|INFO|Marker blah
Edit: #Prix has pointed out in the comments that if the timestamp is meant to appear at the start of the string, then the ^ start-marker should be at the start of the regex to prevent accidental matches in other parts of the string (and for performance):
s/^(\d{2}:\d{2}:\d{2}:\d{3})(\|\w+\|Marker)/$1|ICD$2/
↑
Use ^ here to anchor the search to the beginning of the string.

Perl - partial pattern matching in a sequence of letters

I am trying to find a pattern using perl. But I am only interested with the beginning and the end of the pattern. To be more specific I have a sequence of letters and I would like to see if the following pattern exists. There are 23 characters. And I'm only interested in the beginning and the end of the sequence.
For example I would like to extract anything that starts with ab and ends with zt. There is always
So it can be
abaaaaaaaaaaaaaaaaaaazt
So that it detects this match
but not
abaaaaaaaaaaaaaaaaaaazz
So far I tried
if ($line =~ /ab[*]zt/) {
print "found pattern ";
}
thanks
* is a quantifier and meta character. Inside a character class bracket [ .. ] it just means a literal asterisk. You are probably thinking of .* which is a wildcard followed by the quantifier.
Matching entire string, e.g. "abaazt".
/^ab.*zt$/
Note the anchors ^ and $, and the wildcard character . followed by the zero or more * quantifier.
Match substrings inside another string, e.g. "a b abaazt c d"
/\bab\S*zt\b/
Using word boundary \b to denote beginning and end instead of anchors. You can also be more specific:
/(?<!\S)ab\S*zt(?!\S)/
Using a double negation to assert that no non-whitespace characters follow or precede the target text.
It is also possible to use the substr function
if (substr($string, 0, 2) eq "ab" and substr($string, -2) eq "zt")
You mention that the string is 23 characters, and if that is a fixed length, you can get even more specific, for example
/^ab.{19}zt$/
Which matches exactly 19 wildcards. The syntax for the {} quantifier is {min, max}, and any value left blank means infinite, i.e. {1,} is the same as + and {0,} is the same as *, meaning one/zero or more matches (respectively).
Just a * by itself wont match anything (except a literal *), if you want to match anything you need to use .*.
if ($line =~ /^ab.*zt$/) {
print "found pattern ";
}
If you really want to capture the match, wrap the whole pattern in a capture group:
if (my ($string) = $line =~ /^(ab.*zt)$/) {
print "found pattern $string";
}