perl one liner to search for a pattern

perl one liner to search for a pattern - perl

I have a text file like
Hi
how are you
<blank>
<blank>
abcd
<blank>
defgh
opqr
<blank>
I want to print all lines that have the pattern like "some text"blankblank"some text"
like
how are you
<blank>
<blank>
abcd
I am thinking about using join and then search for the pattern. But I am not sure how to do it. (By blank I mean empty line)

Blank line: /^(?:(?!\n)\s)*\n/m
Non-blank line: /^.*\S.*\n/m
So you want to print all instances of:
/
^
(?:
.*\S.*\n
(?: (?:(?!\n)\s)*\n ){2}
)+
.*\S.*\n
/mx
As a lone liner:
perl -0777ne'print /^(?:.*\S.*\n(?:(?:(?!\n)\s)*\n){2})+.*\S.*\n/mg' file
If all your blank lines contain no whitespace, you can simplify some:
Blank line: /^\n/m
Non-blank line: /^.+\n/m
perl -0777ne'print /^(?:.+\n\n\n)+.+\n/mg' file

Perhaps I'm not understanding the question. What I think you're asking is how you can match 2 consecutive lines that have the same text ("some text") and print those.
to do that you could do something like this
assume that the file is stored as a string in $file
print "$1\n$1" while ($file =~ /(.*)(?=\n\1(?:\n|$))/mg);
.* = matches anything, grabs up as much is it can
() = capturing group, stores .* to $1 in this case
(?= ... ) = look ahead, so that that part of the string can be used in the next match
\1 = whatever was captured in the first capturing group (i.e $1)
(?: ... ) = non-capturing group

Related

Can somebody explain this obfuscated perl regexp script?

This code is taken from the HackBack DIY guide to rob banks by Phineas Fisher. It outputs a long text (The Sixth Declaration of the Lacandon Jungle). Where does it fetch it? I don't see any alphanumeric characters at all. What is going on here? And what does the -r switch do? It seems undocumented.
perl -Mre=eval <<\EOF
''
=~(
'(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})')
;$:="\."^ '~';$~='#'
|'(';$^= ')'^'[';
$/='`' |'.';
$,= '('
EOF

The basic idea of the code you posted is that each alphanumeric character has been replaced by a bitwise operation between two non-alphanumeric characters. For instance,
'`'|'%'
(5th line of the "star" in your code)
Is a bitwise or between backquote and modulo, whose codepoints are respectively 96 and 37, whose "or" is 101, which is the codepoint of the letter "e". The following few lines all print the same thing:
say '`' | '%' ;
say chr( ord('`' | '%') );
say chr( ord('`') | ord('%') );
say chr( 96 | 37 );
say chr( 101 );
say "e"
Your code starts with (ignore whitespaces which don't matter):
'' =~ (
The corresponding closing bracket is 28 lines later:
^'(').'"})')
(C-f this pattern to see it on the web-page; I used my editor's matching parenthesis highlighting to find it)
We can assign everything in between the opening and closing parenthesis to a variable that we can then print:
$x = '(?'
.'{'.(
'`'|'%'
).("\["^
'-').('`'|
'!').("\`"|
',').'"(\\$'
.':=`'.(('`')|
'#').('['^'.').
('['^')').("\`"|
',').('{'^'[').'-'.('['^'(').('{'^'[').('`'|'(').('['^'/').('['^'/').(
'['^'+').('['^'(').'://'.('`'|'%').('`'|'.').('`'|',').('`'|'!').("\`"|
'#').('`'|'%').('['^'!').('`'|'!').('['^'+').('`'|'!').('['^"\/").(
'`'|')').('['^'(').('['^'/').('`'|'!').'.'.('`'|'%').('['^'!')
.('`'|',').('`'|'.').'.'.('`'|'/').('['^')').('`'|"\'").
'.'.('`'|'-').('['^'#').'/'.('['^'(').('`'|('$')).(
'['^'(').('`'|',').'-'.('`'|'%').('['^('(')).
'/`)=~'.('['^'(').'|</'.('['^'+').'>|\\'
.'\\'.('`'|'.').'|'.('`'|"'").';'.
'\\$:=~'.('['^'(').'/<.*?>//'
.('`'|"'").';'.('['^'+').('['^
')').('`'|')').('`'|'.').(('[')^
'/').('{'^'[').'\\$:=~/('.(('{')^
'(').('`'^'%').('{'^'#').('{'^'/')
.('`'^'!').'.*?'.('`'^'-').('`'|'%')
.('['^'#').("\`"| ')').('`'|'#').(
'`'|'!').('`'| '.').('`'|'/')
.'..)/'.('[' ^'(').'"})';
print $x;
This will print:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
The remaining of the code is a bunch of assignments into some variables; probably here only to complete the pattern: the end of the star is:
$:="\."^'~';
$~='#'|'(';
$^=')'^'[';
$/='`'|'.';
$,='(';
Which just assigns simple one-character strings to some variables.
Back to the main code:
(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})
This code is inside a regext which is matched against an empty string (don't forget that we had first '' =~ (...)). (?{...}) inside a regex runs the code in the .... With some whitespaces, and removing the string within the eval, this gives us:
# fetch an url from the web using curl _quitely_ (-s)
($: = `curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)
# replace end of paragraphs with newlines in the HTML fetched
=~ s|</p>|\n|g;
# Remove all HTML tags
$: =~ s/<.*?>//g;
# Print everything between SEXTA and Mexicano (+2 chars)
print $: =~ /(SEXTA.*?Mexicano..)/s
You can automate this unobfuscation process by using B::Deparse: running
perl -MO=Deparse yourcode.pl
Will produce something like:
'' =~ m[(?{eval"(\$:=`curl -s https://enlacezapatista.ezln.org.mx/sdsl-es/`)=~s|</p>|\\n|g;\$:=~s/<.*?>//g;print \$:=~/(SEXTA.*?Mexicano..)/s"})];
$: = 'P';
$~ = 'h';
$^ = 'r';
$/ = 'n';
$, = '(';

Eliminate duplicate words across lines

I'd like a sed script that eliminates repeated words in a text file on one or more lines. For example:
this is is is a text file file it is littered with duplicate words
words words on one or more lines lines
lines
lines
should transform to:
this is a text file it is littered with duplicate words
on one or more lines
This awk script produces the correct output:
{
for (i = 1; i <= NF; i++) {
word = $i
if (word != last) {
if (i < NF) {
next_word = $(i+1)
if (word != next_word) {
printf("%s ", word)
}
} else {
printf("%s\n", word)
}
}
}
last = word
}
but I'd really like a sed "one-liner".

This works with GNU sed, at least for the example input:
$ sed -Ez ':a;s/(\<\S+)(\s+)\1\s+/\1\2/g;ta' infile
This is a text file and is littered with duplicate words
on one or more lines
The -E option is just there to avoid having to escape the capture group parentheses and + quantifiers.
-z treats the input as null byte separated, i.e., as a single line.
The commmand is then structured as
:a # label
s///g # substitution
ta # jump to label if substitution did something
And the substitution is this:
s/(\<\S+)(\s+)\1\s+/\1\2/g
First capture group: (\<\S+) – a complete word (start of word boundary, one or more non-space characters
Second capture group: (\s+) – any number of blanks after that first word
\1\s+ – the first word again plus whatever blanks follow it
This preserves the whitespace after the first word and discards the whitespace after the duplicate.
Note that -E, -z, \<, \S and \s are all GNU extensions to POSIX sed.

With sed, you can use
sed -E 's/([a-z]+) +\1/\1/g'
Note that it works for duplicates. Not for triplicates or line breaks.
This can be fixed, by joining all the lines and looping.
sed -E ':a;N;s/(\b[a-z]+\b)([ \n])[ \n]*\b\1\b */\1\2/g;ba'

sed -En '
H
${
g
s/^\n//
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
p
}
' file
This is a text file with duplicate words
on one or more lines
where
H -- append each line to the hold space
${...} -- on the last line, perform the enclosed commands
g -- replace pattern space with the contents of the hold space
s/^\n// -- remove leading newline (side-effect of H on first line)
s/(\<([[:alnum:]]+)[[:space:]]+)(\2([[:space:]]+|$))+/\1/g
..1..2............2............1..........................
the key here is to capture the text and the spaces separately so that the back reference can match with differing whitespace.
captured expression #1 is the first word and it's whitespace (which can contain newlines), and the capture #2 is just the word.

How to insert a colon between word and number

I want to insert a colon between word and number then add a new line after a number.
For example:
"cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
my expected output:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
My attempt :
$Bday=~ /^([a-z]||\_)/:/^([0-9])/
print "\n";

#!/usr/bin/perl
use warnings;
use strict;
my $str = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011";
$str =~ s/\s*([a-z_]+)((?: \d+)+)/$1:$2\n/g;
print $str;
produces your desired output from your sample input.
Edit: Note the use of the s operator for regular expression substitution. One of the many problems with your code is that you're not using that (IF your intent is to modify the string in place and not extract bits from it for further processing)

One more variant -
> cat test_perl.pl
#!/usr/bin/perl
use strict;
use warnings;
while ( "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011" =~ m/([a-z_]+)\s+([0-9 ]+)/g )
{
print "$1:$2\n";
}
> test_perl.pl
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
>

The original code $Bday=~ /^([a-z]||\_)/:/^([0-9])/ doesn't make much sense. Apart from missing a semicolon and having too many delimiters (matching patterns are of the format /.../ or m/.../ and replacing ones s/.../.../), it could never match anything.
([a-z]||\_) would match:
one lowercase ASCII letter (a through z);
an empty string (the space between the two |s; or
one underscore (escape with a backslash is superfluous).
To get it (or the corresponding subexpression for numbers) to match a sequence of one
or more of the characters, you need to follow it with a +.
^([0-9]) would fail to match unless it was at the beginning of the string. There it would match a single digit.
My solution (taking into account the later comments by the OP about having input such as cat[1] or dog3):
use strict;
use warnings;
my $bday = "cat 11052000 cow_and_owner_ 01011999 12031981 dog 22032011 cat[1] 01012018 dog3 02012018";
# capture groups:
# $1------------------------\ $2-------------\
$bday =~ s/([A-Za-z][A-Za-z0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;
print $bday;
will print out:
cat:11052000
cow_and_owner_:01011999 12031981
dog:22032011
cat[1]:01012018
dog3:02012018
Breakdown:
[A-Za-z]: Begin with a letter.
[A-Za-z0-9_\[\]]*: Follow with zero or more letters, numbers, underscores and square brackets.
\h+: Separate with one or more horizontal whitespace.
\d+(?:\h+\d+)*: One sequence of digits (\d+) followed by zero or more sequences of horizontal whitespace and digits.
(?!\S): Can't be followed by non-whitespace.
\s*: Consume following whitespace (including line feeds; this allows the input to be separated on multiple lines, as long as a single entry is not spread on multiple lines. To get that, replace all the \h+ with \s+.).
The replace pattern will repeat (the /g modifier) sequentially in the source string as long as it matches, placing each heading-date record on its own line and then proceeding with the rest of the string.
Note that if your headers (dog etc.) might contain non-ASCII letters, use \pL or \p{XPosixAlpha} instead of [A-Za-z]:
$bday =~ s/\pL[\pL0-9_\[\]]*)\h+(\d+(?:\h+\d+)*)(?!\S)\s*/$1:$2\n/g;

Add a new line in a variable using perl

I am trying to add a new line in a variable after certain number of words. For example: If we have a variable:
$x = "This a variable, start a new line here, This is a new line.";
If I print the above variable
print $x;
I should get the below output:
This is a variable,
start a new line here,
This is a new line.
How can I achieve this in Perl from the variable itself?

I do not agree to the formula "after certain number of words".
Note that the first target line has 4 words, whereas remaining 2 have
5 words each.
Actually you need to replace each comma and following sequence of
spaces (if any) with a comma and \n.
So the intuitive way to do it is:
$x =~ s/,\s*/,\n/g;

The simplest way is to split the string on comma followed by a space and then
join the word groups with a comma followed by a newline.
my $x = "This a variable, start a new line here, This is a new line.";
print join(",\n", split /, /, $x) . "\n";
output
This a variable,
start a new line here,
This is a new line.
For solving the general, how do I reformat this string with line breaks after n-columns? problem, use the Text::Wrap library (as suggested by #ikegami):
use Text::Wrap;
my $x = "The quick brown fox jumped over the lazy dog.";
$Text::Wrap::columns = 15;
# wrap() needs an array of words
my #words = split /\s+/, $x;
# Initial tab, subsequent tab values set to '' (think indent amount)
print wrap('', '', #words) . "\n";
output
The quick
brown fox
jumped over
the lazy dog.

You probably want to use regular expressions. You can do this:
$x =~ s/^(\S+\s+){3}\K/\n/;
Or if this is about the commas and not the spaces:
$x =~ s/^([^,]+,+){2}\s*\K/\n/;
(in this case I also remove any potential space that would be after the comma)
You can also configure separately how many words or comma you want, by putting this in a variable:
my $nbwords = 7; # add a line after the 7th word
$x =~ s/^(\S+\s+){$nbwords}\K/\n/;
Now, that would keep the last space so you may want to do this:
my $nbwords = 7; # add a line after the 7th word
$nbwords--; # becomes 6 because there is another word after that we match as well
$x =~ s/^(\S+\s+){$nbwords}\S+\K\s+/\n/;
You should probably learn to use Regexps but just to explain the above:
\s is any space character (like space, tab, line feed, etc)
\S (uppercase) is any character except a space character
+ means any number of characters of that type described with what is before. So \s+ means any number of consecutive space characters.
{123} means 123 times that type of character ...
{3,80} means 3 to 80 times. So + is equivalent to {1,} (one to unlimited)
\K means that whatever is before will not be replaced, only what is after.

Removing double quotes in beginning & end of string in a pipe delimited line in perl

I have a Line that is pipe delimited:
John |DEMME|"9 Snowy "" Court"|WERRIBEE|""VIC""
I split my line to each fields
#fields = split (/\|/, $_);
what I want is to remove the double quotes in the beginning/end of each fields
but it should retain the double quotes that are in between.
expected output
John |DEMME|9 Snowy "" Court|WERRIBEE|VIC
I also tried this
s/^\"|"$//g;
but what it does is it reads by line not but fields, so it will only remove the double qoutes which are at the beginnig and end of the line.
another scenario:
John |DEMME| "Shop 6A ""Atlantic on Coolum""|WERRIBEE|VIC
output should be
John |DEMME| Shop 6A "Pacific on Coolum"|WERRIBEE|VIC
I hope you guys can help me with this.
thank you very much

If it's reading by lines, then you could write:
s/(?:^|(?<=\|))"|"(?=$|\|)//g;
to remove " at the start of a line or after a |, or at the end of a line or before a |.
(The (?<=...) notation creates a zero-width positive "lookbehind" assertion, which in this case checks to see if there's a | preceding; the (?=...) notation creates a zero-width positive "lookahead" assertion, which in this case checks to see if there's a | or end-of-file following.)

try this one dude
my $_='John |DEMME|"9 Snowy "" Court"|WERRIBEE|""VIC""';
my #fields = split (/\|/, $_);
foreach my $item(#fields){
$item=~s/^\"+//g;
$item=~s/\"+$//g;
print "$item";
}

The quotes could also be removed from each field using a map expression:
#fields = map { s/^"(.*)"$/$1/; $_ } split (/\|/, $_);
split breaks the line apart into a list, while the map applies the substitution to each member of the list. To join them back up:
print join('|', #fields), "\n";