How to find patterns across multiple lines using perl

How to find patterns across multiple lines using perl - perl

I want to grep some string spread along multiple lines withing some begin and end pattern
Example:
MediaHelper->fetchStrings( names => [ //Here new line may or many not be
**'ubp-firstrun_heading',
'firstrun_text',
'_firstrun-or-start_search',
'installed'** //may end here also );
]);
using perl or grap how I can get list 4 strings here begin pattern is MediaHelper->fetchStrings(names => [ and end pattern is );
Or any other suggesting using other commands like grep or sed or awk ?

Try this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ p' <yourfile>
Or, if you want to skip the delimiting lines, this:
sed -n '/MediaHelper->fetchStrings( names =>/,/);/ {/MediaHelper->fetchStrings( names =>/b; /^);/b; p}' <yourfile>

If I understand your question, you need to match all strings in all lines (and not just the MediaHelper thing).
If this is the case, then sed is the right tool, because it is by default line-oriented.
In our case, if you want to match the string in every line:
sed "s/.*\('.*'\).*/\1/" <your_file>
Hope it helps
Edit: To be more descriptive, first we need to match the whole line (that's the first and the last .*) and then we enclose in parenthesis the part of the line we want to print, which in our case is everything inside single quotes. The number 1 before the last delimiter denotes that we want to print the first (in our case it is the last also) parenthesis.

Just process the file in slurp mode instead of line by line:
perl -0777 -ne 'print $1 while m{MediaHelper->fetchStrings(names\s*=>\s*\[(.*?)\]}g' file
Explanation:
Switches:
-0777: Slurp mode instead of line by line
-n: Creates a while(<>){..} loop for each line in your input file.
-e: Tells perl to execute the code on command line.

Related

GREP Print Blank Lines For Non-Matches

I want to extract strings between two patterns with GREP, but when no match is found, I would like to print a blank line instead.
Input
This is very new
This is quite old
This is not so new
Desired Output
is very
is not so
I've attempted:
grep -o -P '(?<=This).*?(?=new)'
But this does not preserve the second blank line in the above example. Have searched for over an hour, tried a few things but nothing's worked out.
Will happily used a solution in SED if that's easier!

You can use
#!/bin/bash
s='This is very new
This is quite old
This is not so new'
sed -En 's/.*This(.*)new.*|.*/\1/p' <<< "$s"
See the online demo yielding
is very
is not so
Details:
E - enables POSIX ERE regex syntax
n - suppresses default line output
s/.*This(.*)new.*|.*/\1/ - finds any text, This, any text (captured into Group 1, \1, and then any text again, or the whole string (in sed, line), and replaces with Group 1 value.
p - prints the result of the substitution.
And this is what you need for your actual data:
sed -En 's/.*"user_ip":"([^"]*).*|.*/\1/p'
See this online demo. The [^"]* matches zero or more chars other than a " char.

With your shown samples, please try following awk code.
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} NF!=3{print ""}' Input_file
OR
awk -F'This\\s+|\\s+new' 'NF==3{print $2;next} {print ""}' Input_file
Explanation: Simple explanation would be, setting This\\s+ OR \\s+new as field separators for all the lines of Input_file. Then in main program checking condition if NF(number of fields) are 3 then print 2nd field (where next will take cursor to next line). In another condition checking if NF(number of fields) is NOT equal to 3 then simply print a blank line.

sed:
sed -E '
/This.*new/! s/.*//
s/.*This(.*)new.*/\1/
' file
first line: lines not matching "This.*new", remove all characters leaving a blank line
second lnie: lines matching the pattern, keep only the "middle" text
this is not the pcre non-greedy match: the line
This is new but that is not new
will produce the output
is new but that is not
To continue to use PCRE, use perl:
perl -lpe '$_ = /This(.*?)new/ ? $1 : ""' file

This might work for you:
sed -E 's/.*This(.*)new.*|.*/\1/' file
If the first match is made, the line is replace by everything between This and new.
Otherwise the second match will remove everything.
N.B. The substitution will always match one of the conditions. The solution was suggested by Wiktor Stribiżew.

Printing all words that start with "#" using sed in BASH

I have a file with a lot of text, but I want to print only words that contain "#" at the beginning. Ex:
My name is #Laura and I live in #London. Name=#Laura. City=#London
How can I print all words that start with #?.I did this the following and it worked, but I want to do it using sed. I tried several patters, but I cannot make it print anything.
grep -o -E "#\w+" file.txt
Thanks

Use this sed command:
sed 's/[^#]*\(#[^ .]*\)/\1\n/g' file.txt
Explanation: we invoke the substitution command of sed. This has following structure: sed 's/regex/replace/options'. We will search for a regex and replace it using the g option. g makes sure the match is made multiple times per line.
We look for a series of non at chars followed by an # and a number of non-spaces #[^ ]*. We put this last part in a group \(\) and sub it during the replacement \1.
Note that we add a newline at the end of each match, you can also get the output on a single line by omitting the \n.

Using command line to remove text?

I have a huge file that contains lines that follow this format:
New-England-Center-For-Children-L0000392290
Southboro-Housing-Authority-L0000392464
Crew-Star-Inc-L0000391998
Saxony-Ii-Barber-Shop-L0000392491
Test-L0000392334
What I'm trying to do is narrow it down to just this:
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Test
Can anyone help with this?

Using GNU awk:
awk -F\- 'NF--' OFS=\- file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Set the input and output field separator to -.
NF contains number of fields. Reduce it by 1 to remove the last field.
Using sed:
sed 's/\(.*\)-.*/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Simple greedy regex to match up to the last hyphen.
In replacement use the captured group and discard the rest.

Version 1 of the Question
The first version of the input was in the form of HTML and parts had to be removed both before and after the desired text:
$ sed -r 's|.*[A-Z]/([a-zA-Z-]+)-L0.*|\1|' input
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
Version 2 of the Question
In the revised question, it is only necessary to remove the text that starts with -L00:
$ sed 's|-L00.*||' input2
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
Both of these commands use a single "substitute" command. The command has the form s|old|new|.

The perl code for this would be: perl -nle'print $1 if(m{-.*?/(.*?-.*?)-})
We can break the Regex down to matching the following:
- for that's between the city and state
.*? match the smallest set of character(s) that makes the Regex work, i.e. the State
/ matches the slash between the State and the data you want
( starts the capture of the data you are interested in
.*?-.*? will match the data you care about
) will close out the capture
- will match the dash before the L####### to give the regex something to match after your data. This will prevent the minimal Regex from matching 0 characters.
Then the print statement will print out what was captured (your data).

awk likes these things:
$ awk -F[/-] -v OFS="-" '{print $(NF-3), $(NF-2)}' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
This sets / and - as possible field separators. Based on them, it prints the last_field-3 and last_field-2 separated by the delimiter -. Note that $NF stands for last parameter, hence $(NF-1) is the penultimate, etc.
This sed is also helpful:
$ sed -r 's#.*/(\w*-\w*)-\w*\.\w*</loc>$#\1#' file
Special-Restaurant
Eliot-Cleaning
Kennedy-Plumbing
It selects the block word-word after a slash / and followed with word.word</loc> + end_of_line. Then, it prints back this block.
Update
Based on your new input, this can make it:
$ sed -r 's/(.*)-L\w*$/\1/' file
New-England-Center-For-Children
Southboro-Housing-Authority
Crew-Star-Inc
Saxony-Ii-Barber-Shop
Test
It selects everything up to the block -L + something + end of line, and prints it back.
You can use also another trick:
rev file | cut -d- -f2- | rev
As what you want is every slice of - separated fields, let's get all of them but last one. How? By reversing the line, getting all of them from the 2nd one and then reversing back.

Here's how I'd do it with Perl:
perl -nle 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && print $2' filename
Note: the original question was matching input lines like this:
<loc>http://www.example.com/bp/Lowell-MA/Special-Restaurant-L0000423916.htm</loc>
<loc>http://www.example.com/bp/Houston-TX/Eliot-Cleaning-L0000422797.htm</loc>
<loc>http://www.example.com/bp/New-Orleans-LA/Kennedy-Plumbing-L0000423121.htm</loc>
The -n option tells Perl to loop over every line of the file (but not print them out).
The -l option adds a newline onto the end of every print
The -e 'perl-code' option executes perl-code for each line of input
The pattern:
/regex/ && print
Will only print if the regex matches. If the regex contains capture parentheses you can refer to the first captured section as $1, the second as $2 etc.
If your regex contains slashes, it may be cleaner to use a different regex delimiter ('m' stands for 'match'):
m{regex} && print
If you have a modern Perl, you can use -E to enable modern feature and use say instead of print to print with a newline appended:
perl -nE 'm{example[.]com/bp/(.*?)/(.*?)-L\d+[.]htm} && say $2' filename

This is very concise in Perl
perl -i.bak -lpe's/-[^-]+$//' myfile
Note that this will modify the input file in-place but will keep a backup of the original data in called myfile.bak

Change formatting on paragraphs, with perl

I have a number of paragraphs that have returns at the end of a line. I do not want returns at the end of lines, I will let the layout program take care of that. I would like to remove the returns, and replace them with spaces.
The issue is that I do want returns in between paragraphs. So, if there is more than one return in a row (2, 3, etc) I would like to keep two returns.
This would allow for there to be paragraphs, with one blank line between then, but all other formatting for lines would be removed. This would allow the layout program to worry about the line breaks, and not the have the breaks determined by a set number of characters, as they are now.
I would like to use Perl to accomplish this change, but am open to other methods.
example text:
This is a test.
This is just a test.
This too is a test.
This too is just a test.
would become:
This is a test. This is just a test.
This too is a test. This too is just a test.
Can this be done easily?

Using a perl one-liner. Replace 2 or more newlines with just 2. Strip all single newlines:
perl -0777 -pe 's{(\n{2})\n*|\n}{$1//" "}eg' file.txt > newfile.txt
Switches:
-0777: Slurps the entire file
-p: Creates a while(<>){...; print} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.

I came up with another solution and also wanted to explain what your regex was matching.
Matt#MattPC ~/perl/testing/8
$ cat input.txt
This is a test.
This is just a test.
This too is a test.
This too is just a test.
another test.
test.
Matt#MattPC ~/perl/testing/8
$ perl -e '$/ = undef; $_ = <>; s/(?<!\n)\n(?!\n)/ /g; s/\n{2,}/\n\n/g; print' input.txt
This is a test. This is just a test.
This too is a test. This too is just a test.
another test. test.
I basically just wrote a perl program and mashed it into a one-liner. It would normally look like this.
# First two lines read in the whole file
$/ = undef;
$_ = <>;
# This regex replaces every `\n` by a space
# if it is not preceded or followed by a `\n`
s/(?<!\n)\n(?!\n)/ /g;
# This replaces every two or more \n by two \n
s/\n{2,}/\n\n/g;
# finally print $_
print;
perl -p -i -e 's/(\w+|\s+)[\r\n]/$1 /g' abc.txt
Part of the problem here is what you are matching. (\w+|\s+) matches one of more word characters, which is the same as [a-zA-Z0-9_], OR one or more whitespace characters, which is the same as [\t\n\f\r ].
This wouldn't match your input, since you aren't matching periods, and no line consists of only white space or only characters (even the blank lines would need two whitespace characters to match it, since we have [\r\n] at the end). Plus, neither would match a period.

sed: replace pattern only if followed by empty line

I need to replace a pattern in a file, only if it is followed by an empty line. Suppose I have following file:
test
test
test
...
the following command would replace all occurrences of test with xxx
cat file | sed 's/test/xxx/g'
but I need to only replace test if next line is empty. I have tried matching a hex code, but that doesn ot work:
cat file | sed 's/test\x0a/xxx/g'
The desired output should look like this:
test
xxx
xxx
...

Suggested solutions for sed, perl and awk:
sed
sed -rn '1h;1!H;${g;s/test([^\n]*\n\n)/xxx\1/g;p;}' file
I got the idea from sed multiline search and replace. Basically slurp the entire file into sed's hold space and do global replacement on the whole chunk at once.
perl
$ perl -00 -pe 's/test(?=[^\n]*\n\n)$/xxx/m' file
-00 triggers paragraph mode which makes perl read chunks separated by one or several empty lines (just what OP is looking for). Positive look ahead (?=) to anchor substitution to the last line of the chunk.
Caveat: -00 will squash multiple empty lines into single empty lines.
awk
$ awk 'NR==1 {l=$0; next}
/^$/ {gsub(/test/,"xxx", l)}
{print l; l=$0}
END {print l}' file
Basically store previous line in l, substitute pattern in l if current line is empty. Print l. Finally print the very last line.
Output in all three cases
test
xxx
xxx
...

This might work for you (GNU sed):
sed -r '$!N;s/test(\n\s*)$/xxx\1/;P;D' file
Keep a window of 2 lines throughout the length of the file and if the second line is empty and the first line contains the pattern then make a substitution.

Using sed
sed -r ':a;$!{N;ba};s/test([^\n]*\n(\n|$))/xxx\1/g'
explanation
:a # set label a
$ !{ # if not end of file
N # Add a newline to the pattern space, then append the next line of input to the pattern space
b a # Unconditionally branch to label. The label may be omitted, in which case the next cycle is started.
}
# simply, above command :a;$!{N;ba} is used to read the whole file into pattern.
s/test([^\n]*\n(\n|$))/xxx\1/g # replace the key word if next line is empty (\n\n) or end of line ($)

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find patterns across multiple lines using perl - perl

Try this: sed -n '/MediaHelper->fetchStrings( names =>/,/);/ p' <yourfile> Or, if you want to skip the delimiting lines, this: sed -n '/MediaHelper->fetchStrings( names =>/,/);/ {/MediaHelper->fetchStrings( names =>/b; /^);/b; p}' <yourfile>

Related

GREP Print Blank Lines For Non-Matches

Printing all words that start with "#" using sed in BASH

Using command line to remove text?

Change formatting on paragraphs, with perl

sed: replace pattern only if followed by empty line

Categories

Resources