How to remove non numeric charcter from numeric string? - perl

I have following data. I would like to print last column without non numeric character from a string. Kindly help me
N THR K 149A
CA THR K 149A
C THR K 149A
O THR K 149A
CB THR K 149A
OG1 THR K 149A
CG2 THR K 149A
N SER K 149B
CA SER K 149B
C SER K 149B
O SER K 149B
CB SER K 149B
for solving the above problem I have tried by following program.
#!/usr/bin/perl -w
open(F1, "$ARGV[0]") or die;
chomp(#arr=<F1>);
close F1;
for($i=0;$i<=$#arr;$i++)
{
#pdb=split(/\h/,$arr[$i]);
if($pdb[3] =~ /[A-Z]/*$);{
$pdb[3] =~ s/\D//g;
print "$pdb[1] $pdb[2] $pdb[3]\n";
}
}

Ok, unless this is a typo, it is the thing wrong with your code.
if($pdb[3] =~ /[A-Z]/*$);{
In this code, you have placed the slash / in the middle of your regex, and also placed a semi-colon there which does not belong anywhere on that line. Also, you are using * as the quantifier, which will not work as intended, because it will allow a match on the empty string (zero matches), which will match all strings. The correct line is:
if($pdb[3] =~ /[A-Z]+$/) {
However, this entire line is incorrect, when taken in context:
if($pdb[3] =~ /[A-Z]*$/) {
$pdb[3] =~ s/\D//g;
Here you only remove non-digits if upper case letters are found. Besides the fact that you are checking for two different things, you do not need to check before substituting, because a substitution will not do anything if it does not match. So... something like this:
if ($foo =~ /A/) {
$foo =~ s/A//g;
is completely redundant, because s/A//g will not do anything unless there is already an A in the string.
Also, a few more things you should know:
Always use
use strict;
use warnings;
As it will help you prevent a lot of simple mistakes.
Use three argument open, with lexical file handle, and check the return value including the error:
open my $fh, "<", $file or die "Cannot open $file: $!";
You do not need to quote variables, such as with "$ARGV[0]". You leave out the quotes: $ARGV[0].
You are using a C-style for loop. Using a Perl-style loop is preferred, in my opinion:
for my $i (0 .. $#arr)
But you should not be using array indexes unless you need the indexes themselves, so the better loop is:
for my $line (#arr)
But again, as a general rule, it is better to read a file line-by-line than slurping it into an array. For this purpose you would use a while loop, which iterates over the file handle instead of exhausting it all at once:
while (<$fh>) {
# process line $_
}
Using /\h/ as the field delimiter for split is wrong, unless you intended that consecutive whitespace indicates empty fields. The default split is ' ', which splits on multiple whitespace /\s+/, and also strips leading whitespace. With CSV data, it is possibly correct to split on single delimiters, but in that case you should use the specific delimiter, and not a character class like \h.
Like I said before, using the * quantifier in a regex match is horribly wrong. You might notice that a regex such as /[A-Z]*/ matches anything if you try it out: perl -lnwe 'print /[A-Z]*/ ? "match!" : "no match";' That is because it is allowed to match the empty string, and all strings match the empty string.
And like I also said, you do not need to check before you substitute. At least not for the same thing. So, when simplified, your code becomes:
open my $fh, "<", $ARGV[0] or die "Cannot open $ARGV[0]: $!";
while (<$fh>) { # short for while ($_ = <$fh>)
chomp; # short for chomp($_)
my #fields = split; # short for split(' ', $_)
$fields[3] =~ s/\D//g;
print "#fields[1,2,3]\n"; # quoting an array inserts spaces between elements
}
Note that I used an array slice, where we only use the elements with the indicated elements. You can also write this such as:
print join(" ", $fields[1], $fields[2], $fields[3]), "\n";
You might note also that this is possible to achieve using a one-liner:
perl -anlwe '$F[3] =~ s/\D//g; print "#F[1,2,3]"'
The -a switch autosplits the line on whitespace, storing the fields in #F. The -l switch chomps the line and adds newline to print. And the -n switch reads input from STDIN or argument files, whichever is supplied.

Try this
perl -ne 'print "$1\n" if m/(\d+)\D$/' datafile

Related

Extract preceding and trailing characters to a matched string from file in awk

I have a large string file seq.txt of letters, unwrapped, with over 200,000 characters. No spaces, numbers etc, just a-z.
I have a second file search.txt which has lines of 50 unique letters which will match once in seq.txt. There are 4000 patterns to match.
I want to be able to find each of the patterns (lines in file search.txt), and then get the 100 characters before and 100 characters after the pattern match.
I have a script which uses grep and works, but this runs very slowly, only does the first 100 characters, and is written out with echo. I am not knowledgeable enough in awk or perl to interpret scripts online that may be applicable, so I am hoping someone here is!
cat search.txt | while read p; do echo "grep -zoP '.{0,100}$p' seq.txt | sed G"; done > do.grep
Easier example with desired output:
>head seq.txt
abcdefghijklmnopqrstuvwxyz
>head search.txt
fgh
pqr
uvw
>head desiredoutput.txt
cdefghijk
mnopqrstu
rstuvwxyz
Best outcome would be a tab separated file of the 100 characters before \t matched pattern \t 100 characters after. Thank you in advance!
One way
use warnings;
use strict;
use feature 'say';
my $string;
# Read submitted files line by line (or STDIN if #ARGV is empty)
while (<>) {
chomp;
$string = $_;
last; # just in case, as we need ONE line
}
# $string = q(abcdefghijklmnopqrstuvwxyz); # test
my $padding = 3; # for the given test sample
my #patterns = do {
my $search_file = 'search.txt';
open my $fh, '<', $search_file or die "Can't open $search_file: $!";
<$fh>;
};
chomp #patterns;
# my #patterns = qw(bcd fgh pqr uvw); # test
foreach my $patt (#patterns) {
if ( $string =~ m/(.{0,$padding}) ($patt) (.{0,$padding})/x ) {
say "$1\t$2\t$3";
# or
# printf "%-3s\t%3s%3s\n", $1, $2, $3;
}
}
Run as program.pl seq.txt, or pipe the content of seq.txt to it.†
The pattern .{0,$padding} matches any character (.), up to $padding times (3 above), what I used in case the pattern $patt is found at a position closer to the beginning of the string than $padding (like the first one, bcd, that I added to the example provided in the question). The same goes for the padding after the $patt.
In your problem then replace $padding to 100. With the 100 wide "padding" before and after each pattern, when a pattern is found at a position closer to the beginning than the 100 then the desired \t alignment could break, if the position is lesser than 100 by more than the tab value (typically 8).
That's what the line with the formatted print (printf) is for, to ensure the width of each field regardless of the length of the string being printed. (It is commented out since we are told that no pattern ever gets into the first or last 100 chars.)
If there is indeed never a chance that a matched pattern breaches the first or the last 100 positions then the regex can be simplified to
/(.{$padding}) ($patt) (.{$padding})/x
Note that if a $patt is within the first/last $padding chars then this just won't match.
The program starts the regex engine for each of #patterns, what in principle may raise performance issues (not for one run with the tiny number of 4000 patterns, but such requirements tend to change and generally grow). But this is by far the simplest way to go since
we have no clue how the patterns may be distributed in the string, and
one match may be inside the 100-char buffer of another (we aren't told otherwise)
If there is a performance problem with this approach please update.
† The input (and output) of the program can be organized in a better way using named command-line arguments via Getopt::Long, for an invocation like
program.pl --sequence seq.txt --search search.txt --padding 100
where each argument may be optional here, with defaults set in the file, and argument names may be shortened and/or given additional names, etc. Let me know if that is of interest
One in awk. -v b=3 is the before context length -v a=3 is the after context length and -v n=3 is the match length which is always constant. It hashes all the substrings of seq.txt to memory so it uses it depending on the size of the seq.txt and you might want to follow the consumption with top, like: abcdefghij -> s["def"]="abcdefghi" , s["efg"]="bcdefghij" etc.
$ awk -v b=3 -v a=3 -v n=3 '
NR==FNR {
e=length()-(n+a-1)
for(i=1;i<=e;i++) {
k=substr($0,(i+b),n)
s[k]=s[k] (s[k]==""?"":ORS) substr($0,i,(b+n+a))
}
next
}
($0 in s) {
print s[$0]
}' seq.txt search.txt
Output:
cdefghijk
mnopqrstu
rstuvwxyz
You can tell grep to search for all the patterns in one go.
sed 's/.*/.{0,100}&.{0,100}/' search.txt |
grep -zoEf - seq.txt |
sed G >do.grep
4000 patterns should be easy peasy, though if you get to hundreds of thousands, maybe you will want to optimize.
There is no Perl regex here, so I switched from the nonstandard grep -P to the POSIX-compatible and probably more efficient grep -E.
The surrounding context will consume any text it prints, so any match within 100 characters from the previous one will not be printed.
You can try following approach to your problem:
load string input data
load into an array patterns
loop through each pattern and look for it in the string
form an array from found matches
loop through matches array and print result
NOTE: the code is not tested due absence of input data
use strict;
use warnings;
use feature 'say';
my $fname_s = 'seq.txt';
my $fname_p = 'search.txt';
open my $fh, '<', $fname_s
or die "Couldn't open $fname_s";
my $data = do { local $/; <$fh> };
close $fh;
open my $fh, '<', $fname_p
or die "Couln't open $fname_p";
my #patterns = <$fh>;
close $fh;
chomp #patterns;
for ( #patterns ) {
my #found = $data =~ s/(.{100}$_.{100})/g;
s/(.{100})(.{50})(.{100})/$1 $2 $3/ && say for #found;
}
Test code for provided test data (added latter)
use strict;
use warnings;
use feature 'say';
my #pat = qw/fgh pqr uvw/;
my $data = do { local $/; <DATA> };
for( #pat ) {
say $1 if $data =~ /(.{3}$_.{3})/;
}
__DATA__
abcdefghijklmnopqrstuvwxyz
Output
cdefghijk
mnopqrstu
rstuvwxyz

Perl with FASTA sequence extraction has problems (only) with first sequence

I am using a function/subroutine extract_seq available on internet to extract sequences in FASTA files. Briefly:
A sequence begins with first line identified by '>', followed by ID and other information separated by spaces
Subsequent lines (not beginning with '>' have multiple strings
A FASTA file can have 1 or more sequences
Bug is that the output has additional '>' character for first sequence (only) causing consistency problems.
Program works fine in extracting sequences based on ID except for additional '>' in case of first sequence. Could you please suggest a solution as well as reason for the bug? A simple regex would fix the problem but I do not feel good about fixing bugs that I cannot understand.
The Perl script is:
#!/usr/bin/perl -w
use strict;
my $seq_all = "seq_all.fa"; # all proteins in fasta format
foreach my $q_seq ("A0A1D8PC43","A0A1D8PC38") {
print "Querying $q_seq\n";
&extract_seq($seq_all, $q_seq);
}
exit 0;
sub extract_seq
{
open(my $fh, ">query.seq");
my $seq_all = $_[0];
my $lookup = $_[1];
local $/ = "\n>";
#ARGV = ($seq_all);
while (my $seq = <>) {
chomp $seq;
my ($id) = $seq =~ /^>*(\S+)/;
if ($id eq $lookup) {
print "$seq\n";
last;
}
}
}
The FASTA file is:
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
>A0A1D8PC56 A0A1D8PC56_CANAL Uncharacterized protein OS=Candida
MSDTKKTTETDSEVGYLDIYLRFNDDMEKDYCFQVKTTTVFKDLYKVFRTLPISLRPSVF
YHAQPIGFKKSVSPGYLTQDGNFIFDEDSQKQAVPVNDNDLINETVWPGQLILPVWQFND
FGFYSFLAFLACWLYTDLPDFISPTPGICLTNQMTKLMAWVLVQFGKDRFAETLLADLYD
TVGVGAQCVFFGFHIIKCLFIFGFLYTGVFNPMRVFRLTPRSVKLDVTKEELVKLGWTGT
RKATIDEYKEYYREFKINQHGGMIQAHRAGLFNTLRNLGVQLESGEGYNTPLTEENKLRT
MRQIVEDAKKPDFKLKLSYEYFAELGYVFATNAENKEGSELAQLIKQYRRYGLLVSDQRI
KTVVRARKGETDEEKPKVEEVVEE
>A0A1D8PC67 A0A1D8PC67_CANAL Bfa1p OS=Candida albicans (strain
MVSDKLTLLRQFSEEDELFGDIEGIDYHDGETLKINKFSFPSSASSPSFAITGQSPNMRS
INGKRITRETLSEYSEENETDLTSEFSDQEFEWDGFNKNQSIYQQMNQRLIATKVAKQRE
AEREQRELMQKRHKDYDPNQTLRLKDFNKLTNENLTLLDQLDDEKTVNYEYVRDDVEDFA
QGFDKDFETKLRIQPSMPTLRSNAPTLKKYKSYGEFKCDNRVKQKLDRIPSFYNKNQLLS
KFKETKSYHPHHKKMGTVRCLNNNSEVPVTYPSISNMKLNKEKNRWEGNDIDLIRFEKPS
LITHKENKTKKRQGNMVYDEQNLRWINIESEHDVFDDIPDLAVKQLQSPVRGLSQFTQRT
TSTTATATAPSKNNETQHSDFEISRKLVDKFQKEQAKIEKKINHWFIDTTSEFNTDHYWE
IRKMIIEE
>A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
Edit:
The problem, as explained above, I face is that the output has additional '>' character for first sequence (only). I do not see the reason for the same and this is causing a lot of trouble. Output is:
Querying A0A1D8PC43
>A0A1D8PC43 A0A1D8PC43_CANAL Diphosphomevalonate decarboxylase
MYSASVTAPVNIATLKYWGKRDKSLNLPTNSSISVTLSQDDLRTLTTASASESFEKDQLW
LNGKLESLDTPRTQACLADLRKLRASIEQSPDTPKLSQMKLHIVSENNFPTAAGLASSAA
GFAALVSAIAKLYELPQDMSELSKIARKGSGSACRSLFGGFVAWEMGTLPDGQDSKAVEI
APLEHWPSLRAVILVVSDDKKDTPSTTGMQSTVATSDLFAHRIAEVVPQRFEAMKKAILD
KDFPKFAELTMKDSNSFHAVCLDSYPPIFYLNDTSKKIIKMVETINQQEVVAAYTFDAGP
NAVIYYDEANQDKVLSLLYKHFGHVPGWKTHYTAETPVAGVSRIIQTSIGPGPQETSESL
TK
Querying A0A1D8PC38
A0A1D8PC38 A0A1D8PC38_CANAL Cta2p OS=Candida albicans (strain
MPENLQTRLHNSLDEILKSSGYIFEVIDQNRKQSNVITSPNNELIQKSITQSLNGEIQNF
HAILDQTVSKLNDAEWCLGVMVEKKKKHDELKVKEEAARKKREEEAKKKEEEAKKKAEEA
KKKEEEAKKAEEAKKAEEAKKVEEAAKKAEEAKKAEEEARKKAETAPQKFDNFDDFIGFD
INDNTNDEDMLSNMDYEDLKLDDKVPATTDNNLDMNNILENDESILDGLNMTLLDNGDHV
NEEFDVDSFLNQFGN
$/ is the input record separator, setting local $/="\n>"; effect is that input is split into record ending with \n>, after chomp, the ending is removed however />*(\S+)/ may not match because > is consumed from previous record.
from FASTA wikipedia a line beginning by > is a comment and may not always be an id. However in case it is always the case, following may fix.
my ($id,$seq) = $seq =~ /^>*(.*)\n(\S+)/;
You set the record separator to \n>. This does not apply to the first sequence.
Fixed code sequence:
...
chomp $seq;
# for first sequence
$seq =~ s/^>//;
my ($id) = $seq =~ /^(\S+)/;
if ($id eq $lookup) {
...
Please note that your implementation is extremely inefficient, because it reads & parses the file contents for each query. How about splitting loading/parsing and querying into separate functions?
Alternative solution: give the full list of lookup values to the loader. It would then fill an answer array as it encounters the matches during reading the file.

Perl: Find a match, remove the same lines, and to get the last field

Being a Perl newbie, please pardon me for asking this basic question.
I have a text file #server1 that shows a bunch of sentences (white space is the field separator) on many lines in the file.
I needed to match lines with my keyword, remove the same lines, and extract only the last field, so I have tried with:
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my #uniqmatchedline = split(/ /, #allmatchedlines);
my $lastfield = $uniqmatchedline[-1]\n";
print "$lastfield\n";
and it gives me the output showing:
1
I don't know why it's giving me just "1".
Could someone please explain why I'm getting "1" and how I can get the last field of the matched line correctly?
Thank you!
my #uniqmatchedline = split(/ /, #allmatchedlines);
You're getting "1" because split takes a scalar, not an array. An array in scalar context returns the number of elements.
You need to split on each individual line. Something like this:
my #uniqmatchedline = map { split(/ /, $_) } #allmatchedlines;
There are two issues with your code:
split is expecting a scalar value (string) to split on; if you are passing an array, it will convert the array to scalar (which is just the array length)
You did not have a way to remove same lines
To address these, the following code should work (not tested as no data):
my #allmatchedlines;
open(output1, "ssh user1#server1 cat /tmp/myfile.txt |");
while(<output1>) {
chomp;
#allmatchedlines = $_ if /mysearch/;
}
close(output1);
my %existing;
my #uniqmatchedline = grep !$existing{$_}++, #allmatchedlines; #this will return the unique lines
my #lastfields = map { ((split / /, $_)[-1]) . "\n" } #uniqmatchedline ; #this maps the last field in each line into an array
print for #lastfields;
Apart from two errors in the code, I find the statement "remove the same lines and extract only the last field" unclear. Once duplicate matching lines are removed, there may still be multiple distinct sentences with the pattern.
Until a clarification comes, here is code that picks the last field from the last such sentence.
use warnings 'all';
use strict;
use List::MoreUtils qw(uniq)
my $file = '/tmp/myfile.txt';
my $cmd = "ssh user1\#server1 cat $file";
open my $fh, '-|', $cmd // die "Error opening $cmd: $!"; # /
while (<$fh>) {
chomp;
push #allmatchedlines, $_ if /mysearch/;
}
close(output1);
my #unique_matched_lines = uniq #allmatchedlines;
my $lastfield = ( split ' ', $unique_matched_lines[-1] )[-1];
print $lastfield, "\n";
I changed to the three-argument open, with error checking. Recall that open for a process involves a fork and returns pid, so an "error" doesn't at all relate to what happened with the command itself. See open. (The # / merely turns off wrong syntax highlighting.) Also note that # under "..." indicates an array and thus need be escaped.
The (default) pattern ' ' used in split splits on any amount of whitespace. The regex / / turns off this behavior and splits on a single space. You most likely want to use ' '.
For more comments please see the original post below.
The statement #allmatchedlines = $_ if /mysearch/; on every iteration assigns to the array, overwriting whatever has been in it. So you end up with only the last line that matched mysearch. You want push #allmatchedlines, $_ ... to get all those lines.
Also, as shown in the answer by Justin Schell, split needs a scalar so it is taking the length of #allmatchedlines – which is 1 as explained above. You should have
my #words_in_matched_lines = map { split } #allmatchedlines;
When all this is straightened out, you'll have words in the array #uniqmatchedline and if that is the intention then its name is misleading.
To get unique elements of the array you can use the module List::MoreUtils
use List::MoreUtils qw(uniq);
my #unique_elems = uniq #whole_array;

Whats wrong with this code to read file?

I have been trying to read a file called "perlthisfile.txt" which is basically the output of nmap on my computer.
I want to get only the ip addresses printed out, so i wrote the following code but it is not working:
#!/usr/bin/perl
use strict;
use warnings;
use Scalar::Util qw(looks_like_number);
print"\n running \n";
open (MYFILE, 'perlthisfile.txt') or die "Cannot open file\n";
while(<MYFILE>) {
chomp;
my #value = split(' ', <MYFILE>);
print"\n before foreach \n";
foreach my $val (#value) {
if (looks_like_number($val)) {
print "\n looks like number block \n";
if ($val == /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/) {
print "\n$val\n";
}
}
}
}
close(MYFILE);
exit 0;
And when i ran this code the output was:
running
before foreach
before foreach
looks like number block
before foreach
looks like number block
before foreach
looks like number block
My perlthisfile.txt:
Starting Nmap 6.00 ( http://nmap.org ) at 2013-10-16 22:59 EST
Nmap scan report for BoB2.iiNet (10.1.1.1)
Nmap scan report for android-fbff3c3812154cdc (10.1.1.3)
All 1000 scanned ports on android-fbff3c3812154cdc (10.1.1.3) are closed
Nmap scan report for 10.1.1.5
All 1000 scanned ports on 10.1.1.5 are open|filtered
Nmap scan report for 10.1.1.6
All 1000 scanned ports on 10.1.1.6 are closed
Several issues here. As #toolic said, calling <MYFILE> inside the split is probably not what you want - it will read the next record from the file, use $_ instead.
Also, you are using == with a regex, you should use the binding operator, =~ (== is only used for numeric comparisons in Perl):
if ($val =~ /(\d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3}\:\d{1,5})/){
I suggest that looks_like_number is redundant if the regex works. I suspect that you are using it because == gives something like isn't numeric in numeric eq (==) depending on the version of perl you are using.
You had a few errors, one of which is regex which should have optional part for port number (: and following \d{1,5})
#!/usr/bin/perl
use strict;
use warnings;
open (my $MYFILE, '<', 'perlthisfile.txt') or die $!;
my $looks_like_ip = qr/( \d{1,3}\.\d{1,3}\.\d{1,3}\.\d{1,3} (?: : \d{1,5})? )/x;
while (<$MYFILE>) {
chomp;
my #value = split;
print"\n before foreach \n";
foreach my $val (#value) {
if (my ($match) = $val =~ /$looks_like_ip/){
print "\n$match\n";
}
# else { print "$val doesn't contain IP\n" }
}
}
close($MYFILE) or warn $!;
If this is what it looks to be, which is a quick hack to extract IPs, you might get away with something simple such as:
perl -nlwe '/((?:\d+\.)+\d+)/ && print $1' perlthisfile.txt
Which is to say, not a very strict regex by any means, it just matches numbers joined by periods. If you'd like to only print unique IPs, you can make use of a hash to dedupe:
perl -nlwe '/((?:\d+\.)+\d+)/ && !$seen{$1}++ && print $1" perlthisfile.txt
With a slightly tighter regex that also matches port numbers:
perl -nlwe '/((?:\d+[\.:]){3,4}\d+)/ && print $1' perlthisfile.txt
This will disallow shorter chains of numbers, and allow for a port number.
This last regex explained:
/( # opening parenthesis, starts a string capture
(?: # a non-capturing parenthesis
\d+ # match a number, repeated one or more times
[\.:] # [ ... ] is a character class, it matches one of the literal
# characters inside it, and only one time
){3,4} # closing the non-capturing parenthesis, adding a quantifier
# that says this parenthesis can match 3 or 4 times
\d+ # match one or more numbers
)/x # close capturing parenthesis (added `/x` switch)
The /x switch is just so that you can use the above regex as-is, with comments and whitespace.
The logic behind this is simply: We want a string consisting of a number followed by a period or a colon. We want this string 3 or 4 times. End with another number.
The + and {3,4} are quantifiers, they dictate how many times the item to the left of it is supposed to match. By default, every item matches one time, but by using a quantifier you can change that. + is shorthand for {1,}, and you also have:
? -> {1,0}
* -> {0,}
The syntax is {min,max}, and when a number is missing, that means as many times as possible.

Extracting specific lines with Perl

I am writing a perl program to extract lines that are in between the two patterns i am matching. for example the below text file has 6 lines. I am matching load balancer and end. I want to get the 4 lines that are in between.
**load balancer**
new
old
good
bad
**end**
My question is how do you extract lines in between load balancer and end into an array. Any help is greatly appreciated.
You can use the flip-flop operator to tell you when you are between the markers. It will also include the actual markers, so you'll need to except them from the data collection.
Note that this will mash together all the records if you have several, so if you do you need to store and reset #array somehow.
use strict;
use warnings;
my #array;
while (<DATA>) {
if (/^load balancer$/ .. /^end$/) {
push #array, $_ unless /^(load balancer|end)$/;
}
}
print #array;
__DATA__
load balancer
new
old
good
bad
end
You can use the flip-flop operator.
Additionally, you can also use the return value of the flipflop to filter out the boundary lines. The return value is a sequence number (starting with 1) and the last number has the string E0 appended to it.
# Define the marker regexes separately, cuz they're ugly and it's easier
# to read them outside the logic of the loop.
my $start_marker = qr{^ \s* \*\*load \s balancer\*\* \s* $}x;
my $end_marker = qr{^ \s* \*\*end\*\* \s* $}x;
while( <DATA> ) {
# False until the first regex is true.
# Then it's true until the second regex is true.
next unless my $range = /$start_marker/ .. /$end_marker/;
# Flip-flop likes to work with $_, but it's bad form to
# continue to use $_
my $line = $_;
print $line if $range !~ /^1$|E/;
}
__END__
foo
bar
**load balancer**
new
old
good
bad
**end**
baz
biff
Outputs:
new
old
good
bad
If you prefer a command line variation:
perl -ne 'print if m{\*load balancer\*}..m{\*end\*} and !m{\*load|\*end}' file
For files like this, I often use a change in the Record Separator ( $/ or $RS from English )
use English qw<$RS>;
local $RS = "\nend\n";
my $record = <$open_handle>;
When you chomp it, you get rid of that line.
chomp( $record );