Perl: How to print next line after matching a pattern? - perl

I would like to print specific data after matching a pattern or line. I have a file like this:
#******************************
List : car
Design: S
Date: Sun 10:10
#******************************
b-black
g-green
r-red
Car Type No. color
#-------------------------------------------
N17 bg099 g
#-------------------------------------------
Total 1 car
#******************************
List : car
Design: L
Date: Sun 10:20
#******************************
b-black
g-green
r-red
Car Type No. color
#-------------------------------------------
A57 ft2233 b
#-------------------------------------------
Total 1 car
#******************************
List : car
Design: M
Date: Sun 12:10
#******************************
b-black
g-green
r-red
Car Type No. color
#-------------------------------------------
L45 nh669 g
#-------------------------------------------
Total 1 car
#. .
#. .
#.
#.
I want to print the data for example after the lines "Type...." and dashes line"------" which is N17 and bg099. I have tried this but it cannot work.
my #array;
While (#array = <FILE>) {
foreach my $line (#array) {
if ($line =~ m/(Car)((.*))/) {
my $a = $array[$i+2];
push (#array, $a);
}
if ($array[$i+2] =~ m/(.*)\s+(.*)\s+(.*)/) {
my $car_type = "$1";
print "$car_type\n";
}
}
}
Expected Output:
Car Type No.
N17 bg099
A57 ft2233
L45 nh669
.. ..
. .

while (<FILE>) { #read line by line
if ($_ =~ /^Car/) { #if the line starts with 'Car'
<FILE> or die "Bad car file format"; #read the first line after a Car line, which is '---', in order to get to the next line
my $model = <FILE>; #assign the second line after Car to $model, this is the line we're interested in.
$model =~ /^([^\s]+)\s+([^\s]+)/; #no need for if, assuming correct file format #capture the first two words. You can replace [^\s] with \w, but I prefer the first option.
print "$1 $2\n";
}
}
Or if you prefer a more compact solution:
while (<FILE>) {
if ($_ =~ /^Car/) {
<FILE> or die "Bad car file format";
print join(" ",(<FILE> =~ /(\w+)\s+(\w+)/))."\n";
}
}

Here's another option:
use strict;
use warnings;
print "Car Type\tNo.\n";
while (<>) {
if (/#-{32}/) {
print "$1\t$2\n" if <> =~ /(\S+)\s+(\S+)/;
<>;
}
}
Output:
Car Type No.
N17 bg099
A57 ft2233
L45 nh669
Usage: perl script.pl inFile [>outFile]
Edit: Simplified

I got your code to work with a couple small tweaks.
It's still not perfect but it works.
"while" should be lower case.
You never increment $i.
The way you reuse #array is confusing at best, but if you just output $a you'll get your car data.
Code:
$file_to_get = "input_file.txt";
open (FILE, $file_to_get) or die $!;
my #array;
while (#array = <FILE>) {
$i = 0;
foreach my $line (#array) {
if ($line =~ m/(Car)((.*))/) {
my $a = $array[$i+2];
push (#array, $a);
print $a;
}
$i++;
}
}
close(FILE);

Something like this:
while (my $line = <>) {
next unless $line =~ /Car\s+Type/;
next unless $line = <> and $line =~ /^#----/;
next unless $line = <>;
my #fields = split ' ', $line;
print "#fields[0,1]\n";
}

a shell one-liner to do the same thing
echo "Car Type No. "; \
grep -A 2 Type data.txt \
| grep -v -E '(Type|-)' \
| grep -o -E '(\w+ *\w+)'

perl -lne 'if(/Type/){$a=<>;$a=<>;$a=~m/^([^\s]*)\s*([^\s]*)\s/g; print $1." ".$2}' your_file
tested:
> perl -lne 'if(/Type/){$a=<>;$a=<>;$a=~m/^([^\s]*)\s*([^\s]*)\s/g; print $1." ".$2}' temp
N17 bg099
A57 ft2233
L45 nh669
if you want to use awk,you can do this as below:
> awk '/Type/{getline;if($0~/^#---*/){getline;print $1,$2}}' your_file
N17 bg099
A57 ft2233
L45 nh669

Solution using Perl flip-flop operator. Assumption from the input that you always have Total line at the end of the block of interest
perl -ne '$a=/^#--/;$b=/^Total/;print if(($a .. $b) && !$a && !$b);' file

Related

Pick up the longest peptide using perl

I want to find out the longest possible protein sequence translated from cds in 6 forward and reverse frame.
This is the example input format:
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
I would like to find out all the strings which start from "M" and stop at "X", count the each length of the strings and select the longest.
For example, in the case above:
the script will find,
>111 has two matches:
MGFSOX
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222 has one match:
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX
Then count each match's length, and print the string and number of longest matches which is the result I want:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
But it prints out no answer. Does anyone know how to fix it? Any suggestion will be helpful.
#!/usr/bin/perl -w
use strict;
use warnings;
my #pep=();
my $i=();
my #Xnum=();
my $n=();
my %hash=();
my #k=();
my $seq=();
$n=0;
open(IN, "<$ARGV[0]");
while(<IN>){
chomp;
if($_=~/^[^\>]/){
#pep=split(//, $_);
if($_ =~ /(X)/){
push(#Xnum, $1);
if($n >= 0 && $n <= $#Xnum){
if(#pep eq "M"){
for($i=1; $i<=$#pep; $i++){
$seq=join("",#pep);
$hash{$i}=$seq;
push(#k, $i);
}
}
elsif(#pep eq "X"){
$n=$n+1;
}
foreach (sort {$a cmp $b} #k){
print "$hash{$k[0]}\t$k[0]";
}
}
}
}
elsif($_=~/^\>/){
print "$_\n";
}
}
close IN;
Check out this Perl one-liner
$ cat iris.txt
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
$ perl -ne ' if(!/^>/) { print "$p"; while(/(M[^M]+?X)/g ) { if(length($1)>length($x)) {$x=$1 } } print "$x ". length($x)."\n";$x="" } else { $p=$_ } ' iris.txt
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPX 7
$
There's more than one way to do it!
Try this too:
print and next if /^>/;
chomp and my #z = $_ =~ /(M[^X]*X)/g;
my $m = "";
for my $s (#z) {
$m = $s if length $s > length $m
}
say "$m\t" . length $m
Output:
>111
MJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX 32
>222
MPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPX 38
uses >=5.14 and make sure to run script with perl -n
As a one-liner:
perl -E 'print and next if /^>/; chomp and my #z = $_ =~ /(M[^X]*X)/g; my $m = ""; for my $s (#z) { $m = $s if length $s > length $m } say "$m\t" . length $m' -n data.txt
Here is solution using reduce from List::Util.
Edit: mistakenly used maxstr which gave results but is not what was needed. Have reedited this post to use reduce (correctly) instead.
#!/usr/bin/perl
use strict;
use warnings;
use List::Util qw/reduce/;
open my $fh, '<', \<<EOF;
>111
KKKKKKKMGFSOXLKPXLLLLLLLLLLLLLLLLLMJJJJJJJJJJJJJJJJJJJJJJJJJJJJJJX
>222
WWWMPPPPPPPPPPPPPPPPPPPPPPPPPPPPPPMPPPPPXKKKKKK
EOF
my $id;
while (<$fh>) {
chomp;
if (/^>/) {
$id = $_;
}
else {
my $data = reduce {length($a) > length($b) ? $a : $b} /M[^X]*X/g;
print "$id\n$data\t" . length($data) . "\n" if $data;
}
}
Here's my take on it.
I like fasta files tucked into a hash, with the fasta name as the key. This way you can just add descriptions to it, e.g. base composition etc...
#!/usr/local/ActivePerl-5.20/bin/env perl
use strict;
use warnings;
my %prot;
open (my $fh, '<', '/Users/me/Desktop/fun_prot.fa') or die $!;
my $string = do { local $/; <$fh> };
close $fh;
chomp $string;
my #fasta = grep {/./} split (">", $string);
for my $aa (#fasta){
my ($key, $value) = split ("\n", $aa);
$value =~ s/[A-Z]*(M.*M)[A-Z]/$1/;
$prot{$key}->{'len'} = length($value);
$prot{$key}->{'prot'} = $value;
}
for my $sequence (sort { $prot{$b}->{'len'} <=> $prot{$a}->{'len'} } keys %prot){
print ">" . $sequence, "\n", $prot{$sequence}->{'prot'}, "\t", $prot{$sequence}->{'len'}, "\n";
last;
}
__DATA__
>1232
ASDFASMJJJJJMFASDFSDAFSDDFSA
>2343
AASFDFASMJJJJJJJJJJJJJJMRGQEGDAGDA
Output
>2343
MJJJJJJJJJJJJJJM 16

how to count a repeating string in a line using perl

I have the below file
file1:
abc def host 123 host 869 host
I wrote below script to count the occurrence of a "host" keyword in each line.
I tried all the ways(refer the ones which are commented) still it does not seem to work. sed command worked in command line but not inside the perl script
#!/usr/bin/perl
open(SOURCE,"</home/amp/surevy01/file1");
open(DESTINATION,"</home/amp/surevy01/file2");
while(my $line = <SOURCE>)
{
while(my $line1 = <DESTINATION>)
{
#chomp($line);
#chomp($line1);
if ($line =~ "host")
{
#my $count = grep {host} $line;
#my $count = `sed -i {s/host/host\n/g} $line1 | grep -c {host}`;
#my $count = `perl -pi -e 's/host/host\n/g' $line1 | grep -c host`;
#my $count grep ("host" ,$line);
print "$count";
print "match found \n";
next;
}
else
{
print "match not found \n";
exit;
}
}
}
I'm a beginner to perl. Looking for your valuable suggestions
Your own solution will match instances like hostages and Shostakovich
grep is the canonical way to count elements of a list, and split will turn your line into a list of words, giving
my $count = grep { $_ eq 'host' } split ' ', $line
I don't know why you're looping through two files in your example, but you can use the /g (global) flag:
my $line = "abc def host 123 host 869 host";
my $x = 0;
while ($line =~ /host/g){
$x++;
}
print "$x\n"; # 3
When you run a regex with /g in scalar context (as is the conditional in the while statement), it will keep track of the location of the last match and restart from there. Therefore, /host/g in a loop as above will find each occurence of host. You can also use the /g in list contexts:
my $line = "abc def host 123 host 869 host";
my #matches = $contents =~ /host/g;
print scalar #matches; # 3 again
In this case, #matches will contain all matches of the regexp against the string, which will be ('host', 'host', 'host') since the query is a simple string. Then, scalar(#matches) will yield the length of the list.
This produces the number of instances of host in $line:
my $count = () = $line =~ /host/g;
But that also matches hosting. To avoid that, the following will probably do the trick:
my $count = () = $line =~ /\bhost\b/g;
=()= this is called Perl secret Goatse operator. More info

comparing hash element to table element perl

I have a program that compares each line of two files, each line contains one word, if simply read the two files and stock the data into table, and compare the element of the two tables,
the first file contain:
straight
work
week
belief time
saturday
wagon
australia
sunday
french
...
and the second file contain
firepower
malaise
bryson
wagon
dalglish
french
...
this will take a long time to compare file, so I propose another solution, but this doesn't work
#!/usr/bin/perl
use strict;
use warnings;
open( FIC, $ARGV[0] );
open( FICC, $ARGV[1] );
print "choose the name of the file\n";
chomp( my $fic2 = <STDIN> );
open( FIC2, ">$fic2" );
my $i=0;
my $j=0;
my #b=();
my %stops;
while (<FIC>) #read each line into $_
{
# Remove newline from $_
chomp;
$_ =~ s/\s+$//;
$stops{$_} = $i; # add the line to
$i++;
}
close FIC;
while (<FICC>) {
my $ligne = $_;
$ligne =~ s/\s+$//;
$b[$i] = lc($ligne);
# $b contain the data
$i++;
}
foreach my $che (#b) {
chomp($che);
print FIC2 $che;
print FIC2 " ";
print FIC2 $stops{"$che"}; print FIC2 "\n";
#this returns nothing
}
The problem is inthis command $stop{"$che"}; in the case that the elment don't exist in the hash %stop, it return an integer and an error
Use of initalized value in print c:/ats2/hash.pl line 44, line 185B2
Does this what you want?
join <(sort file1) <(sort file2) >result
Works in bash.

Discovering duplicate lines

I've got a file of CSS elements, and I'm trying to check for any duplicate CSS elements,.. then output the lines that show the dupe lines.
###Test
###ABC
###test
##.hello
##.ABC
##.test
bob.com###Test
~qwerty.com###Test
~more.com##.ABC
###Test & ##.ABC already exists in the list, and I'd like a way to output the lines that are used in the file, basically duplication checking (case sensitive). So using the above list, I would generate something like this..
Line 1: ###Test
Line 7: bob.com###Test
Line 8: ~qwerty.com###Test
Line 5: ##.ABC
Line 9: ~more.com##.ABC
Something in bash, or maybe perl?
Thanks :)
I've been challenged by your problem, so I wrote you a script. Hope you liked it. :)
#!/usr/bin/perl
use strict;
use warnings;
sub loadf($);
{
my #file = loadf("style.css");
my #inner = #file;
my $l0 = 0; my $l1 = 0; my $l2 = 0; my $dc = 0; my $tc;
foreach my $line (#file) {
$l1++;
$line =~ s/^\s+//;
$line =~ s/\s+$//;
foreach my $iline (#inner) {
$l2++;
$iline =~ s/^\s+//;
$iline =~ s/\s+$//;
next if ($iline eq $line);
if ($iline =~ /\b$line\b/) {
$dc++;
if ($dc > 0) {
if ($l0 == 0) {
print "Line " . $l1 . ": " . $line . "\n";
$l0++;
}
print "Line " . $l2 . ": " . $iline . "\n";
}
}
}
print "\n" unless($dc == 0);
$dc = 0; $l0 = 0; $l2 = 0;
}
}
sub loadf($) {
my #file = ( );
open(FILE, $_[0] . "\n") or die("Couldn't Open " . $_[0] . "\n");
#file = <FILE>;
close(FILE);
return #file;
}
__END__
This does exactly what you need. And sorry if it's a bit messy.
This seems to work:
sort -t '#' -k 2 inputfile
It groups them by the part after the # characters:
##.ABC
~more.com##.ABC
###ABC
##.hello
##.test
###test
bob.com###Test
~qwerty.com###Test
###Test
If you only want to see the unique values:
sort -t '#' -k 2 -u inputfile
Result:
##.ABC
###ABC
##.hello
##.test
###test
###Test
This pretty closely duplicates the example output in the question (it relies on some possibly GNU-specific features):
cat -n inputfile |
sed 's/^ *\([0-9]\)/Line \1:/' |
sort -t '#' -k 2 |
awk -F '#+' '{if (! seen[$2]) { \
if ( count > 1) printf "%s\n", lines; \
count = 0; \
lines = "" \
}; \
seen[$2] = 1; \
lines = lines "\n" $0; ++count}
END {if (count > 1) print lines}'
Result:
Line 5: ##.ABC
Line 9: ~more.com##.ABC
Line 1: ###Test
Line 7: bob.com###Test
Line 8: ~qwerty.com###Test
I'd recommend using the uniq function if you can install MoreUtils:
how-do-i-print-unique-elements-in-perl-array
Here is one way to do it, which is fairly easy to extend to multiple files if need be.
With this file find_dups.pl:
use warnings;
use strict;
my #lines;
while (<>) { # read input lines
s/^\s+//; s/\s+$//; # trim whitespace
push #lines, {data => $_, line => $.} if $_ # store useful data
}
#lines = sort {length $$a{data} <=> length $$b{data}} #lines; # shortest first
while (#lines) {
my ($line, #found) = shift #lines;
my $re = qr/\Q$$line{data}\E$/; # search token
#lines = grep { # extract matches from #lines
not $$_{data} =~ $re && push #found, $_
} #lines;
if (#found) { # write the report
print "line $$_{line}: $$_{data}\n" for $line, #found;
print "\n";
}
}
then perl find_dups.pl input.css prints:
line 5: ##.ABC
line 9: ~more.com##.ABC
line 1: ###Test
line 7: bob.com###Test
line 8: ~qwerty.com###Test

While and foreach mixed loop issue

!C:\Perl\bin\perl.exe
use strict;
use warnings;
my $numArgs = $#ARGV + 1;
print "thanks, you gave me $numArgs command-line arguments.\n";
while (my $line = <DATA> ) {
foreach my $argnum (0 .. $#ARGV) {
if ($line =~ /$ARGV[$argnum]/)
{
print $line;
}
}
}
__DATA__
A
B
Hello World :-)
Hello World !
when I passed one arg, it works well.
Such as I run test.pl A or test.pl B or **test.pl Hello"
when I passed two args, it works some time only.
Successful: When I run test.pl A B or test.pl A Hello or **test.pl B Hello"
Failed: when I run test.pl Hello World*
Produced and output duplicate lines:
D:\learning\perl>t.pl Hello World
thanks, you gave me 2 command-line arguments.
Hello World :-)
Hello World :-)
Hello World !
Hello World !
D:\learning\perl>
How to fix it? Thank you for reading and replies.
[update]
I don't want to print duplicate lines.
I don't see the problem, your script processes the __DATA__ and tests all input words against it: since "Hello" and "World" match twice each, it prints 4 rows.
If you don't want it to write multiple lines, just add last; after the print statement.
The reason you're getting the duplicate output is because the regex $line =~ /Hello/ matches both "Hello World" lines and $line =~ /World/ also matches both "Hello World" lines. To prevent that, you'll need to add something to remember which lines from the __DATA__ section have already been printed so that you can skip printing them if they match another argument.
Also, some very minor stylistic cleanup:
#!C:\Perl\bin\perl.exe
use strict;
use warnings;
my $numArgs = #ARGV;
print "thanks, you gave me $numArgs command-line arguments.\n";
while (my $line = <DATA> ) {
foreach my $arg (#ARGV) {
if ($line =~ /$arg/)
{
print $line;
}
}
}
__DATA__
A
B
Hello World :-)
Hello World !
Using an array in scalar context returns its size, so $size = #arr is preferred over $size = $#arr + 1
If you're not going to use a counter for anything other than indexing through an array (for $i (0..$#arr) { $elem = $arr[$i]; ... }), then it's simpler and more straightforward to just loop over the array instead (for $elem (#arr) { ... }).
Your foreach loop could also be replaced with a grep statement, but I'll leave that as an exercise for the reader.
Assuming you want to print each line from DATA only once if one or more patterns match, you can use grep. Note that use of \Q to quote regex metacharacters in the command line arguments and the use of the #patterns array to precompile the patterns.
Read if grep { $line =~ $_ } #patterns out loud: If $line matches one or more patterns ;-)
#!/usr/bin/perl
use strict; use warnings;
printf "Thanks, you gave me %d command line arguments.\n", scalar #ARGV;
my #patterns = map { qr/\Q$_/ } #ARGV;
while ( my $line = <DATA> ) {
print $line if grep { $line =~ $_ } #patterns;
}
__DATA__
A
B
Hello World :-)
Hello World !
Here are some comments on your script to help you learn:
my $numArgs = $#ARGV + 1;
print "thanks, you gave me $numArgs command-line arguments.\n";
The command line arguments are in #ARGV (please do read the documentation). In scalar context, #ARGV evaluates to the number of elements in that array. Therefore, you can simply use:
printf "Thanks, you gave me %d command line arguments.\n", scalar #ARGV;
Further, you can iterate directly over the elements of #ARGV in your foreach loop instead of indexed access.
while (my $line = <DATA> ) {
foreach my $arg ( #ARGV ) {
if ( $line =~ /$arg/ ) {
print $line;
}
}
}
Now, what happens to your program if I pass ( to it on the command line? Or, even World? What should happen?