My file looks like this:
1 15
2 16
3 18
4 19
5 25
6 30
7 55
8 45
9 34
10 52
If the matched pattern is 30 in line 6, I would like to grab N lines before and M lines after the line 6, for example if N=3 and M=4 so the result is expected to be like this:
3 18
4 19
5 25
6 30
7 55
8 45
9 34
10 52
I am a very new beginner in Perl and any advice would be appreciated.
﹟UPDATE
Many thanks for these helpful advice below and I really appreciate them.
Here is my updated code for this and any suggestions are welcome!
my $num;
while(<>)
{
if ( /pattern/)
{$num = $. ;}
}
open (,"") || die ("Can't open the file");
while(<>)
{
if ( $. >= $num-N and $. <=$num+M)
{
print OUT "$_ \r";
}
}
Maintain an array (I'll call it #preceding) of the last N lines read. When the pattern is matched, stop updating this array and start inserting lines into another array (#following). Do this until #following has M lines in it.
It should look something like this (fixed now thanks to ikegami):
my $matched = 0;
my #preceding;
my #following;
while(<>){
if ($matched){
push ( #following, $_);
last if #following == M;
next;
}
else {
push ( #preceding, $_);
shift(#preceding) if #preceding > N;
}
$matched = 1 if /pattern/;
}
my #lines = <>;
foreach $idx (grep { $lines[$_] =~ /pattern/ } 0..$#lines) {
print join (map {$lines[$_]} grep { $_ >= $idx - $B && $_ <= $idx +$A } 0..$#lines)."\n";
}
You can also use the GNU grep command, with -A,-B flags for that exact purpose.
-A NUM, --after-context=NUM
Print NUM lines of trailing context after matching lines.
Places a line containing -- between contiguous groups of
matches.
-B NUM, --before-context=NUM
Print NUM lines of leading context before matching lines.
Places a line containing -- between contiguous groups of
matches.
Related
This is sample file with tab separted.
2000 46 26
3000 52 25
5149 4 3
10000 104 32
10500 20 12
13397 0 3
20000 20 12
24489 8 0
I try this with my Perl code, this works fine with one condition then I unable to do the same in when the condition is increased to 10001 to 20000 and 30001 to 40000 and so on, until the end of the file.
I want output as :-
1 10000 102 54
10001 20000 124 47
20001 30000 28 12 so on.....
#! /usr/bin/perl
my $file = "$ARGV[0]";
open (f, $file);
#f = <f>;
foreach $F1 (#f) {
($a, $b, $c) = split(/\t/, $F1);
$x = "1";
$y = "10000" ;
if ( ( $a > $x ) && ( $a <= $y ) ) {
$total += $b ;
$total_1 += $c;
}
#$x = $y;
#$y = $y*2;
}
print "$x\t$y\t$total\t$total_1\n" ;
By starting with the simple case and then trying to build on that, you're actually making things harder than they need to be. This is one example where seeing the bigger picture helps to simplify the code.
You're splitting your data into "buckets" - using the first column to determine which bucket the record should go into and then summing the second and third columns within a bucket.
I would write it something like this.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
# Store bucket totals here
my #totals;
# Read from STDIN
while (<>) {
# Skip blank lines
next unless /\S/;
# Split the data on white space
my #cols = split;
# Calculate the bucket.
# 1 - 10,000 is bucket 0
# 10,001 - 20,000 is bucket 1
# etc...
my $bucket = int($cols[0] / 10_000);
# Each element in #totals is a two-element array.
# The first element is the sum of column two.
# The second element is the sum of column three
$totals[$bucket][0] += $cols[1];
$totals[$bucket][1] += $cols[2];
}
# Walk the #totals array and display the results.
for (0 .. $#totals) {
my $start = ($_ * 10_000) + 1;
my $end = ($_ + 1) * 10_000;
say "$start $end $totals[$_][0] $totals[$_][1]";
}
As we read from <>, there is no need to bother with opening filehandles.
I put this in a file called sum and called it like this:
$ ./sum in.txt
And the result I got was:
1 10000 102 54
10001 10001 124 47
20001 10002 28 12
Which looks correct to me. Let me know if you have any questions.
#!/usr/bin/perl
my $file = $ARGV[0];
my $value = $ARGV[1];
my #grabbed;
open (FILE, $file);
while (<FILE>) {
if (/alignment# residue#/) {
push #grabbed, $_;
while (<FILE>) {
last if /^$/;
push #grabbed, $_;
}
}
}
close (FILE);
my $line= `awk ' {if(\$2==$value)} ' #grabbed`;
print $line;
Problem :
1.First, I don't know if its possible to do awk on an array or not?
2. I am trying to match a value, existing on the second column of the 2-D array (#grabbed). The #grabbed will look like this :
7 1 M 1.000 6 .VPMLG 66.63
8 2 S 1.000 10 .QINTSARKG 66.63
9 3 V 1.000 13 .KTAVFPRGQMSL 66.63
10 4 L 1.000 7 .SLAKFT 66.63
11 5 L 1.000 14 .ALSVQWIKMRYPF 66.63
12 6 R 1.000 16 .DERSAVGTNQLYMIP 66.63
13 7 S 1.000 18 .GDTHPKRSALFCIQVYN 66.63
14 8 G 1.000 17 .DRFLENGAQPSTYCHM 66.63
15 9 L 1.000 19 .NDHPELASVKRCWFGTQI 66.63
16 10 G 1.000 18 .RLDPEGFTYAVCIKNMH 66.63
I am trying to match and grab the line in which column 2 is of value "9".
No need to swith to awk when that job can be done with perl too.
for ( #grabbed ) {
my #f = split;
if ( $f[1] == $value ) {
push #line, $_;
}
}
It appears that by "2D Array" you mean an array of strings, each string being a whitespace-delimited list of values.
Perl is made for this sort of thing. You could use the other answer's suggestion of splitting each line and looking at each value; however, a simple regular expression would be faster. Replace your awk line with something like this:
foreach (#grabbed)
{
#Match the beginning of the line, possibly some whitespace,
#then some digits, then more whitespace, then the contents of $value
if (/^\s*\d+\s+$value/)
{
#The line matched: do stuff
}
}
Also, will you ever need to look at the lines that don't match? If not, it would be much more efficient not to put the whole file into an array; instead, just do all of your processing in the while loop.
my file is like
line 1
line 2
line 3
target
line 5
line 6
line 7
I can write a regex that matches the target. What all I need is I need to grab lines 2,3,5,6.
Is there any way to do it?
If you're not determined to use perl you can easily extract the context you want with grep and Context Line Control options
grep -A 2 -B 2 target filename | grep -v target
Of course target will need to be replaced by a suitable regex.
Robert is on the right path. You have to multiline your regex and match the 2 previous and next lines:
#!/usr/bin/perl -w
my $lines = <<EOF
line 1
line 2
line 3
target
line 5
line 6
line 7
EOF
;
# Match a new line, then 2 lines, then target, then 2 lines.
# { $1 } { $3 }
my $re = qr/^.*\n((.*?\n){2})target\n((.*?\n){2}).*$/m;
(my $res = $lines) =~ s/$re/$1$3/;
print $res;
#lines = ('line 1', 'line 2', 'line 3', 'target', 'line 5', 'line 6', 'line 7');
my %answer;
$regex = 'target';
for my $idx (0..$#lines) {
if ($lines[$idx] =~ /$regex/) {
for $ii (($idx - 2)..($idx + 2)){
unless ($lines[$ii] =~ /^$regex$/) {$answer{$ii} = $lines[$ii];}
}
}
}
foreach $key (sort keys %answer) { print "$answer{$key}\n" }
Which yields...
[mpenning#Bucksnort ~]$ perl search.pl
line 2
line 3
line 5
line 6
[mpenning#Bucksnort ~]$
EDIT
Fixed for #leonbloy's comment about multiple target strings in the file
slurp the file to a list / array, find the index of the matching line, and use this index to get the desired values (using offsets)
Although this was asked 8 months ago, I had to rethink this question, since none of the findable solution met with my aims. My goal was to make a script which examines many of huge log files, and makes extracts from them, containing only the wanted lines, putting optional number of lines before and after the line which contains the searched pattern(s) WITHOUT any redundancies. I tried to reuse some of the codes found here, but none of them was good enough for me. So finally I create a unique one, which is probably not the most beautiful, but looks useful, so I'd like to share it with you:
use strict;
my #findwhat = ('x');
my $extraLines = 3;
my #cache = ('') x ($extraLines);
my #stack;
my $lncntr = 0;
my $hit = 0;
my $nextHitWatch = 0;
my $shift = 1;
open (IN, "<test1.log");
while (my $line=<IN>) {
$lncntr++;
chomp $line;
foreach my $what (#findwhat) {if ($line =~ m/$what/i) {$hit = 1; last}}
if ($hit && !$nextHitWatch) {
#stack = #cache;
$hit = 0;
$nextHitWatch++;
}
if (!$hit && $nextHitWatch && $nextHitWatch < ($extraLines * 2) + 2) {
#stack = (#stack, $line);
$nextHitWatch++;
}
if (!$hit && $nextHitWatch && $nextHitWatch == ($extraLines * 2) + 2) {
#stack = (#stack, $line);
for (my $i = 0; $i <= ($#stack - ($extraLines + $shift)); $i++) {
print $stack[$i]. "\n" if $stack[$i];
}
$nextHitWatch = 0;
$shift = 1;
#stack = ();
}
if ($nextHitWatch >= 1 && eof) {
foreach(#stack) {print "$_\n"}
}
if ($nextHitWatch >= 1 && eof) {
if (!$hit) {
my $upValue = 3 + $#stack - ($nextHitWatch - $extraLines + $shift);
$upValue = ($upValue > $#stack) ? $#stack : $upValue;
for (my $i = 0; $i <= $upValue; $i++) {
print $stack[$i] . "\n";
}
} else {
foreach (#stack) {print "$_\n"}
}
}
shift(#cache);
push(#cache, $line);
}
close (IN);
Probably, you will have to change only the values of the list #findwhat and the scalar $extraLines. I hope my code will be useable. (Sorry for my poor English)
multiline the regex, eg: /\n{3}(foo)\n{3}/m;
edit
/\n*(foo)\n*/m works in the general case
One liner version (where -l = chomp and -n = while(<>){}. See perldocperlrun for more options):
perl -lnE '$h{$.}=$_; END {
for ( grep { $h{$_} eq "target" } sort{ $a <=> $b } keys %h ) {
say for #h{$_-2..$_-1 , $_+1..$_+2} } }' data.txt
Script with explanation:
#!perl
use feature 'say';
while (<DATA>) {
chomp;
$hash{$.} = $_ ; # hash entry with line number as key; line contents as value
}
# find the target in the hash and sort keys or line numbers into an array
#matches = sort {$a <=> $b} grep { $hash{$_} eq 'target' } keys %hash;
for (#matches) {
say "before\n" ;
say for #hash{$_-2..$_-1} ; # print the context lines as a hash slice
say ">>>>\" $hash{$.} \"<<<< " ;
say "after\n" ;
say for #hash{$_+1..$_+2} ;
say "";
}
__DATA__
line 1
line 2
line 3
target
line 5
line 6
line 7
target
line of context1
line of context2
target
Output:
before
line 2
line 3
>>>>" target "<<<<
after
line 5
line 6
before
line 6
line 7
>>>>" target "<<<<
after
line of context1
line of context2
before
line of context1
line of context2
>>>>" target "<<<<
after
A simpler version using only arrays and with output that excludes the target as the OP question requested:
#!perl -l
chomp( my #lines = <DATA> ) ;
my $n = 2 ; # context range before/after
my #indexes = grep { $lines[$_] =~ m/target/ } 0..$#lines ;
foreach my $i (#indexes) {
print for #lines[$i-$n..$i-1], #lines[$i+1..$i+$n],"";
}
__DATA__
line 1
line 2
line 3
target
line 5
line 6
line 7
target
line of context1
line of context2
target
This avoids constructing the hash but may be slower on very large files/arrays.
On CPAN List::MoreUtils has indexes() and there is always splice(), but I'm not sure these would make things simpler.
I would like to filter out particular columns with a regex and discard others. For example, if I had the following column names:
date
mem_total
cpu.usagemhz.average_0
cpu.usagemhz.average_1
cpu.usagemhz.average_2
I would like to capture only columns that begin with "cpu.usage.mhz.average"
Is their a particular function of text::csv that will help me do a quick check of the column names?
Thanks!
JD
* Update **
I tried jimtut answer and it is extremely close to what I am looking for. Thanks Again Everyone!
Here is the code from jimtut with one small edit on the print statement at the bottom. I added the print $colCount just to see what was going on with the data;
use Text::CSV;
my $file = "foo.csv";
my $pattern = ".*In";
open(F, $file) or warn "Warning! Unable to open $file\n";
my $lineCount = 0;
my %desiredColumns;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
print "$colCount\t$field\n";
}
}
}
}
close(F);
Here is the results
colCount | $field
12 565
13 73
14 36
15 32
16 127
17 40
18 32
19 42
20 171
12 464
13 62
14 32
15 24
16 109
17 21
18 19
19 39
20 150
12 515
13 76
14 28
15 30
16 119
17 15
18 25
19 46
20 169
12 500
13 71
14 30
15 28
16 111
17 20
18 18
19 40
20 167
I would like to add this data to individual arrays or hashes. what do you think? something like...
foreach column {
check to see if a hash already exists with that column number. If not then create hash.
}
Then go through each field and add the field data to the appropriate hash.
Do you think this is the right way to go about solving this?
No, not a specific function in Text::CSV. I would do something like this:
use Text::CSV;
my $file = "foo.csv";
my $pattern = "cpu.usage.mhz.average.*";
open(F, $file) or die "Unable to open $file: $!\n";
my $lineCount = 0;
my %desiredColumns;
my %columnContents;
while(<F>) {
$lineCount++;
my $csv = Text::CSV->new();
my $status = $csv->parse($_); # should really check this!
my #fields = $csv->fields();
my $colCount = 0;
if ($lineCount == 1) {
# Let's look at the column headings.
foreach my $field (#fields) {
$colCount++;
if ($field =~ m/$pattern/) {
# This heading matches, save the column #.
$desiredColumns{$colCount} = 1;
}
}
}
else {
# Not the header row. Parse the body of the file.
foreach my $field (#fields) {
$colCount++;
if (exists $desiredColumns{$colCount}) {
# This is one of the desired columns.
# Do whatever you want to do with this column!
push(#{$columnContents{$colCount}}, $field);
}
}
}
}
close(F);
foreach my $key (sort keys %columnContents) {
print "Column $key: " . join(",", #{$columnContents{$key}}) . "\n\n";
}
Hope that helps! I'm sure someone can write that in a Perl one-liner, but that's easier (for me) to read...
Since your fields of interest are at index 2-4, we'll just pluck those out of the field array returned by getline(). This sample code prints them but you can do whatever you like to them.
use Text::CSV; # load the module
my $csv = Text::CSV->new (); # instantiate
open $fh, "<somefile"; # open the input
while ( my $fields = $csv->getline($fh) ) { # read a line, and parse it into fields
print "I got #{$fields}[2..4]\n"; # print the fields of interest
}
close ($fh) # close when done
WHY are you trying to do this? Is it to minimize storage? Eliminate processing costs for parsing many un-needed columns?
If the latter, you can't avoid that processing cost. Any solution you come up with would STILL read and parse 100% of the file.
If the former, there are many methods, some are more efficient than the others.
Also, what exactly do you mean "help me do a quick check of the column names?"? If you want to get the column names, there's column_names() method provided you previously set the column names using column_names(getline($fh)).
If you want to only return specific column names in a hash to avid wasting memory on un-needed columns, there's no clear-cut API for that. You can roll your own, or abuse a "bug/feature" of getline_hr() method:
For the former (roll your own), you can do something like:
my $headers = $csv->getline( $fh ); # First line is headers.
my #headers_keep = map { /^cpu.usage.mhz.average/ ? 1 : 0 } #$headers;
while ( my $row = $csv->getline( $fh ) ) {
my $i = 0;
my #row_new = grep { $headers_keep[$i++] } $#row;
push #rows, \#row_new;
}
BUT you can either roll your own OR .
You can also use a "feature" of "getline_hr()" which doesn't assign values into a hash if the column name is a duplicate (only the LAST version gets assigned) \
In your case, for column names: date,mem_total,cpu.usagemhz.average_0,cpu.usagemhz.average_1,cpu.usagemhz.average_2, merely set the column_names array to contain "cpu.usagemhz.average_0" value in the first 2 eements of the array - they will NOT be then saved by getline_hr().
You can go over the list of columns, find the consecutive range of "not needed" columns, and replace their names with the name of the first needed column follwing that range. The only stiking point is if the "un-needed" range is at the very end of the columns - replace with "JUNK" or something.
I have a output file which is a two dimensional array (this file was output generated after running script written to produce 2D array) and I have to read information under a particular column, say column 1. In other words, how do I read and print out information listed, corresponding to all the rows, under column 1.
Any suggestions?
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
From the above data I want to extract information column wise, say if I want information from column 1, I should be able to list only the following output.
want to list Then I want
OUTPUT:
1
A
93
Final version after all corrections:
#!/usr/bin/perl
use strict;
use warnings;
my $column_to_show = 0;
while ( <DATA> ) {
last unless /\S/;
print +(split)[$column_to_show], "\n";
}
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
Output:
C:\Temp> u
1
A
93
Explanation of print +(split)[$column_to_show], "\n";:
perldoc -f split:
Splits the string EXPR into a list of strings and returns that list.
...
If EXPR is omitted, splits the $_ string. If PATTERN is also omitted,
splits on whitespace (after skipping any leading whitespace).
So: (split)[3] selects the fourth element of the list returned by split. The + in front of (split) is necessary to help perl parse the expression correctly. See perldoc -f print:
Also be careful not to follow the
print keyword with a left parenthesis
unless you want the corresponding
right parenthesis to terminate the
arguments to the print — interpose a +
or put parentheses around all the
arguments.
I thoroughly recommend every Perl programmer to occasionally skim through all of the documentation perldoc perltoc. It is on your computer.
my $line ;
foreach $line (#DATA)
{
my #DATA1 = split( / +/, "$line" );
print "first element of array is $DATA1[0]";
}
__DATA__
1 2 3 4 5 6 7 8 9
A B C D E F G H I
93 48 57 66 52 74 33 22 91
OUTPUT:-
1
A
93
Try playing with this code. Basically I load the data into an array of arrays
Each line is a reference to a row.
#!/usr/bin/perl
use strict;
use warnings;
my $TwoDimArray;
while (my $line=<DATA>) {
push #$TwoDimArray, [split(/,/,$line)];
};
for my $column (0..2) {
print "[$column,0] : " . $TwoDimArray->[0]->[$column] ."\n";
print "[$column,1] : " . $TwoDimArray->[1]->[$column] ."\n";
print "\n";
}
__DATA__
1,2,3,04,05,06
7,8,9,10,11,12
The map function is your friend:
open FILE, "data.txt";
while ($line = <FILE>) {
chomp($line);
push #data, [split /[, ]+/, $line];
}
close FILE;
#column1 = map {$$_[0]} #data;
print "#column1\n";
And in data.txt something like:
1, 2, 3, 4
5, 6, 7, 8
9, 10, 11, 12
13, 14, 15, 16
perl -lne '#F = split /\s+/ and print $F[1]'
This might be what you want:
use English qw<$OS_ERROR>; # Or just use $!
use IO::Handle;
my #columns;
open my $fh, '<', 'columns.dat' or die "I'm dead. $OS_ERROR";
while ( my $line = <$fh> ) {
my #cols = split /\s+/, $line;
$columns[$_][$fh->input_line_number()-1] = $cols[$_] foreach 0..$#cols;
}
$fh->close();
You can access them directly by element.
$arrays[0][0] = 1;
$arrays[0][1] = 2;
$arrays[1][0] = 3;
$arrays[1][1] = 4;
for (my $i = 0; $i <= $#{$arrays[1]}; $i++) {
print "row for $i\n";
print "\tfrom first array: " . $arrays[0][$i] . "\n";
print "\tfrom second array: " . $arrays[1][$i] . "\n";
}
prints
row for 0
from first array: 1
from second array: 3
row for 1
from first array: 2
from second array: 4