Awk's output in Perl doesn't seem to be working properly - perl

I'm writing a simple Perl script which is meant to output the second column of an external text file (columns one and two are separated by a comma).
I'm using AWK because I'm familiar with it.
This is my script:
use v5.10;
use File::Copy;
use POSIX;
$s = `awk -F ',' '\$1==500 {print \$2}' STD`;
say $s;
The contents of the local file "STD" is:
CIR,BS
60,90
70,100
80,120
90,130
100,175
150,120
200,260
300,500
400,600
500,850
600,900
My output is very strange and it prints out the desired "850" but it also prints a trailer of the line and a new line too!
ka#man01:$ ./test.pl
850
ka#man01:$
The problem isn't just printing. I need to use the variable generated by awk "i.e. the $s variable) but the variable is also being reserved with a long string and a new line!
Could you guys help?
Thank you.

I'd suggest that you're going down a dirty road by trying to inline awk into perl in the first place. Why not instead:
open ( my $input, '<', 'STD' ) or die $!;
while ( <$input> ) {
s/\s+\z//;
my #fields = split /,/;
print $fields[1], "\n" if $fields[0] == 500;
}
But the likely problem is that you're not handling linefeeds, and say is adding an extra one. Try using print instead, or chomp on the resultant string.

perl can do many of the things that awk can do. Here's something similar that replaces your entire Perl program:
$ perl -naF, -le 'chomp; print $F[1] if $F[0]==500' STD
850
The -n creates a while loop around your argument to -e.
The -a splits up each line into #F and -F lets you specify the separator. Since you want to separate the fields on a comma you use -F,.
The -l adds a newline each time you call print.
The -e argument is the program to run (with the added while from -n). The chomp removes the newline from the output. You get a newline in your output because you happen to use the last field in the line. The -l adds a newline when you print; that's important when you want to extract a field in the middle of the line.

The reason you get 2 newlines:
the backtick operator does not remove the trailing newline from the awk output. $s contains "850\n"
the say function appends a newline to the string. You have say "850\n" which is the same as print "850\n\n"

Related

perl: print to console all the matched pattern

I have mulitple lines
QQQQl123
hsdhjhksd
QQQQl234
ajkdkjsdh
QQQQl564
i want to print all matching QQQQl[0-9]+
like
QQQQl123
QQQQl234
QQQQl564
how to do this using perl
I tried:
$ perl -0777pe '/QQQQl[0-9]+/' filename
it shows nothing
perl -we 'while(<>){ next unless $_=~/QQQQl[0-9]+/; print $_; }' < filename
perl -ne 'print if /QQQQl[0-9]+/' filename
Or, if, for some reason, you insist on using -0777, you could do
perl -0777nE 'say for /QQQQl[0-9]+/g' filename
(or print "$_\n" instead of say)
Your code doesn't work because /QQQQl[0-9]+/ returns true because $_ indeed contains that pattern, but you never asked Perl to do anything based on that return value.
-n is preferable to -p in that case, since you don't want to print every line but only some (-p automatically prints every line, and there is very little you can do about it).

Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

How to remove a specific word from a file in perl

A file contains:
rhost=localhost
ruserid=abcdefg_xxx
ldir=
lfile=
rdir=p01
rfile=
pgp=none
mainframe=no
ftpmode=binary
ftpcmd1=
ftpcmd2=
ftpcmd3=
ftpcmd1a=
ftpcmd2a=
notifycc=no
firstfwd=Yes
NOTIFYNYL=
decompress=no
compress=no
I want to write a simple code that removes the "_xxx" in that second line. Keep in mind that there will never be a file that contains the string "_xxx" so that should make it extremely easier. I'm just not too familiar with the syntax. Thanks!
The short answer:
Here's how you can remove just the literal '_xxx'.
perl -pli.bak -e 's/_xxx$//' filename
The detailed explanation:
Since Perl has a reputation for code that is indistinguishable from line noise, here's an explanation of the steps.
-p creates an implicit loop that looks something like this:
while( <> ) {
# Your code goes here.
}
continue {
print or die;
}
-l sort of acts like "auto-chomp", but also places the line ending back on the line before printing it again. It's more complicated than that, but in its simplest use, it changes your implicit loop to look like this:
while( <> ) {
chomp;
# Your code goes here.
}
continue {
print $_, $/;
}
-i tells Perl to "edit in place." Behind the scenes it creates a separate output file and at the end it moves that temporary file to replace the original.
.bak tells Perl that it should create a backup named 'originalfile.bak' so that if you make a mistake it can be reversed easily enough.
Inside the substitution:
s/
_xxx$ # Match (but don't capture) the final '_xxx' in the string.
/$1/x; # Replace the entire match with nothing.
The reference material:
For future reference, information on the command line switches used in Perl "one-liners" can be obtained in Perl's documentation at perlrun. A quick introduction to Perl's regular expressions can be found at perlrequick. And a quick overview of Perl's syntax is found at perlintro.
This overwrites the original file, getting rid of _xxx in the 2nd line:
use warnings;
use strict;
use Tie::File;
my $filename = shift;
tie my #lines, 'Tie::File', $filename or die $!;
$lines[1] =~ s/_xxx//;
untie #lines;
Maybe this can help
perl -ple 's/_.*// if /^ruserid/' < file
will remove anything after the 1st '_' (included) in the line what start with "ruserid".
One way using perl. In second line ($. == 2), delete from last _ until end of line:
perl -lpe 's/_[^_]*\Z// if $. == 2' infile

Perl substitute with regex

When I run this command over a Perl one liner, it picks up the the regular expression -
so that can't be bad.
more tagcommands | perl -nle 'print /(\d{8}_\d{9})/' | sort
12012011_000005769
12012011_000005772
12162011_000005792
12162011_000005792
But when I run this script over the command invocation below, it does not pick up the
regex.
#!/usr/bin/perl
use strict;
my $switch="12012011_000005777";
open (FILE, "more /home/shortcasper/work/tagcommands|");
my #array_old = (<FILE>) ;
my #array_new = #array_old ;
foreach my $line(#array_new) {
$line =~ s/\d{8}_\d{9}/$switch/g;
print $line;
sleep 1;
}
This is the data that I am feeding into the script
/CASPERBOT/START URL=simplefile:///data/tag/squirrels/squirrels /12012011_000005777N.dart.gz CASPER=SeqRashMessage
/CASPERBOT/ADDSERVER simplefile:///data/tag/squirrels/12012011_0000057770.dart.trans.gz
/CASPERRIP/newApp multistitch CASPER_BIN
/CASPER_BIN/START URLS=simplefile:///data/tag/squirrels /12012011_000005777R.rash.gz?exitOnEOF=false;binaryfile:///data/tag/squirrels/12162011_000005792D.binaryBlob.gz?exitOnEOF=false;simplefile:///data/tag/squirrels/12012011_000005777E.bean.trans.gz?exitOnEOF=false EXTRACTORS=rash;island;rash BINARY=T
You should study your one-liner to see how it works. First check perl -h to learn about the switches used:
-l[octal] enable line ending processing, specifies line terminator
-n assume "while (<>) { ... }" loop around program
The first one is not exactly self-explanatory, but what -l actually does is chomp each line, and then change $\ and $/ to newline. So, your one-liner:
perl -nle 'print /(\d{8}_\d{9})/'
Actually does this:
$\ = "\n";
while (<>) {
chomp;
print /(\d{8}_\d{9})/;
}
A very easy way to see this is to use the Deparse command:
$ perl -MO=Deparse -nle 'print /(\d{8}_\d{9})/'
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
print /(\d{8}_\d{9})/;
}
-e syntax OK
So, that's how you transform that into a working script.
I have no idea how you went from that to this:
use strict;
my $switch="12012011_000005777";
open (FILE, "more /home/shortcasper/work/tagcommands|");
my #array_old = (<FILE>) ;
my #array_new = #array_old ;
foreach my $line(#array_new) {
$line =~ s/\d{8}_\d{9}/$switch/g;
print $line;
sleep 1;
}
First of all, why are you opening a pipe from the more command to read a text file? That is like calling a tow truck to fetch you a cab. Just open the file. Or better yet, don't. Just use the diamond operator, like you did the first time.
You don't need to first copy the lines of a file to an array, and then use the array. while(<FILE>) is a simple way to do it.
In your one-liner, you print the regex. Well, you print the return value of the regex. In this script, you print $line. I'm not sure how you thought that would do the same thing.
Your regex here will remove all set of numbers and replace it with the ones in your script. Nothing else.
You may also be aware that sleep 1 will not do what you think. Try this one-liner, for example:
perl -we 'for (1 .. 10) { print "line $_\n"; sleep 1; }'
As you will notice, it will simply wait 10 seconds then print everything at once. That's because perl by default prints to the standard output buffer (in the shell!), and that buffer is not printed until it is full or flushed (when the perl execution ends). So, it's a perception problem. Everything works like it should, you just don't see it.
If you absolutely want to have a sleep statement in your script, you'll probably want to autoflush, e.g. STDOUT->autoflush(1);
However, why are you doing that? Is it so you will have time to read the numbers? If so, put that more statement at the end of your one-liner instead:
perl ...... | more
That will pipe the output into the more command, so you can read it at your own pace. Now, for your one-liner:
Always also use -w, unless you specifically want to avoid getting warnings (which basically you never should).
Your one-liner will only print the first match. If you want to print all the matches on a new line:
perl -wnle 'print for /(\d{8}_\d{9})/g'
If you want to print all the matches, but keep the ones from the same line on the same line:
perl -wnle 'print "#a" if #a = /(\d{8}_\d{9})/g'
Well, that should cover it.
Your open call may be failing (you should always check the result of an open to make sure it succeeded if the rest of the program depends on it) but I believe your problem is in complicating things by opening a pipe from a more command instead of simply opening the file itself. Change the open to simply
open FILE, "/home/shortcasper/work/tagcommands" or die $!;
and things should improve.

Perl program to replace tabs with spaces

I'd like to write a Perl one-liner that replaces all tabs '\t' in a batch of text files in the current directory with spaces, with no effect on the visible spacing.
Can anyone show me how to do this?
This is in FAQ:
1 while $string =~ s/\t+/' ' x (length($&) * 8 - length($`) % 8)/e;
Or you can just use the Text::Tabs module (part of the standard Perl distribution).
use Text::Tabs;
#expanded_lines = expand(#lines_with_tabs);
You don't need a Perl one-liner for this, you could use expand instead:
The expand utility shall write files or the standard input to the standard output with characters replaced with one or more characters needed to pad to the next tab stop.
The expand utility will even take care of managing tab stops for you and that seems to be part of your "with no effect on the visible spacing" requirement, a Perl one-liner probably would't (but I bet someone here could provide a one-liner that would).
Use Text::Tabs. The following is adapted only very slightly from the documentation:
perl -MText::Tabs -n -i.orig -e 'print expand $_' *
perl -p -i -e 's/\t/ /g' file.txt would be one way to do this
$ perl -wp -i.backup -e 's/\t/ /g' *
You can use s/// to achieve this. Perhaps you have a line of text stored in $line:
$line =~ s/\t/ /g;
This should replace each tab (\t) with four spaces. It just depends on how many spaces one tab is in your file.
Here's something that should do it pretty quickly for you; edit it how you will.
open(FH, 'tabz.txt');
my #new;
foreach my $line (<FH>) {
$line =~ s/\t/ /g; # Add your spaces here!
push(#new, $line);
}
close(FH);
open(FH, '>new.txt');
printf(FH $_) foreach (#new);
close(FH);