How to print without duplicates with perl? - perl

My assignment is a little more in depth than the title but in the title is my main question. Here is the assignment:
Write a perl script that will grep for all occurrences of the regular expression in all regular files in the file/directory list as well as all regular files under the directories in the file/directory list. If a file is not a TEXT file then the file should first be operated on by the unix command strings (no switches) and the resulting lines searched. If the -l switch is given only the file name of the files containing the regular expression should be printed, one per line. A file name should occur a maximum of one time in this case. If the -l switch is not given then all matching lines should be printed, each proceeded on the same line by the file name and a colon. An example invocation from the command line:
plgrep 'ba+d' file1 dir1 dir2 file2 file3 dir3
Here is my code:
#!/usr/bin/perl -w
use Getopt::Long;
my $fname = 0;
GetOptions ('l' => \$fname);
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$fname ? print "$ARGV\n" : print "$ARGV:$_";
}
}
So far that code does everything it's supposed to except for reading non-text files and printing out duplicates of file names when using the -l switch. Here is an example of my output after entering the following on the command line: plgrep 'ba+d' file1 file2
file1:My dog is bad.
file1:My dog is very baaaaaad.
file2:I am bad at the guitar.
file2:Even though I am bad at the guitar, it is still fun to play!
Which is PERFECT!
But when I use the -l switch to print out only the file names this is what I get after entering the following on the command line: plgrep -l 'ba+d' file1 file2
file1
file1
file2
file2
How do I get rid of those duplicates so it only prints:
file1
file2
I have tried:
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$seen{$ARGV}++;
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
}
}
But when I try to run it without the -l switch I only get:
file1:My dog is bad.
file2:I am bad at the guitar.
I also tried:
$fname ? print "$ARGV\n" unless ($ARGV > 1) : print "$ARGV:$_";
But I keep getting syntax error at plgrep line 17, near ""$ARGV\n" unless"
If someone could help me out with my duplicates issue as well as the italicized part of the assignment I would truly appreciate it. I don't even know where to start on that italicized part.

If you're printing only file names, you can exit the loop (using the last command) after the first match, since you already know the file matches. By not scanning the rest of the file, this will also prevent the name from being printed repeatedly.
Edited to add: In order to do it this way, you'll also need to switch from using <> to read the files to instead getting the names from #ARGV and opening them normally.
If you want to continue using <>, you'll instead need to watch $ARGV to see when it changes (indicating that you've started reading a new file) and keep a flag to indicate whether the current file has found any matches yet or not. However, this approach would require you to read every file in its entirety, which will be less efficient than only reading enough of each file to know whether it contains at least one match or not (i.e., skipping to the next file after the first match), so I would recommend switching to open instead.

The first syntax problem is simply an extra semicolon.
The second is that you may only use if/unless as a statement modifier at the end of a statement - you can't embed it in the middle of a conditional that way.
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
Becomes:
next if $seen{$ARGV} > 1;
print $fname ? "$ARGV\n" : "$ARGV:$_";

Related

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...
Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Extract everything between first and last occurence of the same pattern in single iteration

This question is very much the same as this except that I am looking to do this as fast as possible, doing only a single pass of the (unfortunately gzip compressed) file.
Given the pattern CAPTURE and input
1:.........
...........
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
...........
1000:......
Print:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
Can this be accomplished with a regular expression?
I vaguely remember that this kind of grammar cannot be captured by a regular expression but not quite sure as regular expressions these days provide look aheads,etc.
You can buffer the lines until you see a line that contains CAPTURE, treating the first occurrence of the pattern specially.
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/CAPTURE/ ) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
}
Feed the input into this program via zcat file.gz | perl script.pl.
Which can of course be jammed into a one-liner, if need be...
zcat file.gz | perl -ne '$x&&push#b,$_;if(/CAPTURE/){$x||=#b=$_;print#b;#b=()}'
Can this be accomplished with a regular expression?
You mean in a single pass, in a single regex? If you don't mind reading the entire file into memory, sure... but this is obviously not a good idea for large files.
zcat file.gz | perl -0777ne '/((^.*CAPTURE.*$)(?s:.*)(?2)(?:\z|\n))/m and print $1'
I would write
gunzip -c file.gz | sed -n '/CAPTURE/,$p' | tac | sed -n '/CAPTURE/,$p' | tac
Find the first CAPTURE and look back for the last one.
echo "/CAPTURE/,?CAPTURE? p" | ed -s <(gunzip -c inputfile.gz)
EDIT: Answer to comment and second (better?) solution.
When your input doesn't end with a newline, ed will complain, as shown by these tests.
# With newline
printf "1,$ p\n" | ed -s <(printf "%s\n" test)
# Without newline
printf "1,$ p\n" | ed -s <(printf "%s" test)
# message removed
printf "1,$ p\n" | ed -s <(printf "%s" test) 2> /dev/null
I do not know the memory complications this will give for a large file, but you would prefer a streaming solution.
You can use sed for the next approach.
Keep reading lines until you find the first match. During this time only remember the last line read (by putting it in a Hold area).
Now change your tactics.
Append each line to the Hold area. You do not know when to flush until the next match.
When you have the next match, recall the Hold area and print this.
I needed some tweeking for preventing the second match to be printed twice. I solved this by reading the next line and replacing the HOLD area with that line.
The total solution is
gunzip -c inputfile.gz | sed -n '1,/CAPTURE/{h;n};H;/CAPTURE/{x;p;n;h};'
When you don't like the sed holding space, you can implemnt the same approach with awk:
gunzip -c inputfile.gz |
awk '/CAPTURE/{capt=1} capt==1{a[i++]=$0} /CAPTURE/{for(j=0;j<i;j++) print a[j]; i=0}'
I don't think regex will be faster than double scan...
Here is an awk solution (double scan)
$ awk '/pattern/ && NR==FNR {a[++f]=NR; next} a[1]<=FNR && FNR<=a[f]' file{,}
Alternatively if you have any a priori information on where the patterns appear on the file you can have heuristic approaches which will be faster on those special cases.
Here is one more example with regex (the cons is that if files are large, it will consume a large memory)
#!/usr/bin/perl
{
local $/ = undef;
open FILE, $ARGV[0] or die "Couldn't open file: $!";
binmode FILE;
$string = <FILE>;
close FILE;
}
print $1 if $string =~ /([^\n]+(CAPTURE).*\2.*?)\n/s;
Or with one liner:
cat file.tmp | perl -ne '$/=undef; print $1 if <STDIN> =~ /([^\n]+(CAPTURE).*\2.*?)\n/s'
result:
100:CAPTURE
...........
150:CAPTURE
...........
200:CAPTURE
This might work for you (GNU sed):
sed '/CAPTURE/!d;:a;n;:b;//ba;$d;N;bb' file
Delete all lines until the first containing the required string. Print the line containing the required string. Replace the pattern space with the next line. If this line contains the required string, repeat the last two previous sentences. If it is the last line of the file, delete the pattern space. Otherwise, append the next line and repeat the last three previous sentences.
Having studied the test files used for haukex's benchmark, it would seem that sed is not the tool to extract this file. Using a mixture of csplit, grep and sed presents a reasonably fast solution as follows:
lines=$(grep -nTA1 --no-group-separator CAPTURE oldFile |
sed '1s/\t.*//;1h;$!d;s/\t.*//;H;x;s/\n/ /')
csplit -s oldFile $lines && rm xx0{0,2} && mv xx01 newFile
Split the original file into three files. A file preceding the first occurrence of CAPTURE, a file from the first CAPTURE to the last CAPTURE and a file containing of the remainder. The first and third files are discarded and the second file renamed.
csplit can use line numbers to split the original file. grep is extremely fast at filtering patterns and can return the line numbers of all patterns that match CAPTURE and the following context line. sed can manipulate the results of grep into two line numbers which are supplied to the csplit command.
When run against the test files (as above) I get timings around 10 seconds.
While posting this question, the problem I had at hand was that I had several huge gzip compressed log files generated by a java application.
The log lines were of the following format:
[Timestamp] (AppName) {EventId} [INFO]: Log text...
[Timestamp] (AppName) {EventId} [EXCEPTION]: Log text...
at com.application.class(Class.java:154)
caused by......
[Timestamp] (AppName) {EventId} [LogLevel]: Log text...
Given an EventId, I needed to extract all the lines corresponding to the event from these files. The problem became unsolvable with a trivial grep for EventId just due to the fact that the exception lines could be of arbitrary length and do not contain the EventId.
Unfortunately I forgot to consider the edge case where the last log line for an EventId could be the exception and the answers posted here would not print the stacktrace lines. However it wasn't hard to modify haukex's solution to cover these cases as well:
#!/usr/bin/env perl
use warnings;
use strict;
my $first=1;
my #buf;
while ( my $line = <> ) {
push #buf, $line unless $first;
if ( $line=~/EventId/ or ($first==0 and $line!~/\(AppName\)/)) {
if ($first) {
#buf = ($line);
$first = 0;
}
print #buf;
#buf = ();
}
else {
$first = 1;
}
}
I am still wondering if the faster solutions(mainly walter's sed solution or haukex's in-memory perl solution) could be modified to do the same.

Deleting a line from a huge file in Perl

I have huge text file and first five lines of it reads as below :
This is fist line
This is second line
This is third line
This is fourth line
This is fifth line
Now, I want to write something at a random position of the third line of that file which will replace the characters in that line by the new string I am writing. I am able to achieve that with the below code :
use strict;
use warnings;
my #pos = (0);
open my $fh, "+<", "text.txt";
while(<$fh) {
push #pos, tell($fh);
}
seek $fh , $pos[2]+1, 0;
print $fh "HELLO";
close($fh);
However, I am not able to figure out with the same kind of approach how can I delete the entire third line from that file so that the texts reads below :
This is fist line
This is second line
This is fourth line
This is fifth line
I do not want to read the entire file into an array, neither do I want to use Tie::File. Is it possible to achieve my requirement using seek and tell ? A solution will be very helpful.
A file is a sequence of bytes. We can replace (overwrite) some of them, but how would we remove them? Once a file is written its bytes cannot be 'pulled out' of the sequence or 'blanked' in any way. (The ones at the end of the file can be dismissed, by truncating the file as needed.)
The rest of the content has to move 'up', so that what follows the text to be removed overwrites it. We have to rewrite the rest of the file. In practice it is often far simpler to rewrite the whole file.
As a very basic example
use warnings 'all';
use strict;
use File::Copy qw(move);
my $file_in = '...';
my $file_out = '...'; # best use `File::Temp`
open my $fh_in, '<', $file_in or die "Can't open $file_in: $!";
open my $fh_out, '>', $file_out or die "Can't open $file_out: $!";
# Remove a line with $pattern
my $pattern = qr/this line goes/;
while (<$fh_in>)
{
print $fh_out $_ unless /$pattern/;
}
close $fh_in;
close $fh_out;
# Rename the new fie into the original one, thus replacing it
move ($file_out, $file_in) or die "Can't move $file_out to $file_in: $!";
This writes every line of input file into the output file, unless a line matches a given pattern. Then that file is renamed, replacing the original (what does not involve data copy). See this topic in perlfaq5.
Since we really use a temporary file I'd recommend the core module File::Temp for that.
This may be made more efficient, but far more complicated, by opening in update '+<' mode so to overwrite only a portion of the file. You iterate until the line with the pattern, record (tell) its position and the line length, then copy all remaining lines in memory. Then seek back to the position minus length of that line, and dump the copied rest of the file, overwriting the line and all that follows it.
Note that now the data for the rest of the file is copied twice, albeit one copy is in memory. Going to this trouble may make sense if the line to be removed is far down a very large file. If there are more lines to remove this gets messier.
Writing out a new file and copying it over the original changes the file's inode number. That may be a problem for some tools or procedures, and if it is you can instead update the original by either
Once the new file is written out, open it for reading and open the original for writing. This clobbers the original file. Then read from the new file and write to the original one, thus copying the content back to the same inode. Remove the new file when done.
Open the original file in read-write mode ('+<') to start with. Once the new file is written, seek to the beginning of the original (or to the place from which to overwrite) and write to it the content of the new file. Remember to also set the end-of-file if the new file is shorter,
truncate $fh, tell($fh);
after copying is done. This requires some care and the first way is probably generally safer.
If the file weren't huge the new "file" can be "written" in memory, as an array or a string.
Use sed command from Linux command line in Perl:
my $return = `sed -i '3d' text.txt`;
Where "3d" means delete the 3rd row.
It is useful to look at perlrun and see how perl itself modifies a file 'in-place.'
Given:
$ cat text.txt
This is fist line
This is second line
This is third line
This is fourth line
This is fifth line
You can apparently 'modify in-place', sed like, by using the -i and -p switch to invoke Perl:
$ perl -i -pe 's/This is third line\s*//' text.txt
$ cat text.txt
This is fist line
This is second line
This is fourth line
This is fifth line
But if you consult the Perl Cookbook recipe 7.9 (or look at perlrun) you will see that this:
$ perl -i -pe 's/This is third line\s*//' text.txt
is equivalent to:
while (<>) {
if ($ARGV ne $oldargv) { # are we at the next file?
rename($ARGV, $ARGV . '.bak');
open(ARGVOUT, ">$ARGV"); # plus error check
select(ARGVOUT);
$oldargv = $ARGV;
}
s/This is third line\s*//;
}
continue{
print;
}
select (STDOUT); # restore default output

What's the use of <> in Perl?

What's the use of <> in Perl. How to use it ?
If we simply write
<>;
and
while(<>)
what is that the program doing in both cases?
The answers above are all correct, but it might come across more plainly if you understand general UNIX command line usage. It is very common to want a command to work on multiple files. E.g.
ls -l *.c
The command line shell (bash et al) turns this into:
ls -l a.c b.c c.c ...
in other words, ls never see '*.c' unless the pattern doesn't match. Try this at a command prompt (not perl):
echo *
you'll notice that you do not get an *.
So, if the shell is handing you a bunch of file names, and you'd like to go through each one's data in turn, perl's <> operator gives you a nice way of doing that...it puts the next line of the next file (or stdin if no files are named) into $_ (the default scalar).
Here is a poor man's grep:
while(<>) {
print if m/pattern/;
}
Running this script:
./t.pl *
would print out all of the lines of all of the files that match the given pattern.
cat /etc/passwd | ./t.pl
would use cat to generate some lines of text that would then be checked for the pattern by the loop in perl.
So you see, while(<>) gets you a very standard UNIX command line behavior...process all of the files I give you, or process the thing I piped to you.
<>;
is a short way of writing
readline();
or if you add in the default argument,
readline(*ARGV);
readline is an operator that reads a line from the specified file handle. Reading from the special file handle ARGV will read from STDIN if #ARGV is empty or from the concatenation of the files named by #ARGV if it's not.
As for
while (<>)
It's a syntax error. If you had
while (<>) { ... }
it get rewritten to
while (defined($_ = <>)) { ... }
And as previously explained, that means the same as
while (defined($_ = readline(*ARGV))) { ... }
That means it will read lines from (previously explained) ARGV until there are no more lines to read.
It is called the diamond operator and feeds data from either stdin if ARGV is empty or each line from the files named in ARGV. This webpage http://docstore.mik.ua/orelly/perl/learn/ch06_02.htm explains it very well.
In many cases of programming with syntactical sugar like this, Deparse of O is helpful to find out what's happening:
$ perl -MO=Deparse -e 'while(<>){print 42}'
while (defined($_ = <ARGV>)) {
print 42;
}
-e syntax OK
Quoting perldoc perlop:
The null filehandle <> is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes
a list of filenames, doing the same to each line of input from all of
them. Input from <> comes either from standard input, or from each
file listed on the command line.
it takes the STDIN standard input:
> cat temp.pl
#!/usr/bin/perl
use strict;
use warnings;
my $count=<>;
print "$count"."\n";
>
below is the execution:
> temp.pl
3
3
>
so as soon as you execute the script it will wait till the user gives some input.
after 3 is given as input,it stores that value in $count and it prints the value in the next statement.

Perl: How to get filename when using <> construct?

Perl offers this very nice feature:
while ( <> )
{
# do something
}
...which allows the script to be used as script.pl <filename> as well as cat <filename> | script.pl.
Now, is there a way to determine if the script has been called in the former way, and if yes, what the filename was?
I know I knew this once, and I know I even used the construct, but I cannot remember where / how. And it proved very hard to search the 'net for this ("perl stdin filename"? No...).
Help, please?
The variable $ARGV holds the current file being processed.
$ echo hello1 > file1
$ echo hello2 > file2
$ echo hello3 > file3
$ perl -e 'while(<>){s/^/$ARGV:/; print;}' file*
file1:hello1
file2:hello2
file3:hello3
The I/O Operators section of perlop is very informative about this.
Essentially, the first time <> is executed, - is added to #ARGV if it started out empty. Opening - has the effect of cloning the STDIN file handle, and the variable $ARGV is set to the current element of #ARGV as it is processed.
Here's the full clip.
The null filehandle "<>" is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes a
list of filenames, doing the same to each line of input from all of
them. Input from "<>" comes either from standard input, or from each
file listed on the command line. Here's how it works: the first time
"<>" is evaluated, the #ARGV array is checked, and if it is empty,
$ARGV[0] is set to "-", which when opened gives you standard input. The
#ARGV array is then processed as a list of filenames. The loop
while (<>) {
... # code for each line
}
is equivalent to the following Perl-like pseudo code:
unshift(#ARGV, '-') unless #ARGV;
while ($ARGV = shift) {
open(ARGV, $ARGV);
while (<ARGV>) {
... # code for each line
}
}
except that it isn't so cumbersome to say, and will actually work. It
really does shift the #ARGV array and put the current filename into the
$ARGV variable. It also uses filehandle ARGV internally. "<>" is just
a synonym for "<ARGV>", which is magical. (The pseudo code above doesn't
work because it treats "<ARGV>" as non-magical.)
If you care to know about when <> switches to a new file (e.g. in my case - I wanted to record the new filename and line number), then the eof() function documentation offers a trick:
# reset line numbering on each input file
while (<>) {
next if /^\s*#/; # skip comments
print "$.\t$_";
} continue {
close ARGV if eof; # Not eof()!
}