Seeking explanation of Magic Perl Shared Lines Oneliner - perl

I found this (here if you must know), and it caught my attention.
$ perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' file1 file2
I do know perl. But I do not know how this does what it does.
$ perl -ne 'print if ($seen{$_} .= #ARGV) =~ /10$/' <(echo 'zz\nabc\n3535\ndef') <(echo 'abc\ndef\nff')
abc
def
Seems like it just spits out the lines of the input files that are shared. Now putting every line into a hash as key, or something, I can see how it can help achieve that task, but... What the hell is going on with that regex?
Thinking about it some more, nothing about the use of .= is obvious either.

The expression $seen{$_} .= #ARGV appends the number of elements in #ARGV to $seen{$_}
While the first file is being read, #ARGV contains only one element -- the second file name
While the second file is being read, #ARGV is empty
The value of $_, which is used as the key for the %seen hash, is the latest line read from either file
If any given line appears only in the first file, only a 1 will be appended to the hash element
If any given line appears only in the second file, only a 0 will be appended to the hash element
If any given line appears in both files, a 1 and then a 0 will be appended to the hash element, leaving it set to 10
When reading through the second file, if the appended 0 character results in a value of 10 then the line is printed
This results in all lines that appear in both files being printed to the output

Related

How to split a file (with sed) into numerous files according to a value found on each line?

I have several Company_***.csv files (altough the separator's a tab not a comma; hence should be *.tsv, but never mind) which contains a header plus numerous data lines e.g
1stHeader 2ndHeader DateHeader OtherHeaders...
111111111 SOME STRING 2020-08-01 OTHER STRINGS..
222222222 ANOT STRING 2020-08-02 OTHER STRINGS..
I have to split them according to the 3rd column here, it's a date.
Each file should be named like e.g. Company_2020_08_01.csv Company_2020_08_02.csv & so one
and containing: same header on the 1st line + matching rows as the following lines.
At first I thought about saving (once) the header in a single file e.g.
sed -n '1w Company_header.csv' Company_*.csv
then parsing the files with a pattern for the date (hence the headers would be skipped) e.g.
sed -n '/\t2020-[01][0-9]-[0-3][0-9]\t/w somefilename.csv' Company_*.csv
... and at last, insert the (missing) header in each generated file.
But I'm stuck at step 2: I can't find how I could generate (dynamically) the "filename" expected by the w command, neither how to capture the date in the search pattern (because apparently this is just an address, not a search-replace "field" as in the s/regexp/replacement/[flags] command, so you can't have capturing groups ( ) in there).
So I wonder if this is actually doable with sed? Or should I look upon other tools e.g. awk?
Disclaimer: I'm quite a n00b with these commands so I'm just learning/starting from scratch...
Perl to the rescue!
perl -e 'while (<>) {
$h = $_, next if $. == 1;
$. = 0 if eof;
#c = split /\t/;
open my $out, ">>", "Company_" . $c[2] =~ tr/-/_/r . ".csv" or die $!;
print {$out} $h unless tell $out;
print {$out} $_;
}' -- Company_*.csv
The diamond operator <> in scalar context reads a line from the input.
The first line of each file is stored in the variable $h, see $. and eof
split populates the #c array by the column values for each line
$c[2] contains the date, using tr we translate dashes to underscores to create a filename from it. open opens the file for appending.
print prints the header if the file is empty (see tell)
and prints the current line, too.
Note that it only appends to the files, so don't forget to delete any output files before running the script again.

Awk's output in Perl doesn't seem to be working properly

I'm writing a simple Perl script which is meant to output the second column of an external text file (columns one and two are separated by a comma).
I'm using AWK because I'm familiar with it.
This is my script:
use v5.10;
use File::Copy;
use POSIX;
$s = `awk -F ',' '\$1==500 {print \$2}' STD`;
say $s;
The contents of the local file "STD" is:
CIR,BS
60,90
70,100
80,120
90,130
100,175
150,120
200,260
300,500
400,600
500,850
600,900
My output is very strange and it prints out the desired "850" but it also prints a trailer of the line and a new line too!
ka#man01:$ ./test.pl
850
ka#man01:$
The problem isn't just printing. I need to use the variable generated by awk "i.e. the $s variable) but the variable is also being reserved with a long string and a new line!
Could you guys help?
Thank you.
I'd suggest that you're going down a dirty road by trying to inline awk into perl in the first place. Why not instead:
open ( my $input, '<', 'STD' ) or die $!;
while ( <$input> ) {
s/\s+\z//;
my #fields = split /,/;
print $fields[1], "\n" if $fields[0] == 500;
}
But the likely problem is that you're not handling linefeeds, and say is adding an extra one. Try using print instead, or chomp on the resultant string.
perl can do many of the things that awk can do. Here's something similar that replaces your entire Perl program:
$ perl -naF, -le 'chomp; print $F[1] if $F[0]==500' STD
850
The -n creates a while loop around your argument to -e.
The -a splits up each line into #F and -F lets you specify the separator. Since you want to separate the fields on a comma you use -F,.
The -l adds a newline each time you call print.
The -e argument is the program to run (with the added while from -n). The chomp removes the newline from the output. You get a newline in your output because you happen to use the last field in the line. The -l adds a newline when you print; that's important when you want to extract a field in the middle of the line.
The reason you get 2 newlines:
the backtick operator does not remove the trailing newline from the awk output. $s contains "850\n"
the say function appends a newline to the string. You have say "850\n" which is the same as print "850\n\n"

How to print without duplicates with perl?

My assignment is a little more in depth than the title but in the title is my main question. Here is the assignment:
Write a perl script that will grep for all occurrences of the regular expression in all regular files in the file/directory list as well as all regular files under the directories in the file/directory list. If a file is not a TEXT file then the file should first be operated on by the unix command strings (no switches) and the resulting lines searched. If the -l switch is given only the file name of the files containing the regular expression should be printed, one per line. A file name should occur a maximum of one time in this case. If the -l switch is not given then all matching lines should be printed, each proceeded on the same line by the file name and a colon. An example invocation from the command line:
plgrep 'ba+d' file1 dir1 dir2 file2 file3 dir3
Here is my code:
#!/usr/bin/perl -w
use Getopt::Long;
my $fname = 0;
GetOptions ('l' => \$fname);
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$fname ? print "$ARGV\n" : print "$ARGV:$_";
}
}
So far that code does everything it's supposed to except for reading non-text files and printing out duplicates of file names when using the -l switch. Here is an example of my output after entering the following on the command line: plgrep 'ba+d' file1 file2
file1:My dog is bad.
file1:My dog is very baaaaaad.
file2:I am bad at the guitar.
file2:Even though I am bad at the guitar, it is still fun to play!
Which is PERFECT!
But when I use the -l switch to print out only the file names this is what I get after entering the following on the command line: plgrep -l 'ba+d' file1 file2
file1
file1
file2
file2
How do I get rid of those duplicates so it only prints:
file1
file2
I have tried:
$pat = shift #ARGV;
while (<>) {
if (/$pat/) {
$seen{$ARGV}++;
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
}
}
But when I try to run it without the -l switch I only get:
file1:My dog is bad.
file2:I am bad at the guitar.
I also tried:
$fname ? print "$ARGV\n" unless ($ARGV > 1) : print "$ARGV:$_";
But I keep getting syntax error at plgrep line 17, near ""$ARGV\n" unless"
If someone could help me out with my duplicates issue as well as the italicized part of the assignment I would truly appreciate it. I don't even know where to start on that italicized part.
If you're printing only file names, you can exit the loop (using the last command) after the first match, since you already know the file matches. By not scanning the rest of the file, this will also prevent the name from being printed repeatedly.
Edited to add: In order to do it this way, you'll also need to switch from using <> to read the files to instead getting the names from #ARGV and opening them normally.
If you want to continue using <>, you'll instead need to watch $ARGV to see when it changes (indicating that you've started reading a new file) and keep a flag to indicate whether the current file has found any matches yet or not. However, this approach would require you to read every file in its entirety, which will be less efficient than only reading enough of each file to know whether it contains at least one match or not (i.e., skipping to the next file after the first match), so I would recommend switching to open instead.
The first syntax problem is simply an extra semicolon.
The second is that you may only use if/unless as a statement modifier at the end of a statement - you can't embed it in the middle of a conditional that way.
$fname ? print "$ARGV\n" unless ($seen{$ARGV} > 1); : print "$ARGV:$_";
Becomes:
next if $seen{$ARGV} > 1;
print $fname ? "$ARGV\n" : "$ARGV:$_";

Perl: How to get filename when using <> construct?

Perl offers this very nice feature:
while ( <> )
{
# do something
}
...which allows the script to be used as script.pl <filename> as well as cat <filename> | script.pl.
Now, is there a way to determine if the script has been called in the former way, and if yes, what the filename was?
I know I knew this once, and I know I even used the construct, but I cannot remember where / how. And it proved very hard to search the 'net for this ("perl stdin filename"? No...).
Help, please?
The variable $ARGV holds the current file being processed.
$ echo hello1 > file1
$ echo hello2 > file2
$ echo hello3 > file3
$ perl -e 'while(<>){s/^/$ARGV:/; print;}' file*
file1:hello1
file2:hello2
file3:hello3
The I/O Operators section of perlop is very informative about this.
Essentially, the first time <> is executed, - is added to #ARGV if it started out empty. Opening - has the effect of cloning the STDIN file handle, and the variable $ARGV is set to the current element of #ARGV as it is processed.
Here's the full clip.
The null filehandle "<>" is special: it can be used to emulate the
behavior of sed and awk, and any other Unix filter program that takes a
list of filenames, doing the same to each line of input from all of
them. Input from "<>" comes either from standard input, or from each
file listed on the command line. Here's how it works: the first time
"<>" is evaluated, the #ARGV array is checked, and if it is empty,
$ARGV[0] is set to "-", which when opened gives you standard input. The
#ARGV array is then processed as a list of filenames. The loop
while (<>) {
... # code for each line
}
is equivalent to the following Perl-like pseudo code:
unshift(#ARGV, '-') unless #ARGV;
while ($ARGV = shift) {
open(ARGV, $ARGV);
while (<ARGV>) {
... # code for each line
}
}
except that it isn't so cumbersome to say, and will actually work. It
really does shift the #ARGV array and put the current filename into the
$ARGV variable. It also uses filehandle ARGV internally. "<>" is just
a synonym for "<ARGV>", which is magical. (The pseudo code above doesn't
work because it treats "<ARGV>" as non-magical.)
If you care to know about when <> switches to a new file (e.g. in my case - I wanted to record the new filename and line number), then the eof() function documentation offers a trick:
# reset line numbering on each input file
while (<>) {
next if /^\s*#/; # skip comments
print "$.\t$_";
} continue {
close ARGV if eof; # Not eof()!
}

Need more explanation about what's being done here?

print reverse <>;
print sort <>;
What's the exact steps perl handles with these operations?
It seems for the 1st one,perl not just reverses the order of invocation parameters,but also the contents of each file...
print reverse <>;
<> is evaluated in an array context, meaning it "slurps" the file. It reads the entire file. In the case of the magic file represented by the files named in #ARGV, it will read the contents of all files in the order referenced by the command line arguments (#ARGV).
reverse then reverses the order of the array, meaning the last line from the last file comes first, and the first line from the last file comes last.
print then prints the array.
From your notes, you might want something like this:
perl -e 'sub BEGIN { #ARGV=reverse #ARGV; } print <>;' /etc/motd /etc/passwd
This is described in the docs for I/O operators. Here's an excerpt from the docs:
The null filehandle <> is special: it can be used to emulate the behavior of sed and awk. Input from <> comes either from standard input, or from each file listed on the command line. Here's how it works: the first time <> is evaluated, the #ARGV array is checked, and if it is empty, $ARGV[0] is set to "-", which when opened gives you standard input. The #ARGV array is then processed as a list of filenames.
It's worth reading the entire doc, as it provides equivalent "non-magical" Perl code equivalent to <> in various use cases.