Perl-script to read and print lines from multiple txt files? - perl

We have 300+ txt files, of which are basically replicates of an email, each txt file has the following format:
To: blabla#hotmail.com
Subject: blabla
From: bla1#hotmail.com
Message: Hello World!
The platform I am to the script on is Windows, and everything is local (including the Perl instance). The aim is to write a script, which crawls through each file (all located within the same directory), and print out a list of each 'unique' email address in the from field. The concept is very easy.
Can anyone point me in the right direction here? I know how to start off a Perl script, and I am able to read a single file and print all details:
#!/usr/local/bin/perl
open (MYFILE, 'emails/email_id_1.txt');
while (<MYFILE>) {
chomp;
print "$_\n";
}
close (MYFILE);
So now, I need to be able to read and print line 3 of this file, but perform this activity not just once, but for all of the files. I've looked into the File::Find module, could this be of any use?

What platform? If Linux then it's simple:
foreach $f (#ARGS) {
# Do stuff
}
and then call with:
perl mything.pl *.txt
In Windows you'll need to expand the wildcard first as cmd.exe doesn't expand wildcards (unlike Linux shells):
#ARGV = map glob, #ARGV
foreach $f (#ARGS) {
# Do stuff
}
then extracting the third line is just a simple case of reading each line in and counting when you've got to line 3 so you know to print the results.

The glob() builtin can give you a list of files in a directory:
chdir $dir or die $!;
my #files = glob('*');
You can use Tie::File to access the 3rd line of a file:
use Tie::File;
for (#files) {
tie my #lines, 'Tie::File', $_ or die $!;
print $lines[2], "\n";
}

Perl one-liner, windows-version:
perl -wE "#ARGV = glob '*.txt'; while (<>) { say $1 if /^From:\s*(.*)/ }"
It will check all the lines, but only print if it finds a valid From: tag.

Are you using a Unix-style shell? You can do this in the shell without even using Perl.
grep "^From:" ./* | sort | uniq -c"
The breakdown is as follows:
grep will grab every line that starts with "From:", and send it to...
sort, which will alpha sort those lines, then...
uniq, which will filter out dupe lines. The "-c" part will count the occurrences.
Your output would look like:
3 From: dave#example.com
5 From: foo#bar.example.com
etc...
Possible issues:
I'm not sure how complex your "From" lines will be, e.g. multiple addresses, different formats, etc.
You could enhance that grep step in a few ways, or replace it with a Perl script that has less-broad functionality than your proposed all-in-one script.
Please comment if anything isn't clear.

Here's my solution (I hope this isn't homework).
It checks all files in the current directory whose names end with ".txt", case-insensitive (e.g., it will find "foo.TXT", which is probably what you want under Windows). It also allows for possible variations in line terminators (at least CR-LF and LF), and searches for the From: prefix case-insensitively, and allows arbitrary whitespace after the :.
#!/usr/bin/perl
use strict;
use warnings;
opendir my $DIR, '.' or die "opendir .: $!\n";
my #files = grep /\.txt$/i, readdir $DIR;
closedir $DIR;
# print "Got ", scalar #files, " files\n";
my %seen = ();
foreach my $file (#files) {
open my $FILE, '<', $file or die "$file: $!\n";
while (<$FILE>) {
if (/^From:\s*(.*)\r?$/i) {
$seen{$1} = 1;
}
}
close $FILE;
}
foreach my $addr (sort keys %seen) {
print "$addr\n";
}

Related

Perl wildcards in the file paths

I am working on my project where the GNU Makefile should automatically test my Perl program with different input files. I have this code which reads only one file from inputs directory, searches stop words and outputs out.txt file with frequency table of non-stop words.
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::StopWords qw(getStopWords);
my %found;
my $src = '/programu-testavimas/1-dk/trunk/tests/inputs/test.txt';
my $des = '/programu-testavimas/1-dk/trunk/tests/outputs/out.txt';
open(SRC,'<',$src) or die $!;
open(DES,'>',$des) or die $!;
my $stopwords = getStopWords('en');
while( my $line = <SRC>){
++$found{$_} for grep { !$stopwords->{$_} }
split /\s+/, lc $line;
}
print DES $_, "\t\t", $found{$_}, $/ for sort keys %found;
close(SRC);
close(DES);
My goal is to test many files with separate case.sh scripts where the input files should be different in each case, this is one of the case:
#!/bin/sh
perl /programu-testavimas/1-dk/trunk/scripts/test.pl /programu-testavimas/1-dk/trunk/tests/inputs/test.txt > /home/aleksandra/programų-testavimas/1-dk/trunk/tests/outputs/out.txt
Then, my Makefile at once should test program with different inputs in each case. So, right now I'm struggling with my Perl code where my input file is only one individual and I need to make it read different files in inputs directory. How can I change path correctly that bash scripts could have each case with individual input file?
EDIT: I tried this with glob function but it outputs empty file
open(DES,'>',$des) or die $!;
my $stopwords = getStopWords('en');
for my $file ( glob $src ) {
open(SRC,'<',$file) or die "$! opening $file";
while( my $line = <SRC>){
++$found{$_} for grep { !$stopwords->{$_} }
split /\s+/, lc $line;
}
print DES $_, "\t\t", $found{$_}, $/ for sort keys %found;
close(SRC);
}
close(DES);
Correct me if I'm wrong, but to me it sounds like you have different shell scripts, each calling your perl script with a different input, and redirecting your perl's script output to a new file.
You don't need to glob anything in your perl script. It already has all the information it needs: which file to read. Your shell script/Makefile is handling the rest.
So given the shell script
#!/bin/sh
perl /path/to/test.pl /path/to/input.txt > /path/to/output.txt
Then in your perl script, simply read from the file provided via the first positional parameter:
#!/usr/bin/perl
use strict;
use warnings;
use Lingua::StopWords qw(getStopWords);
my %found;
my $stopwords = getStopWords('en');
while(my $line = <>) {
++$found{$_} for grep { !$stopwords->{$_} }
split /\s+/, lc $line;
}
print $_, "\t\t", $found{$_}, $/ for sort keys %found;
while(<>) will read from STDIN or from ARGV.
Your shell script could then call your perl script with different inputs and define outputs:
#!/bin/sh
for input in /path/to/*.txt; do
perl /path/to/test.pl "$input" > "$input.out"
done

foreach and special variable $_ not behaving as expected

I'm learning Perl and wrote a small script to open perl files and remove the comments
# Will remove this comment
my $name = ""; # Will not remove this comment
#!/usr/bin/perl -w <- wont remove this special comment
The name of files to be edited are passed as arguments via terminal
die "You need to a give atleast one file-name as an arguement\n" unless (#ARGV);
foreach (#ARGV) {
$^I = "";
(-w && open FILE, $_) || die "Oops: $!";
/^\s*#[^!]/ || print while(<>);
close FILE;
print "Done! Please see file: $_\n";
}
Now when I ran it via Terminal:
perl removeComments file1.pl file2.pl file3.pl
I got the output:
Done! Please see file:
This script is working EXACTLY as I'm expecting but
Issue 1 : Why $_ didn't print the name of the file?
Issue 2 : Since the loop runs for 3 times, why Done! Please see file: was printed only once?
How you would write this script in as few lines as possible?
Please comment on my code as well, if you have time.
Thank you.
The while stores the lines read by the diamond operator <> into $_, so you're writing over the variable that stores the file name.
On the other hand, you open the file with open but don't actually use the handle to read; it uses the empty diamond operator instead. The empty diamond operator makes an implicit loop over files in #ARGV, removing file names as it goes, so the foreach runs only once.
To fix the second issue you could use while(<FILE>), or rewrite the loop to take advantage of the implicit loop in <> and write the entire program as:
$^I = "";
/^\s*#[^!]/ || print while(<>);
Here's a more readable approach.
#!/usr/bin/perl
# always!!
use warnings;
use strict;
use autodie;
use File::Copy;
# die with some usage message
die "usage: $0 [ files ]\n" if #ARGV < 1;
for my $filename (#ARGV) {
# create tmp file name that we are going to write to
my $new_filename = "$filename\.new";
# open $filename for reading and $new_filename for writing
open my $fh, "<", $filename;
open my $new_fh, ">", $new_filename;
# Iterate over each line in the original file: $filename,
# if our regex matches, we bail out. Otherwise we print the line to
# our temporary file.
while(my $line = <$fh>) {
next if $line =~ /^\s*#[^!]/;
print $new_fh $line;
}
close $fh;
close $new_fh;
# use File::Copy's move function to rename our files.
move($filename, "$filename\.bak");
move($new_filename, $filename);
print "Done! Please see file: $filename\n";
}
Sample output:
$ ./test.pl a.pl b.pl
Done! Please see file: a.pl
Done! Please see file: b.pl
$ cat a.pl
#!/usr/bin/perl
print "I don't do much\n"; # comments dont' belong here anyways
exit;
print "errrrrr";
$ cat a.pl.bak
#!/usr/bin/perl
# this doesn't do much
print "I don't do much\n"; # comments dont' belong here anyways
exit;
print "errrrrr";
Its not safe to use multiple loops and try to get the right $_. The while Loop is killing your $_. Try to give your files specific names inside that loop. You can do this with so:
foreach my $filename(#ARGV) {
$^I = "";
(-w && open my $FILE,'<', $filename) || die "Oops: $!";
/^\s*#[^!]/ || print while(<$FILE>);
close FILE;
print "Done! Please see file: $filename\n";
}
or that way:
foreach (#ARGV) {
my $filename = $_;
$^I = "";
(-w && open my $FILE,'<', $filename) || die "Oops: $!";
/^\s*#[^!]/ || print while(<$FILE>);
close FILE;
print "Done! Please see file: $filename\n";
}
Please never use barewords for filehandles and do use a 3-argument open.
open my $FILE, '<', $filename — good
open FILE $filename — bad
Simpler solution: Don't use $_.
When Perl was first written, it was conceived as a replacement for Awk and shell, and Perl heavily borrowed from that syntax. Perl also for readability created the special variable $_ which allowed you to use various commands without having to create variables:
while ( <INPUT> ) {
next if /foo/;
print OUTPUT;
}
The problem is that if everything is using $_, then everything will effact $_ in many unpleasant side effects.
Now, Perl is a much more sophisticated language, and has things like locally scoped variables (hint: You don't use local to create these variables -- that merely gives _package variables (aka global variables) a local value.)
Since you're learning Perl, you might as well learn Perl correctly. The problem is that there are too many books that are still based on Perl 3.x. Find a book or web page that incorporates modern practice.
In your program, $_ switches from the file name to the line in the file and back to the next file. It's what's confusing you. If you used named variables, you could distinguished between files and lines.
I've rewritten your program using more modern syntax, but your same logic:
use strict;
use warnings;
use autodie;
use feature qw(say);
if ( not $ARGV[0] ) {
die "You need to give at least one file name as an argument\n";
}
for my $file ( #ARGV ) {
# Remove suffix and copy file over
if ( $file =~ /\..+?$/ ) {
die qq(File "$file" doesn't have a suffix);
}
my ( $output_file = $file ) =~ s/\..+?$/./; #Remove suffix for output
open my $input_fh, "<", $file;
open my $output_fh, ">", $output_file;
while ( my $line = <$input_fh> ) {
print {$output_fh} $line unless /^\s*#[^!]/;
}
close $input_fh;
close $output_fh;
}
This is a bit more typing than your version of the program, but it's easier to see what's going on and maintain.

Find file which content not match a string pattern in Perl

I'm writing a code to find the file which not contain a string pattern. Provided I have a list of files, I have to look into the content of each file, I would like to get the file name if the string pattern "clean" not appear inside the file. Pls help.
Here is the scenario:
I have a list of files, inside each file is having numerous of lines. If the file is clean, it will have the "clean" wording. But if the file is dirty, the "clean" wording not exist and there will be no clear indication to tell the file is dirty. So as long as inside each file, if the "clean" wording is not detect, I'll category it as dirty file and I would like to trace the file name
You can use a simple one-liner:
perl -0777 -nlwE 'say $ARGV if !/clean/i' *.txt
Slurping the file with -0777, making the regex check against the entire file. If the match is not found, we print the file name.
For perl versions lower than 5.10 that do not support -E you can substitute -E with -e and say $ARGV with print "$ARGV".
perl -0777 -nlwe 'print "$ARGV\n" if !/clean/i' *.txt
If you need to generate the list within Perl, the File::Finder module will make life easy.
Untested, but should work:
use File::Finder;
my #wanted = File::Finder # finds all ..
->type( 'f' ) # .. files ..
->name( '*.txt' ) # .. ending in .txt ..
->in( '.' ) # .. in current dir ..
->not # .. that do not ..
->contains( qr/clean/ ); # .. contain "clean"
print $_, "\n" for #wanted;
Neat stuff!
EDIT:
Now that I have a clearer picture of the problem, I don't think any module is necessary here:
use strict;
use warnings;
my #files = glob '*.txt'; # Dirty & clean laundry
my #dirty;
foreach my $file ( #files ) { # For each file ...
local $/ = undef; # Slurps the file in
open my $fh, $file or die $!;
unless ( <$fh> =~ /clean/ ) { # if the file isn't clean ..
push #dirty, $file; # .. it's dirty
}
close $fh;
}
print $_, "\n" for #dirty; # Dirty laundry list
Once you get the mechanics, this can be simplified a la grep, etc.
One way like this:
ls *.txt | grep -v "$(grep -l clean *.txt)"
#!/usr/bin/perl
use strict;
use warnings;
open(FILE,"<file_list_file>");
while(<FILE>)
{
my $flag=0;
my $filename=$_;
open(TMPFILE,"$_");
while(<TMPFILE>)
{
$flag=1 if(/<your_string>/);
}
close(TMPFILE);
if(!$flag)
{
print $filename;
}
}
close(FILE);

Find Particular String in File and Count How many Times it is repeated using perl

I have a Long File Say 10000 Lines.
That is same set of Data Repeated , Like 10 lines and next ten line will be Same.
I want to Find Say "ObjectName" String in that file and Count it, How Many Times is appearing in that file.
Can anyone post detailed code. I am new to Perl.
Using Perl:
perl -ne '$x+=s/objname//g;END{print $x,"\n";}' file
Updated:
Since OP wants the solution using handlers:
#!/usr/bin/perl
use warnings;
use strict;
open my $fh , '<' , 'f.txt' or die 'Cannot open file';
my $x=0;
while (<$fh>){
chomp;
$x+=s/objname//g;
}
close $fh;
print "$x";
Here's another option that also addresses your comment about searching in a whole directory:
#!/usr/bin/env perl
use warnings;
use strict;
my $dir = '.';
my $count = 0;
my $find = 'ObjectName';
for my $file (<$dir/*.txt>) {
open my $fh, '<', $file or die $!;
while (<$fh>) {
$count += /\Q$find\E/g;
}
close $fh;
}
print $count;
The glob denoted by <$dir/*.txt> will non-recursively get the names of all text files in the directory $dir. If you want all files, change it to <$dir/*>. Each file is opened and read, line-by-line. The regex /\Q$find\E/g globally matches the contents of $find against each line. The \Q ... \E notation escapes any meta-characters in the string you're looking for, else those characters may interfere with the matching.
Hope this helps!
This could be a one liner in bash
grep "ObjectName " <filename> | wc -l

Using Perl to rename files in a directory

I'd like to take a directory and for all email (*.msg) files, remove the 'RE ' at the beginning. I have the following code but the rename fails.
opendir(DIR, 'emails') or die "Cannot open directory";
#files = readdir(DIR);
closedir(DIR);
for (#files){
next if $_ !~ m/^RE .+msg$/;
$old = $_;
s/RE //;
rename($old, $_) or print "Error renaming: $old\n";
}
If your ./emails directory contains these files:
1.msg
2.msg
3.msg
then your #files will look something like ('.', '..', '1.msg', '2.msg', '3.msg') but your rename wants names like 'emails/1.msg', 'emails/2.msg', etc. So you can chdir before renaming:
chdir('emails');
for (#files) {
#...
}
You'd probably want to check the chdir return value too.
Or add the directory names yourself:
rename('emails/' . $old, 'emails/' . $_) or print "Error renaming $old: $!\n";
# or rename("emails/$old", "emails/$_") if you like string interpolation
# or you could use map if you like map
You might want to combine your directory reading and filtering using grep:
my #files = grep { /^RE .+msg$/ } readdir(DIR);
or even this:
opendir(DIR, 'emails') or die "Cannot open directory";
for (grep { /^RE .+msg$/ } readdir(DIR)) {
(my $new = $_) =~ s/^RE //;
rename("emails/$_", "emails/$new") or print "Error renaming $_ to $new: $!\n";
}
closedir(DIR);
You seem to be assuming glob-like behavior rather than than readdir-like behavior.
The underlying readdir system call returns just the filenames within the directory, and will include two entries . and ... This carries through to the readdir function in Perl, just to give a bit more detail on mu's answer.
Alternately, there's not much point to using readdir if you're collecting all the results in an array anyways.
#files = glob('emails/*');
As already mentioned, your script fails because of the path you expect and the script uses are not the same.
I would suggest a more transparent usage. Hardcoding a directory is not a good idea, IMO. As I learned one day when I made a script to alter some original files, with the hardcoded path, and a colleague of mine thought this would be a nice script to borrow to alter his copies. Ooops!
Usage:
perl script.pl "^RE " *.msg
i.e. regex, then a file glob list, where the path is denoted in relation to the script, e.g. *.msg, emails/*.msg or even /home/pat/emails/*.msg /home/foo/*.msg. (multiple globs possible)
Using the absolute paths will leave the user with no doubt as to which files he'll be affecting, and it will also make the script reusable.
Code:
use strict;
use warnings;
use v5.10;
use File::Copy qw(move);
my $rx = shift; # e.g. "^RE "
if ($ENV{OS} =~ /^Windows/) { # Patch for Windows' lack of shell globbing
#ARGV = map glob, #ARGV;
}
for (#ARGV) {
if (/$rx/) {
my $new = s/$rx//r; # Using non-destructive substitution
say "Moving $_ to $new ...";
move($_, $new) or die $!;
}
}
I don't know if the regex fits the specifig name of the files, but in one line this could be done with:
perl -E'for (</path/to/emails*.*>){ ($new = $_) =~ s/(^RE)(.*$)/$2/; say $_." -> ".$new}
(say ... is nice for testing, just replace it with rename $_,$new or rename($_,$new) )
<*.*> read every file in the current directory
($new = $_) =~ saves the following substitution in $new and leaves $_ as intact
(^RE) save this match in $1 (optional) and just match files with "RE" at the beginning
(.*$) save everything until and including the end ($) of the line -> into $2
substitute the match with the string in$2