Perl split a text file into chunks - perl

I have a large txt file made of thousand of articles and I am trying to split it into individual files - one for each of the articles that I'd like to save as article_1, article_2 etc.. Each articles begins by a line containing the word /DOCUMENTS/.
I am totally new to perl and any insight would be so great ! (even advice on good doc websites). Thanks a lot.
So far what I have tried look like:
#!/usr/bin/perl
use warnings;
use strict;
my $id = 0;
my $source = "2010_FTOL_GRbis.txt";
my $destination = "file$id.txt";
open IN, $source or die "can t read $source: $!\n";
while (<IN>)
{
{
open OUT, ">$destination" or die "can t write $destination: $!\n";
if (/DOCUMENTS/)
{
close OUT ;
$id++;
}
}
}
close IN;

Let's say that /DOCUMENTS/ appears by itself on a line. Thus you can make that the record separator.
use English qw<$RS>;
use File::Slurp qw<write_file>;
my $id = 0;
my $source = "2010_FTOL_GRbis.txt";
{ local $RS = "\n/DOCUMENTS/\n";
open my $in, $source or die "can t read $source: $!\n";
while ( <$in> ) {
chomp; # removes the line "\n/DOCUMENTS/\n"
write_file( 'file' . ( ++$id ) . '.txt', $_ );
}
# being scoped by the surrounding brackets (my "local block"),
close $in; # an explicit close is not necessary
}
NOTES:
use English declares the global variable $RS. The "messy name" for it is $/. See perldoc perlvar
A line separator is the default record separator. That is, the standard unit of file reading is a record. Which is only, by default, a "line".
As you will find in the linked documentation, $RS only takes literal strings. So, using the idea that the division between articles was '/DOCUMENTS/' all by itself on a line, I specified newline + '/DOCUMENTS/' + newline. If this is part of a path that occurs somewhere on the line, then that particular value will not work for the record separator.

Did you read Programming Perl? It is the best book for beginning!
I don't understand what you are trying to do. I assume you have text that has articles and want to get all articles in separate files.
use warnings;
use strict;
use autodie qw(:all);
my $id = 0;
my $source = "2010_FTOL_GRbis.txt";
my $destination = "file$id.txt";
open my $IN, '<', $source;
#open first file
open my $OUT, '>', $destination;
while (<$IN>) {
chomp; # kill \n at the end
if ($_ eq '/DOCUMENTS/') { # not sure, am i right here or what you looking for
close OUT;
$id++;
$destination = "file$id.txt";
open my $OUT, '>', $destination;
} else {
print {$OUT} $_, "\n"; # print into file with $id name (as you open above)
}
}
close $IN;

Related

Open two text files, process them and write to separate files

I'm using with Perl to open two text files, process them and then write the output to another file.
I have a file INPUT were every line is a customer. I will process each line into variables that will be used to substitute text in another file, TEMP. The result should be written into individual files for each customer, OUTPUT.
My program seems to be working on only the first file. The rest of the files remain empty with no output.
#!/usr/bin/perl -w
if ( $#ARGV < 0) {
print "Usage: proj5.pl <mm/dd/yyyy>\n";
exit;
}
my $date = $ARGV[0];
open(INFO, "p5Customer.txt") or die("Could not open p5Customer.txt file\n");
open(TEMP, "template.txt") or die("Could not open template.txt file\n");
my $directory = "Emails";
mkdir $directory unless(-e $directory);
foreach $info (<INFO>){
($email, $fullname, $title, $payed, $owed) = split /,/, $info;
next if($owed < $payed);
chomp($owed);
$filepath = "$directory/$email";
unless(open OUTPUT, '>>'.$filepath){
die "Unable to create '$filepath'\n";
}
foreach $detail (<TEMP>){
$detail =~ s/EMAIL/$email/g;
$detail =~ s/(NAME|FULLNAME)/$fullname/g;
$detail =~ s/TITLE/$title/g;
$detail =~ s/AMOUNT/$owed/g;
$detail =~ s{DATE}{$date}g;
print OUTPUT $detail;
}
close(OUTPUT);
}
close(INFO);
close(TEMP);
As has been said, you need to open your template file again each time you read from it. There's a bunch of other issues with your code too
Always use strict and use warnings 'all' and declare every variable with my as close as possible to where it is first used
$#ARGV is the index of the last element of #ARGV, so $#ARGV < 0 is much better written as #ARGV < 1
You should use lexical file handles, and the three-parameter form of open, so open(INFO, "p5Customer.txt") should be open my $info_fh, '<', "p5Customer.txt"
You should use while instead of for to read from a file
It is easier to use the default variable $_ for short loops
It is pointless to capture a substring in a regular expression if you're not going to use it, so (NAME|FULLNAME) should be NAME|FULLNAME
There is no point in closing input files before the end of your program
It is also much better to use an existing template system, such as
Template::Toolkit
This should work for you
#!/usr/bin/perl
use strict;
use warnings 'all';
if ( #ARGV < 1 ) {
print "Usage: proj5.pl <mm/dd/yyyy>\n";
exit;
}
my $date = $ARGV[0];
open my $info_fh, '<', 'p5Customer.txt' or die qq{Could not open "p5Customer.txt" file: $!};
my $directory = "Emails";
mkdir $directory unless -e $directory;
while ( <$info_fh> ) {
chomp;
my ($email, $fullname, $title, $payed, $owed) = split /,/;
next if $owed < $payed;
open my $template_fh, '<', 'template.txt' or die qq{Could not open "template.txt" file: $!};
my $filepath = "$directory/$email";
open my $out_fh, '>', $filepath or die qq{Unable to create "$filepath": $!};
while ( <$template_fh> ) {
s/EMAIL/$email/g;
s/FULLNAME|NAME/$fullname/g;
s/TITLE/$title/g;
s/AMOUNT/$owed/g;
s/DATE/$date/g;
print $out_fh $_;
}
close($out_fh);
}
Your problem is that the TEMP loop is inside the INPUT loop and so the TEMP loop will end while the INPUT loop is still on the first line of the INPUT file.
Best to store TEMP file data into a hash table and work on the TEMP hash table inside the INPUT loop.
Good luck.

Search string with multiple words in the pattern

My program is trying to search a string from multiple files in a directory. The code searches for single patterns like perl but fails to search a long string like Status Code 1.
Can you please let me know how to search for strings with multiple words?
#!/usr/bin/perl
my #list = `find /home/ad -type f -mtime -1`;
# printf("Lsit is $list[1]\n");
foreach (#list) {
# print("Now is : $_");
open(FILE, $_);
$_ = <FILE>;
close(FILE);
unless ($_ =~ /perl/) { # works, but fails to find string "Status Code 1"
print "found\n";
my $filename = 'report.txt';
open(my $fh, '>>', $filename) or die "Could not open file '$filename' $!";
say $fh "My first report generated by perl";
close $fh;
} # end unless
} # end For
There are a number of problems with your code
You must always use strict and use warnings at the top of every Perl program. There is little point in delcaring anything with my without strict in place
The lines returned by the find command will have a newline at the end which must be removed before Perl can find the files
You should use lexical file handles (my $fh instead of FILE) and the three-parameter form of open as you do with your output file
$_ = <FILE> reads only the first line of the file into $_
unless ($_ =~ /perl/) is inverted logic, and there's no need to specify $_ as it is the default. You should write if ( /perl/ )
You can't use say unless you have use feature 'say' at the top of your program (or use 5.010, which adds all features available in Perl v5.10)
It is also best to avoid using shell commands as Perl is more than able to do anything that you can using command line utilities. In this case -f $file is a test that returns true if the file is a plain file, and -M $file returns the (floating point) number of days since the file's modification time
This is how I would write your program
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
for my $file ( glob '/home/ad/*' ) {
next unless -f $file and int(-M $file) == 1;
open my $fh, '<', $file or die $!;
while ( <$fh> ) {
if ( /perl/ ) {
print "found\n";
my $filename = 'report.txt';
open my $out_fh, '>>', $filename or die "Could not open file '$filename': $!";
say $fh "My first report generated by perl";
close $out_fh;
last;
}
}
}
it should have matched unless $_ contains text in different case.
try this.
unless($_ =~ /Status\s+Code\s+1/i) {
Change
unless ($_ =~ /perl/) {
to:
unless ($_ =~ /(Status Code 1)/) {
I am certain the above works, except it's case sensitive.
Since you question it, I rewrote your script to make more sense of what you're trying to accomplish and implement the above suggestion. Correct me if I am wrong, but you're trying to make a script which matches "Status Code 1" in a bunch of files where last modified within 1 day and print the filename to a text file.
Anyways, below is what I recommend:
#!/usr/bin/perl
use strict;
use warnings;
my $output_file = 'report.txt';
my #list = `find /home/ad -type f -mtime -1`;
foreach my $filename (#list) {
print "PROCESSING: $filename";
open (INCOMING, "<$filename") || die "FATAL: Could not open '$filename' $!";
foreach my $line (<INCOMING>) {
if ($line =~ /(Status Code 1)/) {
open( FILE, ">>$output_file") or die "FATAL: Could not open '$output_file' $!";
print FILE sprintf ("%s\n", $filename);
close(FILE) || die "FATAL: Could not CLOSE '$output_file' $!";
# Bail when we get the first match
last;
}
}
close(INCOMING) || die "FATAL: Could not close '$filename' $!";
}

In Perl, how can filter all log files in a directory, and extract interesting lines?

I'm trying to select only the .log files in my directory and then search in those files for the word "unbound" and print the entire line into a new output file with the same name as the log file (number###.log) but with a .txt extension. This is what I have so far:
#!/usr/bin/perl
use strict;
use warnings;
my $path = $ARGV[0];
my $outpath = $ARGV[1];
my #files;
my $files;
opendir(DIR,$path) or die "$!";
#files = grep { /\.log$/} readdir(DIR);
my #out;
my $out;
opendir(OUT,$outpath) or die "$!";
my $line;
foreach $files (#files) {
open (FILE, "$files");
my #line = <FILE>;
my $regex = Unbound;
open (OUT, ">>$out");
print grep {$line =~ /$regex/ } <>;
}
close OUT;
close FILE;
closedir(DIR);
closedir (OUT);
I'm a beginner, and I don't really know how to create a new text file with the acquired output.
Few things I'd suggest to improve this code:
declare your loop iterators within the loop. foreach my $file ( #files ) {
use 3 arg open: open ( my $input_fh, "<", $filename );
use glob rather than opendir then grep. foreach my $file ( <$path/*.txt> ) {
grep is good for extracting things into arrays. Your grep reads the whole file to print it, which isn't necessary. Doesn't matter much if the file is short though.
perltidy is great for reformatting code.
you're opening 'OUT' to a directory path (I think?) which isn't going to work.
$outpath isn't, it's a file. You need to do something different to output to different files. opendir isn't really valid to an output.
because you're using opendir that's actually giving you filenames - not full paths. So you might be in the wrong place to actually open the files. Prepending the path name, doing a chdir are possible solutions. But that's one of the reasons I like glob because it returns a path as well.
So with that in mind - how about:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
#Extract paths
my $input_path = $ARGV[0];
my $output_path = $ARGV[1];
#Error if paths are invalid.
unless (defined $input_path
and -d $input_path
and defined $output_path
and -d $output_path )
{
die "Usage: $0 <input_path> <output_path>\n";
}
foreach my $filename (<$input_path/*.log>) {
# extract the 'name' bit of the filename.
# be slightly careful with this - it's based
# on an assumption which isn't always true.
# File::Spec is a more powerful way of accomplishing this.
# but should grab 'number####' from /path/to/file/number####.log
my $output_file = basename ( $filename, '.log' );
#open input and output filehandles.
open( my $input_fh, "<", $filename ) or die $!;
open( my $output_fh, ">", "$output_path/$output_file.txt" ) or die $!;
print "Processing $filename -> $output_path/$output_file.txt\n";
#iterate input, extracting into $line
while ( my $line = <$input_fh> ) {
#check if $line matches your RE.
if ( $line =~ m/Unbound/ ) {
#write it to output.
print {$output_fh} $line;
}
}
#tidy up our filehandles. Although technically, they'll
#close automatically because they leave scope
close($output_fh);
close($input_fh);
}
Here is a script that takes advantage of Path::Tiny. Now, at this stage of your learning process, you are probably better off understanding #Sobrique's solution, but using modules such as Path::Tiny or Path::Class will make it easier to write these one off scripts more quickly, and correctly.
Also, I didn't really test this script, so watch out for bugs.
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Tiny;
run(\#ARGV);
sub run {
my $argv = shift;
unless (#$argv == 2) {
die "Need source and destination paths\n";
}
my $it = path($argv->[0])->realpath->iterator({
recurse => 0,
follow_symlinks => 0,
});
my $outdir = path($argv->[1])->realpath;
while (my $path = $it->()) {
next unless -f $path;
next unless $path =~ /[.]log\z/;
my $logfh = $path->openr;
my $outfile = $outdir->child($path->basename('.log') . '.txt');
my $outfh;
while (my $line = <$logfh>) {
next unless $line =~ /Unbound/;
unless ($outfh) {
$outfh = $outfile->openw;
}
print $outfh $line;
}
close $outfh
or die "Cannot close output '$outfile': $!";
}
}
Notes
realpath will croak if the path provided does not exist.
Similarly for openr and openw.
I am reading input files line-by-line to keep the memory footprint of the program independent of the sizes of input files.
I do not open the output file until I know I have a match to print to.
When matching a file extension using a regular expression pattern, keep in mind that \n is a valid character in Unix file names, and the $ anchor will match it.

Select rows based on text pattern

I want to extract rows from a file that match a particular pattern and I want to do this for over 500 files. It should have the ability to retain the unique name of the file as well.
I used awk but then i have to do each file individually.
c:\>gawk "/S1901/" Census_Tract_*.csv > Census_Tract_*.csv
In the example shown in the link here (http://bit.ly/nMX8qh) I want to retain only those records that have S1901 in them. Apologies for the external link but i am not able to retain formatting of the table.
I found some perl code that I used to write it but it retains all the rows and does not select only those rows/records where the pattern matches. Any tips would be much appreciated. The perl code is below:
#perl -w
$pattern = "Subject_Census*.csv"; # process only those files that match pattern
while (defined ($in = glob($pattern))) {
($out = $in) =~ s/\.csv$/.outcsv/; # read from "xyz.in" and write to "xyz.out"
open (IN, "<", $in) or die "Can't open $in for reading: $!";
open (OUT,">>", $out) or die "Can't open $out for writing: $!";
while (<IN>) {
$mystring =~ /S1901/;
print OUT $_ if $mystring == 0;
}
close (IN) or die "Can't close $in: $!"; # good idea to do some housekeeping
close (OUT) or die "Can't close $out: $!";
}
Untested:
use strict;
use warnings;
use autodie;
my $files_list_filename = 'files.txt';
open my $fl, '<', $files_list_filename;
my #list_of_files = <$fl>;
chomp #list_of_files;
close $fl;
foreach my $file ( #list_of_files ) {
open my $test_fh, '<', $file;
while ( my $line = <$test_fh> ) {
if( $line =~ m/S1901/ ) {
print "$file at $.: $line";
}
}
close $test_fh;
}
Is that sort of what you had in mind? It opens a file named filelist.txt and reads in a list of however many filenames you want to give it. Then it iterates over that list, opening each file one by one, scanning each file one by one, and if a line is found containing the trigger text, it prints the filename and line number, as well as the line itself where the trigger was met. Then it moves on to the next.
perl -ni.bak -e 'print if /S1901/' Subject_Census*.csv

help merging perl code routines together for file processing

I need some perl help in putting these (2) processes/code to work together. I was able to get them working individually to test, but I need help bringing them together especially with using the loop constructs. I'm not sure if I should go with foreach..anyways the code is below.
Also, any best practices would be great too as I'm learning this language. Thanks for your help.
Here's the process flow I am looking for:
read a directory
look for a particular file
use the file name to strip out some key information to create a newly processed file
process the input file
create the newly processed file for each input file read (if i read in 10, I create 10 new files)
Part 1:
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
next if ($file =~ /^\.+$/);
#Get filename attributes
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
print "$1\n";
print "$2\n";
print "$3\n";
}
print "$file\n";
}
Part 2:
use strict;
use Digest::MD5 qw(md5_hex);
#Create new file
open (NEWFILE, ">/backups/processed/foo$1.name.$2-foo_p$3.out") || die "cannot create file";
my $data = '';
my $line1 = <>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ( "^A", "^E", "^D");
while (<>)
{
my $digest = md5_hex($data);
chomp;
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2" ;
$extra .= "$heading[$_]$sep1$values[$_]$sep2" for (0..scalar(#values));
$data .= "$extra$eorec";
print NEWFILE "$data";
}
#print $data;
close (NEWFILE);
You are using an old-style of Perl programming. I recommend you to use functions and CPAN modules (http://search.cpan.org). Perl pseudocode:
use Modern::Perl;
# use...
sub get_input_files {
# return an array of files (#)
}
sub extract_file_info {
# takes the file name and returs an array of values (filename attrs)
}
sub process_file {
# reads the input file, takes the previous attribs and build the output file
}
my #ifiles = get_input_files;
foreach my $ifile(#ifiles) {
my #attrs = extract_file_info($ifile);
process_file($ifile, #attrs);
}
Hope it helps
I've bashed your two code fragments together (making the second a sub that the first calls for each matching file) and, if I understood your description of the objective correctly, this should do what you want. Comments on style and syntax are inline:
#!/usr/bin/env perl
# - Never forget these!
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
my $target_dir = "/backups/test/";
opendir my $dh, $target_dir or die "can't opendir $target_dir: $!";
while (defined(my $file = readdir($dh))) {
# Parens on postfix "if" are optional; I prefer to omit them
next if $file =~ /^\.+$/;
if ($file =~ /^foo(\d{3})\.name\.(\w{3})-foo_p(\d{1,4})\.\d+.csv$/) {
process_file($file, $1, $2, $3);
}
print "$file\n";
}
sub process_file {
my ($orig_name, $foo_x, $name_x, $p_x) = #_;
my $new_name = "/backups/processed/foo$foo_x.name.$name_x-foo_p$p_x.out";
# - From your description of the task, it sounds like we actually want to
# read from the found file, not from <>, so opening it here to read
# - Better to use lexical ("my") filehandle and three-arg form of open
# - "or" has lower operator precedence than "||", so less chance of
# things being grouped in the wrong order (though either works here)
# - Including $! in the error will tell why the file open failed
open my $in_fh, '<', $orig_name or die "cannot read $orig_name: $!";
open(my $out_fh, '>', $new_name) or die "cannot create $new_name: $!";
my $data = '';
my $line1 = <$in_fh>;
chomp $line1;
my #heading = split /,/, $line1;
my ($sep1, $sep2, $eorec) = ("^A", "^E", "^D");
while (<$in_fh>) {
chomp;
my $digest = md5_hex($data);
my (#values) = split /,/;
my $extra = "__mykey__$sep1$digest$sep2";
$extra .= "$heading[$_]$sep1$values[$_]$sep2"
for (0 .. scalar(#values));
# - Useless use of double quotes removed on next two lines
$data .= $extra . $eorec;
#print $out_fh $data;
}
# - Moved print to output file to here (where it will print the complete
# output all at once) rather than within the loop (where it will print
# all previous lines each time a new line is read in) to prevent
# duplicate output records. This could also be achieved by printing
# $extra inside the loop. Printing $data at the end will be slightly
# faster, but requires more memory; printing $extra within the loop and
# getting rid of $data entirely would require less memory, so that may
# be the better option if you find yourself needing to read huge input
# files.
print $out_fh $data;
# - $in_fh and $out_fh will be closed automatically when it goes out of
# scope at the end of the block/sub, so there's no real point to
# explicitly closing it unless you're going to check whether the close
# succeeded or failed (which can happen in odd cases usually involving
# full or failing disks when writing; I'm not aware of any way that
# closing a file open for reading can fail, so that's just being left
# implicit)
close $out_fh or die "Failed to close file: $!";
}
Disclaimer: perl -c reports that this code is syntactically valid, but it is otherwise untested.