Reading and Writing line by line from/to the same file - perl

I'm working with xml-files which I need to manipulate in my script. My first approach on this was:
qx(perl export_xml.pl $export_params > $path$prefix\investment.xml); # Create the xml-file
open DERI, '+<'.$path.$prefix.'investment.xml' or die 'Can\'t open investment.xml: '.$!;
my #derivative_xml = <DERI>;
seek(DERI, 0, 0);
foreach (#derivative_xml) {
$_ =~ s/^\s*$//g;
$_ =~ s/^.*detected on Server.*$//g;
$_ = encode('utf8', $_);
}
print DERI join('', #derivative_xml);
This is working for testing purposes, but unfortunately the real files are just too big for that (up to 6GB).
Is there a way to read the file line by line and then modify the input through the filehandle? Something like
foreach (<DERI>) { ##derivative_xml) {
$_ =~ s/^\s*$//g;
$_ =~ s/^.*detected on Server.*$//g;
$_ = encode('utf8', $_);
}
I can't really test that in a non-ridiculous amount of time, so it would be pretty nice, if I didn't have to trial and error here.
Thanks in advance!

This should work. No need of another script file.
perl -pi -e 's/^\s*$//g;s/^.*detected on Server.*$//g;$_ = encode('utf8', $_)' investment.xml
Did not test though with huge file upto 6GB. Test this and check how much time it takes.

Related

Avoiding regex match variable being reused

Basically, I'm looping through html files and looking for a couple of regexes. They match which is fine, but I don't expect every file to contain matches, but when the loop runs, every iteration contains the same match (despite it not being in that file). I assume that by using $1 it is persisting through each iteration.
I've tried using an arbitary regex straight after each real match to reset it, but that doesn't seem to work. The thread I got that idea from seemed to have a lot of argument etc on best practice and the original questions problem, so I thought it would be worth asking for specific advice to my code. It's likely not written in a great way either:
# array of diff filenames
opendir(TDIR, "$folder/diff/$today") || die "can't opendir $today: $!";
#diffList = grep !/^\.\.?$/, readdir(TDIR);
closedir TDIR;
# List of diff files
print "List of Diff files:\n" . join("\n", #diffList) . "\n\n";
for($counter = 0; $counter < scalar(#diffList); $counter++) {
# Open diff file, read in to string
$filename = $diffList[$counter];
open FILE, "<", "$folder/diff/$today/$filename";
while(<FILE>) {
$lines .= $_;
}
close FILE or warn "$0: close today/$filename: $!";
# Use regular expressions to extract the found differences
if($lines =~ m/$plus1(.*?)$span/s) {
$plus = $1;
"a" =~ m/a/;
} else {$plus = "0";}
if($lines =~ m/$minus1(.*?)$span/s) {
$minus = $1;
"a" =~ m/.*/;
} else {$minus = "0";}
# If changes were found, send them to the database
if($plus ne "0" && $minus ne "0") {
# Do stuff
}
$plus = "0";
$minus = "0";
}
If I put a print inside the "do stuff" if, it's always true and always shows the same two values that are found in one of the files.
Hopefully I've explained my situation well enough. Any advice is appreciated, thanks.
It may be that your code appends lines from newly-read files onto $lines. Have you tried explicitly clearing it after each iteration?
It's already been answered, but you could also consider a different syntax for reading the file. It can be noticeably quicker and helps you avoid little bugs like this.
Just add this to read the file between the open/close:
local $/ = undef;
$lines = <FILE>;
That'll temporarily unset the line separator so it reads the whole file at once. Just enclose it in a { } block if you need to read another file in the same scope.

Reading file line by line iteration issue

I have the following simple piece of code (identified as the problem piece of code and extracted from a much larger program).
Is it me or can you see an obvious error in this code that it stopping it from matching against $variable and printing $found when it definitely should be doing?
Nothing is printed when I try to print $variable, and there are definitely matching lines in the file I am using.
The code:
if (defined $var) {
open (MESSAGES, "<$messages") or die $!;
my $theText = $mech->content( format => 'text' );
print "$theText\n";
foreach my $variable (<MESSAGES>) {
chomp ($variable);
print "$variable\n";
if ($theText =~ m/$variable/) {
print "FOUND\n";
}
}
}
I have located this as the point at which the error is occurring but cannot understand why?
There may be something I am totally overlooking as its very late?
Update I have since realised that I misread your question and this probably doesn't solve the problem. However the points are valid so I am leaving them here.
You probably have regular expression metacharacters in $variable. The line
if ($theText =~ m/$variable/) { ... }
should be
if ($theText =~ m/\Q$variable/) { ... }
to escape any that there are.
But are you sure you don't just want eq?
In addition, you should read from the file using
while (my $variable = <MESSAGES>) { ... }
as a for loop will unnecessarily read the entire file into memory. And please use a better name than $variable.
This works for me.. Am I missing the question at hand? You're just trying to match "$theText" to anything on each line in the file right?
#!/usr/bin/perl
use warnings;
use strict;
my $fh;
my $filename = $ARGV[0] or die "$0 filename\n";
open $fh, "<", $filename;
my $match_text = "whatever";
my $matched = '';
# I would use a while loop, out of habit here
#while(my $line = <$fh>) {
foreach my $line (<$fh>) {
$matched =
$line =~ m/$match_text/ ? "Matched" : "Not matched";
print $matched . ": " . $line;
}
close $fh
./test.pl testfile
Not matched: this is some textfile
Matched: with a bunch of lines or whatever and
Not matched: whatnot....
Edit: Ah, I see.. Why don't you try printing before and after the "chomp()" and see what you get? That shouldn't be the issue, but it doesn't hurt to test each case..

Read the last line of file with data in Perl

I have a text file to parse in Perl. I parse it from the start of file and get the data that is needed.
After all that is done I want to read the last line in the file with data. The problem is that the last two lines are blank. So how do I get the last line that holds any data?
If the file is relatively short, just read on from where you finished getting the data, keeping the last non-blank line:
use autodie ':io';
open(my $fh, '<', 'file_to_read.txt');
# get the data that is needed, then:
my $last_non_blank_line;
while (my $line = readline $fh) {
# choose one of the following two lines, depending what you meant
if ( $line =~ /\S/ ) { $last_non_blank_line = $line } # line isn't all whitespace
# if ( line !~ /^$/ ) { $last_non_blank_line = $line } # line has no characters before the newline
}
If the file is longer, or you may have passed the last non-blank line in your initial data gathering step, reopen it and read from the end:
my $backwards = File::ReadBackwards->new( 'file_to_read.txt' );
my $last_non_blank_line;
do {
$last_non_blank_line = $backwards->readline;
} until ! defined $last_non_blank_line || $last_non_blank_line =~ /\S/;
perl -e 'while (<>) { if ($_) {$last = $_;} } print $last;' < my_file.txt
You can use the module File::ReadBackwards in the following way:
use File::ReadBackwards ;
$bw = File::ReadBackwards->new('filepath') or
die "can't read file";
while( defined( $log_line = $bw->readline ) ) {
print $log_line ;
exit 0;
}
If they're blank, just check $log_line for a match with \n;
If the file is small, I would store it in an array and read from the end. If its large, use File::ReadBackwards module.
Here's my variant of command line perl solution:
perl -ne 'END {print $last} $last= $_ if /\S/' file.txt
No one mentioned Path::Tiny. If the file size is relativity small you can do this:
use Path::Tiny;
my $file = path($file_name);
my ($last_line) = $file->lines({count => -1});
CPAN page.
Just remember for the large file, just as #ysth said it's better to use File::ReadBackwards. The difference can be substantial.
sometimes it is more comfortable for me to run shell commands from perl code. so I'd prefer following code to resolve the case:
$result=`tail -n 1 /path/file`;

Perl script (or anything) to total up CSV column

I wrote (with lots of help from others) an awk command to total up a column in a CSV file. Unfortunately, I learned after some Googling that awk isn't great at handling CSV files due to the fact that the separator is not always the same (i.e. commas should be ignored when surround by quotes).
It seems that perhaps a Perl script could do better. Would it be possible to have a one-line Perl script (or something nearly as succinct) that achieves the same thing as this awk command that totals up the 5th column of a CSV file?
cat file.csv | awk -F "\"*,\"*" '{s+=$5} END {printf("%01.2f\n", s)}'
I'm not married to Perl in particular but I was hoping to avoid writing a full-blown PHP script. By this time I could have easily written a PHP script, but now that I've come this far, I want to see if I can follow it through.
You need to use a decent CSV parser to deal with all the complexities of CSV format. Text::CSV_XS (or Text::CSV if that's not avialable) is one of the preferred ones.
perl -e '{use Text::CSV_XS; my $csv=Text::CSV_XS->new(); open my $fh, "<", "file.csv" or die "file.csv: $!"; my $sum = 0; while (my $row = $csv->getline ($fh)) {$sum += $row->[4]}; close $fh; print "$sum\n";}'
Here's the actual Perl code, for better readability
use Text::CSV_XS; # use the parser library
my $csv = Text::CSV_XS->new(); # Create parser object
open my $fh, "<", "file.csv" or die "file.csv: $!"; # Open the file.
my $sum = 0;
while (my $row = $csv->getline ($fh)) { # $row is array of field values now
$sum += $row->[4];
}
close $fh;
print "$sum\n";
The above could be shortened by using slightly lesser quality but denser Perl:
cat file.csv | perl -MText::CSV_XS -nae '$csv=Text::CSV_XS->new();
$csv->parse($_); #f=$csv->fields(); $s+=$f[4]} { print "$s\n"'
Are you opposed to using a Perl module? You can use Text::CSV to do this easily without rolling your own parser.
Tutorial snippet changed to perform total:
# ... some tutorial code ommited
while (<CSV>) {
if ($csv->parse($_)) {
my #columns = $csv->fields();
$total += $columns[4];
} else {
my $err = $csv->error_input;
print "Failed to parse line: $err";
}
}
print "total: $total\n";
Python
import csv
with open( "some_file.csv", "rb" ) as source:
rdr= csv.reader( source )
col_5= 0
for row in rdr:
col_5 += row[5]
print col_5
Not a one-liner, but pretty terse.
There are a number of tools that do this. A quick search for 'cli csvparser' lead me to several tools (which I apparently can't link to--possibly to prevent spamming).
I installed the first one I found--csvtool--and was able to do a similar command line as yours and get a total.
Pretty short (and fast) solution:
perl -MText::CSV_XS -E'$c=new Text::CSV_XS;$s+=$r->[4]while$r=$c->getline(*ARGV);say$s' file.csv

How can I search multiple files for a string in Perl?

My question is probably simple but I'm a complete newbie. I want to search the contents of multiple text files for a particular phrase and then display the lines of the finds on screen. I've already learnt how to deal with a single file. For example, if I want to search for a word, say "Okay" in a text file named "wyvern.txt" in the root directory of F. The following code works:
#!/usr/bin/perl
$file = 'F:\wyvern.txt';
open(txt, $file);
while($line = <txt>) {
print "$line" if $line =~ /Okay/;
}
close(txt);
But what should I do if I want to search for the same phrase in two text files, say "wyvern' and "casanova" respectively? or how about all the files in the directory "novels" in the root directory of F.
Any help would be greatly appreciated.
Thanks in advance
Mike
Edit:
Haha, I finally figured out how to search all the files in a directory for a pattern match:)
The following code works great:
#!/usr/bin/perl
#files = <F:/novels/*>;
foreach $file (#files) {
open (FILE, "$file");
while($line= <FILE> ){
print "$line" if $line =~ /Okay/;
}
close FILE;
}
Extending the good answer provided by Jonathan Leffler:
The filename where the match was found is in $ARGV, and with a small change, the line number can be found in $.. Example:
while (<>) {
print "$ARGV:$.:$_" if /pattern/;
} continue {
close ARGV if eof; # Reset $. at the end of each file.
}
Furthermore, if you have a list of filenames and they're not on the commandline, you can still get the magic ARGV behavior. Watch:
{
local #ARGV = ('one.txt', 'two.txt');
while (<>) {
print "$ARGV:$.:$_" if /Okay/;
} continue {
close ARGV if eof;
}
}
Which is a generally useful pattern for doing line-by-line processing on a series of files, whatever it is -- even if I might recommend File::Grep or App::Ack for this specific problem :)
On a system where command line arguments are properly expanded, you can use:
[sinan#host:~/test]$ perl -ne 'print "$.:$_" if /config/' *
1:$(srcdir)/config/override.m4
The problem with Windows is:
C:\Temp> perl -ne "print if /perl/" *.txt
Can't open *.txt: Invalid argument.
On Windows, you could do:
C:\Temp> for %f in (*.txt) do perl -ne "print if /perl/" %f
But, you might just want to use cmd.exe builtin findstr or the grep command line tool.
The easiest way is to list the files on the command line, and then simply use:
while (<>)
{
print if m/Okay/;
}
File::Grep is what you need here
Just a tweak on your line: <F:/novels/*>, I prefer to use the glob keyword - it works the same in this context and avoids the chances of confusing the many different uses of angle brackets in perl. Ie:
#files = glob "F:/novels/*";
See perldoc glob for more.
put the files in a for loop, or something along those lines:
i.e.
for $file ('F:\wyvern.txt','F:\casanova.txt') {
open(TXT, $file);
while($line = <txt>) {
print "$line" if $line =~ /Okay/;
}
close TXT;
}
Okay, I'm a complete dummie. But to sum up, I now can search one single text file or multiple text files for a specified string. I'm still trying to figuring out how to deal with all the files in one folder.
the following codes work.
Code 1:
#!/usr/bin/perl
$file = 'F:\one.txt';
open(txt, $file);
while($line = <txt>) {
print "$line" if $line =~ /Okay/;
}
close(txt);
Code 2:
#!/usr/bin/perl
{
local #ARGV = ('F:\wyvern.txt', 'F:\casanova.txt');
while (<>) {
print "$ARGV:$.:$_" if /Okay/;
} continue {
close ARGV if eof;
}
}
Thanks again for your help. I really appreciate it.