Read & Seek in gzip files Perl - perl

I am trying to read given set of gzip/plain xml files and printing some portions of these files into output xml files based on given offset and length values.
The offset values are keys of hash %offhash and corresponding keys are length.
Here is the funcntion I used for generating output files-
sub fileproc {
my $infile = shift;
my $outfile = shift;
my $FILEH;
$| = 1;
$outfile =~ s/.gz$//;
if($infile =~ m/\.gz$/i){
open( $FILEH,"gunzip -c $infile | ") or die "Could not open input $infile";
}
else{
open( $FILEH, "<", $infile) or die "Could not open input $infile";
}
open(my $OUTH, ">", $outfile) or die "Couldn't open file, $!";
foreach my $offset (sort{$a <=> $b} keys %offhash)
{
my $record="";
seek ($FILEH, $offset, 0);
read ($FILEH, $record, $offhash{$offset}, 0);
print $OUTH "$record";
}
close $FILEH;
close $OUTH;
}
This function works properly for plain xml input files but creating some buffering issue when there are some(or all) .xml.gz files in the input file set. The output file in this case contains data from some previous read input(.gz) files.
It seems the problem is in the line--
open( $FILEH,"gunzip -c $infile | ") or die "Could not open input $infile";
Can anyone help me to resolve this issue?
Thanks in advance.

You can only seek in regular files, not in the output of programs or STDIN etc. If you want to do this, you need to add a buffering layer yourself, but note that you might to need to buffer the whole uncompressed file just to be able to seek in it.
Even if you don't gunzip with an external program, but use something like IO::Gzip you will not be able to seek, because the inherent way gzip (and other compressions) work, is that you need to read all the previous data to be able to decompress the data at the current file position. There are ways around it to limit the amount of necessary previous data, but then you would need to specifically prepare your gzip file and it will grow bigger. I'm not aware of any module which implements this currently, but I did a proof-of-concept once so I know it works.

Related

Open a file and overwrite the file with adjustments and no backup

I have the following three lines:
rename($file_path, $file_fh.'.bak');
open( my $file_IN_fh, '<' , $file_path.'.bak') || die "die message";
open( my $file_OUT_fh, '>' , $file_path) || die "die message";
It works great. It allows me to go through the in file while(<$file_IN_fh>), make a bunch of changes with a script (s///g, if() to determine if the line stays or not, etc), and write to the out file. In the end I get my edited file and the file name is unchanged.
My issue is that I am at a point where I no longer (currently) want the backup files, so I want to replace the code with something similar that won't create the backup file, and comment back and forth the three lines over the years if my needs change.
How do I do this kind of editing in place not from the command line?
One basic way is to read the file line by line and write desired output lines to a temporary file, which is then renamed so to overwrite the original.
use File::Copy qw(move);
open my $fh, '<', $file or die "Can't open $file: $!";
open my $fh_out, '>', $outfile or die "Can't open $outfile: $!";
while (<$fh>) {
next if /line_to_skip/;
s/patt/repl/g;
print $fh_out $_;
}
close $_ for ($fh, $fh_out);
move ($outfile, $file) or die "Can't move $outfile to $file: $!";
This is what is normally done by tools that edit files "in place" (with additional safety, checks, and flexibility). Since the $outfile is temporary use File::Temp.
Add checks when close-ing files.
Note that this changes the file's inode number, which may matter for some applications.†
If the file isn't huge you can simplify this and read it in first
open my $fh, '<', $file or die "Can't open $file: $!";
my #lines = <$fh>;
open $fh, '>', $file or die "Can't open $file for writing: $!";
for (#lines) {
next if /line_to_skip/;
s/patt/repl/g;
print $fh_out $_;
}
close $fh;
what preserves the inode number, since > mode truncates the existing inode data.
† If this is indeed a problem, you can still keep the same inode. After the temporary file is written, open it for reading and open the original file for writing; that truncates the contents of that inode. Then copy the temporary file to the original. Close handles and delete the temporary file.
If the file is huge, then I'd question why you'd want to avoid the temporary file. Otherwise, I'd suggest just loading the file into memory, make modifications, then write it back out.
use File::Slurp qw( read_file write_file );
my $in = read_file($qfn, array_ref => 1);
my #out;
while (defined( $_ = shift(#$in) )) {
s/a/b/g; # For example.
push #out, $_ if /c/; # For example.
}
write_file($qfn, \#out);
I avoided using expensive splice by using two arrays.
Note that using Tie::File might save one line of code, but this will be 30x faster[1], and probably use less memory (despite memory-saving being Tie::File's goal). Tie::File is never the answer!!!
This is not necessarily representative of all Tie::File uses, but I have indeed timed Tie::File taking 30x longer than the alternative at some basic task. That means that 2 seconds worth of work would have taken 1 minute with Tie::File!
Take a look at the Tie::File module. It is a core module and so shouldn't need installing, and the code is as simple as
use Tie::File;
tie my #file, 'Tie::File', $filepath or die $!;
Thereafter the array #file will hold the contents of the file, one line per element, and any changes to the array will be reflected in the file. All array operations such as push, splice, etc. will work fine
Note that line one of the file is in element zero of the array etc.

Reading a CSV file and writing to a CSV file

use Text::CSV;
$csv = Text::CSV->new;
open(HIGH, "+>Hardtest.csv") || die "Cannot open ticket $!\n"; #reads the high file
while(<HIGH>)
{
print "Printing High Priority Tickets ...\n";
sleep(1);
print <HIGH>;
}
close(HIGH);
here is my code, i am trying to read a csv and write to it, however i cant seem to read the CSV file, help would be appreciated, thanks!
OK, lots of things here.
Always use strict and use warnings.
You're opening the CSV file write mode (append mode?). Don't do that, if you're just reading from it.
Don't use || die, use or die.
Finally, don't print <HIGH>, instead print $_.
I've modified your code a bit:
#!/usr/bin/perl -w
use strict;
use Text::CSV;
my $csv = Text::CSV->new;
open(HIGH, "+<Hardtest.csv") || die "Cannot open ticket $!\n"; #reads the high file
while(<HIGH>)
{
print "Printing High Priority Tickets ...\n";
print $_;
sleep(1);
}
print HIGH "9,10,11,12\n";
close(HIGH);
Let me explain:
1. "+>" will open the file in read/write mode, BUT will also overwrite the existing file. Hence, In your code, while loop is never entered. I've changed that to "+<" which means read/write in append mode.
2. Second last statement, in above code, will append new content to the CSV file.

In Perl - Need to modify a script to parse all files in a directory

I have little to no Perl experience, so any assistance is much appreciated. I'm sorry if I'm not giving clear information in the question as I do not have a programming background.
I have a script that will parse a text file, check for a certain number of data points in the text file, then output "# of data points = X". I can get this to run on a single text file, and I can get it to output to a text file which is great.
However, there are 138 text files that I need to parse and analyze the data in number, all in one directory. I'm wondering if rather than running this individual script 138 times I can modify the script to go to the directory, run on each file in it, and output the results together in a text file.
I didn't write the original script, I inherited it and just barely managed to figure out how to get it to run on a single text file.
You can also do a glob, like so.
my #files = </path/where/files/are/*>;
foreach my $file (#files) {
print "working on $file...\n"
# do stuff with $file;
}
If your problem is to open 138 file in the same directory, you can open one by one using the "opendir" function, example of a script that open ALL FILES and print all lines :
#!/usr/bin/perl
use strict;
use warnings;
my $directory = '/tmp';
opendir (DIR, $directory) or die $!;
while (my $file = readdir(DIR)) {
print "$file\n";
open (FILE, $file) or die $!;
while (<FILE>) {
print $_;
}
}
closedir(DIR);

How to read and write a file, syntax wrong

I end up having my script appending the new changes that I wanted to make to the end of the file instead of in the actual file.
open (INCONFIG, "+<$text") or die $!;
#config = <INCONFIG>;
foreach(#config)
{
if ( $_ =~ m/$checker/ )
{
$_ = $somethingnew;
}
print INCONFIG $_;
}
close INCONFIG or die;
Ultimately I wanted to rewrite the whole text again, but with certain strings modified if it matched the search criterion. But so far it only appends ANOTHER COPY of the entire file(with changes) to the bottom of the old file.
I know that I can just close the file, and use another write file -handle and parse it in. But was hoping to be able to learn what I did wrong, and how to fix it.
As I understand open, using read/write access for a text file isn't a good idea. After all a file just is a byte stream: Updating a part of the file with something of a different length is the stuff headaches are made of ;-)
Here is my approach: Try to emulate the -i "inplace" switch of perl. So essentially we write to a backup file, which we will later rename. (On *nix system, there is some magic with open filehandles keeping deleted files available, so we don't have to create a new file. Lets do it anyway.)
my $filename = ...;
my $tempfile = "$filename.tmp";
open my $inFile, '<', $filename or die $!;
open my $outFile, '>', $tempfile or die $!;
while (my $line = <$inFile>) {
$line = doWeirdSubstitutions($line);
print $outFile $line;
}
close $inFile or die $!;
close $outFile or die $!;
rename $tempfile, $filename
or die "rename failed: $!"; # will break under weird circumstances.
# delete temp file
unlink $tempfile or die $!;
Untested, but obvious code. Does this help with your problem?
Your problem is a misunderstanding of what <+ "open for update" does. It is discussed in the Perl Tutorial at
Mixing Reads and Writes.
What you really want to do is copy the old file to a new file and then rename it after the fact. This is discussed in the perlfaq5 mentioned by daxim. Plus there are entire modules dedicated to doing this safely, such as File::AtomicWrite. These help with the issue of your program aborting and leaving you with a clobbered file.
As others pointed out, there are better ways :)
But if you really want to read and write using +<, you should remember that, after reading the file, you're at the end of the file... That explains that your output is appended after the original content.
What you need to do is reset the file-pointer to the beginning of the file, using seek:
seek(INCONFIG ,0,0);
Then start writing...
perlopentut says this about mixing reads and writes
In fact, when it comes to updating a file, unless you're working on a
binary file as in the WTMP case above, you probably don't want to use
this approach for updating. Instead, Perl's -i flag comes to the
rescue.
Another way is to use the Tie::File module. The code reduces to just this:
tie my #config, 'Tie::File', $text or die $!;
s/$checker/$somethingnew/g for #config;
But remember to back the file up before you modify it until you have debugged your program.

perl appending issues

I have some code that appends into some files in the nested for loops. After exiting the for loops, I want to append .end to all the files.
foreach my $file (#SPICE_FILES)
{
open(FILE1, ">>$file") or die "[ERROR $0] cannot append to file : $file\n";
print FILE1 "\n.end\n";
close FILE1;
}
I noticed in some strange cases that the ".end" is appended into the middle of the files!
how do i resolve this??
Since I do not yet have the comment-privilege I'll have to write this as an 'answer'.
Do you use any dodgy modules?
I have run into issues where (obviously) broken perl-modules have done something to the output buffering. For me placing
$| = 1;
in the code has helped. The above statement turns off perls output buffering (AFAIK). It might have had other effects too, but I have not seen anything negative come out of it.
I guess you've got data buffered in some previously opened file descriptors. Try closing them before re-opening:
open my $fd, ">>", $file or die "Can't open $file: $!";
print $fd, $data;
close $fd or die "Can't close: $!";
Better yet, you can append those filehanles to an array/hash and write to them in cleanup:
push #handles, $fd;
# later
print $_ "\n.end\n" for #handles;
Here's a case to reproduce the "impossible" append in the middle:
#!/usr/bin/perl -w
use strict;
my $file = "file";
open my $fd, ">>", $file;
print $fd "begin"; # no \n -- write buffered
open my $fd2, ">>", $file;
print $fd2 "\nend\n";
close $fd2; # file flushed on close
# program ends here -- $fd finally closed
# you're left with "end\nbegin"
It’s not possible to append something to the middle of the file. The O_APPEND flag guarantees that each write(2) syscall will place its contents at the old EOF and update the st_size field by incrementing it by however many bytes you just wrote.
Therefore if you find that your own data is not showing up at the end when you go to look at it, then another agent has written more data to it afterwards.