How do I access gzipped files without creating additional processes?

How do I access gzipped files without creating additional processes? - perl

My application reads and writes a lot of medium to large files. I would like to store these in zipped format. Saves diskspace and network time.
One way to do it is with this:
sub fopen {
my $mode = shift;
my $filename = shift;
if ($filename =~ /\.gz$/) {
if ($mode eq "<") {
open(my $fp, "-|", "/usr/bin/gzcat $filename");
#my $fp = gzopen($filename, "rb") ;
return $fp;
}
if ($mode eq ">") {
open(my $fp, "|-", "/usr/bin/gzip > $filename");
#my $fp = gzopen($filename, "wb") ;
return $fp;
}
} else {
open(my $fp, $mode, $filename);
return $fp;
}
}
I can then change my existing code simply by swapping the calls to open.
As is apparent from the function, I've also thought of using the zlib/compress library. The problem is that the result can't be passed around as a file pointer.
Is there a way to do this that doesn't involved creating a bunch of extra processes?

From the documentation of IO::Uncompress::Gunzip
use IO::Uncompress::Gunzip qw($GunzipError);
my $z = IO::Uncompress::Gunzip->new( $input )
or die "IO::Uncompress::Gunzip failed: $GunzipError\n";
The variable $z is now a file handle that you can use as usual.
while (<$z>) {...}

Just to add some information about previous answers, from an old bench I made, PerlIO::gzip is faster than IO::Uncompress::Gunzip.

Look at the IO::* namespace on your Perl version.
For example Debian old-stable (5 - Lenny) Perl and next versions, ships IO::Uncompress::Gunzip and IO::Uncompress::AnyUncompress.
#!/usr/bin/perl
use strict ;
use warnings ;
use IO::Uncompress::Gunzip qw(gunzip $GunzipError);
my $input = "file1.txt.gz";
my $output = "file1.txt";
gunzip $input => $output
or die "gunzip failed: $GunzipError\n";

Related

redirect output files in different directory

I'm new to Perl and trying to put output files in a different directory.piece of code is as below
use File::Basename;
use File::Copy;
use File::Path;
foreach my $file (sort({uc($a) cmp uc($b)} keys(%$ch_ref))) {
my $num = keys(%{$$ch_ref{$file}});
print "\n -> $string $file ($num):\n";
foreach my $sid (keys(%{$$ch_ref{$file}})) {
if ($type == $PRINT_OLD) {
open ( my $output, '>>',$file );
print {$output} " something";
close ( $output ) or die $!;
}

The third argument to open() is the full path to the file that you want to open. Currently, you're just giving it the filename. but you can expand that to include the directory as well.
Something like this:
my $dir = '/path/to/some/directory';
open my $output, '>>', $dir . $string . '_' . $file;
You should really be checking the success of the open() call, and it's a bit easier to give a sensible error message if you build the filename into a variable first.
my $dir = '/path/to/some/directory';
my $filename = "$dir${string}_$file";
open my $output, '>>', $filename
or die "Can't open $filename: $!";
Note that using ${string} instead of $string means that you can use it directly in the string without the name getting tangled up with the following _ character.
I'd also strongly recommend dropping your use of prototypes on your subroutine. Perl prototypes are often far more trouble than they are worth.
Also, there's no need to open() and close() your file so many times. Just open it at the top of the loop (it will be automatically closed at the end as $output goes out of scope).

What produces the white space in my perl programm?

As the title says, I have a program or better two functions to read and write a file either in an array or to one. But now to the mean reason why I write this: when running my test several times my test program that tests my functions produces more and more white space. Is there somebody that could explain my fail and correct me?
my code
Helper.pm:
#!/usr/bin/env perl
package KconfCtl::Helper;
sub file_to_array($) {
my $file = shift();
my ( $filestream, $string );
my #rray;
open( $filestream, $file ) or die("cant open $file: $!");
#rray = <$filestream>;
close($filestream);
return #rray;
}
sub array_to_file($$;$) {
my #rray = #{ shift() };
my $file = shift();
my $mode = shift();
$mode='>' if not $mode;
my $filestream;
if ( not defined $file ) {
$filestream = STDOUT;
}
else {
open( $filestream, $mode, $file ) or die("cant open $file: $!");
}
my $l = #rray; print $l,"\n";
foreach my $line (#rray) {
print $filestream "$line\n";
}
close($filestream);
}
1;
test_helper.pl:
use KconfCtl::Helper;
use strict;
my #t;
#t= KconfCtl::Helper::file_to_array("kconf.test");
#print #t;
my $t_index=#t;
#t[$t_index]="n";
KconfCtl::Helper::array_to_file(\#t, "kconf.test", ">");
the result after the first:
n
and the 2nd run:
n
n

When you read from a file, the data includes the newline characters at the end of each line. You're not stripping those off, but you are adding an additional newline when you output your data again. That means your file is gaining additional blank lines each time you read and write it
Also, you must always use strict and use warnings 'all' at the top of every Perl script; you should avoid using subroutine prototypes; and you should declare all of your variables as late as possible
Here's a more idiomatic version of your module code which removes the newlines on input using chomp. Note that you don't need the #! line on the module file as it won't be run from the command line, but you my want it on the program file. It's also more normal to export symbols from a module using the Exporter module so that you don't have to qualify the subroutine names by prefixing them with the full package name
use strict;
use warnings 'all';
package KconfCtl::Helper;
sub file_to_array {
my ($file) = #_;
open my $fh, '<', $file or die qq{Can't open "$file" for input: $!}; #'
chomp(my #array = <$fh>);
return #array;
}
sub array_to_file {
my ($array, $file, $mode) = #_;
$mode //= '>';
my $fh;
if ( $file ) {
open $fh, $mode, $file or die qq{Can't open "$file" for output: $!}; #'
}
else {
$fh = \*STDOUT;
}
print $fh $_, "\n" for #$array;
}
1;
and your test program would be like this
#!/usr/bin/env perl
use strict;
use warnings 'all';
use KconfCtl::Helper;
use constant FILE => 'kconf.test';
my #t = KconfCtl::Helper::file_to_array(FILE);
push #t, 'n';
KconfCtl::Helper::array_to_file(\#t, FILE);

When you read in from your file, you need to chomp() the lines, or else the \n at the end of the line is included.
Try this and you'll see what's happening:
use Data::Dumper; ## add this line
sub file_to_array($) {
my $file = shift();
my ( $filestream, $string );
my #rray;
open( $filestream, '<', $file ) or die("cant open $file: $!");
#rray = <$filestream>;
close($filestream);
print Dumper( \#rray ); ### add this line
return #rray;
}
you can add
foreach(#rray){
chomp();
}
into your module to stop this happening.

Can't find error "Global symbol #xx requires explicit package name"

I have checked the questions that may already have an answer and none of them have helped.
This is for my semester project for Unix Programming. I have created a script that compares HTML files to one other from a website.
The script worked perfectly as expected until I tried to implement the second website, so in turn I deleted the added code for the second website and now I get the errors
Global symbol "#master" requires explicit package name
Global symbol "#child" requires explicit package name
within the csite_md5 subroutine. I have gone through the code many times over and cannot see the problem.
I am looking for another set of eyes to see if I'm just missing something simple, which usually is the case.
Also I am new to Perl as this is my first time using the language.
#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
use File::Basename;
# Path to the c-site download root directory
my $csite_dir = '/root/websites/c-site/wget/';
opendir my $dh, $csite_dir or die $!;
# Finds the sub directories c-site_'date +%F' where the c-site download is located
my #wget_subdir_csite = sort grep /^[^.]/, readdir $dh;
# Creates the absolute path to the c-site download
my $csite_master_dir = "$csite_dir$wget_subdir_csite[0]/dayzunderground.webs.com";
my $csite_child_dir = "$csite_dir$wget_subdir_csite[1]/dayzunderground.webs.com";
# Call to subroutine to append the .html file name to the absolute path
my #master_csite = &gethtml_master_csite($csite_master_dir);
my #child_csite = &gethtml_child_csite($csite_child_dir);
&csite_md5(\#master_csite, \#child_csite);
sub gethtml_master_csite{
my ($master_path) = #_;
opendir (DIR, $master_path) or die $!;
# Ends with .html and is a file
my #html_master = sort grep {m/\.html$/i && -f "$master_path/$_"} readdir(DIR);
my #files_master = ("$master_path/$html_master[0]","$master_path/$html_master[1]","$master_path/$html_master[2]","$master_path/$html_master[3]");
return #files_master
}
sub gethtml_child_csite{
my ($child_path) = #_;
opendir (DIR, $child_path) or die $!;
# Ends with .html and is a file
my #html_child = sort grep {m/\.html$/i && -f "$child_path/$_"} readdir(DIR);
my #files_child = ("$child_path/$html_child[0]","$child_path/$html_child[1]","$child_path/$html_child[2]","$child_path/$html_child[3]");
return #files_child
}
sub csite_md5{
my ($master, $child) = #_;
if(&md5sum($master[0]) ne &md5sum($child[0])){
my $filename = basename($master[0]);
system("diff -u -d -t --width=100 $master[0] $child[0] > ~/websites/c-site/diff/c-site-$filename-`date +%F`");
#print "1"
}
if(&md5sum($master[1]) ne &md5sum($child[1])){
my $filename2 = basename($master[1]);
system("diff -u -d -t --width=100 $master[1] $child[1] > ~/websites/c-site/diff/c-site-$filename2-`date +%F`");
#print "2"
}
if(&md5sum($master[2]) ne &md5sum($child[2])){
my $filename3 = basename($master[2]);
system("diff -u -d -t --width=100 $master[2] $child[2] > ~/websites/c-site/diff/c-site-$filename3-`date +%F`");
#print "3"
}
if(&md5sum($master[3]) ne &md5sum($child[3])){
my $filename4 = basename($master[3]);
system("diff -u -d -t --width=100 $master[3] $child[3] > ~/websites/c-site/diff/c-site-$filename4-`date +%F`");
#print "4"
}
}
sub md5sum{
my $file = shift;
my $digest = "";
eval{
open(FILE, $file) or die "Can't find file $file\n";
my $ctx = Digest::MD5->new;
$ctx->addfile(*FILE);
$digest = $ctx->hexdigest;
close(FILE);
};
if($#){
print $#;
return "";
}
return $digest
}

$master and $child are array references; use them like $master->[0]. $master[0] uses the array #master, which is a completely separate variable.

I thought it may help to go through your program and point out some practices that are less than optimal
You shouldn't use an ampersand & when calling a Perl subroutine. That was required in Perl 4 which was superseded about 22 years ago
It is preferable to use the File::Spec module to manipulate file paths, both to handle cases like multiple path separators and for portability. File::Spec will also do the job of File::BaseName
It is unnecessary to use the shell to create a date string. Use the Time::Piece module and localtime->ymd generates the same string as date +%F
It is neater and more concise to use map where appropriate instead of writing multiple identical assignments
The gethtml_master_csite and gethtml_child_csite subroutines are identical except that they use different variable names internally. They can be replaced by a single gethtml_csite subroutine
You should use lexical file and directory handles throughout, as you have done with the first opendir. You should also use the three-parameter form of open (with the open mode as the second parameter)
If an open fails then you should include the variable $! in the die string so that you know why it failed. Also, if you end the string with a newline then Perl won't append the source file and line number to the string when it is printed
As you have read, the csite_md5 attempts to use arrays #master and #child which don't exist. You have array references $master and $child instead. Also, the subroutine lends itself to a loop structure instead of writing the four comparisons explicitly
In md5sum you have used an eval to catch the die when the open call fails. It is nicer to check for this explicitly
The standard way of returning a false value from a subroutine is a bare return. If you return '' then it will evaluate as true in list context
With those chnages in place your code looks like this. Please ask if you have any problem understanding it. Note that I haven't been able to test it but it does compile
#!/usr/bin/perl
use strict;
use warnings;
use Digest::MD5 qw(md5_hex);
use File::Spec::Functions qw/ catdir catfile splitpath /;
use Time::Piece 'localtime';
my $csite_dir = '/root/websites/c-site/wget/';
opendir my $dh, $csite_dir or die qq{Unable to open "$csite_dir": $!};
my #wget_subdir_csite = sort grep /^[^.]/, readdir $dh;
my ($csite_master_dir, $csite_child_dir) = map
catdir($csite_dir, $_, 'dayzunderground.webs.com'),
#wget_subdir_csite[0,1];
my #master_csite = gethtml_csite($csite_master_dir);
my #child_csite = gethtml_csite($csite_child_dir);
csite_md5(\#master_csite, \#child_csite);
sub gethtml_csite {
my ($path) = #_;
opendir my $dh, $path or die qq{Unable to open "$path": $!};
my #files = sort grep { /\.html$/i and -f } map catfile($path, $_), readdir $dh;
return #files;
}
sub csite_md5 {
my ($master_list, $child_list) = #_;
for my $i ( 0 .. $#$master_list ) {
my ($master, $child) = ($master_list->[$i], $child_list->[$i]);
if ( md5sum($master) ne md5sum($child) ) {
my $filename = (splitpath($master))[-1]; # Returns (volume, path, file)
my $date = localtime->ymd;
system("diff -u -d -t --width=100 $master $child > ~/websites/c-site/diff/c-site-$filename-$date");
}
}
}
sub md5sum {
my ($file) = #_;
my $digest = "";
open my $fh, '<', $file or do {
warn qq{Can't open file "$file": $!}; # '
return;
};
my $ctx = Digest::MD5->new;
$ctx->addfile($fh);
return $ctx->hexdigest;
}

In Perl, how can filter all log files in a directory, and extract interesting lines?

I'm trying to select only the .log files in my directory and then search in those files for the word "unbound" and print the entire line into a new output file with the same name as the log file (number###.log) but with a .txt extension. This is what I have so far:
#!/usr/bin/perl
use strict;
use warnings;
my $path = $ARGV[0];
my $outpath = $ARGV[1];
my #files;
my $files;
opendir(DIR,$path) or die "$!";
#files = grep { /\.log$/} readdir(DIR);
my #out;
my $out;
opendir(OUT,$outpath) or die "$!";
my $line;
foreach $files (#files) {
open (FILE, "$files");
my #line = <FILE>;
my $regex = Unbound;
open (OUT, ">>$out");
print grep {$line =~ /$regex/ } <>;
}
close OUT;
close FILE;
closedir(DIR);
closedir (OUT);
I'm a beginner, and I don't really know how to create a new text file with the acquired output.

Few things I'd suggest to improve this code:
declare your loop iterators within the loop. foreach my $file ( #files ) {
use 3 arg open: open ( my $input_fh, "<", $filename );
use glob rather than opendir then grep. foreach my $file ( <$path/*.txt> ) {
grep is good for extracting things into arrays. Your grep reads the whole file to print it, which isn't necessary. Doesn't matter much if the file is short though.
perltidy is great for reformatting code.
you're opening 'OUT' to a directory path (I think?) which isn't going to work.
$outpath isn't, it's a file. You need to do something different to output to different files. opendir isn't really valid to an output.
because you're using opendir that's actually giving you filenames - not full paths. So you might be in the wrong place to actually open the files. Prepending the path name, doing a chdir are possible solutions. But that's one of the reasons I like glob because it returns a path as well.
So with that in mind - how about:
#!/usr/bin/perl
use strict;
use warnings;
use File::Basename;
#Extract paths
my $input_path = $ARGV[0];
my $output_path = $ARGV[1];
#Error if paths are invalid.
unless (defined $input_path
and -d $input_path
and defined $output_path
and -d $output_path )
{
die "Usage: $0 <input_path> <output_path>\n";
}
foreach my $filename (<$input_path/*.log>) {
# extract the 'name' bit of the filename.
# be slightly careful with this - it's based
# on an assumption which isn't always true.
# File::Spec is a more powerful way of accomplishing this.
# but should grab 'number####' from /path/to/file/number####.log
my $output_file = basename ( $filename, '.log' );
#open input and output filehandles.
open( my $input_fh, "<", $filename ) or die $!;
open( my $output_fh, ">", "$output_path/$output_file.txt" ) or die $!;
print "Processing $filename -> $output_path/$output_file.txt\n";
#iterate input, extracting into $line
while ( my $line = <$input_fh> ) {
#check if $line matches your RE.
if ( $line =~ m/Unbound/ ) {
#write it to output.
print {$output_fh} $line;
}
}
#tidy up our filehandles. Although technically, they'll
#close automatically because they leave scope
close($output_fh);
close($input_fh);
}

Here is a script that takes advantage of Path::Tiny. Now, at this stage of your learning process, you are probably better off understanding #Sobrique's solution, but using modules such as Path::Tiny or Path::Class will make it easier to write these one off scripts more quickly, and correctly.
Also, I didn't really test this script, so watch out for bugs.
#!/usr/bin/env perl
use strict;
use warnings;
use Path::Tiny;
run(\#ARGV);
sub run {
my $argv = shift;
unless (#$argv == 2) {
die "Need source and destination paths\n";
}
my $it = path($argv->[0])->realpath->iterator({
recurse => 0,
follow_symlinks => 0,
});
my $outdir = path($argv->[1])->realpath;
while (my $path = $it->()) {
next unless -f $path;
next unless $path =~ /[.]log\z/;
my $logfh = $path->openr;
my $outfile = $outdir->child($path->basename('.log') . '.txt');
my $outfh;
while (my $line = <$logfh>) {
next unless $line =~ /Unbound/;
unless ($outfh) {
$outfh = $outfile->openw;
}
print $outfh $line;
}
close $outfh
or die "Cannot close output '$outfile': $!";
}
}
Notes
realpath will croak if the path provided does not exist.
Similarly for openr and openw.
I am reading input files line-by-line to keep the memory footprint of the program independent of the sizes of input files.
I do not open the output file until I know I have a match to print to.
When matching a file extension using a regular expression pattern, keep in mind that \n is a valid character in Unix file names, and the $ anchor will match it.

Perl - empty rows while writing CSV from Excel

I want to convert excel-files to csv-files with Perl. For convenience I like to use the module File::Slurp for read/write operations. I need it in a subfunction.
While printing out to the screen, the program generates the desired output, the generated csv-files unfortunately just contain one row with semicolons, field are empty.
Here is the code:
#!/usr/bin/perl
use File::Copy;
use v5.14;
use Cwd;
use File::Slurp;
use Spreadsheet::ParseExcel;
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
for my $sourcename (#files) {
print "Now working on $sourcename\n";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/g;
print "Output CSV-File: ".$outFile."\n";
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";"; # correct output!
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" ); # only one row of semicolons with empty fields!
}
}
print "\n";
}
}
}
}
xls2csv();
I know it has something to do with the parameter passing in the write_file function, but couldn't manage to fix it.
Has anybody an idea?
Thank you very much in advance.

write_file will overwrite the file unless the append => 1 option is given. So this:
write_file( $outFile, { binmode => ':utf8' }, $source_cell->Value, ";" );
Will write a new file for each new cell value. It does however not match your description of "only one row of semi-colons of empty fields", as it should only be one semi-colon, and one value.
I am doubtful towards this sentiment from you: "For convenience I like to use the module File::Slurp". While the print statement works as it should, using File::Slurp does not. So how is that convenient?
What you should do, if you still want to use write_file is to gather all the lines to print, and then print them all at once at the end of the loop. E.g.:
$line .= $source_cell->Value . ";"; # use concatenation to build the line
...
push #out, "$line\n"; # store in array
...
write_file(...., \#out); # print the array
Another simple option would be to use join, or to use the Text::CSV module.

Well, in this particular case, File::Slurp was indeed complicating this for me. I just wanted to avoid to repeat myself, which I did in the following clumsy working solution:
#!/usr/bin/perl
use warnings;
use strict;
use File::Copy;
use v5.14;
use Cwd;
use File::Basename;
use File::Slurp;
use Tie::File;
use Spreadsheet::ParseExcel;
use open qw/:std :utf8/;
# ... other functions
sub xls2csv {
my $currentPath = getcwd();
my #files = <$currentPath/stage0/*.xls>;
my $fh;
for my $sourcename (#files) {
say "Now working on $sourcename";
my $outFile = $sourcename;
$outFile =~ s/xls/csv/gi;
if ( -e $outFile ) {
unlink($outFile) or die "Error: $!";
print "Old $outFile deleted.";
}
my $source_excel = new Spreadsheet::ParseExcel;
my $source_book = $source_excel->Parse($sourcename)
or die "Could not open source Excel file $sourcename: $!";
foreach my $source_sheet_number ( 0 .. $source_book->{SheetCount} - 1 )
{
my $source_sheet = $source_book->{Worksheet}[$source_sheet_number];
next unless defined $source_sheet->{MaxRow};
next unless $source_sheet->{MinRow} <= $source_sheet->{MaxRow};
next unless defined $source_sheet->{MaxCol};
next unless $source_sheet->{MinCol} <= $source_sheet->{MaxCol};
foreach my $row_index (
$source_sheet->{MinRow} .. $source_sheet->{MaxRow} )
{
foreach my $col_index (
$source_sheet->{MinCol} .. $source_sheet->{MaxCol} )
{
my $source_cell =
$source_sheet->{Cells}[$row_index][$col_index];
if ($source_cell) {
print $source_cell->Value, ";";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh $source_cell->Value, ";";
close $fh;
}
}
print "\n";
open( $fh, '>>', $outFile ) or die "Error: $!";
print $fh "\n";
close $fh;
}
}
}
}
xls2csv();
I'm actually NOT happy with it, since I'm opening and closing the files so often (I have many files with many lines). That's not very clever in terms of performance.
Currently I still don't know how to use the split or Text:CSV in this case, in order to put everything into an array and to open, write and close each file only once.
Thank you for your answer TLP.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How do I access gzipped files without creating additional processes? - perl

From the documentation of IO::Uncompress::Gunzip use IO::Uncompress::Gunzip qw($GunzipError); my $z = IO::Uncompress::Gunzip->new( $input ) or die "IO::Uncompress::Gunzip failed: $GunzipError\n"; The variable $z is now a file handle that you can use as usual. while (<$z>) {...}

Just to add some information about previous answers, from an old bench I made, PerlIO::gzip is faster than IO::Uncompress::Gunzip.

Related

redirect output files in different directory

What produces the white space in my perl programm?

Can't find error "Global symbol #xx requires explicit package name"

In Perl, how can filter all log files in a directory, and extract interesting lines?

Perl - empty rows while writing CSV from Excel

Categories

Resources