Related
I would like to use
myscript.pl targetfolder/*
to read some number from ASCII files.
myscript.pl
#list = <#ARGV>;
# Is the whole file or only 1st line is loaded?
foreach $file ( #list ) {
open (F, $file);
}
# is this correct to judge if there is still file to load?
while ( <F> ) {
match_replace()
}
sub match_replace {
# if I want to read the 5th line in downward, how to do that?
# if I would like to read multi lines in multi array[row],
# how to do that?
if ( /^\sName\s+/ ) {
$name = $1;
}
}
I would recommend a thorough read of perlintro - it will give you a lot of the information you need. Additional comments:
Always use strict and warnings. The first will enforce some good coding practices (like for example declaring variables), the second will inform you about potential mistakes. For example, one warning produced by the code you showed would be readline() on unopened filehandle F, giving you the hint that F is not open at that point (more on that below).
#list = <#ARGV>;: This is a bit tricky, I wouldn't recommend it - you're essentially using glob, and expanding targetfolder/* is something your shell should be doing, and if you're on Windows, I'd recommend Win32::Autoglob instead of doing it manually.
foreach ... { open ... }: You're not doing anything with the files once you've opened them - the loop to read from the files needs to be inside the foreach.
"Is the whole file or only 1st line is loaded?" open doesn't read anything from the file, it just opens it and provides a filehandle (which you've named F) that you then need to read from.
I'd strongly recommend you use the more modern three-argument form of open and check it for errors, as well as use lexical filehandles since their scope is not global, as in open my $fh, '<', $file or die "$file: $!";.
"is this correct to judge if there is still file to load?" Yes, while (<$filehandle>) is a good way to read a file line-by-line, and the loop will end when everything has been read from the file. You may want to use the more explicit form while (my $line = <$filehandle>), so that your variable has a name, instead of the default $_ variable - it does make the code a bit more verbose, but if you're just starting out that may be a good thing.
match_replace(): You're not passing any parameters to the sub. Even though this code might still "work", it's passing the current line to the sub through the global $_ variable, which is not a good practice because it will be confusing and error-prone once the script starts getting longer.
if (/^\sName\s+/){$name = $1;}: Since you've named the sub match_replace, I'm guessing you want to do a search-and-replace operation. In Perl, that's called s/search/replacement/, and you can read about it in perlrequick and perlretut. As for the code you've shown, you're using $1, but you don't have any "capture groups" ((...)) in your regular expression - you can read about that in those two links as well.
"if I want to read the 5th line in downward , how to do that ?" As always in Perl, There Is More Than One Way To Do It (TIMTOWTDI). One way is with the range operator .. - you can skip the first through fourth lines by saying next if 1..4; at the beginning of the while loop, this will test those line numbers against the special $. variable that keeps track of the most recently read line number.
"and if I would like to read multi lines in multi array[row], how to do that ?" One way is to use push to add the current line to the end of an array. Since keeping the lines of a file in an array can use up more memory, especially with large files, I'd strongly recommend making sure you think through the algorithm you want to use here. You haven't explained why you would want to keep things in an array, so I can't be more specific here.
So, having said all that, here's how I might have written that code. I've added some debugging code using Data::Dumper - it's always helpful to see the data that your script is working with.
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper; # for debugging
$Data::Dumper::Useqq=1;
for my $file (#ARGV) {
print Dumper($file); # debug
open my $fh, '<', $file or die "$file: $!";
while (my $line = <$fh>) {
next if 1..4;
chomp($line); # remove line ending
match_replace($line);
}
close $fh;
}
sub match_replace {
my ($line) = #_; # get argument(s) to sub
my $name;
if ( $line =~ /^\sName\s+(.*)$/ ) {
$name = $1;
}
print Data::Dumper->Dump([$line,$name],['line','name']); # debug
# ... do more here ...
}
The above code is explicitly looping over #ARGV and opening each file, and I did say above that more verbose code can be helpful in understanding what's going on. I just wanted to point out a nice feature of Perl, the "magic" <> operator (discussed in perlop under "I/O Operators"), which will automatically open the files in #ARGV and read lines from them. (There's just one small thing, if I want to use the $. variable and have it count the lines per file, I need to use the continue block I've shown below, this is explained in eof.) This would be a more "idiomatic" way of writing that first loop:
while (<>) { # reads line into $_
next if 1..4;
chomp; # automatically uses $_ variable
match_replace($_);
} continue { close ARGV if eof } # needed for $. (and range operator)
[I've changed the code below to reflect what I'm currently running after having implemented people's suggestions]
Let me preface this by saying that I'm not a programmer, but just someone who uses Perl to get certain text processing things done the best I can.
I've got a script that produces frequency lists. It essentially does the following:
Reads in lines from a file having the format $frequency \t $item. Any given $item may occur multiple times, with different values for $frequency.
Eliminates certain lines depending on the content of $item.
Sums the frequencies of all identical $items, regardless of their case, and merges these entries into one.
Performs a reverse natural sort on the resulting array.
Prints the results to an output file.
The script works perfectly well on input files of up to about 1 GB in size. However, I have files of up to 6 GB that I need to process, and this has proven impossible due to memory use. Though my machine has 32 GB of RAM, uses zRam, and has 64 GB of swap on SSD just for this purpose, the script will inevitably be killed by the Linux OOM service when combined memory use hits something around 70 GB (of the 92 GB total).
The real issue, of course, is the vast amount of memory my script is using. I could try adding even more swap, but I've increased it twice now and it just gets eaten up.
So I need to somehow optimize the script. And that's what I'm here asking for help with.
Below is the actual version of the script that I'm running now, with some hopefully useful comments retained.
I'd be enormously grateful if your comments and suggestions contained enough code to actually allow me to more or less drop it in to the existing script, as I'm not a programmer by trade, as I said above, and even something so apparently simple as piping the text being processed through some module or another would throw me for a serious curve.
Thanks in advance!
(By the way, I'm using Perl 5.22.1 x64 on Ubuntu 16.04 LTS x64.
#!/usr/bin/env perl
use strict;
use warnings;
use warnings qw(FATAL utf8);
use Getopt::Long qw(:config no_auto_abbrev);
# DEFINE VARIABLES
my $delimiter = "\t";
my $split_char = "\t";
my $input_file_name = "";
my $output_file_name = "";
my $in_basename = "";
my $frequency = 0;
my $item = "";
# READ COMMAND LINE OPTIONS
GetOptions (
"input|i=s" => \$input_file_name,
"output|o=s" => \$output_file_name,
);
# INSURE AN INPUT FILE IS SPECIFIED
if ( $input_file_name eq "" ) {
die
"\nERROR: You must provide the name of the file to be processed with the -i switch.\n";
}
# IF NO OUTPUT FILE NAME IS SPECIFIED, GENERATE ONE AUTOMATICALLY
if ( $output_file_name eq "" ) {
# STRIP EXTENSION FROM INPUT FILE NAME
$in_basename = $input_file_name;
$in_basename =~ s/(.+)\.(.+)/$1/;
# GENERATE OUTPUT FILE NAME FROM INPUT BASENAME
$output_file_name = "$in_basename.output.txt";
}
# READ INPUT FILE
open( INPUTFILE, '<:encoding(utf8)', $input_file_name )
or die "\nERROR: Can't open input file ($input_file_name): $!";
# PRINT INPUT AND OUTPUT FILE INFO TO TERMINAL
print STDOUT "\nInput file:\t$input_file_name";
print STDOUT "\nOutput file:\t$output_file_name";
print STDOUT "\n\n";
# PROCESS INPUT FILE LINE BY LINE
my %F;
while (<INPUTFILE>) {
chomp;
# PUT FREQUENCY IN $frequency AND THEN PUT ALL OTHER COLUMNS INTO $item
( $frequency, $item ) = split( /$split_char/, $_, 2 );
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq '' or not defined $item or $item =~ /^\s*$/;
# PROCESS INPUT LINES
$F{ lc($item) } += $frequency;
}
close INPUTFILE;
# OPEN OUTPUT FILE
open( OUTPUTFILE, '>:encoding(utf8)', "$output_file_name" )
|| die "\nERROR: The output file \($output_file_name\) couldn't be opened for writing!\n";
# PRINT OUT HASH WITHOUT SORTING
foreach my $item ( keys %F ) {
print OUTPUTFILE $F{$item}, "\t", $item, "\n";
}
close OUTPUTFILE;
exit;
Below is some sample input from the source file. It's tab-separated, and the first column is $frequency, while all the rest together is $item.
2 útil volver a valdivia
8 útil volver la vista
1 útil válvula de escape
1 útil vía de escape
2 útil vía fax y
1 útil y a cabalidad
43 útil y a el
17 útil y a la
1 útil y a los
21 útil y a quien
1 útil y a raíz
2 útil y a uno
UPDATE In my tests a hash takes 2.5 times the memory that its data "alone" would take. However, the program size for me is consistently 3-4 times as large as its variables. This would turn 6.3Gb data file into a ~ 15Gb hash, for a ~ 60Gb program, just as reported in comments.
So 6.3Gb == 60Gb, so to say. This still improved the starting situation enough so to work for the current problem but is clearly not a solution. See the (updated) Another approach below for a way to run this processing without loading the whole hash.
There is nothing obvious to lead to an order-of-magnitude memory blowup. However, small errors and inefficiences can add up so let's first clean up. See other approaches at end.
Here is a simple re-write of the core of the program, to try first.
# ... set filenames, variables
open my $fh_in, '<:encoding(utf8)', $input_file_name
or die "\nERROR: Can't open input file ($input_file_name): $!";
my %F;
while (<$fh_in>) {
chomp;
s/^\s*//; #/trim leading space
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
# ... increment counters and aggregates and add to hash
# (... any other processing?)
$F{ lc($item) } += $frequency;
}
close $fh_in;
# Sort and print to file
# (Or better write: "value key-length key" and sort later. See comments)
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
foreach my $item ( sort {
$F{$b} <=> $F{$a} || length($b) <=> length($a) || $a cmp $b
} keys %F )
{
print $fh_out $F{$item}, "\t", $item, "\n";
}
close $fh_out;
A few comments, let me know if more is needed.
Always add $! to error-related prints, to see the actual error. See perlvar.
Use lexical filehandles (my $fh rather than IN), it's better.
If layers are specified in the three-argument open then layers set by open pragma are ignored, so there should be no need for use open ... (but it doesn't hurt either).
The sort here has to at least copy its input, and with multiple conditions more memory is needed.
That should take no more memory than 2-3 times the hash size. While initially I suspected a memory leak (or excessive data copying), by reducing the program to basics it was shown that the "normal" program size is the (likely) culprit. This can be tweaked by devising custom data structures and packing the data economically.
Of course, all this is fiddling if your files are going to grow larger and larger, as they tend to do.
Another approach is to write out the file unsorted and then sort using a separate program. That way you don't combine the possible memory swelling from processing with final sorting.
But even this pushes the limits, due to the greatly increased memory footprint as compared to data, since hash takes 2.5 times the data size and the whole program is yet 3-4 as large.
Then find an algorithm to write the data line-by-line to the output file. That is simple to do here since by the shown processing we only need to accumulate frequencies for each item
open my $fh_out, '>:encoding(utf8)', $output_file_name
or die "\nERROR: Can't open output file ($output_file_name\: $!";
my $cumulative_freq;
while (<$fh_in>) {
chomp;
s/^\s*//; #/ leading only
my ($frequency, $item) = split /$split_char/, $_, 2;
# Skip lines with empty or undefined content, or spaces only in $item
next if not defined $frequency or $frequency eq ''
or not defined $item or $item =~ /^\s*$/;
$cumulative_freq += $frequency; # would-be hash value
# Add a sort criterion, $item's length, helpful for later sorting
say $fh_out $cumulative_freq, "\t", length $item, "\t", lc($item);
#say $fh_out $cumulative_freq, "\t", lc($item);
}
close $fh_out;
Now we can use the system's sort, which is optimized for very large files. Since we wrote a file with all sorting columns, value key-length key, run in a terminal
sort -nr -k1,1 -k2,2 output_file_name | cut -f1,3- > result
The command sorts numerically by the first and then by the second field (then it sorts by third itself) and reverses the order. This is piped into cut which pulls out the first and third fields from STDIN (with tab as default delimiter), what is the needed result.
A systemic solution is to use a database, and a very convenient one is DBD::SQLite.
I used Devel::Size to see memory used by variables.
Sorting input requires keeping all input in memory, so you can't do everything in a single process.
However, sorting can be factored: you can easily sort your input into sortable buckets, then process the buckets, and produce the correct output by combining the outputs in reversed-sorted bucket order. The frequency counting can be done per bucket as well.
So just keep the program you have, but add something around it:
partition your input into buckets, e.g. by the first character or the first two characters
run your program on each bucket
concatenate the output in the right order
Your maximum memory consumption will be slightly more than what your original program takes on the biggest bucket. So if your partitioning is well chosen, you can arbitrarily drive it down.
You can store the input buckets and per-bucket outputs to disk, but you can even connect the steps directly with pipes (creating a subprocess for each bucket processor) - this will create a lot of concurrent processes, so the OS will be paging like crazy, but if you're careful, it won't need to write to disk.
A drawback of this way of partitioning is that your buckets may end up being very uneven in size. An alternative is to use a partitioning scheme that is guaranteed to distribute the input equally (e.g. by putting every nth line of input into the nth bucket) but that makes combining the outputs more complex.
I have a large data set (around 90GB) to work with. There are data files (tab delimited) for each hour of each day and I need to perform operations in the entire data set. For example, get the share of OSes which are given in one of the columns. I tried merging all the files into one huge file and performing the simple count operation but it was simply too huge for the server memory.
So, I guess I need to perform the operation each file at a time and then add up in the end. I am new to perl and am especially naive about the performance issues. How do I do such operations in a case like this.
As an example two columns of the file are.
ID OS
1 Windows
2 Linux
3 Windows
4 Windows
Lets do something simple, counting the share of the OSes in the data set. So, each .txt file has millions of these lines and there are many such files. What would be the most efficient way to operate on the entire files.
Unless you're reading the entire file into memory, I don't see why the size of the file should be an issue.
my %osHash;
while (<>)
{
my ($id, $os) = split("\t", $_);
if (!exists($osHash{$os}))
{
$osHash{$os} = 0;
}
$osHash{$os}++;
}
foreach my $key (sort(keys(%osHash)))
{
print "$key : ", $osHash{$key}, "\n";
}
While Paul Tomblin's answer dealt with filling the hash, here's the same plus opening the files:
use strict;
use warnings;
use 5.010;
use autodie;
my #files = map { "file$_.txt" } 1..10;
my %os_count;
for my $file (#files) {
open my $fh, '<', $file;
while (<$file>) {
my ($id, $os) = split /\t/;
... #Do something with %os_count and $id/$os here.
}
}
We just open each file serially -- Since you need to read all lines from all files, there isn't much more you can do about it. Once you have the hash, you could store it somewhere and load it when the program starts, then skip all lines until the last you read, or simply seek there, if your records premit, which doesn't look like it.
I have a couple of text files (A.txt and B.txt) which look like this (might have ~10000 rows each)
processa,id1=123,id2=5321
processa,id1=432,id2=3721
processa,id1=3,id2=521
processb,id1=9822,id2=521
processa,id1=213,id2=1
processc,id1=822,id2=521
I need to check if every row in file A.txt is present in B.txt as well (B.txt might have more too, that is okay).
The thing is that rows can be in any order in the two files, so I am thinking I will sort them in some particular order in both the files in O(nlogn) and then match each line in A.txt to the next lines in B.txt in O(n). I could implement a hash, but the files are big and this comparison happens only once after which these files are regenerated, so I don't think that is a good idea.
What is the best way to sort the files in Perl? Any ordering would do, it just needs to be some ordering.
For example, in dictionary ordering, this would be
processa,id1=123,id2=5321
processa,id1=213,id2=1
processa,id1=3,id2=521
processa,id1=432,id2=3721
processb,id1=9822,id2=521
processc,id1=822,id2=521
As I mentioned before, any ordering would be just as fine, as long as Perl is fast in doing it.
I want to do it from within Perl code, after opening the file like so
open (FH, "<A.txt");
Any comments, ideas etc would be helpful.
To sort the file in your script, you will still have to load the entire thing into memory. If you're doing that, I'm not sure what's the advantage of sorting it vs just loading it into a hash?
Something like this would work:
my %seen;
open(A, "<A.txt") or die "Can't read A: $!";
while (<A>) {
$seen{$_}=1;
}
close A;
open(B, "<B.txt") or die "Can't read B: $!";
while(<B>) {
delete $seen{$_};
}
close B;
print "Lines found in A, missing in B:\n";
join "\n", keys %seen;
Here's another way to do it. The idea is to create a flexible data structure that allows you to answer many kinds of questions easily with grep.
use strict;
use warnings;
my ($fileA, $fileB) = #ARGV;
# Load all lines: $h{LINE}{FILE_NAME} = TALLY
my %h;
$h{$_}{$ARGV} ++ while <>;
# Do whatever you need.
my #all_lines = keys %h;
my #in_both = grep { keys %{$h{$_}} == 2 } keys %h;
my #in_A = grep { exists $h{$_}{$fileA} } keys %h;
my #only_in_A = grep { not exists $h{$_}{$fileB} } #in_A;
my #in_A_mult = grep { $h{$_}{$fileA} > 1 } #in_A;
well, i routinely parse very large (600MB) daily Apache log files with Perl, and to store the information i use a hash. I also go through about 30 of these files, in one script instance, using the same hash. its not a big deal assuming you have enough RAM.
May I ask why you must do this in native Perl? If the cost of calling a system call or 3 is not an issue (e.g. you do this infrequently and not in a tight loop), why not simply do:
my $cmd = "sort $file1 > $file1.sorted";
$cmd .= "; sort $file2 > $file2.sorted";
$cmd .= "; comm -23 $file1.sorted $file2.sorted |wc -l";
my $count = `$cmd`;
$count =~ s/\s+//g;
if ($count != 0) {
print "Stuff in A exists that aren't in B\n";
}
Please note that comm parameter might be different, depending on what exactly you want.
As usual, CPAN has an answer for this. Either Sort::External or File::Sort looks like it would work. I've never had occasion to try either, so I don't know which would be better for you.
Another possibility would be to use AnyDBM_File to create a disk-based hash that can exceed available memory. Without trying it, I couldn't say whether using a DBM file would be faster or slower than the sort, but the code would probably be simpler.
Test if A.txt is a subset of B.txt
open FILE.B, "B.txt";
open FILE.A, "A.txt";
my %bFile;
while(<FILE.B>) {
($process, $id1, $id2) = split /,/;
$bFile{$process}{$id1}{$id2}++;
}
$missingRows = 0;
while(<FILE.A>) {
$missingRows++ unless $bFile{$process}{$id1}{$id2};
# If we've seen a given entry already don't add it
next if $missingRows; # One miss means they aren't all verified
}
$is_Atxt_Subset_Btxt = $missingRows?FALSE:TRUE;
That will give you a test for all rows in A being in B with only reading in all of B and then testing each member of the array while reading A.
I'm working on a project that involves parsing a large csv formatted file in Perl and am looking to make things more efficient.
My approach has been to split() the file by lines first, and then split() each line again by commas to get the fields. But this suboptimal since at least two passes on the data are required. (once to split by lines, then once again for each line). This is a very large file, so cutting processing in half would be a significant improvement to the entire application.
My question is, what is the most time efficient means of parsing a large CSV file using only built in tools?
note: Each line has a varying number of tokens, so we can't just ignore lines and split by commas only. Also we can assume fields will contain only alphanumeric ascii data (no special characters or other tricks). Also, i don't want to get into parallel processing, although it might work effectively.
edit
It can only involve built-in tools that ship with Perl 5.8. For bureaucratic reasons, I cannot use any third party modules (even if hosted on cpan)
another edit
Let's assume that our solution is only allowed to deal with the file data once it is entirely loaded into memory.
yet another edit
I just grasped how stupid this question is. Sorry for wasting your time. Voting to close.
The right way to do it -- by an order of magnitude -- is to use Text::CSV_XS. It will be much faster and much more robust than anything you're likely to do on your own. If you're determined to use only core functionality, you have a couple of options depending on speed vs robustness.
About the fastest you'll get for pure-Perl is to read the file line by line and then naively split the data:
my $file = 'somefile.csv';
my #data;
open(my $fh, '<', $file) or die "Can't read file '$file' [$!]\n";
while (my $line = <$fh>) {
chomp $line;
my #fields = split(/,/, $line);
push #data, \#fields;
}
This will fail if any fields contain embedded commas. A more robust (but slower) approach would be to use Text::ParseWords. To do that, replace the split with this:
my #fields = Text::ParseWords::parse_line(',', 0, $line);
Here is a version that also respects quotes (e.g. foo,bar,"baz,quux",123 -> "foo", "bar", "baz,quux", "123").
sub csvsplit {
my $line = shift;
my $sep = (shift or ',');
return () unless $line;
my #cells;
$line =~ s/\r?\n$//;
my $re = qr/(?:^|$sep)(?:"([^"]*)"|([^$sep]*))/;
while($line =~ /$re/g) {
my $value = defined $1 ? $1 : $2;
push #cells, (defined $value ? $value : '');
}
return #cells;
}
Use it like this:
while(my $line = <FILE>) {
my #cells = csvsplit($line); # or csvsplit($line, $my_custom_seperator)
}
As other people mentioned, the correct way to do this is with Text::CSV, and either the Text::CSV_XS back end (for FASTEST reading) or Text::CSV_PP back end (if you can't compile the XS module).
If you're allowed to get extra code locally (eg, your own personal modules) you could take Text::CSV_PP and put it somewhere locally, then access it via the use lib workaround:
use lib '/path/to/my/perllib';
use Text::CSV_PP;
Additionally, if there's no alternative to having the entire file read into memory and (I assume) stored in a scalar, you can still read it like a file handle, by opening a handle to the scalar:
my $data = stupid_required_interface_that_reads_the_entire_giant_file();
open my $text_handle, '<', \$data
or die "Failed to open the handle: $!";
And then read via the Text::CSV interface:
my $csv = Text::CSV->new ( { binary => 1 } )
or die "Cannot use CSV: ".Text::CSV->error_diag ();
while (my $row = $csv->getline($text_handle)) {
...
}
or the sub-optimal split on commas:
while (my $line = <$text_handle>) {
my #csv = split /,/, $line;
... # regular work as before.
}
With this method, the data is only copied a bit at a time out of the scalar.
You can do it in one pass if you read the file line by line. There is no need to read the whole thing into memory at once.
#(no error handling here!)
open FILE, $filename
while (<FILE>) {
#csv = split /,/
# now parse the csv however you want.
}
Not really sure if this is significantly more efficient though, Perl is pretty fast at string processing.
YOU NEED TO BENCHMARK YOUR IMPORT to see what is causing the slowdown. If for example, you are doing a db insertion that takes 85% of the time, this optimization won't work.
Edit
Although this feels like code golf, the general algorithm is to read the whole file or part of the fie into a buffer.
Iterate byte by byte through the buffer until you find a csv delimeter, or a new line.
When you find a delimiter, increment your column count.
When you find a newline increment your row count.
If you hit the end of your buffer, read more data from the file and repeat.
That's it. But reading a large file into memory is really not the best way, see my original answer for the normal way this is done.
Assuming that you have your CSV file loaded into $csv variable and that you do not need text in this variable after you successfully parsed it:
my $result=[[]];
while($csv=~s/(.*?)([,\n]|$)//s) {
push #{$result->[-1]}, $1;
push #$result, [] if $2 eq "\n";
last unless $2;
}
If you need to have $csv untouched:
local $_;
my $result=[[]];
foreach($csv=~/(?:(?<=[,\n])|^)(.*?)(?:,|(\n)|$)/gs) {
next unless defined $_;
if($_ eq "\n") {
push #$result, []; }
else {
push #{$result->[-1]}, $_; }
}
Answering within the constraints imposed by the question, you can still cut out the first split by slurping your input file into an array rather than a scalar:
open(my $fh, '<', $input_file_path) or die;
my #all_lines = <$fh>;
for my $line (#all_lines) {
chomp $line;
my #fields = split ',', $line;
process_fields(#fields);
}
And even if you can't install (the pure-Perl version of) Text::CSV, you may be able to get away with pulling up its source code on CPAN and copy/pasting the code into your project...