I'm doing some simple parsing on text files (which could get up into the 1GB range). How would I go about skipping the first N rows, and more importantly, the last (different) N rows? I'm sure I could open the file and count the rows, and do something with $_ < total_row_count -N, but that seems incredibly inefficient.
I'm pretty much a perl newb, btw.
A file is a sequence of bytes, without the notion of "lines." Some of those bytes are considered as "line" separators (linefeeds), which is how software gives us our "logical" lines to work with. So there is no way to know how many lines there are in a file -- without having read it and counted them, that is.
A simple and naive way is to read line-by-line and count
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
++$cnt while <$fh>;
with a little faster version using $. variable
1 while <$fh>;
my $cnt = $.;
These take between 2.5 and 3 seconds for a 1.1 Gb text file on a reasonable desktop.
We can speed this up a lot by reading in larger chunks and counting newline characters
open my $fh, '<', $file or die "Can't open $file: $!";
my $cnt;
NUM_LINES: {
my $len = 64_000;
my $buf;
$cnt += $buf =~ tr/\n//
while read $fh, $buf, $len;
seek $fh, 0, 0;
};
This goes in barely over half a second, on same hardware and Perl versions.
I've put it in a block to scope unneeded variables but it should be in a sub, where you can then check where the filehandle is when you get it and return it there after counting (so we can count the "rest" of lines from some point in the file and the processing can then continue), etc. It should also include checks on read operation, at each invocation.
I'd think that a half a second overhead on a Gb large file isn't that bad at all.
Still, you can go for faster yet, at the expense of it being far messier. Get the file size (metadata, so no reading involved) and seek to a position estimated to be the wanted number of lines before the end (no reading involved). That most likely won't hit the right spot so read to the end to count lines and adjust, seeking back (further or closer). Repeat until you reach the needed place.
open my $fh, "<", $file;
my $size = -s $file;
my $estimated_line_len = 80;
my $num_last_lines = 100;
my $pos = $size - $num_last_lines*$estimated_line_len;
seek $fh, $pos, 0;
my $cnt;
++$cnt while <$fh>;
say "There are $cnt lines from position $pos to the end";
# likely need to seek back further/closer ...
I'd guess that this should get you there in under 100 ms. Note that $pos is likely inside a line.
Then once you know the number of lines (or the position for desired number of lines before the end) do seek $fh, 0, 0 and process away. Or really have this in a sub which puts the filehandle back where it was before returning, as mentioned.
I think you need a circular buffer to avoid reading entire file on your memory.
skip-first-last.pl
#!/usr/bin/perl
use strict;
use warnings;
my ($first, $last) = #ARGV;
my #buf;
while (<STDIN>) {
my $mod = $. % $last;
print $buf[$mod] if defined $buf[$mod];
$buf[$mod] = $_ if $. > $first;
}
1;
Skip first 5 lines and last 2 lines:
$ cat -n skip-first-last.pl | ./skip-first-last.pl 5 2
6
7 my #buf;
8 while (<STDIN>) {
9 my $mod = $. % $last;
10 print $buf[$mod] if defined $buf[$mod];
11 $buf[$mod] = $_ if $. > $first;
12 }
Related
I have a file that has a fixed length record with no newline characters.
Example: A file with 100 characters that has a fixed length record of 25 characters. (total of 4 records)
How can I read the file per record without having to store the data in a variable. (please see example below)
open my $fh, "<", "inputfile.txt" or die "can't open file\n";
my $data = <$fh>; # I would like to avoid storing the file contents in a variable
for (my $j = 0; $j < length $data; $j += 25 ) {
my $record = substr($data, $j, 25) # Get one record
print "$record\n";
}
2nd option:
I can also use $_ to capture the data in . Am I doing the same thing as above in terms of consuming additional memory?
open my $fh, "<", "inputfile.txt" or die "can't open file\n";
while ( <$fh> ) {
for (my $j = 0; $j < length $_; $j += 25 ) {
my $record = substr($_, $j, 25) # Get one record
print "$record\n";
}
}
The reason that I do not wan't to store it in a variable because I am concerned that if I am dealing with a very large file, it would consume twice as space as opening the file.
Am I making the correct assumption that I would be taking twice the space in memory as I did when I opened the file?
What would be the most efficient way to read the file wihtout having to consume a lot of memory?
Please correct me if my question does not make sense.
Thanks :)
You can use read to read a specific number of characters from a file handle.
Attempts to read LENGTH characters of data into variable SCALAR from the specified FILEHANDLE. Returns the number of characters actually read, 0 at end of file, or undef if there was an error (in the latter case $! is also set). SCALAR will be grown or shrunk so that the last character actually read is the last character of the scalar after the read.
Here's a short example.
while (read(\*DATA, my $record, 3)) {
print $record, "\n";
}
__DATA__
foobarbazqrr
This will output
foo
bar
baz
qrr
If you read the whole file (as one line) at once, the space you would take up in memory would be at the size of the entire file. It would only be double the size of reading one record at a time if the file only has two very long records.
As it's not mentioned yet - check out $/ - the record separator.
By default, it's linefeed "\n" and you read a file line by line.
However, you can set it to a reference to a numeric value - it has to be a reference, so it doesn't treat the literal string '25' as the delimiter.
Like so:
#!/usr/bin/env perl
use strict;
use warnings;
local $/ = \25;
while ( <DATA> ) {
print;
print "\n-- end of record --\n";
}
__DATA__
1234567890123456
12345636734345345345q34523 3 2134234213 35r25253 25252 2524gfartw345sadgw54723wqu745ewsdf
Your assumption is partly true. Reading the whole file into memory will require as much memory as the file itself uses. For example, if your file is 100 MB, reading it into memory will increase your memory use by 100 MB. This does not mean twice, because just opening the file does not require 100 MB.
As for the best way of reading the file record-by-record, this is it:
my $record_size = 25;
open my $fh, "<", "inputfile.txt" or die "can't open file\n";
while(read($fh, my $record, $record_size)) {
print($record."\n")
}
Also, consider opening your file in binary mode if it contains anything else but text.
I'm working on a 16GB file and a small file.
I tried to load both files into memory. Then, I moved on each line in the big file and validate something in the small file (for each line in the big file I iterated on the small one).
This is my code
local $/ = undef;
open my $fh1, '<', $in or die "error opening $in: $!";
my $input_file = do { local $/; <$fh1> };
local $/ = undef;
open my $fh2, '<', $handle or die "error opening $handle: $!";
my $handle_file = do { local $/; <$fh2> };
my $counter_yes = 0;
my $counter_no = 0;
my $flag = 0;
my #lines1 = split /\n/, $input_file;
foreach my $line( #lines1 ) {
my #f = split('\t', $line); # $f[0] and $f[1]
print "f0 and f1 are: $f[0] and $f[1]\n";
my #lines2 = split /\n/, $handle_file;
foreach my $input ( #lines2 ){
#print "line2 is: $input\n";
my #sp = split /:/, $input; # $sp[0] and $sp[1]
if ( $sp[0] eq $f[0] ){
my #r = split /-/, $sp[1];
if ( ($f[1] >= $r[0]) && ($f[1] <= $r[1]) ){
$flag = 1;
$counter_yes = $counter_yes;
last;
}
}
}
if ( $flag == 0 ){
$counter_no = $counter_no ;
}
}
While I running it I get the error
Split loop at script.pl line 30, <$fh2> chunk 1
What can be the reason?
You can run perldoc perldiag to learn what some built in errors and warnings mean.
Split loop
(P) The split was looping infinitely. (Obviously, a split
shouldn't iterate more times than there are characters of input,
which is what happened.) See "split" in perlfunc.
The string you're splitting on is so large, Perl thought it was iterating infinitely. When Perl has split a string more times than the length of the string + 10, it gives this error assuming its in an infinite loop. Unfortunately for you, it stored that number as a 32 bit integer which can only hold up to 2 billion and change. Your string is over 16 billion so the result will be unpredictable.
This was recently fixed in 5.20 along with many other related problems with working with strings over 2G in size. So if you upgrade Perl your code will "work".
However, your code is hideously inefficient and will crush the memory of most machines causing it to slow down terribly as it swaps to disk. At minimum you should only slurp in the small file and read the 16 gig file line by line.
my #small_data = <$small_fh>;
chomp #small_data;
while( my $big = <$big_fh> ) {
chomp $big;
for my $small (#small_data) {
...
}
}
But even that is going to be terribly inefficient, if your small file contains 1000 lines then that loop will run 16 trillion times!
Since it seems like you're checking to see if entries in the big file are in the small file, you're better off turning the entries in the small file into a hash table.
my %fields;
while( my $line = <$small_fh> ) {
chomp $line;
my #sp = split /:/, $line;
$fields{$sp[0]} = $sp[1];
}
Now you can iterate through the big file and just do a hash lookup.
while( my $line = <$big_fh> ) {
chomp $line;
my #f = split('\t', $line);
if( defined $fields{$f[0]} ) {
...
}
}
Why are you reading the whole file into one big string and splitting it into an array of lines, when you could be reading it into an array of lines to begin with? And why do you do it over and over again for the second file? You can just
chomp(my #lines1 = <$fh>);
chomp(my #lines2 = <$fh2>);
at the top of your program and eliminate $input_file and $handle_file which are otherwise unused, and all of the $/ nonsense. This could very well be the source of the problem, since the error message indicates that split is producing "too many" fields.
I'm working on a 16GB file and a small file.
I tried to load both files into memory.
Do you have 16GB of memory? Actually, your code requires more than 32GB of memory.
Split loop at script.pl line 30, chunk 1
I can't duplicate that error. Perl errors are usually pretty descriptive, yet that isn't even comprehensible.
Next, if you had this in your code:
my $x = 10;
#nothing changes $x
#in these
#lines
$x = 10;
What would be the purpose of the last line? Yet, you did this:
$/ = undef;
#Nothing changes $/
#in these lines
$/ = undef;
Next, all perl programs should start with the following lines:
<guess>
If you don't know, then you need to buy a beginning perl book.
Having this snippet:
my $file = "input.txt"; # let's assume that this is an ascii file
my $size1 = -s $file;
print "$size1\n";
$size2 = 0;
open F, $file;
$size2 += length($_) while (<F>);
close F;
print "$size2\n";
when can one assert that it is true that $size1 equals $size2?
If you don't specify an encoding that supports multibyte characters, it should hold. Otherwise, the result can be different:
$ cat 1.txt
žluťoučký kůň
$ perl -E 'say -s "1.txt";
open my $FH, "<:utf8", "1.txt";
my $f = do { local $/; <$FH> };
say length $f;'
20
14
You cannot, because the input layer may do some convert on the input line, for example change crlf to cr, that may change the length of that line.
In addition, length $line count how many characters in $line, in the multi-byte encoding, as the example given by #choroba, one character may occupy more than one bytes.
See perlio for further details.
No, as Lee Duhem says, the two numbers may be different because of Perl's end-of-line processing, or because length reports the size of the string in characters, which will throw the numbers out if there are any wide characters in the text.
However the tell function will report the exact position in bytes that you have read up to, so an equivalent to your program for which the numbers are guaranteed to match is this
use strict;
use warnings;
my $file = 'input.txt';
my $size1 = -s $file;
print "$size1\n";
open my $fh, '<', $file or die $!;
my $size2 = 0;
while (<$fh>) {
$size2 = tell $fh;
}
close $fh;
print "$size2\n";
Please note the use of use strict and use warnings, the lexical file handle, the three-parameter form of open, and the check that it succeeded. All of these are best-practice for Perl programs and should be used in everything you write
You're simply missing binmode(F); or the :raw IO layer. These cause Perl to return the file exactly as it appears on disk. No line ending translation. No decoding of character encodings.
open(my $fh, '<:raw', $file)
or die "open $file: $!\n");
Then your code works fine.
my $size = 0;
$size += length while <$fh>;
That's not particularly good because it could read the entire file at once for binary files. So let's read fixed-sized blocks instead.
local $/ = \(64*1024);
my $size = 0;
$size += length while <$fh>;
That's basically the same as using read, which reads 4K or 8K (in newer Perls) at a time. There are performance benefits to reading more than that at a time, and we can use sysread to do that.
my $size = 0;
while (my $bytes_read = sysread($fh, my $buf, 64*1024)) {
$size += $bytes_read;
}
Reading the whole file is silly, though. You could just seek to the end of the file.
use Fcntl qw( SEEK_END );
my $size = sysseek($fh, 0, SEEK_END);
But then again, you might as well just use -s.
my $size = -s $fh;
I have CSV file which have more than 10 lakh data. I want to use binary::tree for less memory uses.
Main use to this program search first 5 digit and create new file(file name should be first five digit) for store data for same first five digit.
my code working fine but using high memory.
write now i using this code:
my $file = "my_csv_file.csv";
open (my $data, '<', $file) or die "Could not open '$file' $!\n";
while (my $lines = <$data>) {
my #fields = split "," , $lines unless $. == 1;
my $first_five = substr ($fields[1], 0, 5,);
if (-e "$first_five.csv" ) {
open my $fh, '>>', "$first_five.csv" or die $!;
print { $fh } $lines;
} else {
open my $fh, '>>', "$first_five.csv" or die $!;
print $fh "Title\n";
}
close $fh;
}
close $data;
I believe the performance bottleneck in your script is not memory usage at all, but rather that you open and close a file for every record. If I understood the units correctly, 10lakh is 1,000,000, so that's quite a lot of opens and closes.
One solution would be to process the data in batches, particularly if you have many repeated keys in the "first 5" that you extract as the key.
I benchmarked your program against the one below on a synthetic file containing 100 unique 5-digit keys in that second field, but 10,000,000 records (10x the size of your file). The rows looked like this:
1,9990001,----------------------------------------------------------------------------------------------------
2,9990002,----------------------------------------------------------------------------------------------------
3,9990003,----------------------------------------------------------------------------------------------------
I did this to simulate a moderately large amount of data in the input. It should be about 10x the number of records as your input file.
Your original script took over 2 minutes to process this input on my computer. The following script, using batches of 10,000 records, took 24 seconds. That's over 5x as fast.
my $file = "my_csv_file.csv";
open (my $data, '<', $file) or die "Could not open '$file' $!\n";
sub write_out
{
my $batch = shift;
for my $first_five (keys %$batch)
{
my $file_name = $first_five . ".csv";
my $need_title = ! -e $file_name;
open my $fh, '>>', $file_name or die $!;
print $fh "Title\n" if $need_title;
print $fh #{ $batch->{$first_five} };
close $fh;
}
}
my ($line, $batch, $count);
$batch = { };
$count = 0;
while ($line = <$data>)
{
next if $. == 1;
if ($line =~ /^[^,]*,(.....)/)
{
push #{ $batch->{$1} }, $line;
if (++$count > 10000) # adjust as needed
{
write_out $batch;
$batch = { };
$count = 0;
}
}
}
write_out $batch if $count; # write final batch
close $data;
Now, I did notice one difference between my script's output and yours: Yours seems to drop the first line of output for each destination .csv file, putting the word Title in its place. I assume that was an error. My script above adds a row named Title, without dropping the first instance of a given "first five."
If you want the previous behavior, you can change it in sub write_out.
I did some additional experiments. I changed the batch size to 10,000,000, so that write_out only gets called once. The memory usage did grow quite a bit, and the run time only came down to 22 seconds. I also tried changing the batch size down to 100. The memory usage dropped dramatically, but the run time went up to around 30 seconds. This suggests that file open/close are the true bottleneck.
So, by changing the batch size, you can control the memory footprint vs. the run time. In any case, the batch-oriented code should be much faster than your current approach.
Edit: I did a further benchmark using a second 10 million record input, this time fully randomizing the 5-digit keys. The resulting output writes 100,000 files named 00000.csv through 99999.csv. The original script takes around 3 minutes to run, and my script above (with a batch size of 1000000) takes about 1:26, so approximately twice as fast.
The bottleneck is not the script itself, but filesystem operations. Creating / updating 100,000 files is inherently expensive.
I have tab delimited data with multiple columns.
I have OS names in column 31 and data bytes in columns 6 and 7. What I want to do is count the total volume of each unique OS.
So, I did something in Perl like this:
#!/usr/bin/perl
use warnings;
my #hhfilelist = glob "*.txt";
my %count = ();
for my $f (#hhfilelist) {
open F, $f || die "Cannot open $f: $!";
while (<F>) {
chomp;
my #line = split /\t/;
# counting volumes in col 6 and 7 for 31
$count{$line[30]} = $line[5] + $line[6];
}
close (F);
}
my $w = 0;
foreach $w (sort keys %count) {
print "$w\t$count{$w}\n";
}
So, the result would be something like
Windows 100000
Linux 5000
Mac OSX 15000
Android 2000
But there seems to be some error in this code because the resulting values I get aren't as expected.
What am I doing wrong?
It seems like you're not actually ADDING up the counts - you overwrite the last count for any OS with the count from the last line for that OS.
$count{$line[30]} = $line[5] + $line[6];
Should be
$count{$line[30]} += $line[5] + $line[6];
As additional considerations that can improve your code overall but don't affect the correctness of it:
Please use 3-argument form of open and Lexical filehandles:
open(my $filehandle, "<", $f) || die "Cannot open $f: $!";
If you're 100% sure you file doesn't contain quoted field values or tabs in the field contents, your split based logic is OK. For really complicated X-separated files, I would strongly recommend using Text::CSV_XS/Text::CSV CPAN module
Don't need to initialize %count or $w variables - hash will get autoinitialized to empty hash, and $w gets assigned to as loop variable - you may want to actually declare it in the loop itself: foreach my $w (sort keys %count) {
Please don't use 1-letter variables. $w is meaningless in the last loop, whereas $os_name is clear.
Your expression
open F, $f || die "Cannot open $f: $!";
has a subtle bug in it that will eventually bite you, though probably not today.
The || operator has higher precedence than the comma-operator to its left and so this expression actually gets parsed as
open F, ($f || die "Cannot open $f: $!")
which is to say, you will die when $f has a false (0, "", or undef) value, not when the open statement fails to open the file with the name given by $f.
To do what you mean, you could either use parentheses:
open (F, $f) || die ...
or use the alternate low-precedence or operator
open F, $f or die ...
(At times I have been bitten by this myself)
$count{$line[30]} = $line[5] + $line[6];
should use the += operator to add the row's sum to the total, rather than set it as the total:
$count{$line[30]} += $line[5] + $line[6];