I have a program in Perl that reads one line at a time from a data file and computes certain statistics for each line of data. Every now and then, while the program reads through my dataset, I get a warning about an ...uninitialized value... and I would like to know which line of data generates this warning.
Is there any way I can tell Perl to print (to screen or file) the data point that is generating the error?
If your script prints one line for each input line, it would be simpler to see when the error occurs by flushing the standard error along with the output (making the message appear at the "right" point):
$| = 1;
That is, turn on the perl autoflush feature, as discussed in How to flush output to the console?
What (auto)flushing does:
error messages are written to the predefined STDERR stream, normal printf's go to the (default) predefined STDOUT.
data written on these streams is saved up by the system (under Perl's control) to write out in chunks (called "buffers") to improve efficiency.
STDERR and STDOUT buffers are independent, and could be written line-by-line or by buffers (many characters, not necessarily lines).
using autoflush tells Perl to modify its scheme for writing buffers so that their content is written via the operating system at the end of a print/printf call.
normally STDERR is written line-by-line. The command tells Perl to enable this feature for the current/default stream, i.e., STDOUT.
doing that makes both of them write line-by-line, so that messages sent close in time via either appears close together in the output of your script.
Perl usually includes the file handle and line number in warnings by default; i.e.
>echo hello | perl -lnwe 'print $x'
Name "main::x" used only once: possible typo at -e line 1.
Use of uninitialized value $x in print at -e line 1, <> line 1.
So if you're doing the computation while reading, you get the appropriate warning.
I'm having an issue with Perl and I'm hoping someone here can help me figure out what's going on. I have about 130,000 .txt files in a directory called RawData and I have a Perl program that loads them into an array, then loops through this array, loading each .txt file. For simplicity, suppose I have four text files I'm looping through
File1.txt
File2.txt
File3.txt
File4.txt
The contents of each .txt file look something like this:
007 C03XXYY ZZZZ
008 A01XXYY ZZZZ
009 A02XXYY ZZZZ
where X,Y,Z are digits. In my simplified code below, the program then pulls out just line 007 in each .txt file, saves XX as ID, ignores YY and grabs the variable data ZZZZ that I've called VarVal. Then it writes everything to a file with a header specified in the code below:
#!/usr/bin/perl
use warnings;
use strict;
open(OUTFILE, "> ../Data/OutputFile.csv") or die $!;
opendir(MYDIR,"../RawData")||die $!;
my #txtfiles=grep {/\.txt$/} readdir(MYDIR);
closedir(MYDIR);
print OUTFILE "ID,VarName,VarVal\n";
foreach my $txtfile (#txtfiles){
#Prints to the screen so I can see where I am in the loop.
print $txtfile","\n";
open(INFILE, "< ../RawData/$txtfile") or die $!;
while(<INFILE>){
if(m{^007 C03(\d{2})(\d+)(\s+)(.+)}){
print OUTFILE "$1,VarName,$4\n"
}
}
}
The issue I'm having is that the contents of, for example File3.txt, don't show up in OutputFile.csv. However, it's not an issue with Perl not finding a match because I checked that the if statement is being executed by deleting OUTFILE and looking at what the code prints to the terminal screen. What shows up is exactly what should be there.
Furthermore, If I just run the problematic file (File3.txt) through the loop itself by commenting out the opendir and closedir stuff and doing something like my #textfile = "File3.txt";. Then when I run the code, the only data that shows up in the OutputFile.csv IS what's in File3.txt. But when it goes through the loop, it won't show up in OutputFile.csv. Plus, I know that File3.txt is being sent to into the loop because I can see it being printed on the screen with print $txtfile","\n";. I'm at a loss as to what is going on here.
The other issue is that I don't think it's something specific to this one particular file (maybe it is) but I can't just troubleshoot this one file because I have 130,000 files and I just happened to stumble across the fact that this one wasn't being written to the output file. So there may be other files that also aren't getting written, even though there is no obvious reason they shouldn't be just like the case of File3.txt.
Perhaps because I'm doing so many files in rapid succession, looping 130,000 files, causes some sort of I/O issues that randomly fails every so often to write the contents in memory to the output file? That's my best guess but I have not idea how to diagnose or fix this.
This is kind of a difficult question to debug, but I'm hoping someone on here has some insight or has seen similar problems that would provide me with a solution.
Thanks
There's nothing obviously wrong that I can see in your code. It is a little outdated as using autodie and lexical filehandles would be better.
However, I would recommend that you make your regex slightly less restrictive by making the spacing variable length after the first value and making the last variable optionally of 0 length. I'd also output the filename as well. Then you can see which other files aren't being caught for whatever reason:
if (m{^007\s+C03(\d{2})\d+\s+(.*)}){
print OUTFILE "$txtfile $1,VarName,$2\n";
last;
}
Finally, assuming there is only a single 007 C03 in each file, you could throw in a last call after one is found.
You may want to try sorting the #txtfiles list, then trying to systematically look through the output to see what is or isn't there. With 130k files in random order, it would be pretty difficult to be certain that you missed one. Perl should be giving you the files in the actual order they appear in the directory, which is different that user level commands like ls, so it may be different that you'd expect.
Perl 5.14 from stock Ubuntu Precise repos. Trying to write a simple wrapper to monitor progress on copying from one stream to another:
use IO::Handle;
while ($bufsize = read (SOURCE, $buffer, 1048576)) {
STDERR->printflush ("Transferred $xferred of $sendsize bytes\n");
$xferred += $bufsize;
print TARGET $buffer;
}
This does not perform as expected (writing a line each time the 1M buffer is read). I end up seeing the first line (with a blank value of $xferred), and then nothing until everything flushes on the 7th and 8th lines (on an 8MB transfer). Been pounding my brains out on this for hours - I've read the perldocs, I've read the classic "Suffering from Buffering" article, I've tried everything from select and $|++ to IO::Handle to binmode (STDERR, "::unix") to you name it. I've also tried flushing TARGET with each line using IO::Handle (TARGET->flush). No dice.
Has anybody else ever encountered this? I don't have any ideas left. Sleeping one second "fixes" the problem, but obviously I don't want to sleep a second every time I read a buffer just so my progress will output on the screen!
FWIW, the problem is exactly the same whether I'm outputting to STDERR or STDOUT.
Perl read calls fread(3), not read(2).
This means that it goes through libc and may be using an internal buffer larger than yours; i.e., it gets all the data there is to be received and then quickly throws it at you in 1MB increments.
If this conjecture is correct, the solution might be to use sysread, which calls read(2), instead of read.
I have a Perl script which reads three files and writes new files after reading each one of them. Everything is one thread.
In this script, I open and work with three text files and store the contents in a hash. The files are large (close to 3 MB).
I am using a loop to go through each of the files (open -> read -> Do some action (hash table) -> close)
I am noticing that the whenever I am scanning through the first file, the Perl terminal window in my Cygwin shell gets stuck. The moment I hit the enter key I can see the script process the rest of the files without any issues.
It's very odd as there is no read from STDIN in my script. Moreover, the same logic applies to all the three files as everything is in the same loop.
Has anyone here faced a similar issue? Does this usually happen when dealing with large files or big hashes?
I can't post the script here, but there is not much in it to post anyway.
Could this just be a problem in my Cygwin shell?
If this problem does not go away, how can I circumvent it? Like providing the enter input when the script is in progress? More importantly, how can I debug such a problem?
sub read_set
{
#lines_in_set = ();
push #lines_in_set , $_[0];
while (<INPUT_FILE>)
{ $line = $_;
chomp($line);
if ($line=~ /ENDNEWTYPE/i or $line =~ /ENDSYNTYPE/ or eof())
{
push #lines_in_set , $line;
last;
}
else
{
push #lines_in_set , $line;
}
}
return #lines_in_set;
}
--------> I think i found the problem :- or eof() call was ensuring that the script would be stuck !! Somehow happening only at the first time. I have no idea why though
The eof() call is the problem. See perldoc -f eof.
eof with empty parentheses refers to the pseudo file accessed via while (<>), which consists of either all the files named in #ARGV, or to STDIN if there are none.
And in particular:
Note that this function actually reads a character and then "ungetc"s it, so isn't useful in an interactive context.
But your loop reads from another handle, one called INPUT_FILE.
It would make more sense to call eof(INPUT_FILE). But even that probably isn't necessary; your outer loop will terminate when it reaches the end of INPUT_FILE.
Some more suggestions, not related to the symptoms you're seeing:
Add
use strict;
use warnings;
near the top of your script, and correct any error messages this produces (perl -cw script-name does a compile-only check). You'll need to declare your variables using my (perldoc -f my). And use consistent indentation; I recommend the same style you'll find in most Perl documentation.
I have a 10GB file with 200 million lines. I need to get unique lines of this file.
My code:
while(<>) {
chomp;
$tmp{$_}=1;
}
#print...
I only have 2GB memory. How can I solve this problem?
As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file.
use DBM::Deep;
tie my %lines, 'DBM::Deep', 'data.db';
while(<>) {
chomp;
$lines{$_}=1;
}
This is basically what you already had, but the hash is now a database tied to a file (here data.db) rather than kept in memory.
If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep):
sort -u file
In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.
One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle.
Perl can use the BerkeleyDB module to talk with a BDB database. In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database.
BDB is pretty robust. Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.
The only other thing I can think of is using an SQL database which would be slower and much more complex.
Addendum
Maybe I'm over thinking this...
I decided to try a simple hash. Here's my program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant DIR => "/usr/share/dict";
use constant WORD_LIST => qw(words web2a propernames connectives);
my %word_hash;
for my $count (1..100) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be. Each and every key will be unique. The amount of memory will be massive. Yet...
It worked. It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much. I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash.
I'm now downloading DBD from Oracle, compiling it, and installing it. I'll try the same program using DBD and see what happens.
Using a BDB Database
After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to:
Download Berkeley DB from Oracle, and you need an Oracle account. I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.
Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward.
Running CPAN crashed. Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB.5.3. Creating a link fixed the issue.
All told, about 1/2 an hour getting BerkeleyDB installed. Once installed, modifying my program was fairly straight forward:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use BerkeleyDB;
use constant {
DIR => "/usr/share/dict",
BDB_FILE => "bdb_file",
};
use constant WORD_LIST => qw(words web2a propernames connectives);
unlink BDB_FILE if -f BDB_FILE;
our %word_hash;
tie %word_hash, "BerkeleyDB::Hash",
-Filename => BDB_FILE,
-Flags => DB_CREATE
or die qq(Cannot create DBD_Database file ") . BDB_FILE . qq("\n);
for my $count (1..10) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
All I had to do was add a few lines.
Running the program was a disappointment. It wasn't faster, but much, much slower. It took over 2 minutes while using a pure hash took a mere 13 seconds.
However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte. Instead, it created a 20MB database file.
But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes). Now, if your program wants more memory than is available, it simply is given virtual memory.
So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process. And, it was slow.
Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster. Why do the work when the OS will handle it for you?
The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it. If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.
You might consider calculating a hash code for each line, and keeping track of (hash, position) mappings. You wouldn't need a complicated hash function (or even a large hash) for this; in fact, "smaller" is better than "more unique", if the primary concern is memory usage. Even a CRC, or summing up the chars' codes, might do. The point isn't to guarantee uniqueness at this stage -- it's just to narrow the candidate matches down from 200 million to a few dozen.
For each line, calculate the hash and see if you already have a mapping. If you do, then for each position that maps to that hash, read the line at that position and see if the lines match. If any of them do, skip that line. If none do, or you don't have any mappings for that hash, remember the (hash, position) and then print the line.
Note, i'm saying "position", not "line number". In order for this to work in less than a year, you'd almost certainly have to be able to seek right to a line rather than finding your way to line #1392499.
If you don't care about time/IO constraints, nor disk constraints (e.g. you have 10 more GB space), you can do the following dumb algorithm:
1) Read the file (which sounds like it has 50 character lines). While scanning it, remember the longest line length $L.
2) Analyze the first 3 characters (if you know char #1 is identical - say "[" - analyze the 3 characters in position N that is likely to have more diverse ones).
3) For each line with 3 characters $XYZ, append that line to file 3char.$XYZ and keep the count of how many lines in that file in a hash.
4) When your entire file is split up that way, you should have a whole bunch (if the files are A-Z only, then 26^3) of smaller files, and at most 4 files that are >2GB each.
5) Move the original file into "Processed" directory.
6) For each of the large files (>2GB), choose the next 3 character positions, and repeat steps #1-#5, with new files being 6char.$XYZABC
7) Lather, rinse, repeat. You will end up with one of the 2 options eventually:
8a) A bunch of smaller files each of which is under 2GB, all of which have mutually different strings, and each (due to its size) can be processed individually by standard "stash into a hash" solution in your question.
8b) Or, most of the files being smaller, but, you have exausted all $L characters while repeating step 7 for >2GB files, and you still have between 1-4 large files. Guess what - since
those up-to-4 large files have identical characters within a file in positions 1..$L, they can ALSO be processed using the "stash into a hash" method in your question, since they are not going to contain more than a few distinct lines despite their size!
Please note that this may require - at the worst possible distributions - 10GB * L / 3 disk space, but will ONLY require 20GB disk space if you change step #5 from "move" to "delete".
Voila. Done.
As an alternate approach, consider hashing your lines. I'm not a hashing expert but you should be able to compress a line into a hash <5 times line size IMHO.
If you want to be fancy about this, you will do a frequency analysis on character sequences on the first pass, and then do compression/encoding this way.
If you have more processor and have at least 15GB free space and your storage fast enough, you could try this out. This will process it in paralel.
split --lines=100000 -d 4 -d input.file
find . -name "x*" -print|xargs -n 1 -P10 -I SPLITTED_FILE sort -u SPLITTED_FILE>unique.SPLITTED_FILE
cat unique.*>output.file
rm unique.* x*
You could break you file into 10 1 Gbyte files Then reading in one file at a time, sorting lines from that file and writing it back out after they are sorted. Opening all of the 10 files and merge them back into one file (making sure you you merge them in the correct order). Open an output file to save the unique lines. Then read the merge file one line at a time, keeping the last line for comparison. If the last line and the current line are not a match, write out the last line and save the current line as the last line for comparison. Otherwise get the next line from the merged file. That will give you a file which has all of the unique lines.
It may take a while to do this, but if you are limited on memory, then breaking the file up and working on parts of it will work.
It may be possible to do the comparison when writing out the file, but that would be a bit more complicated.
Why use perl for this at all? posix shell:
sort | uniq
done, let's go drink beers.