I want to parse through a 8 GB file to find some information. This is taking me more than 4 hours to finish. I gone through perl Parallel::ForkManager module for this. But it doesn't make much difference. What is the better way to implement this?
The following is the part of the code used to parse this Jumbo file. I actually have list of domains which I have to look in a 8 GB sized zone file and find out what company it is hosted with.
unless(open(FH, $file)) {
print $LOG "Can't open '$file' $!";
die "Can't open '$file' $!";
}
### Reading Zone file : $file
DOMAIN: while(my $line = <FH> ){
#domain and the dns with whom he currently hosted
my($domain, undef, $new_host) = split(/\s|\t/, $line);
next if $seen{$domain};
$seen{$domain} =1;
$domain.=".$domain_type";
$domain = lc ($domain);
#already in?
if($moved_domains->{$domain}){
#Get the next domain if this on the same host, there is nothing to record
if($new_host eq $moved_domains->{$domain}->{PointingHost}){
next DOMAIN;
}
#movedout
else{
#INSERTS = ($domain, $data_date, $new_host, $moved_domains->{$domain}->{Host});
log_this($data_date, $populate, #INSERTS);
}
delete $moved_domains->{$domain};
}
#new to MovedDomain
else{
#is this any of our interested HOSTS
my ($interested) = grep{$new_host =~/\b$_\b/i} keys %HOST;
#if not any of our interested DNS, NEXT!
next DOMAIN if not $interested;
#INSERTS = ($domain, $data_date, $new_host, $HOST{$interested});
log_this($data_date, $populate, #INSERTS);
}
next DOMAIN;
}
A basic line-by-line parsing pass through a 1GB file -- for example, running a regex or something -- takes just a couple of minutes on my 5-year-old Windows box. Even if the parsing work is more extensive, 4 hours sounds like an awfully long time for 8GB of data.
Are you sure that your code does not have a glaring inefficiency? Are you storing a lot of information during the parsing and bumping up against your RAM limits? CPAN has tools that will allow you to profile your code, notably Devel::NYTProf.
Before going through the hassle of parallelizing your code, make sure that you understand where the bottleneck is. If you explain what you are doing or, even better, provide code that illustrates the problem in a compact way, you might get better answers.
With the little information you've given: Parallel::ForkManager sounds like an appropriate tool. But you're likely to get better help if you give more detail about your problem.
Parallelizing is always a difficult problem. How much you can hope to gain depends a lot on the nature of the task. For example, are you looking for a specific line in the file? Or a specific fixed-size record? Or all the chunks that match a particular bit pattern? Do you process the file from beginning to end, or can you skip some parts, or do you do a lot of shuffling back and forth? etc.
Also is the 8GB file an absolute constraint, or might you be able to reorganize the data to make the information easier to find?
With the speeds you're giving, if you're just going through the file once, I/O is not the bottleneck, but it's close. It could be the bottleneck if other processes are accessing the disk at the same time. It may be worth fine-tuning your disk access patterns; this would be somewhat OS- and filesystem-dependent.
Related
what is the best solution to do this script faster like parallel runs ?
#!usr/bin/perl
use warnings;
use strict;
use threads;
open(R1 ,"<$ARGV[0]") || die " problem in oppening $ARGV[0]: $!\n";
my %dict1 : shared;
my $i=0;
while (my $l = <R1>){
chomp($l);
$l=~ s/\s$//;
$l=~ s/^\s//;
if ($l=~ /(.*)\s(.*)/){
$i++;
#print $1,"\n";
#my $t = threads->create($dict1{$1}++);
$dict1{$1}++;
}
}
print $i, "\n";
close R1;
You can make array of $N elements which correspond to equal parts of file,
my $N = 6;
my $step = int( $file_size/$N );
my #arr = map { ($_-1) * $step } 1 .. $N;
Then correct these numbers by seeking to file positions (perldoc -f seek), reading the rest of the line(perldoc -f readline), and telling corrected file position (perldoc -f tell).
Start $N threads where each already knows where to start with their extracting job, and join their results at the end. However you may find that reading from media is actual bottleneck as #ikegami already pointed out.
The most likely case is that you're being limited by the speed that your data can be read from the disk ("I/O bound"), not by processing time ("CPU bound"). If that's the case, then there is nothing you can do with threads or parallel execution to speed this up - if it has any effect at all, parallelization would slow you down by forcing the disk to jump back and forth between the parts of the file being read by the different processes/threads.
An easy way to test whether this is the case would be to open a shell and run the command cat my_data_file > /dev/null This should tell you how long it takes to just read the file from disk, without actually doing anything to it. If it's roughly the same as the time it takes to run your program on my_data_file, then don't bother trying to optimize or speed it up. You can't. There are only two ways to improve performance in that situation:
Change the way your code works so that you don't need to read the entire file. If you're dealing with something that will run multiple times, indexing records in the file or using a database may help, but that doesn't do any good if it's a one-shot operation (since you'd still need to read the whole file once to create the index/database).
Use faster storage media.
If you're not I/O bound, the next most likely case is that you're memory bound - the data won't all fit in memory at once, causing the disk to thrash as it moves chunks of data into and out of virtual memory. Once again, parallelizing the process would make things worse, not better.
The solutions in this case are similar to before:
Change what you're doing so that you don't need all the data in memory at once. In this case, indexing or a database would probably be beneficial even for a one-off operation.
Buy more RAM.
Unless you're doing much heavier processing of the data than just the few regexes and stuffing it in a hash that you've shown, though, you're definitely not CPU bound and parallelization will not provide any benefit.
I have a 10GB file with 200 million lines. I need to get unique lines of this file.
My code:
while(<>) {
chomp;
$tmp{$_}=1;
}
#print...
I only have 2GB memory. How can I solve this problem?
As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file.
use DBM::Deep;
tie my %lines, 'DBM::Deep', 'data.db';
while(<>) {
chomp;
$lines{$_}=1;
}
This is basically what you already had, but the hash is now a database tied to a file (here data.db) rather than kept in memory.
If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep):
sort -u file
In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.
One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle.
Perl can use the BerkeleyDB module to talk with a BDB database. In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database.
BDB is pretty robust. Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.
The only other thing I can think of is using an SQL database which would be slower and much more complex.
Addendum
Maybe I'm over thinking this...
I decided to try a simple hash. Here's my program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant DIR => "/usr/share/dict";
use constant WORD_LIST => qw(words web2a propernames connectives);
my %word_hash;
for my $count (1..100) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be. Each and every key will be unique. The amount of memory will be massive. Yet...
It worked. It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much. I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash.
I'm now downloading DBD from Oracle, compiling it, and installing it. I'll try the same program using DBD and see what happens.
Using a BDB Database
After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to:
Download Berkeley DB from Oracle, and you need an Oracle account. I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.
Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward.
Running CPAN crashed. Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB.5.3. Creating a link fixed the issue.
All told, about 1/2 an hour getting BerkeleyDB installed. Once installed, modifying my program was fairly straight forward:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use BerkeleyDB;
use constant {
DIR => "/usr/share/dict",
BDB_FILE => "bdb_file",
};
use constant WORD_LIST => qw(words web2a propernames connectives);
unlink BDB_FILE if -f BDB_FILE;
our %word_hash;
tie %word_hash, "BerkeleyDB::Hash",
-Filename => BDB_FILE,
-Flags => DB_CREATE
or die qq(Cannot create DBD_Database file ") . BDB_FILE . qq("\n);
for my $count (1..10) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
All I had to do was add a few lines.
Running the program was a disappointment. It wasn't faster, but much, much slower. It took over 2 minutes while using a pure hash took a mere 13 seconds.
However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte. Instead, it created a 20MB database file.
But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes). Now, if your program wants more memory than is available, it simply is given virtual memory.
So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process. And, it was slow.
Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster. Why do the work when the OS will handle it for you?
The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it. If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.
You might consider calculating a hash code for each line, and keeping track of (hash, position) mappings. You wouldn't need a complicated hash function (or even a large hash) for this; in fact, "smaller" is better than "more unique", if the primary concern is memory usage. Even a CRC, or summing up the chars' codes, might do. The point isn't to guarantee uniqueness at this stage -- it's just to narrow the candidate matches down from 200 million to a few dozen.
For each line, calculate the hash and see if you already have a mapping. If you do, then for each position that maps to that hash, read the line at that position and see if the lines match. If any of them do, skip that line. If none do, or you don't have any mappings for that hash, remember the (hash, position) and then print the line.
Note, i'm saying "position", not "line number". In order for this to work in less than a year, you'd almost certainly have to be able to seek right to a line rather than finding your way to line #1392499.
If you don't care about time/IO constraints, nor disk constraints (e.g. you have 10 more GB space), you can do the following dumb algorithm:
1) Read the file (which sounds like it has 50 character lines). While scanning it, remember the longest line length $L.
2) Analyze the first 3 characters (if you know char #1 is identical - say "[" - analyze the 3 characters in position N that is likely to have more diverse ones).
3) For each line with 3 characters $XYZ, append that line to file 3char.$XYZ and keep the count of how many lines in that file in a hash.
4) When your entire file is split up that way, you should have a whole bunch (if the files are A-Z only, then 26^3) of smaller files, and at most 4 files that are >2GB each.
5) Move the original file into "Processed" directory.
6) For each of the large files (>2GB), choose the next 3 character positions, and repeat steps #1-#5, with new files being 6char.$XYZABC
7) Lather, rinse, repeat. You will end up with one of the 2 options eventually:
8a) A bunch of smaller files each of which is under 2GB, all of which have mutually different strings, and each (due to its size) can be processed individually by standard "stash into a hash" solution in your question.
8b) Or, most of the files being smaller, but, you have exausted all $L characters while repeating step 7 for >2GB files, and you still have between 1-4 large files. Guess what - since
those up-to-4 large files have identical characters within a file in positions 1..$L, they can ALSO be processed using the "stash into a hash" method in your question, since they are not going to contain more than a few distinct lines despite their size!
Please note that this may require - at the worst possible distributions - 10GB * L / 3 disk space, but will ONLY require 20GB disk space if you change step #5 from "move" to "delete".
Voila. Done.
As an alternate approach, consider hashing your lines. I'm not a hashing expert but you should be able to compress a line into a hash <5 times line size IMHO.
If you want to be fancy about this, you will do a frequency analysis on character sequences on the first pass, and then do compression/encoding this way.
If you have more processor and have at least 15GB free space and your storage fast enough, you could try this out. This will process it in paralel.
split --lines=100000 -d 4 -d input.file
find . -name "x*" -print|xargs -n 1 -P10 -I SPLITTED_FILE sort -u SPLITTED_FILE>unique.SPLITTED_FILE
cat unique.*>output.file
rm unique.* x*
You could break you file into 10 1 Gbyte files Then reading in one file at a time, sorting lines from that file and writing it back out after they are sorted. Opening all of the 10 files and merge them back into one file (making sure you you merge them in the correct order). Open an output file to save the unique lines. Then read the merge file one line at a time, keeping the last line for comparison. If the last line and the current line are not a match, write out the last line and save the current line as the last line for comparison. Otherwise get the next line from the merged file. That will give you a file which has all of the unique lines.
It may take a while to do this, but if you are limited on memory, then breaking the file up and working on parts of it will work.
It may be possible to do the comparison when writing out the file, but that would be a bit more complicated.
Why use perl for this at all? posix shell:
sort | uniq
done, let's go drink beers.
I've been given the task of writing a webapp that analyzes text files given a single regular expression. The text files I am given range anywhere from 500MB to 3GB. I am currently using Perl as my parsing engine. I've been reading about mapReduce and Hadoop but it seems like the set up is only worth it given very,very large amounts of data, much larger than the amounts I am parsing.
What would be a good way to go about this? Right now a 500MB file takes anywhere from 4 to 6 minutes to parse, which isn't too bad, but the 3GB files take forever, and the webserver usually times out before it can get output from the Perl script and generate a report.
Let's partition your file into 100 chunks, and use seek to let an arbitrary process work on an arbitrary part of the file.
my $chunk = $ARGV[0]; # a user input, from 0 to 99
my $size = -s $THE_FILE;
my $startByte = int($chunk * $size / 100);
my $endByte = int(($chunk + 1) * $size) / 100);
open my $fh, '<', $THE_FILE;
seek $fh, 0, $startByte;
scalar <$fh>; # skip current line in case we have seek'd to the middle of a line
while (<$fh>) {
# ... process this section of the file ...
last if tell($fh) >= $endByte;
}
Now run this program 100 times on whatever machines you have available, passing the arguments 0 to 99 once to each program.
Actually hadoop is surprisingly easy to install and use (especially if you don't have huge data and don't need to optimize it). I had a similar task a while (processing logs in the range of about 5GB) and it took me no more than a couple of hours to install it on 5 machines, just using the tutorial and doc on their site. Then the programming is really easy, just read from STDIN and write to STDOUT!
Probably making your own split and distribute script (even if you make it on top of something like Gearman) will take more than installing hadoop.
I have a very simple dictionary application that does search and display. It's built with the Win32::GUI module. I put all the plain text data needed for the dictionary under the __DATA__ section. The script itself is very small but with everything under the __DATA__ section, its size reaches 30 MB. In order to share the work with my friends, I've then packed the script into a stand-alone executable using the PP utility of the PAR::Packer module with the highest compression level 9 and now I have a single-file dictionary app of about the size of 17MB.
But although I'm very comfortable with the idea of a single-file script, placing such huge amount of text data under the script's DATA section does not feel right. For one thing, when I try opening the script in Padre (Notepad ++ is okay), I'm receiving the error that is like:
Can't open my script as the script is over the arbitrary file size limit which is currently 500000.
My questions:
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
How do people normally format the text data needed for a dictionary application?
Any comments, ideas or suggestions? Thanks like always :)
If I do so, What should I do to reduce the size of the separate file? Zip it and uncompress it while doing search and display?
Well, it depends on WHY you want to reduce the size. If it is to minimize disk space usage (rather weird goal most of the time these days), then the zip/unzip is the way to go.
However if the goal is to minimize memory usage, then a better approach is to split up the dictionary data into smaller chunks (for example indexed by a first letter), and only load needed chunks.
How do people normally format the text data needed for a dictionary application?
IMHO the usual approach is what you get as the logical end of an approach mentioned above (partitioned and indexed data): using a back-end database, which allows you to only retrieve the data which is actually needed.
In your case probably something simple like SQLite or Berkley DB/DBM files should be OK.
Does it bring me any extra benefits except for the eliminating of Padre's file opening issue if I move everything under the DATA section to a separate text file?
This depends somewhat on your usage... if it's a never-changing script used by 3 people, may be no tangible benefits.
In general, it will make maintenance much easier (you can change the dictionary and the code logic independently - think virus definitions file vs. antivirus executable for real world example).
It will also decrease the process memory consumption if you go with the approaches I mentioned above.
Since you are using PAR::Packer already, why not move it to a separate file or module and include it in the PAR file?
The easy way (no extra commandline options to pp, it will see the use statement and do the right thing):
words.pl
#!/usr/bin/perl
use strict;
use warnings;
use Words;
for my $i (1 .. 2) {
print "Run $i\n";
while (defined(my $word = Words->next_word)) {
print "\t$word\n";
}
}
Words.pm
package Words;
use strict;
use warnings;
my $start = tell DATA
or die "could not find current position: $!";
sub next_word {
if (eof DATA) {
seek DATA, $start, 0
or die "could not seek: $!";
return undef;
}
chomp(my $word = scalar <DATA>);
return $word;
}
1;
__DATA__
a
b
c
We have a script on an FTP endpoint that monitors the FTP logs spewed out by our FTP daemon.
Currently what we do is have a perl script essentially runs a tail -F on the file and sends every single line into a remote MySQL database, with slightly different column content based off the record type.
This database has tables for content of both the tarball names/content, as well as FTP user actions with said packages; Downloads, Deletes, and everything else VSFTPd logs.
I see this as particularly bad, but I'm not sure what's better.
The goal of a replacement is to still get log file content into a database as quick as possible. I'm thinking doing something like making a FIFO/pipe file in place of where the FTP log file is, so I can read it in once periodically, ensuring I never read the same thing in twice. Assuming VSFTPd will place nice with that (I'm thinking it won't, insight welcome!).
The FTP daemon is VSFTPd, I'm at least fairly sure the extent of their logging capabilies are: xfer style log, vsftpd style log, both, or no logging at all.
The question is, what's better than what we're already doing, if anything?
Honestly, I don't see much wrong with what you're doing now. tail -f is very efficient. The only real problem with it is that it loses state if your watcher script ever dies (which is a semi-hard problem to solve with rotating logfiles). There's also a nice File::Tail module on CPAN that saves you from the shellout and has some nice customization available.
Using a FIFO as a log can work (as long as vsftpd doesn't try to unlink and recreate the logfile at any point, which it may do) but I see one major problem with it. If no one is reading from the other end of the FIFO (for instance if your script crashes, or was never started), then a short time later all of the writes to the FIFO will start blocking. And I haven't tested this, but it's pretty likely that having logfile writes block will cause the entire server to hang. Not a very pretty scenario.
Your problem with reading a continually updated file is you want to keep reading, even after the end of file is reached. The solution to this is to re-seek to your current position in the file:
seek FILEHANDLE, 0, 1;
Here is my code for doing this sort of thing:
open(FILEHANDLE,'<', '/var/log/file') || die 'Could not open log file';
seek(FILEHANDLE, 0, 2) || die 'Could not seek to end of log file';
for (;;) {
while (<FILEHANDLE>) {
if ( $_ =~ /monitor status down/ ) {
print "Gone down\n";
}
}
sleep 1;
seek FILEHANDLE, 0, 1; # clear eof
}
You should look into inotify (assuming you are on a nice, posix based OS) so you can run your perl script whenever the logfile is updated. If this level of IO causes problems you could always keep the logfile on a RAMdisk so IO is very fast.
This should help you set this up:
http://www.cyberciti.biz/faq/linux-inotify-examples-to-replicate-directories/
You can open the file as an in pipe.
open(my $file, '-|', '/ftp/file/path');
while (<$file>) {
# you know the rest
}
File::Tail does this, plus heuristic sleeping and nice error handling and recovery.
Edit: On second thought, a real system pipe is better if you can manage it. If not, you need to find the last thing you put in the database, and spin through the file until you get to the last thing you put in the database whenever your process starts. Not that easy to accomplish, and potentially impossible if you have no way of identifying where you left off.