I have an application in which scripts will be run that need to be able access stored data. I want to run a script (main.pl) which will create an array. Later, if I run A.pl or B.pl, I want those scripts to be able access the previously created array and change values within it. What do I need to code in main.pl A.pl B.pl so I can achieve that?
Normally one perl instance can not access the variables of another instance. The question then becomes "what can one do that is almost like sharing variables"?
One approach is to store data somewhere it can persist, such as in a database or a CSV file on disk. This means reading the data at the beginning of the program, and writing it or updating it, and naturally leads to questions about race conditions, locking, etc... and greatly expands the scope that any possible answer would need to cover.
Another approach is to write your programs to use CSV or YAML or some other format easily read and written by libraries from CPAN, and use STDIN and STDOUT for input and output. This allows decoupling of storage, and also chaining several tools together with a pipe from the shell prompt.
For an in-memory solution for tieing hashes to shared memory, you can check out IPC::Shareable
http://metacpan.org/pod/IPC::Shareable
Perl memory structures can't be stored and then accessed later by other Perl scripts. However, you can write out those memory structures as a file. This can be done through hard raw coding, or by using a wide variety of Perl modules. The Storable is a standard Perl module and has been around for quite a while.
Since all you're installing is an array, you could have one program write the array to a file, and then have the other file read the array.
use strict;
use warnings;
use autodie;
use constant {
ARRAY_FILE => "$Env{HOME}/perl_arry.txt",
};
my #array;
[...] #Build the array
open my $output_fh, ">", ARRAY_FILE;
while my $item ( #array ) {
say {$output_fh} $item;
}
close $output_fh;
Now, have your second program read in this array:
use strict;
use warnings;
use autodie;
use constant {
ARRAY_FILE => "$Env{HOME}/perl_arry.txt",
};
my #new_array;
open my $input_fh, "<", ARRAY_FILE;
while ( my $item = <$input_fh> ) {
push #new_array, $item;
}
close $input_fh;
More complex data can be stored with Storable, but's it's pretty much the same thing: You need to write Storable to a physical file and then reopen that file to pull in your data once again.
I've written a script that needs to store a small string in-between runs. Which CPAN module could I use to make the process as simple as possible? Ideally I'd want something like:
use That::Module;
my $static_data = read_config( 'script-name' ); # read from e.g. ~/.script-name.data
$static_data++;
write_config( 'script-name', $static_data ); # write to e.g. ~/.script-name.data
I don't need any parsing of the file, just storage. There's a lot of different OSes and places to store these files in out there, which is why I don't want to do that part myself.
Just use Storable for portable persistence of Perl data structures and File::HomeDir for portable "general config place" finding:
use File::HomeDir;
use FindBin qw($Script);
use Storable qw(nstore);
# Generate absolute path like:
# /home/stas/.local/share/script.pl.data
my $file = File::Spec->catfile(File::HomeDir->my_data(), "$Script.data");
# Network order for better endianess compatibility
nstore \%table, $file;
$hashref = retrieve($file);
If it's just a single string (eg, 'abcd1234'), just use a normal file and write to it with open.
If you're looking for something a bit more advanced, take a look at Config::Simple or JSON::XS. Conifg::Simple has its own function to write out to a file, and JSON can just use a plain open.
May be this can help you - http://www.stonehenge.com/merlyn/UnixReview/col53.html. But I think you cannot avoid using work with files and directories.
The easiest way that I know how to do this (rather than rolling by hand) is to use DBM::Deep.
Every time I post about this module I get hate posts responding that its too slow, so please don't do that.
I have a Perl script which reads two files and processes them.
The first file - info file - I store it as a hash (3.5 gb)
The second file - taregt file - I am processing by using information from the info file and other subroutines as designed. (This file, target, ranges from 30 - 60 gb)
So far working are:
reading the info file into a hash
breaking the target file into
chunks
I want to run on all chunks in parallel:
while(chunks){
# do something
sub a {}
sub b {}
}
So basically, I want to read a chunk, write its output and do this for multiple chunks at the same time. The while loop reads each line of a chunk file, and calls on various subroutine for processing.
Is there a way that I can read chunks in background?
I don't want to read info file for every chunk as it is 3.5gb long and I am reading it into hash, which takes up 3.5gb everytime.
Right now the script takes 1 - 2hrs to run for 30-60gb.
You can try using Perl threads if parallel tasks are independent.
A 3.5GB hash is very big, you should consider using a database instead. Depending on how you do this, you can keep accessing the database via the hash.
If memory were a non-issue, forking would be the easiest solution. However, this duplicates the process, including the hash, and would only result in unneccessary swapping.
If you cannot free some memory, you should consider to use threads. Perl threads only live inside the interpreter and are invisible to the OS. These threads have a similar feel to forking, however, you can declare variables as :shared. (You have to use threads::shared)
See the official Perl threading tutorial
What's about module File::Map (memory mapping), it can easy read big files.
use strict;
use File::Map qw(map_file);
map_file my $map, $ARGV[0]; # $ARGV[0] - path to your file
# Do something with $map
I have a 10GB file with 200 million lines. I need to get unique lines of this file.
My code:
while(<>) {
chomp;
$tmp{$_}=1;
}
#print...
I only have 2GB memory. How can I solve this problem?
As I commented on David's answer, a database is the way to go, but a nice one might be DBM::Deep since its pure-Perl and easy to install and use; its essentially a Perl hash tied to a file.
use DBM::Deep;
tie my %lines, 'DBM::Deep', 'data.db';
while(<>) {
chomp;
$lines{$_}=1;
}
This is basically what you already had, but the hash is now a database tied to a file (here data.db) rather than kept in memory.
If you don't care about preserving order, I bet the following is faster than the previously posted solutions (e.g. DBM::Deep):
sort -u file
In most cases, you could store the line as a key in a hash. However, when you get this big, this really isn't very efficient. In this case, you'd be better off using a database.
One thing to try is the Berkeley Database that use to be included in Unix (BDB). Now, it's apparently owned by Oracle.
Perl can use the BerkeleyDB module to talk with a BDB database. In fact, you can even tie a Perl hash to a BDB database. Once this is done, you can use normal Perl hashes to access and modify the database.
BDB is pretty robust. Bitcoins uses it, and so does SpamAssassin, so it is very possible that it can handle the type of database you have to create in order to find duplicate lines. If you already have DBD installed, writing a program to handle your task shouldn't take that long. If it doesn't work, you wouldn't have wasted too much time with this.
The only other thing I can think of is using an SQL database which would be slower and much more complex.
Addendum
Maybe I'm over thinking this...
I decided to try a simple hash. Here's my program:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use constant DIR => "/usr/share/dict";
use constant WORD_LIST => qw(words web2a propernames connectives);
my %word_hash;
for my $count (1..100) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
The files read in contain a total of about 313,000 lines. I do this 100 times to get a hash with 31,300,000 keys in it. It is about as inefficient as it can be. Each and every key will be unique. The amount of memory will be massive. Yet...
It worked. It took about 10 minutes to run despite the massive inefficiencies in the program, and it maxed out at around 6 gigabytes. However, most of that was in virtual memory. Strangely, even though it was running, gobbling memory, and taking 98% of the CPU, my system didn't really slow down all that much. I guess the question really is what type of performance are you expecting? If taking about 10 minutes to run isn't that much of an issue for you, and you don't expect this program to be used that often, then maybe go for simplicity and use a simple hash.
I'm now downloading DBD from Oracle, compiling it, and installing it. I'll try the same program using DBD and see what happens.
Using a BDB Database
After doing the work, I think if you have MySQL installed, using Perl DBI would be easier. I had to:
Download Berkeley DB from Oracle, and you need an Oracle account. I didn't remember my password, and told it to email me. Never got the email. I spent 10 minutes trying to remember my email address.
Once downloaded, it has to be compiled. Found directions for compiling for the Mac and it seemed pretty straight forward.
Running CPAN crashed. Ends up that CPAN is looking for /usr/local/BerkeleyDB and it was installed as /usr/local/BerkeleyDB.5.3. Creating a link fixed the issue.
All told, about 1/2 an hour getting BerkeleyDB installed. Once installed, modifying my program was fairly straight forward:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie;
use BerkeleyDB;
use constant {
DIR => "/usr/share/dict",
BDB_FILE => "bdb_file",
};
use constant WORD_LIST => qw(words web2a propernames connectives);
unlink BDB_FILE if -f BDB_FILE;
our %word_hash;
tie %word_hash, "BerkeleyDB::Hash",
-Filename => BDB_FILE,
-Flags => DB_CREATE
or die qq(Cannot create DBD_Database file ") . BDB_FILE . qq("\n);
for my $count (1..10) {
for my $file (WORD_LIST) {
open my $file_fh, "<", DIR . "/$file";
while (my $word = <$file_fh>) {
chomp $word;
$word_hash{"$file-$word-$count"} = $word;
}
}
}
All I had to do was add a few lines.
Running the program was a disappointment. It wasn't faster, but much, much slower. It took over 2 minutes while using a pure hash took a mere 13 seconds.
However, it used a lot less memory. While the old program gobbled gigabytes, the BDB version barely used a megabyte. Instead, it created a 20MB database file.
But, in these days of VM and cheap memory, did it accomplish anything? In the old days before virtual memory and good memory handling, a program would crash your computer if it used all of the memory (and memory was measured in megabytes and not gigabytes). Now, if your program wants more memory than is available, it simply is given virtual memory.
So, in the end, using a Berkeley database is not a good solution. Whatever I saved in programming time by using tie was wasted with the installation process. And, it was slow.
Using BDB simply used a DBD file instead of memory. A modern OS will do the same, and is faster. Why do the work when the OS will handle it for you?
The only reason to use a database is if your system really doesn't have the required resources. 200 million lines is a big file, but a modern OS will probably be okay with it. If your system really doesn't have the resource, use a SQL database on another system, and not a DBD database.
You might consider calculating a hash code for each line, and keeping track of (hash, position) mappings. You wouldn't need a complicated hash function (or even a large hash) for this; in fact, "smaller" is better than "more unique", if the primary concern is memory usage. Even a CRC, or summing up the chars' codes, might do. The point isn't to guarantee uniqueness at this stage -- it's just to narrow the candidate matches down from 200 million to a few dozen.
For each line, calculate the hash and see if you already have a mapping. If you do, then for each position that maps to that hash, read the line at that position and see if the lines match. If any of them do, skip that line. If none do, or you don't have any mappings for that hash, remember the (hash, position) and then print the line.
Note, i'm saying "position", not "line number". In order for this to work in less than a year, you'd almost certainly have to be able to seek right to a line rather than finding your way to line #1392499.
If you don't care about time/IO constraints, nor disk constraints (e.g. you have 10 more GB space), you can do the following dumb algorithm:
1) Read the file (which sounds like it has 50 character lines). While scanning it, remember the longest line length $L.
2) Analyze the first 3 characters (if you know char #1 is identical - say "[" - analyze the 3 characters in position N that is likely to have more diverse ones).
3) For each line with 3 characters $XYZ, append that line to file 3char.$XYZ and keep the count of how many lines in that file in a hash.
4) When your entire file is split up that way, you should have a whole bunch (if the files are A-Z only, then 26^3) of smaller files, and at most 4 files that are >2GB each.
5) Move the original file into "Processed" directory.
6) For each of the large files (>2GB), choose the next 3 character positions, and repeat steps #1-#5, with new files being 6char.$XYZABC
7) Lather, rinse, repeat. You will end up with one of the 2 options eventually:
8a) A bunch of smaller files each of which is under 2GB, all of which have mutually different strings, and each (due to its size) can be processed individually by standard "stash into a hash" solution in your question.
8b) Or, most of the files being smaller, but, you have exausted all $L characters while repeating step 7 for >2GB files, and you still have between 1-4 large files. Guess what - since
those up-to-4 large files have identical characters within a file in positions 1..$L, they can ALSO be processed using the "stash into a hash" method in your question, since they are not going to contain more than a few distinct lines despite their size!
Please note that this may require - at the worst possible distributions - 10GB * L / 3 disk space, but will ONLY require 20GB disk space if you change step #5 from "move" to "delete".
Voila. Done.
As an alternate approach, consider hashing your lines. I'm not a hashing expert but you should be able to compress a line into a hash <5 times line size IMHO.
If you want to be fancy about this, you will do a frequency analysis on character sequences on the first pass, and then do compression/encoding this way.
If you have more processor and have at least 15GB free space and your storage fast enough, you could try this out. This will process it in paralel.
split --lines=100000 -d 4 -d input.file
find . -name "x*" -print|xargs -n 1 -P10 -I SPLITTED_FILE sort -u SPLITTED_FILE>unique.SPLITTED_FILE
cat unique.*>output.file
rm unique.* x*
You could break you file into 10 1 Gbyte files Then reading in one file at a time, sorting lines from that file and writing it back out after they are sorted. Opening all of the 10 files and merge them back into one file (making sure you you merge them in the correct order). Open an output file to save the unique lines. Then read the merge file one line at a time, keeping the last line for comparison. If the last line and the current line are not a match, write out the last line and save the current line as the last line for comparison. Otherwise get the next line from the merged file. That will give you a file which has all of the unique lines.
It may take a while to do this, but if you are limited on memory, then breaking the file up and working on parts of it will work.
It may be possible to do the comparison when writing out the file, but that would be a bit more complicated.
Why use perl for this at all? posix shell:
sort | uniq
done, let's go drink beers.
I am generating relatively large files using Perl. The files I am generating are of two kinds:
Table files, i.e. textual files I print line by line (row by row), which contain mainly numbers. A typical line looks like:
126891 126991 14545 12
Serialized objects I create then store into a file using Storable::nstore. These objects usually contain some large hash with numeric values. The values in the object might have been packed to save on space (and the object unpacks each value before using it).
Currently I'm usually doing the following:
use IO::Compress::Gzip qw(gzip $GzipError);
# create normal, uncompressed file ($out_file)
# ...
# compress file using gzip
my $gz_out_file = "$out_file.gz";
gzip $out_file => $gz_out_file or die "gzip failed: $GzipError";
# delete uncompressed file
unlink($out_file) or die "can't unlink file $out_file: $!";
This is quite inefficient since I first write the large file to disk, then gzip read it again and compresses it. So my questions are as following:
Can I create a compressed file without first writing a file to disk? Is it possible to create a compressed file sequentially, i.e. printing line-by-line like in scenario (1) described earlier?
Does Gzip sounds like an appropriate choice? aRe there any other recommended compressors for the kind of data I have described?
Does it make sense to pack values in an object that will later be stored and compressed anyway?
My considerations are mainly saving on disk space and allowing fast decompression later on.
You can use IO::Zlib or PerlIO::gzip to tie a file handle to compress on the fly.
As for what compressors are appropriate, just try several and see how they do on your data. Also keep an eye on how much CPU/memory they use for compression and decompression.
Again, test to see how much pack helps with your data, and how much it affects your performance. In some cases, it may be helpful. In others, it may not. It really depends on your data.
You can also open() a filehandle to a scalar instead of a real file, and use this filehandle with IO::Compress::Gzip. Haven't actually tried it, but it should work. I use something similar with Net::FTP to avoid creating files on disk.
Since v5.8.0, Perl has built using PerlIO by default. Unless you've changed this (i.e., Configure -Uuseperlio), you can open filehandles directly to Perl scalars via:
open($fh, '>', \$variable) || ..
from open()
IO::Compress::Zlib has an OO interface that can be used for this.
use strict;
use warnings;
use IO::Compress::Gzip;
my $z = IO::Compress::Gzip->new('out.gz');
$z->print($_, "\n") for 0 .. 10;