Serialize data to binary using multicore - perl

I'm using the store function from the module Storable to get a binary representation of my hash. This hash is big enough for make the process last for 20min. Are there any similar function to store that works with multicore, so it gets the speed boosted?
I've searching for a while and I coulnd't find anything relevant, even using bson for the storage.

Finally I decided to split the data that I want to store in as many pieces as cores I have on the computer. So, I'm able to execute the store in threads making different output files, as ikegami suggested in the comments.

Related

How to efficiently store and update binary data in Mongodb?

I am storing a large binary array within a document. I wish to continually add bytes to this array and sometimes change the value of existing bytes.
I was looking for some $append_bytes and $replace_bytes type of modifiers but it appears that the best I can do is $push for arrays. It seems like this would be doable by performing seek-write type operations if I had access somehow to the underlying bson on disk, but it does not appear to me that there is anyway to do this in mongodb (and probably for good reason).
If I were instead to just query this binary array, edit or add to it, and then update the document by rewriting the entire field, how costly will this be? Each binary array will be on the order of 1-2MB, and updates occur once every 5 minutes and across 1000s of documents. Worse, yet there is no easy way to spread these out (in time) and they will usually be happening close to one another on the 5 minute intervals. Does anyone have a good feel for how disastrous this will be? Seems like it would be problematic.
An alternative would be to store this binary data as separate files on disk, implement a thread pool to efficiently manipulate the files on disk, and reference the filename from my mongodb document. (I'm using python and pymongo so I was looking at pytables). I'd prefer to avoid this though if possible.
Is there any other alternative that I am overlooking here?
Thanks in advnace.
EDIT
After some work writing some tests for my use cases I have decided to use a separate filesystem for the binary data objects (specifically hdf5 using either pytables or h5py). I will still use mongo for everything except the persistence of these binary data objects. In this manner I can decouple the performance related to append and update type operations away from my base mongo performance.
One of the mongo developers did point out that I can set internal array elements using dot notation and $set (see ref in comment below), but there is no way at this time to do a range of sets in an array atomically.
Moreover - if I have 1,000s of 2MB binary data fields within my mongo documents and I am updating and growing them often (as in at least once every 5 minutes) - my gut tells me that mongo is going to have to manage a lot of allocation/growth issues within its file(s) on disk - and that ultimately this will lead to performance problems. I would rather off load that to a separate filesystem at the OS level to handle.
Finally - I will be manipulating and performing computation on my data using numpy - both the pytables and the h5py modules allow nice integration between numpy behavior and the store.
As you have mentioned that, you are frequently editing your binary data, in fact very frequently. GridFS is another option I would be suggesting.
When to use GridFS might be useful to you

Perl DBM vs. Storable

for my current project i need to store a little database on disk, that i read once my program runs and write it once.
I have looked into perls DBM functionality and from what I understand it provides merely a hash that is stored on disk with every read and write going directly to disk.
My question is: Could I not simply use Storable or any of the related modules to achieve the same (a persistent hash) with far less File I/O overhead? (The hashes will never be to large to fit into memory easily)
Regards
Nick
SQLite is fast becoming the standard for simple on-disk databases. And in Perl you can just use DBD::SQLite and you're good to go.
Since the previous answers didn't really answer your actual question, "yes, you can"... with the following caveats:
Storable isn't really suited to concurrent access.
You will need to roll your own "atomic" update (ie: you will need to write to a tmp file, then rename).
If performance isn't really an issue, you could also use Data::Dumper (with the resulting file being somewhat human readable).
You could splat the contents to CSV.
I often use Dumper when there is only going to be a single task accessing the file - and it gives me a way to read/modify the contents if I see fit.

What is the Storable module used for?

I am having a hard time understanding what Storable does.
I know that it "stores" a variable into your disk, but why would I need to do that? What would I use this module for, and how would I do it?
Reasons that spring to mind:
Persist memory across script calls
Sharing variables across different processes (sometimes it isn't possible to pipe stuff)
Of course, that's not all that Storable does. It also:
Makes it possible to create deep clones of data structures
Serializes the data structure stored, which implies a smaller file footprint than output from Data::Dump
Is optimized for speed (so it's faster to retrieve than to require a file containing Data::Dump output
One example:
Your program spends a long time populating your data structure, a graph, or trie, and if the program crashes then you'd lose it all and have to start again from square one. To avoid losing this data and be able to continue where it stopped last time you can save a snapshot of the data to a file manually or just simply use Storable.

Efficient disk access of large number of small .mat files containing objects

I'm trying to determine the best way to store large numbers of small .mat files, around 9000 objects with sizes ranging from 2k to 100k, for a total of around half a gig.
The typical use case is that I only need to pull a small number (say 10) of the files from disk at a time.
What I've tried:
Method 1: If I save each file individually, I get performance problems (very slow save times and system sluggishness for some time after) as Windows 7 has difficulty handling so may files in a folder (And I think my SSD is having a rough time of it, too). However, the end result is fine, I can load what I need very quickly. This is using '-v6' save.
Method 2: If I save all of the files in one .mat file and then load just the variables I need, access is very slow (loading takes around three quarters of the time it takes to load the whole file, with small variation depending on the ordering of the save). This is using '-v6' save, too.
I know I could split the files up into many folders but it seems like such a nasty hack (and won't fix the SSD's dislike of writing many small files), is there a better way?
Edit:
The objects are consist mainly of a numeric matrix of double data and an accompanying vector of uint32 identifiers, plus a bunch of small identifying properties (char and numeric).
Five ideas to consider:
Try storing in an HDF5 object - take a look at http://www.mathworks.com/help/techdoc/ref/hdf5.html - you may find that this solves all of your problems. It will also be compatible with many other systems (e.g. Python, Java, R).
A variation on your method #2 is to store them in one or more files, but to turn off compression.
Different datatypes: It may also be the case that you have some objects that compress or decompress inexplicably poorly. I have had such issues with either cell arrays or struct arrays. I eventually found a way around it, but it's been awhile & I can't remember how to reproduce this particular problem. The solution was to use a different data structure.
#SB proposed a database. If all else fails, try that. I don't like building external dependencies and additional interfaces, but it should work (the primary problem is that if the DB starts to groan or corrupts your data, then you're back at square 1). For this purpose consider SQLite, which doesn't require a separate server/client framework. There is an interface available on Matlab Central: http://www.mathworks.com/matlabcentral/linkexchange/links/1549-matlab-sqlite
(New) Considering that the objects are less than 1GB, it may be easier to just copy the entire set to a RAM disk and then access through that. Just remember to copy from the RAM disk if anything is saved (or wrap save to save objects in two places).
Update: The OP has mentioned custom objects. There are two methods to consider for serializing these:
Two serialization program from Matlab Central: http://www.mathworks.com/matlabcentral/fileexchange/29457 - which was inspired by: http://www.mathworks.com/matlabcentral/fileexchange/12063-serialize
Google's Protocol Buffers. Take a look here: http://code.google.com/p/protobuf-matlab/
Try storing them as blobs in a database.
I would also try the multiple folders method as well - it might perform better than you think. It might also help with organization of the files if that's something you need.
The solution I have come up with is to save object arrays of around 100 of the objects each. These files tend to be 5-6 meg so loading is not prohibitive and access is just a matter of loading the right array(s) and then subsetting them to the desired entry(ies). This compromise avoids writing too many small files, still allows for fast access of single objects and avoids any extra database or serialization overhead.

looking for light-weight data persistence solution in perl

In my app I need to store some simple data both in memroy and in disk. A real database will be overkill in my case, so I need lighter one to handle the simple data persistence requirement. I do some google search by myself, and found something interesting like DBM and DBI CVS, etc. but since there are too many options there so it is difficult for me to make the actuaaly choice, so I'd like ask you here for the "best-practice" like light-weight data perisistence solution in perl.
You have several options:
Storable is a core module and is very efficient. It has some problems with portability, for example someone using an older version of Storable may not be able to read your data. Also, the endianness of the system creating and retrieving that data is important. The network order stoarge options help reduce the portability issues. You can store an arbitrary nested data structure to a file or string and restore it. Storable is supported only by Perl.
YAML is a text based format that works like storable--you can store and restore arbitrary structures to/from YAML files. YAML is nice because there are YAML libraries for several languages. It's not quite as speedy or space efficient as Storable.
JSON is a popular data exchange format with support in many languages. It is very much like YAML in both its strengths and weaknesses.
DBD::SQLite is a database driver for the DBI that allows you to keep a whole relational database in a single file. It is powerful and allows you work with many of the persistence tools that are aimed at other databases like MySQL and Postgres.
DBM::Deep is a convenient and powerful perl only module that allows efficient retrieval and modification of small parts of a large persistent data structures. Almost as easy to use as Storable, but far more efficient when dealing with only small portions of a large data structure.
Update: I realized that I should mention that I have used all of these modules and depending on your particular needs, any of them could be "the right choice".
You might want to try Tie::Storable. Then it's as simple as addressing a hash.
If you're not looking to store a ton of data and you're OK loading everything all at once at program startup, it might be the way to go.
If you're looking for something more sophisticated but still light weight, a lot of people (including myself) swear by SQLite.
If I had to do this I would probably go with DBI and DBD::SQLite, since it does not involve reading all the data into memory, but I'd just like to mention a few other ways, because "there's more than one way to do it":
The old way to do this was with DB_file and its cousins. It still works with modern versions of Perl. The drawback is that it are only useful for storing a one-dimensional hash (a hash which doesn't have any references in it). The advantage is that you can find nice books about it which don't cost very much money, and also online articles, and also I believe it doesn't involve reading the whole file into memory.
Another method is to print the contents of Data::Dumper to a file to store, and eval the contents of the file to read the data.
Yet another thing which hasn't been mentioned is KiokuDB, which looks like the cutting-edge Moose-based module, if you want to be trendy.
Do you want your data to be transparently persisted, i.e. you won't have to worry about doing a commit()-type operation after every write? I just asked a very similar question: Simple, modern, robust, transparent persistence of data strutures for Perl, and listed all the solutions I found.
If you do want transparent persistence (autocommit), then DBM::Deep may be easier to use than Storable. Here is example code that works out of the box:
use DBM::Deep;
tie my %db, 'DBM::Deep', 'file.db';
if ( exists $db{foo}->{bar} ) {
print $db{foo}->{bar}, "\n"
} else {
$db{foo}->{bar} = 'baz';
}
Look into Tie::File and submodules like Tie::File::AsHash, or Tie::Handle::CSV. All available on CPAN, fast and easy to use.
Storable lets you serialize any Perl data structure and read it back in. For in-memory storage, just use IO::Scalar to store into a string, that way you only need to write the code once and for writing to disk you just pass in another I/O handle.