Proccessing 2 million records with perl

Proccessing 2 million records with perl - perl

I have 2 million records on the database is it possible to bring them all and store them on perl hash reference without any problem of reaching out of memory ?

What is your reason to read them all into memory? Speed or ease of coding (i.e. treat the whole thing as a hashref).
If its the former, then sure, I think, you just need a ton of ram.
If its the latter, then there are interesting options. For example there are tied interfaces for databases that look like Perl native hashes but in reality query and return data as needed. A quick search of CPAN shows Tie::DBI, Tie::Hash::DBD and several tied interfaces for specific databases, flat-file DBs, and CSV files, including mine Tie::Array::CSV.

On the one hand, processing two million elements in a hash isn't unheard of. However, we don't know how big your records are. At any rate, it sounds like an XY problem. It may not be the best solution for the problem you're facing.
Why not use DBIx::Class so that your tables can be treated like Perl classes (which are themselves glorified data-structures)? There's a ton of documentation at DBIx::Class::Manual::DocMap. This is really what DBIx::Class is all about; letting you abstract away the SQL details of the database and treat it like a series of classes.

That completely depends on how much data your records have. Perl hashes and arrays take up more memory than you'd think although it's not crazy. But again, it totally depends on what your data looks like and how much RAM you have. Perl won't have any problems with it if you have the RAM.

Related

Developing efficient complex queries with multiple includes in Rails 4

I may be way off base here, but my general understanding of relational databases is that having information broken into tables makes for faster querying. So, I designed a somewhat complicated data structure. I had no trouble implementing it with migrations, and model relations (has_many, belongs_to, etc). I thought that is would allow for easy querying - but I seem to be missing something! I have spent the better part of a week trying to build some sort of query (or multiples), but I'm getting nowhere.
I need bits and pieces of about 5 tables for my output. Additionally, I need to find unique entries on a combination of 3 fields (think first, middle, and last name where last name is from another table). To further complicate things I need to find arrays of objects (think pens owned by each person).
Rather than get bogged down with the specifics of the code, I was wondering if people might recommend some good videos or blogs that might help me understand what I'm doing. I've tried to wade my way through RailsGuides, and a few of the doc sites, but my understanding is spotty, and I'm having a hard time pulling it all together.
I also think that starting off this way, might show me that I've made conceptual and design errors that might not be apparent if I were to just post 100's of line of code.

Light weight data store in Perl

My requirement is to maintain a simple data store with some rows (~1000) & columns (6)
Over period of time (2 years) I am expecting the data to grow to 1000-1500 lines/rows
I would like query, insert & update in the data store
I need this data store because this needs to be processed by another script.
I am using Perl for programming.
I have seen some threads in Stackoverflow (ex: looking for light-weight data persistence solution in perl) about this but I cannot make a decision
Anyone using light weight data store in Perl with query, insert & update capabilities ?

Go for Sqlite. It is powerful, tunable and lightweight.

Already accepted the answer, but in your case, I might just go with hashes and use Storable to write my structure to a disk. That is, if you don't have multiple people using the data at once.
The advantage is that it's all standard Perl, so it will work with almost any Perl installation. Can't get any lighter weight than this.

Probably the simplest lightweight solution would be to use DBI with DBD::SQLite.

If your data is relational and you are comfortable with SQL then I vote DBD::SQLite.
However if your data is more like documents (each entry data is self contained) or if you are not comfortable with SQL then I recommend DBM::Deep. Its interface is exactly as easy to use as regular Perl variables.
Finally, if you want to be really modern, MongoDB is very easy to install and the new Mango Perl module is very cool, just saying :-)

Can i use RedisDB for big databases?

Can I use RedisDB for big databases instead of SQL-based DB?
I really like RedisDB architecture, and i now what RedisDB support virtual memory.
But I don't now how fast is it.
PS When I say big database, I don't mean REALLY big. Just DB of some site, that can't be placed in memory.

It all depends on your application. Redis is a Key-Value store. If your data stored in a relational format and does not lend itself well to key retrieval then it is not really the solution for you. Also, Redis has the best performance when it is stored in memory. If transactions are important for you, it is not really a good solution too. I use Redis and love it, but it seems many people are trying to shoehorn in applications that Redis is not suited for. If you are storing things like webpages, comments, tweets etc I would also consider a document store like MongoDB. You would have to add more details about your implementation.

Redis virtual memory performance depends how you use it - if your frequently used values all fit in memory, with occasional disk reads for less popular content, there won't be that much difference from having the full dataset in memory. If your access requirements are completely random and a value only gets read once after being loaded from disk it won't perform any better than reading files from disk. Of course any caching solution will have problems in that scenario.
There are a couple of things to consider when designing for redis vm though:
All keys need to fit in memory, even if their values don't. This works well if your value is a large serialized object, but not so well if your value is a single number or short string and may actually be smaller than the key.
The save process works a little differently when VM is used. I don't recall the details, but you may need to consider what is guaranteed to be written to disk when.

looking for light-weight data persistence solution in perl

In my app I need to store some simple data both in memroy and in disk. A real database will be overkill in my case, so I need lighter one to handle the simple data persistence requirement. I do some google search by myself, and found something interesting like DBM and DBI CVS, etc. but since there are too many options there so it is difficult for me to make the actuaaly choice, so I'd like ask you here for the "best-practice" like light-weight data perisistence solution in perl.

You have several options:
Storable is a core module and is very efficient. It has some problems with portability, for example someone using an older version of Storable may not be able to read your data. Also, the endianness of the system creating and retrieving that data is important. The network order stoarge options help reduce the portability issues. You can store an arbitrary nested data structure to a file or string and restore it. Storable is supported only by Perl.
YAML is a text based format that works like storable--you can store and restore arbitrary structures to/from YAML files. YAML is nice because there are YAML libraries for several languages. It's not quite as speedy or space efficient as Storable.
JSON is a popular data exchange format with support in many languages. It is very much like YAML in both its strengths and weaknesses.
DBD::SQLite is a database driver for the DBI that allows you to keep a whole relational database in a single file. It is powerful and allows you work with many of the persistence tools that are aimed at other databases like MySQL and Postgres.
DBM::Deep is a convenient and powerful perl only module that allows efficient retrieval and modification of small parts of a large persistent data structures. Almost as easy to use as Storable, but far more efficient when dealing with only small portions of a large data structure.
Update: I realized that I should mention that I have used all of these modules and depending on your particular needs, any of them could be "the right choice".

You might want to try Tie::Storable. Then it's as simple as addressing a hash.
If you're not looking to store a ton of data and you're OK loading everything all at once at program startup, it might be the way to go.
If you're looking for something more sophisticated but still light weight, a lot of people (including myself) swear by SQLite.

If I had to do this I would probably go with DBI and DBD::SQLite, since it does not involve reading all the data into memory, but I'd just like to mention a few other ways, because "there's more than one way to do it":
The old way to do this was with DB_file and its cousins. It still works with modern versions of Perl. The drawback is that it are only useful for storing a one-dimensional hash (a hash which doesn't have any references in it). The advantage is that you can find nice books about it which don't cost very much money, and also online articles, and also I believe it doesn't involve reading the whole file into memory.
Another method is to print the contents of Data::Dumper to a file to store, and eval the contents of the file to read the data.
Yet another thing which hasn't been mentioned is KiokuDB, which looks like the cutting-edge Moose-based module, if you want to be trendy.

Do you want your data to be transparently persisted, i.e. you won't have to worry about doing a commit()-type operation after every write? I just asked a very similar question: Simple, modern, robust, transparent persistence of data strutures for Perl, and listed all the solutions I found.
If you do want transparent persistence (autocommit), then DBM::Deep may be easier to use than Storable. Here is example code that works out of the box:
use DBM::Deep;
tie my %db, 'DBM::Deep', 'file.db';
if ( exists $db{foo}->{bar} ) {
print $db{foo}->{bar}, "\n"
} else {
$db{foo}->{bar} = 'baz';
}

Look into Tie::File and submodules like Tie::File::AsHash, or Tie::Handle::CSV. All available on CPAN, fast and easy to use.

Storable lets you serialize any Perl data structure and read it back in. For in-memory storage, just use IO::Scalar to store into a string, that way you only need to write the code once and for writing to disk you just pass in another I/O handle.

Has anyone used an object database with a large amount of data?

Object databases like MongoDB and db4o are getting lots of publicity lately. Everyone that plays with them seems to love it. I'm guessing that they are dealing with about 640K of data in their sample apps.
Has anyone tried to use an object database with a large amount of data (say, 50GB or more)? Are you able to still execute complex queries against it (like from a search screen)? How does it compare to your usual relational database of choice?
I'm just curious. I want to take the object database plunge, but I need to know if it'll work on something more than a sample app.

Someone just went into production with a 12 terabytes of data in MongoDB. The largest I knew of before that was 1 TB. Lots of people are keeping really large amounts of data in Mongo.
It's important to remember that Mongo works a lot like a relational database: you need the right indexes to get good performance. You can use explain() on queries and contact the user list for help with this.

When I started db4o back in 2000 I didn't have huge databases in mind. The key goal was to store any complex object very simply with one line of code and to do that good and fast with low ressource consumption, so it can run embedded and on mobile devices.
Over time we had many users that used db4o for webapps and with quite large amounts of data, going close to todays maximum database file size of 256GB (with a configured block size of 127 bytes). So to answer your question: Yes, db4o will work with 50GB, but you shouldn't plan to use it for terabytes of data (unless you can nicely split your data over multiple db4o databases, the setup costs for a single database are negligible, you can just call #openFile() )
db4o was acquired by Versant in 2008, because it's capabilites (embedded, low ressource-consumption, lightweight) make it a great complimentary product to Versant's high-end object database VOD. VOD scales for huge amounts of data and it does so much better than relational databases. I think it will merely chuckle over 50GB.

MongoDB powers SourceForge, The New York Times, and several other large databases...

You should read the MongoDB use cases. People who are just playing with technology are often just looking at how does this work and are not at the point where they can understand the limitations. For the right sorts of datasets and access patterns 50GB is nothing for MongoDB running on the right hardware.
These non-relational systems look at the trade-offs which RDBMs made, and changed them a bit. Consistency is not as important as other things in some situations so these solutions let you trade that off for something else. The trade-off is still relatively minor ms or maybe secs in some situations.
It is worth reading about the CAP theorem too.

I was looking at moving the API I have for sure with the stack overflow iphone app I wrote a while back to MongoDB from where it currently sits in a MySQL database. In raw form the SO CC dump is in the multi-gigabyte range and the way I constructed the documents for MongoDB resulted in a 10G+ database. It is arguable that I didn't construct the documents well but I didn't want to spend a ton of time doing this.
One of the very first things you will run into if you start down this path is the lack of 32 bit support. Of course everything is moving to 64 bit now but just something to keep in mind. I don't think any of the major document databases support paging in 32 bit mode and that is understandable from a code complexity standpoint.
To test what I wanted to do I used a 64 bit instance EC2 node. The second thing I ran into is that even though this machine had 7G of memory when the physical memory was exhausted things went from fast to not so fast. I'm not sure I didn't have something set up incorrectly at this point because the non-support of 32 bit system killed what I wanted to use it for but I still wanted to see what it looked like. Loading the same data dump into MySQL takes about 2 minutes on a much less powerful box but the script I used to load the two database works differently so I can't make a good comparison. Running only a subset of the data into MongoDB was much faster as long as it resulted in a database that was less than 7G.
I think my take away from it was that large databases will work just fine but you may have to think about how the data is structured more than you would with a traditional database if you want to maintain the high performance. I see a lot of people using MongoDB for logging and I can imagine that a lot of those databases are massive but at the same time they may not be doing a lot of random access so that may mask what performance would look like for more traditional applications.
A recent resource that might be helpful is the visual guide to nosql systems. There are a decent number of choices outside of MongoDB. I have used Redis as well although not with as large of a database.

Here's some benchmarks on db4o:
http://www.db4o.com/about/productinformation/benchmarks/
I think it ultimately depends on a lot of factors, including the complexity of the data, but db4o seems to certainly hang with the best of them.

Perhaps worth a mention.
The European Space Agency's Planck mission is running on the Versant Object Database.
http://sci.esa.int/science-e/www/object/index.cfm?fobjectid=46951
It is a satelite with 74 onboard sensors launched last year which is mapping the infrarred spectrum of the universe and storing the information in a map segment model. It has been getting a ton of hype these days because of it's producing some of the coolest images ever seen of the universe.
Anyway, it has generated 25T of information stored in Versant and replicated across 3 continents. When the mission is complete next year, it will be a total of 50T
Probably also worth noting, object databases tend to be a lot smaller to hold the same information. It is because they are truly normalized, no data duplication for joins, no empty wasted column space and few indexes rather than 100's of them. You can find public information about testing ESA did to consider storage in multi-column relational database format -vs- using a proper object model and storing in the Versant object database. THey found they could save 75% disk space by using Versant.
Here is the implementation:
http://www.planck.fr/Piodoc/PIOlib_Overview_V1.0.pdf
Here they talk about 3T -vs- 12T found in the testing
http://newscenter.lbl.gov/feature-stories/2008/12/10/cosmic-data/
Also ... there are benchmarks which show Versant orders of magnitude faster on the analysis side of the mission.
CHeers,
-Robert

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse