Can CouchDB handle 15 million records daily? - nosql

I'm relatively new to NoSQL databases and I have to evaluate different NoSQL-Solutions for a monitoring tool.
The situation is the following:
One datum is just about 100 Bytes big, but there are really a lot of them. During a day we get about 15 million records... So I'm currently testing with 900 million records (about 15GB as SQL-Insert Script)
My question is: Does Couchdb fit my needs? I need to do range querys (on the date the records were created) and sum up some of the columns acoording to groups definied by "secondary indexes" stored in the datum.)
I know that MapReduce is probably the best solution to calculate that, but is the JavaScript of CouchDB able to do this in an acceptable time?
I already tried MongoDB but it's really poor MapReduce made a crappy job... I also read about HBase and Cassandra. But maybee CouchDB is also a good possibility
I hope I gave you all the needed information... Thank you for your help!
andy

Frankly, at this time, unless you have very good hardware, Apache CouchDB may run into problems. Map/reduce will probably be fine. CouchDB's incremental map/reduce is ideal for your requirements.
As a developer, you will love it! Unfortunately as a sysadmin, you may notice more disk usage and i/o than expected.
I suggest to try it. Being HTTP and Javascript, it's easy to do a feasibility test. Just remember, the initial view build will take a long time (let's assume for argument it takes longer than every other competing database). But that time will never be spent again. Map/reduce runs only once per document (actually per document update).

Related

DynamoDB vs MongoDB NoSQL [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 4 years ago.
Improve this question
I'm trying to figure out what can I use for a future project, we plan to store about 500k records per month in the first year and maybe more for the next years this is a vertical application so there's no need to use a database for this, that's the reason why I decided to choose a NoSQL data storage.
The first option that came to my mind was mongo DB since is a very mature product with a lot of support from the community but on the other hand, we got a brand new product that offers a managed service at top performance, I'll develop this application but there's no maintenance plan (at least for now) so I think that will be a huge advantage since amazon provides an elastic way to scale.
My major concern is about the query structure, I haven't looked at the dynamo DB query capabilities yet but since is a k/v data storage I feel that this could be more limited than mongo DB.
If someone had the experience of moving a project from MongoDB to DynamoDB, any advice will be totally appreciated.
I know this is old, but it still comes up when you search for the comparison. We were using Mongo, have moved almost entirely to Dynamo, which is our first choice now. Not because it has more features, it doesn't. Mongo has a better query language, you can index within a structure, there's lots of little things. The superiority of Dynamo is in what the OP stated in his comment: it's easy. You don't have to take care of any servers. When you start to set up a Mongo sharded solution, it gets complicated. You can go to one of the hosting companies, but that's not cheap either. With Dynamo, if you need more throughput, you just click a button. You can write scripts to scale automatically. When it's time to upgrade Dynamo, it's done for you. That is all a lot of precious stress and time not spent. If you don't have dedicated ops people, Dynamo is excellent.
So we are now going on Dynamo by default. Mongo maybe, if the data structure is complicated enough to warrant it, but then we'd probably go back to a SQL database. Dynamo is obtuse, you really need to think about how you're going to build it, and likely you'll use Redis in Elasticcache to make it work for complex stuff. But it sure is nice to not have to take care of it. You code. That's it.
With 500k documents, there is no reason to scale whatsoever. A typical laptop with an SSD and 8GB of ram can easily do 10s of millions of records, so if you are trying to pick because of scaling your choice doesn't really matter. I would suggest you pick what you like the most, and perhaps where you can find the most online support with.
For quick overview comparisons, I really like this website, that has many comparison pages, eg AWS DynamoDB vs MongoDB; http://db-engines.com/en/system/Amazon+DynamoDB%3BMongoDB
Short answer: Start with SQL and add NoSQL only when/if needed. (unless you don't need anything beyond very simple queries)
My personal experience: I haven't used MongoDB for queries but as of April 2015 DynamoDB is still very crippled when it comes to anything beyond the most basic key/value queries. I love it for the basic stuff but if you want query language then look to a real SQL database solution.
In DynamoDB you can query on a hash or on a hash and range key, and you can have multiple secondary global indexes. I'm doing queries on a single table with 4 possible filter parameters and sorting the results, this is supported (barely) through the use of the global secondary indexes with filter expressions. The problem comes in when you try to get the total results matching the filter, you can't just search for the first 10 items matching the filter, but rather it checks 10 items and you may get 0 valid results forcing you to keep re-scanning from the continue key - pain in the neck and consumes too much of your table read quota for a simple scenario.
To be specific about the limit problem with filters in the query, this is from the docs (http://docs.aws.amazon.com/amazondynamodb/latest/developerguide/QueryAndScan.html#ScanQueryLimit):
In a response, DynamoDB returns all the matching results within
the scope of the Limit value. For example, if you issue a Query
or a Scan request with a Limit value of 6 and without a filter
expression, the operation returns the first six items in the
table that match the request parameters. If you also supply a
FilterExpression, the operation returns the items within the
first six items in the table that match the filter requirements.
My conclusion is that queries involving FilterExpressions are only usable on very rare occasions and are not scalable because each query can easily read most or all of your of your table which consumes far too many DynamoDB read units. Once you use too many read units you'll get throttled and see poor performance.
Expert opinion: In the AWS summit on Apr 9, 2015 Brett Hollman, Manager, Solutions Architecture, AWS in his talk on scalling to your first 10 million users advocates starting with a SQL database and then using NoSQL only when and if it makes sense. Because sooner or later you'll probably need a SQL server somewhere in your stack. His slides are here: http://www.slideshare.net/AmazonWebServices/deep-dive-scaling-up-to-your-first-10-million-users
See slide 28.
We chose a combination of Mongo/Dynamo for a healthcare product. Basically mongo allows better searching, but the hosted Dynamo is great because its HIPAA compliant without any extra work. So we host the mongo portion with no personal data on a standard setup and allow amazon to deal with the HIPAA portion in terms of infrastructure. We can query certain items from mongo which bring up documents with pointers (ID's) of the relatable Dynamo document.
The main reason we chose to do this using mongo instead of hosting the entire application on dynamo was for 2 reasons. First, we needed to preform location based searches which mongo is great at and at the time, Dynamo was not, but they do have an option now.
Secondly was that some documents were unstructured and we did not know ahead of time what the data would be, so for example lets say user a inputs a document in the "form" collection like this: {"username": "user1", "email": "me#me.com"}. And another user puts this in the same collection {"phone": "813-555-3333", "location": [28.1234,-83.2342]}. With mongo we can search any of these dynamic and unknown fields at any time, with Dynamo, you could do this but would have to make a index every time a new field was added that you wanted searchable. So if you have never had a phone field in your Dynamo document before and then all of the sudden, some one adds it, its completely unsearchable.
Now this brings up another point in which you have mentioned. Sometimes choosing the right solution for the job does not always mean choosing the best product for the job. For example you may have a client who needs and will use the system you created for 10+ years. Going with a SaaS/IaaS solution that is good enough to get the job done may be a better option as you can rely on amazon to have up-kept and maintained their systems over the long haul.
I have worked on both and kind of fan of both.
But you need to understand when to use what and for what purpose.
I don't think It's a great idea to move all your database to DynamoDB, reason being querying is difficult except on primary and secondary keys, Indexing is limited and scanning in DynamoDB is painful.
I would go for a hybrid sort of DB, where extensive query-able data should be there is MongoDB, with all it's feature you would never feel constrained to provide enhancements or modifications.
DynamoDB is lightning fast (faster than MongoDB) so DynamoDB is often used as an alternative to sessions in scalable applications. DynamoDB best practices also suggests that if there are plenty of data which are less being used, move it to other table.
So suppose you have a articles or feeds. People are more likely to look for last week stuff or this month's stuff. chances are really rare for people to visit two year old data. For these purposes DynamoDB prefers to have data stored by month or years in different tables.
DynamoDB is seemlessly scalable, something you will have to do manually in MongoDB. however you would lose on performance of DynamoDB, if you don't understand about throughput partition and how scaling works behind the scene.
DynamoDB should be used where speed is critical, MongoDB on the other hand has too many hands and features, something DynamoDB lacks.
for example, you can have a replica set of MongoDB in such a way that one of the replica holds data instance of 8(or whatever) hours old. Really useful, if you messed up something big time in your DB and want to get the data as it is before.
That's my opinion though.
Bear in mind, I've only experimented with MongoDB...
From what I've read, DynamoDB has come a long way in terms of features. It used to be a super-basic key-value store with extremely limited storage and querying capabilities. It has since grown, now supporting bigger document sizes + JSON support and global secondary indices. The gap between what DynamoDB and MongoDB offers in terms of features grows smaller with every month. The new features of DynamoDB are expanded on here.
Much of the MongoDB vs. DynamoDB comparisons are out of date due to the recent addition of DynamoDB features. However, this post offers some other convincing points to choose DynamoDB, namely that it's simple, low maintenance, and often low cost. Another discussion here of database choices was interesting to read, though slightly old.
My takeaway: if you're doing serious database queries or working in languages not supported by DynamoDB, use MongoDB. Otherwise, stick with DynamoDB.

What is the write performance of Cassandra, HBase, MongoDB or Elastic Search

I was looking for a real test result of write performances of Cassandra, HBase, Mongodb and Elastic Search. I want to collect log data from our servers and this data is really huge. Over 10gb for an hour.
I know everyone is talking about cassandra's write performance and these products have different areas of usage. K/V, Document Oriented or search. But I am insterested in not their storage types right now, only the write performance. for example 50K transactions per second.
Also I want to search on top of this data.
What is your suggestions to handle over 50K trans per second.
Cassandra is pretty fast indeed.
Voldemort is very fast too. I know at least one massive website that chose it over all other options.
But, really, 50K/sec is nothing special. I did it with MySQL on one machine.
Note, however, that besides how quickly you can write data, you should be also interested in what can you with that data later. Otherwise you can just pipe all your writes to /dev/null. It will be insanely fast.
Also, any "benchmark" you find in the internet is useless. It either uses data with certain characteristics or synthesized data. And this can make all the difference. Nobody will do a benchmark of your scenarios, using your data, except for you.
You might be interested in Netflix's benchmark of Cassandra up to one million writes per second on EC2: http://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html

NoSQL & AdHoc Queries - Millions of Rows

I currently run a MySQL-powered website where users promote advertisements and gain revenue every time someone completes one. We log every time someone views an ad ("impression"), every time a user clicks an add ("click"), and every time someone completes an ad ("lead").
Since we get so much traffic, we have millions of records in each of these respective tables. We then have to query these tables to let users see how much they have earned, so we end up performing multiple queries on tables with millions and millions of rows multiple times in one request, hundreds of times concurrently.
We're looking to move away from MySQL and to a key-value store or something along those lines. We need something that will let us store all these millions of rows, query them in milliseconds, and MOST IMPORTANTLY, use adhoc queries where we can query any single column, so we could do things like:
FROM leads WHERE country = 'US' AND user_id = 501 (the NoSQL equivalent, obviously)
FROM clicks WHERE ad_id = 1952 AND user_id = 200 AND country = 'GB'
etc.
Does anyone have any good suggestions? I was considering MongoDB or CouchDB but I'm not sure if they can handle querying millions of records multiple times a second and the type of adhoc queries we need.
Thanks!
With those requirements, you are probably better off sticking with SQL and setting up replication/clustering if you are running into load issues. You can set up indexing on a document database so that those queries are possible, but you don't really gain anything over your current system.
NoSQL systems generally improve performance by leaving out some of the more complex features of relational systems. This means that they will only help if your scenario doesn't require those features. Running ad hoc queries on tabular data is exactly what SQL was designed for.
CouchDB's map/reduce is incremental which means it only processes a document once and stores the results.
Let's assume, for a moment, that CouchDB is the slowest database in the world. Your first query with millions of rows takes, maybe, 20 hours. That sounds terrible. However, your second query, your third query, your fourth query, and your hundredth query will take 50 milliseconds, perhaps 100 including HTTP and network latency.
You could say CouchDB fails the benchmarks but gets honors in the school of hard knocks.
I would not worry about performance, but rather if CouchDB can satisfy your ad-hoc query requirements. CouchDB wants to know what queries will occur, so it can do the hard work up-front before the query arrives. When the query does arrive, the answer is already prepared and out it goes!
All of your examples are possible with CouchDB. A so-called merge-join (lots of equality conditions) is no problem. However CouchDB cannot support multiple inequality queries simultaneously. You cannot ask CouchDB, in a single query, for users between age 18-40 who also clicked fewer than 10 times.
The nice thing about CouchDB's HTTP and Javascript interface is, it's easy to do a quick feasibility study. I suggest you try it out!
Most people would probably recommend MongoDB for a tracking/analytic system like this, for good reasons. You should read the „MongoDB for Real-Time Analytics” chapter from the „MongoDB Definitive Guide” book. Depending on the size of your data and scaling needs, you could get all the performance, schema-free storage and ad-hoc querying features. You will need to decide for yourself if issues with durability and unpredictability of the system are risky for you or not.
For a simpler tracking system, Redis would be a very good choice, offering rich functionality, blazing speed and real durability. To get a feel how such a system would be implemented in Redis, see this gist. The downside is, that you'd need to define all the „indices” by yourself, not gain them for „free”, as is the case with MongoDB. Nevertheless, there's no free lunch, and MongoDB indices are definitely not a free lunch.
I think you should have a look into how ElasticSearch would enable you:
Blazing speed
Schema-free storage
Sharding and distributed architecture
Powerful analytic primitives in the form of facets
Easy implementation of „sliding window”-type of data storage with index aliases
It is in heart a „fulltext search engine”, but don't get yourself confused by that. Read the „Data Visualization with ElasticSearch and Protovis“ article for real world use case of ElasticSearch as a data mining engine.
Have a look on these slides for real world use case for „sliding window” scenario.
There are many client libraries for ElasticSearch available, such as Tire for Ruby, so it's easy to get off the ground with a prototype quickly.
For the record (with all due respect to #jhs :), based on my experience, I cannot imagine an implementation where Couchdb is a feasible and useful option. It would be an awesome backup storage for your data, though.
If your working set can fit in the memory, and you index the right fields in the document, you'd be all set. Your ask is not something very typical and I am sure with proper hardware, right collection design (denormalize!) and indexing you should be good to go. Read up on Mongo querying, and use explain() to test the queries. Stay away from IN and NOT IN clauses that'd be my suggestion.
It really depends on your data sets. The number one rule to NoSQL design is to define your query scenarios first. Once you really understand how you want to query the data then you can look into the various NoSQL solutions out there. The default unit of distribution is key. Therefore you need to remember that you need to be able to split your data between your node machines effectively otherwise you will end up with a horizontally scalable system with all the work still being done on one node (albeit better queries depending on the case).
You also need to think back to CAP theorem, most NoSQL databases are eventually consistent (CP or AP) while traditional Relational DBMS are CA. This will impact the way you handle data and creation of certain things, for example key generation can be come trickery.
Also remember than in some systems such as HBase there is no indexing concept. All your indexes will need to be built by your application logic and any updates and deletes will need to be managed as such. With Mongo you can actually create indexes on fields and query them relatively quickly, there is also the possibility to integrate Solr with Mongo. You don’t just need to query by ID in Mongo like you do in HBase which is a column family (aka Google BigTable style database) where you essentially have nested key-value pairs.
So once again it comes to your data, what you want to store, how you plan to store it, and most importantly how you want to access it. The Lily project looks very promising. The work I am involved with we take a large amount of data from the web and we store it, analyse it, strip it down, parse it, analyse it, stream it, update it etc etc. We dont just use one system but many which are best suited to the job at hand. For this process we use different systems at different stages as it gives us fast access where we need it, provides the ability to stream and analyse data in real-time and importantly, keep track of everything as we go (as data loss in a prod system is a big deal) . I am using Hadoop, HBase, Hive, MongoDB, Solr, MySQL and even good old text files. Remember that to productionize a system using these technogies is a bit harder than installing MySQL on a server, some releases are not as stable and you really need to do your testing first. At the end of the day it really depends on the level of business resistance and the mission-critical nature of your system.
Another path that no one thus far has mentioned is NewSQL - i.e. Horizontally scalable RDBMSs... There are a few out there like MySQL cluster (i think) and VoltDB which may suit your cause.
Again it comes to understanding your data and the access patterns, NoSQL systems are also Non-Rel i.e. non-relational and are there for better suit to non-relational data sets. If your data is inherently relational and you need some SQL query features that really need to do things like Cartesian products (aka joins) then you may well be better of sticking with Oracle and investing some time in indexing, sharding and performance tuning.
My advice would be to actually play around with a few different systems. However for your use case I think a Column Family database may be the best solution, I think there are a few places which have implemented similar solutions to very similar problems (I think the NYTimes is using HBase to monitor user page clicks). Another great example is Facebook and like, they are using HBase for this. There is a really good article here which may help you along your way and further explain some points above. http://highscalability.com/blog/2011/3/22/facebooks-new-realtime-analytics-system-hbase-to-process-20.html
Final point would be that NoSQL systems are not the be all and end all. Putting your data into a NoSQL database does not mean its going to perform any better than MySQL, Oracle or even text files... For example see this blog post: http://mysqldba.blogspot.com/2010/03/cassandra-is-my-nosql-solution-but.html
I'd have a look at;
MongoDB - Document - CP
CouchDB - Document - AP
Redis - In memory key-value (not column family) - CP
Cassandra - Column Family - Available & Partition Tolerant (AP)
HBase - Column Family - Consistent & Partition Tolerant (CP)
Hadoop/Hive - Also have a look at Hadoop streaming...
Hypertable - Another CF CP DB.
VoltDB - A really good looking product, a relation database that is distributed and might work for your case (may be an easier move). They also seem to provide enterprise support which may be more suited for a prod env (i.e. give business users a sense of security).
Any way thats my 2c. Playing around with the systems is really the only way your going to find out what really works for your case.

MongoDB vs. Redis vs. Cassandra for a fast-write, temporary row storage solution

I'm building a system that tracks and verifies ad impressions and clicks. This means that there are a lot of insert commands (about 90/second average, peaking at 250) and some read operations, but the focus is on performance and making it blazing-fast.
The system is currently on MongoDB, but I've been introduced to Cassandra and Redis since then. Would it be a good idea to go to one of these two solutions, rather than stay on MongoDB? Why or why not?
Thank you
For a harvesting solution like this, I would recommend a multi-stage approach. Redis is good at real time communication. Redis is designed as an in-memory key/value store and inherits some very nice benefits of being a memory database: O(1) list operations. For as long as there is RAM to use on a server, Redis will not slow down pushing to the end of your lists which is good when you need to insert items at such an extreme rate. Unfortunately, Redis can't operate with data sets larger than the amount of RAM you have (it only writes to disk, reading is for restarting the server or in case of a system crash) and scaling has to be done by you and your application. (A common way is to spread keys across numerous servers, which is implemented by some Redis drivers especially those for Ruby on Rails.) Redis also has support for simple publish/subscribe messenging, which can be useful at times as well.
In this scenario, Redis is "stage one." For each specific type of event you create a list in Redis with a unique name; for example we have "page viewed" and "link clicked." For simplicity we want to make sure the data in each list is the same structure; link clicked may have a user token, link name and URL, while the page viewed may only have the user token and URL. Your first concern is just getting the fact it happened and whatever absolutely neccesary data you need is pushed.
Next we have some simple processing workers that take this frantically inserted information off of Redis' hands, by asking it to take an item off the end of the list and hand it over. The worker can make any adjustments/deduplication/ID lookups needed to properly file the data and hand it off to a more permanent storage site. Fire up as many of these workers as you need to keep Redis' memory load bearable. You could write the workers in anything you wish (Node.js, C#, Java, ...) as long as it has a Redis driver (most web languages do now) and one for your desired storage (SQL, Mongo, etc.)
MongoDB is good at document storage. Unlike Redis it is able to deal with databases larger than RAM and it supports sharding/replication on it's own. An advantage of MongoDB over SQL-based options is that you don't have to have a predetermined schema, you're free to change the way data is stored however you want at any time.
I would, however, suggest Redis or Mongo for the "step one" phase of holding data for processing and use a traditional SQL setup (Postgres or MSSQL, perhaps) to store post-processed data. Tracking client behavior sounds like relational data to me, since you may want to go "Show me everyone who views this page" or "How many pages did this person view on this given day" or "What day had the most viewers in total?". There may be even more complex joins or queries for analytic purposes you come up with, and mature SQL solutions can do a lot of this filtering for you; NoSQL (Mongo or Redis specifically) can't do joins or complex queries across varied sets of data.
I currently work for a very large ad network and we write to flat files :)
I'm personally a Mongo fan, but frankly, Redis and Cassandra are unlikely to perform either better or worse. I mean, all you're doing is throwing stuff into memory and then flushing to disk in the background (both Mongo and Redis do this).
If you're looking for blazing fast speed, the other option is to keep several impressions in local memory and then flush them disk every minute or so. Of course, this is basically what Mongo and Redis do for you. Not a real compelling reason to move.
All three solutions (four if you count flat-files) will give you blazing fast writes. The non-relational (nosql) solutions will give you tunable fault-tolerance as well for the purposes of disaster recovery.
In terms of scale, our test environment, with only three MongoDB nodes, can handle 2-3k mixed transactions per second. At 8 nodes, we can handle 12k-15k mixed transactions per second. Cassandra can scale even higher. 250 reads is (or should be) no problem.
The more important question is, what do you want to do with this data? Operational reporting? Time-series analysis? Ad-hoc pattern analysis? real-time reporting?
MongoDB is a good option if you want the ability to do ad-hoc analysis based on multiple attributes within a collection. You can put up to 40 indexes on a collection, though the indexes will be stored in-memory, so watch for size. But the result is a flexible analytical solution.
Cassandra is a key-value store. You define a static column or set of columns that will act as your primary index right up front. All queries run against Cassandra should be tuned to this index. You can put a secondary on it, but that's about as far as it goes. You can, of course, use MapReduce to scan the store for non-key attribution, but it will be just that: a serial scan through the store. Cassandra also doesn't have the notion of "like" or regex operations on the server nodes. If you want to find all customers where the first name starts with "Alex", you'll have to scan through the entire collection, pull the first name out for each entry and run it through a client-side regex.
I'm not familiar enough with Redis to speak intelligently about it. Sorry.
If you are evaluating non-relational platforms, you might also want to consider CouchDB and Riak.
Hope this helps.
Just found this: http://blog.axant.it/archives/236
Quoting the most interesting part:
This second graph is about Redis RPUSH vs Mongo $PUSH vs Mongo insert, and I find this graph to be really interesting. Up to 5000 entries mongodb $push is faster even when compared to Redis RPUSH, then it becames incredibly slow, probably the mongodb array type has linear insertion time and so it becomes slower and slower. mongodb might gain a bit of performances by exposing a constant time insertion list type, but even with the linear time array type (which can guarantee constant time look-up) it has its applications for small sets of data.
I guess everything depends at least on data type and volume. Best advice probably would be to benchmark on your typical dataset and see yourself.
According to the Benchmarking Top NoSQL Databases (download here)
I recommend Cassandra.
If you have the choice (and need to move away from flat fies) I would go with Redis. Its blazingly fast, will comfortably handle the load you're talking about, but more importantly you won't have to manage the flushing/IO code. I understand its pretty straight forward but less code to manage is better than more.
You will also get horizontal scaling options with Redis that you may not get with file based caching.
I can get around 30k inserts/sec with MongoDB on a simple $350 Dell. If you only need around 2k inserts/sec, I would stick with MongoDB and shard it for scalability. Maybe also look into doing something with Node.js or something similar to make things more asynchronous.
The problem with inserts into databases is that they usually require writing to a random block on disk for each insert. What you want is something that only writes to disk every 10 inserts or so, ideally to sequential blocks.
Flat files are good. Summary statistics (eg total hits per page) can be obtained from flat files in a scalable manner using merge-sorty map-reducy type algorithms. It's not too hard to roll your own.
SQLite now supports Write Ahead Logging, which may also provide adequate performance.
I have hand-on experience with mongodb, couchdb and cassandra. I converted a lot of files to base64 string and insert these string into nosql.
mongodb is the fastest. cassandra is slowest. couchdb is slow too.
I think mysql would be much faster than all of them, but I didn't try mysql for my test case yet.

Has anyone used an object database with a large amount of data?

Object databases like MongoDB and db4o are getting lots of publicity lately. Everyone that plays with them seems to love it. I'm guessing that they are dealing with about 640K of data in their sample apps.
Has anyone tried to use an object database with a large amount of data (say, 50GB or more)? Are you able to still execute complex queries against it (like from a search screen)? How does it compare to your usual relational database of choice?
I'm just curious. I want to take the object database plunge, but I need to know if it'll work on something more than a sample app.
Someone just went into production with a 12 terabytes of data in MongoDB. The largest I knew of before that was 1 TB. Lots of people are keeping really large amounts of data in Mongo.
It's important to remember that Mongo works a lot like a relational database: you need the right indexes to get good performance. You can use explain() on queries and contact the user list for help with this.
When I started db4o back in 2000 I didn't have huge databases in mind. The key goal was to store any complex object very simply with one line of code and to do that good and fast with low ressource consumption, so it can run embedded and on mobile devices.
Over time we had many users that used db4o for webapps and with quite large amounts of data, going close to todays maximum database file size of 256GB (with a configured block size of 127 bytes). So to answer your question: Yes, db4o will work with 50GB, but you shouldn't plan to use it for terabytes of data (unless you can nicely split your data over multiple db4o databases, the setup costs for a single database are negligible, you can just call #openFile() )
db4o was acquired by Versant in 2008, because it's capabilites (embedded, low ressource-consumption, lightweight) make it a great complimentary product to Versant's high-end object database VOD. VOD scales for huge amounts of data and it does so much better than relational databases. I think it will merely chuckle over 50GB.
MongoDB powers SourceForge, The New York Times, and several other large databases...
You should read the MongoDB use cases. People who are just playing with technology are often just looking at how does this work and are not at the point where they can understand the limitations. For the right sorts of datasets and access patterns 50GB is nothing for MongoDB running on the right hardware.
These non-relational systems look at the trade-offs which RDBMs made, and changed them a bit. Consistency is not as important as other things in some situations so these solutions let you trade that off for something else. The trade-off is still relatively minor ms or maybe secs in some situations.
It is worth reading about the CAP theorem too.
I was looking at moving the API I have for sure with the stack overflow iphone app I wrote a while back to MongoDB from where it currently sits in a MySQL database. In raw form the SO CC dump is in the multi-gigabyte range and the way I constructed the documents for MongoDB resulted in a 10G+ database. It is arguable that I didn't construct the documents well but I didn't want to spend a ton of time doing this.
One of the very first things you will run into if you start down this path is the lack of 32 bit support. Of course everything is moving to 64 bit now but just something to keep in mind. I don't think any of the major document databases support paging in 32 bit mode and that is understandable from a code complexity standpoint.
To test what I wanted to do I used a 64 bit instance EC2 node. The second thing I ran into is that even though this machine had 7G of memory when the physical memory was exhausted things went from fast to not so fast. I'm not sure I didn't have something set up incorrectly at this point because the non-support of 32 bit system killed what I wanted to use it for but I still wanted to see what it looked like. Loading the same data dump into MySQL takes about 2 minutes on a much less powerful box but the script I used to load the two database works differently so I can't make a good comparison. Running only a subset of the data into MongoDB was much faster as long as it resulted in a database that was less than 7G.
I think my take away from it was that large databases will work just fine but you may have to think about how the data is structured more than you would with a traditional database if you want to maintain the high performance. I see a lot of people using MongoDB for logging and I can imagine that a lot of those databases are massive but at the same time they may not be doing a lot of random access so that may mask what performance would look like for more traditional applications.
A recent resource that might be helpful is the visual guide to nosql systems. There are a decent number of choices outside of MongoDB. I have used Redis as well although not with as large of a database.
Here's some benchmarks on db4o:
http://www.db4o.com/about/productinformation/benchmarks/
I think it ultimately depends on a lot of factors, including the complexity of the data, but db4o seems to certainly hang with the best of them.
Perhaps worth a mention.
The European Space Agency's Planck mission is running on the Versant Object Database.
http://sci.esa.int/science-e/www/object/index.cfm?fobjectid=46951
It is a satelite with 74 onboard sensors launched last year which is mapping the infrarred spectrum of the universe and storing the information in a map segment model. It has been getting a ton of hype these days because of it's producing some of the coolest images ever seen of the universe.
Anyway, it has generated 25T of information stored in Versant and replicated across 3 continents. When the mission is complete next year, it will be a total of 50T
Probably also worth noting, object databases tend to be a lot smaller to hold the same information. It is because they are truly normalized, no data duplication for joins, no empty wasted column space and few indexes rather than 100's of them. You can find public information about testing ESA did to consider storage in multi-column relational database format -vs- using a proper object model and storing in the Versant object database. THey found they could save 75% disk space by using Versant.
Here is the implementation:
http://www.planck.fr/Piodoc/PIOlib_Overview_V1.0.pdf
Here they talk about 3T -vs- 12T found in the testing
http://newscenter.lbl.gov/feature-stories/2008/12/10/cosmic-data/
Also ... there are benchmarks which show Versant orders of magnitude faster on the analysis side of the mission.
CHeers,
-Robert