How does MonogoDB stack up for very large data sets where only some of the data is volatile - mongodb

I'm working on a project where we periodically collect large quantities of e-mail via IMAP or POP, perform analysis on it (such as clustering into conversations, extracting important sentences etc.), and then present views via the web to the end user.
The main view will be a facebook-like profile page for each contact of the the most recent (20 or so) conversations that each of them have had from the e-mail we capture.
For us, it's important to be able to retrieve the profile page and recent 20 items frequently and quickly. We may also be frequently inserting recent e-mails into this feed. For this, document storage and MongoDB's low-cost atomic writes seem pretty attractive.
However we'll also have a LARGE volume of old e-mail conversations that won't be frequently accessed (since they won't appear in the most recent 20 items, folks will only see them if they search for them, which will be relatively rare). Furthermore, the size of this data will grow more quickly than the contact store over time.
From what I've read, MongoDB seems to more or less require the entire data set to remain in RAM, and the only way to work around this is to use virtual memory, which can carry a significant overhead. Particularly if Mongo isn't able to differentiate between the volatile data (profiles/feeds) and non-volatile data (old emails), this could end up being quite nasty (and since it seems to devolve the virtual memory allocation to the OS, I don't see how the this would be possible for Mongo to do).
It would seem that the only choices are to either (a) buy enough RAM to store everything, which is fine for the volatile data, but hardly cost efficient for capturing TB of e-mails, or (b) use virtual memory and see reads/writes on our volatile data slow to a crawl.
Is this correct, or am I missing something? Would MongoDB be a good fit for this particular problem? If so, what would the configuration look like?

MongoDB does not "require the entire data set to remain in RAM". See http://www.mongodb.org/display/DOCS/Caching for an explanation as to why/how it uses virtual memory the way it does.
It would be fine for this application. If your sorting and filtering were more complex you might, for example, want to use a Map-Reduce operation to create a collection that's "display ready" but for a simple date ordered set the existing indexes will work just fine.

MongoDB uses mmap to map documents into virtual memory (not physical RAM). Mongo does not require the entire dataset to be in RAM but you will want your 'working set' in memory (working set should be a subset of your entire dataset).
If you want to avoid mapping large amounts of email into virtual memory you could have your profile document include an array of ObjectIds that refer to the emails stored in a separate collection.

#Andrew J
Typical you need enough RAM to hold your working set, this is true for MongoDB as it is for an RDBMS. So if you want to hold the last 20 emails for all users without going to disk, then you need that much memory. If this exceed the memory on a single system, then you can use MongoDB's sharding feature to spread data across multiple machines, therefore aggregating the Memory, CPU and IO bandwidth of the machines in the cluster.
#mP
MongoDB allows you as the application developer to specify the durability of your writes, from a single node in memory to multiple nodes on disk. The choice is your depending on what your needs are and how critical the data is; not all data is created equally. In addition in MongoDB 1.8, you can specify --dur, this writes a journal file for all the writes. This further improves the durability of writes and speeds up recovery if there is a crash.

And what happens if your computer crashes to all the stuff Mongo had in memory. Im guessing that it has no logs so the answer is probably bad luck.

Related

understand MongoDB cache system

This is a basic question, but very important, and i am not sure to really get the point.
On the official documentation we can read
MongoDB keeps all of the most recently used data in RAM. If you have created indexes for your queries and your working data set fits in RAM, MongoDB serves all queries from memory.
The part i am not sure to understand is
If you have created indexes for your queries and your working data set fits in RAM
what does mean "indexes" here?
For example, if i update a model, then i query it, because i have updated it, it's now in RAM so it will come from the memory, but this is not very clear in my mind.
How can we be sure that datas we query will come from the memory or not? I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
How do you globally advise to use MongoDB for huge traffic?
Note: This was written back in 2013 when MongoDB was still quite young, it didn't have the features it does today, while this answer still holds true for mmap, it does not for the other storage technologies MongoDB now implements, such as WiredTiger, or Percona.
A good place to start to understand exactly what is an index: http://docs.mongodb.org/manual/core/indexes/
After you have brushed up on that you will udersand why they are so good, however, skipping forward to some of the more intricate questions.
How can we be sure that datas we query will come from the memory or not?
One way is to look at the yields field on any query explain(). This will tell you how many times the reader yielded its lock because data was not in RAM.
Another more indepth way is to look on programs like mongostat and other such programs. These programs will tell you about what page faults (when data needs to be paged into RAM from disk) are happening on your mongod.
I understand that MongoDB uses the free memory to cache datas about the memory which is free on the moment, but does someone could explain further the global behavior ?
This is actually incorrect. It is easier to just say that MongoDB does this but in reality it does not. It is in fact the OS and its own paging algorithms, usually the LRU, that does this for MongoDB. MongoDB does cache index plans for a certain period of time though so that it doesn't have to constantly keep checking and testing for indexes.
In which case could it be better to use a variable in our node server which store datas than trust the MongoDB cache system?
Not sure how you expect that to work...I mean the two do quite different things and if you intend to read your data from MongoDB into your application on startup into that var then I definitely would not recommend it.
Besides OS algorithms for memory management are extremely mature and fast, so it is ok.
How do you globally advise to use MongoDB for huge traffic?
Hmm, this is such a huge question. Really I would recommend you Google a little in this subject but as the documentation states you need to ensure your working set fits into RAM for one.
Here is a good starting point: What does it mean to fit "working set" into RAM for MongoDB?
MongoDB attempts to keep entire collections in memory: it memory-maps each collection page. For everything to be in memory, both the data pages, and the indices that reference them, must be kept in memory.
If MongoDB returns a record, you can rest assured that it is now in memory (whether it was before your query or not).
MongoDB doesn't keep a "cache" of records in the same way that, say, a web browser does. When you commit a change, both the memory and the disk are updated.
Mongo is great when matched to the appropriate use cases. It is very high performance if you have sufficient server memory to cache everything, and declines rapidly past that point. Many, many high-volume websites use MongoDB: it's a good thing that memory is so cheap, now.

Is MongoDB usable as shared memory for a parallell processing / multiple-instances application?

I'm planning a product that will process updates from multiple data feeds. Input-data is guesstimated to be a total of 100Mbps stream containing 100 byte sized messages. These messages contain several data fields that needs to be checked for correlation with the existing data set within the application. If a input-message correlates with an existing data record, then the input-message will update the existing data-record, if not: it will create a new record. It is assumed that data are updated every 3 seconds in average.
The correlation process is assumed to be a bottleneck, and thus I intend to make our product able to run balanced in multiple processes if needed (most likely on a separate hardware or VM). Somewhat in the vicinity of Space-based architecture. I'd then like a shared storage between my processes so that all existing data records are visible to all the running processes. The shared storage will have to fetch possible candidates for correlation through a query/search based on some attributes (e.g. elevation). It will have to offer configuring warm redundancy, and a possibility to store snapshots every 5 minutes for logging.
Everything seems to be pointing towards MongoDB, but I'd like a confirmation from you that MongoDB will meet my needs. So do you think it is a go?
-Thank you
NB: I am not considering a relational database because we want to focus all coding in our application, instead of having to make 'stored procedures'/'functions' in a separate environment to optimize the performance of our system. Further, the data is diverse and I don't want to try normalize it into a schema.
Yes, MongoDB will meet your needs. I think the following aspects of your description are particularly relevant in your DB selection decision:
1. An update happens every 3 seconds
MongoDB has a database level write-lock (usually short lived) that blocks read operations. This means that you want will want to ensure that you have enough memory to fit your working set, and you will generally not run into any write-lock issues. Note that bulk inserts will hold the write lock for longer.
If you are sharding, you will want to consider shard keys that allow for write scaling i.e. distribute writes on different shards.
2. Shared storage for multiple processes
This is a pretty common scenario; in fact, many MongoDB deployments are expected be accessed from multiple processes concurrently. Unlike the write-lock, the read-lock does not block other reads.
3. Warm redundancy
Supported through MongoDB replication. If you'd like to read from secondary server(s) you will need to set the Read Preference to secondaryPreferred in your driver.

messaging service: redis or mongodb?

I am working on a messaging system that is a bit more advanced than simply sending receiving messages; it is something that looks like facebook chat/messaging: it has chat aspects but also messaging ones, like group messages, read/unread messages, and other.
On redis, I would simply use lists to store received messages, for example like this:
myID = [ "amy|how are you?", "frank|long time no see!" ]
amyID = [ "john|I'm good! you?" ]
(I have simplified it all a lot for easier reading.
But in this way I would not be able to keep track of single conversations, as they will all be always flushed once the messages are received (so basically no "inbox" feature.
On the other hand, if I use mongodb, I could use something like this: How to keep track of a private messaging system using MongoDB?
I though of the following benefits/disadvantages:
MONGODB
advantages:
can see inbox view
can check read/unread messages on each conversation
disadvantages
not as fast as redis
storage size increases a lot
REDIS
advantages:
easy to pick up new messages
no storage problems (messages are flushed)
disadvantages:
once messages are sent to the client are lost, so no read/unread features and
no inbox
Any ideas?
Thanks in advance.
I cannot answer for Redis because I don't use it and never have so I won't pretend I have.
However, if for some reason, you are not using something like an XMPP client like Facebook does: http://www.ibm.com/developerworks/xml/tutorials/x-realtimeXMPPtut/section3.html (aka Jabber) for chat then I will describe about a pure MongoDB solution in this situation.
MongoDB uses the OS' LRU as a means to cache documents and queries, fair enough it provides no direct query cache however if you are smart you will not need one; instead you just read all your queries directly from RAM. With this in mind MongoDB can be just as fast as Redis, since Redis uses the computers RAM too.
Speed between the two on a optimised query is negligible I would think. The true measure of speed comes from your schema, indexes, cluster setup and the queries you perform.
A note about storage size here, taking your comment into consideration:
the problem with flushing mongodb is bigger than I initially though: apparently when you delete something on mongo you only delete its reference, so if you delete 4mb of documents, it won't free up that much space. the only way to actually free up that memory is to run a dbRepair (or something among this line) that basically blocks the db while running....
You seem to have some misconceptions about exactly how MongoDB works.
This link will be of help to you: http://www.10gen.com/presentations/storage-engine-internals it will describe some of the reasons why excessive disk space is used and will also explain some of the misconceptions you have about how a computer works and how MongoDB frees space and reuses it.
MongoDB does not free space on a record level. Instead it will send that "empty" record (record and document are two different things as the presentation will tell you), shove it into a deleted bucket list and then reuse that space when a new document (or a updated document that has been moved) comes along and fits in that space.
It is true that if you are not careful and understanding on how MongoDB works on this level that you will probably be forced to run repairDB fairly regularly to keep any sort of performance after fragmentation.
As for memory handling. The OS handles this as I said. A good explanation of when the OS will free memory is on Wikipedia: http://en.wikipedia.org/wiki/Paging
Until there is not enough RAM to store all the data needed, the process of obtaining an empty page frame does not involve removing another page from RAM.
So the OS will handle removing pages for you and you shouldn't concern yourself with that part, instead you should be concerned with making your working set fit into RAM.
If you are worried about storing messages and don't really want to, i.e. you want them to be "flushed" you can actually use the TTL feature that comes with the later MongoDB installations: http://docs.mongodb.org/manual/tutorial/expire-data/ which will basically allow you to set a time-out for when a message should be deleted from the collection.
So personally if set-up right MongoDB could do messaging and chat like Facebook do it, of course they use the XMPP protocol and then archive messages into Cassandra for search but you don't have to do it like they do, that is just one way to achieve the same goal.
Hope this makes sense and I haven't gone round in circles, it is a bit of a long answer.
I think the big point here is the storage problems. You would need a lot of machine or a good system of flushing some conversations for you to use MongoDB. Despite wanting a sort of "inbox" system... I think redis would be more conducive to a well-working chat system - you just need to come up with some very creative workaround... or give up that design goal.
We use a mixed design, so we when we need snappy performance as in messages, queues and caches it´s on Redis and when we need to search on secondary indexes or update whole documents, we use MongoDB.
You can also try Riak, which can grow more linearly and smoothly than MongoDB.

mongoDB vs relational databases when data can't fit into memory?

First of all, I apologize for my potentially shallow understanding of NoSQL architecture (and databases in general) so try to bear with me.
I'm thinking of using mongoDB to store resources associated with an UUID. The resources can be things such as large image files (tens of megabytes) so it makes sense to store them as files and store just links in my database along with the associated metadata. There's also the added flexibility of decoupling the actual location of the resource files, so I can use a different third party to store the files if I need to.
Now, one document which describes resources would be about 1kB. At first I except a couple hundred thousands of resource documents which would equal some hundreds of megabytes in database size, easily fitting into server memory. But in the future I might have to scale this into the order of tens of MILLIONS of documents. This would be tens of gigabytes which I can't squeeze into server memory anymore.
Only the index could still fit in memory being around a gigabyte or two. But if I understand correctly, I'd have to read from disk every time I did a lookup on an UUID. Is there a substantial speed benefit from mongoDB over a traditional relational database in such a situation?
BONUS QUESTION: is there an existing, established way of doing what I'm trying to achieve? :)
MongoDB doesn't suddenly become slow the second the entire database no longer fits into physical memory. MongoDB currently uses a storage engine based on memory mapped files. This means data that is accessed often will usually be in memory (OS managed, but assume a LRU scheme or something similar).
As such it may not slow down at all at that point or only slightly, it really depends on your data access patterns. Similar story with indexes, if you (right) balance your index appropriately and if your use case allows it you can have a huge index with only a fraction of it in physical memory and still have very decent performance with the majority of index hits happening in physical memory.
Because you're talking about UUID's this might all be a bit hard to achieve since there's no guarantee that the same limited group of users are generating the vast majority of throughput. In those cases sharding really is the most appropriate way to maintain quality of service.
This would be tens of gigabytes which I can't squeeze into server
memory anymore.
That's why MongoDB gives you sharding to partition your data across multiple mongod instances (or replica sets).
In addition to considering sharding, or maybe even before, you should also try to use covered indexes as much as possible, especially if it fits your Use cases.
This way you do not HAVE to load entire documents into memory. Your indexes can help out.
http://www.mongodb.org/display/DOCS/Retrieving+a+Subset+of+Fields#RetrievingaSubsetofFields-CoveredIndexes
If you have to display your entire document all the time based on the id, then the general rule of thumb is to attempt to keep e working set in memory.
http://blog.boxedice.com/2010/12/13/mongodb-monitoring-keep-in-it-ram/
This is one of the resources that talks about that. There is a video on mongodb's site too that speaks about this.
By attempting to size the ram so that the working set is in memory, and also looking at sharding, you will not have to do this right away, you can always add sharding later. This will improve scalability of your app over time.
Again, these are not absolute statements, these are general guidelines, that you should think through your usage patterns and make sure that they ar relevant to what you are doing.
Personally, I have not had the need to fit everything in ram.

Main Memory DB vs Object DB

I'm currently trying to pick a database vendor.
I'm just seeking some personal opinions from fellow database developers out there.
My question is especially targeted towards people who:
1) have used Main Memory DB (MMDB) that supports replicating to disk (hybrid) before (i.e. ExtremeDB)
or
2) have used Versant Object Database and/or Objectivity Database and/or Progress ObjectStore
and the question is really: if you could recommend a database vendor, based on your experience, that would suit my application.
My application is a commercial real-time (read: high-performance) object-oriented C++ GIS kind of app, where we need to do a lot of lat/lon search (i.e. given an area, find all matching targets within the area...R-Tree index).
The types of data that I would like to store into the database are all modeled as objects and they make use of std::list and std::vector, so naturally, Object Database seems to make sense. I have read through enough articles to convince myself that a traditional RDBMS probably isnt what I'm really looking for in terms of
performance (joins or multiple
tables for dynamic-length data like
list/vector)
ease of programming
(impedance mismatch)
However, in terms of performance,
Input data is being fed into the system at about 40 MB/s.
Hence, the system will also be doing insert into the database at the rate of roughly 350 inserts per second (where each object varies from 64KB to 128KB),
Database will consistently be searched and updated via multiple threads.
From my understanding, all of the Object DBs I have listed here use cache for storing database objects. ExtremeDB claims that since it's designed especially for memory, it can avoid overhead of caching logic, etc. See more by googling: Main Memory vs. RAM-Disk Databases: A Linux-based Benchmark
So..I'm just a bit confused. Can Object DBs be used in real-time system? Is it as "fast" as MMDB?
Fundamentally, I difference between a MMDB and a OODB is that the MMDB has the expectation that all of its data is based in RAM, but persisted to disk at some point. Whereas an OODB is more conventional in that there's no expectation of the entire DB fitting in to RAM.
The MMDB can leverage this by giving up on the concept that the persisted data doesn't necessarily have to "match" the in RAM data.
The way anything with persistence is going to work, is that it has to write the data to disk on update in some fashion.
Almost all DBs use some kind of log for this. These logs are basically "raw" pages of data, or perhaps individual transactions, appended to a file. When the file gets "too big", a new file is started.
Once the logs are properly consolidated in to the main store, the logs are discarded (or reused).
Now, a crude, in RAM DB can exist simply by appending transactions to a log file, and when it's restarted, it just loads the log in to RAM. So, in essence, the log file IS the database.
The downside of this technique is the longer and more transactions you have, the bigger your log/DB is, and thus the longer the DB startup time. But, ideally, you can also "snapshot" the current state, which eliminates all of the logs up to date, and effectively compresses them.
In this manner, all the routine operations of the DB have to manage is appending pages to logs, rather than updating other disk pages, index pages, etc. Since, ideally, most systems don't need to "Start up" that often, perhaps start up time is less of an issue.
So, in this way, a MMDB can be faster than an OODB who has a different contract with the disk, maintaining logs and disk pages. In this way, an OODB can be slower even if the entire DB fits in to RAM and is properly cached, simply because you incur disk operations outside of the log operations during normal operations, vs a MMDB where these operations happen as a "maintenance" task, which can be scheduled during down time and/or quiet time.
As to whether either of these systems can meet you actual performance needs, I can't say.
The back ends of databases (reader and writer processes, caching, lock managing, txn log files, ACID semantics) are the same, so RDBs and OODB are actually very similar here. The difference is the interface to the application programmer. Is your data model complicated, consists of lots of classes with real inheritance relationships? Then OO is good. Is it relatively flat and simple? Then go RDB. What is the nature of the relationships? Is it pointer-like and set like? Then go RDB. Is is more complicated, like (ordered) list, array, map? Then you should go OO. Also, do you have a stand-alone application with no need to integrate with other apps? Then OO is ok. Do you have to share data with other apps (i.e. several apps access the same database)? Then that's a deal-breaker for OO, and you should stick with RDB. Is the schema of your database stable or do you expect it to evolve frequently? OODBs are bad ad schema evolution, so if you expect frequent changes, stick with RDBs.