Storing very large documents in MongoDB - mongodb

In short: If you have a large number of documents with varying sizes, where relatively few documents hit the maximum object size, what are the best practices to store those documents in MongoDB?
I have set of documents like:
{_id: ...,
values: [12, 13, 434, 5555 ...]
}
The length of the values list varies hugely from one document to another. For the majority of documents, it will have a few elements, for a few it will have tens of millions of elements, and I will hit the maximum object size limit in MongoDB. The trouble is any special solution I come up with for those very large (and relatively few) documents might have an impact on how I store the small documents which would, otherwise, live happily in a MongoDB collection.
As far as I see, I have the following options. I would appreciate any input on pros and cons of those, and any other option that I missed.
1) Use another datastore: That seems too drastic. I like MongoDB, and it's not like I hit the size limit for many objects. In the words case, my application could treat the very large objects and the rest differently. It just doesn't seem elegant.
2) Use GridFS to store the values: Like a blob in a traditional DB, I could keep the first few thousand elements of values in document and if there are more elements in the list, I could keep the rest in a GridFS object as a binary file. I wouldn't be able to search in this part, but I can live with that.
3) Abuse GridFS: I could keep every document in gridFS. For the majority of the (small) documents the binary chunk would be empty because the files collection would be able to keep everything. For the rest I could keep the excess elements in the chunks collection. Does that introduce an overhead compared to option #2?
4) Really abuse GridFS: I could use the optional fields in the files collection of GridFS to store all elements in the values. Does GridFS do smart chunking also for the files collection?
5) Use an additional "relational" collection to store the one-to-many relation, but th number of documents in this collection would easily exceed a hundred billion rows.

If you have large documents, try to store some metadata about them in MongoDB, and put the rest of the data --the part you will not be querying on-- outside.

Related

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Storing user activity in embedded array versus separate collection?

Say I want to mirror a social media's news feed by storing it in a mongo collection, and then periodically syncing it to fetch updates.
Multiple users will then be able to interact with this feed at a time (both reads and writes)
Also, lets assume that I initially will be storing between 500 and 1000 entries, but that I might consider increasing this later on.
My question is, would i be better off storing these activities in an embedded array, or a separate collection
As I understand it, storing it in an embedded array will allow for quick access access, but can quickly halt performance due to memory allocation.
On the other hand, storing each entry as it own document means ill have to go fetch every single on of them, which will slow down read performance
Any suggestion to what might fit my usecase best is much a appreciated
Thanks
Use a collection. Queries return matching documents, not matching array elements, so the things you are searching for should logically be your collection documents. You can reshape a document to contain just the first matching array element when a query matches against elements of an array, but not, e.g. the first 4 matching elements. You would need to use aggregation for simple queries, which would hurt performance.

Best way to organize subdocuments in Mongo?

I have a project that I'm working on that will require me to store a large number of objects in an array linked to a parent object, akin to the storing of social media comments to their original post. What is the best way for me to organize the data for the array of child documents/comments?
Is it considered best practice to have the child objects under a different collection and reference to their parent or would it be more ideal just to put them all within the parent object directly?
I discuss this a little here, read this first:
https://stackoverflow.com/a/27285313/68567
For your case, Option 3 (keeping some of the data in your primary model) is probably the best. The key is to Avoid unbounded array growth.
This has to do with how Mongodb allocates documents. http://docs.mongodb.org/manual/core/storage/
"Every document in MongoDB is stored in a record which contains the document itself and extra space, or padding, which allows the document to grow as the result of updates."
When node allocates new documents it allocates space based on the size of the inserted document and the sizes of documents already in your collection. (Read more in the link above.) If you have some documents that are orders of magnitude larger than others this will likely lead to fragmentation.
The way to avoid having too many documents in your 'comments' sub-document array is with the $push and $slice commands.
http://docs.mongodb.org/manual/reference/operator/update/slice/
So store the 'most recent 5' and display those when the item first loads. (Or oldest, or whatever other sorting criteria you want to use.) Then provide a way for the user to load more which will do a separate round-trip to the collection that has all of them.

MongoDB GridFS Size Limit

I am using MongoDB as a convenient way of storing a dataset as a series of columns where there is a document that stores the values for a given column and another document that stores the details of the detaset, and a mapping to the other documents with the associated column values. The issue I'm now facing as things get bigger is that I can no longer store the entire column in a single document.
I'm aware that there is also the GridFS option, the only downside is that I believe it stores the files as blobs meaning I would lose random access to a chunk of the column, or the value at a specified index, something that was incredibly useful from the document store, however I may not ahve any other option.
So my question is: does GridFS also impose an upper limit on the size of documents and if so does anyone know what this is. I've looked in hte docs and haven't found anything, but it may be I'm not looking in the correct place or that there is a limit but it's not well documented.
Thanks,
Vackar
GridFS
Per the GridFS documentation:
Instead of storing a file in an single document, GridFS divides a file
into parts, or chunks, and stores each of those chunks as a separate
document. By default GridFS limits chunk size to 256k. GridFS uses
two collections to store files. One collection stores the file chunks,
and the other stores file metadata.
GridFS will allow you to store arbitrarily large files however this really won't help your use case. A file in GridFS will effectively be a large binary blob and you will not get any of the benefits of structured documents and indexing.
Schema Design
The fundamental challenge you have is your approach to schema design. If you are creating documents that are likely to grow beyond the 16Mb document limit, these will also have a significant impact on your database storage and fragmentation as the documents grow in size.
The appropriate solution would be to rethink your schema approach so that you do not have unbounded document growth. This probably means flattening the array of "columns" that you are growing so it is represented by a collection of documents rather than an array.
A better (and separate) question to ask would be how to refactor your schema given the expected data growth patterns.

Is mongoDB inefficient for storing many arrays of integers?

All the documents in my mongoDB collection will have an array of integers. I don't need more than 32 bits for each integer, and the length of the integer array will be identical for each document.
The clients of my application will frequently be updating individual fields within the arrays.
If I have 5000 to 10000 documents with arrays of 256 integers, will mongo db waste space because it needs to be prepared for me to change the contents of my arrays to non-integer datatypes, OR change the length of the array?
Will the design of mongoDB make updating individual integers within my arrays very inefficient when compared to a traditional relational database?
Presume I'm using the update array syntax describe here:
http://docs.mongodb.org/manual/applications/update/#update-arrays
will mongo db waste space because it needs to be prepared for me to change the contents of my arrays to non-integer datatypes OR change the length of the array?
No it will not waste space. Rather than thinking about this in terms of the ability to change data types or changing the array length, I would concentrate on MongoDB's padding factor whereby it adaptively learns whether documents tend to grow. Since your document sizes will be very similar, your padding factor will tend towards 1 (i.e. almost no additional padding added on the document size).
Will the design of mongoDB make updating individual integers within my arrays very inefficient when compared to a traditional relational database?
Since embedded arrays don't have an exact relational equivalent, the comparison is not obvious. You might assume the relational equivalent to be a JOIN. In this case, I believe MongoDB will work out to be faster, since a JOIN has a cost of its own.
As an additional note, 5,000 to 10,000 documents is miniscule given the volume of data MongoDB can handle. As long as you are specifying an indexed criteria on the update (such as _id) you really don't have any space or performance considerations to worry about here. However since your documents are not tiny, the one thing I would watch for is trying to load the entire document at once in a find query, you might prefer to project find queries for specific fields only; and when querying the array you might want to consider $slice.