Is mongoDB inefficient for storing many arrays of integers? - mongodb

All the documents in my mongoDB collection will have an array of integers. I don't need more than 32 bits for each integer, and the length of the integer array will be identical for each document.
The clients of my application will frequently be updating individual fields within the arrays.
If I have 5000 to 10000 documents with arrays of 256 integers, will mongo db waste space because it needs to be prepared for me to change the contents of my arrays to non-integer datatypes, OR change the length of the array?
Will the design of mongoDB make updating individual integers within my arrays very inefficient when compared to a traditional relational database?
Presume I'm using the update array syntax describe here:
http://docs.mongodb.org/manual/applications/update/#update-arrays

will mongo db waste space because it needs to be prepared for me to change the contents of my arrays to non-integer datatypes OR change the length of the array?
No it will not waste space. Rather than thinking about this in terms of the ability to change data types or changing the array length, I would concentrate on MongoDB's padding factor whereby it adaptively learns whether documents tend to grow. Since your document sizes will be very similar, your padding factor will tend towards 1 (i.e. almost no additional padding added on the document size).
Will the design of mongoDB make updating individual integers within my arrays very inefficient when compared to a traditional relational database?
Since embedded arrays don't have an exact relational equivalent, the comparison is not obvious. You might assume the relational equivalent to be a JOIN. In this case, I believe MongoDB will work out to be faster, since a JOIN has a cost of its own.
As an additional note, 5,000 to 10,000 documents is miniscule given the volume of data MongoDB can handle. As long as you are specifying an indexed criteria on the update (such as _id) you really don't have any space or performance considerations to worry about here. However since your documents are not tiny, the one thing I would watch for is trying to load the entire document at once in a find query, you might prefer to project find queries for specific fields only; and when querying the array you might want to consider $slice.

Related

How to correctly build indexes on MongoDB when every field will be searchable and sortable?

I am designing a MongoDB collection that will have 50 million documents and every field in the document will be searchable and sortable. The searching and sorting logics will be sent from the frontend so could be a lot of fields searchings and sorting combinations. I've made some tests and concluded that when there is searching and sorting only in indexed fields the query runs very fast, but when searching or sorting non-indexed fields the query runs very slow.
Considering that will have a lot of possible searching/sorting combinations, how can I build indexes in this collection in this case to get a better performance?
Indexing comes at a cost of extra memory space and possible increased execution time of database write(insert and update) operations. However, like you rightly pointed out, indexing makes database reads(and sorting) super fast.
Creating indexes is easy and straight forward, however, you need to consider the tradeoffs, most times, this is usually the read-write ration of the fields in your documents.
If you frequently read(or sort) documents from a very large collection(like the 50million examples you mentioned), it makes a lot of sense to add indexing to all the fields you use to identify(or sort) your documents, you just need to ensure you don't run out of memory space in the DB. Not indexing the fields would be very frustrating, just imagine if you need to get the last document by a field that is not indexed, you would have to search through 49,999,999 documents to find it.
I hope this helps.

How much space does it take to store data in MongoDB?

I have a MongoDB with approximately 50 collections but it can increase in future. On each collections we will have fields ranging from 5 - 11 columns.
My question is how do I optimize the MongoDB so that I do not take up storage spaces because of superLongCollectionFieldName. How is character/word calculated when storing the data?
Lets say I have a field called, userID and another field called, IP does it both take full size for the bit block?
The overall storage required for your data will depend on many use case specific factors including schema, indexes, how compressible the data is, and your data update/deletion patterns. The length of field names does not significantly affect index size (since indexes only store key values and document locations), but long names may have some impact on storage usage. The best way to guesstimate storage usage would be to generate some representative test data using a data generator or by extrapolating from existing data.
MongoDB (as at 4.0) does not maintain a central catalog of field names: field names are stored in each document so documents are self-describing in a distributed deployment. In all modern versions of MongoDB (3.2+) data is compressed by default so the size of field names is not a typical concern for most use cases.
You could implement a mapping to shorter names via application code, but that will add translation overhead and reduce clarity of the documents stored in the server. For more discussion, see: SERVER-863: Tokenize the field names.

Optimizing for random reads

First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.

Storing very large documents in MongoDB

In short: If you have a large number of documents with varying sizes, where relatively few documents hit the maximum object size, what are the best practices to store those documents in MongoDB?
I have set of documents like:
{_id: ...,
values: [12, 13, 434, 5555 ...]
}
The length of the values list varies hugely from one document to another. For the majority of documents, it will have a few elements, for a few it will have tens of millions of elements, and I will hit the maximum object size limit in MongoDB. The trouble is any special solution I come up with for those very large (and relatively few) documents might have an impact on how I store the small documents which would, otherwise, live happily in a MongoDB collection.
As far as I see, I have the following options. I would appreciate any input on pros and cons of those, and any other option that I missed.
1) Use another datastore: That seems too drastic. I like MongoDB, and it's not like I hit the size limit for many objects. In the words case, my application could treat the very large objects and the rest differently. It just doesn't seem elegant.
2) Use GridFS to store the values: Like a blob in a traditional DB, I could keep the first few thousand elements of values in document and if there are more elements in the list, I could keep the rest in a GridFS object as a binary file. I wouldn't be able to search in this part, but I can live with that.
3) Abuse GridFS: I could keep every document in gridFS. For the majority of the (small) documents the binary chunk would be empty because the files collection would be able to keep everything. For the rest I could keep the excess elements in the chunks collection. Does that introduce an overhead compared to option #2?
4) Really abuse GridFS: I could use the optional fields in the files collection of GridFS to store all elements in the values. Does GridFS do smart chunking also for the files collection?
5) Use an additional "relational" collection to store the one-to-many relation, but th number of documents in this collection would easily exceed a hundred billion rows.
If you have large documents, try to store some metadata about them in MongoDB, and put the rest of the data --the part you will not be querying on-- outside.

server side set intersection in mongodb

In an application I am working on, a requirement is to do massive set intersection, to the tune of 10-1,000,000 items or so. The items that we are intersecting are simply ObjectId's.
So for instance there is a boxes document and inside the boxes document there is an item_ids Array. This item_ids array for each box holds 10-1,000,000 ObjectId's.
The end goal here is to say, given box A with ObjectId 4d3dc3898951498107000005, and box B with ObjectId 4d3dc3898951498107000002, which item_ids do they have in common?
Here is how im doing it:
db.boxes.distinct("item_ids", {'_id' : {$in : [ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}})
Firstly just curious if this seems like a sane approach. In my research so far it seems like map reduce is a common suggestion for large intersections, but that it is not recommended for realtime queries.
Secondly, curious how this would behave in a sharded environment? Will mongos run a chunk of the query on the mongod's it needs to and aggregate my result magically?
Lastly, if the above is sane, is it also sane to do:
db.items.find({'_id' : { $in : db.eval(function() {return db.boxes.distinct("item_ids", {_id:{$in:[ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}}); }) }})
Which would basically be finding which items both box A and box B have in common, and then materializing them into objects all in one server side query. This appears to also work with .limit and .skip to effectively implement a paging of the data set.
Anyhow, any feedback is valuable, thanks!
I think you may want to reconsider your schema. If you have 1,000,000 ObjectIDs in an array at 12 bytes each that is 12MB not even counting the BSON overhead which can be significant for large arrays* (probably another 8MB or so). In 1.8 we are raising the max document size from 4MB to 16MB, but even that won't be enough for the objects you are looking to store.
*For historical reasons we store the stingified index for each element in the array which is fine when you have <100 elements, but adds up when you need 6 or 7 digits.