How efficient are MongoDB projections? - mongodb

Is there a lot of overhead in excluding nearly all of the data in a document when querying a mongo database?
For example, in the case where I only want field1 and field2, for a collection with a document structure of:
{
"field1" : 1
"field2" : true
"field3" : ["big","array",...]
"field4" : ["another","big","array",...]
}
would I benefit more from:
Creating a separate collection alongside this collection containing
only field1 and field2, or
Using .find() on the original documents with inclusion/exclusion parameters
Note: The inefficiency of saving the same data twice isn't a concern for me as much as the efficiency of querying the data
Many thanks!

Projection is somewhat similar to using column names explicitly in SQL, so it seems a little counter-intuitive to ask if returning smaller amount of data would incur overhead over returning larger amount of data (full document).
So you have to find the document (depending on how you .find() it may be fast or slow) but returning only first two fields of the document rather than all the fields (complete document) would make it faster not slower.
Having a second collection may only benefit if you are concerned about your collection fitting into RAM. If the documents in the duplicate collection are much smaller then they can presumably fit into a smaller amount of total RAM decreasing a chance that a page will need to be swapped in from disk. However, if you are writing to this collection as well as original collection then you have to have a lot more data in RAM than if you just have the original collection.
So while the intricate details may depend on your individual set-up, the general answer would probably be 2. you will benefit more from using projection and only returning the two fields you need.

Related

What does nscannedObjects = 0 actually mean?

As far as I understood, nscannedObjects entry in the explain() method means the number of documents that MongoDB needed to go to find in the disk.
My question is: when this value is 0, what this actually mean besides the explanation above? Does MongoDB keep a cache with some documents stored there?
nscannedObjects=0 means that there was no fetching or filtering to satisfy your query, the query was resolved solely based on indexes. So for example if you were to query for {_id:10} and there were no matching documents you would get nscannedObjects=0.
It has nothing to do with the data being in memory, there is no such distinction with the query plan.
Note that in MongoDB 3.0 and later nscanned and nscannedObjects are now called totalKeysExamined and totalDocsExamined, which is a little more self-explanatory.
Mongo is a document database, which means that it can interpret the structure of the stored documents (unlike for example key-value stores).
One particular advantage of that approach is that you can build indices on the documents in the database.
Index is a data structure (usually a variant of b-tree), which allows for fast searching of documents basing on some of their attributes (for example id (!= _id) or some other distinctive feature). These are usually stored in memory, allowing very fast access to them.
When you search for documents basing on indexed attributes (let's say id > 50), then mongo doesn't need to fetch the document from memory/disk/whatever - it can see which documents match the criteria basing solely on the index (note that fetching something from disk is several orders of magnitude slower than memory lookup, even with no cache). The only time it actually goes to the disk is when you need to fetch the document for further processing (and which is not covered by the statistic you cited).
Indices are crucial to achieve high performance, but also have drawbacks (for example rarely used index can slow down inserts and not be worth it - after each insertion the index has to be updated).

MongoDB - How much data is too much data?

I'm building an application that uses MongoDB as a database. I have a lot of products, and I want to log what products a user looks at to the user's database entry. For instance, a user profile looks like this:
{
"email" : "foo#bar.com",
"name" : "John Snow",
"_id" : ObjectId("51ecbcc6896652a008000001"),
"productsViewed" : [
product1,
product2,
product3,
product4
]
}
I have two options here. I can log just the _id of each product, or I could log entire objects representing the product (name, price, ~100 word description, categories, that sort of thing). The difference in object size is 1 line of text per product vs about 30 lines per product.
I realise that this is probably a trivial amount of data to be concerned about, but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact? Logging more data is far more useful for my purposes but I'd like to avoid my database calls lagging if the user profile becomes quite large.
Question is: At what point (in character length, I guess?) is too much data to store with one MongoDB record?
16 Meg is the limitation for the entire document. This means that all strings etc have to fit within 16 meg. However, before that there are more limitation on your schema which you, yourself hint at:
but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact?
And the answer is yes. First off with the added data of the root user you will probably be over the 16 meg limit, however, further on from this the in-memory $pull, $push and other sub document operators might have a hard time keeping peformance up. You can sort of mitigate that problem by batching your subdocuments into groups of 100.
However, yet again, you have an even bigger problem: Fragmentation. Since MongoDB stores the record in a single contigeous space on the disk, hence it has settings like padding, you could see considerable fragmentation from odd sized record objects not being reused here.
I would personally say that you should factor off this relation to a separate collection.

Does providing a projection argument to find() limit the data that is added to Mongo's working set?

In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: http://docs.mongodb.org/manual/applications/indexes/#create-indexes-that-support-covered-queries these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.ensureIndex({a:1,b:1});
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( http://docs.mongodb.org/manual/applications/map-reduce/ ).
Note that for updates I believe only atomic operations: http://docs.mongodb.org/manual/tutorial/isolate-sequence-of-operations/ (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).

Mongodb : multiple specific collections or one "store-it-all" collection for performance / indexing

I'm logging different actions users make on our website. Each action can be of different type : a comment, a search query, a page view, a vote etc... Each of these types has its own schema and common infos. For instance :
comment : {"_id":(mongoId), "type":"comment", "date":4/7/2012,
"user":"Franck", "text":"This is a sample comment"}
search : {"_id":(mongoId), "type":"search", "date":4/6/2012,
"user":"Franck", "query":"mongodb"} etc...
Basically, in OOP or RDBMS, I would design an Action class / table and a set of inherited classes / tables (Comment, Search, Vote).
As MongoDb is schema less, I'm inclined to set up a unique collection ("Actions") where I would store these objects instead of multiple collections (collection Actions + collection Comments with a link key to its parent Action etc...).
My question is : what about performance / response time if I try to search by specific columns ?
As I understand indexing best practices, if I want "every users searching for mongodb", I would index columns "type" + "query". But it will not concern the whole set of data, only those of type "search".
Will MongoDb engine scan the whole table or merely focus on data having this specific schema ?
If you create sparse indexes mongo will ignore any rows that don't have the key. Though there is the specific limitation of sparse indexes that they can only index one field.
However, if you are only going to query using common fields there's absolutely no reason not to use a single collection.
I.e. if an index on user+type (or date+user+type) will satisfy all your querying needs - there's no reason to create multiple collections
Tip: use date objects for dates, use object ids not names where appropriate.
Here is some useful information from MongoDB's Best Practices
Store all data for a record in a single document.
MongoDB provides atomic operations at the document level. When data
for a record is stored in a single document the entire record can be
retrieved in a single seek operation, which is very efficient. In some
cases it may not be practical to store all data in a single document,
or it may negatively impact other operations. Make the trade-offs that
are best for your application.
Avoid Large Documents.
The maximum size for documents in MongoDB is 16MB. In practice most
documents are a few kilobytes or less. Consider documents more like
rows in a table than the tables themselves. Rather than maintaining
lists of records in a single document, instead make each record a
document. For large media documents, such as video, consider using
GridFS, a convention implemented by all the drivers that stores the
binary data across many smaller documents.

server side set intersection in mongodb

In an application I am working on, a requirement is to do massive set intersection, to the tune of 10-1,000,000 items or so. The items that we are intersecting are simply ObjectId's.
So for instance there is a boxes document and inside the boxes document there is an item_ids Array. This item_ids array for each box holds 10-1,000,000 ObjectId's.
The end goal here is to say, given box A with ObjectId 4d3dc3898951498107000005, and box B with ObjectId 4d3dc3898951498107000002, which item_ids do they have in common?
Here is how im doing it:
db.boxes.distinct("item_ids", {'_id' : {$in : [ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}})
Firstly just curious if this seems like a sane approach. In my research so far it seems like map reduce is a common suggestion for large intersections, but that it is not recommended for realtime queries.
Secondly, curious how this would behave in a sharded environment? Will mongos run a chunk of the query on the mongod's it needs to and aggregate my result magically?
Lastly, if the above is sane, is it also sane to do:
db.items.find({'_id' : { $in : db.eval(function() {return db.boxes.distinct("item_ids", {_id:{$in:[ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}}); }) }})
Which would basically be finding which items both box A and box B have in common, and then materializing them into objects all in one server side query. This appears to also work with .limit and .skip to effectively implement a paging of the data set.
Anyhow, any feedback is valuable, thanks!
I think you may want to reconsider your schema. If you have 1,000,000 ObjectIDs in an array at 12 bytes each that is 12MB not even counting the BSON overhead which can be significant for large arrays* (probably another 8MB or so). In 1.8 we are raising the max document size from 4MB to 16MB, but even that won't be enough for the objects you are looking to store.
*For historical reasons we store the stingified index for each element in the array which is fine when you have <100 elements, but adds up when you need 6 or 7 digits.