server side set intersection in mongodb - mongodb

In an application I am working on, a requirement is to do massive set intersection, to the tune of 10-1,000,000 items or so. The items that we are intersecting are simply ObjectId's.
So for instance there is a boxes document and inside the boxes document there is an item_ids Array. This item_ids array for each box holds 10-1,000,000 ObjectId's.
The end goal here is to say, given box A with ObjectId 4d3dc3898951498107000005, and box B with ObjectId 4d3dc3898951498107000002, which item_ids do they have in common?
Here is how im doing it:
db.boxes.distinct("item_ids", {'_id' : {$in : [ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}})
Firstly just curious if this seems like a sane approach. In my research so far it seems like map reduce is a common suggestion for large intersections, but that it is not recommended for realtime queries.
Secondly, curious how this would behave in a sharded environment? Will mongos run a chunk of the query on the mongod's it needs to and aggregate my result magically?
Lastly, if the above is sane, is it also sane to do:
db.items.find({'_id' : { $in : db.eval(function() {return db.boxes.distinct("item_ids", {_id:{$in:[ObjectId("4d3dc3898951498107000005"), ObjectId("4d3dc3898951498107000002")]}}); }) }})
Which would basically be finding which items both box A and box B have in common, and then materializing them into objects all in one server side query. This appears to also work with .limit and .skip to effectively implement a paging of the data set.
Anyhow, any feedback is valuable, thanks!

I think you may want to reconsider your schema. If you have 1,000,000 ObjectIDs in an array at 12 bytes each that is 12MB not even counting the BSON overhead which can be significant for large arrays* (probably another 8MB or so). In 1.8 we are raising the max document size from 4MB to 16MB, but even that won't be enough for the objects you are looking to store.
*For historical reasons we store the stingified index for each element in the array which is fine when you have <100 elements, but adds up when you need 6 or 7 digits.

Related

Optimizing for random reads

First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.

Any way to make Mongo filtering dynamic records faster?

Scenario
I have Mongo collection Items that have dynamic item objects in it. Currently I have over 3 million records. I'm using C# with MongoSharp but I don't think it has anything to do with my problem.
Here is an example Item (it has a lot more fields than just 3):
{
_id : "1234567890",
Code : 888596937,
RefNumber : "GHTZKL",
...
}
AFAIK there is no point in using TextSearch since it's not really words, just some codes so it won't give me anything beneficial. I also cannot index them all since then I will have to index every single field.
Problem
Right now when I filter data it takes about 1-3 seconds (on ssd). Is there any way I can make it filter my items faster or it is as fast as I can get?
You don't mention what field you want to search on, but it sounds like you want to search on any arbitrary attribute. This is a common design and borders on an antipattern for MongoDB. The only way to avoid the collection scan you're getting now is to index the fields you want to search on, but indexing every field when you don't know what the fields will be ahead of time isn't possible. The solution to this to name only the common fields (and index on them), then group the other fields into name/value pairs in an array in the the document. You can then index that array to get your fast searches.
A word of caution on NVP arrays: if you array gets very large (hundreds of attributes), your index size will blow up spectacularly. It's best to try to keep the array size fairly small.
For more information on this design pattern, see Asya's great writeup.

Partial doc updates to a large mongo collection - how to not lock up the database?

I've got a mongo db instance with a collection in it which has around 17 million records.
I wish to alter the document structure (to add a new attribute in the document) of all 17 million documents, so that I dont have to problematically deal with different structures as well as make queries easier to write.
I've been told though that if I run an update script to do that, it will lock the whole database, potentially taking down our website.
What is the easiest way to alter the document without this happening? (I don't mind if the update happens slowly, as long as it eventually happens)
The query I'm attempting to do is:
db.history.update(
{ type : { $exists: false }},
{
$set: { type: 'PROGRAM' }
},
{ multi: true }
)
You can update the collection in batches(say half a million per batch), this will distribute the load.
I created a collection with 20000000 records and ran your query on it. It took ~3 minutes to update on a virtual machine and i could still read from the db in a separate console.
> for(var i=0;i<20000000;i++){db.testcoll.insert({"somefield":i});}
The locking in mongo is quite lightweight, and it is not going to be held for the whole duration of the update. Think of it like 20000000 separate updates. You can read more here:
http://docs.mongodb.org/manual/faq/concurrency/
You do actually care if your update query is slow, because of the write lock problem on the database you are aware of, both are tightly linked. It's not a simple read query here, you really want this write query to be as fast as possible.
Updating the "find" part is part of the key here. First, since your collection has millions of documents, it's a good idea to keep the field name size as small as possible (ideally one single character : type => t). This helps because of the schemaless nature of mongodb collections.
Second, and more importantly, you need to make your query use a proper index. For that you need to workaround the $exists operator which is not optimized (several ways to do it there actually).
Third, you can work on the field values themselves. Use http://bsonspec.org/#/specification to estimate the size of the value you want to store, and eventually pick a better choice (in your case, you could replace the 'PROGRAM' string by a numeric constant for example and gain a few bytes in the process, multiplied by the number of documents to update for each update multiple query). The smaller the data you want to write, the faster the operation will be.
A few links to other questions which can inspire you :
Can MongoDB use an index when checking for existence of a field with $exists operator?
Improve querying fields exist in MongoDB

Does providing a projection argument to find() limit the data that is added to Mongo's working set?

In Mongo, suppose I have a collection mycollection that has fields a, b, and huge. I very frequently want to perform queries, mapreduce, updates, etc. on a, and b and very occassionally want to return huge in query results as well.
I know that db.mycollection.find() will scan the entire collection and result in Mongo attempting to add the whole collection to the working set, which may exceed the amount of RAM I have available.
If I instead call db.mycollection.find({}, { a : 1, b : 1 }), will this still result in the whole collection being added to the working set or only the terms of my projection?
MongoDB can use something called covered queries: http://docs.mongodb.org/manual/applications/indexes/#create-indexes-that-support-covered-queries these allow you to load all the values from the index rather than the disk, or memory, if those documents are in memory at the time.
Be warned that you cannot use covered queries on a full table scan, the condition, projection and sort must all be within the index; i.e.:
db.col.ensureIndex({a:1,b:1});
db.col.find({a:1}, {_id:0, a:1, b:1})(.sort({b:1}));
Would work (the sort is in brackets because it is not totally needed). You can add _id to your index if you intend to return that too.
Map Reduce does not support covered queries, there is no way to project only a certain amount of fields into the MR, as far as I know; maybe there is some hack I do not know of. Map Reduce only supports a $match like operator in terms of input query with a separate parameter for the sort of the incoming query ( http://docs.mongodb.org/manual/applications/map-reduce/ ).
Note that for updates I believe only atomic operations: http://docs.mongodb.org/manual/tutorial/isolate-sequence-of-operations/ (excluding findAndModify) do not load the document into your working set, however, believe is the keyword there.
Considering you need to do both MR and normal find and update on these records I would strongly recommend you look into checking why you are paging in so much data and whether you really do need to do it that often. It seems like you are trying to do too much processing in a short and frequent amount of time.
On the other hand, if this is a script which runs every night or something then I would not worry too much about its excessive working set (i.e. score board recalc script).

How complete should MongoDB indexes be?

For example, I have documents with only three fields: user, date, status. Since I select by user and sort by date, I have those two fields as an index. That is the proper thing to do. However, since each date only has one status, I am essentially indexing everything. Is it okay to not index all fields in a query? Where do you draw the line?
What makes this question more difficult is the complete opposite approach to indexes between read-heavy and write-heavy collections. If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Is it okay to not index all fields in a query?
Yes, but you'll want to avoid this for frequently used queries. Anything not indexed will imply a "table scan". This means accessing each possible document individually, which will be slow.
Where do you draw the line?
Also note, that if you sort by an un-indexed field, MongoDB will "yell at you" if you're trying to sort too much data. So you have to have some awareness of how much data is "outside of" the index.
If yours is somewhere in between, how do you determine the proper approach when it comes to indexes?
Monitoring, instrumenting, experimenting and experience.
There is no hard and fast rule here, it's all going to be about trade-offs. CPU vs. RAM vs. Disk IO vs. Responsiveness, etc.
The perfect situation is to store everything in a single index. By everything I mean all fields you query on, you sort by and you retrieve. This will ensure that you'll get maximum performance (if index fits in ram)
This situation is not always possible, so you'll have to make choices.
Here are 3 tips to reduce at maximum the index size:
Does each of your query have a lot of results or only a few ? => A few : you do not have to index all the fields you retrieve (only the query and sort fields because few results mean few disk access).
Does your query results are often the same (i.e your working set is small) ? => don't index the field you retrieve because results are cached by mongodb.
Do you have a query field more selective than another ? => index the more selective field only.