What is the preferred way to add many fields to all documents in a MongoDB collection? - mongodb

I have have a Python application that is iteratively going through every document in a MongoDB (3.0.2) collection (typically between 10K and 1M documents), and adding new fields (probably doubling/tripling the number of fields in the document).
My initial thought was that I would use upsert the entire of the revised documents (using pyMongo) - now I'm questioning that:
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
Also, is it better to perform a write to the collection on a document by document basis or in bulk?

this is actually a great question that can be solved a few different ways depending on how you are managing your data.
if you are upserting additional fields does this mean your data is appending additional fields at a later point in time with the only changes being the addition of the additional fields? if so you could set the ttl on your documents so that the old ones drop off over time. keep in mind that if you do this you will want to set an index that sorts your results by descending _id so that the most recent additions are selected before the older ones.
the benefit of this of doing it this way is that your are continually writing data as opposed to seeking and updating data so it is faster.
in regards to upserts vs bulk inserts. bulk inserts are always faster than upserts since bulk upserting requires you to find the original document first.
Given that the revised documents are significantly bigger should I be inserting only the new fields, or just replacing the document?
you really need to understand your data fully to determine what is best but if only change to the data is additional fields or changes that only need to be considered from that point forward then bulk inserting and setting a ttl on your older data is the better method from the stand point of write operations as opposed to seek, find and update. when using this method you will want to db.document.find_one() as opposed to db.document.find() so that only your current record is returned.
Also, is it better to perform a write to the collection on a document by document basis or in bulk?
bulk inserts will be faster than inserting each one sequentially.

Related

Mongo DB update query performance

I would like to understand which of the below queries would be faster, while doing updates, in mongo db? I want to update few thousands of records at one stretch.
Accumulating the object ids of those records and firing them using $in or using bulk update?
Using one or two fields in the collection which are common for those few thousand records - akin to "where" in sql and firing an update using those fields. These fields might or might not be indexed.
I know that query will be much smaller in the 2nd case as every single "_id" (oid) is not accumulated. Does accumulating _ids and using those to update documents offer any practical performance advantages?
Does accumulating _ids and using those to update documents offer any practical performance advantages?
Yes because MongoDB will certainly use the _id index (idhack).
In the second method - as you observed - you can't tell whether or not an index will be used for a certain field.
So the answer will be: it depends.
If your collection has million of documents or more, and / or the number of search fields is quite large you should prefer the first search method. Especially if the id list size is not small and / or the id values are adjacent.
If your collection is pretty small and you can tolerate a full scan you may prefer the second approach.
In any case, you should testify both methods using explain().

MongoDB 3.X : Does it make sense to have only one collection per database

Since MongoDB 3.x introduces lock per record and not on collection or database, does it make sense to write all of your data to single collection with one extra identifier field "documentType".
It will help simulate "join" through map-reduce operation.
Couchbase does the same thing with "buckets" instead of collection.
Does anybody see any disadvatanges with this approach ?
There's one big general-case disadvantage: indexes.
With Mongo, you generally want to set up indexes so that most, if not all, queries you make, use them. So in addition to the one on _id, you'll set up indexes on the primary fields you search by (often compounded with those you sort by).
If you're storing everything in one single collection, that means you need to have all those indexes on that collection. Which means two things:
The indexes are be bigger, since there's more documents to index. Granted, this can be somewhat mitigated by using sparse indexes.
Inserting or modifying documents in the collection requires Mongo to update all these indexes (where it'd just update the relevant indexes in the standard use-many-collections approach). This kills your write performance.
Furthermore, if you have in your application a query that somehow doesn't use one of those many indexes, it needs to scan through the entire collection, which is O(n) where n is the number of documents in the collection -- in your case, that means the number of documents in the entire database.
Collections are cheap. Use them ;)

Slow Upserts with PyMongoDB

I'm trying to insert ~800 million records into MongoDB using PyMongo on a macbook air 1.7GHz i7 with no multi-threading, the documents are structured as below:
Records I'm reading are the following tuple:
(user_id,imp_date,imp_creative,imp_pid,geo_id)
I'm creating my own _id field based on the user_id in the file I'm reading from.
{_id:user_id,
'imp_date':[array of dates],
'imp_creative':[array of numeric ids],
'imp_pid':[array of numeric ids],
'geo_id':numeric id}
I'm using an upsert with $push to append date, creative id, and pid for the corresponding arrays
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>}},safe=True,upsert=True)
I'm using an upsert with $set to overwrite the geographic location (only care about most recent.
self.collection.update({'_id':uid},
{"$set":{'geo_id':<geo id>}},safe=True,upsert=True)
I'm only writing about 1,500 records per second (8,000 if I set safe=False). My question is: what can I do to speed this up further (ideally 20k/second or faster)?
Ideas I can't find a definitive recommendation on:
-Using multiple threads to insert data
-Sharding
-Padding arrays (my arrays grow very slowly, each document array will have an average length of ~4 at the end of the file)
-Turning journaling off
Apologies if I've left out any required information, this is my first post.
1- You could add an index to speed it up, and index would help you to find the documents faster although the inserts would be slower (you have to update the index as well). If the improvement in the retrieving phase compensates the extra time to update the index depends on how many records you have in the collections, how many indexes you have and how complicated those indexes are.
However, in your case you are only querying with the _id so there's no much more you can do with indexes.
2- Are you using two consecutive updates? I mean, one for the $set and one for the $push?
If that's true, then you should definetelly use just one:
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
3- The update operation is an atomic operation which might locks other queries. If the document you are about to update is not already in RAM but it is in the disk, mongo will have to first fetch it from the disk and then update it. If you do a find operation first (which doesn't block as it's a read-only operation) the document will be in RAM for sure so the update operation (the locking one) will be faster:
self.collection.findOne({'_id':uid})
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
4-If your documents don't grow too much as you have said, it won't be necessary to bother about padding factor and reallocation issues. Furthermore, in some recent versions (can't remember if it was since 2.2 or 2.4) collections are created with the powerOfTwo option enabled by default.

How to do global search in MongoDB?

What I mean as global search is searching for documents in specified collections, for example, searching for a name in both User and Organization collections and will return both user and organization documents that match the criteria.
Is it possible to simply copy the documents in User and Organization into another collection and do a search in it?
No, it is not possible to do a multi-collection search automatically. There's no reason however that you couldn't perform the same query on multiple collections and combine the results.
While you could duplicate the data into another collection for query purposes, if you need to be guaranteed that the source collection's values matches identically with the "index" collection, you'll need to implement your own multi-phase transaction (example) as MongoDb doesn't have a multi-collection atomic commit. Or, you can accept the fact that the "index" table may be out of sync. Of course, it could be periodically updated through custom code. Further, it means your working set has increased as you're double storing data. Also, if you then need to grab data from individual collections (to grab more of the source document), you've likely not gained anything and made things worse when compared to doing multiple queries in the first place.
You could store related documents in the same collection and take advantage of the built-in indexing offered. Of course, this comes with the caveat that if your documents are now typed, you may find it more challenging to build MongoDb indexes that are efficient. Every changing/new document must go through the indexing pipeline, which may introduce significant overhead.
If it's only a few collections, I'd just do multiple searches without understanding more deeply your requirements. If not, the second best would be to combine documents into a single collection. Last choice would be to copy the data.

mongodb document structure

My database has users collection,
each user has multiple documents,
each document has multiple sections
each section has multiple works
Users work with works collection very often (add new work, update works, delete works). So my question is what structure of collections should I make? works collection is 100-200 records per section.
Should I make work collection for all users with user _id or there is best solution?
Depends on what kind of queries you have. The guideline is to arrange documents so that you can fetch all you need in ideally one query.
On the other hand, what you probably want to avoid is to have mongo reallocate documents because there's not enough space for a in-place update. You can do that by preallocating enough space, or extracting that frequently changing part into its own collection.
As you can read in MongoDB docs,
Generally, for "contains" relationships between entities, embedding should be be chosen. Use linking when not using linking would result in duplication of data.
So if each user has only access to his documents, I think you're good. Just keep in mind there's a limitation on size (16MB I think) for documents which you should be careful about, since you're embedding lots of stuff.