How to insert quickly to a very large collection - mongodb

I have a collection of over 70 million documents. Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
Is there anyway to avoid this. I just want mongo to take new documents and insert it as they are, without doing this check. Is it even possible?

Diagnosing "Slow" Performance
Your question includes a number of leading assumptions about how MongoDB works. I'll address those below, but I'd advise you to try to understand any performance issues based on facts such as database metrics (i.e. serverStatus, mongostat, mongotop), system resource monitoring, and information in the MongoDB log on slow queries. Metrics need to be monitored over time so you can identify what is "normal" for your deployment, so I would strongly recommend using a MongoDB-specific monitoring tool such as MMS Monitoring.
A few interesting presentations that provide very relevant background material for performance troubleshooting and debugging are:
William Zola: The (Only) Three Reasons for Slow MongoDB Performance
Aska Kamsky: Diagnostics and Debugging with MongoDB
Improving efficiency of inserts
Aside from understanding where your actual performance challenges lie and tuning your deployment, you could also improve efficiency of inserts by:
removing any unused or redundant secondary indexes on this collection
using the Bulk API to insert documents in batches
Assessing Assumptions
Whenever I add new documents in batches (lets say 2K), the insert operation is really slow. I suspect that is because, the mongo engine is comparing the _id's of all the new documents with all the 70 million to find out any _id duplicate entries. Since the _id based index is disk-resident, it'll make the code a lot slow.
If a collection has 70 million entries, that does not mean that an index lookup involves 70 million comparisons. The indexed values are stored in B-trees which allow for a small number of efficient comparisons. The exact number will depend on the depth of the tree and how your indexes are built and the value you're looking up .. but will be on the order of 10s (not millions) of comparisons.
If you're really curious about the internals, there are some experimental storage & index stats you can enable in a development environment: Storage-viz: Storage Visualizers and Commands for MongoDB.
Since the _id based index is disk-resident, it'll make the code a lot slow.
MongoDB loads your working set (portion of data & index entries recently accessed) into available memory.
If you are able to create your ids in an approximately ascending order (for example, the generated ObjectIds) then all the updates will occur at the right side of the B-tree and your working set will be much smaller (FAQ: "Must my working set fit in RAM").
Yes, I can let mongo use the _id for itself, but I don't want to waste a perfectly good index for it. Moreover, even if I let mongo generate _id for itself won't it need to compare still for duplicate key errors?
A unique _id is required for all documents in MongoDB. The default ObjectId is generated based on a formula that should ensure uniqueness (i.e. there is an extremely low chance of returning a duplicate key exception, so your application will not get duplicate key exceptions and have to retry with a new _id).
If you have a better candidate for the unique _id in your documents, then feel free to use this field (or collection of fields) instead of relying on the generated _id. Note that the _id is immutable, so you shouldn't use any fields that you might want to modify later.

Related

How to correctly build indexes on MongoDB when every field will be searchable and sortable?

I am designing a MongoDB collection that will have 50 million documents and every field in the document will be searchable and sortable. The searching and sorting logics will be sent from the frontend so could be a lot of fields searchings and sorting combinations. I've made some tests and concluded that when there is searching and sorting only in indexed fields the query runs very fast, but when searching or sorting non-indexed fields the query runs very slow.
Considering that will have a lot of possible searching/sorting combinations, how can I build indexes in this collection in this case to get a better performance?
Indexing comes at a cost of extra memory space and possible increased execution time of database write(insert and update) operations. However, like you rightly pointed out, indexing makes database reads(and sorting) super fast.
Creating indexes is easy and straight forward, however, you need to consider the tradeoffs, most times, this is usually the read-write ration of the fields in your documents.
If you frequently read(or sort) documents from a very large collection(like the 50million examples you mentioned), it makes a lot of sense to add indexing to all the fields you use to identify(or sort) your documents, you just need to ensure you don't run out of memory space in the DB. Not indexing the fields would be very frustrating, just imagine if you need to get the last document by a field that is not indexed, you would have to search through 49,999,999 documents to find it.
I hope this helps.

Is it worth splitting one collection into many in MongoDB to speed up querying records?

I have a query for a collection. I am filtering by one field. I thought, I can speed up query, if based on this field I make many separate collections, which collection's name would contain that field name, in previous approach I filtered with. Practically I could remove filter component in a query, because I need only pick the right collection and return documents in it as response. But in this way ducoments will be stored redundantly, a document earlier was stored only once, now document might be stored in more collections. Is this approach worth to follow? I use Heroku as cloud provider. By increasing of the number of dynos, it is easy to serve more user request. As I know read operations in MongoDB are highly mutual, parallel executed. Locking occure on document level. Is it possible gain any advantage by increasing redundancy? Of course index exists for that field.
If it's still within the same server, I believe there may be little parallelization gain (from the database side) in doing it this way, because for a single server, it matters little how your document is logically structured.
All the server cares about is how many collection and indexes you have, since it stores those collections and associated indexes in a number of files. It will need to load these files as the collection is accessed.
What could potentially be an issue is if you have a massive number of collections as a result, where you could hit the open file limit. Note that the open file limit is also shared with connections, so with a lot of collections, you're indirectly reducing the number of possible connections.
For illustration, let's say you have a big collection with e.g. 5 indexes on them. The WiredTiger storage engine stores the collection as:
1 file containing the collection data
1 file containing the _id index
5 files containing the 5 secondary indexes
Total = 7 files.
Now you split this one collection across e.g. 100 collections. Assuming the collections also requires 5 secondary indexes, in total they will need 700 files in WiredTiger (vs. of the original 7). This may or may not be desirable from your ops point of view.
If you require more parallelization if you're hitting some ops limit, then sharding is the recommended method. Sharding the busy collection across many different shards (servers) will immediately give you better parallelization vs. a single server/replica set, given a properly chosen shard key designed to maximize parallelization.
Having said that, sharding also requires more infrastructure and may complicate your backup/restore process. It will also require considerable planning and testing to ensure your design is optimal for your use case, and will scale well into the future.

MongoDB Internal implementation of indexing?

I've learned a lot of things about indexing and finding some stuff from
here.
Indexes support the efficient execution of queries in MongoDB. Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the
query statement. If an appropriate index exists for a query, MongoDB can use the index to limit the number of documents it must inspect.
But i still have some questions:
While Creating index using (createIndex), is the Record always stored in
RAM?
Is every time need to create Index Whenever My application
is going to restart ?
What will Happen in the case of default id (_id). Is always Store in RAM.
_id Is Default Index, That means All Records is always Store in RAM for particular collections?
Please help me If I am wrong.
Thanks.
I think, you are having an idea that indexes are stored in RAM. What if I say they are not.
First of all we need to understand what are indexes, indexes are basically a pointer to tell where on disk that document is. Just like we have indexing in book, for faster access we can see what topic is on which page number.
So when indexes are created, they also are stored in the disk, But when an application is running, based on the frequent use and even faster access they get loaded into RAM but there is a difference between loaded and created.
Also loading an index is not same as loading a collection or records into RAM. If we have index loaded we know what all documents to pick up from disk, unlike loading all document and verifying each one of them. So indexes avoid collection scan.
Creation of indexes is one time process, but each write on the document can potentially alter the indexing, so some part might need to be recalculating because records might get shuffled based on the change in data. that's why indexing makes write slow and read fast.
Again think of as a book, if you add a new topic of say 2 pages in between the book, all the indexes after that topic number needs to be recalculated. accordingly.
While Creating index Using (createIndex),Is Record always store in RAM
?.
No, records are not stored in RAM, while creating it sort of processes all the document in the collection and create an index sheet, this would be time consuming understandably if there are too many documents, that's why there is an option to create index in background.
Is every time need to create Index Whenever My application is going to
restart ?
Index are created one time, you can delete it and create again, but it won't recreated on the application or DB restart. that would be insane for huge collection in sharded environment.
What will Happen in the case of default id (_id). Is always Store in
RAM.
Again that's not true. _id comes as indexed field, so index is already created for empty collection, as when you do a write , it would recalculate the index. Since it's a unique index, the processing would be faster.
_id Is Default Index, That means All Records is always Store in RAM for particular collections ?
all records would only be stored in RAM when you are using in-memory engine of MongoDB, which I think comes as enterprise edition. Due to indexing it would not automatically load the record into RAM.
To answer the question from the title:
MongoDB indexes use a B-tree data structure.
source: https://docs.mongodb.com/manual/indexes/index.html#b-tree

Slow Upserts with PyMongoDB

I'm trying to insert ~800 million records into MongoDB using PyMongo on a macbook air 1.7GHz i7 with no multi-threading, the documents are structured as below:
Records I'm reading are the following tuple:
(user_id,imp_date,imp_creative,imp_pid,geo_id)
I'm creating my own _id field based on the user_id in the file I'm reading from.
{_id:user_id,
'imp_date':[array of dates],
'imp_creative':[array of numeric ids],
'imp_pid':[array of numeric ids],
'geo_id':numeric id}
I'm using an upsert with $push to append date, creative id, and pid for the corresponding arrays
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>}},safe=True,upsert=True)
I'm using an upsert with $set to overwrite the geographic location (only care about most recent.
self.collection.update({'_id':uid},
{"$set":{'geo_id':<geo id>}},safe=True,upsert=True)
I'm only writing about 1,500 records per second (8,000 if I set safe=False). My question is: what can I do to speed this up further (ideally 20k/second or faster)?
Ideas I can't find a definitive recommendation on:
-Using multiple threads to insert data
-Sharding
-Padding arrays (my arrays grow very slowly, each document array will have an average length of ~4 at the end of the file)
-Turning journaling off
Apologies if I've left out any required information, this is my first post.
1- You could add an index to speed it up, and index would help you to find the documents faster although the inserts would be slower (you have to update the index as well). If the improvement in the retrieving phase compensates the extra time to update the index depends on how many records you have in the collections, how many indexes you have and how complicated those indexes are.
However, in your case you are only querying with the _id so there's no much more you can do with indexes.
2- Are you using two consecutive updates? I mean, one for the $set and one for the $push?
If that's true, then you should definetelly use just one:
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
3- The update operation is an atomic operation which might locks other queries. If the document you are about to update is not already in RAM but it is in the disk, mongo will have to first fetch it from the disk and then update it. If you do a find operation first (which doesn't block as it's a read-only operation) the document will be in RAM for sure so the update operation (the locking one) will be faster:
self.collection.findOne({'_id':uid})
self.collection.update({'_id':uid},
{"$push":{'imp_date':<datevalue>,
'imp_creative':<creative_id>,
'imp_pid':<pid>},
"$set":{'geo_id':<geo id>}},
safe=True,upsert=True)
4-If your documents don't grow too much as you have said, it won't be necessary to bother about padding factor and reallocation issues. Furthermore, in some recent versions (can't remember if it was since 2.2 or 2.4) collections are created with the powerOfTwo option enabled by default.

Mongodb model for Uniqueness

Scenario:
10.000.000 record/day
Records:
Visitor, day of visit, cluster (Where do we see it), metadata
What we want to know with this information:
Unique visitor on one or more clusters for a given range of dates.
Unique Visitors by day
Grouping metadata for a given range (Platform, browser, etc)
The model i stick with in order to easily query this information is:
{
VisitorId:1,
ClusterVisit: [
{clusterId:1, dates:[date1, date2]},
{clusterId:2, dates:[date1, date3]}
]
}
Index:
by VisitorId (to ensure Uniqueness)
by ClusterVisit.ClusterId-ClusterVisit.dates (for searching)
by IdUser-ClusterVisit.IdCluster (for updating)
I also have to split groups of clusters into different collections in order to ease to access the data more efficiently.
Importing:
First we search for a combination of VisitorId - ClusterId and we addToSet the date.
Second:
If first doesn't match, we upsert:
$addToSet: {VisitorId:1,
ClusterVisit: [{clusterId:1, dates:[date1]}]
}
With First and Second importing i cover if the clusterId doesn't exists or if VisitorId doesn´t exists.
Problems:
totally inefficient (near impossible) on update / insert / upsert when the collection grows, i guess because of the document size getting bigger when adding a new date.
Difficult to maintain (unset dates mostly)
i have a collection with more than 50.000.000 that i can't grow any more. It updates only 100 ~ records/sec.
I think the model i'm using is not the best for this size of information. What do you think will be best to get more upsert/sec and query the information FAST, before i mess with sharding, which is going to take more time while i learn and get confident with it.
I have a x1.large instance on AWS
RAID 10 with 10 disks
Arrays are expensive on large collections: mapreduce, aggregate...
Try .explain():
MongoDB 'count()' is very slow. How do we refine/work around with it?
Add explicit hints for index:
Simple MongoDB query very slow although index is set
A full heap?:
Insert performance of node-mongodb-native
The end of memory space for collection:
How to improve performance of update() and save() in MongoDB?
Special read clustering:
http://www.colinhowe.co.uk/2011/02/23/mongodb-performance-for-data-bigger-than-memor/
Global write lock?:
mongodb bad performance
Slow logs performance track:
Track MongoDB performance?
Rotate your logs:
Does logging output to an output file affect mongoDB performance?
Use profiler:
http://www.mongodb.org/display/DOCS/Database+Profiler
Move some collection caches to RAM:
MongoDB preload documents into RAM for better performance
Some ideas about collection allocation size:
MongoDB data schema performance
Use separate collections:
MongoDB performance with growing data structure
A single query can only use one index (better is a compound one):
Why is this mongodb query so slow?
A missing key?:
Slow MongoDB query: can you explain why?
Maybe shards:
MongoDB's performance on aggregation queries
Improving performance stackoverflow links:
https://stackoverflow.com/a/7635093/602018
A good point for further sharding replica education is:
https://education.10gen.com/courses