I have MongoDB collection that contains around 10M documents
I try to update (upsert) around 2500~ documents, each update is around 1K
I try to use bulkWrite with order=false.
It took around 10 seconds and it like 3-4 ms per document.
So I try to insert the 2500~ documents by updateOne (iterative mode)
I measure the avg time per document and it took like 3.5ms for each update.
Why I don't get a better result for bulkWrite and how can I improve the bulkWrite update time?
Example for bulkUpdate with 1 document:
db.collections.bulkWrite( [
{ updateOne :
{
"filter": {"Name": "someName", "Namespace" :
"someNs", "Node" : "someNode" , "Date" : 0},
"update": {"$addToSet" : {"Data" :{"$each" : ["1", , "2"]}}},
"upsert": true
}
}
] )
Document Example:
{
"Name": "someName",
"Namespace": "someNs",
"Node": "SomeNode",
"Date": 23245,
"Data" : ["a", "b"]
}
I have a compound index that contains: Name, Namespace, Node, Date.
When I try to find documents I get good results
TL DR; play around with your batch sizes to find the sweet spot.
Bulk write or updateMany will be faster than single updates. Simply because there is less chatter going on (less round trips). The amount of data being transferred is exactly the same.
What you need to do is to find an optimum number which gives you the highest throughput based on your network, cluster config etc.
Typically what you'd see is that if batch size is smaller, you are not using it to the ability. And if it is too big, then you are spending too much time just transferring the package to database.
Related
I have MongoDB 4.4 cluster and a database with collection of 200k documents and 55 indexes for different queries.
The following query:
db.getCollection('tasks').find({
"customer": "gZuu5ZptDEtC6dq2Z",
"finished": true,
"$or": [
{
"scoreCalculated": {
"$exists": true
},
},
{
"workflowProcessed": {
"$exists": true
},
}
]
}).sort({
"scoreCalculated": -1,
"workflowProcessed": -1,
"createdAt": -1
})
is executed at average of less than 1 second. Explain.
But if I change sort direction to
.sort({
"scoreCalculated": 1,
"workflowProcessed": 1,
"createdAt": 1
})
the execution time grows to several seconds (up to 10). Explain.
The first explain shows that apiGetTasks index is used. But it has ascending sort and I don't get why it is not used when I turn sort direction to ascending. Adding same index with descending sort doesn't change anything.
Could you please help me to understand why the second query is so slow?
Please share the indexes.
55 indexes are way too much you should decrease it because it can harm your performance (every time you want to use an index you need to load it to RAM instead letting Mongo utilize the RAM in order to optimize queries).
Moreover, the number of total exterminated docs is 210780 which is all your collection. So you need to rethink how to build efficiently indexes that help you optimize queries.
Read Mongo indexes docs here
I need compound index for my collection, but I'm not sure about keys order
My item:
{
_id,
location: {
type: "Point",
coordinates: [<lng>, <lat>]
},
isActive: true,
till: ISODate("2016-12-29T22:00:00.000Z"),
createdAt : ISODate("2016-10-31T12:02:51.072Z"),
...
}
My main query is:
db.collection.find({
$and: [
{
isActive: true
}, {
'till': {
$gte: new Date()
}
},
{
'location': { $geoWithin: { $box: [ [ SWLng,SWLat], [ NELng, NELat] ] } }
}
]
}).sort({'createdAt': -1 })
In human, I need all active items on visible part of my map that are not expired, newly added - first.
Is it normal to create this index:
db.collection.createIndex( { "isActive": 1, "till": -1, "location": "2dsphere", "createdAt": -1 } )
And what is the best order for performance, for disk usage? Or maybe I have to create several indexes...
Thank you!
The order of fields in an index should be:
fields on which you will query for exact values.
fields on which you will sort.
fields on which you will query for a range of values.
In your case it would be:
db.collection.createIndex( { "isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" } )
However, indexes on boolean fields are often not very useful, as on average MongoDB will still need to access half of your documents. So I would advise you to do the following:
duplicate the collection for test purposes
create index, you would like to test (i.e. {"isActive": 1, "createdAt": -1, "till": -1, "location": "2dsphere" })
in the mongo shell create variable:
var exp = db.testCollection.explain('executionStats')
execute the following query exp.find({'you query'}) it will return statistics describing the execution of the winning plan
analyze the keys like: "nReturned", "totalKeysExamined","totalDocsExamined"
drop the index, create new one (i.e. {"createdAt": -1, "till": -1, "location": "2dsphere"}), execute exp.find({'you query'}) compare the result to the previous one
In Mongo, many things depend upon data and its access patterns. There are few things to consider while creating index on your collection-
How the data will be accessed from application. (You already know the main query so this part is almost done)
The data size and cardinality and data span.
Operations on the data. (how often reads and writes will happen and in what pattern)
A particular query can use only one index at a time.
Index usage is not static. Mongo keeps changing index used by heuristics and it tries to do it in optimized way. So if you see index1 being used at soem time, it may happen that mongo uses index2 after some time when some/enough different type/cardinality of data is entered.
Indices can be good and worse as well for your application performance. It is best to test them via shell/compass before using them in production.
var ex = db.<collection>.explain("executionStats")
Above line when entered in mongo shell gives you the cursor on explainable object which can be used further to check performance issues.
ex.find(<Your query>).sort(<sort predicate>)
Points to note in above output are
"executionTimeMillis"
"totalKeysExamined"
"totalDocsExamined"
"stage"
"nReturned"
We strive for minimum for first three items (executionTimeMillis, totalKeysExamined and totalDocsExamined) and "stage" is one important thing to tell what is happening. If Stage is "COLLSCAN" then it means it is looking for every document to fulfil the query, if Stage is "SORT" then it means it is doing in-memory sorting. Both are not good.
Coming to your query, there are few things to consider-
If "till" is going to have a fix value like End of month date for all the items entered during a month then it's not a good idea to have index on it. DB will have to scan many documents even after this index. Moreover there will be only 12 entries for this index for a year given it is month end date.
If "till" is a fix value after "createdAt" then it is not good to have index on both.
Indexing "isActive" is not good because there are only two values it can take.
So please try with the actual data and execute below indices and determine which index should fit considering time, no. of docs. examined etc.
1. {"location": "2dsphere" , "createdAt": -1}
2. {"till":1, "location": "2dsphere" , "createdAt": -1}
Apply both indices on collection and execute ex.find().sort() where ex is explainable cursor. Then you need to analyze both outputs and compare to decide best.
I have a collection
orders
{
"_id": "abcd",
"last_modified": ISODate("2016-01-01T00:00:00Z"),
"suborders": [
{
"suborder_id": "1",
"last_modified: ISODate("2016-01-02T00: 00: 00Z")
}, {
"suborder_id":"2",
"last_modified: ISODate("2016-01-03T00:00:00Z")
}
]
}
I have two indexes on this collection:
{"last_modified":1}
{"suborders.last_modified": 1}
when I use range queries on last_modified, indexes are properly used, and results are returned instantly. eg query: db.orders.find({"last_modified":{$gt:ISODate("2016-09-15"), $lt:ISODate("2016-09-16")}});
However, when I am querying on suborders.last_modified, the query takes too long to execute. eq query:db.orders.find({"suborders.last_modified":{$gt:ISODate("2016-09-15"), $lt:ISODate("2016-09-16")}});
Please help debug this.
The short answer is to use min and max to set the index bounds correctly. For how to approach debugging, read on.
A good place to start for query performance issues is to attach .explain() at the end of your queries. I made a script to generate documents like yours and execute the queries you provided.
I used mongo 3.2.9 and both queries do use the created indices with this setup. However, the second query was returning many more documents (approximately 6% of all the documents in the collection). I suspect that is not your intention.
To see what is happening lets consider a small example in the mongo shell:
> db.arrayFun.insert({
orders: [
{ last_modified: ISODate("2015-01-01T00:00:00Z") },
{ last_modified: ISODate("2016-01-01T00:00:00Z") }
]
})
WriteResult({ "nInserted" : 1 })
then query between May and July of 2015:
> db.arrayFun.find({"orders.last_modified": {
$gt: ISODate("2015-05-01T00:00:00Z"),
$lt: ISODate("2015-07-01T00:00:00Z")
}}, {_id: 0})
{ "orders" : [ { "last_modified" : ISODate("2015-01-01T00:00:00Z") }, { "last_modified" : ISODate("2016-01-01T00:00:00Z") } ] }
Although neither object in the array has last_modified between May and July, it found the document. This is because it is looking for one object in the array with last_modified greater than May and one object with last_modified less than July. These queries cannot intersect multikey index bounds, which happens in your case. You can see this in the indexBounds field of explain("allPlansExecution") output, in particular one of the lower bound or upper bound Date will not be what you specified. This means that a large number of documents may need to be scanned to complete the query depending on your data.
To find objects in the array that have last_modified between two bounds, I tried using $elemMatch.
db.orders.find({"suborders": {
$elemMatch:{
last_modified:{
"$gt":ISODate("2016-09-15T00:00:00Z"),
"$lt":ISODate("2016-09-16T00:00:00Z")
}
}
}})
In my test this returned about 0.5% of all documents. However, it was still running slow. The explain output showed it was still not setting the index bounds correctly (only using one bound).
What ended up working best was to manually set the index bounds with min and max.
db.subDocs.find()
.min({"suborders.last_modified":ISODate("2016-09-15T00:00:00Z")})
.max({"suborders.last_modified":ISODate("2016-09-16T00:00:00Z")})
Which returned the same documents as $elemMatch but used both bounds on the index. It ran in 0.021s versus 2-4s for elemMatch and the original find.
Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?
In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.
Let's say I have a collection with documents that looks like this (just simplified example, but it should show the scheme):
> db.data.find()
{ "_id" : ObjectId("4e9c1f27aa3dd60ee98282cf"), "type" : "A", "value" : 11 }
{ "_id" : ObjectId("4e9c1f33aa3dd60ee98282d0"), "type" : "A", "value" : 58 }
{ "_id" : ObjectId("4e9c1f40aa3dd60ee98282d1"), "type" : "B", "value" : 37 }
{ "_id" : ObjectId("4e9c1f50aa3dd60ee98282d2"), "type" : "B", "value" : 1 }
{ "_id" : ObjectId("4e9c1f56aa3dd60ee98282d3"), "type" : "A", "value" : 85 }
{ "_id" : ObjectId("4e9c1f5daa3dd60ee98282d4"), "type" : "B", "value" : 12 }
Now I need to collect some statistics on that collection. For example:
db.data.mapReduce(function(){
emit(this.type,this.value);
},function(key,values){
var total = 0;
for(i in values) {total+=values[i]};
return total;
},
{out:'stat'})
will collect totals in 'stat' collection.
> db.stat.find()
{ "_id" : "A", "value" : 154 }
{ "_id" : "B", "value" : 50 }
At this point everything is perfect, but I've stuck on the next move:
'data' collection is constantly updated with new data (old documents stays unchanged, only inserts, no updates)
I would like to periodically update 'stat' collection, but do not want to query the whole 'data' collection every time, so I choose to run incremental mapReduce
It may seems good to just update 'stat' collection on every insert in 'data' collection and do no use mapReduce, but the real case is more complex than this example and I would like to get statistics only on demand.
To do this I should be able to query only documents, that was added after my last mapReduce
As far as I understand I cannot rely on ObjectId property, just store the last one and later select every document with ObjectId > stored because ObjectId is not equal autoincrement ids in SQL databases (for example different shards will produce different ObjectIds).
I can change ObjectId generator, but not sure how to do it better in sharded environment
So the question is:
Is it any way to select only documents, added after the last mapReduce to run incremental mapReduce or may be there is another strategy to update statistic data on constantly growing collection?
You can cache the time and use it as a barrier for your next incremental map-reduce.
We're testing this at work and it seems to be working. Correct me if I'm wrong, but you can't safely do map-reduce while an insert is happening across shards. The versions become inconsistent and your map-reduce operation will fail. (If you find a solution to this, please do let me know! :)
We use bulk-inserts instead, once every 5 minutes. Once all the bulk inserts are done, we run the map-reduce like this (in Python):
m = Code(<map function>)
r = Code(<reduce function>)
# pseudo code
end = last_time + 5 minutes
# Use time and optionally any other keys you need here
q = bson.SON([("date" : {"$gte" : last_time, "$lt" : end})])
collection.map_reduce(m, r, out=out={"reduce": <output_collection>}, query=q)
Note that we used reduce and not merge, because we don't want to override what we had before; we want to combine the old results and the new result with the same reduce function.
You can get just the time portion of the ID using _id.getTime() (from: http://api.mongodb.org/java/2.6/org/bson/types/ObjectId.html). That should be sortable across all shards.
EDIT: Sorry, that was the java docs... The JS version appears to be _id.generation_time.in_time_zone(Time.zone), from http://mongotips.com/b/a-few-objectid-tricks/
I wrote up a complete pymongo-based solution that uses incremental map-reduce and caches the time, and expects to run in a cron job. It locks itself so two can't run concurrently:
https://gist.github.com/2233072
""" This method performs an incremental map-reduce on any new data in 'source_table_name'
into 'target_table_name'. It can be run in a cron job, for instance, and on each execution will
process only the new, unprocessed records.
The set of data to be processed incrementally is determined non-invasively (meaning the source table is not
written to) by using the queued_date field 'source_queued_date_field_name'. When a record is ready to be processed,
simply set its queued_date (which should be indexed for efficiency). When incremental_map_reduce() is run, any documents
with queued_dates between the counter in 'counter_key' and 'max_datetime' will be map/reduced.
If reset is True, it will drop 'target_table_name' before starting.
If max_datetime is given, it will only process records up to that date.
If limit_items is given, it will only process (roughly) that many items. If multiple
items share the same date stamp (as specified in 'source_queued_date_field_name') then
it has to fetch all of those or it'll lose track, so it includes them all.
If unspecified/None, counter_key defaults to counter_table_name:LastMaxDatetime.
"""
We solve this issue using 'normalized' ObjectIds. Steps that we are doing:
normalize id - take timestap from the current/stored/last processed id and set other
parts of id to its min values. C# code: new ObjectId(objectId.Timestamp,
0, short.MinValue, 0)
run map-reduce with all items that have id
greater than our normalized id, skip already processed items.
store last processed id, and mark all processed items.
Note: Some boundary items will be processed several times. To fix it we set some sort of a flag in the processed items.