I have a single standalone mongo installation on a Linux machine.
The database contains a collection with 181 million documents. This collection is by far the largest collection in the database (approx 90%)
The size of the collection is currently 3.5 TB.
I'm running Mongo version 4.0.10 (Wired Tiger)
The collection have 2 indexes.
One on id
One on 2 fields and it is used when deleting documents (see those in the snippet below).
When benchmarking bulk deletion on this collection we used the following snippet
db.getCollection('Image').deleteMany(
{$and: [
{"CameraId" : 1},
{"SequenceNumber" : { $lt: 153000000 }}]})
To see the state of the deletion operation I ran a simple test of deleting 1000 documents while looking at the operation using currentOp(). It shows the following.
"command" : {
"q" : {
"$and" : [
{
"CameraId" : 1.0
},
{
"SequenceNumber" : {
"$lt" : 153040000.0
}
}
]
},
"limit" : 0
},
"planSummary" : "IXSCAN { CameraId: 1, SequenceNumber: 1 }",
"numYields" : 876,
"locks" : {
"Global" : "w",
"Database" : "w",
"Collection" : "w"
},
"waitingForLock" : false,
"lockStats" : {
"Global" : {
"acquireCount" : {
"r" : NumberLong(877),
"w" : NumberLong(877)
}
},
"Database" : {
"acquireCount" : {
"w" : NumberLong(877)
}
},
"Collection" : {
"acquireCount" : {
"w" : NumberLong(877)
}
}
}
It seems to be using the correct index but the number and type of locks worries me. As I interpret this it aquires 1 global lock for each deleted document from a single collection.
When using this approach it has taken over a week to delete 40 million documents. This cannot be expected performance.
I realise there other design exists such as bulking documents into larger chunks and store them using GridFs, but the current design is what it is and I want to make sure that what I see is expected before changing my design or restructuring the data or even considering clustering etc.
Any suggestions of how to increase performance on bulk deletions or is this expected?
Related
In my use case, I want to search a document by a given unique string in MongoDB. However, I want my queries to be fast and searching by _id will add some overhead. I want to know if there are any benefits in MongoDB to search a document by _id over any other unique value?
To my knowledge object ID are similar to any other unique value in a document [Point made for the case of searching only].
As for the overhead, you can assume I am caching the string to objectID and the cache is very small and in memory [Almost negligible], though the DB is large.
Analyzing your query performance
I advise you to use .explain() provided by mongoDB to analyze your query performance.
Let's say we are trying to execute this query
db.inventory.find( { quantity: { $gte: 100, $lte: 200 } } )
This would be the result of the query execution
{ "_id" : 2, "item" : "f2", "type" : "food", "quantity" : 100 }
{ "_id" : 3, "item" : "p1", "type" : "paper", "quantity" : 200 }
{ "_id" : 4, "item" : "p2", "type" : "paper", "quantity" : 150 }
If we call .execution() this way
db.inventory.find(
{ quantity: { $gte: 100, $lte: 200 } }
).explain("executionStats")
It will return the following result:
{
"queryPlanner" : {
"plannerVersion" : 1,
...
"winningPlan" : {
"stage" : "COLLSCAN",
...
}
},
"executionStats" : {
"executionSuccess" : true,
"nReturned" : 3,
"executionTimeMillis" : 0,
"totalKeysExamined" : 0,
"totalDocsExamined" : 10,
"executionStages" : {
"stage" : "COLLSCAN",
...
},
...
},
...
}
More details about this can be found here
How efficient is search by _id and indexes
To answer your question, using indexes is always more efficient. Indexes are special data structures that store a small portion of the collection's data set in an easy to traverse form. With _id being the default index provided by MongoDB, that makes it more efficient.
Without indexes, MongoDB must perform a collection scan, i.e. scan every document in a collection, to select those documents that match the query statement.
So, YES, using indexes like _id is better!
You can also create your own indexes by using createIndex()
db.collection.createIndex( <key and index type specification>, <options> )
Optimize your MongoDB query
In case you want to optimize your query, there are multiple ways to do that.
Creating custom indexes to support your queries
Limit the Number of Query Results to Reduce Network Demand
db.posts.find().sort( { timestamp : -1 } ).limit(10)
Use Projections to Return Only Necessary Data
db.posts.find( {}, { timestamp : 1 , title : 1 , author : 1 , abstract : 1} ).sort( { timestamp : -1 } )
Use $hint to Select a Particular Index
db.users.find().hint( { age: 1 } )
Short answer, yes _id is the primary key and it's indexed. Of course it's fast.
But you can use an index on the other fields too and get more efficient queries.
I have a collection User in mongo. When I do a count on this collection I got 13204951 documents
> db.User.count()
13204951
But when I tried to find the count of non-stale documents like this I got a count of 13208778
> db.User.find({"_id": {$exists: true, $ne: null}}).count()
13208778
> db.User.find({"UserId": {$exists: true, $ne: null}}).count()
13208778
I even tried to get the count of this collection using MongoEngine
user_list = set(User.objects().values_list('UserId'))
len(resume_list)
13208778
Here are the indexes of this User collection
>db.User.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "user_db.User"
},
{
"v" : 1,
"unique" : true,
"key" : {
"UserId" : 1
},
"name" : "UserId_1",
"ns" : "user_db.User",
"sparse" : false,
"background" : true
}
]
Any pointers on how to debug the mismatch in counts from different queries.
refer to this document
On a sharded cluster, db.collection.count() can result in an inaccurate count if orphaned documents exist or if a chunk migration is in progress.
Also, refer to this question
If you are not using sharding cluster, you can refer to this question
The basic idea is db.{collection}.count() might do some tricks to make it fast to return a count, and it might be not accurate, use a count() with query should be accurate.
I use MongoDB for an internal ADMIN type of application used by my team.
Mongo is installed on 1 box and no replica sets.
ADMIN application inserts 70K to 100K documents/per day and we maintain 4 months of data. DB has ~100 million documents at any given time.
When the application was deployed, it all started fine for few days. As the data kept accumulated to reach the 4 months max limit, I see severe performance issues with MongoDB.
I installed MongoDB 3.0.4 as-is on a Linux box and did not fine tune any optimization settings.
Are there any optimization settings I need to adjust?
ADMIN application has schedulers which runs every 1/2 hr to insert and purge outdated data. Given below collection with indexes defined on createdDate,env,messageId,sourceSystem, I see few queries were taking 30 min to respond.
Sample query: Count of documents with a given env,sourceSystem, but between a given range of dates. ADMIN app uses grails and the above query is created using GORM. It used to work fine in the beginning. But over the period of time, performance degraded. I tried restarting the application as well. It didn't help. I believe using the MongoDB as-is (like a Dev Mode) might be causing performance issue. Any suggestions on what to tweak in settings (perhaps cpu/mem limits etc)?
{
"_id" : ObjectId("5575e388e4b001976b5e570f"),
"createdDate" : ISODate("2015-06-07T05:00:34.040Z"),
"env" : "prod",
"messageId" : "f684b34d-a480-42a0-a7b8-69d6d18f39e5",
"payload" : "JSON or XML DATA",
"sourceSystem" : "sourceModule"
}
Update:
Indices:
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"messageId" : 1
},
"name" : "messageId_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"createdDate" : 1
},
"name" : "createdDate_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"sourceSystem" : 1
},
"name" : "sourceSystem_1",
"ns" : "admin.Message"
},
{
"v" : 1,
"key" : {
"env" : 1
},
"name" : "env_1",
"ns" : "admin.Message"
}
]
I have a MongoDB collection with a lot of indexes.
Would it bring any benefits to delete indexes that are barely used?
Is there any way or tool which can tell me (in numbers) how often a index is used?
EDIT: I'm using version 2.6.4
EDIT2: I'm now using version 3.0.3
Right, so this is how I would do it.
First you need a list of all your indexes for a certain collection (this will be done collection by collection). Let's say we are monitoring the user collection to see which indexes are useless.
So I run a db.user.getIndexes() and this results in a parsable output of JSON (you can run this via command() from the client side as well to integrate with a script).
So you now have a list of your indexes. It is merely a case of understanding which queries use which indexes. If that index is not hit at all you know it is useless.
Now, you need to run every query with explain() from that output you can judge which index is used and match it to and index gotten from getIndexes().
So here is a sample output:
> db.user.find({religion:1}).explain()
{
"queryPlanner" : {
"plannerVersion" : 1,
"namespace" : "meetapp.user",
"indexFilterSet" : false,
"parsedQuery" : {
"religion" : {
"$eq" : 1
}
},
"winningPlan" : {
"stage" : "FETCH",
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"religion" : NumberLong(1)
},
"indexName" : "religion_1",
"isMultiKey" : false,
"direction" : "forward",
"indexBounds" : {
"religion" : [
"[1.0, 1.0]"
]
}
}
},
"rejectedPlans" : [ ]
},
"serverInfo" : {
"host" : "ip-172-30-0-35",
"port" : 27017,
"version" : "3.0.0",
"gitVersion" : "a841fd6394365954886924a35076691b4d149168"
},
"ok" : 1
}
There are a set of rules that the queryPlanner field will use and you will need to discover and write for them but this first one is simple enough.
As you can see: the winning plan (in winningPlan) is a single (could be multiple remember, this stuff you will need to code around) IXSCAN (index scan) and the key pattern for the index used is:
"keyPattern" : {
"religion" : NumberLong(1)
},
Great, now we can match that the key output of getIndexes():
{
"v" : 1,
"key" : {
"religion" : NumberLong(1)
},
"name" : "religion_1",
"ns" : "meetapp.user"
},
to tells us that the religion index is not useless and is in fact used.
Unfortunately this is the best way I can see. It used to be that MongoDB had an index stat for number of times the index was hit but it seems that data has been removed.
So you would just rinse and repeat this process for every collection you have until you have removed the indexes that are useless.
One other way of doing this, of course, is to remove all indexes and then re-add indexes as you test your queries. Though that might be bad if you do need to do this in production.
On a side note: the best way to fix this problem is to not have it at all.
I make this easier for me by using a indexing function within my active record. Once every so often I run (from PHP) something of the sort: ./yii index/rebuild which essentially goes through my active record models and detects which indexes I no longer use and have removed from my app and removes them in turn. It will, of course, create new indexes.
I have a collection named App and need to query those active (active: true) apps that belong to a particular user (user_id) or are available to all users (by their _id). I use query like this
{
"active" : true,
"$or" : [
{
"user_id" : "111111111111111111111111"
},
{
"_id" : {
"$in" : [
ObjectId("222222222222222222222222"),
ObjectId("333333333333333333333333"),
ObjectId("444444444444444444444444")
]
}
}
]
}
However in db.currentOp(true) I see that this query is running very slowly: lockStats.timeLockedMicros.r is about 3000.
How can I optimize performance of this query? I already have the following indexes on App:
> db.App.getIndexes()
[
{
"v" : 1,
"key" : {
"_id" : 1
},
"name" : "_id_",
"ns" : "mydb.App"
},
{
"v" : 1,
"key" : {
"active" : 1,
"created_at" : -1
},
"name" : "active_1_created_at_-1",
"ns" : "mydb.App",
"background" : true
},
{
"v" : 1,
"key" : {
"active" : 1,
"user_id" : 1
},
"name" : "active_1_user_id_1",
"ns" : "mydb.App",
"background" : true
}
]
Two issues I see here:
1) You would not need index on the boolean field active as it would have low selectivity and not benefiting query performance.
"If overall selectivity is low, and if MongoDB must read a number of documents to return results, then some queries may perform faster without indexes." source
2) You need an index for user_id because user_id cannot use the compound index you created for active_1_user_id_1
Edit: You can always check index efficiency by doing a explain(true) and look at which indexes are used for that query.
I would try to do the following:
remove all your indexes, your active field has a low cardinality (boolean) and does not help you at all, you are not using created_at, so there is no reason for it.
add an index only on user_id key
change your strings as numbers to numbers.