I have a very large collection (millions of documents) with data which looks like:
u'timestamp': 1481454871423.0,
u'_id': ObjectId('584d351772c4d8106cc43116'),
u'data': {
...
},
u'geocode': [{u'providerId': 2, u'placeId': 97459515},
{u'providerId': 3, u'placeId': 237},
{u'providerId': 3, u'placeId': 3}]
I want a query which targets a providerId and placeId pair, and returns 10 records only, within a timestamp range.
To that end I perform queries like:
'geocode.providerId': 3,
'geocode.placeId': 3
'timestamp': { '$gte': 1481454868360L,
'$lt': 1481454954839L }
And I provide a hint, to target the index which looks like:
[('geocode.providerId', 1), ('geocode.placeId', 1), ('timestamp', 1)]
where 1 is ascending. Before iterating over the returned cursor, it is limited to 10 records and sorted ascending on timestamp (which should be it's default state due to the index).
A pymongo query looks like:
collection.find(findDic).hint(hint).sort([('timestamp', pymongo.ASCENDING)]).skip(0).limit(10)
The query explains come back looking like:
{
u'executionStats': {
u'executionTimeMillis': 1270,
u'nReturned': 10,
u'totalKeysExamined': 568686,
u'allPlansExecution': [],
u'executionSuccess': True,
u'executionStages': {
u'needYield': 0,
u'saveState': 4442,
u'memUsage': 54359,
u'restoreState': 4442,
u'memLimit': 33554432,
u'isEOF': 1,
u'inputStage': {
u'needYield': 0,
u'saveState': 4442,
u'restoreState': 4442,
u'isEOF': 1,
u'inputStage': {
u'needYield': 0,
u'docsExamined': 284964,
u'saveState': 4442,
u'restoreState': 4442,
u'isEOF': 1,
u'inputStage': {
u'saveState': 4442,
u'isEOF': 1,
u'seenInvalidated': 0,
u'keysExamined': 568686,
u'nReturned': 284964,
u'invalidates': 0,
u'keyPattern': {u'geocode.providerId': 1,
u'timestamp': 1,
u'geocode.placeId': 1},
u'isUnique': False,
u'needTime': 283722,
u'isMultiKey': True,
u'executionTimeMillisEstimate': 360,
u'dupsTested': 568684,
u'restoreState': 4442,
u'direction': u'forward',
u'indexName': u'geocode.providerId_1_geocode.placeId_1_timestamp_1',
u'isSparse': False,
u'advanced': 284964,
u'stage': u'IXSCAN',
u'dupsDropped': 283720,
u'needYield': 0,
u'isPartial': False,
u'indexBounds': {u'geocode.providerId': [u'[3, 3]'
],
u'timestamp': [u'[-inf.0, 1481455513378)'
],
u'geocode.placeId': [u'[MinKey, MaxKey]'
]},
u'works': 568687,
u'indexVersion': 1,
},
u'nReturned': 252823,
u'needTime': 315863,
u'filter': {u'$and': [{u'geocode.placeId': {u'$eq': 3}},
{u'timestamp': {u'$gte': 1481405886510L}}]},
u'executionTimeMillisEstimate': 970,
u'alreadyHasObj': 0,
u'invalidates': 0,
u'works': 568687,
u'advanced': 252823,
u'stage': u'FETCH',
},
u'nReturned': 0,
u'needTime': 315864,
u'executionTimeMillisEstimate': 1150,
u'invalidates': 0,
u'works': 568688,
u'advanced': 0,
u'stage': u'SORT_KEY_GENERATOR',
},
u'nReturned': 10,
u'needTime': 568688,
u'sortPattern': {u'timestamp': 1},
u'executionTimeMillisEstimate': 1200,
u'limitAmount': 10,
u'invalidates': 0,
u'works': 568699,
u'advanced': 10,
u'stage': u'SORT',
},
u'totalDocsExamined': 284964,
},
u'queryPlanner': {
u'parsedQuery': {u'$and': [{u'geocode.placeId': {u'$eq': 3}},
{u'geocode.providerId': {u'$eq': 3}},
{u'timestamp': {u'$lt': 1481455513378L}},
{u'timestamp': {u'$gte': 1481405886510L}}]},
u'rejectedPlans': [],
u'namespace': u'mxp957.tweet_244de17a-aa75-4da9-a6d5-97b9281a3b55',
u'winningPlan': {
u'sortPattern': {u'timestamp': 1},
u'inputStage': {u'inputStage': {u'filter': {u'$and': [{u'geocode.placeId': {u'$eq': 3}},
{u'timestamp': {u'$gte': 1481405886510L}}]},
u'inputStage': {
u'direction': u'forward',
u'indexName': u'geocode.providerId_1_geocode.placeId_1_timestamp_1',
u'isUnique': False,
u'isSparse': False,
u'isPartial': False,
u'indexBounds': {u'geocode.providerId': [u'[3, 3]'],
u'timestamp': [u'[-inf.0, 1481455513378)'
],
u'geocode.placeId': [u'[MinKey, MaxKey]'
]},
u'isMultiKey': True,
u'stage': u'IXSCAN',
u'indexVersion': 1,
u'keyPattern': {u'geocode.providerId': 1,
u'timestamp': 1,
u'geocode.placeId': 1},
}, u'stage': u'FETCH'},
u'stage': u'SORT_KEY_GENERATOR'},
u'limitAmount': 10,
u'stage': u'SORT',
},
u'indexFilterSet': False,
u'plannerVersion': 1,
},
u'ok': 1.0,
u'serverInfo': {
u'host': u'rabbit',
u'version': u'3.2.11',
u'port': 27017,
u'gitVersion': u'009580ad490190ba33d1c6253ebd8d91808923e4',
},
}
I don't understand why all of these documents need to be examined. In the case above, the size of the collection is only 284587 which means that every record was looked at twice! I want totalKeysExamined to only be 10, but am struggling to see how to achieve this.
I am using MongoDB version 3.2.11 and pymongo.
As Astro mentioned, the issue is that MongoDB was not using the index effectively.
MongoDB team say that this is resolved in later versions:
https://jira.mongodb.org/browse/SERVER-27386
Also an option is to remove providerId from the index. In my use case, providerId is one of two values, and most of the time will always be the same value. It represents which API was used to geocode; my system only supports two, and only has one enabled at any one time.
See the commit that would resolve this:
https://github.com/watfordxp/GeoTweetSearch/commit/420536e4a138fb22e0dd0e61ef9c83c23a9263c1
Related
We setup a mongodb replica set with 3 nodes (version 3.6). Now We're having this exception thrown by the mongodb client application:
<< ErrorHandlerProcessor >> Query failed with error code 96 and error message 'Executor error during find command: InterruptedDueToReplStateChange: operation was interrupted' on server mongodb-0.mongodb-internal.dp-common-database.svc.cluster.local:27017; nested exception is com.mongodb.MongoQueryException: Query failed with error code 96 and error message 'Executor error during find command: InterruptedDueToReplStateChange: operation was interrupted' on server mongodb-0.mongodb-internal.dp-common-database.svc.cluster.local:27017
After checking the mongodb server logs, we noticed that it elected a new primary due to some reasons at that time, but we cannot find out any errors. Can anyone help point me the cause or how to troubleshoot this issue?
Thanks.
And below is the mongodb logs from 2 nodes for that particular period:
mongodb-0
2020-09-30T06:19:32.238+0000 I NETWORK [conn38321] end connection 172.28.42.10:58362 (115 connections now open)
2020-09-30T06:40:15.730+0000 I COMMAND [PeriodicTaskRunner] task: UnusedLockCleaner took: 259ms
2020-09-30T06:40:15.757+0000 I COMMAND [conn38197] command admin.$cmd command: isMaster { ismaster: 1, $db: "admin" } numYields:0 reslen:793 locks:{} protocol:op_msg 107ms
2020-09-30T06:40:15.849+0000 I REPL [replexec-2645] Member mongodb-1.mongodb-internal.dp-common-database.svc.cluster.local:27017 is now in state RS_DOWN
2020-09-30T06:40:15.854+0000 I REPL [replexec-2645] Member mongodb-2.mongodb-internal.dp-common-database.svc.cluster.local:27017 is now in state RS_DOWN
2020-09-30T06:40:15.854+0000 I REPL [replexec-2645] can't see a majority of the set, relinquishing primary
2020-09-30T06:40:15.854+0000 I REPL [replexec-2645] Stepping down from primary in response to heartbeat
2020-09-30T06:40:15.865+0000 I REPL [replexec-2648] Member mongodb-2.mongodb-internal.dp-common-database.svc.cluster.local:27017 is now in state SECONDARY
2020-09-30T06:40:15.873+0000 I REPL [replexec-2649] Member mongodb-1.mongodb-internal.dp-common-database.svc.cluster.local:27017 is now in state SECONDARY
2020-09-30T06:40:15.885+0000 E QUERY [conn38282] Plan executor error during find command: DEAD, stats: { stage: "LIMIT", nReturned: 1, executionTimeMillisEstimate: 20, works: 1, advanced: 1, needTime: 0, needYield: 0, saveState: 0, restoreState: 0, isEOF: 1, invalidates: 0, limitAmount: 1, inputStage: { stage: "FETCH", nReturned: 1, executionTimeMillisEstimate: 20, works: 1, advanced: 1, needTime: 0, needYield: 0, saveState: 0, restoreState: 0, isEOF: 0, invalidates: 0, docsExamined: 1, alreadyHasObj: 0, inputStage: { stage: "IXSCAN", nReturned: 1, executionTimeMillisEstimate: 20, works: 1, advanced: 1, needTime: 0, needYield: 0, saveState: 0, restoreState: 0, isEOF: 0, invalidates: 0, keyPattern: { consumer: 1.0, channel: 1.0, externalTransactionId: -1.0 }, indexName: "lastsequence", isMultiKey: false, multiKeyPaths: { consumer: [], channel: [], externalTransactionId: [] }, isUnique: false, isSparse: false, isPartial: false, indexVersion: 2, direction: "forward", indexBounds: { consumer: [ "["som", "som"]" ], channel: [ "["normalChannelMobile", "normalChannelMobile"]" ], externalTransactionId: [ "[MaxKey, MinKey]" ] }, keysExamined: 1, seeks: 1, dupsTested: 0, dupsDropped: 0, seenInvalidated: 0 } } }
2020-09-30T06:40:15.887+0000 I REPL [replexec-2645] transition to SECONDARY from PRIMARY
mongodb-1
September 30th 2020, 14:38:40.871 2020-09-30T06:38:40.871+0000 I REPL [replication-343] Canceling oplog query due to OplogQueryMetadata. We have to choose a new sync source. Current source: mongodb-0.mongodb-internal.dp-common-database.svc.cluster.local:27017, OpTime { ts: Timestamp(1601448015, 5), t: 19 }, its sync source index:-1
2020-09-30T06:38:40.871+0000 I REPL [replication-343] Choosing new sync source because our current sync source, mongodb-0.mongodb-internal.dp-common-database.svc.cluster.local:27017, has an OpTime ({ ts: Timestamp(1601448015, 5), t: 19 }) which is not ahead of ours ({ ts: Timestamp(1601448015, 5), t: 19 }), it does not have a sync source, and it's not the primary (sync source does not know the primary)
[replexec-7147] Starting an election, since we've seen no PRIMARY in the past 10000ms
This post has my problem too but no one answer that. I have a model in the mongodb with some fields in which one of them named synced which is a boolean field.
With the code below I want to update one (or with some minor changes more than one) document's "synced" field:
db.bulk_write([UpdateOne({'content_id': '6101-1514301051'}, {'$set': {'synced': True}})]).bulk_api_result
But the result says that one matched doc found with no update:
{'writeErrors': [],
'writeConcernErrors': [],
'nInserted': 0,
'nUpserted': 0,
'nMatched': 1,
'nModified': 0,
'nRemoved': 0,
'upserted': []}
And when I do the same only with another field I get the right answer and the update successfully applied:
db.bulk_write([UpdateOne({'content_id': '6101-1514301051'}, {'$set': {'visit': 14201}})]).bulk_api_result
{'writeErrors': [],
'writeConcernErrors': [],
'nInserted': 0,
'nUpserted': 0,
'nMatched': 1,
'nModified': 1,
'nRemoved': 0,
'upserted': []}
What is wrong with my code and how can I apply the changes to the synced filed?
MORE INFO:
I also do the same thing with mongo-shell:
> db.my_model_name.updateOne({"content_id": "6101-1514301051"}, {$set: {"synced": true}})
{ "acknowledged" : true, "matchedCount" : 1, "modifiedCount" : 0 }
or with
> db.telegram_post_model.update({"content_id": "6101-1514301051"}, {$set: {"synced": true}})
WriteResult({ "nMatched" : 1, "nUpserted" : 0, "nModified" : 0 })
But still cannot update the document.
Following code works good for me:
from pymongo import MongoClient
db = MongoClient("mongodb://login:password#server:port/database_name").get_default_database()
db.telegram_post_model.update({"content_id": "6101-1514301051"}, {"$set": {"synced": True}})
{
"_id" : ObjectId("5488303649f2012be0901e97"),
"user_id":3,
"my_shopping_list" : {
"books" : [ ]
},
"my_library" : {
"books" : [
{
"date_added" : ISODate("2014-12-10T12:03:04.062Z"),
"tag_text" : [
"english"
],
"bdata_product_identifier" : "a1",
"tag_id" : [
"fa7ec571-4903-4aed-892a-011a8a411471"
]
},
{
"date_added" : ISODate("2014-12-10T12:03:08.708Z"),
"tag_text" : [
"english",
"hindi"
],
"bdata_product_identifier" : "a2",
"tag_id" : [
"fa7ec571-4903-4aed-892a-011a8a411471",
"60733993-6b54-420c-8bc6-e876c0e196d6"
]
}
]
},
"my_wishlist" : {
"books" : [ ]
},
}
Here I would like to remove only english from every tag_text array of my_library using only user_id and tag_text This document belongs to user_id:3. I have tried some queries which delete an entire book sub-document . Thank you.
Well since you are using pymongo and mongodb doesn't provide a nice way for doing this because using the $ operator will only pull english from the first subdocument, why not write a script that will remove english from every tag_text and then update your document.
Demo:
>>> doc = yourcollection.find_one(
{
'user_id': 3, "my_library.books" : {"$exists": True}},
{"_id" : 0, 'user_id': 0
})
>>> books = doc['my_library']['books'] #books field in your doc
>>> new_books = []
>>> for k in books:
... for x, y in k.items():
... if x == 'tag_text' and 'english' in y:
... y.remove('english')
... new_book.append({x:y})
...
>>> new_book
[{'tag_text': []}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471']}, {'bdata_product_identifier': 'a1'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 4, 62000)}, {'tag_text': ['hindi']}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471', '60733993-6b54-420c-8bc6-e876c0e196d6']}, {'bdata_product_identifier': 'a2'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 8, 708000)}]
>>> yourcollection.update({'user_id' : 3}, {"$set" : {'my_library.books' : bk}})
Check if everything work fine.
>>> yourcollection.find_one({'user_id' : 3})
{'user_id': 3.0, '_id': ObjectId('5488303649f2012be0901e97'), 'my_library': {'books': [{'tag_text': []}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471']}, {'bdata_product_identifier': 'a1'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 4, 62000)}, {'tag_text': ['hindi']}, {'tag_id': ['fa7ec571-4903-4aed-892a-011a8a411471', '60733993-6b54-420c-8bc6-e876c0e196d6']}, {'bdata_product_identifier': 'a2'}, {'date_added': datetime.datetime(2014, 12, 10, 12, 3, 8, 708000)}]}, 'my_shopping_list': {'books': []}, 'my_wishlist': {'books': []}}
One possible solution could be to repeat
db.collection.update({user_id: 3, "my_library.books.tag_text": "english"}, {$pull: {"my_library.books.$.tag_text": "english"}}
until MongoDB can no longer match a document to update.
I have about 1000000 documents in a collections (random generated).
Sample document:
{
"loc": {
"lat": 39.828475,
"lon": 116.273542
},
"phone": "",
"keywords": [
"big",
"small",
"biggest",
"smallest"
],
"prices": [
{
"category": "uRgpiamOVTEQ",
"list": [
{
"price": 29,
"name": "ehLYoPpntlil"
}
]
},
{
"category": "ozQNmdwpwhXPnabZ",
"list": [
{
"price": 96,
"name": "ITTaLHf"
},
{
"price": 88,
"name": "MXVgJFBgtwLYk"
}
]
},
{
"category": "EDkfKGZSou",
"list": [
{
"price": 86,
"name": "JluoCLnenOqDllaEX"
},
{
"price": 35,
"name": "HbmgMDfxCOk"
},
{
"price": 164,
"name": "BlrUD"
},
{
"price": 106,
"name": "LOUcsMDeaqVm"
},
{
"price": 14,
"name": "rDkwhN"
}
]
}
],
}
Search without indexes
> db.test1.find({"prices.list.price": { $gt: 190 } }).explain()
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 541098,
"nscannedObjects" : 1005584,
"nscanned" : 1005584,
"nscannedObjectsAllPlans" : 1005584,
"nscannedAllPlans" : 1005584,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 8115,
"nChunkSkips" : 0,
**"millis" : 13803,**
"server" : "localhost:27017",
"filterSet" : false
}
With indexes:
> db.test1.ensureIndex({"prices.list.price":1,"menu.list.name":1})
{
"createdCollectionAutomatically" : false,
"numIndexesBefore" : 1,
"numIndexesAfter" : 2,
"ok" : 1
}
> db.test1.find({"prices.list.price": { $gt: 190 } }).explain()
{
"cursor" : "BtreeCursor prices.list.price_1_prices.list.name_1",
"isMultiKey" : true,
"n" : 541098,
"nscannedObjects" : 541098,
"nscanned" : 868547,
"nscannedObjectsAllPlans" : 541098,
"nscannedAllPlans" : 868547,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 16852,
"nChunkSkips" : 0,
**"millis" : 66227,**
"indexBounds" : {
"menu.list.price" : [
[
190,
Infinity
]
],
"menu.list.name" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost:27017",
"filterSet" : false
}
Have any idea why indexed search slower than without index ?
Also i will use:
db.test1.find( { loc : { $near : [39.876045, 32.862245]}}) (need 2d indexes)
db.test1.find({ keywords:{$in: [ "small", "another" ] }}) (use index for keywords)
db.test1.find({"prices.list.name":/.s./ }) (no need to index because i will use regex)
An index allows faster access to location of the document that satisfies the query.
In your example, your query selects half of all the documents in the collection. So even though the index scan provides faster access to know which documents will match the query predicate, it actually creates a lot more work overall.
In collection scan, the query is scanning all of the documents, and checking the field that you are querying by to see if it matches. Half the time it ends up selecting the document.
In index scan, the query is traversing half of all the index entries and then jumping from them directly to the documents that satisfy the query predicate. That's more operations in your case.
In addition, while doing this, the operations are yielding the read mutex when they need to wait for the document they have to read to be brought into RAM, or when there is a write that is waiting to go, and the index scan is showing double the number of yields as the collection scan. If you don't have enough RAM for your working set, then adding an index will put more pressure on the existing resources and make things slower, rather than faster.
Try the same query with price compared to a much larger number, like 500 (or whatever would be a lot more selective in your data set). If the query is still slower with an index, then you are likely seeing a lot of page faulting on the system. But if there is enough RAM for the index, then the indexed query will be a lot faster while the unindexed query will be just as slow.
First, as a suggestion you will get more faster while querying arrays with elemMatch.
http://docs.mongodb.org/manual/reference/operator/query/elemMatch/
In your case
db.test1.find({"prices.list.price":{ $elemMatch: { $gte: 190 }} })
Second thing is
To index a field that holds an array value, MongoDB adds index items
for each item in the array. These multikey indexes allow MongoDB to
return documents from queries using the value of an array. MongoDB
automatically determines whether to create a multikey index if the
indexed field contains an array value; you do not need to explicitly
specify the multikey type.
Consider the following illustration of a multikey index:
Diagram of a multikey index on the addr.zip field. The addr field contains an array of
address documents. The address documents contain the zip field.
Multikey indexes support all operations supported by other MongoDB
indexes; however, applications may use multikey indexes to select
documents based on ranges of values for the value of an array.
Multikey indexes support arrays that hold both values (e.g. strings,
numbers) and nested documents.
from http://docs.mongodb.org/manual/core/index-multikey/
Please look at the following example. It seems to me that the query should be covered by the index {a: 1}, however explain() gives me an indexOnly: false. What I am doing wrong?
> db.foo.save({a: 1, b: 2});
> db.foo.save({a: 2, b: 3});
> db.foo.ensureIndex({a: 1});
> db.foo.find({a: 1}).explain();
{
"cursor" : "BtreeCursor a_1",
"nscanned" : 6,
"nscannedObjects" : 6,
"n" : 6,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" : {
"a" : [
[
1,
1
]
]
}
}
Index only denotes a covered query ( http://docs.mongodb.org/manual/applications/indexes/#indexes-covered-queries ) whereby the query and its sort and data can all be found within a single index.
The problem with your query:
db.foo.find({a: 1}).explain();
Is that it must retrieve the full document which means it cannot find all data within the index. Instead you can use:
db.foo.find({a: 1}, {_id:0,a:1}).explain();
Which will mean you only project the a field which makes the entire query fit into the index, and so indexOnly being true.