Speeding up $or query in pymongo - mongodb

I have a collection of 1.8 billion records stored in mongodb, where each record looks like this:
{
"_id" : ObjectId("54c1a013715faf2cc0047c77"),
"service_type" : "JE",
"receiver_id" : NumberLong("865438083645"),
"time" : ISODate("2012-12-05T23:07:36Z"),
"duration" : 24,
"service_description" : "NQ",
"receiver_cell_id" : null,
"location_id" : "658_55525",
"caller_id" : NumberLong("475035504705")
}
I need to get all the records for 2 million specific users (I have the users of interest id in a text file) and process it before I write the results to a database. I have indices on the receiver_id and on caller_id (each is part of a single index).
The current procedure I have is as the following:
for user in list_of_2million_users:
user_records = collection.find({ "$or" : [ { "caller_id": user }, { "receiver_id" : user } ] })
for record in user_records:
process(record)
However, it takes 15 seconds on average to consume the user_records cursor (the process function is very simple with low running time). This will not be feasible to process 2 million users. Any suggestions to speed up the $or query? as it seems to be the most time-consuming step.
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
}
}
],
"cursor" : "QueryOptimizerCursor",
"n" : 784,
"nscannedObjects" : 784,
"nscanned" : 784,
"nscannedObjectsAllPlans" : 784,
"nscannedAllPlans" : 784,
"scanAndOrder" : false,
"nYields" : 753,
"nChunkSkips" : 0,
"millis" : 31057,
"server" : "some_server:27017",
"filterSet" : false
}
And this is the collection stats:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
I am running Ubuntu server with 125GB of RAM.
Note that I will run this analysis only once (not periodic thing I will do).

If the indices on caller_id and receiver_id are a single compound index, this query will do a collection scan instead of an index scan. Make sure they are both part of a separate index, i.e.:
db.user_records.ensureIndex({caller_id:1})
db.user_records.ensureIndex({receiver_id:1})
You can confirm that your query is doing an index scan in the mongo shell:
db.user_records.find({'$or':[{caller_id:'example'},{receiver_id:'example'}]}).explain()
If the explain plan returns its cursor type as BTreeCursor, you're using an index scan. If it says BasicCursor, you're doing a collection scan which is not good.
It would also be interesting to know the size of each index. For best query performances, both indices should be completely loaded into RAM. If the indices are so large that only one (or neither!) of them fit into RAM, you will have to page them in from disk to look up the results. If they're too big to fit in your RAM, your options are not too great, basically either splitting up your collection in some manner and re-indexing it, or getting more RAM. You could always get an AWS RAM-heavy instance just for the purpose of this analysis, since this is a one-off thing.

I am no expert in MongoDB, though I had the similar problem & following solutions helped me tackle the problem. Hope it helps you too.
Query is using indexes and scanning exact documents, so there are no issues with your indexing, though I'll suggest you to:
First of all try to see the status of command: mongostat --discover
See for the parameters such as page faults & index miss.
Have you tried warming up (performance of query after executing query for first)? What's the performance after warming up? If it's same as the previous one there might be page faults.
If you are going to run it as an analysis I think warming up the database might help you.

I don't know why your approach is so slow.
But you might want to try these alternative approaches:
Use $in with many ids at once. I'm not sure if mongodb handles millions of values well, but if it does not, sort the list of IDs and then split it into batches.
Do a collection scan in the application and check each entry against a hashset containing the interesting IDs. Should have acceptable performance for a one-off script, especially since you're interested in so many IDs.

Related

Getting rid of _id in mongodb collection

I know it is not possible to remove the _id field in a mongodb collection. However, the size of my collections is large, that the index on the _id field prevents me from loading the other indices in the RAM. My machine has 125GB of RAM and my collection stats is as follows:
db.call_records.stats()
{
"ns" : "stc_cdrs.call_records",
"count" : 1825338618,
"size" : 438081268320,
"avgObjSize" : 240,
"storageSize" : 468641284752,
"numExtents" : 239,
"nindexes" : 3,
"lastExtentSize" : 2146426864,
"paddingFactor" : 1,
"systemFlags" : 0,
"userFlags" : 1,
"totalIndexSize" : 165290709024,
"indexSizes" : {
"_id_" : 73450862016,
"caller_id_1" : 45919923504,
"receiver_id_1" : 45919923504
},
"ok" : 1
}
When I do a query like the following:
db.call_records.find({ "$or" : [ { "caller_id": 125091840205 }, { "receiver_id" : 125091840205 } ] }).explain()
{
"clauses" : [
{
"cursor" : "BtreeCursor caller_id_1",
"isMultiKey" : false,
"n" : 401,
"nscannedObjects" : 401,
"nscanned" : 401,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"caller_id" : [
[
125091840205,
125091840205
]
]
}
},
{
"cursor" : "BtreeCursor receiver_id_1",
"isMultiKey" : false,
"n" : 383,
"nscannedObjects" : 383,
"nscanned" : 383,
"scanAndOrder" : false,
"indexOnly" : false,
"nChunkSkips" : 0,
"indexBounds" : {
"receiver_id" : [
[
125091840205,
125091840205
]
]
it takes more than 15 seconds on average to return the results. The indices for both caller_id and receiver_id should be around 90GB, which is OK. However, the 73GB index on the _id makes this query very slow.
You correctly told that you can not remove _id field from your document. You also can not remove an index from this field, so this is something you have to live with.
For some reason you start with the assumption that _id index makes your query slow, which is completely unjustifiable and most probably is wrong. This index is not used and just stays there untouched.
Few things I would try to do in your situation:
You have 400 billion documents in your collection, have you thought that this is a right time to start sharding your database? In my opinion you should.
use explain with your query to actually figure out what slows it down.
Looking at your query, I would also try to do the following:
change your document from
{
... something else ...
receiver_id: 234,
caller_id: 342
}
to
{
... something else ...
participants: [342, 234]
}
where your participants are [caller_id, receiver_id] in this order, then you can put only one index on this field. I know that it will not make your indices smaller, but I hope that because you will not use $or clause, you will get results faster. P.S. if you will do this, do not do this in production, test whether it give you a significant improvement and only then change in prod.
There are a lot of potential issues here.
The first is that your indexes do not include all of the data returned. This means Mongo is getting the _id from the index and then using the _id to retrieve and return the document in question. So removing the _id index, even if you could, would not help.
Second, the query includes an OR. This forces Mongo to load both indexes so that it can read them and then retrieve the documents in question.
To improve performance, I think you have just a few choices:
Add the additional elements to the indexes and restrict the data returned to what is available in the index (this would change indexOnly = true in the explain results)
Explore sharding as Skooppa.com mentioned.
Rework the query and/or the document to eliminate the OR condition.

MongoDB fulltext search not using index

We use mongoDB fulltext search to find products in our database.
Unfortunately it is incredible slow.
The collection contains 89.114.052 documents and I have the suspicion, that the full text index is not used.
Performing a search with explain(), nscannedObjects returns 133212.
Shouldn't this be 0 if an index is used?
My index:
{
"v" : 1,
"key" : {
"_fts" : "text",
"_ftsx" : 1
},
"name" : "textIndex",
"ns" : "search.products",
"weights" : {
"brand" : 1,
"desc" : 1,
"ean" : 1,
"name" : 3,
"shop_product_number" : 1
},
"default_language" : "german",
"background" : false,
"language_override" : "language",
"textIndexVersion" : 2
}
The complete test search:
> db.products.find({ $text: { $search: "playstation" } }).limit(100).explain()
{
"cursor" : "TextCursor",
"n" : 100,
"nscannedObjects" : 133212,
"nscanned" : 133212,
"nscannedObjectsAllPlans" : 133212,
"nscannedAllPlans" : 133212,
"scanAndOrder" : false,
"nYields" : 1041,
"nChunkSkips" : 0,
"millis" : 105,
"server" : "search2:27017",
"filterSet" : false
}
Please have a look at the question you asked:
".... The collection contains 89.114.052 documents and I have the suspicion, that the full text index is not used ...."
You are only "nScanned" for 133212 documents. Of course the index is used. If it was not then 89,114,052 documents ( because this is English locale and not German ) would have otherwise been reported in "nScanned" which means an index is not used.
Your query is slow. Well it seems your hardware is not up to the task of keeping 1333212 documents in memory or otherwise having the super fast disk to "page" effectively. But this is not a MongoDB problem but yours.
You have over 100,000 documents that match your query and even if you just want 100 then you need to accept this is how this works and MongoDB does not "give up" once you have matched 100 documents and yield control. The query pattern here finds all of the matches and then applies the "limit" to the cursor in order just to return the most recent.
Maybe some time in the future the "text" functionality might allow you do do things like you can do in the aggregate version of $geoNear and specify "minimum" and "maximum" values for a "score" in order to improve results. But right now it does not.
So either upgrade your hardware or use an external text search solution if your problem is the slow results on matching over 100,000 documents out of over 89,000,000 documents.

Incorrect items count with specific index usage

I'm using MongoDB, version 2.4.8 on windows server 2008 R2 and I have strange index behaviour which I can't explain. Here example of structure that I have in my collection:
{
"_id" : NUUID("67070100-4627-4aa5-8ab9-45624e5b82ad"),
"PropertyType" : "Cooperative",
"Address" : {
"Street" : "aaaaaaaaa",
"HouseNo" : "165",
"PostalCode" : 2860,
"City" : "bbbbb",
"Floor" : "1",
"DoorNumber" : ""
},
"Sales" : {
"Price" : 425000,
"Payout" : 0,
"AreaPrice" : 9042,
"GrossPrice" : 2340,
"NetPrice" : 800,
},
"WithdrawnFromSale" : true,
"UnitData" : {
"UnitType" : "aaaaa",
"Area" : 400,
"LivingArea" : 50,
"UnitArea" : 50,
"Rooms" : 2,
"BuildYear" : 1948,
"GroundArea" : 203,
"NoiseLevel" : 5
}
}
Also, I've created index for that collection:
db["UnitModel"].ensureIndex({ "Sales": 1, "PropertyType": 1, "UnitData.Rooms": 1, "UnitData.NoiseLevel": 1 })
The problem with that index is that I get wrong count of items when using this index.
When I issue this request:
db.UnitModel.find({Sales: {$ne: null}, WithdrawnFromSale: false}).explain({verbose: true})
I get following results:
{
"cursor" : "BtreeCursor Sales_1_PropertyType_1_UnitData.Rooms_1_UnitData.NoiseLevel_1 multi",
"isMultiKey" : false,
"n" : 19368,
"nscannedObjects" : 42875,
"nscanned" : 42876,
"nscannedObjectsAllPlans" : 43274,
"nscannedAllPlans" : 43276,
"scanAndOrder" : false,
"indexOnly" : false,
....
}
Here we can see that index has been used, but the number of items returned is "n" : 19368. which is wrong.
It should be 70986 items in collection with that criteria.
Why am I sure that it should be more records? Well, here the code:
var totalCount = 0;
db.UnitModel.find({WithdrawnFromSale: false}).forEach(
function (e) {
if(e.hasOwnProperty('Sales') && e.Sales != null)
totalCount++;
}
)
totalCount;
totalCount = 70986
To be sure that query above do not use any indexes let's check it out:
db.UnitModel.find({WithdrawnFromSale: false}).explain({verbose: true})
And result:
{
"cursor" : "BasicCursor",
"isMultiKey" : false,
"n" : 70986,
"nscannedObjects" : 3204212,
"nscanned" : 3204212,
"nscannedObjectsAllPlans" : 3204212,
"nscannedAllPlans" : 3204212,
"scanAndOrder" : false,
"indexOnly" : false,
....
}
So, for UnitModel collection I'm using, for criteria: Sales: {$ne: null}, WithdrawnFromSale: false it should be 70986 records returned by mongo. But as you can see I get it wrong.
Can someone explain me why? What can be the reason?
BTW. When I drop that index and use following index:
db["UnitModel"].ensureIndex({ "WithdrawnFromSale": 1})
it works as expected. But I do not need that index, it's not optimzal for my case.
As at MongoDB 2.4, the maximum size of an indexed value is 1024 bytes. The current behaviour for a key too large to index is to log a warning on the server side -- but this does not throw an exception. In this case, documents with excessively long keys will not be included in the index when the key is too long, but will be included in other indexes. This can lead to inconsistencies in results such as incorrect counts and "missing documents" that cannot be found by one index but may be available in another index or with a $natural search.
In the MongoDB 2.5 development/unstable branch (which will culminate in the MongoDB 2.6 production release later this year) this behaviour has changed. As at MongoDB 2.5.5, an exception will now be raised if a insert/update includes an index update where the keys would be too large. See SERVER-5290 in the MongoDB issue tracker for more details.
Figure out what the reason of the issue. When I look in log files for monogodb I have seen tons of following messages:
HBReadModel.system.indexes Btree::insert: key too large to index, skipping HBReadModel.UnitModel.$Sales_1_WithdrawnFromSale_1_PropertyType_1_UnitData.Rooms_1_UnitData.NoiseLevel_1
I was trying to create index on sales field which in actually document and not field. To avoid this I just re-created index and specify field inside Sales document. Log is clear, query returns records as expected.

Efficiently sorting the results of a mongodb geospatial query

I have a very large collection of documents like:
{ loc: [10.32, 24.34], relevance: 0.434 }
and want to be able efficiently do a query like:
{ "loc": {"$geoWithin":{"$box":[[-103,10.1],[-80.43,30.232]]}} }
with arbitrary boxes.
Adding an 2d index on loc makes this very fast and efficient. However, I want to now also just get the most relevant documents:
.sort({ relevance: -1 })
Which causes everything to grind to a crawl (there can be huge amount of results in any particular box, and I just need the top 10 or so).
Any advise or help greatly appreciated!!
Have you tried using the aggregation framework?
A two stage pipeline might work:
a $match stage that uses your existing $geoWithin query.
a $sort stage that sorts by relevance: -1
Here's an example of what it might look like:
db.foo.aggregate(
{$match: { "loc": {"$geoWithin":{"$box":[[-103,10.1],[-80.43,30.232]]}} }},
{$sort: {relevance: -1}}
);
I'm not sure how it will perform. However, even if it's poor with MongoDB 2.4, it might be dramatically different in 2.6/2.5, as 2.6 will include improved aggregation sort performance.
When there is a huge result matching particular box, sort operation is really expensive so that you definitely want to avoid it.
Try creating separate index on relevance field and try using it (without 2d index at all): the query will be executed much more efficiently that way - documents (already sorted by relevance) will be scanned one by one matching the given geo box condition. When top 10 are found, you're good.
It might not be that fast if geo box matches only small subset of the collection, though. In worst case scenario it will need to scan through the whole collection.
I suggest you to create 2 indexes (loc vs. relevance) and run tests on queries which are common in your app (using mongo's hint to force using needed index).
Depending on your tests results, you may even want to add some app logic so that if you know the box is huge you can run the query with relevance index, otherwise use loc 2d index. Just a thought.
You cannot have the scan and order value as 0 when you trying to use to have sorting on the part of a compound key. Unfortunately currently there is no solution for your problem which is not related to the phenomenon that you are using a 2d index or else.
When you run an explain command on your query the value of "scanAndOrder" show weather it was needed to have a sorting phase after collecting the result or not.If it is true a sorting after the querying was needed, if it is false sorting was not needed.
To test the situation i created a collection called t2 in a sample db this way:
db.createCollection('t2')
db.t2.ensureIndex({a:1})
db.t2.ensureIndex({b:1})
db.t2.ensureIndex({a:1,b:1})
db.t2.ensureIndex({b:1,a:1})
for(var i=0;i++<200;){db.t2.insert({a:i,b:i+2})}
While you can use only 1 index to support a query i did the following test with the results included:
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("b_1").explain()
{
"cursor" : "BtreeCursor b_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 200,
"nscanned" : 200,
"nscannedObjectsAllPlans" : 200,
"nscannedAllPlans" : 200,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost:27418",
"millis" : 0
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("a_1_b_1").explain()
{
"cursor" : "BtreeCursor a_1_b_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 150,
"nscanned" : 150,
"nscannedObjectsAllPlans" : 150,
"nscannedAllPlans" : 150,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"a" : [
[
50,
1.7976931348623157e+308
]
],
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
]
},
"server" : "localhost:27418",
"millis" : 1
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("a_1").explain()
{
"cursor" : "BtreeCursor a_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 150,
"nscanned" : 150,
"nscannedObjectsAllPlans" : 150,
"nscannedAllPlans" : 150,
"scanAndOrder" : true,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 1,
"indexBounds" : {
"a" : [
[
50,
1.7976931348623157e+308
]
]
},
"server" : "localhost:27418",
"millis" : 1
}
mongos> db.t2.find({a:{$gt:50}}).sort({b:1}).hint("b_1_a_1").explain()
{
"cursor" : "BtreeCursor b_1_a_1",
"isMultiKey" : false,
"n" : 150,
"nscannedObjects" : 150,
"nscanned" : 198,
"nscannedObjectsAllPlans" : 150,
"nscannedAllPlans" : 198,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 0,
"nChunkSkips" : 0,
"millis" : 0,
"indexBounds" : {
"b" : [
[
{
"$minElement" : 1
},
{
"$maxElement" : 1
}
]
],
"a" : [
[
50,
1.7976931348623157e+308
]
]
},
"server" : "localhost:27418",
"millis" : 0
}
The indexes on individual fields does not help much so a_1 (not support sorting) and b_1 (not support queryin) is out . The index on a_1_b_1 also not fortunate while it will perform worse than the single a_1, mongoDB engine will not utilize the situation that the part related to one 'a' value stored ordered this way. What is worth to try is a compound index b_1_a_1 which in your case relevance_1_loc_1 while it will return the results in ordered manner so scanAndOrder will be false and i have not tested for 2d index but i assume it will exclude scanning some documents based on just the index value (that is why in the test in that case the nscanned is higher than nscannedObjects). The index unfortunately will be huge but still smaller than the docs.
This solution is valid if you need to search inside a box(rectangle).
The problem with geospatial index is that you can only place it in the front of a Compound Index (at least it is so for mongo 3.2)
So I thought why not to create my own "geospatial" index? All I need is to create a Compound Index on Lat, Lgn (X,Y) and add the sort field at the first place. Then I'll need to implement the logic of searching inside the box boundaries and specifically instruct mongo to use it (hint).
Translating to your problem:
db.collection.createIndex({ "relevance": 1, "loc_x": 1, "loc_y": 1 }, { "background": true } )
Logic:
db.collection.find({
"loc_x": { "$gt": -103, "$lt": -80.43 },
"loc_y": { "$gt": 10.1, "$lt": 30.232 }
}).hint("relevance_1_loc_x_1_loc_y_1") // or whatever name you gave it
Use $gte and $lte if you need inclusive results.
And you don't need to use .sort() since it's already sorted, or you can do a reverse sort on relevance if you need.
The only issue that I encountered with it is when the box area is small. It took more time to find small areas than large ones. That is why I kept the geospatial index for small area searches.

Indexing with mongodb: bad performance / indexOnly=false

I have a mongodb on a 8GB linux machine running. Currently it's in test-mode so there are very few other requests coming in if any at all.
I have a colelction items with 1 million documents in it. I am creating an index on the fields: PeerGroup and CategoryIds (which is an array of 3-6 elements which will yield in an multi key): db.items.ensureIndex({PeerGroup:1, CategoryIds:1}.
When I am querying
db.items.find({"CategoryIds" : new BinData(3,"xqScEqwPiEOjQg7tzs6PHA=="), "PeerGroup" : "anonymous"}).explain()
I have the following results:
{
"cursor" : "BtreeCursor PeerGroup_1_CategoryIds_1",
"isMultiKey" : true,
"n" : 203944,
"nscannedObjects" : 203944,
"nscanned" : 203944,
"nscannedObjectsAllPlans" : 203944,
"nscannedAllPlans" : 203944,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 680,
"indexBounds" : {
"PeerGroup" : [
[
"anonymous",
"anonymous"
]
],
"CategoryIds" : [
[
BinData(3,"BXzpwVQozECLaPkJy26t6Q=="),
BinData(3,"BXzpwVQozECLaPkJy26t6Q==")
]
]
},
"server" : "db02:27017"
}
I think 680ms is not that very fast. Or is this acceptable?
Also, why does it say "indexOnly:false" ?
I think 680ms is not that very fast. Or is this acceptable?
That kind of depends on how big these objects are and whether this was a first run. Assuming the whole data set (including the index) you are returning fits into memory, then they next time you run this it will be an in-memory query and will then return basically as fast as possible. The nscanned is high meaning that this query is not very selective, are most records going to have an "anonymous" value in PeerGroup? If so, and the CategoryId is more selective then you might try an index on {CategoryIds:1, PeerGroup:1} instead (use hint() to try out one versus the other).
Also, why does it say "indexOnly:false"
This simply indicates that all the fields you wish to return are not in the index, the BtreeCursor indicates that the index was used for the query (a BasicCursor would mean it had not). For this to be an indexOnly query, you would need to be returning only the two fields in the index (that is: {_id : 0, PeerGroup:1, CategoryIds:1}) in your projection. That would mean that it would never have to touch the data itself and could return everything you need from the index alone.