mongodb: non-performant indexes - mongodb

After playing with
db.largecollection.find( { $or : [ { identifierX : "sha1_hash123" } , { identifierY : "md5_hash456" } , { identifierZ : "another_hash789" } ] } )
I checked the indexes that mongodb prepared automatically. in addition to the "single" ensureIndex for the identifiers x/y/z, there is a identifierX_1_identifierY_1_identifierZ_1 now and performance is down :-(
Do you have an idea or tip how to explain to mongodb that it's faster to use the indexes for the single identifiers because i do not have $and, but $or queries?
Thx

MongoDB doesn't create indexes on its own. It's something that an application, user, or framework does. For your query, MongoDB could only use an index for either of identifierX, identifierY or identifierZ. However, if you don't have such an index then of course none will be used. The identifierX_1_identifierY_1_identifierZ_1 index can not be used for this query.
In this case, you will probably need to make an index for all of this identifiers:
db.ensureIndex( { 'identifierX' : 1 } );
db.ensureIndex( { 'identifierY' : 1 } );
db.ensureIndex( { 'identifierZ' : 1 } );
MongoDB can only use one index at a time, and it will try to pick the "best" one. Try using explain to see which indexed is being picked:
db.largecollection.find( { $or : [
{ identifierX : "sha1_hash123" },
{ identifierY : "md5_hash456" },
{ identifierZ : "another_hash789" }
] } ).explain();
That should give you some ideas on which index is being used.
There is an exception for $or though, where MongoDB can use a different index for each of the parts and de-dup for you. It's here in the docs. It would (of course) still not use the compound index, and you need the indexes that I've written here above.

Related

Matching elements in array documents sometimes gets very slow

I have a mongodb collection with about 100.000 documents.
Each document has an array with about ~ 100 elements. Is an array of strings like this:
features: [
"0_Toyota",
"29776_Grey",
"101037_Hybrid",
"240473_Iron Gray",
"46290_Aluminium,Magnesium",
"2787_14",
"9350_1920 x 1080",
"36303_Y",
"310870_N",
"57721_Y"
...
Making queries like this, are very fast. But sometimes gets very slow, including an specific extra condition inside $and. I have no idea why this happens. When gets slow, it takes more than 40 seconds. Always happens with the same extra condition. It is very possible that it happens with other conditions.
db.products.find({
$and:[
{
"features" : {
"$eq" : "36303_N"
}
},
{
"features" : {
"$eq" : "91135_IPS"
}
},
{
"features" : {
"$eq" : "9350_1366 x 768"
}
},
{
"features" : {
"$eq" : "178874_Y"
}
},
{
"features" : {
"$eq" : "43547_Y"
}
}
...
I'm running the same mongodb in my unix laptop and on a linux server instance.
Also trying indexing the field "features" with the same results.
use $all in your mongo query with your data helps you to query for an array
first create index on features
use this query may helps to you
db.products.find( { features: { $all: ["36303_N", "91135_IPS","others..."] } } )
by the way ,
if your query is very slow ,get the slow operation from your mongod log
show your mongodb version .
any writing when query (write will blocking read in some version)
I have realized that order inside $all matters. I change the order of elements by its number of documents that exists inside the collection, ascending. Making the query more selective.
Before, the query takes ~ 40 seconds to execute, now, with elements ordered, it takes ~ 22 seconds.
Still many seconds anyway.

MongoDB- Query Embedded Document, but Projection Showing Path to key/value?

So basically I want
db.scoreFacts.find(
{"instrumentRanges.flute.minPitch": {$gte: 0, $lte:56}},
{"instrumentRanges.flute.minPitch": 1})
to return
{ "_id" : "Bach_Brandenburg5_Mov1.xml", "minPitch" : 50 }
but instead I get:
{ "_id" : "Bach_Brandenburg5_Mov1.xml", "instrumentRanges" : { "flute" : { "minPitch" : 50 } } }
Essentially the path to "minPitch" is returned, which is not what I need. How can I achieve my desired output with only .find() (no map, etc)? Thanks.
You can't do this with a standard .find() query. If you wish to alter the document structure, look into using an aggregate() call. You can then use projection to define the resulting field(s) you desire.
For example:
db.scoreFacts.aggregate([
{ $match: {"instrumentRanges.flute.minPitch": {$gte: 0, $lte:56}} },
{ $project: {"minPitch": "$instrumentRanges.flute.minPitch"} }
]);
For more information, please see the relevant documentation. Additionally, take a look at the prerequisite aggregation pipeline section.
Note: I have not tested the above query myself, so you may need to alter it somewhat to get the behavior you want.

MongoDB querying by sub-document value not working as expected

I have two schemas, a Profile and a LevelOfNeed.
Profile
{
"_id" : ObjectId("56d35960a695dfa140137fca"),
. . .
"levelOfNeedServiced" : ObjectId("56d35828a695dfa140137fc7")
}
Level of Need
{
"_id" : ObjectId("56d35828a695dfa140137fc7"),
"sortOrder" : 2,
"description" : "Moderate Needs",
"additionalCost" : 3,
"__v" : 0
}
I currently have 4 documents for LevelOfNeed. What I need to do is select all of the Profile documents where the levelOfNeedServiced.sortOrder is >= a value.
Example:
db.getCollection('profiles').find({
'levelOfNeedServiced.sortOrder': { $gte: 2 }
})
Given my data, I would expect to see the example Profile, but this returns no results. What am I doing wrong?
Update 1
Previously, I was running MongoDB 3.0.9. I've since upgraded to 3.2.3, however I'm still getting the same results. According to the docs, I should be able to query on an embedded document field value.
Update 2
The aggregate function solution works as expected, but since I already had an array of LevelOfNeed objects, I was able to use that to get to the related documents I needed using the $in operator.
Unfortunately mongodb does not support joins until version 3.2. In version 3.2 it provides the $lookup aggregation operator to lookup referenced documents across collections.
You could use it as below:
db.Profile.aggregate([
{
$lookup:{
"from":"LevelOfNeed",
"localField":"levelOfNeedServiced",
"foreignField":"_id",
"as":"joined"
}
},
{
$match:{
"joined.sortOrder":{$gte:2}
}
},
{
$project:{"levelOfNeedServiced":1,...} //include things you want to project.
}
])
Your code:
db.getCollection('profiles').find({
'levelOfNeedServiced.sortOrder': { $gte: 2 }
})
does not work as intended because, the field levelOfNeedServiced is identified as a field containing an ObjectId and not the resolved LevelOfNeed document.

how to remove all documents from a collection except one in MongoDB

Is there any way to remove all the documents except one from a collection based on condition.
I am using MongoDB version 2.4.9
You can do this in below way,
db.inventory.remove( { type : "food" } )
Above query will remove documents with type equals to "food"
To remove document that not matches condition you can do,
db.inventory.remove( { type : { $ne: "food" } } )
or
db.inventory.remove( { type : { $nin: ["Apple", "Mango"] } } )
Check here for more info.
To remove all documents except one, we can use the query operator $nin (not in) over a specified array containing the values related to the documents that we want to keep.
db.collections.remove({"field_name":{$nin:["valueX"]}})
The advantage of $nin array is that we can use it to delete all documents except one or two or even many other documents.
To delete all documents except two:
db.collections.remove({"field_name":{$nin:["valueX", "valueY"]}})
To delete all documents except three:
db.collections.remove({"field_name":{$nin:["valueX", "valueY", "valueZ"]}})
Query
db.collection.remove({ "fieldName" : { $ne : "value"}})
As stated above by Taha EL BOUFFI, the following worked for me.
db.collection.remove({"fieldName" : { $nin: ["value"]}});

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.