Can we apply index to match a certain value in mongoDB - mongodb

i have a collection named users as shown below .
db.users.find().pretty()
{
"_id" : ObjectId("512efc206074b0e4bbdce792"),
"login_id" : "dutchuser",
"isBroker" : false
}
I want to apply index for this users collection with the login_id and isBroker field also .
db.users.ensureIndex( { "login_id": 1, "isBroker": 1 }, { unique: false } )
My concern is that most of the isBroker field has got a value of false .
So is there any possibility that i can apply index in that way ??

You cannot conditionally apply a filter to an index in MongoDB. While you could potentially restructure your data or introduce additional, potentially duplicate fields in your schema, I'm not convinced it's a reasonable "optimization."
Use db.stats() to actually measure the size of the database and db.{collectionname}.totalIndexSize() to see what the impact of having the index you proposed really is.
By using this index:
db.users.ensureIndex( { "login_id": 1, "isBroker": 1 }, { unique: false } )
You can only use queries that involve login_id and isBroker or just login_id. Depending on the types of queries that you run, you may also run into this currently open issue that might make a simple grouping/sorting on isBroker inefficient (or if at some point it becomes broker_type for example).

Related

Best way to count documents in mongoDB

we have a collection with big amount of documents, lets say around 100k. We now want to count the number of documents which has the key x set.
If I try it with Collection.countDocuments({ x: { $exists: true }}) I get the result, but it creates instantly a warning in the console: Query Targeting: Scanned Objects / Returned has gone above 1000.
So, is there a better way to count the documents? There is a Index on the field, is it possible to get the length of the index?
Thanks
Theres no real way of viewing the index trees in Mongo, what other people have linked you just returns the size of the tree, I'm not sure how useful that information is in this context.
Now to your question is this the best way to count?.
The answer is Yes ... -ish.
countDocuments is a wrapper function, it just simulates the following pipeline:
db.collection.aggregate([
{ $match: <query> },
{ $group: { _id: null, n: { $sum: 1 } } } )
])
This pipeline is the most efficient way to go, but the difference between running this aggregation and using the wrapper function is about 100-200 milliseconds, depending on your machine spec.
Meaning if you're looking for "way" better performance you're not going to find it.
With that said this warning is stupid, it just means you have more than 1000 documents with that field. The true purpose of it is to alert you in the case you're trying to query 1-20 documents without a proper index.
You can use the indexSizes field returned by the stats() method.
The stats() method "Returns statistics about the collection".
See example here :
https://docs.mongodb.com/manual/reference/method/db.collection.stats/#basic-stats-lookup
{
...,
"indexSizes" : {
"_id_" : 237568,
"cuisine_1" : 143360,
"borough_1_cuisine_1" : 151552,
"borough_1_address.zipcode_1" : 151552
},
...
}
indexSize key return size as in space used in storing not count
Check With Explain if index getting used or not . (Update in question Also)
can use hint option to check the performance after specifying index
Or precalculate count by $inc operator might good option if possible in you use case
try cursor.count if its faster countDocument should been faster but no harm in checking
https://docs.mongodb.com/manual/reference/method/cursor.count/

How to store an ordered set of documents in MongoDB without using a capped collection

What's a good way to store a set of documents in MongoDB where order is important? I need to easily insert documents at an arbitrary position and possibly reorder them later.
I could assign each item an increasing number and sort by that, or I could sort by _id, but I don't know how I could then insert another document in between other documents. Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
My first guess would be to increment the sequence of all of the following elements so that there would be space for the new element using a query something like db.items.update({"sequence":{$gte:6}}, {$inc:{"sequence":1}}). My limited understanding of Database Administration tells me that a query like that would be slow and generally a bad idea, but I'm happy to be corrected.
I guess I could set the new element's sequence to 5.5, but I think that would get messy rather quickly. (Again, correct me if I'm wrong.)
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet again, I might be wrong about that one too.)
I could have each document contain a reference to the next document, but that would require a query for each item in the list. (You'd get an item, push it onto the results array, and get another item based on the next field of the current item.) Aside from the obvious performance issues, I would also not be able to pass a sorted mongo cursor to my {#each} spacebars block expression and let it live update as the database changed. (I'm using the Meteor full-stack javascript framework.)
I know that everything has it's advantages and disadvantages, and I might just have to use one of the options listed above, but I'd like to know if there is a better way to do things.
Based on your requirement, one of the approaches could be to design your schema, in such a way that each document has the capability to hold more than one document and in itself act as a capped container.
{
"_id":Number,
"doc":Array
}
Each document in the collection will act as a capped container, and the documents will be stored as array in the doc field. The doc field being an array, will maintain the order of insertion.
You can limit the number of documents to n. So the _id field of each container document will be incremental by n, indicating the number of documents a container document can hold.
By doing these you avoid adding extra fields to the document, extra indices, unnecessary sorts.
Inserting the very first record
i.e when the collection is empty.
var record = {"name" : "first"};
db.col.insert({"_id":0,"doc":[record]});
Inserting subsequent records
Identify the last container document's _id, and the number of
documents it holds.
If the number of documents it holds is less than n, then update the
container document with the new document, else create a new container
document.
Say, that each container document can hold 5 documents at most,and we want to insert a new document.
var record = {"name" : "newlyAdded"};
// using aggregation, get the _id of the last inserted container, and the
// number of record it currently holds.
db.col.aggregate( [ {
$group : {
"_id" : null,
"max" : {
$max : "$_id"
},
"lastDocSize" : {
$last : "$doc"
}
}
}, {
$project : {
"currentMaxId" : "$max",
"capSize" : {
$size : "$lastDocSize"
},
"_id" : 0
}
// once obtained, check if you need to update the last container or
// create a new container and insert the document in it.
} ]).forEach( function(check) {
if (check.capSize < 5) {
print("updating");
// UPDATE
db.col.update( {
"_id" : check.currentMaxId
}, {
$push : {
"doc" : record
}
});
} else {
print("inserting");
//insert
db.col.insert( {
"_id" : check.currentMaxId + 5,
"doc" : [ record ]
});
}
})
Note that the aggregation, runs on the server side and is very efficient, also note that the aggregation would return you a document rather than a cursor in versions previous to 2.6. So you would need to modify the above code to just select from a single document rather than iterating a cursor.
Inserting a new document in between documents
Now, if you would like to insert a new document between documents 1 and 2, we know that the document should fall inside the container with _id=0 and should be placed in the second position in the doc array of that container.
so, we make use of the $each and $position operators for inserting into specific positions.
var record = {"name" : "insertInMiddle"};
db.col.update(
{
"_id" : 0
}, {
$push : {
"doc" : {
$each : [record],
$position : 1
}
}
}
);
Handling Over Flow
Now, we need to take care of documents overflowing in each container, say we insert a new document in between, in container with _id=0. If the container already has 5 documents, we need to move the last document to the next container and do so till all the containers hold documents within their capacity, if required at last we need to create a container to hold the overflowing documents.
This complex operation should be done on the server side. To handle this, we can create a script such as the one below and register it with mongodb.
db.system.js.save( {
"_id" : "handleOverFlow",
"value" : function handleOverFlow(id) {
var currDocArr = db.col.find( {
"_id" : id
})[0].doc;
print(currDocArr);
var count = currDocArr.length;
var nextColId = id + 5;
// check if the collection size has exceeded
if (count <= 5)
return;
else {
// need to take the last doc and push it to the next capped
// container's array
print("updating collection: " + id);
var record = currDocArr.splice(currDocArr.length - 1, 1);
// update the next collection
db.col.update( {
"_id" : nextColId
}, {
$push : {
"doc" : {
$each : record,
$position : 0
}
}
});
// remove from original collection
db.col.update( {
"_id" : id
}, {
"doc" : currDocArr
});
// check overflow for the subsequent containers, recursively.
handleOverFlow(nextColId);
}
}
So that after every insertion in between , we can invoke this function by passing the container id, handleOverFlow(containerId).
Fetching all the records in order
Just use the $unwind operator in the aggregate pipeline.
db.col.aggregate([{$unwind:"$doc"},{$project:{"_id":0,"doc":1}}]);
Re-Ordering Documents
You can store each document in a capped container with an "_id" field:
.."doc":[{"_id":0,","name":"xyz",...}..]..
Get hold of the "doc" array of the capped container of which you want
to reorder items.
var docArray = db.col.find({"_id":0})[0];
Update their ids so that after sorting the order of the item will change.
Sort the array based on their _ids.
docArray.sort( function(a, b) {
return a._id - b._id;
});
update the capped container back, with the new doc array.
But then again, everything boils down to which approach is feasible and suits your requirement best.
Coming to your questions:
What's a good way to store a set of documents in MongoDB where order is important?I need to easily insert documents at an arbitrary
position and possibly reorder them later.
Documents as Arrays.
Say I want to insert something between an element with a sequence of 5 and an element with a sequence of 6?
use the $each and $position operators in the db.collection.update() function as depicted in my answer.
My limited understanding of Database Administration tells me that a
query like that would be slow and generally a bad idea, but I'm happy
to be corrected.
Yes. It would impact the performance, unless the collection has very less data.
I could use a capped collection, which has a guaranteed order, but then I'd run into issues if I needed to grow the collection. (Yet
again, I might be wrong about that one too.)
Yes. With Capped Collections, you may lose data.
An _id field in MongoDB is a unique, indexed key similar to a primary key in relational databases. If there is an inherent order in your documents, ideally you should be able to associate a unique key to each document, with the key value reflecting the order. So while preparing your document for insertion, explicitly add an _id field as this key (if you do not, mongo creates it automatically with a BSON objectid).
As far as retrieving the results are concerned, MongoDB does not guarantee the order of return documents unless you explicitly use .sort() . If you do not use .sort(), the results are usually returned in natural order (order of insertion).Again, there is no guarantee on this behavior.
I'd advise you to override _id with your order while inserting, and use a sort while retrieving. Since _id is a necessary and auto-indexed entity, you will not be wasting any space defining a sort key, and storing the index for it.
For abitrary sorting of any collection, you'll need a field to sort it on. I call mine "sequence".
schema:
{
_id: ObjectID,
sequence: Number,
...
}
db.items.ensureIndex({sequence:1});
db.items.find().sort({sequence:1})
Here is a link to some general sorting database answers that may be relevant:
https://softwareengineering.stackexchange.com/questions/195308/storing-a-re-orderable-list-in-a-database/369754
I suggest going with Floating point solution - adding a position column:
Use a floating-point number for the position column.
You can then reorder the list changing only the position column in the "moved" row.
If your user wants to position "red" after "blue" but before "yellow" Then you just need to calculate
red.position = ((yellow.position - blue.position) / 2) + blue.position
After a few re-positions in the same place (Cuttin in half every time) - you might reach a wall - it's better that if you reach a certain threshold - to resort the list.
When retrieving it you can simply say col.sort() to get it sorted and no need for any client-side code (Like in the case of a Linked list solution)

MongoDB "filtered" index: is it possible?

Is it possible to index some documents of the collection "only if" one of the fields to be indexed has a particular value?
Let me explain with an example:
The collection "posts" has millions of documents, ALL defined as follows:
{
    "network": "network_1",
    "blogname": "blogname_1",
    "post_id": 1234,
    "post_slug": "abcdefg"
}
Let's assume that the distribution of the post is equally split on network_1 and network_2
My application OFTEN select the type of query based on the value of "network" (although sometimes I need the data from both networks):
For example:
www.test.it/network_1/blog_1/**postid**/1234/
-> db.posts.find ({network: "network_1" blogname "blog_1", post_id: 1234})
www.test.it/network_2/blog_4/**slug**/aaaa/
-> db.posts.find ({network: "network_2" blogname "blog_4" post_slug: "yyyy"})
I could create two separate indexes (network / blogname / post_id and network / blogname / post_slug) but I would get a huge waste of RAM, since 50% of the data in the index will never be used.
Is there a way to create an index "filtered"?
Example:
(Note the WHERE parameter)
db.posts.ensureIndex ({network: 1 blogname: 1, post_id: 1}, {where: {network: "network_1"}})
db.posts.ensureIndex ({network: 1 blogname: 1, post_slug: 1}, {where: {network: "network_2"}})
Indeed it's possible in MongoDB 3.2+ They call it partialFilterExpression where you can set a condition based on which index will be created.
Example
db.users.createIndex({ "userId": 1, "project": 1 },
{ unique: true, partialFilterExpression:{
userId: { $exists: true, $gt : { $type : 10 } } } })
Please see Partial Index documentation
As of MongoDB v3.2, partial indexes are supported. Documentation: https://docs.mongodb.org/manual/core/index-partial/
It's possible, but it requires a workaround which creates redundancy in your documents, requires you to rewrite your find-queries and limits find-queries to exact matches.
MongoDB supports sparse indexes which only index the documents where the given field exists. You can use this feature to only index a part of the collection by adding this field only to those documents you want to index.
The bad news is that sparse indexes can only include a single field. But the good news is, that this field can also contain an object with multiple fields, so you can still store all the data you want to search for in this field.
To do this, add a new field to the included documents which includes an object with the fields you search for:
{
"network": "network_1",
"blogname": "blogname_1",
"post_id": 1234,
"post_slug": "abcdefg"
"network_1_index_key": {
"blogname": "blogname_1",
"post_id": 1234
}
}
Your ensureIndex command would index the field network_1_index_key:
db.posts.ensureIndex( { network_1_index_key: 1 }, { sparse: true } )
A find-query which is supposed to use this index, must now query for the exact object of the field network_1_index_key:
db.posts.find ({
network_1_index_key: {
blogname: "blogname_1",
post_id: 1234
}
})
Doing this would likely only make sense when the documents you want to index are a very small part of the collection. When its about half, I would just create a regular index and live with it because the larger document-size could mitigate the gains from the reduced index size.
You can try create index on all field (network / blogname / post_id / post_slug)

How do you get around missing values in a unique index using mongo db?

The mongo documentation states that "When a document is saved to a collection with unique indexes, any missing indexed keys will be inserted with null values. Thus, it won't be possible to insert multiple documents missing the same indexed key."
So is it impossible to create a unique index on an optional field? Should I create a compound index with say a userId as well to solve this? In my specific case I have a user collection that has an optional embedded oauth object.
e.g.
>db.users.ensureIndex( { "name":1, "oauthConnections.provider" : 1, "oauthConnections.providerId" : 1 } );
My sample user
{ name: "Bob"
,pwd: "myPwd"
,oauthConnections [
{
"provider":"Facebook",
"providerId" : "12345",
"key":"blah"
}
,{
"provider":"Twitter",
"providerId" : "67890",
"key":"foo"
}
]
}
I believe that this is possible: You can have an index that is sparse and unique. This way, non-existant values never make it to the index, hence they can't be duplicate.
Caveat: This is not possible with compound indexes. I'm not quite sure about your question. Your citing a part of the documentation that concerns compound indexes -- there, missing values will be inserted, but from your question I guess you're not looking for a solution w/ compound indexes?
Here's a sample:
> db.Test.insert({"myId" : "1234", "string": "foo"});
> show collections
Test
system.indexes
>
> db.Test.find();
{ "_id" : ObjectId("4e56e5260c191958ad9c7cb1"), "myId" : "1234", "string" : "foo" }
>
> db.Test.ensureIndex({"myId" : 1}, {sparse: true, unique: true});
>
> db.Test.insert({"myId" : "1234", "string": "Bla"});
E11000 duplicate key error index: test.Test.$myId_1 dup key: { : "1234" }
>
> db.Test.insert({"string": "Foo"});
> db.Test.insert({"string": "Bar"});
> db.Test.find();
{ "_id" : ObjectId("4e56e5260c191958ad9c7cb1"), "myId" : "1234", "string" : "foo" }
{ "_id" : ObjectId("4e56e5c30c191958ad9c7cb4"), "string" : "Foo" }
{ "_id" : ObjectId("4e56e5c70c191958ad9c7cb5"), "string" : "Bar" }
Also note that compound indexes can't be sparse
It is not impossible to index an optional field. The docs are talking about a unique index. Once you've specified a unique index, you can only insert one document per value for that field, even if that value is null.
If you want a unique index on an optional field but still allow multiple nulls, you could try making the index both unique and sparse, although I have no idea if that's possible. I couldn't find an answer in the documentation.
There's no good way to uniquely index an optional field. You can either fill it with a default (the _id on the user would work), let your access layer enforce uniqueness, or change your "schema" a bit.
We have a separate collection for oauth login tokens, partially for this reason. We never really need to access those in a context where having them as embedded docs is an obvious win. If this is a relatively easy change to make, it's probably your best bet.
----edit----
As the other answers points, you can achieve this with a sparse index. It's even a documented use. You should probably accept one of those answers instead of mine.
http://www.mongodb.org/display/DOCS/Indexes#Indexes-SparseIndexes

Fast way to find duplicates on indexed column in mongodb

I have a collection of md5 in mongodb. I'd like to find all duplicates. The md5 column is indexed. Do you know any fast way to do that using map reduce.
Or should I just iterate over all records and check for duplicates manually?
My current approach using map reduce iterates over the collection almost twice (assuming that there is very small amount of duplicates):
res = db.files.mapReduce(
function () {
emit(this.md5, 1);
},
function (key, vals) {
return Array.sum(vals);
}
)
db[res.result].find({value: {$gte:1}}).forEach(
function (obj) {
out.duplicates.insert(obj)
});
I personally found that on big databases (1TB and more) accepted answer is terribly slow. Aggregation is much faster. Example is below:
db.places.aggregate(
{ $group : {_id : "$extra_info.id", total : { $sum : 1 } } },
{ $match : { total : { $gte : 2 } } },
{ $sort : {total : -1} },
{ $limit : 5 }
);
It searches for documents whose extra_info.id is used twice or more times, sorts results in descending order of given field and prints first 5 values of it.
The easiest way to do it in one pass is to sort by md5 and then process appropriately.
Something like:
var previous_md5;
db.files.find( {"md5" : {$exists:true} }, {"md5" : 1} ).sort( { "md5" : 1} ).forEach( function(current) {
if(current.md5 == previous_md5){
db.duplicates.update( {"_id" : current.md5}, { "$inc" : {count:1} }, true);
}
previous_md5 = current.md5;
});
That little script sorts the md5 entries and loops through them in order. If an md5 is repeated, then they will be "back-to-back" after sorting. So we just keep a pointer to previous_md5 and compare it current.md5. If we find a duplicate, I'm dropping it into the duplicates collection (and using $inc to count the number of duplicates).
This script means that you only have to loop through the primary data set once. Then you can loop through the duplicates collection and perform clean-up.
You can do a group by that field and then query to get the duplicated (having a count > 1). http://www.mongodb.org/display/DOCS/Aggregation#Aggregation-Group
Although, the fastest thing might be to just do a query which only returns that field and then to do the aggregation in the client. Group/Map-Reduce need to provide access to the whole document which is much more costly than just providing the data from the index (which is now covered in 1.7.3+).
If this is a general problem you need to run periodically, you might want to keep a collection which is just {md5:value, count:value} so you can skip the aggregation, and it will be extremely fast when you need to cull duplicates.