mongo db count differs from aggregate sum - mongodb

i have a query and when i validate it i see that the count command returns a different results from the aggregate result.
i have an array of sub-documents like so:
{
...
wished: [{'game':'dayz','appid':'1234'}, {'game':'half-life','appid':'1234'}]
...
}
i am trying to query a count of all games in the collection and return the name along with the count of how many times i found that game name.
if i go
db.user_info.count({'wished.game':'dayz'})
it returns 106 as the value and
db.user_info.aggregate([{'$unwind':'$wished'},{'$group':{'_id':'$wished.game','total':{'$sum':1}}},{'$sort':{'total':-1}}])
returns 110
i don't understand why my counts are different. the only thing i can think of is that it has to do with the data being in an array of sub-documents as opposed to being in an array or just in a document.

The $unwind statement will cause one user with multiple wished games to appear as several users. Imagine this data:
{
_id: 1,
wished: [{game:'a'}, {game:'b'}]
}
{
_id: 2,
wished: [{game:'a'}, {game:'c'}, {game:'a'}]
}
The count can NEVER be more than 2.
But with this same data, an $unwind will give you 5 different documents. Summing them up will then give you a:3, b:1, c:1.

Related

Mongo Query Optimization for collection over 10 Million records

I have a collection with over 10 Million records, I need to match with a particular field and get
the distinct _ids of the records set.
after the $match pipeline the result set becomes less than 5 Million.
if i group with id to get the unique ids, the execution time on my local environment is over 20 seconds.
db.getCollection('viewscounts').aggregate(
[
{
$match: {
MODULE_ID: 4,
}
},
{
$group: {
_id: '$ITEM_ID',
}
}
], { allowDiskUse: true })
If I get rid of either $match or $group and have only 1 pipeline, the execution time is less than 0.1 seconds.
I'm okay with limiting the _ids, but they should be unique.
Can anyone suggest a better way to get the results faster?
You have already implemented the best Aggregation pipelines possible for the query to get your desired output.
The reason why your query results are faster when using only one of the aggregation pipelines is that the query result returns partial output instead of the entire 5 million records. where when you add both the stages, the entire output of the $match stage has to be processed by $group stage resulting in more time.
The only way to optimize your aggregation query is to apply indexes on MODULE_ID and ITEM_ID keys
db.viewscounts.createIndex({MODULE_ID: 1}, { sparse: true })
db.viewscounts.createIndex({ITEM_ID: 1})
It should be faster after you perform the above two indexes on your viewscounts collection.
Additionally, you can also get your desired output from MongoDB distinct command. Give the below query a try and see if it helps.
db.getCollection('viewscounts').distinct("ITEM_ID", {"MODULE_ID": 4})
Note: The above query returns an array of unique key-values instead of objects like in the aggregation query
Hope this helps

Find current and previous documents from mongo db

I have to return 2 documents from a single query. The first value which I will be giving in the query and the second will be the previous one(sorted).
I am able to design both separately. The below code gives separate outputs.
db.collection.find({'_id':'value1'})
db.collection.find({'_id': {'$lt': 'value1'}}).sort({'_id':-1}).limit(1)
How to combine them? So when I execute from my appl it returns 2 outputs
Fetch only a specific key instead of entire document
You can use $lte instead of $lt and limit with 2 - logically it will be the same operation
db.collection.find({ _id: { $lte: 'value1' } }, { _id: 1, yourKey: 1 }).sort({_id: -1}).limit(2)
EDIT: to get specific keys you need to specify them as second argument of .find()

How to efficiently page batches of results with MongoDB

I am using the below query on my MongoDB collection which is taking more than an hour to complete.
db.collection.find({language:"hi"}).sort({_id:-1}).skip(5000).limit(1)
I am trying to to get the results in a batch of 5000 to process in either ascending or descending order for documents with "hi" as a value in language field. So i am using this query in which i am skipping the processed documents every time by incrementing the "skip" value.
The document count in this collection is just above 20 million.
An index on the field "language" is already created.
MongoDB Version i am using is 2.6.7
Is there a more appropriate index for this query which can get the result faster?
When you want to sort descending, you should create a multi-field index which uses the field(s) you sort on as descending field(s). You do that by setting those field(s) to -1.
This index should greatly increase the performance of your sort:
db.collection.ensureIndex({ language: 1, _id: -1 });
When you also want to speed up the other case - retrieving sorted in ascending order - create a second index like this:
db.collection.ensureIndex({ language: 1, _id: 1 });
Keep in mind that when you do not sort your results, you receive them in natural order. Natural order is often insertion order, but there is no guarantee for that. There are various events which can cause the natural order to get messed up, so when you care about the order you should always sort explicitly. The only exception to this rule are capped collections which always maintain insertion order.
In order to efficiently "page" through results in the way that you want, it is better to use a "range query" and keep the last value you processed.
You desired "sort key" here is _id, so that makes things simple:
First you want your index in the correct order which is done with .createIndex() which is not the deprecated method:
db.collection.createIndex({ "language": 1, "_id": -1 })
Then you want to do some simple processing, from the start:
var lastId = null;
var cursor = db.collection.find({language:"hi"});
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
That's the first batch. Now when you move on to the next one:
var cursor = db.collection.find({ "language":"hi", "_id": { "$lt": lastId });
cursor.sort({_id:-1}).limit(5000).forEach(funtion(doc) {
// do something with your document. But always set the next line
lastId = doc._id;
})
So that the lastId value is always considered when making the selection. You store this between each batch, and continue on from the last one.
That is much more efficient than processing with .skip(), which regardless of the index will "still" need to "skip" through all data in the collection up to the skip point.
Using the $lt operator here "filters" all the results you already processed, so you can move along much more quickly.

MongoDB select subdocument with aggregation function

I have a mongo DB collection that looks something like this:
{
{
_id: objectId('aabbccddeeff'),
objectName: 'MyFirstObject',
objectLength: 0xDEADBEEF,
objectSource: 'Source1',
accessCounter: {
'firstLocationCode' : 283,
'secondLocationCode' : 543,
'ThirdLocationCode' : 564,
'FourthLocationCode' : 12,
}
}
...
}
Now, assuming that this is not the only record in the collection and that most/all of the documents contain the accessCounter subdocument/field how will I go with selecting the x first documents where I have the most access from a specific location.
A sample "query" will be something like:
"Select the first 10 documents From myCollection where the accessCounter.firstLocationCode are the highest"
So a sample result will be X documents where the accessCounter. will be the greatest is the database.
Thank your for taking the time to read my question.
No need for an aggregation, that is a basic query:
db.collection.find().sort({"accessCounter.firstLocation":-1}).limit(10)
In order to speed this up, you should create a subdocument index on accessCounter first:
db.collection.ensureIndex({'accessCounter':-1})
assuming the you want to do the same query for all locations. In case you only want to query firstLocation, create the index on accessCounter.firstLocation.
You can speed this up further in case you only need the accessCounter value by making this a so called covered query, a query of which the values to return come from the index itself. For example, when you have the subdocument indexed and you query for the top secondLocations, you should be able to do a covered query with:
db.collection.find({},{_id:0,"accessCounter.secondLocation":1})
.sort("accessCounter.secondLocation":-1).limit(10)
which translates to "Get all documents ('{}'), don't return the _id field as you do by default ('_id:0'), get only the 'accessCounter.secondLocation' field ('accessCounter.secondLocation:1'). Sort the returned values in descending order and give me the first ten."

How do I do a "NOT IN" query in Mongo?

This is my document:
{
title:"Happy thanksgiving",
body: "come over for dinner",
blocked:[
{user:333, name:'john'},
{user:994, name:'jessica'},
{user:11, name: 'matt'},
]
}
What is the query to find all documents that do not have user 11 in "blocked"?
You can use $in or $nin for "not in"
Example ...
> db.people.find({ crowd : { $nin: ["cool"] }});
I put a bunch more examples here: http://learnmongo.com/posts/being-part-of-the-in-crowd/
Since you are comparing against a single value, your example actually doesn't need a NOT IN operation. This is because Mongo will apply its search criteria to every element of an array subdocument. You can use the NOT EQUALS operator, $ne, to get what you want as it takes the value that cannot turn up in the search:
db.myCollection.find({'blocked.user': {$ne: 11}});
However if you have many things that it cannot equal, that is when you would use the NOT IN operator, which is $nin. It takes an array of values that cannot turn up in the search:
db.myCollection.find({'blocked.user': {$nin: [11, 12, 13]}});
Try the following:
db.stack.find({"blocked.user":{$nin:[11]}})
This worked for me.
See http://docs.mongodb.org/manual/reference/operator/query/nin/#op._S_nin
db.inventory.find( { qty: { $nin: [ 5, 15 ] } } )
This query will
select all documents in the inventory collection where the qty field
value does not equal 5 nor 15. The selected documents will include
those documents that do not contain the qty field.
If the field holds an array, then the $nin operator selects the
documents whose field holds an array with no element equal to a value
in the specified array (e.g. , , etc.).