I'm trying to use the aggregation framework to group a lot of strings together to indentify the unique ones. I must also keep some information about the rest of the fields. This would be analogous to me using the * operator in mysql with a group by statement.
SELECT *
FROM my_table
GROUP BY field1
I have tried using the aggregation framework, and it works fine just to get unique fields.
db.mycollection.aggregate({
$group : { _id : "$field1"}
})
What if I want the other fields that went with that. MySQL would only give me the first one that appeared in the group (which I'm fine with). Thats what I thought the $first operator did.
db.mycollection.aggregate({
$group : {
_id : "$field1",
another_field : {$first : "$field2"}
}})
This way it groups by field1 but still gives me back the other fields attached to document. When I try this I get:
exception: aggregation result exceeds maximum document size (16MB)
Which I have a feeling is because it is returning the whole aggregation back as one document. Can I return it as another json array?
thanks in advance
You're doing the aggregation correctly, but as the error message indicates, the full result of the aggregate call cannot be larger than 16 MB.
Work-arounds would be to either add a filter to reduce the size of the result or use map-reduce instead and output the result to another collection.
If you unique values of the result does not exceed 2000 you could use group() function like
db.mycollection.group( {key : {field1 : 1, field2 : 1}}, reduce: function(curr, result){}, initial{} })
Last option would be map reduce:
db.mycollection.mapReduce( function() { emit( {field1 :1, field2: 1}, 1); }, function(key, values) { return 1;}, {out: {replace: "unique_field1_field2"}})
and your result would be in "unique_field1_field2" collection
Another alternative is use the distinct function:
db.mycollection.distinct('field1')
This functions accepts a second argument, a query, where you can filter the documents.
Related
I have to return 2 documents from a single query. The first value which I will be giving in the query and the second will be the previous one(sorted).
I am able to design both separately. The below code gives separate outputs.
db.collection.find({'_id':'value1'})
db.collection.find({'_id': {'$lt': 'value1'}}).sort({'_id':-1}).limit(1)
How to combine them? So when I execute from my appl it returns 2 outputs
Fetch only a specific key instead of entire document
You can use $lte instead of $lt and limit with 2 - logically it will be the same operation
db.collection.find({ _id: { $lte: 'value1' } }, { _id: 1, yourKey: 1 }).sort({_id: -1}).limit(2)
EDIT: to get specific keys you need to specify them as second argument of .find()
My Query below:
db.chats.find({ bid: 'someID' }).sort({start_time: 1}).limit(10).skip(82560).pretty()
I have indexes on chats collection on the fields in this order
{
"cid" : 1,
"bid" : 1,
"start_time" : 1
}
I am trying to perform sort, but when I write a query and check the result of explain(), I still get the winningPlan as
{
"stage":"SKIP",
"skipAmount":82560,
"inputStage":{
"stage":"SORT",
"sortPattern":{
"start_time":1
},
"limitAmount":82570,
"inputStage":{
"stage":"SORT_KEY_GENERATOR",
"inputStage":{
"stage":"COLLSCAN",
"filter":{
"ID":{
"$eq":"someID"
}
},
"direction":"forward"
}
}
}
}
I was expecting not to have a sort stage in the winning plan as I have indexes created for that collection.
Having no indexes will result into the following error
MongoError: OperationFailed: Sort operation used more than the maximum 33554432 bytes of RAM [duplicate]
However I managed to make the sort work by increasing the size allocation on ram from 32mb to 64mb, looking for help in adding indexes properly
The order of fields in an index matters. To sort query results by a field that is not at the start of the index key pattern the query must include equality conditions on all the prefix keys that precede the sort keys. The cid field is not in the query nor used for sorting, so you must leave it out. Then you put the bid field first in the index definition as you use it in the equality condition. The start_time goes after that to be used in sorting. Finally, the index must look like this:
{"bid" : 1, "start_time" : 1}
See the documentation for further reference.
Imagine a collection with about 5,000,000 documents. I need to do a basicCursor query to select ~100 documents based on too many fields to index. Let's call this the basicCursorMatch. This will be immensely slow.
I can however to a bTreeCursor query on a few indexes that will limit my search to ~500 documents. Let's call this query the bTreeCursorMatch.
Is there a way I can do this basicCursorMatch directly on the cursor or collection resulting from the bTreeCursorMatch?
Intuitively I tried
var cursor = collection.find(bTreeCursorMatch);
var results = cursor.find(basicCursorMatch);
similar to collection.find(bTreeCursorMatch).find(basicCursorMatch), which doesn't seem to work.
Alternatively, I was hoping I could do something like this:
collection.aggregate([
{$match: bTreeCursorMatch}, // Uses index 5,000,000 -> 500 fast
{$match: basicCursorMatch}, // No index, 500 -> 100 'slow'
{$sort}
]);
.. but it seems that I cannot do this either. Is there an alternative to do what I want?
The reason I am asking is because this second query will differ a lot and there is no way I can index all the fields. But I do want to make that first query using a bTreeCursor, otherwise querying the whole collection will take forever using a basicCursor.
update
Also, through user input the subselection of 500 documents will be queried in different ways during a session with an unpredictable basicCursor query, using multiple $in $eq $gt $lt. But during this, the bTreeCursor subselection remains the same. Should I just keep doing both queries for every user query, or is there a more efficient way to keep a reference to this collection?
In practice, you rarely need to run second queries on a cursor. You specially don't need to break MongoDB's work into separate indexable / non-indexable chunks.
If you pass a query to MongoDB's find method that can be partially fulfilled by a look-up in an index, MongoDB will do that look-up first, and then do a full scan on the remaining documents.
For instance, I have a collection users with documents like:
{ _id : 4, gender : "M", ... }
There is an index on _id, but not on gender. There are ~200M documents in users.
To get an idea of what MongoDB is doing under the hood, add explain() to your cursor (in the Mongo shell):
> db.users.find( { _id : { $gte : 1, $lt : 10 } } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 9,
"nscannedObjects" : 9
}
I have cut out some of the fields returned by explain. Basically, cursor tells you if it's using an index, n tells you the number of documents returned by the query and nscannedObjects is the number of objects scanned during the query. In this case, mongodb was able to scan exactly the right number of objects.
What happens if we now query on gender as well?
> db.users.find( { _id : { $gte : 1, $lt : 10 }, gender : "F" } ).explain()
{
"cursor" : "BtreeCursor oldId_1_state_1",
"n" : 5,
"nscannedObjects" : 9
}
find returns 5 objects, but had to scan 9 documents. It was therefore able to isolate the correct 9 documents using the _id field. It then went through all 9 documents and filtered them by gender.
I have a mongodb query where i want to get documents if a field has particular value.
db.collection.find({key:{$in:['value1','value2']}}) if i run above command i get documents containing either 'value1' or 'value2'. but lets just say there are no values. and i search db.collection.find({key:{$in:[]}}), nothing is displayed. and db.collection.find({key:{$in:[*]}}) gives unexpected token* which wild card do i use in $in to show all results.?
I think this is logically consistent behavior for $in. The query
db.collection.find({ "key" : { "$in" : [] } })
could be translated as "find all the documents where the value of key is one of the values contained in the array []". Since there are no values in the array [], there are no matching documents. If you want to find all of the extant values for key, use .distinct to return them as an array:
db.collection.distinct("key")
.distinct will use an index if possible.
If you want a query to match all documents, omit the query selector from .find:
db.collection.find()
as suggested in the comments.
I have a mongo DB collection that looks something like this:
{
{
_id: objectId('aabbccddeeff'),
objectName: 'MyFirstObject',
objectLength: 0xDEADBEEF,
objectSource: 'Source1',
accessCounter: {
'firstLocationCode' : 283,
'secondLocationCode' : 543,
'ThirdLocationCode' : 564,
'FourthLocationCode' : 12,
}
}
...
}
Now, assuming that this is not the only record in the collection and that most/all of the documents contain the accessCounter subdocument/field how will I go with selecting the x first documents where I have the most access from a specific location.
A sample "query" will be something like:
"Select the first 10 documents From myCollection where the accessCounter.firstLocationCode are the highest"
So a sample result will be X documents where the accessCounter. will be the greatest is the database.
Thank your for taking the time to read my question.
No need for an aggregation, that is a basic query:
db.collection.find().sort({"accessCounter.firstLocation":-1}).limit(10)
In order to speed this up, you should create a subdocument index on accessCounter first:
db.collection.ensureIndex({'accessCounter':-1})
assuming the you want to do the same query for all locations. In case you only want to query firstLocation, create the index on accessCounter.firstLocation.
You can speed this up further in case you only need the accessCounter value by making this a so called covered query, a query of which the values to return come from the index itself. For example, when you have the subdocument indexed and you query for the top secondLocations, you should be able to do a covered query with:
db.collection.find({},{_id:0,"accessCounter.secondLocation":1})
.sort("accessCounter.secondLocation":-1).limit(10)
which translates to "Get all documents ('{}'), don't return the _id field as you do by default ('_id:0'), get only the 'accessCounter.secondLocation' field ('accessCounter.secondLocation:1'). Sort the returned values in descending order and give me the first ten."