Mongodb aggregation framework to count distinct array items - mongodb

I have some docs like:
{ tags: { first_cat: ["a", "b", "c"], second_cat : ["1","2","3"]}}
{ tags: { first_cat: ["d", "b", "a"], second_cat : ["1"]}}
I need something like this:
{ first_cat: [{"a" : 2}, {"b" : 2}, {"c" : 1}, {"d" : 1}], second_cat: [{"1" : 2, "2": 1, "3":1}] }
With m/r it's quite easy to do (but slow), is it possibile to get a similar result with aggregation framework?

You can not do this with the Aggregation Framework as there is no way to convert arbitrary values "a" to a key { "a": 2 }. You will need to redesign your schema.

Related

MongoDB Aggregation - Does $unwind order documents the same way as the nested array order

I am wandering whether using $unwind operator in aggregation pipeline for document with nested array will return the deconstructed documents in the same order as the order of the items in the array.
Example:
Suppose I have the following documents
{ "_id" : 1, "item" : "foo", values: [ "foo", "foo2", "foo3"] }
{ "_id" : 2, "item" : "bar", values: [ "bar", "bar2", "bar3"] }
{ "_id" : 3, "item" : "baz", values: [ "baz", "baz2", "baz3"] }
I would like to use paging for all values in all documents in my application code. So, my idea is to use mongo aggregation framework to:
sort the documents by _id
use $unwind on values attribute to deconstruct the documents
use $skip and $limit to simulate paging
So the question using the example described above is:
Is it guaranteed that the following aggregation pipeline:
[
{$sort: {"_id": 1}},
{$unwind: "$values"}
]
will always result to the following documents with exactly the same order?:
{ "_id" : 1, "item" : "foo", values: "foo" }
{ "_id" : 1, "item" : "foo", values: "foo2" }
{ "_id" : 1, "item" : "foo", values: "foo3" }
{ "_id" : 2, "item" : "bar", values: "bar" }
{ "_id" : 2, "item" : "bar", values: "bar2" }
{ "_id" : 2, "item" : "bar", values: "bar3" }
{ "_id" : 3, "item" : "baz", values: "baz" }
{ "_id" : 3, "item" : "baz", values: "baz2" }
{ "_id" : 3, "item" : "baz", values: "baz3" }
I also asked the same question in the MongoDB community forum . An answer that confirms my assumption was posted from a member of MongoDB stuff.
Briefly:
Yes, the order of the returned documents in the example above will always be the same. It follows the order from the array field.
In the case that you do run into issues with order. You could use includeArrayIndex to guarantee order.
[
{$unwind: {
path: 'values',
includeArrayIndex: 'arrayIndex'
}},
{$sort: {
_id: 1,
arrayIndex: 1
}},
{ $project: {
index: 0
}}
]
From what I see at https://github.com/mongodb/mongo/blob/0cee67ce6909ca653462d4609e47edcc4ac5c1a9/src/mongo/db/pipeline/document_source_unwind.cpp
The cursor iterator uses getNext() method to unwind an array:
DocumentSource::GetNextResult DocumentSourceUnwind::doGetNext() {
auto nextOut = _unwinder->getNext();
while (nextOut.isEOF()) {
.....
// Try to extract an output document from the new input document.
_unwinder->resetDocument(nextInput.releaseDocument());
nextOut = _unwinder->getNext();
}
return nextOut;
}
And the getNext() implemenation relies on array's index:
DocumentSource::GetNextResult DocumentSourceUnwind::Unwinder::getNext() {
....
// Set field to be the next element in the array. If needed, this will automatically
// clone all the documents along the field path so that the end values are not shared
// across documents that have come out of this pipeline operator. This is a partial deep
// clone. Because the value at the end will be replaced, everything along the path
// leading to that will be replaced in order not to share that change with any other
// clones (or the original).
_output.setNestedField(_unwindPathFieldIndexes, _inputArray[_index]);
indexForOutput = _index;
_index++;
_haveNext = _index < length;
.....
return _haveNext ? _output.peek() : _output.freeze();
}
So unless there is anything upstream that messes with document's order the cursor should have unwound docs in the same order as subdocs were stored in the array.
I don't recall how merger works for sharded collections and I imagine there might be a case when documents from other shards are returned from between 2 consecutive unwound documents. What the snippet of the code guarantees is that unwound document with next item from the array will never be returned before unwound document with previous item from the array.
As a side note, having million items in an array is quite an extreme design. Even 20-bytes items in the array will exceed 16Mb doc limit.

Is two-way referencing more efficient in Mongo for a 1 to N relationship?

I have a discussion at work about two-way referencing in a 1 to N relationship. According to this post in MongoDB blog, you can do it. We wouldn't need atomic updates at all, so no problem there. Following the example in the article, in our case you can only create or delete task but not change the task owner.
My argument is that two-way referencing is probably more efficient for fetching data from both sides, as we will need to display more often the owner with their tasks and less often just the tasks, in different parts of the program. My colleague says there won't be an efficiency gain and the data duplication is not worth it.
Do you have any info about the efficiency of this approach?
De-normalizing and storing the data helps when we have less write and more read. Here the efficiency depends upon how the data is retrieved. If our retrieval of data from the collections requires two way referencing and if we already have it then certainly it improves the efficiency of our query.
Student collection
{ _id:1, name: "Joseph", courses:[1, 3, 4]}
{ _id:2, name: "Mary", courses:[1, 3]}
{ _id:3, name: "Catherine", courses:[1, 2, 4]}
{ _id:4, name: "Robert", courses:[2, 4]}
Course Collection
{ _id:1, name: "Math101", students: [1, 2, 3]}
{ _id:2, name: "Science101", students: [3, 4]}
{ _id:3, name: "History101", students: [1, 2]}
{ _id:4, name: "Astronomy101", students: [1, 3, 4]}
Consider the above example of Students and Courses, here two way referencing is done, the courses array in Students collection gives us the different courses studied by the student. Similarly the Students array in the Courses collection gives us the students who are studying the respective course.
If we want to list the students who were studying Math101 then the query would be
db.courses.aggregate([{$match: {name:"Math101"}},
{$unwind:"$students"},
{$lookup:{from:"students",
localField:"students",
foreignField:"_id",
as:"result"}}])
$match, $unwind, $lookup in the aggregation pipeline are used to achieve the result. $match to reduce the data(it is good to use this operator in the start of the aggregation pipeline), $unwind to unwind the students array in the Courses collection, $lookup to look in to the Students collection and get the student details
The result after executing the above aggregation query on our sample collections is
{
"_id" : 1,
"name" : "Math101",
"students" : 1,
"result" : [
{
"_id" : 1,
"name" : "Joseph",
"courses" : [
1,
3,
4
]
}
]
}
{
"_id" : 1,
"name" : "Math101",
"students" : 2,
"result" : [
{
"_id" : 2,
"name" : "Mary",
"courses" : [
1,
3
]
}
]
}
{
"_id" : 1,
"name" : "Math101",
"students" : 3,
"result" : [
{
"_id" : 3,
"name" : "Catherine",
"courses" : [
1,
2,
4
]
}
]
}
The efficiency on two way referencing purely based on what we retrieve, hence design your schema closely aligned with your expected results.

MongoDB: Find documents with key in subdocument

The $in operator works with arrays.
Is there an equivalent for dictionaries?
The following code creates two test documents and finds the ones containing one of the listed values in the array documents, but doesn't find the ones containing the same values in the sub-documents.
> use test
> db.stuff.drop()
> db.stuff.insertMany([{lst:['a','b'],dic:{a:1,b:2}},{lst:['a','c'],dic:{a:3,c:4}}])
{
"acknowledged" : true,
"insertedIds" : [
ObjectId("595bbe8b3b0518bcca4b1530"),
ObjectId("595bbe8b3b0518bcca4b1531")
]
}
> db.stuff.find({lst:{$in:['b','c']}},{_id:0})
{ "lst" : [ "a", "b" ], "dic" : { "a" : 1, "b" : 2 } }
{ "lst" : [ "a", "c" ], "dic" : { "a" : 3, "c" : 4 } }
> db.stuff.find({dic:{$in:['b','c']}},{_id:0})
>
EDIT (in response to the answer below)
Using the list as suggested in the answer below prevents me from finding the desired element. For example, after executing both the insertMany above in this question and below in the answer, the following can be done with a dictionary, not with a list (or am I missing something?):
> x=db.stuff.findOne({lst:{$in:['b','c']}},{_id:0})
{ "lst" : [ "a", "b" ], "dic" : { "a" : 1, "b" : 2 } }
> x
{ "lst" : [ "a", "b" ], "dic" : { "a" : 1, "b" : 2 } }
> x.dic.a
1
> x.dic.b
2
For subdocuments, there's no exact equivalent to $in. You could use the $exists query operator combined with $or:
db.stuff.find({$or:[
{'dic.b': {$exists: true}},
{'dic.c': {$exists: true}}
]})
The recommended approach, however, is to change your schema, so that the keys and values are changed into an array of {key: "key", value: 123} subdocuments:
db.stuff.insertMany([
{dic: [{key: 'a', value: 1}, {key: 'b', value: 2}]},
{dic: [{key: 'a', value: 3}, {key: 'c', value: 4}]}
])
Then you can use $in to find documents with certain keys:
db.stuff.find({'dic.key': {$in: ['a', 'b']}})
The especially good thing about this new schema is you can use an index for the $in query:
db.stuff.createIndex({'dic.key': 1})
A disadvantage, as you point out above, is that simple element access like x.dic.a no longer works. You need to do a bit of coding in your language. E.g. in Javascript:
> var doc = {dic: [{key: 'a', value: 3}, {key: 'c', value: 4}]}
> function getValue(doc, key) {
... return doc.dic.filter(function(elem) {
... return elem.key == key;
... })[0].value;
... }
> getValue(doc, "a")
3
> getValue(doc, "c")
4

Finding two documents in MongoDB that share a key value

I have a large collection of documents in MongoDB, each one of those documents has a key called "name", and another key called "type". I would like to find two documents with the same name and different types, a simple MongoDB counterpart of
SELECT ...
FROM table AS t1, table AS t2
WHERE t1.name = t2.name AND t1.type <> t2.type
I can imagine that one can do this using aggregation: however, the collection is very large, processing it will take time and I'm looking just for one pair of such documents.
While I stand by by comments that I don't think the way you are phrasing your question is actually related to a specific problem you have, I will go someway to explain the idiomatic SQL way in a MongoDB type of solution. I stand on that your actual solution would be different but you haven't presented us with that problem, but only SQL.
So consider the following documents as a sample set, removing _id fields in this listing for clarity:
{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
{ "name" : "z", "type" : "z" }
If we ran the SQL presented over the same data we would get this result:
a|b
a|c
a|c
b|c
b|a
b|a
a|b
b|c
We can see that 2 documents do not match, and then work out the logic of the SQL operation. So the other way of saying it is "Which documents given a key of "name" do have more than one possible value in the key "type".
Given that, taking a mongo approach, we can query for the items that do not match the given condition. So effectively the reverse of the result:
db.sample.aggregate([
// Store unique documents grouped by the "name"
{$group: {
_id: "$name",
comp: {
$addToSet: {
name:"$name",
type: "$type"
}
}
}},
// Unwind the "set" results
{$unwind: "$comp"},
// Push the results back to get the unique count
// *note* you could not have done this with alongside $addtoSet
{$group: {
_id: "$_id",
comp: {
$push: {
name: "$comp.name",
type: "$comp.type"
}
},
count: {$sum: 1}
}},
// Match only what was counted once
{$match: {count: 1}},
// Unwind the array
{$unwind: "$comp"},
// Clean up to "name" and "type" only
{$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])
This operation will yield the results:
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
Now in order to get the same result as the SQL query we would take those results and channel them into another query:
db.sample.find({$nor: [{ name: "f", type: "e"},{ name: "z", type: "z"}] })
Which arrives as the final matching result:
{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
So this will work, however the one thing that may make this impractical is where the number of documents being compared is very large, we hit a working limit on compacting those results down to an array.
It also suffers a bit from the use of a negative in the final find operation which would force a scan of the collection. But in all fairness the same could be said of the SQL query that uses the same negative premise.
Edit
Of course what I did not mention is that if the result set goes the other way around and you are matching more results in the excluded items from the aggregate, then just reverse the logic to get the keys that you want. Simply change $match as follows:
{$match: {$gt: 1}}
And that will be the result, maybe not the actual documents but it is a result. So you don't need another query to match the negative cases.
And, ultimately this was my fault because I was so focused on the idiomatic translation that I did not read the last line in your question, where to do say that you were looking for one document.
Of course, currently if that result size is larger than 16MB then you are stuck. At least until the 2.6 release, where the results of aggregation operations are a cursor, so you can iterate that like a .find().
Also introduced in 2.6 is the $size operator which is used to find the size of an array in the document. So this would help to remove the second $unwind and $group that are used in order to get the length of the set. This alters the query to a faster form:
db.sample.aggregate([
{$group: {
_id: "$name",
comp: {
$addToSet: {
name:"$name",
type: "$type"
}
}
}},
{$project: {
comp: 1,
count: {$size: "$comp"}
}},
{$match: {count: {$gt: 1}}},
{$unwind: "$comp"},
{$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])
And MongoDB 2.6.0-rc0 is currently available if you are doing this just for personal use, or development/testing.
Moral of the story. Yes you can do it, But do you really want or need to do it that way? Then probably not, and if you asked a different question about the specific business case, you may get a different answer. But then again this may be exactly right for what you want.
Note
Worthwhile to mention that when you look at the results from the SQL, it will erroneously duplicate several items due to the other available type options if you didn't use a DISTINCT for those values or essentially another grouping. But that is the result that was being produced by this process using MongoDB.
For Alexander
This is the output of the aggregate in the shell from current 2.4.x versions:
{
"result" : [
{
"name" : "f",
"type" : "e"
},
{
"name" : "z",
"type" : "z"
}
],
"ok" : 1
}
So do this to get a var to pass as the argument to the $nor condition in the second find, like this:
var cond = db.sample.aggregate([ .....
db.sample.find({$nor: cond.result })
And you should get the same results. Otherwise consult your driver.
There is a very simple aggregation that works to get you the names and their types that occur more than once:
db.collection.aggregate([
{ $group: { _id : "$name",
count:{$sum:1},
types:{$addToSet:"$type"}}},
{$match:{"types.1":{$exists:true}}}
])
This works in all versions that support aggregation framework.

Mongodb: find embedded element missing some key

I have a document with an embedded collection, but few elements are missing a key and I have to find all those elements. Here is an example:
var foo = {name: 'foo', embedded: [{myKey: "1", value: 3}, {myKey: "2", value: 3}]}
db.example.insert(foo)
var bar = {name: 'bar', embedded: [{value: 4}, {myKey: "3", value: 1}]}
db.example.insert(bar)
I need a query that returns the 'bar' object because one of its embedded doesn't have the key 'myKey'.
I try to use the $exists, but it returns only if ALL embedded elements are missing the key
db.example.find({'embedded.myKey': {$exists: true}}).size()
// -> 2
db.example.find({'embedded.myKey': {$exists: false}}).size()
// -> 0
How can I find the documents that at least one embedded element is missing the key 'myKey'?
If 'value' is always present, then you can try this command
db.example.find({ embedded : { $elemMatch : { value : {$exists : true}, myKey : {$exists : false}} }})
{ "_id" : ObjectId("518bbccbc9e49428608691b0"), "name" : "bar", "embedded" : [ { "value" : 4 }, { "myKey" : "3", "value" : 1 } ] }