Comparing documents between two MongoDB collections - mongodb

I have two existing collections and need to populate a third collection based on the comparison between the two existing.
The two collections that need to be compared have the following schema:
// Settings collection:
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Low":116,
"High":124,
"ImportLogId":1
}
// Data collection
{
"Identifier":"ABC123",
"C":"1",
"U":"V",
"Date":"11/6/2013 12AM",
"Value":128,
"ImportLogId": 1
}
I am new to MongoDB and NoSQL in general so I am having a tough time grasping how to do this. The SQL would look something like this:
SELECT s.Identifier, r.ReadValue, r.U, r.C, r.Date
FROM Settings s
JOIN Reads r
ON s.Identifier = r.Identifier
AND s.C = r.C
AND s.U = r.U
WHERE (r.Value <= s.Low OR r.Value >= s.High)
In this case using the sample data, I would want to return a record because the value from the Data collection is greater than the high value from the setting collection. Is this possible using Mongo queries or map reduce, or is this bad collection structure (i.e. maybe all of this should be in one collection)?
A few more additional notes:
The Settings collection should really only have 1 record per "Identifier". The Data collection will have many records per "Identifier". This process could potentially be scanning hundreds of thousands of documents at one time, so resource consideration is somewhat important

There is no good way of performing operation like this using MongoDB. If you want BAD way you can use code like this:
db.settings.find().forEach(
function(doc) {
data = db.data.find({
Identifier: doc.Idendtifier,
C: doc.C,
U: doc.U,
$or: [{Value: {$lte: doc.Low}}, {Value: {$gte: doc.High}}]
}).toArray();
// Do what you need
}
)
but don't expect it will perform even remotely as good as any decent RDBMS.
You could rebuild your schema and embed documents from data collection like this:
{
"_id" : ObjectId("527a7f4b07c17a1f8ad009d2"),
"Identifier" : "ABC123",
"C" : "1",
"U" : "V",
"Low" : 116,
"High" : 124,
"ImportLogId" : 1,
"Data" : [
{
"Date" : ISODate("2013-11-06T00:00:00Z"),
"Value" : 128
},
{
"Date" : ISODate("2013-10-09T00:00:00Z"),
"Value" : 99
}
]
}
It may work if number of embedded document is low but to be honest working with arrays of documents is far from being pleasant experience. Not even mention that you can easily hit document size limit with growing size of the Data array.
If this kind of operations is typical for your application I would consider using different solution. As much as I like MongoDB it works well only with certain type of data and access patterns.

Without the concept of JOIN, you must change your approach and denormalize.
In your case, looks like you're doing a data log validation. My advice is looping settings collection and with each of them use the findAndModify operator in order to set a validation flag on data collection records who matches; after that, you could just use the find operator on the data collection, filtering by the new flag.

Starting Mongo 4.4, we can achieve this type of "join" with the new $unionWith aggregation stage coupled with a classic $group stage:
// > db.settings.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Low" : 116 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Low" : 416 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Low" : 142 }
// > db.data.find()
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 14 }
// { "Identifier" : "GHI789", "C" : "1", "U" : "W", "Value" : 43 }
// { "Identifier" : "ABC123", "C" : "1", "U" : "V", "Value" : 45 }
// { "Identifier" : "DEF456", "C" : "1", "U" : "W", "Value" : 8 }
db.data.aggregate([
{ $unionWith: "settings" },
{ $group: {
_id: { Identifier: "$Identifier", C: "$C", U: "$U" },
Values: { $push: "$Value" },
Low: { $mergeObjects: { v: "$Low" } }
}},
{ $match: { "Low.v": { $lt: 150 } } },
{ $out: "result-collection" }
])
// > db.result-collection.find()
// { _id: { Identifier: "ABC123", C: "1", U: "V" }, Values: [14, 45], Low: { v: 116 } }
// { _id: { Identifier: "GHI789", C: "1", U: "W" }, Values: [43], Low: { v: 142 } }
This:
Starts with a union of both collections into the pipeline via the new $unionWith stage.
Continues with a $group stage that:
Groups records based on Identifier, C and U
Accumulates Values into an array
Accumulates Lows via a $mergeObjects operation in order to get a value of Low that isn't null. Using a $first wouldn't work since this could potentially take null first (for elements from the data collection). Whereas $mergeObjects discards null values when merging an object containing a non-null value.
Then discards joined records whose Low value is bigger than let's say 150.
And finally output resulting records to a third collection via an $out stage.

A feature we've developed called Data Compare & Sync might be able to help here.
It lets you compare two MongoDB collections and see the differences (e.g. spot the same, missing, or different fields).
You can then export these comparison results to a CSV file, and use that to create your new, third collection.
Disclosure: We are the creators of the MongoDB GUI, Studio 3T.

Related

MongoDB Aggregation - Does $unwind order documents the same way as the nested array order

I am wandering whether using $unwind operator in aggregation pipeline for document with nested array will return the deconstructed documents in the same order as the order of the items in the array.
Example:
Suppose I have the following documents
{ "_id" : 1, "item" : "foo", values: [ "foo", "foo2", "foo3"] }
{ "_id" : 2, "item" : "bar", values: [ "bar", "bar2", "bar3"] }
{ "_id" : 3, "item" : "baz", values: [ "baz", "baz2", "baz3"] }
I would like to use paging for all values in all documents in my application code. So, my idea is to use mongo aggregation framework to:
sort the documents by _id
use $unwind on values attribute to deconstruct the documents
use $skip and $limit to simulate paging
So the question using the example described above is:
Is it guaranteed that the following aggregation pipeline:
[
{$sort: {"_id": 1}},
{$unwind: "$values"}
]
will always result to the following documents with exactly the same order?:
{ "_id" : 1, "item" : "foo", values: "foo" }
{ "_id" : 1, "item" : "foo", values: "foo2" }
{ "_id" : 1, "item" : "foo", values: "foo3" }
{ "_id" : 2, "item" : "bar", values: "bar" }
{ "_id" : 2, "item" : "bar", values: "bar2" }
{ "_id" : 2, "item" : "bar", values: "bar3" }
{ "_id" : 3, "item" : "baz", values: "baz" }
{ "_id" : 3, "item" : "baz", values: "baz2" }
{ "_id" : 3, "item" : "baz", values: "baz3" }
I also asked the same question in the MongoDB community forum . An answer that confirms my assumption was posted from a member of MongoDB stuff.
Briefly:
Yes, the order of the returned documents in the example above will always be the same. It follows the order from the array field.
In the case that you do run into issues with order. You could use includeArrayIndex to guarantee order.
[
{$unwind: {
path: 'values',
includeArrayIndex: 'arrayIndex'
}},
{$sort: {
_id: 1,
arrayIndex: 1
}},
{ $project: {
index: 0
}}
]
From what I see at https://github.com/mongodb/mongo/blob/0cee67ce6909ca653462d4609e47edcc4ac5c1a9/src/mongo/db/pipeline/document_source_unwind.cpp
The cursor iterator uses getNext() method to unwind an array:
DocumentSource::GetNextResult DocumentSourceUnwind::doGetNext() {
auto nextOut = _unwinder->getNext();
while (nextOut.isEOF()) {
.....
// Try to extract an output document from the new input document.
_unwinder->resetDocument(nextInput.releaseDocument());
nextOut = _unwinder->getNext();
}
return nextOut;
}
And the getNext() implemenation relies on array's index:
DocumentSource::GetNextResult DocumentSourceUnwind::Unwinder::getNext() {
....
// Set field to be the next element in the array. If needed, this will automatically
// clone all the documents along the field path so that the end values are not shared
// across documents that have come out of this pipeline operator. This is a partial deep
// clone. Because the value at the end will be replaced, everything along the path
// leading to that will be replaced in order not to share that change with any other
// clones (or the original).
_output.setNestedField(_unwindPathFieldIndexes, _inputArray[_index]);
indexForOutput = _index;
_index++;
_haveNext = _index < length;
.....
return _haveNext ? _output.peek() : _output.freeze();
}
So unless there is anything upstream that messes with document's order the cursor should have unwound docs in the same order as subdocs were stored in the array.
I don't recall how merger works for sharded collections and I imagine there might be a case when documents from other shards are returned from between 2 consecutive unwound documents. What the snippet of the code guarantees is that unwound document with next item from the array will never be returned before unwound document with previous item from the array.
As a side note, having million items in an array is quite an extreme design. Even 20-bytes items in the array will exceed 16Mb doc limit.

Mongo Query display all documents with matching key values [duplicate]

I have a large collection of documents in MongoDB, each one of those documents has a key called "name", and another key called "type". I would like to find two documents with the same name and different types, a simple MongoDB counterpart of
SELECT ...
FROM table AS t1, table AS t2
WHERE t1.name = t2.name AND t1.type <> t2.type
I can imagine that one can do this using aggregation: however, the collection is very large, processing it will take time and I'm looking just for one pair of such documents.
While I stand by by comments that I don't think the way you are phrasing your question is actually related to a specific problem you have, I will go someway to explain the idiomatic SQL way in a MongoDB type of solution. I stand on that your actual solution would be different but you haven't presented us with that problem, but only SQL.
So consider the following documents as a sample set, removing _id fields in this listing for clarity:
{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
{ "name" : "z", "type" : "z" }
If we ran the SQL presented over the same data we would get this result:
a|b
a|c
a|c
b|c
b|a
b|a
a|b
b|c
We can see that 2 documents do not match, and then work out the logic of the SQL operation. So the other way of saying it is "Which documents given a key of "name" do have more than one possible value in the key "type".
Given that, taking a mongo approach, we can query for the items that do not match the given condition. So effectively the reverse of the result:
db.sample.aggregate([
// Store unique documents grouped by the "name"
{$group: {
_id: "$name",
comp: {
$addToSet: {
name:"$name",
type: "$type"
}
}
}},
// Unwind the "set" results
{$unwind: "$comp"},
// Push the results back to get the unique count
// *note* you could not have done this with alongside $addtoSet
{$group: {
_id: "$_id",
comp: {
$push: {
name: "$comp.name",
type: "$comp.type"
}
},
count: {$sum: 1}
}},
// Match only what was counted once
{$match: {count: 1}},
// Unwind the array
{$unwind: "$comp"},
// Clean up to "name" and "type" only
{$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])
This operation will yield the results:
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
Now in order to get the same result as the SQL query we would take those results and channel them into another query:
db.sample.find({$nor: [{ name: "f", type: "e"},{ name: "z", type: "z"}] })
Which arrives as the final matching result:
{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
So this will work, however the one thing that may make this impractical is where the number of documents being compared is very large, we hit a working limit on compacting those results down to an array.
It also suffers a bit from the use of a negative in the final find operation which would force a scan of the collection. But in all fairness the same could be said of the SQL query that uses the same negative premise.
Edit
Of course what I did not mention is that if the result set goes the other way around and you are matching more results in the excluded items from the aggregate, then just reverse the logic to get the keys that you want. Simply change $match as follows:
{$match: {$gt: 1}}
And that will be the result, maybe not the actual documents but it is a result. So you don't need another query to match the negative cases.
And, ultimately this was my fault because I was so focused on the idiomatic translation that I did not read the last line in your question, where to do say that you were looking for one document.
Of course, currently if that result size is larger than 16MB then you are stuck. At least until the 2.6 release, where the results of aggregation operations are a cursor, so you can iterate that like a .find().
Also introduced in 2.6 is the $size operator which is used to find the size of an array in the document. So this would help to remove the second $unwind and $group that are used in order to get the length of the set. This alters the query to a faster form:
db.sample.aggregate([
{$group: {
_id: "$name",
comp: {
$addToSet: {
name:"$name",
type: "$type"
}
}
}},
{$project: {
comp: 1,
count: {$size: "$comp"}
}},
{$match: {count: {$gt: 1}}},
{$unwind: "$comp"},
{$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])
And MongoDB 2.6.0-rc0 is currently available if you are doing this just for personal use, or development/testing.
Moral of the story. Yes you can do it, But do you really want or need to do it that way? Then probably not, and if you asked a different question about the specific business case, you may get a different answer. But then again this may be exactly right for what you want.
Note
Worthwhile to mention that when you look at the results from the SQL, it will erroneously duplicate several items due to the other available type options if you didn't use a DISTINCT for those values or essentially another grouping. But that is the result that was being produced by this process using MongoDB.
For Alexander
This is the output of the aggregate in the shell from current 2.4.x versions:
{
"result" : [
{
"name" : "f",
"type" : "e"
},
{
"name" : "z",
"type" : "z"
}
],
"ok" : 1
}
So do this to get a var to pass as the argument to the $nor condition in the second find, like this:
var cond = db.sample.aggregate([ .....
db.sample.find({$nor: cond.result })
And you should get the same results. Otherwise consult your driver.
There is a very simple aggregation that works to get you the names and their types that occur more than once:
db.collection.aggregate([
{ $group: { _id : "$name",
count:{$sum:1},
types:{$addToSet:"$type"}}},
{$match:{"types.1":{$exists:true}}}
])
This works in all versions that support aggregation framework.

MongoDB Calculate Values from Two Arrays, Sort and Limit

I have a MongoDB database storing float arrays. Assume a collection of documents in the following format:
{
"id" : 0,
"vals" : [ 0.8, 0.2, 0.5 ]
}
Having a query array, e.g., with values [ 0.1, 0.3, 0.4 ], I would like to compute for all elements in the collection a distance (e.g., sum of differences; for the given document and query it would be computed by abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9).
I tried to use the aggregation function of MongoDB to achieve this, but I can't work out how to iterate over the array. (I am not using the built-in geo operations of MongoDB, as the arrays can be rather long)
I also need to sort the results and limit to the top 100, so calculation after reading the data is not desired.
Current Processing is mapReduce
If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:
db.test.mapReduce(
function() {
var input = [0.1,0.3,0.4];
var value = Array.sum(this.vals.map(function(el,idx) {
return Math.abs( el - input[idx] )
}));
emit(null,{ "output": [{ "_id": this._id, "value": value }]});
},
function(key,values) {
var output = [];
values.forEach(function(value) {
value.output.forEach(function(item) {
output.push(item);
});
});
output.sort(function(a,b) {
return a.value < b.value;
});
return { "output": output.slice(0,100) };
},
{ "out": { "inline": 1 } }
)
So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.
The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.
That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.
I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:
{
"results" : [
{
"_id" : null,
"value" : {
"output" : [
{
"_id" : ObjectId("56558d93138303848b496cd4"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d96138303848b49906e"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d93138303848b496d9a"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d93138303848b496ef2"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497861"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497b58"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497ba5"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497c43"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d95138303848b49842b"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d96138303848b498db4"),
"value" : 2.1
}
]
}
}
],
"timeMillis" : 1758,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 100,
"output" : 1
},
"ok" : 1
}
So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.
Future Processing is Aggregate
As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:
db.test.aggregate([
{ "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
{ "$group": {
"_id": "$_id",
"result": {
"$sum": {
"$abs": {
"$subtract": [
"$vals",
{ "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] }
]
}
}
}
}},
{ "$sort": { "result": -1 } },
{ "$limit": 100 }
])
Also limitting to the same 10 results for brevity, you get output like this:
{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
This is made possible largely due to $unwind being modified to project a field in results that contains the array index, and also due to $arrayElemAt which is a new operator that can extract an array element as a singular value from a provided index.
This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal operator so $arrayElemAt does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind for comparison.
The math is done by $subtract and of course another new operator in $abs to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group stage accumulating all array members per document and applying the addition of entries via the $sum accumulator.
Finally all result documents are processed with $sort and then the $limit is applied to just return the top results.
Summary
Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.
So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.
As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.
Sample
Of course the expected result for the sample document provided in the question is 0.9 by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:
var bulk = db.test.initializeUnorderedBulkOp();
var x = 10000;
while ( x-- ) {
var vals = [0,0,0];
vals = vals.map(function(val) {
return Math.round((Math.random()*10),1)/10;
});
bulk.insert({ "vals": vals });
if ( x % 1000 == 0) {
bulk.execute();
bulk = db.test.initializeUnorderedBulkOp();
}
}
The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.

Finding two documents in MongoDB that share a key value

I have a large collection of documents in MongoDB, each one of those documents has a key called "name", and another key called "type". I would like to find two documents with the same name and different types, a simple MongoDB counterpart of
SELECT ...
FROM table AS t1, table AS t2
WHERE t1.name = t2.name AND t1.type <> t2.type
I can imagine that one can do this using aggregation: however, the collection is very large, processing it will take time and I'm looking just for one pair of such documents.
While I stand by by comments that I don't think the way you are phrasing your question is actually related to a specific problem you have, I will go someway to explain the idiomatic SQL way in a MongoDB type of solution. I stand on that your actual solution would be different but you haven't presented us with that problem, but only SQL.
So consider the following documents as a sample set, removing _id fields in this listing for clarity:
{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
{ "name" : "z", "type" : "z" }
If we ran the SQL presented over the same data we would get this result:
a|b
a|c
a|c
b|c
b|a
b|a
a|b
b|c
We can see that 2 documents do not match, and then work out the logic of the SQL operation. So the other way of saying it is "Which documents given a key of "name" do have more than one possible value in the key "type".
Given that, taking a mongo approach, we can query for the items that do not match the given condition. So effectively the reverse of the result:
db.sample.aggregate([
// Store unique documents grouped by the "name"
{$group: {
_id: "$name",
comp: {
$addToSet: {
name:"$name",
type: "$type"
}
}
}},
// Unwind the "set" results
{$unwind: "$comp"},
// Push the results back to get the unique count
// *note* you could not have done this with alongside $addtoSet
{$group: {
_id: "$_id",
comp: {
$push: {
name: "$comp.name",
type: "$comp.type"
}
},
count: {$sum: 1}
}},
// Match only what was counted once
{$match: {count: 1}},
// Unwind the array
{$unwind: "$comp"},
// Clean up to "name" and "type" only
{$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])
This operation will yield the results:
{ "name" : "f", "type" : "e" }
{ "name" : "z", "type" : "z" }
Now in order to get the same result as the SQL query we would take those results and channel them into another query:
db.sample.find({$nor: [{ name: "f", type: "e"},{ name: "z", type: "z"}] })
Which arrives as the final matching result:
{ "name" : "a", "type" : "b" }
{ "name" : "a", "type" : "c" }
{ "name" : "b", "type" : "c" }
{ "name" : "b", "type" : "a" }
{ "name" : "a", "type" : "b" }
{ "name" : "b", "type" : "c" }
So this will work, however the one thing that may make this impractical is where the number of documents being compared is very large, we hit a working limit on compacting those results down to an array.
It also suffers a bit from the use of a negative in the final find operation which would force a scan of the collection. But in all fairness the same could be said of the SQL query that uses the same negative premise.
Edit
Of course what I did not mention is that if the result set goes the other way around and you are matching more results in the excluded items from the aggregate, then just reverse the logic to get the keys that you want. Simply change $match as follows:
{$match: {$gt: 1}}
And that will be the result, maybe not the actual documents but it is a result. So you don't need another query to match the negative cases.
And, ultimately this was my fault because I was so focused on the idiomatic translation that I did not read the last line in your question, where to do say that you were looking for one document.
Of course, currently if that result size is larger than 16MB then you are stuck. At least until the 2.6 release, where the results of aggregation operations are a cursor, so you can iterate that like a .find().
Also introduced in 2.6 is the $size operator which is used to find the size of an array in the document. So this would help to remove the second $unwind and $group that are used in order to get the length of the set. This alters the query to a faster form:
db.sample.aggregate([
{$group: {
_id: "$name",
comp: {
$addToSet: {
name:"$name",
type: "$type"
}
}
}},
{$project: {
comp: 1,
count: {$size: "$comp"}
}},
{$match: {count: {$gt: 1}}},
{$unwind: "$comp"},
{$project: { _id: 0, name: "$comp.name", type: "$comp.type"}}
])
And MongoDB 2.6.0-rc0 is currently available if you are doing this just for personal use, or development/testing.
Moral of the story. Yes you can do it, But do you really want or need to do it that way? Then probably not, and if you asked a different question about the specific business case, you may get a different answer. But then again this may be exactly right for what you want.
Note
Worthwhile to mention that when you look at the results from the SQL, it will erroneously duplicate several items due to the other available type options if you didn't use a DISTINCT for those values or essentially another grouping. But that is the result that was being produced by this process using MongoDB.
For Alexander
This is the output of the aggregate in the shell from current 2.4.x versions:
{
"result" : [
{
"name" : "f",
"type" : "e"
},
{
"name" : "z",
"type" : "z"
}
],
"ok" : 1
}
So do this to get a var to pass as the argument to the $nor condition in the second find, like this:
var cond = db.sample.aggregate([ .....
db.sample.find({$nor: cond.result })
And you should get the same results. Otherwise consult your driver.
There is a very simple aggregation that works to get you the names and their types that occur more than once:
db.collection.aggregate([
{ $group: { _id : "$name",
count:{$sum:1},
types:{$addToSet:"$type"}}},
{$match:{"types.1":{$exists:true}}}
])
This works in all versions that support aggregation framework.

Mongo does not have a max() function, how do I work around this?

I have a MongoDB collection and need to find the max() value of a certain field across all docs. This value is the timestamp and I need to find the latest doc by finding the largest timestamp. Sorting it and getting the first one gets inefficient really fast. Shall I just maintain a 'maxval' separately and update it whenever a doc arrives with a larger value for that field? Any better suggestions?
Thanks much.
if you have an index on the timestsamp field, finding the highest value is efficientl something like
db.things.find().sort({ts:-1}).limit(1)
but if having an index is too much overhead storing the max in a separate collection might be good.
For sure if it will be big collection and if you need always display max timestamp you may need create separate collection and store statistic data there instead of order big collection each time.
statistic
{
_id = 1,
id_from_time_stamp_collection = 'xxx',
max_timestamp: value
}
And whenever new doc come just update statistic collection with id = 1(with $gt condition in query, so if new timestamp will be greater than max_timestamp then max_timestamp will be updated, otherwise - no).
Also probably you can store and update other statistic data within statistic collection.
Try with db.collection.group
For example, with this collection:
> db.foo.find()
{ "_id" : ObjectId("..."), "a" : 1 }
{ "_id" : ObjectId("..."), "a" : 200 }
{ "_id" : ObjectId("..."), "a" : 230 }
{ "_id" : ObjectId("..."), "a" : -2230 }
{ "_id" : ObjectId("..."), "a" : 5230 }
{ "_id" : ObjectId("..."), "a" : 530 }
{ "_id" : ObjectId("..."), "a" : 1530 }
You can use group using
> db.foo.group({
initial: { },
reduce: function(doc, acc) {
if(acc.hasOwnProperty('max')) {
if(acc.max < doc.a)
acc.max = doc.a;
} else {
acc.max = doc.a
}
}
})
[ { "max" : 5230 } ]
Since there is no key value in group all the objects are grouped in a single result