Is there a way to query an embedded document in an embedded document? - mongodb

I have a weird mongodb document, but still need to query on it. Is it possible?
For example: I need every player within a certain radius.
{
"_id" : ObjectId("55d89c63c746230c200c528e"),
"speler_id" : 12,
"naam" : "Arjen Robben",
"seconds" : [
[
{
"locatie" : [
8.7173307286181370,
33.2784843816214250
],
"timestamp" : ISODate("1970-01-01T19:00:01.000Z")
},
{
"locatie" : [
-45.8853075448968970,
138.1526615469845800
],
"timestamp" : ISODate("1970-01-01T19:00:02.000Z")
},
{
"locatie" : [
80.5503710377444690,
10.0500048843973580
],
"timestamp" : ISODate("1970-01-01T19:00:03.000Z")
}
]
]
}

Well you can always use $geoWithin with $center or $centerSphere ( depending on whether these are global geometry coordinates or just a flat plane, for distance caluation purposes ) after processing with $unwind in the aggregation framework:
db.collection.aggregate([
{ "$unwind": "$seconds" },
{ "$unwind": "$seconds" },
{ "$match": {
"seconds.locatie": {
"$geoWithin": {
"$centerSphere": [
[
8.7173307286181370,
33.2784843816214250
],
100
]
}
}
}}
])
Which on the presented data would return:
{
"_id" : ObjectId("55d89c63c746230c200c528e"),
"speler_id" : 12,
"naam" : "Arjen Robben",
"seconds" : {
"locatie" : [
8.717330728618137,
33.278484381621425
],
"timestamp" : ISODate("1970-01-01T19:00:01Z")
}
}
{
"_id" : ObjectId("55d89c63c746230c200c528e"),
"speler_id" : 12,
"naam" : "Arjen Robben",
"seconds" : {
"locatie" : [
80.55037103774447,
10.050004884397358
],
"timestamp" : ISODate("1970-01-01T19:00:03Z")
}
}
Since $geoWithin does not "require" a geospatial index, then this is fine to use at later aggregation stages than the initial match. The $centerSphere in this case defines a point to query from and the "radius" extending from that point. This is just really a geometery "shortcut" as you can alternately provdide a GeoJSON polygon of your own definition.
But it's not really great. And mostly because it will not be able to use an index and therefore is pretty much brute force over the whole collection, and in that you cannot do nice things like return the distance from the queried point, like you can do with $geoNear.
Therefore while you can do things like this, most geoSpatial queries with MongoDB are best left to keeping that location data at the top level of the document, rather than embedded within arrays. So such modelling usually means having separate collection objects rather than embedded ones for the best results.
If you want an aggregated array in your response, then it is better to do this in aggregation after the intial geospatial query is made.

Related

Is there a way to query for dict of arrays of array in MongoDB

My mongo db collection contains the structure as :
{
"_id" : ObjectId("5889ce0d2e9bfa938c49208d"),
"filewise_word_freq" : {
"33236365" : [
[
"cluster",
4
],
[
"question",
2
],
[
"differ",
2
],[
"come",
1
]
],
"33204685" : [
[
"node",
6
],
[
"space",
4
],
[
"would",
3
],[
"templat",
1
]
]
},
"file_root" : "socialcast",
"main_cluster_name" : "node",
"most_common_words" : [
[
"node",
16
],
[
"cluster",
7
],
[
"n't",
3
]
]
}
I want to search for a value "node" inside the arrays of arrays of the filename (in my case its "33236365","33204685" and so on...) of the dict filewise_word_freq.
And if the value("node") is present inside any one of the array of arrays of the filename(33204685), then should return the filename(33204685).
I tried from this link of stackoverflow :
enter link description here
I tried to execute for my use case it didn't work. And above all this I didn't no how to return only the filename rather the entire object or document.
db.frequencydist.find({"file_root":'socialcast',"main_cluster_name":"node","filewise_word_freq":{$elemMatch:{$elemMatch:{$elemMatch:{$in:["node"]}}}}}).pretty().
It returned nothing.
Kindly help me.
the data model you have chosen has made it extremely difficult to either query or even for aggregation. I would suggest to revise your document model. However I think you can use $where
db.collection.find({"file_root": 'socialcast',
"main_cluster_name": "node", $where : "for(var i in this.filewise_word_freq){for(var j in this.filewise_word_freq[i]){if(this.filewise_word_freq[i][j].indexOf("node")>=0){return true}}}"})
yes, this will return you the whole document and from your application you might need to filter the files name out.
you might also want to see map-reduce functionality, though that's not recommended.
One other way is to do it through functions, functions runs on mongo server and are saved in a special collection.
Still going back to the db model, do revise it if that's a possibility. maybe something like
{
"_id" : ObjectId("5889ce0d2e9bfa938c49208d"),
"filewise_word_freq" : [
{
"fileName":"33236365",
"word_counts" : {
"cluster":4,
"question":2,
"differ":2,
"come":1
}
},
{
"fileName":"33204685",
"word_counts" : {
"node":6,
"space":4,
"would":3,
"template":1
}
}
]
"file_root" : "socialcast",
"main_cluster_name" : "node",
"most_common_words" : [
{
"node":16
},
{
"cluster":7
},
{
"n't":3
}
]
}
It would be a lot easier to run aggregation on these.
For this model, the aggregation would be something like
db.collection.aggregate([
{$unwind : "$filewise_word_freq"},
{$match : {'filewise_word_freq.word_counts.node' : {$gte : 0}}},
{$group :{_id: 1, fileNames : {$addToSet : "$filewise_word_freq.fileName"}}},
{$project :{ _id:0}}
])
this will provide you a single document with a single field fileNames with list of all the filename
{
fileNames : ["33204685"]
}
You can try something like this. This will match the node as part of the query and returns filewise_word_freq.33204685 as part of the projection.
db.collection.find({
"file_root": 'socialcast',
"main_cluster_name": "node",
"filewise_word_freq.33204685": {
$elemMatch: {
$elemMatch: {
$in: ["node"]
}
}
}
}, {
"filewise_word_freq.33204685": 1
}).pretty();

MongoDB Calculate Values from Two Arrays, Sort and Limit

I have a MongoDB database storing float arrays. Assume a collection of documents in the following format:
{
"id" : 0,
"vals" : [ 0.8, 0.2, 0.5 ]
}
Having a query array, e.g., with values [ 0.1, 0.3, 0.4 ], I would like to compute for all elements in the collection a distance (e.g., sum of differences; for the given document and query it would be computed by abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9).
I tried to use the aggregation function of MongoDB to achieve this, but I can't work out how to iterate over the array. (I am not using the built-in geo operations of MongoDB, as the arrays can be rather long)
I also need to sort the results and limit to the top 100, so calculation after reading the data is not desired.
Current Processing is mapReduce
If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:
db.test.mapReduce(
function() {
var input = [0.1,0.3,0.4];
var value = Array.sum(this.vals.map(function(el,idx) {
return Math.abs( el - input[idx] )
}));
emit(null,{ "output": [{ "_id": this._id, "value": value }]});
},
function(key,values) {
var output = [];
values.forEach(function(value) {
value.output.forEach(function(item) {
output.push(item);
});
});
output.sort(function(a,b) {
return a.value < b.value;
});
return { "output": output.slice(0,100) };
},
{ "out": { "inline": 1 } }
)
So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.
The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.
That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.
I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:
{
"results" : [
{
"_id" : null,
"value" : {
"output" : [
{
"_id" : ObjectId("56558d93138303848b496cd4"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d96138303848b49906e"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d93138303848b496d9a"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d93138303848b496ef2"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497861"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497b58"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497ba5"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497c43"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d95138303848b49842b"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d96138303848b498db4"),
"value" : 2.1
}
]
}
}
],
"timeMillis" : 1758,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 100,
"output" : 1
},
"ok" : 1
}
So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.
Future Processing is Aggregate
As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:
db.test.aggregate([
{ "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
{ "$group": {
"_id": "$_id",
"result": {
"$sum": {
"$abs": {
"$subtract": [
"$vals",
{ "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] }
]
}
}
}
}},
{ "$sort": { "result": -1 } },
{ "$limit": 100 }
])
Also limitting to the same 10 results for brevity, you get output like this:
{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
This is made possible largely due to $unwind being modified to project a field in results that contains the array index, and also due to $arrayElemAt which is a new operator that can extract an array element as a singular value from a provided index.
This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal operator so $arrayElemAt does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind for comparison.
The math is done by $subtract and of course another new operator in $abs to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group stage accumulating all array members per document and applying the addition of entries via the $sum accumulator.
Finally all result documents are processed with $sort and then the $limit is applied to just return the top results.
Summary
Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.
So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.
As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.
Sample
Of course the expected result for the sample document provided in the question is 0.9 by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:
var bulk = db.test.initializeUnorderedBulkOp();
var x = 10000;
while ( x-- ) {
var vals = [0,0,0];
vals = vals.map(function(val) {
return Math.round((Math.random()*10),1)/10;
});
bulk.insert({ "vals": vals });
if ( x % 1000 == 0) {
bulk.execute();
bulk = db.test.initializeUnorderedBulkOp();
}
}
The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.

Update an array element with inc mongo update

HI All I have this Data in mongo,
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 1
}
],
"count" : NumberLong(1),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
I want to update it using this new data
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
}
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
What i need in the output is
{"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 4
}
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(7),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
Could you please suggest me how can i achieve this this using update operation
My update query will have tags field as query parameter.
You'll never get this in a single query operation as presently there is no way for MongoDB updates to refer to the existing values of fields. The exception of course is operators such as $inc, but this has a bit more going on than can be really handled by this.
You need multiple updates, but there is a consistent model to follow and the Bulk Operations API can at least help with sending all of those updates in a single request:
var updoc = {
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
},
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
};
var bulk = db.collection.initializeOrderedBulkOp();
// Inspect the document variable for update
// For each array entry
updoc.articleId.forEach(function(doc) {
// First try to match the document and array entry to update
bulk.find({
"tags": updoc.tags,
"articleId.articleId": doc.articleId
}).update({
"$inc": { "articleId.$.articleCount": doc.articleCount }
});
// Then try to "push" the array entry where it does not exist
bulk.find({
"tags": updoc.tags,
"articleId.articleId": { "$ne": doc.articleId }
}).update({
"$push": { "articleId": doc }
});
})
// Finally increment the overall count
bulk.find({ "tags": updoc.tags }).update({
"$inc": { "count": updoc.count }
});
bulk.execute();
Now that is not "truly" atomic and there is a very small chance that the modified document could be read without all of the modifications in place. And the Bulk API sends these over to the server to process all at once, then that is a lot better than individual operations between the client and server where the chance of the document being read in a non-consistent state would be much higher.
So for each array member in the document to "merge" you want to both try to $inc where the
member is matched in the query and to $push a new member where it was not. Finally you just want to $inc again for the total count on the merged document with the existing one.
For this sample that is a total of 5 update operations but all sent in one package. Note that the response though will confirm that only 3 operations where applied here as 2 of the operations would not actually match a document due to the conditions specified:
BulkWriteResult({
"writeErrors" : [ ],
"writeConcernErrors" : [ ],
"nInserted" : 0,
"nUpserted" : 0,
"nMatched" : 3,
"nModified" : 3,
"nRemoved" : 0,
"upserted" : [ ]
})
So that is one way to handle it. Another may be to just submit each document individually and then periodically "merge" the data into grouped documents using the aggregation framework. It depends on how "real time" you want to do this. The above is as close to "real time" updates as you can generally get.
Delayed Processing
As mentioned, there is another approach to this where you can consider a "delayed" processing of this "merging" where you do not need the data to be updated in real time. The approach considers the use of the aggregation framework to perform the "merge", and you could even use the aggregation as the general query for the data, but you probably want to accumulate in a collection instead.
The basic premise of the aggregation is that you store each "change" document as a separate document in the collection, rather than merge in real time. So two documents in the collection would be represented like this:
{
"_id" : ObjectId("548fe1c78ad2c25d4c952eee"),
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 1
}
],
"count" : NumberLong(1),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
},
{
"_id" : ObjectId("548fe2286032bac607405eb3"),
"articleId" : [
{
"articleId" : "9514666",
"articleCount" : 3
},
{
"articleId" : "9514667",
"articleCount" : 3
}
],
"count" : NumberLong(6),
"timeStamp" : NumberLong("1416634200000"),
"interval" : 1,
"tags" : "famous"
}
In order to "merge" these results for a given "tags" value, you want an aggregation pipeline like this:
db.collection.aggregate([
// Unwinds the array members to de-normalize
{ "$unwind": "$articleId" },
// Group the elements by "tags" value and "articleId"
{ "$group": {
"_id": {
"tags": "$tags",
"articleId": "$articleId.articleId",
},
"articleCount": { "$sum": "$articleId.articleCount" },
"timeStamp": { "$max": "$timeStamp" },
"interval": { "$max": "$interval" },
}},
// Now group again creating the array of "merged" items
{ "$group": {
"_id": "$tags",
"articleId": {
"$push": {
"articleId": "$_id.articleId",
"articleCount": "$articleCount"
}
},
"count": { "$sum": "$articleCount" },
"timeStamp": { "$max": "$timeStamp" },
"interval": { "$max": "$interval" },
}}
])
So using "tags" and "articleId" ( the inner value ) you group the results together, taking the $sum of the "articleCount" fields where both of those fields are the same and the $max value for the rest of the fields, which makes sense.
In a second $group pass you then just break the result documents down to "tags", pushing each matching "articleId" value under that into an array. To avoid any duplication the document "count" is summed at this stage and the other values are just taken from the same groupings.
The result is the same "merged" document, which you could either use the above aggregation query to simply return your results from such a collection, or use those results to either just create a new collection for the merged documents ( see the $out operator for one option ) or use a similar process to the first example to "merge" these "merged" results with an existing "merged" collection.
Accumulating data like this is generally a wide topic, even though a common use case for many. There is a reference project maintained but MongoDB solutions architecture called HVDF or High Volume Data Feed. It is aimed at providing a framework or at least a reference example of handling volume feeds ( for which change document accumulation is a case ) and aggregating these in a series manner for analysis.
The actual approaches depend on the overall needs of your application. Concepts such as these are employed internally by a framework like HVDF, it's just a matter of how much complexity you need and the approach that suits your application best for how you need to access the data.

$and with $nearSphere in mongodb

I have a collection having from and to point locations. Now I wish to find documents which have both, to and from locations nearby the given source and destinations.
Here's the setup:
collection: db.t2.find():
{
"_id" : ObjectId("5..4"),
"uid" : "sdrr",
"valid_upto": 122334,
"loc" : {
"from" : {
"type" : "Point",
"coordinates" : [ 77.206672, 28.543347 ]
},
"to" : {
"type" : "Point",
"coordinates" : [ 77.1997687, 28.5567278 ]
}
}
}
Indices: db.t2.getIndices():
{
"v" : 1,
"name" : "_id_",
"key" : {
"_id" : 1
},
"ns" : "mydb.t2"
},
{
"v" : 1,
"name" : "uid_1_loc.from_2dsphere_loc.to_2dsphere_valid_upto_1",
"key" : {
"uid" : 1,
"loc.from" : "2dsphere",
"loc.to" : "2dsphere",
"valid_upto" : 1
},
"ns" : "mydb.t2"
}
Single queries for either to or from work good with the current settings give nice results. However, when I use to and from together in a single query with $and clause:
db.t2.find({
"$and" : [
{
"loc.from" : {
"$nearSphere" : [ 77.5454589,28.4621213 ],
"$maxDistance" : 0.18
}
},
{
"loc.to" : {
"$nearSphere" : [ 77.206672, 28.543347 ],
"$maxDistance" : 0.18
}
}
]
})
it throws the following error:
error: {
"$err" : "can't find any special indices: 2d (needs index), 2dsphere (needs index), for: { $and: [ { loc.from: { $nearSphere: [ 77.5454589, 28.4621213 ], $maxDistance: 0.18 } }, { loc.to: { $nearSphere: [ 77.206672, 28.543347 ], $maxDistance: 0.18 } } ] }",
"code" : 13038
}
I suppose the data has been indexed as evident from getIndices(), but still its unable to find indices! Where is the problem then and how can I fix it to have effect of a $and-ed operation?
The error appears to be present from a MongoDB 2.4 version where there indeed was a bug that would not allow a $near type of query within and $and operation that accessed another field.
But your particular problem here is that you just cannot do this.
The code and comments to test this can be vied on GitHub but essentially:
// There can only be one NEAR. If there is a NEAR, it must be either the root or the root
// must be an AND and its child must be a NEAR.
size_t numGeoNear = countNodes(root, MatchExpression::GEO_NEAR);
if (numGeoNear > 1) {
return Status(ErrorCodes::BadValue, "Too many geoNear expressions");
}
So that is an error that would be emitted from MongoDB 2.6 you tried to do this.
A brief look at all the surrounding code within the method will show you that "geo" queries are not alone in this and the other "special" index type of "text" is included in the same rules.
Part of the reason for this is the $meta "scoring" that is required, as in this case is $maxDistance. There really is no valid way to combine or discern which value would actually apply in combined results such as this.
On a bit more of a technical note, the other issue is with being able to "intersect" indexes in a query such as this. The required fuzzy matching makes this a very different prospect to something like the basic "Btree" index intersection.
For now at least, your best approach is to perform each query by itself and manually "union/intersect" your results in code, with of course your own tagging as to which results are for your origin and which are for your destination.
This was a known issue in version 2.4 and prior of MongoDB, fixed in version 2.5.5:
https://jira.mongodb.org/browse/SERVER-4572
Core ServerSERVER-4572 Geospatial index cannot be used in $and
criteria of a query?
Should be fixed as of 2.6 - if you're running 2.4 or previous I'd upgrade, if you're running 2.6.X I'd report it as a bug.

mongodb 2.4.9 $geoWithin query on very simple dataset returning no results. Why?

Here is the output from my mongodb shell of a very simple example of a $geoWithin query. As you can see, I have only a single GeoJson Polygon in my collection, and each of its coordinates lies within the described $box. Furthermore, the GeoJson seems valid, as the 2dsphere index was created without error.
> db.Townships.find()
{ "_id" : ObjectId("5310f13c9f3a313af872530c"), "geometry" : { "type" : "Polygon", "coordinates" : [ [ [ -96.74084500000001, 36.99911500000002 ], [ -96.74975600000002, 36.99916100000001 ], [ -96.74953099999998, 36.99916000000002 ], [ -96.74084500000001, 36.99911500000002 ] ] ] }, "type" : "Feature" }
> db.Townships.ensureIndex( { "geometry" : "2dsphere"})
> db.Townships.find( { "geometry" : { $geoWithin : { "$box" : [[-97, 36], [-96, 37]] } } } ).count()
0
Thanks for any advice.
From documentation:
The $box operator specifies a rectangle for a geospatial $geoWithin query. The query returns documents that are within the bounds of the rectangle, according to their point-based location data. The $box operator returns documents based on grid coordinates and does not query for GeoJSON shapes.
If you insert this document...
db.Townships.insert(
{ "geometry" : [ -96.74084500000001, 36.99911500000002 ],
"type" : "Feature"
})
...your query will found it (but without index support).