Optimizing MongoDB Query on GEO fence - mongodb

I am using MongoDB to store a lot of GPS-Data (2 Mio. Documents with about 1000 GPS-Points per Document). The data looks like the following
{
"Data": [
{
latitude : XXXXXX,
longitude : XXXXXX,
speed : XXXXXXX
}],
"_id": ID,
" StartDayOfWeek" : X
}
As you see Data is an array of GPSPoints and additional information.
I query for Documents on specific Days of the week (Sometimes I query for multiple days in which case I use $or.
Oh... lat/lon is stored as Integers
The following is my query(An example):
db.testNew2.aggregate(
{ $sort : {"StartDayOfWeek" : 1 }},
{ $match: {"$and" : [
{ "StartDayOfWeek" : 1},
{ 'Data' :
{$elemMatch:
{ "latitude" : {$gt:48143743, $lt:48143843}}}},
{ 'Data' :
{$elemMatch:
{ "longitude" : {$gt:11554706,$lt:11554806}}}}
]}},
{ $unwind : "$Data"},
{ $match: {"$and" : [
{ 'Data.latitude' : {$gt:48143743, $lt:48143843}},
{ 'Data.longitude' : {$gt:11554706, $lt:11554806}}
]}},
{$group: {
"_id" : "$_id",
"Traces" : { $push :"$Data"}
}}
)
As you can see I am sorting out the GPS-Points that are not within the GEO fence.
This query works fine but...and I do not know why. It seems very slow. There is an index on StartDayOfWeek and the machine is more than capable (24 GB of RAM and two 7200rpm SATA drives in RAID0).
The collection size is about 130 GB and the query takes about 3 - 5 minutes.
In the Java programme that uses that query I also use allowDiskUsage since the return values can be higher than 16 MB.
Is there any way to optimize this?

Related

How to improve aggregate pipeline

I have pipeline
[
{'$match':{templateId:ObjectId('blabla')}},
{
"$sort" : {
"_id" : 1
}
},
{
"$facet" : {
"paginatedResult" : [
{
"$skip" : 0
},
{
"$limit" : 100
}
],
"totalCount" : [
{
"$count" : "count"
}
]
}
}
])
Index:
"key" : {
"templateId" : 1,
"_id" : 1
}
Collection has 10.6M documents 500k of it is with needed templateId.
Aggregate use index
"planSummary" : "IXSCAN { templateId: 1, _id: 1 }",
But the request takes 16 seconds. What i did wrong? How to speed up it?
For start, you should get rid of the $sort operator. The documents are already sorted by _id since the documents are already guaranteed to sorted by the { templateId: 1, _id: 1 } index. The outcome is sorting 500k which are already sorted anyway.
Next, you shouldn't use the $skip approach. For high page numbers you will skip large numbers of documents up to almost 500k (rather index entries, but still).
I suggest an alternative approach:
For the first page, calculate an id you know for sure falls out of the left side of the index. Say, if you know that you don't have entries back dated to 2019 and before, you can use a match operator similar to this:
var pageStart = ObjectId.fromDate(new Date("2020/01/01"))
Then, your match operator should look like this:
{'$match' : {templateId:ObjectId('blabla'), _id: {$gt: pageStart}}}
For the next pages, keep track of the last document of the previous page: if the rightmost document _id is x in a certain page, then pageStart should be x for the next page.
So your pipeline may look like this:
[
{'$match' : {templateId:ObjectId('blabla'), _id: {$gt: pageStart}}},
{
"$facet" : {
"paginatedResult" : [
{
"$limit" : 100
}
]
}
}
]
Note, that now the $skip is missing from the $facet operator as well.

Poor $lookup/$sort performance on small collections WITH indexes

I've gone through almost 10 similar posts here on SO, and I'm still confused with the results I am getting: 5+ seconds for a sort on a foreign field on a single $lookup aggregation between a 42K document collection and a 19 record collection. Aka, total cross product of 798K.
Unfortunately, denormalization is not a great option here as the documents in the 'to' collection are heavily shared and would require a massive amount of updates across the database when changes are made.
That being said, I can't seem to understand why the following would take this long regardless. I feel like I must be doing something wrong.
The context:
A 4 vCPU, 16 GB RAM VM is running Debian 10 / MongoDB 4.4 as a single node replica set and nothing
else. Fully updated .NET MongoDB driver (I just updated a moment ago and re-tested)
There is one lookup in the aggregation, with the 'from' collection
having 42K documents, and the 'to' collection having 19 documents.
All aggregations, indexes, and collections are using default collation.
The foreign field in the 'to' collection has an index. Yes, just for those 19 records in case it would make a difference.
One of the posts regarding slow $lookup performance, mentioned that if $eq was not used within the nested pipeline of the $lookup stage, it wouldn't use the index. So I made sure that the aggregation pipeline used an $eq operator.
Here's the pipeline:
[{ "$lookup" :
{ "from" : "4",
"let" : { "key" : "$1" },
"pipeline" :
[{ "$match" :
{ "$expr" :
{ "$eq" : ["$id", { "$substrCP" : ["$$key", 0, { "$indexOfCP" : ["$$key", "_"] }] }] } } },
{ "$unwind" : { "path" : "$RF2" } },
{ "$match" : { "$expr" : { "$eq" : ["$RF2.id", "$$key"] } } },
{ "$replaceRoot" : { "newRoot" : "$RF2" } }],
"as" : "L1" } },
{ "$sort" : { "L1.5" : 1 } },
{ "$project" : { "L1" : 0 } },
{ "$limit" : 100 }]
Taking out the nested $unwind/$match/$replaceRoot combo takes away about 30% of the run time bringing it down to 3.5 seconds, however, those stages are necessary to link/lookup to the proper sub document. Sorts on the 'from' collection with no lookup required are done within 0.5 seconds.
What am I doing wrong here? Thanks in advance!
Edit:
I've just tested the same thing with a larger set of records (38K records in the 'from' collection, 26K records in the 'to' collection, one-to-one relationship). Took over 7 minutes to complete the sort. I checked in Compass and saw that the index on "id" was actually being used (kept refreshing during the 7 minutes and saw it rise, I'm currently the only user of the database).
Here's the pipeline, which is simpler than the first:
[{ "$lookup" :
{ "from" : "1007",
"let" : { "key" : "$13" },
"pipeline" :
[{ "$match" :
{ "$expr" : { "$eq" : ["$id", "$$key"] } } }],
"as" : "L1" } },
{ "$sort" : { "L1.1" : -1 } },
{ "$project" : { "L1" : 0 } },
{ "$limit" : 100 }]
Does 7 minutes sound reasonable given the above info?
Edit 2:
shell code to create two 40k record collections (prod and prod2) with two fields (name: string, uid: integer):
var randomName = function() {
return (Math.random()+1).toString(36).substring(2);
}
for (var i = 1; i <= 40000; ++i) {
db.test.insert({
name: randomName(),
uid: i });
}
I created an index on the 'uid' field on prod2, increased the sample document limit of Compass to 50k, then did just the following lookup, which took two full minutes to compute:
{ from: 'prod2',
localField: 'uid',
foreignField: 'uid',
as: 'test' }
Edit 3:
I've just also ran the aggregation pipeline directly from the shell and got results within a few seconds instead of two minutes:
db.test1.aggregate([{ $lookup:
{ from: 'test2',
localField: 'uid',
foreignField: 'uid',
as: 'test' } }]).toArray()
What's causing the discrepancy between the shell and both Compass and the .NET driver?
For anyone stumbling upon this post, the following worked for me: Using the localField/foreignField version of the $lookup operator.
When monitoring the index in Compass, both the let/pipeline and the localField/foreignField versions hit the appropriate index but was orders of magnitude slower when using the let/pipeline version.
I restructured my query building logic to only use localField/foreignField and it has made a world of difference.

Project data set into new objects

I have a really simple question which has troubled me for some time. I have a list of objects containing an array of Measurements, where each of these contains a time and multiple values like below:
{
"_id" : ObjectId("5710ed8129c7f31530a537bc"),
"Measurements" : [
{
"_t" : "Measurement",
"_time" : ISODate("2016-04-14T12:31:52.584Z"),
"Measurement1" : 1
"Measurement2" : 2
"Measurement3" : 3
},
{
"_t" : "DataType",
"_time" : ISODate("2016-04-14T12:31:52.584Z"),
"Measurement1" : 4
"Measurement2" : 5
"Measurement3" : 6
},
{
"_t" : "DataType",
"_time" : ISODate("2016-04-14T12:31:52.584Z"),
"Measurement1" : 7
"Measurement2" : 8
"Measurement3" : 9
} ]
},
{
"_id" : ObjectId("5710ed8129c7f31530a537cc"),
"Measurements" : [
{
"_t" : "Measurement",
"_time" : ISODate("2016-04-14T12:31:52.584Z"),
"Measurement1" : 0
....
I want to create a query which projects the following data set into the one below. For example, query for Measurement1 and create an array of objects containing the time and value of Measurement1 (see below) via mongo aggregation framework.
{ "Measurement": [
{
"Time": ISODate("2016-04-14T12:31:52.584Z"),
"Value": 1
}
{
"Time": ISODate("2016-04-14T12:31:52.584Z"),
"Value": 4
}
{
"Time": ISODate("2016-04-14T12:31:52.584Z"),
"Value": 7
}
]}
Seems like a pretty standard operation, so I hope you guys can shed some light on this.
You can do this by first unwinding the Measurements array for each doc and then projecting the fields you need and then grouping them back together:
db.test.aggregate([
// Duplicate each doc, once per Measurements array element
{$unwind: '$Measurements'},
// Include and rename the desired fields
{$project: {
'Measurements.Time': '$Measurements._time',
'Measurements.Value': '$Measurements.Measurement1'
}},
// Group the docs back together to reassemble the Measurements array field
{$group: {
_id: '$_id',
Measurements: {$push: '$Measurements'}
}}
])

MongoDB Calculate Values from Two Arrays, Sort and Limit

I have a MongoDB database storing float arrays. Assume a collection of documents in the following format:
{
"id" : 0,
"vals" : [ 0.8, 0.2, 0.5 ]
}
Having a query array, e.g., with values [ 0.1, 0.3, 0.4 ], I would like to compute for all elements in the collection a distance (e.g., sum of differences; for the given document and query it would be computed by abs(0.8 - 0.1) + abs(0.2 - 0.3) + abs(0.5 - 0.4) = 0.9).
I tried to use the aggregation function of MongoDB to achieve this, but I can't work out how to iterate over the array. (I am not using the built-in geo operations of MongoDB, as the arrays can be rather long)
I also need to sort the results and limit to the top 100, so calculation after reading the data is not desired.
Current Processing is mapReduce
If you need to execute this on the server and sort the top results and just keep the top 100, then you could use mapReduce for this like so:
db.test.mapReduce(
function() {
var input = [0.1,0.3,0.4];
var value = Array.sum(this.vals.map(function(el,idx) {
return Math.abs( el - input[idx] )
}));
emit(null,{ "output": [{ "_id": this._id, "value": value }]});
},
function(key,values) {
var output = [];
values.forEach(function(value) {
value.output.forEach(function(item) {
output.push(item);
});
});
output.sort(function(a,b) {
return a.value < b.value;
});
return { "output": output.slice(0,100) };
},
{ "out": { "inline": 1 } }
)
So the mapper function does the calculation and output's everything under the same key so all results are sent to the reducer. The end output is going to be contained in an array in a single output document, so it is both important that all results are emitted with the same key value and that the output of each emit is itself an array so mapReduce can work properly.
The sorting and reduction is done in the reducer itself, as each emitted document is inspected the elements are put into a single tempory array, sorted, and the top results are returned.
That is important, and just the reason why the emitter produces this as an array even if a single element at first. MapReduce works by processing results in "chunks", so even if all emitted documents have the same key, they are not all processed at once. Rather the reducer puts it's results back into the queue of emitted results to be reduced until there is only a single document left for that particular key.
I'm restricting the "slice" output here to 10 for brevity of listing, and including the stats to make a point, as the 100 reduce cycles called on this 10000 sample can be seen:
{
"results" : [
{
"_id" : null,
"value" : {
"output" : [
{
"_id" : ObjectId("56558d93138303848b496cd4"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d96138303848b49906e"),
"value" : 2.2
},
{
"_id" : ObjectId("56558d93138303848b496d9a"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d93138303848b496ef2"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497861"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497b58"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497ba5"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d94138303848b497c43"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d95138303848b49842b"),
"value" : 2.1
},
{
"_id" : ObjectId("56558d96138303848b498db4"),
"value" : 2.1
}
]
}
}
],
"timeMillis" : 1758,
"counts" : {
"input" : 10000,
"emit" : 10000,
"reduce" : 100,
"output" : 1
},
"ok" : 1
}
So this is a single document output, in the specific mapReduce format, where the "value" contains an element which is an array of the sorted and limitted result.
Future Processing is Aggregate
As of writing, the current latest stable release of MongoDB is 3.0, and this lacks the functionality to make your operation possible. But the upcoming 3.2 release introduces new operators that make this possible:
db.test.aggregate([
{ "$unwind": { "path": "$vals", "includeArrayIndex": "index" }},
{ "$group": {
"_id": "$_id",
"result": {
"$sum": {
"$abs": {
"$subtract": [
"$vals",
{ "$arrayElemAt": [ { "$literal": [0.1,0.3,0.4] }, "$index" ] }
]
}
}
}
}},
{ "$sort": { "result": -1 } },
{ "$limit": 100 }
])
Also limitting to the same 10 results for brevity, you get output like this:
{ "_id" : ObjectId("56558d96138303848b49906e"), "result" : 2.2 }
{ "_id" : ObjectId("56558d93138303848b496cd4"), "result" : 2.2 }
{ "_id" : ObjectId("56558d96138303848b498e31"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497c43"), "result" : 2.1 }
{ "_id" : ObjectId("56558d94138303848b497861"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499037"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b498db4"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496ef2"), "result" : 2.1 }
{ "_id" : ObjectId("56558d93138303848b496d9a"), "result" : 2.1 }
{ "_id" : ObjectId("56558d96138303848b499182"), "result" : 2.1 }
This is made possible largely due to $unwind being modified to project a field in results that contains the array index, and also due to $arrayElemAt which is a new operator that can extract an array element as a singular value from a provided index.
This allows the "look-up" of values by index position from your input array in order to apply the math to each element. The input array is facilitated by the existing $literal operator so $arrayElemAt does not complain and recongizes it as an array, ( seems to be a small bug at present, as other array functions don't have the problem with direct input ) and gets the appropriate matching index value by using the "index" field produced by $unwind for comparison.
The math is done by $subtract and of course another new operator in $abs to meet your functionality. Also since it was necessary to unwind the array in the first place, all of this is done inside a $group stage accumulating all array members per document and applying the addition of entries via the $sum accumulator.
Finally all result documents are processed with $sort and then the $limit is applied to just return the top results.
Summary
Even with the new functionallity about to be availble to the aggregation framework for MongoDB it is debatable which approach is actually more efficient for results. This is largely due to there still being a need to $unwind the array content, which effectively produces a copy of each document per array member in the pipeline to be processed, and that generally causes an overhead.
So whilst mapReduce is the only present way to do this until a new release, it may actually outperform the aggregation statement depending on the amount of data to be processed, and despite the fact that the aggregation framework works on native coded operators rather than translated JavaScript operations.
As with all things, testing is always recommended to see which case suits your purposes better and which gives the best performance for your expected processing.
Sample
Of course the expected result for the sample document provided in the question is 0.9 by the math applied. But just for my testing purposes, here is a short listing used to generate some sample data that I wanted to at least verify the mapReduce code was working as it should:
var bulk = db.test.initializeUnorderedBulkOp();
var x = 10000;
while ( x-- ) {
var vals = [0,0,0];
vals = vals.map(function(val) {
return Math.round((Math.random()*10),1)/10;
});
bulk.insert({ "vals": vals });
if ( x % 1000 == 0) {
bulk.execute();
bulk = db.test.initializeUnorderedBulkOp();
}
}
The arrays are totally random single decimal point values, so there is not a lot of distribution in the listed results I gave as sample output.

Mongodb : count array values with mapreduce / aggregation

I have documents with the following structure :
{
"name" : "John",
"items" : [
{"key1" : "value1"},
{"key1" : "value1"}
]
}
And have built a simple function to count the number of "items" total.
var count = 0;
db.collection.find({},{items:1}).limit(10000).forEach(
function (doc) {
if(doc.items){
count += doc.items.length;
}
}
)
print(count);
But after ~1 million items, my function breaks, Mongo exits. I've looked at the new aggregation framework as well as mapreduce functions, and I'm not sure which would be the best to use for a simple count like this.
Suggestions welcome! Thanks.
It becomes very easy when you use aggregation http://docs.mongodb.org/manual/core/aggregation-pipeline/
db.collection.aggregate(
{ $unwind : "$items" },
{ $group : {_id:null, items_count : {$sum:1} }}
)
to return count of items for each document,
{ $group : {_id:"$_id", items_count : {$sum:1} }}
You can store length of doc.items as an element of doc. This method causes disk redundancy but a fast and easy way to deal with large collections.
{
"name" : "John",
"itemsLength" : 2,
"items" : [
{"key1" : "value1"},
{"key1" : "value1"}
]
}
Another option may be using mapreduce but, I think, without sharding mapreduce would be slow.