Can't find a good index for ipintervals in mongodb - mongodb

I have a collection, called Cities with city info in it. Each city document has an inner IpIntervals array ({StartNum, EndNum}) that contians IpIntervals for the city. Each interval boundary is calculated using formula 256 * 256 * 256 * a + 256 * 256 * b + 256 * c + d where "a.b.c.d" is ip address. To find location by ip address I'm using query:
{IpIntervals: $elemMatch: {"StartNum": {$lte: <<my_ip_num>>}, "EndNum": {$gte: <<my_ip_num>>}}}}
which works great but it takes about 270 ms, so I want to use some index with it. I've tried different indexes like:
{"IpIntervals.StartNum": 1, "IpIntervals.EndNum": 1}, {"IpIntervals.StartNum": -1, "IpIntervals.EndNum": 1}, {"IpIntervals.StartNum": 1, "IpIntervals.EndNum": -1}, {"IpIntervals.StartNum": 1}
But nothing seems to work: it is always BasicCursor and 270ms, which is not good. Any ideas about what index is appropriate in this situation?
Thanx.
Sample data:
{
"_id" : { "$oid" : "51015e8bd246e8e455ee027d" },
"Name" : "SomeCity",
"Latitude" : 28.755787,
"Longitude" : 37.617634,
"IpIntervals" : [
{ "StartNum" : 2457360384, "EndNum" : 2457360639 },
{ "StartNum" : 2457361408, "EndNum" : 2457362431 },
{ "StartNum" : 2457364480, "EndNum" : 2457366527 },
{ "StartNum" : 2461648896, "EndNum" : 2461650943 }
]
}

Finally found work around with query
db.Cities.find({"IpIntervals.StartNum": {$lte: <<my_ip_num>>}}).limit(1).sort({"IpIntervals.StartNum": -1}) and index {"IpIntervals.StartNum": 1} which takes about 2ms.
Since ip intervals are not overlapping I can order StartNums and get the closest to my_ip_num. Still I haven't found any good index for query {IpIntervals: $elemMatch: {"StartNum": {$lte: <<my_ip_num>>}, "EndNum": {$gte: <<my_ip_num>>}}}}

Related

Query performing faster without the index

Below is a simplified version of a document in my database:
{
_id : 1,
main_data : 100,
sub_docs: [
{
_id : a,
data : 22
},
{
_id: b,
data : 859
},
{
_id: c,
data: 151
},
... snip ...
{
_id: m,
data: 721
},
{
_id: n,
data: 111
}
]
}
So imagine I have a million of these documents with varied data values (say 0 - 1000). Currently my query is something like:
db.myDb.find(
{ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 } } } }
)
Also say the query above will only match around 0.001% of the data (so around 10 documents are returned in total).
And I have an index set using:
db.myDb.ensureIndex( sub_docs.data )
Performing a timed test on this data seems to show it's quicker without any index set on sub_docs.data.
I'm using Mongo 3.2.8.
Edit - Additional information:
My timed test is a Perl script which queries the server and then pulls back the relevant data. I ran this test first when I had the index enabled, however the slow query times forced me to do a bit of digging. I wanted to see how bad the query times would get if I dropped the index, however it improved the response time of the query!
I went a bit further, I plotted the query response time vs the total number of documents in the DB, both graphs show a linear increase in query time but the query with the index increases at a much faster rate.
All the while through testing I've been keeping my eye on the server memory usage (which is low) as my first thought would have been the index doesn't fit in memory.
So overall my question is: why for this particular query does this query perform better without and index?
And is there anyway to improve the speed of this query with a better index?
Update
Ok so it's been a while and I've narrowed it down to the index not constraining both sides of the query search parameters.
The query above will show an index bound of:
[-inf, 160]
Rather than 110 to 160.
I can resolve this problem by using the index min and max functions as follows:
db.myDb.find(
{ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 } } } }
).min({'subdocs.data': 110}).max({'subdocs.data': 160})
However (if possible) I would prefer a different way of doing this as I would like to make use of the aggregate function (which doesn't seem to support min/max index functions)
Ok so I managed to sort this in the end. For whatever reason the index doesn't limit the query as I expected.
Running this:
db.myDb.find({ sub_docs: { $elemMatch: { data: { $gte: 110, $lt: 160 } } } }).explain()
Snippet of what the index is doing is below:
"inputStage" : {
"stage" : "IXSCAN",
"keyPattern" : {
"sub_docs.data" : 1
},
"indexName" : "sub_docs.data_1",
"isMultiKey" : true,
"isUnique" : false,
"isSparse" : false,
"isPartial" : false,
"indexVersion" : 1,
"direction" : "forward",
"indexBounds" : {
"sub_docs.data" : [
"[-inf.0, 160.0)"
]
}
}
Instead of limiting the index between 110 and 160 it's scanning all documents that match the index key of anything less than or equal to 160.
I've not included it but the other rejected plan was an index scan of 110 to inf+.
You can sort this issue with the min/max limits I mention above in my comment however this means you can't use the aggregation framework, which sucks.
So the solution I found was to pull out all the data I wanted to index on into an array:
{
_id : 1,
main_data : 100,
index_values : [
22,
859,
151,
...snip...
721,
111
],
sub_docs: [
{
_id : a,
data : 22
},
{
_id: b,
data : 859
},
{
_id: c,
data: 151
},
... snip ...
{
_id: m,
data: 721
},
{
_id: n,
data: 111
}
]
}
And then I create the index:
db.myDb.ensureIndex({index_values : 1})
And then query on that instead:
db.myDb.find({ index_values : { $elemMatch: { $gte: 110, $lt: 160 } } }).explain()
Which produces:
"indexBounds" : {
"index_values" : [
"[110.0, 160.0]"
]
}
So a lot less documents to check now!

Optimizing MongoDB Query on GEO fence

I am using MongoDB to store a lot of GPS-Data (2 Mio. Documents with about 1000 GPS-Points per Document). The data looks like the following
{
"Data": [
{
latitude : XXXXXX,
longitude : XXXXXX,
speed : XXXXXXX
}],
"_id": ID,
" StartDayOfWeek" : X
}
As you see Data is an array of GPSPoints and additional information.
I query for Documents on specific Days of the week (Sometimes I query for multiple days in which case I use $or.
Oh... lat/lon is stored as Integers
The following is my query(An example):
db.testNew2.aggregate(
{ $sort : {"StartDayOfWeek" : 1 }},
{ $match: {"$and" : [
{ "StartDayOfWeek" : 1},
{ 'Data' :
{$elemMatch:
{ "latitude" : {$gt:48143743, $lt:48143843}}}},
{ 'Data' :
{$elemMatch:
{ "longitude" : {$gt:11554706,$lt:11554806}}}}
]}},
{ $unwind : "$Data"},
{ $match: {"$and" : [
{ 'Data.latitude' : {$gt:48143743, $lt:48143843}},
{ 'Data.longitude' : {$gt:11554706, $lt:11554806}}
]}},
{$group: {
"_id" : "$_id",
"Traces" : { $push :"$Data"}
}}
)
As you can see I am sorting out the GPS-Points that are not within the GEO fence.
This query works fine but...and I do not know why. It seems very slow. There is an index on StartDayOfWeek and the machine is more than capable (24 GB of RAM and two 7200rpm SATA drives in RAID0).
The collection size is about 130 GB and the query takes about 3 - 5 minutes.
In the Java programme that uses that query I also use allowDiskUsage since the return values can be higher than 16 MB.
Is there any way to optimize this?

how to sort before querying in the embedded document

I know how to sort the embedded document after the find results but how do I sort before the find so that the query itself is run on the sorted array ? I know this must be possible if I use aggregate but i really like to know if this is possible without that so that I understand it better how it works.
This is my embedded document
"shipping_charges" : [
{
"region" : "region1",
"weight" : 500,
"rate" : 10
},
{
"region" : "Bangalore HQ",
"weight" : 200,
"rate" : 40
},
{
"region" : "region2",
"weight" : 1500,
"rate" : 110
},
{
"region" : "region3",
"weight" : 100,
"rate" : 50
},
{
"region" : "Bangalore HQ",
"weight" : 100,
"rate" : 150
}
]
This is the query i use to match the 'region' and the 'weight' to get the pricing for that match ..
db.clients.find( { "shipping_charges.region" : "Bangalore HQ" , "shipping_charges.weight" : { $gte : 99 } }, { "shipping_charges.$" : 1 } ).pretty()
This query currently returns me the
{
"shipping_charges" : [
{
"region" : "Bangalore HQ",
"weight" : 200,
"rate" : 40
}
]
}
The reason it possibly returns this set is because of the order in which it appears(& matches) in the embedded document.
But, I want this to return me the last set that best matches to closest slab of the weight(100grams)
What changes required in my existing query so that I can sort the embedded document before the find runs on them to get the results as I want it ?
If for any reasons you are sure this cant be done without a MPR, let me know so that i can stay away from this method and focus only on MPR to get the desired results as I want it .
You can use an aggregation pipeline instead of map-reduce:
db.clients.aggregate([
// Filter the docs to what we're looking for.
{$match: {
'shipping_charges.region': 'Bangalore HQ',
'shipping_charges.weight': {$gte: 99}
}},
// Duplicate the docs, once per shipping_charges element
{$unwind: '$shipping_charges'},
// Filter again to get the candidate shipping_charges.
{$match: {
'shipping_charges.region': 'Bangalore HQ',
'shipping_charges.weight': {$gte: 99}
}},
// Sort those by weight, ascending.
{$sort: {'shipping_charges.weight': 1}},
// Regroup and take the first shipping_charge which will be the one closest to 99
// because of the sort.
{$group: {_id: '$_id', shipping_charges: {$first: '$shipping_charges'}}}
])
You could also use find, but you'd need to pre-sort the shipping_charges array by weight in the documents themselves. You can do that by using a $push update with the $sort modifier:
db.clients.update({}, {
$push: {shipping_charges: {$each: [], $sort: {weight: 1}}}
}, {multi: true})
After doing that, your existing query will return the right element:
db.clients.find({
"shipping_charges.region" : "Bangalore HQ",
"shipping_charges.weight" : { $gte : 99 }
}, { "shipping_charges.$" : 1 } )
You would, of course, need to consistently include the $sort modifier on any further updates to your docs' shipping_charges array to ensure it stays sorted.

Can I group floating point numbers by range in MongoDB?

I have a MongoDB set up with documents like this
{
"_id" : ObjectId("544ced7b9f40841ab8afec4e"),
"Measurement" : {
"Co2" : 38,
"Humidity" : 90
},
"City" : "Antwerp",
"Datum" : ISODate("2014-10-01T23:13:00.000Z"),
"BikeId" : 26,
"GPS" : {
"Latitude" : 51.20711593206187,
"Longitude" : 4.424424413969158
}
}
Now I try to aggregate them by date and location and also add the average of the measurement to the result. So far my code looks like this:
db.stadsfietsen.aggregate([
{$match: {"Measurement.Co2": {$gt: 0}}},
{
$group: {
_id: {
hour: {$hour: "$Datum"},
Location: {
Long: "$GPS.Longitude",
Lat: "$GPS.Latitude"
}
},
Average: {$avg: "$Measurement.Co2"}
}
},
{$sort: {"_id": 1}},
{$out: "Co2"}
]);
which gives me a nice list of all the possible combinations of hour and GPS coordinates, in this form:
{
"_id" : {
"hour" : 0,
"Location" : {
"Long" : 3.424424413969158,
"Lat" : 51.20711593206187
}
},
"Average" : 82
}
The problem is that there are so many unique coordinates, that it's not useful.
Can I group the documents together when there are values that are close together? Say from Longitude 51.207 to Longitude 51.209?
There is no standard support for ranges in $group.
Mathematically
You could calculate a new value that will be the same for several geolocations. For example you could simulate a floor method:
_id:{ hour:{$hour:"$Datum"}, Location:{
Long: $GPS.Longitude - mod($GPS.Longitude, 0.01),
Lat: $GPS.Latitude - mod($GPS.Latitude, 0.01)
}}
Geospatial Indexing
You could restructure you're application to use a Geospatial index and search for all locations in a given range. If this is applicable depends very much on your use case.
Map-Reduce
Map-Reduce is more powerful than the aggregation framework. You can definitely use this to do your calculations, but it's more complex and therefore I can't present you a ready-made solution without spending another hour.

MongoDB Query with sum

I have a simple document setup:
{
VG: "East",
Artikellist: {
Artikel1: "Sprite",
Amount1: 1,
Artikel2: "Fanta",
Amount2: 3
}
}
actually i just want to query these document to get a list of selling articels in each VG, or maybe town, doesnt matter. In addition the Query should sum the Amount of each product and give it back to me.
I Know i'm thinking in a SQL language, but that's actually the case.
May idea was this on here:
db.collection.aggregate([{
$group: {
_id: {
VG: "$VG",
Artikel1: "$Artikellist.Artikel1",
Artikel2: "$Artikellist.Artikel2",
$sum: "$Artikellist.Amount1",
$sum: "$Artikellist.Amount2"
},
}
}]);
The hardest point here that i have 5 different values for VG and and it could be maximum of 5 Artikel and regarding amounts in one list.
So Hopefully you can help me here. Sry for my bad english and for my badder Mongo Skills.
If Artikel1 is always "Sprite" and Artikel2 - "Fanta", then you can try this one:
db.test.aggregate({$group : { _id : {VG : "$VG", Artikel1 : "$Artikellist.Artikel1", Artikel2 : "$Artikellist.Artikel2"}, Amount1 : {$sum : "$Artikellist.Amount1"},Amount2 : {$sum : "$Artikellist.Amount2"}}});
If values of Artikel1 and Artikel2 can vary I suggest changing the structure of document say to:
{
VG: "East",
Artikellist: [
{ Artikel: "Sprite",
Amount: 1},
{ Artikel: "Fanta",
Amount: 3 }
]}
and then use the following approach:
db.test.aggregate({$unwind : "$Artikellist"}, {$group : {_id : {VG : "$VG", Artikel : "$Artikellist.Artikel"}, Amount : {$sum : "$Artikellist.Amount"}}})