Geospacial search returns less results than available - mongodb

I have a collection (users) with 356 documents and an index "2dsphere" on the field geodata. 120 documents have a field geodata:{type:"Point", coordinates:[X,Y]} where X and Y are coordinates within Germany.
If I execute the following aggregation:
db.users.aggregate(
[
{
"$geoNear":{
near: { type: "Point", coordinates: [48.783469, 9.181842] },
distanceField: "distanceCalculated",
spherical: true
}
},
{
"$sort":{"distanceCalculated":1}
}
]
)
I get 46 results, but as I understand, I should get at least 100 (standard for the limit parameter)
As I can see, all results are within ~210 kilometers is there any not documented standard maxDistance? I also tried different maxDistance values but never get more than 48 results.
My question is, what do I have to do to get all (120) results ordered by distance?

Found the pitfall:
All (120) documents had the type of geodata.coordinates set to object instead of array. Unfortunately I can't say why mongoDB nevertheless returned some results instead of null. I converted all entries using the following command, now it works:
db.users.find({ "geodata": { $exists: true, $ne: null }}).forEach((user)=> {
var coordinates = user.geodata.coordinates;
user.geodata.coordinates = [coordinates["0"], coordinates["1"]];
db.users.save(user);
});

Related

Geonear and more than one 2dsphere indexes

I have a question about use of $near vs geonear in returning distance from stored points in database from the user entered point of interest, if more than one 2dsphere index is present in the schema storing the points.
The use case is below.
In my schema I have a source and a destination location as below. The query using Intracity.find works properly and gives me sorted entries from an entered point of interest.
var baseShippingSchema = new mongoose.Schema({
startDate : Date,
endDate : Date,
locSource: {
type: [Number],
index: '2dsphere'
},
locDest: {
type: [Number],
index: '2dsphere'
}
});
var search_begin = moment(request.body.startDate0, "DD-MM-YYYY").toDate();
var search_end = moment(request.body.endDate1, "DD-MM-YYYY").toDate();
var radius = 7000;
Intracity.find({
locSource: {
$near:{$geometry: {type: "Point",
coordinates: [request.body.lng0,request.body.lat0]},
$minDistance: 0,
$maxDistance: radius
}
}).where('startDate').gte(search_begin)
.where('endDate').lte(search_end)
.limit(limit).exec(function(err, results)
{
response.render('test.html', {results : results, error: error});
}
However, I also want to return the "distance" of the stored points from the point of interest, which as per my knowledge and findings, is not possible using $near but is possible using geonear api.
However, the documentation of geonear says the following.
geoNear requires a geospatial index. However, the geoNear command requires that a collection have at most only one 2d index and/or only one 2dsphere.
Since in my schema I have two 2dspehere indexes the following geonear api fails with the error "more than one 2d index, not sure which to run geoNear on"
var point = { name: 'locSource', type : "Point",
coordinates : [request.body.lng0 , request.body.lat0] };
Intracity.geoNear(point, { limit: 10, spherical: true, maxDistance:radius, startDate:{ $gte: search_begin}, endDate:{ $lte:search_end}}, function(err, results, stats) {
if (err){return done(err);}
response.render('test.html', {results : results, error: error});
});
So my question is how can I also get the distance for each of these stored points, from entered point of interest using the schema described above.
Any help would be really great, as my Internet search is not going anywhere.
Thank you
Mrunal
As you noted the mongodb docs state that
The geoNear command and the $geoNear pipeline stage require that a collection have at most only one 2dsphere index and/or only one 2d index
On the other hand calculating distances inside mongo is only possible with the aggregation framework as it is a specialized projection. If you do not want to take option
relational DB approach: maintaining a separate distance table between all items
then your other option is to
document store approach: calculate distances in your server side JS code. You would have to cover memory limits by paginating results.

Filter Documents by Distance Stored in Document with $near

I am using the following example to better explain my need.
I have a set of points(users) on a map and collection schema is as below
{
location:{
latlong:[long,lat]
},
maxDistance:Number
}
i have another collection with events happening in the area. schema is given below
{
eventLocation:{
latlong:[long,lat]
}
}
now users can add their location and the maximum distance they want to travel for to attend an event and save it.
whenever a new event is posted , all the users satisfying their preferences will get a notification. Now how do i query that. i tried following query on user schema
{
$where: {
'location.latlong': {
$near: {
$geometry: {
type: "Point",
coordinates: [long,lat]
},
$maxDistance: this.distance
}
}
}
}
got an error
error: {
"$err" : "Can't canonicalize query: BadValue $where got bad type",
"code" : 17287
}
how do i query the above case as maxDistance is defined by user and is not fixed. i am using 2dsphere index.
Presuming you have already worked out to act on the event data as you recieve it and have it in hand ( if you have not, then that is another question, but look at tailable cursors ), then you should have an object with that data for which to query the users with.
This is therefore not a case for JavaScript evaluation with $where, as it cannot access the query data returned from a $near operation anyway. What you want instead is $geoNear from the aggregation framework. This can project the "distance" found from the query, and allow a later stage to "filter" the results against the user stored value for the maximum distance they want to travel to published events:
// Represent retrieved event data
var eventData = {
eventLocation: {
latlong: [long,lat]
}
};
// Find users near that event within their stored distance
User.aggregate(
[
{ "$geoNear": {
"near": {
"type": "Point",
"coordinates": eventData.eventLocation.latlong
},
"distanceField": "eventDistance",
"limit": 100000,
"spherical": true
}},
{ "$redact": {
"$cond": {
"if": { "$lt": [ "$eventDistance", "$maxDistance" ] },
"then": "$$KEEP",
"else": "$$PRUNE"
}
}}
]
function(err,results) {
// Work with results in here
}
)
Now you do need to be careful with the returned number, as since you appear to be storing in "legacy coordinate pairs" instead of GeoJSON, then the distance returned from this operation will be in radians and not a standard distance. So presuming you are storing in "miles" or "kilometers" on the user objects then you need to calculate via the formula mentioned in the manual under "Calculate Distances Using Spherical Geometry" as mentioned in the manual.
The basics are that you need to divide by the equatorial radius of the earth, being either 3,963.2 miles or 6,378.1 kilometers to convert for a comparison to what you have stored.
The alternate is to store in GeoJSON instead, where there is a consistent measurement in meters.
Assuming "kilometers" that "if" line becomes:
"if": { "$lt": [
"$eventDistance",
{ "$divide": [ "$maxDistance", 6,378.1 ] }
]},
To reliably compare your stored kilometer value to the radian result retured.
The other thing to be aware of is that $geoNear has a default "limit" of 100 results, so you need to "pump up" the "limit" argument there to the number for expected users to possibly match. You might even want to do this in "range lists" of user id's for a really large system, but you can go as big as memory allows within a single aggreation operation and possibly add allowDiskUse where needed.
If you don't tune that parameter, then only the nearest 100 results ( default ) will be returned, which may well no even suit your next operation of filtering those "near" the event to start with. Use common sense though, as you surely have a max distance to even filter out potential users, and that can be added to the query as well.
As stated, the point here is returning the distance for comparison, so the next stage is the $redact operation which can fiter the user's own "travel distance" value against the returned distance from the event. The end result gives only those users that fall within their own distance contraint from the event who will qualify for notification.
That's the logic. You project the distance from the user to the event and then compare to the user stored value for what distance they are prepared to travel. No JavaScript, and all native operators that make it quite fast.
Also as noted in the options and the general commentary, I really do suggest you use a "2dsphere" index for accurate spherical distance calculation as well as converting to GeoJSON storage for your coordinate storage in your database Objects, as they are both general standards that produce consistent results.
Try it without embedding your query in $where: {. The $where operator is for passing a javascript function to the database, which you don't seem to want to do here (and is in fact something you should generally avoid for performance and security reasons). It has nothing to do with location.
{
'location.latlong': {
$near: {
$geometry: {
type: "Point",
coordinates: [long,lat]
},
$maxDistance: this.distance
}
}
}

Find largest document size in MongoDB

Is it possible to find the largest document size in MongoDB?
db.collection.stats() shows average size, which is not really representative because in my case sizes can differ considerably.
You can use a small shell script to get this value.
Note: this will perform a full table scan, which will be slow on large collections.
let max = 0, id = null;
db.test.find().forEach(doc => {
const size = Object.bsonsize(doc);
if(size > max) {
max = size;
id = doc._id;
}
});
print(id, max);
Note: this will attempt to store the whole result set in memory (from .toArray) . Careful on big data sets. Do not use in production! Abishek's answer has the advantage of working over a cursor instead of across an in memory array.
If you also want the _id, try this. Given a collection called "requests" :
// Creates a sorted list, then takes the max
db.requests.find().toArray().map(function(request) { return {size:Object.bsonsize(request), _id:request._id}; }).sort(function(a, b) { return a.size-b.size; }).pop();
// { "size" : 3333, "_id" : "someUniqueIdHere" }
Starting Mongo 4.4, the new aggregation operator $bsonSize returns the size in bytes of a given document when encoded as BSON.
Thus, in order to find the bson size of the document whose size is the biggest:
// { "_id" : ObjectId("5e6abb2893c609b43d95a985"), "a" : 1, "b" : "hello" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a986"), "c" : 1000, "a" : "world" }
// { "_id" : ObjectId("5e6abb2893c609b43d95a987"), "d" : 2 }
db.collection.aggregate([
{ $group: {
_id: null,
max: { $max: { $bsonSize: "$$ROOT" } }
}}
])
// { "_id" : null, "max" : 46 }
This:
$groups all items together
$projects the $max of documents' $bsonSize
$$ROOT represents the current document for which we get the bsonsize
Finding the largest documents in a MongoDB collection can be ~100x faster than the other answers using the aggregation framework and a tiny bit of knowledge about the documents in the collection. Also, you'll get the results in seconds, vs. minutes with the other approaches (forEach, or worse, getting all documents to the client).
You need to know which field(s) in your document might be the largest ones - which you almost always will know. There are only two practical1 MongoDB types that can have variable sizes:
arrays
strings
The aggregation framework can calculate the length of each. Note that you won't get the size in bytes for arrays, but the length in elements. However, what matters more typically is which the outlier documents are, not exactly how many bytes they take.
Here's how it's done for arrays. As an example, let's say we have a collections of users in a social network and we suspect the array friends.ids might be very large (in practice you should probably keep a separate field like friendsCount in sync with the array, but for the sake of example, we'll assume that's not available):
db.users.aggregate([
{ $match: {
'friends.ids': { $exists: true }
}},
{ $project: {
sizeLargestField: { $size: '$friends.ids' }
}},
{ $sort: {
sizeLargestField: -1
}},
])
The key is to use the $size aggregation pipeline operator. It only works on arrays though, so what about text fields? We can use the $strLenBytes operator. Let's say we suspect the bio field might also be very large:
db.users.aggregate([
{ $match: {
bio: { $exists: true }
}},
{ $project: {
sizeLargestField: { $strLenBytes: '$bio' }
}},
{ $sort: {
sizeLargestField: -1
}},
])
You can also combine $size and $strLenBytes using $sum to calculate the size of multiple fields. In the vast majority of cases, 20% of the fields will take up 80% of the size (if not 10/90 or even 1/99), and large fields must be either strings or arrays.
1 Technically, the rarely used binData type can also have variable size.
Well.. this is an old question.. but - I thought to share my cent about it
My approach - use Mongo mapReduce function
First - let's get the size for each document
db.myColection.mapReduce
(
function() { emit(this._id, Object.bsonsize(this)) }, // map the result to be an id / size pair for each document
function(key, val) { return val }, // val = document size value (single value for each document)
{
query: {}, // query all documents
out: { inline: 1 } // just return result (don't create a new collection for it)
}
)
This will return all documents sizes although it worth mentioning that saving it as a collection is a better approach (the result is an array of results inside the result field)
Second - let's get the max size of document by manipulating this query
db.metadata.mapReduce
(
function() { emit(0, Object.bsonsize(this))}, // mapping a fake id (0) and use the document size as value
function(key, vals) { return Math.max.apply(Math, vals) }, // use Math.max function to get max value from vals (each val = document size)
{ query: {}, out: { inline: 1 } } // same as first example
)
Which will provide you a single result with value equals to the max document size
In short:
you may want to use the first example and save its output as a collection (change out option to the name of collection you want) and applying further aggregations on it (max size, min size, etc.)
-OR-
you may want to use a single query (the second option) for getting a single stat (min, max, avg, etc.)
If you're working with a huge collection, loading it all at once into memory will not work, since you'll need more RAM than the size of the entire collection for that to work.
Instead, you can process the entire collection in batches using the following package I created:
https://www.npmjs.com/package/mongodb-largest-documents
All you have to do is provide the MongoDB connection string and collection name. The script will output the top X largest documents when it finishes traversing the entire collection in batches.
Inspired by Elad Nana's package, but usable in a MongoDB console :
function biggest(collection, limit=100, sort_delta=100) {
var documents = [];
cursor = collection.find().readPref("nearest");
while (cursor.hasNext()) {
var doc = cursor.next();
var size = Object.bsonsize(doc);
if (documents.length < limit || size > documents[limit-1].size) {
documents.push({ id: doc._id.toString(), size: size });
}
if (documents.length > (limit + sort_delta) || !cursor.hasNext()) {
documents.sort(function (first, second) {
return second.size - first.size;
});
documents = documents.slice(0, limit);
}
}
return documents;
}; biggest(db.collection)
Uses cursor
Gives a list of the limit biggest documents, not just the biggest
Sort & cut output list to limit every sort_delta
Use nearest as read preference (you might also want to use rs.slaveOk() on the connection to be able to list collections if you're on a slave node)
As Xavier Guihot already mentioned, a new $bsonSize aggregation operator was introduced in Mongo 4.4, which can give you the size of the object in bytes. In addition to that just wanted to provide my own example and some stats.
Usage example:
// I had an `orders` collection in the following format
[
{
"uuid": "64178854-8c0f-4791-9e9f-8d6767849bda",
"status": "new",
...
},
{
"uuid": "5145d7f1-e54c-44d9-8c10-ca3ce6f472d6",
"status": "complete",
...
},
...
];
// and I've run the following query to get documents' size
db.getCollection("orders").aggregate(
[
{
$match: { status: "complete" } // pre-filtered only completed orders
},
{
$project: {
uuid: 1,
size: { $bsonSize: "$$ROOT" } // added object size
}
},
{
$sort: { size: -1 }
},
],
{ allowDiskUse: true } // required as I had huge amount of data
);
as a result, I received a list of documents by size in descending order.
Stats:
For the collection of ~3M records and ~70GB size in total, the query above took ~6.5 minutes.

Mongoose limit/offset and count query

Bit of an odd one on query performance... I need to run a query which does a total count of documents, and can also return a result set that can be limited and offset.
So, I have 57 documents in total, and the user wants 10 documents offset by 20.
I can think of 2 ways of doing this, first is query for all 57 documents (returned as an array), then using array.slice return the documents they want. The second option is to run 2 queries, the first one using mongo's native 'count' method, then run a second query using mongo's native $limit and $skip aggregators.
Which do you think would scale better? Doing it all in one query, or running two separate ones?
Edit:
// 1 query
var limit = 10;
var offset = 20;
Animals.find({}, function (err, animals) {
if (err) {
return next(err);
}
res.send({count: animals.length, animals: animals.slice(offset, limit + offset)});
});
// 2 queries
Animals.find({}, {limit:10, skip:20} function (err, animals) {
if (err) {
return next(err);
}
Animals.count({}, function (err, count) {
if (err) {
return next(err);
}
res.send({count: count, animals: animals});
});
});
I suggest you to use 2 queries:
db.collection.count() will return total number of items. This value is stored somewhere in Mongo and it is not calculated.
db.collection.find().skip(20).limit(10) here I assume you could use a sort by some field, so do not forget to add an index on this field. This query will be fast too.
I think that you shouldn't query all items and than perform skip and take, cause later when you have big data you will have problems with data transferring and processing.
Instead of using 2 separate queries, you can use aggregate() in a single query:
Aggregate "$facet" can be fetch more quickly, the Total Count and the Data with skip & limit
db.collection.aggregate([
//{$sort: {...}}
//{$match:{...}}
{$facet:{
"stage1" : [ {"$group": {_id:null, count:{$sum:1}}} ],
"stage2" : [ { "$skip": 0}, {"$limit": 2} ]
}},
{$unwind: "$stage1"},
//output projection
{$project:{
count: "$stage1.count",
data: "$stage2"
}}
]);
output as follows:-
[{
count: 50,
data: [
{...},
{...}
]
}]
Also, have a look at https://docs.mongodb.com/manual/reference/operator/aggregation/facet/
db.collection_name.aggregate([
{ '$match' : { } },
{ '$sort' : { '_id' : -1 } },
{ '$facet' : {
metadata: [ { $count: "total" } ],
data: [ { $skip: 1 }, { $limit: 10 },{ '$project' : {"_id":0} } ] // add projection here wish you re-shape the docs
} }
] )
Instead of using two queries to find the total count and skip the matched record.
$facet is the best and optimized way.
Match the record
Find total_count
skip the record
And also can reshape data according to our needs in the query.
There is a library that will do all of this for you, check out mongoose-paginate-v2
After having to tackle this issue myself, I would like to build upon user854301's answer.
Mongoose ^4.13.8 I was able to use a function called toConstructor() which allowed me to avoid building the query multiple times when filters are applied. I know this function is available in older versions too but you'll have to check the Mongoose docs to confirm this.
The following uses Bluebird promises:
let schema = Query.find({ name: 'bloggs', age: { $gt: 30 } });
// save the query as a 'template'
let query = schema.toConstructor();
return Promise.join(
schema.count().exec(),
query().limit(limit).skip(skip).exec(),
function (total, data) {
return { data: data, total: total }
}
);
Now the count query will return the total records it matched and the data returned will be a subset of the total records.
Please note the () around query() which constructs the query.
You don't have to use two queries or one complicated query with aggregate and such.
You can use one query
example:
const getNames = async (queryParams) => {
const cursor = db.collection.find(queryParams).skip(20).limit(10);
return {
count: await cursor.count(),
data: await cursor.toArray()
}
}
mongo returns a cursor that has predefined functions such as count, which will return the full count of the queried results regardless of skip and limit
So in count property, you will get the full length of the collection and in data, you will get just the chunk with offset of 20 and limit of 10 documents
Thanks Igor Igeto Mitkovski, a best solution is using native connection
document is here: https://docs.mongodb.com/manual/reference/method/cursor.count/#mongodb-method-cursor.count
and mongoose dont support it ( https://github.com/Automattic/mongoose/issues/3283 )
we have to use native connection.
const query = StudentModel.collection.find(
{
age: 13
},
{
projection:{ _id:0 }
}
).sort({ time: -1 })
const count = await query.count()
const records = await query.skip(20)
.limit(10).toArray()

Update with expression instead of value

I am totally new to MongoDB... I am missing a "newbie" tag, so the experts would not have to see this question.
I am trying to update all documents in a collection using an expression. The query I was expecting to solve this was:
db.QUESTIONS.update({}, { $set: { i_pp : i_up * 100 - i_down * 20 } }, false, true);
That, however, results in the following error message:
ReferenceError: i_up is not defined (shell):1
At the same time, the database did not have any problem with eating this one:
db.QUESTIONS.update({}, { $set: { i_pp : 0 } }, false, true);
Do I have to do this one document at a time or something? That just seems excessively complicated.
Update
Thank you Sergio Tulentsev for telling me that it does not work. Now, I am really struggling with how to do this. I offer 500 Profit Points to the helpful soul, who can write this in a way that MongoDB understands. If you register on our forum I can add the Profit Points to your account there.
I just came across this while searching for the MongoDB equivalent of SQL like this:
update t
set c1 = c2
where ...
Sergio is correct that you can't reference another property as a value in a straight update. However, db.c.find(...) returns a cursor and that cursor has a forEach method:
Queries to MongoDB return a cursor, which can be iterated to retrieve
results. The exact way to query will vary with language driver.
Details below focus on queries from the MongoDB shell (i.e. the
mongo process).
The shell find() method returns a cursor object which we can then iterate to retrieve specific documents from the result. We use
hasNext() and next() methods for this purpose.
for( var c = db.parts.find(); c.hasNext(); ) {
print( c.next());
}
Additionally in the shell, forEach() may be used with a cursor:
db.users.find().forEach( function(u) { print("user: " + u.name); } );
So you can say things like this:
db.QUESTIONS.find({}, {_id: true, i_up: true, i_down: true}).forEach(function(q) {
db.QUESTIONS.update(
{ _id: q._id },
{ $set: { i_pp: q.i_up * 100 - q.i_down * 20 } }
);
});
to update them one at a time without leaving MongoDB.
If you're using a driver to connect to MongoDB then there should be some way to send a string of JavaScript into MongoDB; for example, with the Ruby driver you'd use eval:
connection.eval(%q{
db.QUESTIONS.find({}, {_id: true, i_up: true, i_down: true}).forEach(function(q) {
db.QUESTIONS.update(
{ _id: q._id },
{ $set: { i_pp: q.i_up * 100 - q.i_down * 20 } }
);
});
})
Other languages should be similar.
//the only differnce is to make it look like and aggregation pipeline
db.table.updateMany({}, [{
$set: {
col3:{"$sum":["$col1","$col2"]}
},
}]
)
You can't use expressions in updates. Or, rather, you can't use expressions that depend on fields of the document. Simple self-containing math expressions are fine (e.g. 2 * 2).
If you want to set a new field for all documents that is a function of other fields, you have to loop over them and update manually. Multi-update won't help here.
Rha7 gave a good idea, but the code above is not work without defining a temporary variable.
This sample code produces an approximate calculation of the age (leap years behinds the scene) based on 'birthday' field and inserts the value into suitable field for all documents not containing such:
db.employers.find({age: {$exists: false}}).forEach(function(doc){
var new_age = parseInt((ISODate() - doc.birthday)/(3600*1000*24*365));
db.employers.update({_id: doc._id}, {$set: {age: new_age}});
});
Example to remove "00" from the beginning of a caller id:
db.call_detail_records_201312.find(
{ destination: /^001/ },
{ "destination": true }
).forEach(function(row){
db.call_detail_records_201312.update(
{ _id: row["_id"] },
{ $set: {
destination: row["destination"].replace(/^001/, '1')
}
}
)
});