Meteor,Mongo query find every nth document - mongodb

I use time stamps in my collection so every document has a time stamp, user want's to get documents from "ts1" (time stamp 1) to "ts2" (time stamp 2), however there a too many documents in that interval so I wan't to return only every nth, for example if there is 100000 documents, I need to display 1000 documents, so 100000/1000=100. every 100th document.
Is this possible, and how could I achieve this.
PS. I need to query this inside Meteor publish method.
Here's what I'v got so far:
Meteor.publish('documents-chunk', function (from, to) {
//get find documents count and get nth
var count = Documents.find({time: {$gte: from, $lte: to}}).count();
if (count > 2000) {
var nth = Math.round(count / 1000);
return Documents.find(/*query every nth*/);
}
return Documents.find({time: {$gte: from, $lte: to}});
});
SOLUTION:
I ~solved this problem using answer from Matt K.
This is what I'v done: first I modified my collection and added additional "id" field:
**
1.
**
Document.find({}, {sort: {time: 1}}).forEach(function (c, i) {
Document.update(c, {$set: {id: i + 1}});
console.log(i + 1);
});
This collection had little less than 1,5M records so it took some time, (also to note, I had to add index {time: 1} to this collection otherwise it would crash the database)
**
2.
**
Meteor.publish('documents-chunk', function (from, to) {
var nth = Math.round(Documents.find({time: {$gte: from, $lte: to}}, {sort: {time: 1}}).count() / 1000);
return Documents.find({time: {$gte: from, $lte: to, $mod: [nth, 0]}}, {sort: {time: 1}});
});
This worked for me, and now I get the result I needed;
I'v read at http://docs.mongodb.org/manual/tutorial/create-an-auto-incrementing-field/ that this kinda approach is not recommended. But at this point of time I could not find any other solution to this problem, though I found that it is requested https://jira.mongodb.org/browse/SERVER-2397 so maybe in future there will be cleaner solution, but for now it works.

You can't, at least not to my knowledge. You've got three options:
publish and subscribe to all 100,000, then display every 1000th. Logically speaking, your query is based on the number of results returned from a query. That is a 2 step process no matter how you look at it.
If you wanted to be cute, you could have _id (or another field) be an auto-incrementing number. Then, set var qCount = cursor.count(). Then, query for _id % qCount === 0.
add a sample field to every 1000th record when it's created, then query for: {$exists: {sample: true}}
rethink the business logic. what's the value add of every 1000th record? If it's to "eyeball the data" you should probably be using an aggregate on the data anyways to get rid of outliers. (this is the right choice, but convincing the client is another story...)

If you believe that mongoDB _id values are truly randomly assigned, then you could simply order by _id and pick the first N of a set. This would give you N random values from the interval.
Meteor.publish('documents-chunk', function (from, to) {
return Documents.find({time: {$gte: from, $lte: to}},{sort: {_id: 1}, {limit: 1000}});
});
I'd recommend running some statistics on the randomness of what you get back.

Related

MongoDB fast count of subdocuments - maybe trough index

I'm using MongoDB 4.0 on mongoDB Atlas cluster (3 replicas - 1 shard).
Assuming i have a collection that contains multiple documents.
Each of this documents holding an array out of subdocuments that represent cities in a certain year with additional information. An example document would look like that (i removed unessesary information to simplify example):
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
I have a compound index on and year. What is the fastest way to count the occurrences of name and year combinations?
I already tried the following aggregation:
[{$unwind: {
path: '$cities'
}}, {$group: {
_id: {
name: 'cities.name',
year: '$cities.year'
},
count: {
$sum: 1
}
}}, {$project: {
count: 1,
name: '$_id.name',
year: '$_id.year',
_id: 0
}}]
Another approach i tried was a map-reduce in the following form - the map reduce performed a bit better ~30% less time needed.
map function:
function m() {
for (var i in this.cities) {
emit({
name: this.cities[i].name,
year: this.cities[i].year
},
1);
}
}
reduce function (also tried to replace sum with length, but surprisingly sum is faster):
function r(id, counts) {
return Array.sum(counts);
}
function call in mongoshell:
db.test.mapReduce(m,r,{out:"mr_test"})
Now i was asking myself - Is it possible to access the index? As far as i know it is a B+ tree that holds the pointers to the relevant documents on disk, therefore from a technical point of view I think is would be possible to iterate through all leaves of the index tree and just counting the pointers? Does anybody if this is possible?
Does anybody knows another way to solve this approach in a high performant way? (It is not possible to change the design, because of other dependencies of the software, we are running this on a very big dataset). Has anybody maybe experience in solve such task via shards?
The index will not be very helpful in this situation.
MongoDB indexes were designed for identifying documents that match a given critera.
If you create an index on {cities.name:1, cities.year:1}
This document:
{_id:123,
cities:[
{name:"vienna",
year:1985
},
{name:"berlin",
year:2001
}
{name:"vienna",
year:1985
}
]}
Will have 2 entries in the b-tree that refer to this document:
vienna|1985
berlin|2001
Even if it were possible to count the incidence of a specific key in the index, this does not necessarily correspond.
MongoDB does not provide a method to examine the raw entries in an index, and it explicitly refuses to use an index on a field containing an array for counting.
The MongoDB count command and helper functions all count documents, not elements inside of them. As you noticed, you can unwind the array and count the items in an aggregation pipeline, but at that point you've already loaded all of the documents into memory, so it's too late to make use of an index.

Does Indexing small arrays of subdocuments in Mongodb affect performance?

My Mongodb collection has this document structure:
{
_id: 1,
my_dict: {
my_key: [
{id: x, other_fields: other_values},
...
]
},
...
},
I need to update the array subdocuments very often, so an Index on the id field seems like a good idea. Still, I have many documents (millions) but my arrays inside them are small (max ~20 elements). Would it still improve performance a lot to index it, compared to the cost of indexing?
PS: I'm not using the id as a key (dict instead of an array), as I also often need to get the number of elements in "the array" ($size only works on arrays). I cannot use count as I am using Mongodb 3.2.
Followup question: If it would make a very big difference, I could instead use a dict like so:
{id: {others_fields: other_values}}
and store the size myself in a field. What I dislike about this is that I would need another field and update it myself (possible errors maybe, as I would need to use $inc each time I add/delete an item) instead of relying on "real" values. I would also have to manage the possibility that a key could be called _my_size, which would conflict with my logic. It would look then like this:
{
_id: 1,
my_dict: {
my_key: {
id: {other_fields: other_values},
_my_size: 1
},
},
},
Still not sure which is best for performance. I will need to update the subdocument (with id field) a lot, as well as computing the $size a lot (maybe 1/10 of the calls to update).
Which Schema/Strategy would give me a better performance? Or even more important, would it actually make a big difference? (possibly thousands of calls per second)
Update example:
update(
{_id: 1, my_dict.my_key.id: update_data_id},
{$set: {my_dict.my_key: update_data}}
)
Getting the size example:
aggregate(
{$match: {_id: 1}},
{$project: {_id: 0, nb_of_sub_documents: {$size: $my_dict.my_key}}}

MongoDB: What is the fastest / is there a way to get the 200 documents with a closest timestamp to a specified list of 200 timestamps, say using a $in [duplicate]

Let's assume I have a collection with documents with a ratio attribute that is a floating point number.
{'ratio':1.437}
How do I write a query to find the single document with the closest value to a given integer without loading them all into memory using a driver and finding one with the smallest value of abs(x-ratio)?
Interesting problem. I don't know if you can do it in a single query, but you can do it in two:
var x = 1; // given integer
closestBelow = db.test.find({ratio: {$lte: x}}).sort({ratio: -1}).limit(1);
closestAbove = db.test.find({ratio: {$gt: x}}).sort({ratio: 1}).limit(1);
Then you just check which of the two docs has the ratio closest to the target integer.
MongoDB 3.2 Update
The 3.2 release adds support for the $abs absolute value aggregation operator which now allows this to be done in a single aggregate query:
var x = 1;
db.test.aggregate([
// Project a diff field that's the absolute difference along with the original doc.
{$project: {diff: {$abs: {$subtract: [x, '$ratio']}}, doc: '$$ROOT'}},
// Order the docs by diff
{$sort: {diff: 1}},
// Take the first one
{$limit: 1}
])
I have another idea, but very tricky and need to change your data structure.
You can use geolocation index which supported by mongodb
First, change your data to this structure and keep the second value with 0
{'ratio':[1.437, 0]}
Then you can use $near operator to find the the closest ratio value, and because the operator return a list sorted by distance with the integer you give, you have to use limit to get only the closest value.
db.places.find( { ratio : { $near : [50,0] } } ).limit(1)
If you don't want to do this, I think you can just use #JohnnyHK's answer :)

Count fields in a MongoDB Collection

I have a collection of documents like this one:
{
"_id" : ObjectId("..."),
"field1": "some string",
"field2": "another string",
"field3": 123
}
I'd like to be able to iterate over the entire collection, and find the entire number of fields there are. In this example document there are 3 (I don't want to include _id), but it ranges from 2 to 50 fields in a document. Ultimately, I'm just looking for the average number of fields per document.
Any ideas?
Iterate over the entire collection, and find the entire number of fields there are
Now you can utilise aggregation operator $objectToArray (SERVER-23310) to turn keys into values and count them. This operator is available in MongoDB v3.4.4+
For example:
db.collection.aggregate([
{"$project":{"numFields":{"$size":{"$objectToArray":"$$ROOT"}}}},
{"$group":{"_id":null, "fields":{"$sum":"$numFields"}, "docs":{"$sum":1}}},
{"$project":{"total":{"$subtract":["$fields", "$docs"]}, _id:0}}
])
First stage $project is to turn all keys into array to count fields. Second stage $group is to sum the number of keys/fields in the collection, also the number of documents processed. Third stage $project is subtracting the total number of fields with the total number of documents (As you don't want to count for _id ).
You can easily add $avg to count for average on the last stage.
PRIMARY> var count = 0;
PRIMARY> db.my_table.find().forEach( function(d) { for(f in d) { count++; } });
PRIMARY> count
1074942
This is the most simple way I could figure out how to do this. On really large datasets, it probably makes sense to go the Map-Reduce path. But, while your set is small enough, this'll do.
This is O(n^2), but I'm not sure there is a better way.
You could create a Map-Reduce job. In the Map step iterate over the properties of each document as a javascript object, output the count and reduce to get the total.
For a simple way just find() all value and for each set of record get size of array.
db.getCollection().find(<condition>)
then for each set of result, get the size of array.
sizeOf(Array[i])

mongodb - Find document with closest integer value

Let's assume I have a collection with documents with a ratio attribute that is a floating point number.
{'ratio':1.437}
How do I write a query to find the single document with the closest value to a given integer without loading them all into memory using a driver and finding one with the smallest value of abs(x-ratio)?
Interesting problem. I don't know if you can do it in a single query, but you can do it in two:
var x = 1; // given integer
closestBelow = db.test.find({ratio: {$lte: x}}).sort({ratio: -1}).limit(1);
closestAbove = db.test.find({ratio: {$gt: x}}).sort({ratio: 1}).limit(1);
Then you just check which of the two docs has the ratio closest to the target integer.
MongoDB 3.2 Update
The 3.2 release adds support for the $abs absolute value aggregation operator which now allows this to be done in a single aggregate query:
var x = 1;
db.test.aggregate([
// Project a diff field that's the absolute difference along with the original doc.
{$project: {diff: {$abs: {$subtract: [x, '$ratio']}}, doc: '$$ROOT'}},
// Order the docs by diff
{$sort: {diff: 1}},
// Take the first one
{$limit: 1}
])
I have another idea, but very tricky and need to change your data structure.
You can use geolocation index which supported by mongodb
First, change your data to this structure and keep the second value with 0
{'ratio':[1.437, 0]}
Then you can use $near operator to find the the closest ratio value, and because the operator return a list sorted by distance with the integer you give, you have to use limit to get only the closest value.
db.places.find( { ratio : { $near : [50,0] } } ).limit(1)
If you don't want to do this, I think you can just use #JohnnyHK's answer :)