MongoDB map reduce producing strange result

MongoDB map reduce producing strange result - mongodb

I am implementing a simple 'get max value' map reduce in MongoDB (c# driver).
For my tests I have 10 items in a collection with int _id = 1 to 10.
My map and reduce are as follows:
var map = "function() {emit('_id', this.Id);}";
var reduce = "function(key, values) {var max = 1; for (id in values) {if(id>max) {max=id;}} } return max;}";
When I run however I get the result 9, strange!!
I think that the map is outputting a string, and thus the compare is not working as desired.
Any help would be great

Reduce function won't run if the values contain only one item. If all the ids are unique and your key in the map is only that id, reduce phase won't work because of a design issue (for improving performance). If you need to change the format of your reduce output, you should use finalize method. Or just take a look at the aggregation framework which provides quite useful tools for playing with data.
Check the jira
jira.mongodb.org/browse/SERVER-5818
If you are just trying to get familiar with map reduce I would suggest to try different scenarios where using map-reduce really makes sense
Cheers

Related

How to retrieve the count and range in the same gremlin query

g.V().hasLabel('Person').count() gives me number of person vertices present in my database, and g.V().hasLabel('Person').range(10,15) gives me a range of person vertices.
But is there a way to combine these two into a single Gremlin query?
This is just a simplified version of my query, but my actual query is quiet complex and repeating those number of traversals just to find the count, seem ineffective!

I just realized, I could use groovy with gremlin to achieve what I want. Not sure how elegant this is!
def lst = g.V().hasLabel('Person').toList(); def result = ["count":0, "data":[] ]; result.count=lst.size();result.data=lst[2..3];result
This is working great even in my complex case.

I wouldn't recommend to do a full count() in every query. Rather count the total number once and cache it in your application.
That said, here's how you could do it anyway:
g.V().hasLabel('Person').aggregate('x').cap('x').
project('page3', 'total').by(range(local, 10, 15)).by(count(local))
UPDATE:
In older versions you can try this:
g.V().hasLabel('Person').aggregate('x').cap('x').as('page3', 'total').
select('page3', 'total').by(range(local, 10, 15)).by(count(local))

Mongo aggeregation limit while grouping

I have a collection which im aggregating and grouping by the field "type". The final result should be just maximum of five documents in each type. But if i limit before group only five first docs will be grouped. if i limit after the group the first five types will return.
is there a way to do this without doing find() for each type , limiting to 5 and merging all the results ?

If you can use C# (which according to my quick google-search about mongodb you do), you can do this with one of the GroupBy's which have an "ResultSelector-Function", like this:
var groups = Enumerable.Range(0, 1000).
GroupBy(
x => x/10,
(key, elements) => new { Key = key, Elements = elements.Take(5) }
);
About the speed of this code - I believe the group is completely build before the result-selector is instantiated - so a custom foreach over the input sequence and building the groups by hand might be faster (if you can somehow determine when you are done)
P.S.: On second thought - I doubt my answer is the one you want. I had a look at the mongo-DB documentation, and "map" in combination with a suitable "reduce" function might be exactly what you want.

MongoDB: Order by computed property

in my db I have a collection where documents have a field score which is a float (-1..1). I can query the db to return the first 20 results ordered by score.
My problem is, that I want to modify the score of a doc with a time penality, based on the field time_updated: The older the doc is, the lower the score should be. And the big problem is, that I have to do this on runtime. I could iterate over all documents, update the score and then order by score. But this would cost too much time, since there is a huge amount of documents in the collection.
So my question is: With MongoDB, can I order by a computed property? Is there any way to do that? Or is there a feature in planning for next versions of MongoDB?

Exactly how is the score updated?
If it's simple and can be put in $add, $multiply, etc., terms then the aggregation pipeline will work well. Otherwise you'll need to use a simple MapReduce a for doing the the score updating.
var mapFunction = function() {
emit(this._id, <compute score here from this.score and this.time_updated>);
};
var reduceFunction = function (values) {
return values[0]; // trivial reduce function since incoming id's are unique.
};
For 10000 rows either the aggregation pipeline or a simple MapReduce will probably be sufficiently performant.
For much bigger datasets you may need to use a more complex MapReduce (that actually does a reduce) to be memory efficient. You might also want to take advantage of Incremental MapReduce.

Slow pagination over tons of records in mongodb

I have over 300k records in one collection in Mongo.
When I run this very simple query:
db.myCollection.find().limit(5);
It takes only few miliseconds.
But when I use skip in the query:
db.myCollection.find().skip(200000).limit(5)
It won't return anything... it runs for minutes and returns nothing.
How to make it better?

One approach to this problem, if you have large quantities of documents and you are displaying them in sorted order (I'm not sure how useful skip is if you're not) would be to use the key you're sorting on to select the next page of results.
So if you start with
db.myCollection.find().limit(100).sort({created_date:true});
and then extract the created date of the last document returned by the cursor into a variable max_created_date_from_last_result, you can get the next page with the far more efficient (presuming you have an index on created_date) query
db.myCollection.find({created_date : { $gt : max_created_date_from_last_result } }).limit(100).sort({created_date:true});

From MongoDB documentation:
Paging Costs
Unfortunately skip can be (very) costly and requires the server to walk from the beginning of the collection, or index, to get to the offset/skip position before it can start returning the page of data (limit). As the page number increases skip will become slower and more cpu intensive, and possibly IO bound, with larger collections.
Range based paging provides better use of indexes but does not allow you to easily jump to a specific page.
You have to ask yourself a question: how often do you need 40000th page? Also see this article;

I found it performant to combine the two concepts together (both a skip+limit and a find+limit). The problem with skip+limit is poor performance when you have a lot of docs (especially larger docs). The problem with find+limit is you can't jump to an arbitrary page. I want to be able to paginate without doing it sequentially.
The steps I take are:
Create an index based on how you want to sort your docs, or just use the default _id index (which is what I used)
Know the starting value, page size and the page you want to jump to
Project + skip + limit the value you should start from
Find + limit the page's results
It looks roughly like this if I want to get page 5432 of 16 records (in javascript):
let page = 5432;
let page_size = 16;
let skip_size = page * page_size;
let retval = await db.collection(...).find().sort({ "_id": 1 }).project({ "_id": 1 }).skip(skip_size).limit(1).toArray();
let start_id = retval[0].id;
retval = await db.collection(...).find({ "_id": { "$gte": new mongo.ObjectID(start_id) } }).sort({ "_id": 1 }).project(...).limit(page_size).toArray();
This works because a skip on a projected index is very fast even if you are skipping millions of records (which is what I'm doing). if you run explain("executionStats"), it still has a large number for totalDocsExamined but because of the projection on an index, it's extremely fast (essentially, the data blobs are never examined). Then with the value for the start of the page in hand, you can fetch the next page very quickly.

i connected two answer.
the problem is when you using skip and limit, without sort, it just pagination by order of table in the same sequence as you write data to table so engine needs make first temporary index. is better using ready _id index :) You need use sort by _id. Than is very quickly with large tables like.
db.myCollection.find().skip(4000000).limit(1).sort({ "_id": 1 });
In PHP it will be
$manager = new \MongoDB\Driver\Manager("mongodb://localhost:27017", []);
$options = [
'sort' => array('_id' => 1),
'limit' => $limit,
'skip' => $skip,
];
$where = [];
$query = new \MongoDB\Driver\Query($where, $options );
$get = $manager->executeQuery("namedb.namecollection", $query);

I'm going to suggest a more radical approach. Combine skip/limit (as an edge case really) with sort range based buckets and base the pages not on a fixed number of documents, but a range of time (or whatever your sort is). So you have top-level pages that are each range of time and you have sub-pages within that range of time if you need to skip/limit, but I suspect the buckets can be made small enough to not need skip/limit at all. By using the sort index this avoids the cursor traversing the entire inventory to reach the final page.

My collection has around 1.3M documents (not that big), properly indexed, but still takes a big performance hit by the issue.
After reading other answers, the solution forward is clear; the paginated collection must be sorted by a counting integer similar to the auto-incremental value of SQL instead of the time-based value.
The problem is with skip; there is no other way around it; if you use skip, you are bound to hit with the issue when your collection grows.
Using a counting integer with an index allows you to jump using the index instead of skip. This won't work with time-based value because you can't calculate where to jump based on time, so skipping is the only option in the latter case.
On the other hand,
by assigning a counting number for each document, the write performance would take a hit; because all documents must be inserted sequentially. This is fine with my use case, but I know the solution is not for everyone.
The most upvoted answer doesn't seem applicable to my situation, but this one does. (I need to be able to seek forward by arbitrary page number, not just one at a time.)
Plus, it is also hard if you are dealing with delete, but still possible because MongoDB support $inc with a minus value for batch updating. Luckily I don't have to deal with the deletion in the app I am maintaining.
Just write this down as a note to my future self. It is probably too much hassle to fix this issue with the current application I am dealing with, but next time, I'll build a better one if I were to encounter a similar situation.

If you have mongos default id that is ObjectId, use it instead. This is probably the most viable option for most projects anyway.
As stated from the official mongo docs:
The skip() method requires the server to scan from the beginning of
the input results set before beginning to return results. As the
offset increases, skip() will become slower.
Range queries can use indexes to avoid scanning unwanted documents,
typically yielding better performance as the offset grows compared to
using skip() for pagination.
Descending order (example):
function printStudents(startValue, nPerPage) {
let endValue = null;
db.students.find( { _id: { $lt: startValue } } )
.sort( { _id: -1 } )
.limit( nPerPage )
.forEach( student => {
print( student.name );
endValue = student._id;
} );
return endValue;
}
Ascending order example here.

If you know the ID of the element from which you want to limit.
db.myCollection.find({_id: {$gt: id}}).limit(5)
This is a lil genious solution which works like charm

For faster pagination don't use the skip() function. Use limit() and find() where you query over the last id of the precedent page.
Here is an example where I'm querying over tons of documents using spring boot:
Long totalElements = mongockTemplate.count(new Query(),"product");
int page =0;
Long pageSize = 20L;
String lastId = "5f71a7fe1b961449094a30aa"; //this is the last id of the precedent page
for(int i=0; i<(totalElements/pageSize); i++) {
page +=1;
Aggregation aggregation = Aggregation.newAggregation(
Aggregation.match(Criteria.where("_id").gt(new ObjectId(lastId))),
Aggregation.sort(Sort.Direction.ASC,"_id"),
new CustomAggregationOperation(queryOffersByProduct),
Aggregation.limit((long)pageSize)
);
List<ProductGroupedOfferDTO> productGroupedOfferDTOS = mongockTemplate.aggregate(aggregation,"product",ProductGroupedOfferDTO.class).getMappedResults();
lastId = productGroupedOfferDTOS.get(productGroupedOfferDTOS.size()-1).getId();
}

MongoDB, return recent document for each user_id in collection

Looking for similar functionality to Postgres' Distinct On.
Have a collection of documents {user_id, current_status, date}, where status is just text and date is a Date. Still in the early stages of wrapping my head around mongo and getting a feel for best way to do things.
Would mapreduce be the best solution here, map emits all, and reduce keeps a record of the latest one, or is there a built in solution without pulling out mr?

There is a distinct command, however I'm not sure that's what you need. Distinct is kind of a "query" command and with lots of users, you're probably going to want to roll up data not in real-time.
Map-Reduce is probably one way to go here.
Map Phase: Your key would simply be an ID. Your value would be something like the following {current_status:'blah',date:1234}.
Reduce Phase: Given an array of values, you would grab the most recent and return only it.
To make this work optimally you'll probably want to look at a new feature from 1.8.0. The "re-reduce" feature. Will allow you to process only new data instead of re-processing the whole status collection.
The other way to do this is to build a "most-recent" collection and tie the status insert to that collection. So when you insert a new status for the user, you update their "most-recent".
Depending on the importance of this feature, you could possibly do both things.

Current solution that seems to be working well.
map = function () {emit(this.user.id, this.created_at);}
//We call new date just in case somethings not being stored as a date and instead just a string, cause my date gathering/inserting function is kind of stupid atm
reduce = function(key, values) { return new Date(Math.max.apply(Math, values.map(function(x){return new Date(x)})))}
res = db.statuses.mapReduce(map,reduce);

Another way to achieve the same result would be to use the group command, which is a kind of a mr-shortcut that lets you aggregate on a specific key or set of keys.
In your case it would read like this:
db.coll.group({ key : { user_id: true },
reduce : function(obj, prev) {
if (new Date(obj.date) < prev.date) {
prev.status = obj.status;
prev.date = obj.date;
}
},
initial : { status : "" }
})
However, unless you have a rather small fixed amount of users I strongly believe that a better solution would be, as previously suggested, to keep a separate collection containing only the latest status-message for each user.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse