MongoDB - Can I have multiple sort fields in a map reduce? - mongodb

I have a map/reduce that I noticed was taking nearly 10 seconds to run, even though no results were being returned. Using the mongo shell, I was able to determine that my initial query was the culprit. I was able to add a 2nd sort field to the query, which sped it up drastically.
However, when I try to add that 2nd sort field to my map reduce, I get the error "could not create cursor over [collection] for query ...". Is there any way for me to add a 2nd sort field to the map reduce?
Edit: The goal of my query is to find the first record created by each user/day. So the key that I am emitting is the user's id + created on day, ignoring any time. That way I am grouping all records created by a user on a given day. In my reduce, I am then taking the record that was created first. I have actually ditched using map/reduce and am now doing essentially the same thing, but with a normal find() and then some javascript to group and reduce.

Related

Aggregate do not return records in order as they are inserted

I am facing one issue with aggregate query.
When I am trying to retrieve records using aggregate ($match), I am not receiving records in same order they are inserted.
but when I am trying to query using find, then I am getting data in same order data inserted.
In Mongo, the default internal sort order is an "unknown" implementation detail.
If a certain order was required then you should use the $sort stage otherwise it's considered overhead operation for the storage engine
Without any query criteria, results will be returned by the storage engine in their "natural order" or in layman term the order which they were found, however we don't really know what that order is and we shouldn't rely on it.
So obviously the simple option would be to add a $sort stage on _id for example, if for some reason you don't have a field you can sort on you can use $natural that will return results in the order you expect.

What does the distinct on clause mean in cloud datastore and how does it effect the reads?

This is what the cloud datastore doc says but I'm having a hard time understanding what exactly this means:
A projection query that does not use the distinct on clause is a small operation and counts as only a single entity read for the query itself.
Grouping
Projection queries can use the distinct on clause to ensure that only the first result for each distinct combination of values for the specified properties will be returned. This will return only the first result for entities which have the same values for the properties that are being projected.
Let's say i have a table for questions and i only want to get the question text sorted by the created date would this be counted as a single read and rest as small operations?
If your goal is to just project the date and text fields, you can create a composite index on those two fields. When you query, this is a small operation with all the results as a single read. You are not trying to de-duplicate (so no distinct/on) in this case and so it is a small operation with a single read.

Sort collection by insertion datetime using only id field

I have a collection of data and I want to get it sorted by insertion time. I have not any additional fields to store the insert time. But as I found out I can get this time from Id.
I have tried this code:
return bookmarks.find({}, {sort: {_id.getTimestamp(): 1}, limit: 10});
or
return bookmarks.find({}, {sort: {ObjectId(_id).getTimestamp(): 1}, limit: 10});
but get the error message:
=> Your application has errors. Waiting for file change.
Is there any way to sort collection by insertion datetime using only id field ?
At the moment this isn't possible with Meteor, even if it is with MongoDB. The ObjectID's created with meteor don't bear a timestamp. See http://docs.meteor.com/#collection_object_id
The reason for this is client side code can insert code and it can arrive late on the server, hence there is no guarantee the timestamp portion of the ObjectID will be accurate. In addition to the latency the client side's date is used meaning if they're off it's going to get you incorrect data. I think this is the reason they use an ObjectID but it is completely random.
If you want to sort by date you have to store the time/date separately.
The part what i striked out is not accurate. Meteor use it is own id generation which is based on a random string that is while does not apply the doc what i linked before. Check sasha.sochka's comment under.
It is nearly but not 100% good if you just sort for the _id field . While as it is constructed the first 4 byte is the timestamp in secs (so sorting for the getTimestamps value is not better). Under one second resolution you cannot get the exact order, as it is mentioned in the documentation: http://docs.mongodb.org/manual/reference/object-id/#objectid
It is still true that you can try to check the exact order of the insert/update ops against your collection in the oplog, if you have an oplog, but as it is a capped collection anyway you will see the recent operations only. http://docs.mongodb.org/manual/core/replica-set-oplog/.

MongoDB $in not only one result in case of repeated elements

I need to get the users whose ids are contained in an array. For this i'm using the $in operator, however being this inside an aggregate operation, i'd like to get back a specific user all the time it's id is present in the array, not just one. For example:
The ids array is A=[a,b,c,b] and U(x) is user with id x
with users.find({_id:{$in:A}}) i get these users as result: U(a),U(b),U(c)
instead i'd like to get back the result: U(a),U(b),U(c),U(b)
so get the user back every time it's id appears.
I understand that $in is working as expected but does anyone have an idea on how can i achieve this?
Thanks
This isn't possible using a MongoDB query.
MongoDB's query engine iterates over the documents in a collection (or over an index if there's a useful one) and returns to you any documents that match your query, in the order it finds them. Whether b appears once, twice, or a hundred times in your query makes no difference: the document with _id of b matches the query and is returned once, when MongoDB finds it.
You can do a post-processing step in your programming language to repeat documents as many times as you want.

what is the Recommended max emits in map function?

I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.