MongoDB - Get last document based on multiple fields - mongodb

I've got a special case of query that I need to perform as optimized as possible,
Problem:
I have a collection of messages, each message has a field of groupID and I need to get last message of each groupID, but I don't want to perform one query for each of my groups, instead I wan't to give an array of groupIDs and get an array of Messages
Solutions so far:
I managed to come up with two soloutions
1. perform query for each groupID which works fine takes about 200 ms to complete but performs many requests on MongoDB
2. use Aggregate to groups messages based on GroupID and and then select first of each group which is really slow and takes about 4000 ms
Code
var filter = Builders<Message>.Filter.Eq("GroupID", groupID);
var sort = Builders<Message>.Sort.Descending("Date");
return await MessagesCollection.Find(filter).Sort(sort).Limit(1).FirstAsync();
what I'm looking for is a way to batch queries or do one query that can return first of mesage of given groupID, any idea??
any help will be appreciated
thanx in advance

Related

Couchbase N1QL Query getting distinct on the basis of particular fields

I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);

Limit the select query batch size

I am using MongoTool runner to import the data from mongoDB to Hadoop mapreduce jobs. Due to the size of the data i am getting OutOfMemoryError. So i want to limit the number of records i fetch in a batch fashion.
MongoConfigUtil.setQuery()
can set only the query but i cannot set the size to limit the number of records fetched. What i am looking for is something like
MongoConfigUtil.setBatchSize()
and then
MongoConfigUtil.getNextBatch()
something like that.
Kindly suggest.
You can use the setLimit method of the class MongoInputSplit, passing the number of document that you want to fetch.
myMongoInputSplitObj = new MongoInputSplit(*param*)
myMongoInputSplitObj.setLimit(100)
MongoConfigUtil setLimit
Allow users to set the limit on MongoInputSplits (HADOOP-267).

how to obtain a record of adjacent records

Such as to obtain a post before an after a record time field created
Try to use the following statement to obtain articles
# Created is the time of the creation of the current article
# Before a
prev_post = db.Post.find ({'created': {'$ lt': created}}, sort = [('created', -1)], limit = 1)
# After a
next_post = db.Post.find ({'created': {'$ gt': created}}, sort = [('created', 1)], limit = 1)
The result turn to be discontinuous,sometimes skip several records.I don't know why,maybe I misunderstand the FIND?
Help please.
It may seem a strange behaviour indeed, but MongoDB does not guarantee you the order of stored records, unless you're querying an array (in which records are kept in the insertion order). I believe what MongoDB does - it reaches the first document that matches your query and returns it.
Bottomline: if the logic requires neighbour records, use arrays.
I think it's your created. the time type has a problem ,if the record time is same ,the system don't know which is the choosing one.you can use _id , have a try.

what is the Recommended max emits in map function?

I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.

make a join like SQL server in MongoDB

For example, we have two collections
users {userId, firstName, lastName}
votes {userId, voteDate}
I need a report of the name of all users which have more than 20 votes a day.
How can I write query to get data from MongoDB?
The easiest way to do this is to cache the number of votes for each user in the user documents. Then you can get the answer with a single query.
If you don't want to do that, the map-reduce the results into a results collection, and query that collection. You can then run incremental map-reduces that only calculate new votes to keep your results up to date: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
You shouldn't really be trying to do joins with Mongo. If you are you've designed your schema in a relational manner.
In this instance I would store the vote as an embedded document on the user.
In some scenarios using embedded documents isn't feasible, and in that situation I would do two database queries and join the results at the client rather than using MapReduce.
I can't provide a fuller answer now, but you should be able to achieve this using MapReduce. The Map step would return the userIds of the users who have more than 20 votes, the reduce step would return the firstName and lastName, I think...have a look here.