I am using MongoTool runner to import the data from mongoDB to Hadoop mapreduce jobs. Due to the size of the data i am getting OutOfMemoryError. So i want to limit the number of records i fetch in a batch fashion.
MongoConfigUtil.setQuery()
can set only the query but i cannot set the size to limit the number of records fetched. What i am looking for is something like
MongoConfigUtil.setBatchSize()
and then
MongoConfigUtil.getNextBatch()
something like that.
Kindly suggest.
You can use the setLimit method of the class MongoInputSplit, passing the number of document that you want to fetch.
myMongoInputSplitObj = new MongoInputSplit(*param*)
myMongoInputSplitObj.setLimit(100)
MongoConfigUtil setLimit
Allow users to set the limit on MongoInputSplits (HADOOP-267).
Related
I am trying to delete a limited set of mongo documents from a collection which have id less than 10 but want to remove them in sets of 3, so tried using limit, but it still deletes all the documents and ignores limit.
Query query = new Query();
query.addCriteria(Criteria.where("_id").lt(id)).limit(3);
mongoTemplate.remove(query,TestCollection.class);
When I perform mongoTemplate.find(query,TestCollection.class); limit works fine and returns 3 element at a time but in remove it doesn't works.
Is there any other way to delete in single query only.
To achieve this do it in two passes
Find 3 ids to delete as you are doing currently
do a collection.remove with Criteria.where("_id").in[id1,id2,id3]
I would also add a sort criteria before doing a limit. Otherwise the results of deletion might be dependent on the index used
I have a document structure which looks something like this:
{
...
"groupedFieldKey": "groupedFieldVal",
"otherFieldKey": "otherFieldVal",
"filterFieldKey": "filterFieldVal"
...
}
I am trying to fetch all documents which are unique with respect to groupedFieldKey. I also want to fetch otherField from ANY of these documents. This otherFieldKey has minor changes from one document to another, but I am comfortable with getting ANY of these values.
SELECT DISTINCT groupedFieldKey, otherField
FROM bucket
WHERE filterFieldKey = "filterFieldVal";
This query fetches all the documents because of the minor variations.
SELECT groupedFieldKey, maxOtherFieldKey
FROM bucket
WHERE filterFieldKey = "filterFieldVal"
GROUP BY groupFieldKey
LETTING maxOtherFieldKey= MAX(otherFieldKey);
This query works as expected, but is taking a long time due to the GROUP BY step. As this query is used to show products in UI, this is not a desired behaviour. I have tried applying indexes, but it has not given fast results.
Actual details of the records:
Number of records = 100,000
Size per record = Approx 10 KB
Time taken to load the first 10 records: 3s
Is there a better way to do this? A way of getting DISTINCT only on particular fields will be good.
EDIT 1:
You can follow this discussion thread in Couchbase forum: https://forums.couchbase.com/t/getting-distinct-on-the-basis-of-a-field-with-other-fields/26458
GROUP must materialize all the documents. You can try covering index
CREATE INDEX ix1 ON bucket(filterFieldKey, groupFieldKey, otherFieldKey);
I've got a special case of query that I need to perform as optimized as possible,
Problem:
I have a collection of messages, each message has a field of groupID and I need to get last message of each groupID, but I don't want to perform one query for each of my groups, instead I wan't to give an array of groupIDs and get an array of Messages
Solutions so far:
I managed to come up with two soloutions
1. perform query for each groupID which works fine takes about 200 ms to complete but performs many requests on MongoDB
2. use Aggregate to groups messages based on GroupID and and then select first of each group which is really slow and takes about 4000 ms
Code
var filter = Builders<Message>.Filter.Eq("GroupID", groupID);
var sort = Builders<Message>.Sort.Descending("Date");
return await MessagesCollection.Find(filter).Sort(sort).Limit(1).FirstAsync();
what I'm looking for is a way to batch queries or do one query that can return first of mesage of given groupID, any idea??
any help will be appreciated
thanx in advance
I am new to mongoDb and planning to use map reduce for computing large amount of data.
As you know we have map function to match the criteria and then emit the required data for a given filed. In my map function I have multiple emits. As of now I have 50 Fields emitted from a single document. That means from a single document in a collection explodes to 40 document in temp table. So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Thought of splitting the map function into two….but then one more problem …while performing map function if by chance I ran into an exception thought of skipping entire document data (I.e not to emit any data from that document) but if I split I won't be able to….
In mongoDB.org i found a comment which said..."When I run MR job, with sort - it takes 1.5 days to reach 23% at first stage of MR. When I run MR job, without sort, it takes about 24-36 hours for all job.Also when turn off jsMode is speed up my MR twice ( before i turn off sorting )"
Will enabling sort help? or Will turning OFF jsmode help? i am using mongo 2.0.5
Any suggestion?
Thanks in advance .G
The next step is to sort on this collection. (I haven't used sort param of map will it help?)
Don't know what you mean, MR's don't have sort params, only the incoming query has a sort param. The sort param of the incoming query only sorts the data going in. Unless you are looking for some specific behaviour that will avoid sorting the final output using an incoming sort you don't normally need to sort.
How are you looking to use this MR. Obviusly it won't be in realtime else you would just kill your servers so Ima guess it is a background process that runs and formats data to the way you want. I would suggest looking into incremental MRs so that you do delta updates throughout the day to limit the amount of resources used at any given time.
So if I have 1 million documents to be processed it will 1million * 40 documents in temp table by end of map function.
Are you emiting multiple times? If not then the temp table should have only one key per row with documents of the format:
{
_id: emitted_id
[{ //each of your docs that you emit }]
}
This is shown: http://kylebanker.com/blog/2009/12/mongodb-map-reduce-basics/
or Will turning OFF jsmode help? i am using mongo 2.0.5
Turning off jsmode is unlikely to do anything significant and results from it have varied.
Let's say i put a limit and skip on the MongoDB query...I want to know the total results if there was not a limit on there.
Of course, I could do this the shitty way...which is to query twice.
In MongoDB the default behavior of count() is to ignore skip and limit and count the number of results in the entire original query. So running count will give you exactly what you want.
Passing a Boolean true to count or calling size instead would give you a count WITH skip or limit.
There is no way to get the count without executing the query twice.