MongoDB query optimization for large documents - dataset - query stuck - mongodb

I would like to have an opinion from MongoDB experts as what mongodb query optimization / feature options, I can apply to make read query faster for such large collection of documents (datasets) to be available in one go (find query).
May be someone in the mongodb community has faced similar issue and they have some better thoughts to resolve the same.
I am using MongoDB2.6
The requirement is to fetch all the records in one query since these documents are populated in the excel sheet for users to download the data in excel.
I have a large collection of users somwhere 2000000 documents in the user collection.
User Collection fields:
{
_id: ObjectId("54866f133345a61b6b9a701"),
appId: "20fabc6",
firstName: "FN",
lastName: "LN",
email: "ln#1.com",
userType: "I",
active: "true",
lastUpdatedBy: "TN",
lastUpdatedDate: ISODate("2013-01-24T05:00:00Z"),
createdTime: ISODate("2011-01-24T05:00:00Z"),
}
I have a find query which is currently tying to fetch 900000 documents from User collection. It looks like the query get stuck while fetching that much amount of documents.
Below is the find query:
db.test.User.find({"appId": "20fabc6"}).sort({"appId" : 1});
Query Function:
List<Users> findByAppId(Object[] appIds) {
Query query = new Query();
query.addCriteria(Criteria.where("appId").in(appIds));
return mongoTemplate.find(query, Users.class);
}
I have placed index on the above appId field, still the query is taking too long.
I ran the count on the query and I can see 900000 records for the above find query matching the appId
db.test.User.find({"appId": "20fabc6"}).count();
900000
Below are some of the options, I can think of which can reduce the amount of documents:
1) Adding more fields to filter the records- which is still large amount
db.test.User.find({"appId": "20fabc6"}, "active": "true").count();
700000
2) Limit the no.of records using mongodb limit operation for range queries - which will impact our first requirement to download all the user data in one go into excel sheet.
Does Aggregation with Cursor will help or sharding in cluster will help, if we have to execute the above find query and fetch that much amount of documents (900000) in one go?
I will appreciate for any help or pointers to resolve the same.
Thanks.

Your sort() is unnecessary as you are trying to find only documents with appId of 20fabc6, so why would you then sort by this same appId since it will be the same for all records returned?
Create an index on the appId field
db.test.User.ensureIndex({"appId":1})
And your query should only scan 900000 documents. You can double check this with performance metadata by using the .explain() method on your find.
db.test.User.find({"appId": "20fabc6"}).explain()

Related

MongoDB get list of documents which don't exist in the collection from my query

I have a collection with large number of documents (~100Million). I want to get documents which don't exist in the collection from the list of queries I provided. Example:
query_user_ids = ["32432", "32433", "32434", "32435"]
document = {"_id": "xxxx", "user_id": "32433", "details": "xxxx"}
user_id has unique index on it
I want to query which user_ids are not present in the collection. So assuming user_id 32434 and 32435 do not exist, when I query the collection with this list of ids, I should get the response ["32434", "32435"]
Right now, I am just looping over the user_ids and calling find_one to check if document with user_id exists or not, but I suspect this is slowing down the operation. Is there a way I can do it by directly passing in the list of ids.
I am using PyMongo for querying.
Query
you can do findMany also like bellow
get all documents that exists
and then with python, for example with a loop on the list, keep those of the list, that are not in the results
this way you send only 1 query, and the python time costs nothing even if list is kinda big
*i am not sure that this is the optimal way, but i think it would be faster compared with sending many queries, try it maybe and if you can and give some feedback
Playmongo
find({"user_id": {"$in": ["32432", "32433", "32434", "32435"]}},
{"projection": {"_id": 0, "user_id": 1}})

large field in mongodb document is slowing down aggregate query

I have a collection named "questions" with around 15 fields. There is an indexed field called "user". Another field is "response_api" which is a subdocument of around 60KB. And there are around 40000 documents in this collection.
When I run aggreate query with only $match stage on user field, it is very slow and takes around 11 seconds to complete.
The query is:
db.questions.aggregate([{$match: {user: ObjectId("5c9a19abc89b2d09740ccd1d")} }])
But when I run this same query with $project stage, it returns pretty fast in less than 10 millis. The query is:
db.questions.aggregate([{$match: {user: ObjectId("5c9a19abc89b2d09740ccd1d")} }, {$project: {_id: 1, user: 1, subject: 1}}])
Note: This particular user has 5000 documents in questions collection.
When I copied this collection without "response_api" field and created index on user field, then both these queries on the copied collection were pretty fast and took less than 10 millis.
Can somebody explain What's going on here? Is it because of that large field?
According to this documentation provided by MongoDB, you should keep a check on the size of indexes. Basically, if index size is more than what your RAM can accomadate, then MongoDB will be reading them from the disk and thus your queries will be a lot slower.

mongodb mapreduce groupby twice

I am new to mongodb and try to count how many distinct login users per day from existing collection. The data in collection looks like following
[{
_id: xxxxxx,
properties: {
uuid: '4b5b5c2e208811e3b5a722000a97015e',
time: ISODate("2014-12-13T00:00:00Z"),
type: 'login'
}
}]
Due to my limited knowledge, what I figure out so far is group by day first, output the data to a tmp collection and use this tmp collection to do anther map reduce and output the result to a final collection. This solution will get my collections bigger which I do not really like it. Does anyone can help me out or any good/more complex tutorials that I can follow? thanks
Rather than a map reduce, I would suggest an Aggregation. You can think of an aggregation as somewhat like a linux pipe, in that you can pass the results of one operation to the next. With this strategy, you can perform 2 consecutive groups and never have to write anything to the database.
Take a look at this question for more details on the specifics.

Query moongoDB from a redis list

If for example I keep lists of user posts in redis, for example a user has 1000 posts, and the posts documents are stored into mongodb but the link between the user and the posts is stored inside redis, I can rtetrieve the array containing all the ids of a user post from redis, but what is the efficient way to retrieving them from mongodb?
do I pass a parameter to mongoDB with the array of ids, and mongo will fetch those for me?
I don't seem to find any documentation on this, if Anyone is willing to help me out!
thanks in advance!
To retrieve a number of documents per id, you can use the $in operator to build the MongoDB query. See the following section from the documentation:
http://docs.mongodb.org/manual/reference/operator/query/in/#op._S_in
For instance you can build a query such as:
db.mycollection.find( { _id : { $in: [ id1, id2, id3, .... ] } } )
Depending on how much ids will be returned by Redis, you may have to group them in batch of n items (n=100 for instance) to run several MongoDB queries. IMO, this is a bad practice to build such query containing more than a few thousands ids. It is better to have smaller queries but accept to pay for the extra roundtrips.

make a join like SQL server in MongoDB

For example, we have two collections
users {userId, firstName, lastName}
votes {userId, voteDate}
I need a report of the name of all users which have more than 20 votes a day.
How can I write query to get data from MongoDB?
The easiest way to do this is to cache the number of votes for each user in the user documents. Then you can get the answer with a single query.
If you don't want to do that, the map-reduce the results into a results collection, and query that collection. You can then run incremental map-reduces that only calculate new votes to keep your results up to date: http://www.mongodb.org/display/DOCS/MapReduce#MapReduce-IncrementalMapreduce
You shouldn't really be trying to do joins with Mongo. If you are you've designed your schema in a relational manner.
In this instance I would store the vote as an embedded document on the user.
In some scenarios using embedded documents isn't feasible, and in that situation I would do two database queries and join the results at the client rather than using MapReduce.
I can't provide a fuller answer now, but you should be able to achieve this using MapReduce. The Map step would return the userIds of the users who have more than 20 votes, the reduce step would return the firstName and lastName, I think...have a look here.