I am trying to retrieve 100000 docouments from MongoDb like below and its taking very long to return collection.
var query = Query.EQ("Status", "E");
var items = collection.Find(query).SetLimit(100000).ToList();
Or
var query = Query.GT("_id", idValue);
var items = collection.Find(query).SetLimit(100000).ToList();
Explain:
{
"cursor" : "BtreeCursor _id_",
"nscanned" : 1,
"nscannedObjects" :1,
"n" : 1,
"millis" : 0,
"nYields" : 0,
"nChunkSkips" : 0,
"isMultiKey" : false,
"indexOnly" : false,
"indexBounds" :
{
"_id" :[[ObjectId("4f79a64eca98b5fc0e5ae35a"),
ObjectId("4f79a64eca98b5fc0e5ae35a")]]
}
}
Any suggestions to improve query performance. My table having 2 million documents.
-Venkat
This question was also asked on Google Groups:
https://groups.google.com/forum/?fromgroups#!topicsearchin/mongodb-user/100000/mongodb-user/a6FHFp5aOnA
As I responded on the Google Groups question I tried to reproduce this and was unable to observe any slowness. I was able to read 100,000 documents in 2-3 seconds, depending on whether the documents were near the beginning or near the end of the collection (because I didn't create an index).
My answer to the Google groups question has more details and a link to the test program I used to try and reproduce this.
Given the information you have provided my best guess is that your document size is too large and the delay is not necessarily on the mongo server but on the transmission of the result set back to your app machine. Take a look at your avg document size in the collection, do you have large embedded arrays for example?
Compare the response time when selecting only one field using the .SetFields method (see example here How to retrieve a subset of fields using the C# MongoDB driver?). If the response time is significantly faster then you know that this is the issue.
Have you defined indices?
http://www.mongodb.org/display/DOCS/Indexes
There are several things to check:
Is your query correctly indexed?
If your query is indexed, what are the odds that the data itself is in memory? If you have 20GB of data and 4GB of RAM, then most of your data is not in memory which means that your disks are doing a lot of work.
How much data does 100k documents represent? If your documents are really big they could be sucking up all of the available disk IO or possibly the network? Do you have enough space to store this in RAM on the client?
You can check for disk usage using iostat (a common linux tool) or perfmon (under Windows). If you run these while your query is running, you should get some idea about what's happening with your disks.
Otherwise, you will have to do some reasoning about how much data is moving around here. In general, queries that return 100k objects are not intended to be really fast (not in MongoDB or in SQL). That's more data than humans typically consume in one screen, so you may want to make smaller batches and read 10k objects 10 times instead of 100k objects once.
If you don't create indexes for your collection the MongoDB will do a full table scan - this is the slowest possible method.
You can run explain() for your query. Explain will tell you which indexes (if any) are used for the query, number of scanned documents and total query duration.
If your query hits all the indexes and it's execution is still slow then you probably have a problem with the size of the collection / RAM.
MongoDB is the fastest when collection data + indexes fits in the memory. If the your collection size is larger than available RAM the performance drop is very large.
You can check the size of your collection with totalSize(), totalIndexSize() or validate() (these are shell commands).
Related
I am considering MongoDB to hold data of our campaign logs,
{
"domain" : ""
"log_time" : ""
"email" : ""
"event_type" : "",
"data" : {
"campaign_id" : "",
"campaign_name" : "",
"message" : "",
"subscriber_id" : ""
}
}
The above one is our event structure, each event is associated with one domain,
one domain can contain any number of events and there is no relation between one domain to another domain
most of our queries are specific to one domain at a time
for quick query responses I'm planning to create one collection per one domain so that I can query on particular domain collection data instead of query on whole data which contains all domains data
we will have at least 100k+ domains in the future, so I need to create 100k+ collections.
We are expecting 1 million + documents per collection.
our main intention is index on only required collections, we don't want to index on whole data, that is why we are planning to have one collection per one domain
which approach is better for my case
1.Storing all domains events in one collection
(or)
2.Each domain events in separate collection
I have seen some questions on max number of collections that mongodb can support but I didn't get clarity on this topic , as far I know we can extend default limit size 24k, but if I create 100k+ collections what about performance will it get affect
Is this solution (using max number of collections) right approach for my case
Please suggest about my approach, thanks in advance
Without some hard numbers, this question would be probably just opinion based.
However, if you do some calculations with the numbers you provided, you will get to a solution.
So your total document count is:
100 K collections x 1M documents = 100 G (100.000.000.000) documents.
From your document structure, I'm going to do a rough estimate and say that the average size for each document will be 240 bytes (it may be even higher).
Multiplying those two numbers you get ~21.82 TB of data. You can't store this amount of data just one one server, so you will have to split your data across multiple servers.
With this amount of data, your problem isn't anymore one collection vs multiple collections, but rather, how do I store all of this data in MongoDB on multiple servers, so I can efficiently do my queries.
If you have 100K collections, you can probably do some manual work and store e.g. 10 K collections per MongoDB server. But there's a better way.
You can use sharding and let the MongoDB do the hard work of splitting your data across servers. With sharding, you will have one collection for all domains and then shard that collection across multiple servers.
I would strongly recommend you to read all documentation regarding sharding, before trying to deploy a system of this size.
We've recently hit the >2 Million records for one of our main collections and now we started to suffer for major performance issues on that collection.
They documents in the collection have about 8 fields which you can filter by using UI and the results are supposed to sorted by a timestamp field the record was processed.
I've added several compound indexes with the filtered fields and the timetamp
e.g:
db.events.ensureIndex({somefield: 1, timestamp:-1})
I've also added couple of indexes for using several filters at once to hopefully achieve better performance. But some filters still take awfully long time to perform.
I've made sure that using explain that the queries do use the indexes I've created but performance is still not good enough.
I was wondering if sharding is the way to go now.. but we will soon start to have about 1 million new records per day in that collection.. so I'm not sure if it will scale well..
EDIT: example for a query:
> db.audit.find({'userAgent.deviceType': 'MOBILE', 'user.userName': {$in: ['nickey#acme.com']}}).sort({timestamp: -1}).limit(25).explain()
{
"cursor" : "BtreeCursor user.userName_1_timestamp_-1",
"isMultiKey" : false,
"n" : 0,
"nscannedObjects" : 30060,
"nscanned" : 30060,
"nscannedObjectsAllPlans" : 120241,
"nscannedAllPlans" : 120241,
"scanAndOrder" : false,
"indexOnly" : false,
"nYields" : 1,
"nChunkSkips" : 0,
"millis" : 26495,
"indexBounds" : {
"user.userName" : [
[
"nickey#acme.com",
"nickey#acme.com"
]
],
"timestamp" : [
[
{
"$maxElement" : 1
},
{
"$minElement" : 1
}
]
]
},
"server" : "yarin:27017"
}
please note that deviceType has only 2 values in my collection.
This is searching the needle in a haystack. We'd need some output of explain() for those queries that don't perform well. Unfortunately, even that would fix the problem only for that particular query, so here's a strategy on how to approach this:
Ensure it's not because of insufficient RAM and excessive paging
Enable the DB profiler (using db.setProfilingLevel(1, timeout) where timeout is the threshold for the number of milliseconds the query or command takes, anything slower will be logged)
Inspect the slow queries in db.system.profile and run the queries manually using explain()
Try to identify the slow operations in the explain() output, such as scanAndOrder or large nscanned, etc.
Reason about the selectivity of the query and whether it's possible to improve the query using an index at all. If not, consider disallowing the filter setting for the end-user or give him a warning dialog that the operation might be slow.
A key problem is that you're apparently allowing your users to combine filters at will. Without index intersectioning, that will blow up the number of required indexes dramatically.
Also, blindly throwing an index at every possible query is a very bad strategy. It's important to structure the queries and make sure the indexed fields have sufficient selectivity.
Let's say you have a query for all users with status "active" and some other criteria. But of the 5 million users, 3 million are active and 2 million aren't, so over 5 million entries there's only two different values. Such an index doesn't usually help. It's better to search for the other criteria first, then scan the results. On average, when returning 100 documents, you'll have to scan 167 documents, which won't hurt performance too badly. But it's not that simple. If the primary criterion is the joined_at date of the user and the likelihood of users discontinuing use with time is high, you might end up having to scan thousands of documents before finding a hundred matches.
So the optimization depends very much on the data (not only its structure, but also the data itself), its internal correlations and your query patterns.
Things get worse when the data is too big for the RAM, because then, having an index is great, but scanning (or even simply returning) the results might require fetching a lot of data from disk randomly which takes a lot of time.
The best way to control this is to limit the number of different query types, disallow queries on low selectivity information and try to prevent random access to old data.
If all else fails and if you really need that much flexibility in filters, it might be worthwhile to consider a separate search DB that supports index intersections, fetch the mongo ids from there and then get the results from mongo using $in. But that is fraught with its own perils.
-- EDIT --
The explain you posted is a beautiful example of a the problem with scanning low selectivity fields. Apparently, there's a lot of documents for "nickey#acme.com". Now, finding those documents and sorting them descending by timestamp is pretty fast, because it's supported by high-selectivity indexes. Unfortunately, since there are only two device types, mongo needs to scan 30060 documents to find the first one that matches 'mobile'.
I assume this is some kind of web tracking, and the user's usage pattern makes the query slow (would he switch mobile and web on a daily basis, the query would be fast).
Making this particular query faster could be done using a compound index that contains the device type, e.g. using
a) ensureIndex({'username': 1, 'userAgent.deviceType' : 1, 'timestamp' :-1})
or
b) ensureIndex({'userAgent.deviceType' : 1, 'username' : 1, 'timestamp' :-1})
Unfortunately, that means that queries like find({"username" : "foo"}).sort({"timestamp" : -1}); can't use the same index anymore, so, as described, the number of indexes will grow very quickly.
I'm afraid there's no very good solution for this using mongodb at this time.
Mongo only uses 1 index per query.
So if you want to filter on 2 fields, mongo will use the index with one of the fields, but still needs to scan the entire subset.
This means that basically you'll need an index for every type of query in order to achieve the best performance.
Depending on your data, it might not be a bad idea to have one query per field, and process the results in your app.
This way you'll only need indexes on every fields, but it may be too much data to process.
If you are using $in, mongodb never uses INDEX. Change your query, by removing this $in. It should use index and it would give better performance than what you got earlier.
http://docs.mongodb.org/manual/core/query-optimization/
I'm building an application that uses MongoDB as a database. I have a lot of products, and I want to log what products a user looks at to the user's database entry. For instance, a user profile looks like this:
{
"email" : "foo#bar.com",
"name" : "John Snow",
"_id" : ObjectId("51ecbcc6896652a008000001"),
"productsViewed" : [
product1,
product2,
product3,
product4
]
}
I have two options here. I can log just the _id of each product, or I could log entire objects representing the product (name, price, ~100 word description, categories, that sort of thing). The difference in object size is 1 line of text per product vs about 30 lines per product.
I realise that this is probably a trivial amount of data to be concerned about, but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact? Logging more data is far more useful for my purposes but I'd like to avoid my database calls lagging if the user profile becomes quite large.
Question is: At what point (in character length, I guess?) is too much data to store with one MongoDB record?
16 Meg is the limitation for the entire document. This means that all strings etc have to fit within 16 meg. However, before that there are more limitation on your schema which you, yourself hint at:
but if a user has 10,000 productsViewed entries, will the ~30x larger difference make any sort of impact?
And the answer is yes. First off with the added data of the root user you will probably be over the 16 meg limit, however, further on from this the in-memory $pull, $push and other sub document operators might have a hard time keeping peformance up. You can sort of mitigate that problem by batching your subdocuments into groups of 100.
However, yet again, you have an even bigger problem: Fragmentation. Since MongoDB stores the record in a single contigeous space on the disk, hence it has settings like padding, you could see considerable fragmentation from odd sized record objects not being reused here.
I would personally say that you should factor off this relation to a separate collection.
Is there a lot of overhead in excluding nearly all of the data in a document when querying a mongo database?
For example, in the case where I only want field1 and field2, for a collection with a document structure of:
{
"field1" : 1
"field2" : true
"field3" : ["big","array",...]
"field4" : ["another","big","array",...]
}
would I benefit more from:
Creating a separate collection alongside this collection containing
only field1 and field2, or
Using .find() on the original documents with inclusion/exclusion parameters
Note: The inefficiency of saving the same data twice isn't a concern for me as much as the efficiency of querying the data
Many thanks!
Projection is somewhat similar to using column names explicitly in SQL, so it seems a little counter-intuitive to ask if returning smaller amount of data would incur overhead over returning larger amount of data (full document).
So you have to find the document (depending on how you .find() it may be fast or slow) but returning only first two fields of the document rather than all the fields (complete document) would make it faster not slower.
Having a second collection may only benefit if you are concerned about your collection fitting into RAM. If the documents in the duplicate collection are much smaller then they can presumably fit into a smaller amount of total RAM decreasing a chance that a page will need to be swapped in from disk. However, if you are writing to this collection as well as original collection then you have to have a lot more data in RAM than if you just have the original collection.
So while the intricate details may depend on your individual set-up, the general answer would probably be 2. you will benefit more from using projection and only returning the two fields you need.
I read that MongoDB documents are limited to 4 MB in size. I also read that when you insert a document, MongoDB puts some padding in so that if you add something to the document, the entire document doesn't have to be moved and reindexed.
So I was wondering, does it store documents in 4MB chunks on disk?
Thanks
As of 1.8, individual documents are now limited to 16MB in size (was previously 4MB). This is an arbitary limitation imposed as when you read a document off disk, the whole document is read into RAM. So I think the intention is that this limitation is there to try and safeguard memory / make you think about your schema design.
Data is then stored across multiple data files on disk - I forget the initial file size, but every time the database grows, a new file is created to expand into, where each new file is created bigger than the previous file until a single file size of 2GB is reached. From this point on, if the database continues to grow, subsequent 2GB data files are created for documents to be inserted into.
"chunks" has a meaning in the sharding aspect of MongoDB. Whereby documents are stored in "chunks" of a configurable size and when balancing needs to be done, it's these chunks of data (n documents) that are moved around.
The simple answer is "no." The actual space a document takes up in Mongo's files is variable, but it isn't the maximum document size. The DB engine watches to see how much your documents tend to change after insertion and calculates the padding factor based on that. So it changes all the time.
If you're curious, you can see the actual padding factor and storage space of your data using the .stats() function on a collection in the mongo shell. Here's a real-world example (with some names changed to protect the innocent clients):
{14:42} ~/my_directory ➭ mongo
MongoDB shell version: 1.8.0
connecting to: test
> show collections
schedule_drilldown
schedule_report
system.indexes
> db.schedule_report.stats()
{
"ns" : "test.schedule_report",
"count" : 16749,
"size" : 60743292,
"avgObjSize" : 3626.681712341035,
"storageSize" : 86614016,
"numExtents" : 10,
"nindexes" : 3,
"lastExtentSize" : 23101696,
"paddingFactor" : 1.4599999999953628,
"flags" : 1,
"totalIndexSize" : 2899968,
"indexSizes" : {
"_id_" : 835584,
"WeekEnd_-1_Salon_1" : 925696,
"WeekEnd_-1_AreaCode_1" : 1138688
},
"ok" : 1
}
So my test collection has about 16,749 records in it, with an average size of about 3.6 KB ("avgObjSize") and a total data size of about 60 MB ("size"). However, it turns out they actually take up about 86 MB on disk ("storageSize") because of the padding factor. That padding factor has varied over time as the collection's documents have been updated, but if I inserted a new document right now, it'd allocate 1.46 times as much space as the document needs ("paddingFactor") to avoid having to move things around if I change it later. To me that's a fair size/speed tradeoff.