I have made a test with 10 M rows of data. Each row has 3 integer and 2 string columns. First I import this data to mongoDB which is a single shard. I do a simple "where" query with db.table.find() on a non-index columns. The query fetches a single row which takes roughly in 7 seconds.
On the same hardware I load the same data to a c# list which is in memory. I do a while loop to scan all 10M data and do a simple equal control to emulate where query. It takes only around 650 ms which is much more faster than MongoDB.
I have a 32 GB machine so mongodb is having no problem to memory map the table.
Why mongoDB is much slower? Is it because the mongoDB is keeping the data in a data structure which is hard to full scan or is it because memory mapping in not the same as keeping a data in a variable.
As Remon pointed out you are definitely comparing apples to oranges in this test.
To understand a bit more on what is happening behind the scenes in that table scan, read through the MongoDB internals here. (Look under the Storage model)
There is the concept of extents which represents a contiguous disk space.
Each extent points to a linked list of docs.
The doc contains the data in BSON format. So now you can imagine how we would retrieve data.
Now the beauty of having an index is aptly shown at the right top corner. MongoDB uses a BTree structure to navigate which is pretty fast.
Try changing your test to have some warm up runs and use an index.
UPDATE : I have done some testing as a part of my day job to compare the performance of JBoss Cache (an in memory Java Cache) with MongoDB as an application cache (queries against _id). The results are quite comparable.
Where to start..
First of all the test is completely apples and oranges. Loading a dataset into memory and doing a completely in-memory scan of it is in no way equal to a table scan on any database.
I'm also willing to bet you're doing your test on cold data and MongoDB performance improves dramatically as it swaps hot data into memory. Please note that MongoDB doesn't preemptively swap data into memory. It does so if, and only if, the data is accessed frequently (or at all, depending). Actually it's more accurate to say the OS does since MongoDB's storage engine is built on top of MMFs (memory mapped files).
So in short, your test isn't a good test and the way you're testing MongoDB isn't producing accurate results. You're testing a theoretical best case with your C# equivalent that on top of that is considerably less complex than the database code.
Related
First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.
I'm a beginner with a non SQL structure like here with MongoDB and I don't find somebody talk about a collection with lots of data, like 1.000.000 entries ? and more ?
I saw a company page on the official site. But nothing with large data companies.
I heard about a combo with SQL : Large data are stocked on SQL tables, and only the "cache" are on MongoDB, but it's the only one solution for MongoDB and large data ?
We're using MongoDB to power Where's it Up, and the api behind it. We're currently pushing in >3 million documents per day. MongoDB is the only storage engine in use. We were keeping a bunch around for a while, but we're now using TTL to delete old records.
Things are going super well, just make sure you have all the indexes you need. Querying a million+ records without an index is bad, regardless of your storage engine. Auto-failover has been super helpful.
Something to watch out for is updating records to include more information, it can be pretty expensive if the document grows past pre-allocated space. We ended up changing how we stored data to avoid updates, and create new documents instead.
MongoDB in it's current incarnation is explicitly designed to make it easy to scale out.
As for the numbers: one of my test databases has 10M records and runs easily on my MacBook Air, which is 4 years old now.
So what you can do when your current cluster can not handle the data stored (either because the indices are too big for your RAM or because of processing the queries takes too long): add another node to your MongoDB cluster. Your performance gain should be something between slightly below linear (if your cluster was in perfect condition otherwise) up to several orders of magnitude (when indices didn't fit into RAM and/or IO was pushed to it's limits before and that situation changed after scaling out).
A word of warning: you should have somebody who knows about MongoDB administration in case you want to put you deployment into production. Though MongoDB administration seems to be easy, it is by no means something to be done by a layman. Especially not for production use.
There is a set of registrators, say 100k. Every registrator 24 times a day gives value smth like 23.123. I need to save this value and time. Then I need to calculate how value changes for some period, e.g. 4jun2014 - 19jul2014: In order to do this I have to find last value of 3jun2014 and last value of 19jul2014.
First I am trying to estimate size of data stored by one registrator. Time+value must be lower than 100 bytes. 1 year is < 100*24*365 = 720kB of data, so I can easily store 10 years of data (since 7.2M < 16M limit) at my document. I decided not to store registered data at registeredData collection but to store registrator data embedded in registrator object as a tree timedata->year->month->day:
{
code: '3443-12',
timedata: {
2013: {
6: {
13: [
{t:1391345679, d:213.12},
{t:1391349679, d:213.14},
]
}
}
}
}
So it is easy to get values of the day: just get find({code: "3443-12"})[0].timedata[2013][6][13].
When I get new data, I just push it into array of existing document and it eventually grows from zero to 7Mb.
Questions
What is the stored size of {t:1391345679, d:213.12} line, is it less than 100bytes?
Is it right way to organize database for such purposes?
100k documents with 5Mb size = 500G. Does MongoDB deal fast with database size much more than RAM size?
Update
I decided to store time not as a timestamp but as time in seconds from the start of a day: 0 - 86399: {t: 86123, d: 213.12}.
Regarding your last question, " Does MongoDB deal fast with database size much more than RAM size?" the answer is it can, but it depends on a number of factors.
MongoDB works best when the working set fits within the memory available to MongoDB. When it does not you tend to see rather rapid performance declines. How big that working set is a function of database schema, indexes built and your data access patterns.
Let's say you have a years worth of data in your database, but regularly only touch the last few days of data. Then your working set is likely to be composed of the memory required to keep the last few days of data in memory, plus enough of the indexes in memory for you to properly update and read from them.
Alternatively, if you are randomly accessing data across a year and have high and update volume you may have a significantly larger working set to deal with.
As a point of comparison, I've got a production MongoDB instance that has around 500M documents in it, taking up around 2 TB of disk storage. Total memory on the primary of the replica set is 128GB (1/16th the total storage) and we're not experiencing any performance problems.
The key for all of it though is how much data do you access over time. The killer for MongoDB performance is memory contention, when you are paging out data to service a new request only to re-page that old data right back in. And it gets far worse if you cannot keep your indexes in memory.
I've tested it and it is less than 100 B, in deed, it is 48 B:
var num=100000;
for(i=0;i<num;i++){
db.foo.insert({t:1391345679, d:213.12})
};
db.foo.stats().avgObjSize // => Outputs 48
It looks like what you are doing is kind of a hack to avoid normalising your data (m.b. for transaction purposes?) and sooner or later you may run into problems (e.g. requirements change, size of your data changes, new fields are introduced etc.) I do not know your schema and domain, but if you go with denomarmalized model as you are doing you must be sure that documents will not exceed the size limit of 16MB. That being said, I would recommend schema design article.
Answers:
The previous answer gives a hint about the document size. You can use it as a starting point.
Choosing an effective data models depends on your application needs. The main question is the decision to denormalize or use linking. Note, generally with denormalized data you achieve better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedding makes it possible to update a document in a single atomic write operation (transactionally). So, when to use embedded (denormalized):
you have “contains” relationships between entities. See Model
One-to-One Relationships with Embedded Documents.
you have one-to-many relationships between entities. In these relationships the “many” or
child documents always appear with or are viewed in the context of the
“one” or parent documents. See Model One-to-Many Relationships with
Embedded Documents.
In your situation your documents will grow after creation which can impact write performance and lead to data fragmentation. You can control this with padding factor.
- About the performance: it depends on how you create your indexes. More importantly, on your access patterns. For each query executed often, check out the output from explain() to see how many documents have been checked.
In doing some preliminary tests of MongoDB sharding, I hoped and expected that the time to execute queries that hit only a single chunk of data on one shard/machine would remain relatively constant as more data was loaded. But I found a significant slowdown.
Some details:
For my simple test, I used two machines to shard and tried queries on similar collections with 2 million rows and 7 million rows. These are obviously very small collections that don’t even require sharding, yet I was surprised to already see a significant consistent slowdown for queries hitting only a single chunk. Queries included the sharding key, were for result sets ranging from 10s to 100000s of rows, and I measured the total time required to scroll through the entire result sets. One other thing: since my application will actually require much more data than can fit into RAM, all queries were timed based on a cold cache.
Any idea why this would be? Has anyone else observed the same or contradictory results?
Further details (prompted by Theo):
For this test, the rows were small (5 columns including _id), and the key was not based on _id, but rather on a many-valued text column that almost always appears in queries.
The command db.printShardingStatus() shows how many chunks there are as well as the exact key values used to split ranges for chunks. The average chunk contains well over 100,000 rows for this dataset and inspection of key value splits verifies that the test queries are hitting a single chunk.
For the purpose of this test, I was measuring only reads. There were no inserts or updates.
Update:
Upon some additional research, I believe I determined the reason for the slowdown: MongoDB chunks are purely logical, and the data within them is NOT physically located together (source: "Scaling MongoDB" by Kristina Chodorow). This is in contrast to partitioning in traditional databases like Oracle and MySQL. This seems like a significant limitation, as sharding will scale horizontally with the addition of shards/machines, but less well in the vertical dimension as data is added to a collection with a fixed number of shards.
If I understand this correctly, if I have 1 collection with a billion rows sharded across 10 shards/machines, even a query that hits only one shard/machine is still querying from a large collection of 100 million rows. If values for the sharding key happen to be located contiguously on disk, then that might be OK. But if not and I'm fetching more than a few rows (e.g. 1000s), then this seems likely to lead to lots of I/O problems.
So my new question is: why not organize chunks in MongoDB physically to enable vertical as well as horizontal scalability?
What makes you say the queries only touched a single chunk? If the result ranged up to 100 000 rows it sounds unlikely. A chunk is max 64 Mb, and unless your objects are tiny that many won't fit. Mongo has most likely split your chunks and distributed them.
I think you need to tell us more about what you're doing and the shape of your data. Were you querying and loading at the same time? Do you mean shard when you say chunk? Is your shard key something else than _id? Do you do any updates while you query your data?
There are two major factors when it comes to performance in Mongo: the global write lock and it's use of memory mapped files. Memory mapped files mean you really have to think about your usage patterns, and the global write lock makes page faults hurt really badly.
If you query for things that are all over the place the OS will struggle to page things in and out, this can be especially hurting if your objects are tiny because whole pages have to be loaded just to access a small pieces, lots of RAM will be wasted. If you're doing lots of writes that will lock reads (but usually not that badly since writes happen fairly sequentially) -- but if you're doing updates you can forget about any kind of performance, the updates block the whole database server for significant amounts of time.
Run mongostat while you're running your tests, it can tell you a lot (run mongostat --discover | grep -v SEC to see the metrics for all you shard masters, don't forget to include --port if your mongos is not running on 27017).
Addressing the questions in your update: it would be really nice if Mongo did keep chunks physically together, but it is not the case. One of the reasons is that sharding is a layer on top of mongod, and mongod is not fully aware of it being a shard. It's the config servers and mongos processes that know of shard keys and which chunks that exist. Therefore, in the current architecture, mongod doesn't even have the information that would be required to keep chunks together on disk. The problem is even deeper: Mongo's disk format isn't very advanced. It still (as of v2.0) does not have online compaction (although compaction got better in v2.0), it can't compact a fragmented database and still serve queries. Mongo has a long way to go before it's capable of what you're suggesting, sadly.
The best you can do at this point is to make sure you write the data in order so that chunks will be written sequentially. It probably helps if you create all chunks beforehand too, so that data will not be moved around by the balancer. Of course this is only possible if you have all your data in advance, and that seems unlikely.
Disclaimer: I work at Tokutek
So my new question is: why not organize chunks in MongoDB physically to enable vertical as well as horizontal scalability?
This is exactly what is done in TokuMX, a replacement server for MongoDB. TokuMX uses Fractal Tree indexes which have high write throughput and compression, so instead of storing data in a heap, data is clustered with the index. By default, the shard key is clustered, so it does exactly what you suggest, it organizes the chunks physically, by ensuring all documents are ordered by the shard key on disk. This makes range queries on the shard key fast, just like on any clustered index.
I run a single mongodb instance which is getting inserted with logs from an app server. the current rate of insert in production is 10 inserts per second. And its a capped collection. i DONT USE ANY INDEXES . Queries were running faster when there were small number of records. only one collection has that amount of data. even querying from collection that has very few rows has become very slow. IS there any means to improve the performance.
-Avinash
This is a very difficult question to answer because we dont know much about your configuration or your document structure.
One thing that immediately pops into my head is that you are running out of memory. 10 inserts per second doesn't mean much because we do not know how big the inserted documents are.
If you are inserting larger documents at 10 per second, you could be eating up memory, causing the operating system to push some of your records to disk.
When you query without using an index, you are forced to scan every document. If your documents have been pushed to disk by the OS, you will begin having page faults. Mongo will need to fetch pages of data off the hard disk, and load them into memory so that they can be scanned. Before doing this, the operating system will need to make room for that data in memory by flushing other parts of memory out to disk.
It sounds like you are are I/O bound and the two biggest things you can do to fix this are
Add more memory to the machine running mongod
Start using indexes so that the database does not need to do full collection scans
Use proper indexes, though that will have some effect on the efficiency of insertion in a capped collection.
It would be better if you can share the collection structure and the query you are using.