Are there any tools to estimate index size in MongoDB? - mongodb

I'm looking for a tool to get a decent estimate of how large a MongoDB index will be based on a few signals like:
How many documents in my collection
The size of the indexed field(s)
The size of the _id I'm using if not ObjectId
Geo/Non-geo
Has anyone stumbled across something like this? I can imagine it would be extremely useful given Mongo's performance degradation once it hits the memory wall and documents start getting paged out to disk. If I have a functioning database and want to add another index, the only way I'll know if it will be too big is to actually add it.
It wouldn't need to be accurate down to the bit, but with some assumptions about B-Trees and the index implementation I'm sure it could be reasonable enough to be helpful.
If this doesn't exist already I'd like to build and open source it, so if I've missed any required parameters for this calculation please include in your answer.

I just spoke with some of the 10gen engineers and there isn't a tool but you can do a back of the envelope calculation that is based on this formula:
2 * [ n * ( 18 bytes overhead + avg size of indexed field + 5 or so bytes of conversion fudge factor ) ]
Where n is the number of documents you have.
The overhead and conversion padding are mongo specific but the 2x comes from the b-tree data structure being roughly half full (but having allocated 100% of the space a full tree would require) in the worst case.
I'd explain more but I'm learning about it myself at the moment. This presentation will have more details: http://www.10gen.com/presentations/mongosp-2011/mongodb-internals

You can check the sizes of the indexes on a collection by using command:
db.collection.stats()
More details here: http://docs.mongodb.org/manual/reference/method/db.collection.stats/#db.collection.stats

Another way to calculate is to ingest ~1000 or so documents into every collection, in other words, build a small scale model of what you're going to end up within production, create indexes or what have you and calculate the final numbers based on db.collection.stats() average.
Edit (from a comment):
Tyler's answer
describes the original MMAP storage engine circa MongoDB 2.0, but this
formula definitely isn't applicable to modern versions of MongoDB.
WiredTiger, the default storage engine in MongoDB 3.2+, uses index
prefix compression so index sizes will vary based on the distribution
of key values. There are also a variety of index types and options
which might affect sizing. The best approach for a reasonable estimate
would be using empirical estimation with representative test data for
your projected growth.

Best option is to test in non-prod deployment!
Insert 1000 documents and check index sizes , insert 100000 documents and check index sizes and so one.
Easy way to check in a loop all collections total index sizes:
var y=0;db.adminCommand("listDatabases").databases.forEach(function(d){mdb=db.getSiblingDB(d.name);mdb.getCollectionNames().forEach(function(c){s=mdb[c].stats(1024*1024).totalIndexSize;y=y+s;print("db.Collection:"+d.name+"."+c+" totalIndexSize: "+s+" MB"); })});print("============================");print("Instance totalIndexSize: "+y+" MB");

Related

Mongodb Timeseries Data Schema Design

Have been reading a few blogs related to this topic like https://www.mongodb.com/blog/post/time-series-data-and-mongodb-part-1-introduction. It seems storing related time-series data within a few (or single document) would be a better approach over storing each data point as a single document. But I am just thinking if storing it in a single doc (just forget the size bucket approach for a moment) fits my use case
a) Regarding update of the data: occasionally need to replace the entire time-series
b) Potentially need to read the sorted [<date1,value1>,<date2,value2>,....] by data range.
A few questions on my head right now.
1) The common suggestion I saw is don't embed a large array in a single doc if the size of the array is unbound. If the array size is unbound, mongoDB may need to reallocate a new space upon update. I understand if we are using the old storage engine MMAPv1 that would be an issue. But as I saw from another answer in WiredTiger and in-place updates and Does performing a partial update on a MongoDb document in WiredTiger provide any advantage over a full document update?, this doesn't seem a problem because in WiredTiger we will construct the entire docs anyway and flush it to the disk. And given its optimization towards large docs. I am just wondering if large array in a single doc would still be an issue with WiredTiger.
2) [Given each array may contain 5000+ key value pairs, each doc is around 0.5 - 1 MB] Seems like the query to retrieve the sorted time-series by date range would be less complicated (less aggregation pipelines involved as I need to unwind the subdocument to do sorting and filter) if I store each data point as a single doc. But in terms of disk and memory usage, there would definitely be adv. as I am retrieving 1 docs vs n docs. So just thinking how to draw a line here in terms of performance.
3) There are 3 approaches I could go for in scenario A.
Replace the whole docs (replaceOne/ replaceOneModel)
Update only the time-series part of the doc (updateOne/ updateOneModel)
Use bulk update or insert to update each array element
Also I am thinking about the index rebuild issues. Given the index is not linked directly to the data files in WireTiger, it seems approach 1 and 2 are both acceptable, with 2 being better as it updates only the target part. Am I missing anything here?
4) Just thinking if what are the scenarios that we should actually go for the single data point per doc approach.
I am not too familiar with how the storage engine or mongoDB itself work under the hood, would be appreciate if someone could shed some lights on this. Thanks in advance.

Database for filtering XML documents

I would like to hear some suggestion on implementing database solution for below problem
1) There are 100 million XML documents saved to the database per
day.
2) The database hold maximum 3 days of data
3) 1 million query request per day
4) The value through which the documents are filtered are stored in
a seperate table and mapped with the corresponding XMl document ID.
5) The documents are requested based on date range, documents
matching a list of ID's, Top 10 new documents, records that are new
after the previous request
Here is what I have done so far
1) Checked if I can use Redis, it is limited to few datatypes and
also cannot use multiple where conditions to filter the Hash in
Redis. Indexing based on date and lots of there fields. I am unable
to choose a right datastructure to store it on a hash
2) Investigated DynamoDB, its again a key vaue store where all the
filter conditions should be stored as one value. I am not sure if it
will be efficient querying a json document to filter the right XML
documnent.
3) Investigated Cassandra and it looks like it may fit my
requirement but it has a limitation saying that the read operations
might be slow. Cassandra has an advantage of faster write operation
over changing data. This looks like the best possible solition used
so far.
Currently we are using SQL server and there is a performance problem and so looking for a better solution.
Please suggest, thanks.
It's not that reads in Cassandra might be slow, but it's hard to guarantee SLA for reads (usually they will be fast, but then, some of them slow).
Cassandra doesn't have search capabilities which you may need in the future (ordering, searching by many fields, ranked searching). You can probably achieve that with Cassandra, but with obviously greater amount of effort than with a database suited for searching operations.
I suggest you looking at Lucene/Elasticsearch. Let me quote the features of Lucene from their main website:
Scalable
High-Performance Indexing
over 150GB/hour on modern hardware
small RAM requirements -- only 1MB heap
incremental indexing as fast as batch indexing
index size roughly 20-30% the size of text indexed
Powerful, Accurate and Efficient Search Algorithms
ranked searching -- best results returned first
many powerful query types: phrase queries, wildcard queries, proximity queries, range queries and more
fielded searching (e.g. title, author, contents)
sorting by any field
multiple-index searching with merged results
allows simultaneous update and searching
flexible faceting, highlighting, joins and result grouping
fast, memory-efficient and typo-tolerant suggesters
pluggable ranking models, including the Vector Space Model and Okapi BM25
configurable storage engine (codecs)

Optimizing for random reads

First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.

How to organize mongodb database for a huge set of time-value pairs for a lot of documents?

There is a set of registrators, say 100k. Every registrator 24 times a day gives value smth like 23.123. I need to save this value and time. Then I need to calculate how value changes for some period, e.g. 4jun2014 - 19jul2014: In order to do this I have to find last value of 3jun2014 and last value of 19jul2014.
First I am trying to estimate size of data stored by one registrator. Time+value must be lower than 100 bytes. 1 year is < 100*24*365 = 720kB of data, so I can easily store 10 years of data (since 7.2M < 16M limit) at my document. I decided not to store registered data at registeredData collection but to store registrator data embedded in registrator object as a tree timedata->year->month->day:
{
code: '3443-12',
timedata: {
2013: {
6: {
13: [
{t:1391345679, d:213.12},
{t:1391349679, d:213.14},
]
}
}
}
}
So it is easy to get values of the day: just get find({code: "3443-12"})[0].timedata[2013][6][13].
When I get new data, I just push it into array of existing document and it eventually grows from zero to 7Mb.
Questions
What is the stored size of {t:1391345679, d:213.12} line, is it less than 100bytes?
Is it right way to organize database for such purposes?
100k documents with 5Mb size = 500G. Does MongoDB deal fast with database size much more than RAM size?
Update
I decided to store time not as a timestamp but as time in seconds from the start of a day: 0 - 86399: {t: 86123, d: 213.12}.
Regarding your last question, " Does MongoDB deal fast with database size much more than RAM size?" the answer is it can, but it depends on a number of factors.
MongoDB works best when the working set fits within the memory available to MongoDB. When it does not you tend to see rather rapid performance declines. How big that working set is a function of database schema, indexes built and your data access patterns.
Let's say you have a years worth of data in your database, but regularly only touch the last few days of data. Then your working set is likely to be composed of the memory required to keep the last few days of data in memory, plus enough of the indexes in memory for you to properly update and read from them.
Alternatively, if you are randomly accessing data across a year and have high and update volume you may have a significantly larger working set to deal with.
As a point of comparison, I've got a production MongoDB instance that has around 500M documents in it, taking up around 2 TB of disk storage. Total memory on the primary of the replica set is 128GB (1/16th the total storage) and we're not experiencing any performance problems.
The key for all of it though is how much data do you access over time. The killer for MongoDB performance is memory contention, when you are paging out data to service a new request only to re-page that old data right back in. And it gets far worse if you cannot keep your indexes in memory.
I've tested it and it is less than 100 B, in deed, it is 48 B:
var num=100000;
for(i=0;i<num;i++){
db.foo.insert({t:1391345679, d:213.12})
};
db.foo.stats().avgObjSize // => Outputs 48
It looks like what you are doing is kind of a hack to avoid normalising your data (m.b. for transaction purposes?) and sooner or later you may run into problems (e.g. requirements change, size of your data changes, new fields are introduced etc.) I do not know your schema and domain, but if you go with denomarmalized model as you are doing you must be sure that documents will not exceed the size limit of 16MB. That being said, I would recommend schema design article.
Answers:
The previous answer gives a hint about the document size. You can use it as a starting point.
Choosing an effective data models depends on your application needs. The main question is the decision to denormalize or use linking. Note, generally with denormalized data you achieve better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedding makes it possible to update a document in a single atomic write operation (transactionally). So, when to use embedded (denormalized):
you have “contains” relationships between entities. See Model
One-to-One Relationships with Embedded Documents.
you have one-to-many relationships between entities. In these relationships the “many” or
child documents always appear with or are viewed in the context of the
“one” or parent documents. See Model One-to-Many Relationships with
Embedded Documents.
In your situation your documents will grow after creation which can impact write performance and lead to data fragmentation. You can control this with padding factor.
- About the performance: it depends on how you create your indexes. More importantly, on your access patterns. For each query executed often, check out the output from explain() to see how many documents have been checked.

Best Mongodb Data Model for Response time statistic website

In my project, I have servers that will send ping request to websites, measuring their response time and store it every minute.
I'm going to use Mongodb and i'm searching for best data model.
which data model is better?
1- have a collection for each website and each request as a document.
(1000 collection)
or
2- have a collection for all websites and each website as a document and each request as sub-document.
Both solutions should face of one certain limitation of mongodb. With the first one, that you said each website a collection, the limitation is in the number of the collections while each one will have a namespace entry and the namespace size is 16MB so around 16.000 entries can fit in. (the size of the namespace can be increased) In my opinion this is a much better solution while you said 1000 collections are expected and it can be handled. (Should be considered that indexes has their own namespace entries and count in the 16.000). In this case you can store the entries as documents you can handle them after generally much easier than with the embedded array.
Embedded array limitation. This limitation in the second case is a hard one. Your documents cannot grow bigger than 16MB. This one is BSON size and it can store quite many things inside documents but if you use huge documents which varies in size , and change size in time your storage will get fragmented. The reason is that will be clear if you watch this webinar . Basically this is the worth what you can do in terms of storage usage.
If you likely to use aggregation framework for further analysis it will be also harder with the embedded array concept.
You could do either, but I think you will have to factor in periodic growth in database for either case. During the expansion of datafiles database will be slow/unresponsive. (There might be a setting so this happens in the background - I forget ).
A related question - MongoDB performance with growing data structure, specifically the "Padding Factor"
With first approach, there is an upper limit to number of websites you can store imposed by max number of collections. You can do the calculations based on http://docs.mongodb.org/manual/reference/limits/.
In second approach, while #of collection don't matter as much, but growth of database is something you will want to consider.
One approach is to initialize it with empty data, so it takes lasts longer before expanding.
For instance.
{
website: name,
responses: [{
time: Jan 1, 2013, 0:1, ...
},
{
time: Jan 1, 2013, 0:2, ...
}
... and so for each minute/interval you expect.
]
}
The downside is, it might take you longer to initialize but you will have to worry about this later.
Either ways, it is a cost you will have to pay. The only question is when? Now? or later?
Consider reading their usecases, particularly - http://docs.mongodb.org/manual/use-cases/hierarchical-aggregation/