I am designing a MongoDB database that looks something like this:
registry:{
id:1,
duration:123,
score:3,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
The text field is very big compared to the rest. I sometimes need to perform analytics queries that average the duration or the score, but never use the text.
I have queries that are more specific, and retrieve all the information about a single document. But in this queries I could spend more time making two queries to retrieve all the data.
My question is, if I make a query like this:
db.registries.aggregate( [
{
$group: {
_id: null,
averageDuration: { $avg: "$duration" },
}
}
] )
Would it need to read the data from the transcript field? That would make the query much slower and it would take a lot of RAM. If that is the case it would be better to split the records in two and have something like this right?:
registry:{
id:1,
duration:123,
score:3,
}
registry_text:{
id:1,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Thanks a lot!
I don't know how the server works in this case but I expect that, for caching reasons, the server will load complete documents into memory when it reads them from disk. Disk reads are very slow (= expensive in time taken) and I expect server will aggressively use memory if it can to avoid reads.
An important note here is that the documents are stored on disk as lists of key-value pairs comprising their contents. To not load a field from disk the server would have to rebuild the document in question as part of reading it since there are length fields involved. I don't see this happening in practice.
So, once the documents are in memory I assume they are there with all of their fields and I don't expect you can tune this.
When you are querying, the server may or may not drop individual fields but this would only change the memory requirements for the particular query. Generally these memory requirements are dwarfed by the overall database cache size and aggregation pipelines. So I don't think it really matters at what point a large field is dropped from a document during query processing (assuming you project it out in the query).
I think this isn't a worthwhile matter to try to ponder/optimize. If you have a real system with real workloads, you'll be much more pressed to optimize something else.
If you are concerned with memory usage when the amount of available memory is consumer-sized (say, under 16 gb), just get more memory - it's insanely cheap given how much time you'd spend working around lack of it (whether we are talking about provisioning bigger AWS instances or buying more sticks of RAM).
You should be able to use $project to limit the fields read.
As a general advice, don't try to normalize the data with MongoDB as you would with SQL. Also, it's often more performant to read documents plain from DB and do the processing on your server.
I have found this answer that seems to indicate that project needs to fetch all document in the database server, it only reduces bandwith
When using projection to remove unused fields, the MongoDB server will
have to fetch each full document into memory (if it isn't already
there) and filter the results to return. This use of projection
doesn't reduce the memory usage or working set on the MongoDB server,
but can save significant network bandwidth for query results depending
on your data model and the fields projected.
https://dba.stackexchange.com/questions/198444/how-mongodb-projection-affects-performance
Related
I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}
I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.
First of all, I am using MongoDB 3.0 with the new WiredTiger storage engine. Also using snappy for compression.
The use case I am trying to understand and optimize for from a technical point of view is the following;
I have a fairly large collection, with about 500 million documents that takes about 180 GB including indexes.
Example document:
{
_id: 123234,
type: "Car",
color: "Blue",
description: "bla bla"
}
Queries consist of finding documents with a specific field value. Like so;
thing.find( { type: "Car" } )
In this example the type field should obviously be indexed. So far so good. However the access pattern for this data will be completely random. At a given time I have no idea what range of documents will be accessed. I only know that they will be queried on indexed fields, returning at the most 100000 documents at a time.
What this means in my mind is that the caching in MongoDB/WiredTiger is pretty much useless. The only thing that needs to fit in the cache are the indexes. An estimation of the working set is hard if not impossible?
What I am looking for is mostly tips on what kinds of indexes to use and how to configure MongoDB for this kind of use case. Would other databases work better?
Currently I find MongoDB to work quite well on somewhat limited hardware (16 GB RAM, non SSD disc). Queries return in decent time and obviously instantly if the result set is already in the cache. But as already stated this will most likely not be the typical case. It is not critical that the queries are lightning fast, more so that they are dependable and that the database will run in a stable manner.
EDIT:
Guess I left out some important things. The database will be mostly for archival purposes. As such, data arrives from another source in bulk, say once a day. Updates will be very rare.
The example I used was a bit contrived but in essence that is what queries look like. When I mentioned multiple indexes I meant the type and color fields in that example. So documents will be queried on using these fields. As it is now, we only care about returning all documents that have a specific type, color etc. Naturally, the plan we have is to only query on fields that we have an index for. So ad-hoc queries are off the table.
Right now the index sizes are quite manageable. For the 500 million documents each of these indexes are about 2.5GB and fit easily in RAM.
Regarding average data size of an operation, I can only speculate at this point. As far as I know, typical operations return about 20k documents, with an average object size in the range of 1200 bytes. This is the stat reported by db.stats() so I guess it is for the compressed data on disc, and not how much it actually takes once in RAM.
Hope this bit of extra info helped!
Basically, if you have a consistent rate of reads that are uniformly at random over type (which is what I'm taking
I have no idea what range of documents will be accessed
to mean), then you will see stable performance from the database. It will be doing some stable proportion of reads from cache, just by good luck, and another stable proportion by reading from disk, especially if the number and size of documents are about the same between different type values. I don't think there's a special index or anything to help you besides just better hardware. Indexes should remain in RAM because they'll constantly be being used.
I suppose more information would help, as you mention only one simple query on type but then talk about having multiple indexes to worry about keeping in RAM. How much data does the average operation return? Do you ever care to return a subset of docs of certain type or only all of them? What do inserts and updates to this collection look like?
Also, if the documents being read are truly completely random over the dataset, then the working set is all of the data.
There is a set of registrators, say 100k. Every registrator 24 times a day gives value smth like 23.123. I need to save this value and time. Then I need to calculate how value changes for some period, e.g. 4jun2014 - 19jul2014: In order to do this I have to find last value of 3jun2014 and last value of 19jul2014.
First I am trying to estimate size of data stored by one registrator. Time+value must be lower than 100 bytes. 1 year is < 100*24*365 = 720kB of data, so I can easily store 10 years of data (since 7.2M < 16M limit) at my document. I decided not to store registered data at registeredData collection but to store registrator data embedded in registrator object as a tree timedata->year->month->day:
{
code: '3443-12',
timedata: {
2013: {
6: {
13: [
{t:1391345679, d:213.12},
{t:1391349679, d:213.14},
]
}
}
}
}
So it is easy to get values of the day: just get find({code: "3443-12"})[0].timedata[2013][6][13].
When I get new data, I just push it into array of existing document and it eventually grows from zero to 7Mb.
Questions
What is the stored size of {t:1391345679, d:213.12} line, is it less than 100bytes?
Is it right way to organize database for such purposes?
100k documents with 5Mb size = 500G. Does MongoDB deal fast with database size much more than RAM size?
Update
I decided to store time not as a timestamp but as time in seconds from the start of a day: 0 - 86399: {t: 86123, d: 213.12}.
Regarding your last question, " Does MongoDB deal fast with database size much more than RAM size?" the answer is it can, but it depends on a number of factors.
MongoDB works best when the working set fits within the memory available to MongoDB. When it does not you tend to see rather rapid performance declines. How big that working set is a function of database schema, indexes built and your data access patterns.
Let's say you have a years worth of data in your database, but regularly only touch the last few days of data. Then your working set is likely to be composed of the memory required to keep the last few days of data in memory, plus enough of the indexes in memory for you to properly update and read from them.
Alternatively, if you are randomly accessing data across a year and have high and update volume you may have a significantly larger working set to deal with.
As a point of comparison, I've got a production MongoDB instance that has around 500M documents in it, taking up around 2 TB of disk storage. Total memory on the primary of the replica set is 128GB (1/16th the total storage) and we're not experiencing any performance problems.
The key for all of it though is how much data do you access over time. The killer for MongoDB performance is memory contention, when you are paging out data to service a new request only to re-page that old data right back in. And it gets far worse if you cannot keep your indexes in memory.
I've tested it and it is less than 100 B, in deed, it is 48 B:
var num=100000;
for(i=0;i<num;i++){
db.foo.insert({t:1391345679, d:213.12})
};
db.foo.stats().avgObjSize // => Outputs 48
It looks like what you are doing is kind of a hack to avoid normalising your data (m.b. for transaction purposes?) and sooner or later you may run into problems (e.g. requirements change, size of your data changes, new fields are introduced etc.) I do not know your schema and domain, but if you go with denomarmalized model as you are doing you must be sure that documents will not exceed the size limit of 16MB. That being said, I would recommend schema design article.
Answers:
The previous answer gives a hint about the document size. You can use it as a starting point.
Choosing an effective data models depends on your application needs. The main question is the decision to denormalize or use linking. Note, generally with denormalized data you achieve better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedding makes it possible to update a document in a single atomic write operation (transactionally). So, when to use embedded (denormalized):
you have “contains” relationships between entities. See Model
One-to-One Relationships with Embedded Documents.
you have one-to-many relationships between entities. In these relationships the “many” or
child documents always appear with or are viewed in the context of the
“one” or parent documents. See Model One-to-Many Relationships with
Embedded Documents.
In your situation your documents will grow after creation which can impact write performance and lead to data fragmentation. You can control this with padding factor.
- About the performance: it depends on how you create your indexes. More importantly, on your access patterns. For each query executed often, check out the output from explain() to see how many documents have been checked.
In my project, I have servers that will send ping request to websites, measuring their response time and store it every minute.
I'm going to use Mongodb and i'm searching for best data model.
which data model is better?
1- have a collection for each website and each request as a document.
(1000 collection)
or
2- have a collection for all websites and each website as a document and each request as sub-document.
Both solutions should face of one certain limitation of mongodb. With the first one, that you said each website a collection, the limitation is in the number of the collections while each one will have a namespace entry and the namespace size is 16MB so around 16.000 entries can fit in. (the size of the namespace can be increased) In my opinion this is a much better solution while you said 1000 collections are expected and it can be handled. (Should be considered that indexes has their own namespace entries and count in the 16.000). In this case you can store the entries as documents you can handle them after generally much easier than with the embedded array.
Embedded array limitation. This limitation in the second case is a hard one. Your documents cannot grow bigger than 16MB. This one is BSON size and it can store quite many things inside documents but if you use huge documents which varies in size , and change size in time your storage will get fragmented. The reason is that will be clear if you watch this webinar . Basically this is the worth what you can do in terms of storage usage.
If you likely to use aggregation framework for further analysis it will be also harder with the embedded array concept.
You could do either, but I think you will have to factor in periodic growth in database for either case. During the expansion of datafiles database will be slow/unresponsive. (There might be a setting so this happens in the background - I forget ).
A related question - MongoDB performance with growing data structure, specifically the "Padding Factor"
With first approach, there is an upper limit to number of websites you can store imposed by max number of collections. You can do the calculations based on http://docs.mongodb.org/manual/reference/limits/.
In second approach, while #of collection don't matter as much, but growth of database is something you will want to consider.
One approach is to initialize it with empty data, so it takes lasts longer before expanding.
For instance.
{
website: name,
responses: [{
time: Jan 1, 2013, 0:1, ...
},
{
time: Jan 1, 2013, 0:2, ...
}
... and so for each minute/interval you expect.
]
}
The downside is, it might take you longer to initialize but you will have to worry about this later.
Either ways, it is a cost you will have to pay. The only question is when? Now? or later?
Consider reading their usecases, particularly - http://docs.mongodb.org/manual/use-cases/hierarchical-aggregation/
I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.
I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.
I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.