Mongodb data storage performance - one doc with items in array vs multiple docs per item

Mongodb data storage performance - one doc with items in array vs multiple docs per item - mongodb

I have statistical data in a Mongodb collection saved for each record per day.
For example my collection looks roughly like
{ record_id: 12345, date: Date(2011,12,13), stat_value_1:12345, stat_value_2:98765 }
Each record_id/date combo is unique. I query the collection to get statistics per record for a given date range using map-reduce.
As far as read query performance, is this strategy superior than storing one document per record_id containing an array of statistical data just like the above dict:
{ _id: record_id, stats: [
{ date: Date(2011,12,11), stat_value_1:39884, stat_value_2:98765 },
{ date: Date(2011,12,12), stat_value_1:38555, stat_value_2:4665 },
{ date: Date(2011,12,13), stat_value_1:12345, stat_value_2:265 },
]}
On the pro side I will need one query to get the entire stat history of a record without resorting to the slower map-reduce method, and on the con side I'll have to sum up the stats for a given date range in my application code and if a record outgrows is current padding size-wise there's some disc reallocation that will go on.

I think this depends on the usage scenario. If the data set for a single aggregation is small like those 700 records and you want to do this in real-time, I think it's best to choose yet another option and query all individual records and aggregate them client-side. This avoids the Map/Reduce overhead, it's easier to maintain and it does not suffer from reallocation or size limits. Index use should be efficient and connection-wise, I doubt there's much of a difference: most drivers batch transfers anyway.
The added flexibility might come in handy, for instance if you want to know the stat value for a single day across all records (if that ever makes sense for your application). Should you ever need to store more stat_values, your maximum number of dates per records would go down in the subdocument approach. It's also generally easier to work with db documents rather than subdocuments.
Map/Reduce really shines if you're aggregating huge amounts of data across multiple servers, where otherwise bandwidth and client concurrency would be bottlenecks.

I think you can reference to here, and also see foursquare how to solve this kind of problem here . They are both valuable.

Related

MongoDB: reduce read size and RAM needed with project?

I am designing a MongoDB database that looks something like this:
registry:{
id:1,
duration:123,
score:3,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
The text field is very big compared to the rest. I sometimes need to perform analytics queries that average the duration or the score, but never use the text.
I have queries that are more specific, and retrieve all the information about a single document. But in this queries I could spend more time making two queries to retrieve all the data.
My question is, if I make a query like this:
db.registries.aggregate( [
{
$group: {
_id: null,
averageDuration: { $avg: "$duration" },
}
}
] )
Would it need to read the data from the transcript field? That would make the query much slower and it would take a lot of RAM. If that is the case it would be better to split the records in two and have something like this right?:
registry:{
id:1,
duration:123,
score:3,
}
registry_text:{
id:1,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Thanks a lot!

I don't know how the server works in this case but I expect that, for caching reasons, the server will load complete documents into memory when it reads them from disk. Disk reads are very slow (= expensive in time taken) and I expect server will aggressively use memory if it can to avoid reads.
An important note here is that the documents are stored on disk as lists of key-value pairs comprising their contents. To not load a field from disk the server would have to rebuild the document in question as part of reading it since there are length fields involved. I don't see this happening in practice.
So, once the documents are in memory I assume they are there with all of their fields and I don't expect you can tune this.
When you are querying, the server may or may not drop individual fields but this would only change the memory requirements for the particular query. Generally these memory requirements are dwarfed by the overall database cache size and aggregation pipelines. So I don't think it really matters at what point a large field is dropped from a document during query processing (assuming you project it out in the query).
I think this isn't a worthwhile matter to try to ponder/optimize. If you have a real system with real workloads, you'll be much more pressed to optimize something else.
If you are concerned with memory usage when the amount of available memory is consumer-sized (say, under 16 gb), just get more memory - it's insanely cheap given how much time you'd spend working around lack of it (whether we are talking about provisioning bigger AWS instances or buying more sticks of RAM).

You should be able to use $project to limit the fields read.
As a general advice, don't try to normalize the data with MongoDB as you would with SQL. Also, it's often more performant to read documents plain from DB and do the processing on your server.

I have found this answer that seems to indicate that project needs to fetch all document in the database server, it only reduces bandwith
When using projection to remove unused fields, the MongoDB server will
have to fetch each full document into memory (if it isn't already
there) and filter the results to return. This use of projection
doesn't reduce the memory usage or working set on the MongoDB server,
but can save significant network bandwidth for query results depending
on your data model and the fields projected.
https://dba.stackexchange.com/questions/198444/how-mongodb-projection-affects-performance

Mongodb Index or not to index

quick question on whether to index or not. There are frequent queries to a collection that looks for a specific 'user_id' within an array of a doc. See below -
_id:"bQddff44SF9SC99xRu",
participants:
[
{
type:"client",
user_id:"mi7x5Yphuiiyevf5",
screen_name:"Bob",
active:false
},
{
type:"agent",
user_id:"rgcy6hXT6hJSr8czX",
screen_name:"Harry",
active:false
}
]
}
Would it be a good idea to add an index to 'participants.user_id'? The array is added to frequently and occasionally items are removed.
Update
I've added the index after testing locally with the same set of data and this certainly seems to have decreased the high CPU usage on the mongo process. As there are only a small number of updates to these documents I think it was the right move. I'm looking at more possible indexes and optimisation now.

Why do you want to index? Do you have significant latency problems when querying? Or are you trying to optimise in advance?
Ultimately there are lots of variables here which make it hard to answer. Including but not limited to:
how often is the query made
how many documents in the collection
how many users are in each document
how often you add/remove users from the document after the document is inserted.
do you need to optimise inserts/updates to the collection
It may be that indexing isn't the answer, but rather how you have structured you data.

How to organize mongodb database for a huge set of time-value pairs for a lot of documents?

There is a set of registrators, say 100k. Every registrator 24 times a day gives value smth like 23.123. I need to save this value and time. Then I need to calculate how value changes for some period, e.g. 4jun2014 - 19jul2014: In order to do this I have to find last value of 3jun2014 and last value of 19jul2014.
First I am trying to estimate size of data stored by one registrator. Time+value must be lower than 100 bytes. 1 year is < 100*24*365 = 720kB of data, so I can easily store 10 years of data (since 7.2M < 16M limit) at my document. I decided not to store registered data at registeredData collection but to store registrator data embedded in registrator object as a tree timedata->year->month->day:
{
code: '3443-12',
timedata: {
2013: {
6: {
13: [
{t:1391345679, d:213.12},
{t:1391349679, d:213.14},
]
}
}
}
}
So it is easy to get values of the day: just get find({code: "3443-12"})[0].timedata[2013][6][13].
When I get new data, I just push it into array of existing document and it eventually grows from zero to 7Mb.
Questions
What is the stored size of {t:1391345679, d:213.12} line, is it less than 100bytes?
Is it right way to organize database for such purposes?
100k documents with 5Mb size = 500G. Does MongoDB deal fast with database size much more than RAM size?
Update
I decided to store time not as a timestamp but as time in seconds from the start of a day: 0 - 86399: {t: 86123, d: 213.12}.

Regarding your last question, " Does MongoDB deal fast with database size much more than RAM size?" the answer is it can, but it depends on a number of factors.
MongoDB works best when the working set fits within the memory available to MongoDB. When it does not you tend to see rather rapid performance declines. How big that working set is a function of database schema, indexes built and your data access patterns.
Let's say you have a years worth of data in your database, but regularly only touch the last few days of data. Then your working set is likely to be composed of the memory required to keep the last few days of data in memory, plus enough of the indexes in memory for you to properly update and read from them.
Alternatively, if you are randomly accessing data across a year and have high and update volume you may have a significantly larger working set to deal with.
As a point of comparison, I've got a production MongoDB instance that has around 500M documents in it, taking up around 2 TB of disk storage. Total memory on the primary of the replica set is 128GB (1/16th the total storage) and we're not experiencing any performance problems.
The key for all of it though is how much data do you access over time. The killer for MongoDB performance is memory contention, when you are paging out data to service a new request only to re-page that old data right back in. And it gets far worse if you cannot keep your indexes in memory.

I've tested it and it is less than 100 B, in deed, it is 48 B:
var num=100000;
for(i=0;i<num;i++){
db.foo.insert({t:1391345679, d:213.12})
};
db.foo.stats().avgObjSize // => Outputs 48

It looks like what you are doing is kind of a hack to avoid normalising your data (m.b. for transaction purposes?) and sooner or later you may run into problems (e.g. requirements change, size of your data changes, new fields are introduced etc.) I do not know your schema and domain, but if you go with denomarmalized model as you are doing you must be sure that documents will not exceed the size limit of 16MB. That being said, I would recommend schema design article.
Answers:
The previous answer gives a hint about the document size. You can use it as a starting point.
Choosing an effective data models depends on your application needs. The main question is the decision to denormalize or use linking. Note, generally with denormalized data you achieve better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedding makes it possible to update a document in a single atomic write operation (transactionally). So, when to use embedded (denormalized):
you have “contains” relationships between entities. See Model
One-to-One Relationships with Embedded Documents.
you have one-to-many relationships between entities. In these relationships the “many” or
child documents always appear with or are viewed in the context of the
“one” or parent documents. See Model One-to-Many Relationships with
Embedded Documents.
In your situation your documents will grow after creation which can impact write performance and lead to data fragmentation. You can control this with padding factor.
- About the performance: it depends on how you create your indexes. More importantly, on your access patterns. For each query executed often, check out the output from explain() to see how many documents have been checked.

Best Mongodb Data Model for Response time statistic website

In my project, I have servers that will send ping request to websites, measuring their response time and store it every minute.
I'm going to use Mongodb and i'm searching for best data model.
which data model is better?
1- have a collection for each website and each request as a document.
(1000 collection)
or
2- have a collection for all websites and each website as a document and each request as sub-document.

Both solutions should face of one certain limitation of mongodb. With the first one, that you said each website a collection, the limitation is in the number of the collections while each one will have a namespace entry and the namespace size is 16MB so around 16.000 entries can fit in. (the size of the namespace can be increased) In my opinion this is a much better solution while you said 1000 collections are expected and it can be handled. (Should be considered that indexes has their own namespace entries and count in the 16.000). In this case you can store the entries as documents you can handle them after generally much easier than with the embedded array.
Embedded array limitation. This limitation in the second case is a hard one. Your documents cannot grow bigger than 16MB. This one is BSON size and it can store quite many things inside documents but if you use huge documents which varies in size , and change size in time your storage will get fragmented. The reason is that will be clear if you watch this webinar . Basically this is the worth what you can do in terms of storage usage.
If you likely to use aggregation framework for further analysis it will be also harder with the embedded array concept.

You could do either, but I think you will have to factor in periodic growth in database for either case. During the expansion of datafiles database will be slow/unresponsive. (There might be a setting so this happens in the background - I forget ).
A related question - MongoDB performance with growing data structure, specifically the "Padding Factor"
With first approach, there is an upper limit to number of websites you can store imposed by max number of collections. You can do the calculations based on http://docs.mongodb.org/manual/reference/limits/.
In second approach, while #of collection don't matter as much, but growth of database is something you will want to consider.
One approach is to initialize it with empty data, so it takes lasts longer before expanding.
For instance.
{
website: name,
responses: [{
time: Jan 1, 2013, 0:1, ...
},
{
time: Jan 1, 2013, 0:2, ...
}
... and so for each minute/interval you expect.
]
}
The downside is, it might take you longer to initialize but you will have to worry about this later.
Either ways, it is a cost you will have to pay. The only question is when? Now? or later?
Consider reading their usecases, particularly - http://docs.mongodb.org/manual/use-cases/hierarchical-aggregation/

general questions about using mongodb

I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!

Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References

Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse