Using nested document structure in mongodb - mongodb

I am planning to use a nested document structure for my MongoDB Schema design as I don't want to go for flat schema design as In my case I will need to fetch my result in one query only.
Since MongoDB has a size limit for a document.
MongoDB Limits and Threshold
A MongoDB document has a size limit of 16MB ( an amount of data). If your subcollection can growth without limits go flat.
I don't need to fetch my nested data but only be needing my nested data for filtering and querying purpose.
I want to know whether I will still be bound by MongoDB size limits even if I use my embedded data only for querying and filter purpose and never for fetching of nested data because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?
Nested schema design example
{
clinicName: "XYZ Hopital",
clinicAddress: "ABC place.",
"doctorsWorking":{
"doctorId1":{
"doctorJoined": ISODate("2017-03-15T10:47:47.647Z")
},
"doctorId2":{
"doctorJoined": ISODate("2017-04-15T10:47:47.647Z")
},
"doctorId3":{
"doctorJoined": ISODate("2017-05-15T10:47:47.647Z")
},
...
...
//upto 30000-40000 more records suppose
}
}

I don't think your understanding is correct when you say "because as per my understanding, in this case, MongoDB won't load the complete document in memory but only the selected fields?".
If we see MongoDB Doc. then it reads
The maximum BSON document size is 16 megabytes. The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API.
So the clear limit is 16 MB on document size. Mongo should stop you from saving such a document which is greater than this size.
If I agree with your understanding for a while then let's say that it allows to
save any size of document but more than 16 MB in RAM is not allowed. But on other hand, while storing the data it won't know what queries will be run on this data. So ultimately you will be inserting such big documents which can't be used later. (because while inserting we don't tell the query pattern, we can even try to fetch the full document in a single shot later).
If the limit is on transmission (hypothetically assuming) then there are lot of ways (via code) software developers can bring data into RAM in clusters and they won't cross 16 MB limit ever (that's how they do IO ops. on large files). They will make fun of this limit and just leave it useless. I hope MongoDB creators knew it and didn't want it to happen.
Also if limit is on transmission then there won't be any need of separate collection. We can put everything in a single collections and just write smart queries and can fetch data. If fetched data is crossing 16 MB then fetch it in parts and forget the limit. But it doesn't go this way.
So the limit must be on document size else it can create so many issues.
In my opinion if you just need "doctorsWorking" data for filtering or querying purpose (and if you also think that "doctorsWorking" will cause document to cross 16 MB limit) then it's good to keep it in a separate collection.
Ultimately all things depend on query and data pattern. If a doctor can serve in multiple hospitals in shifts then it will be great to keep doctors in separate collection.

Related

MongoDB: reduce read size and RAM needed with project?

I am designing a MongoDB database that looks something like this:
registry:{
id:1,
duration:123,
score:3,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
The text field is very big compared to the rest. I sometimes need to perform analytics queries that average the duration or the score, but never use the text.
I have queries that are more specific, and retrieve all the information about a single document. But in this queries I could spend more time making two queries to retrieve all the data.
My question is, if I make a query like this:
db.registries.aggregate( [
{
$group: {
_id: null,
averageDuration: { $avg: "$duration" },
}
}
] )
Would it need to read the data from the transcript field? That would make the query much slower and it would take a lot of RAM. If that is the case it would be better to split the records in two and have something like this right?:
registry:{
id:1,
duration:123,
score:3,
}
registry_text:{
id:1,
text:"aaaaaaaaaaaaaaaaaaaaaaaaaaaa"
}
Thanks a lot!
I don't know how the server works in this case but I expect that, for caching reasons, the server will load complete documents into memory when it reads them from disk. Disk reads are very slow (= expensive in time taken) and I expect server will aggressively use memory if it can to avoid reads.
An important note here is that the documents are stored on disk as lists of key-value pairs comprising their contents. To not load a field from disk the server would have to rebuild the document in question as part of reading it since there are length fields involved. I don't see this happening in practice.
So, once the documents are in memory I assume they are there with all of their fields and I don't expect you can tune this.
When you are querying, the server may or may not drop individual fields but this would only change the memory requirements for the particular query. Generally these memory requirements are dwarfed by the overall database cache size and aggregation pipelines. So I don't think it really matters at what point a large field is dropped from a document during query processing (assuming you project it out in the query).
I think this isn't a worthwhile matter to try to ponder/optimize. If you have a real system with real workloads, you'll be much more pressed to optimize something else.
If you are concerned with memory usage when the amount of available memory is consumer-sized (say, under 16 gb), just get more memory - it's insanely cheap given how much time you'd spend working around lack of it (whether we are talking about provisioning bigger AWS instances or buying more sticks of RAM).
You should be able to use $project to limit the fields read.
As a general advice, don't try to normalize the data with MongoDB as you would with SQL. Also, it's often more performant to read documents plain from DB and do the processing on your server.
I have found this answer that seems to indicate that project needs to fetch all document in the database server, it only reduces bandwith
When using projection to remove unused fields, the MongoDB server will
have to fetch each full document into memory (if it isn't already
there) and filter the results to return. This use of projection
doesn't reduce the memory usage or working set on the MongoDB server,
but can save significant network bandwidth for query results depending
on your data model and the fields projected.
https://dba.stackexchange.com/questions/198444/how-mongodb-projection-affects-performance

How to organize mongodb database for a huge set of time-value pairs for a lot of documents?

There is a set of registrators, say 100k. Every registrator 24 times a day gives value smth like 23.123. I need to save this value and time. Then I need to calculate how value changes for some period, e.g. 4jun2014 - 19jul2014: In order to do this I have to find last value of 3jun2014 and last value of 19jul2014.
First I am trying to estimate size of data stored by one registrator. Time+value must be lower than 100 bytes. 1 year is < 100*24*365 = 720kB of data, so I can easily store 10 years of data (since 7.2M < 16M limit) at my document. I decided not to store registered data at registeredData collection but to store registrator data embedded in registrator object as a tree timedata->year->month->day:
{
code: '3443-12',
timedata: {
2013: {
6: {
13: [
{t:1391345679, d:213.12},
{t:1391349679, d:213.14},
]
}
}
}
}
So it is easy to get values of the day: just get find({code: "3443-12"})[0].timedata[2013][6][13].
When I get new data, I just push it into array of existing document and it eventually grows from zero to 7Mb.
Questions
What is the stored size of {t:1391345679, d:213.12} line, is it less than 100bytes?
Is it right way to organize database for such purposes?
100k documents with 5Mb size = 500G. Does MongoDB deal fast with database size much more than RAM size?
Update
I decided to store time not as a timestamp but as time in seconds from the start of a day: 0 - 86399: {t: 86123, d: 213.12}.
Regarding your last question, " Does MongoDB deal fast with database size much more than RAM size?" the answer is it can, but it depends on a number of factors.
MongoDB works best when the working set fits within the memory available to MongoDB. When it does not you tend to see rather rapid performance declines. How big that working set is a function of database schema, indexes built and your data access patterns.
Let's say you have a years worth of data in your database, but regularly only touch the last few days of data. Then your working set is likely to be composed of the memory required to keep the last few days of data in memory, plus enough of the indexes in memory for you to properly update and read from them.
Alternatively, if you are randomly accessing data across a year and have high and update volume you may have a significantly larger working set to deal with.
As a point of comparison, I've got a production MongoDB instance that has around 500M documents in it, taking up around 2 TB of disk storage. Total memory on the primary of the replica set is 128GB (1/16th the total storage) and we're not experiencing any performance problems.
The key for all of it though is how much data do you access over time. The killer for MongoDB performance is memory contention, when you are paging out data to service a new request only to re-page that old data right back in. And it gets far worse if you cannot keep your indexes in memory.
I've tested it and it is less than 100 B, in deed, it is 48 B:
var num=100000;
for(i=0;i<num;i++){
db.foo.insert({t:1391345679, d:213.12})
};
db.foo.stats().avgObjSize // => Outputs 48
It looks like what you are doing is kind of a hack to avoid normalising your data (m.b. for transaction purposes?) and sooner or later you may run into problems (e.g. requirements change, size of your data changes, new fields are introduced etc.) I do not know your schema and domain, but if you go with denomarmalized model as you are doing you must be sure that documents will not exceed the size limit of 16MB. That being said, I would recommend schema design article.
Answers:
The previous answer gives a hint about the document size. You can use it as a starting point.
Choosing an effective data models depends on your application needs. The main question is the decision to denormalize or use linking. Note, generally with denormalized data you achieve better performance for read operations, as well as the ability to request and retrieve related data in a single database operation. Embedding makes it possible to update a document in a single atomic write operation (transactionally). So, when to use embedded (denormalized):
you have “contains” relationships between entities. See Model
One-to-One Relationships with Embedded Documents.
you have one-to-many relationships between entities. In these relationships the “many” or
child documents always appear with or are viewed in the context of the
“one” or parent documents. See Model One-to-Many Relationships with
Embedded Documents.
In your situation your documents will grow after creation which can impact write performance and lead to data fragmentation. You can control this with padding factor.
- About the performance: it depends on how you create your indexes. More importantly, on your access patterns. For each query executed often, check out the output from explain() to see how many documents have been checked.

Aggregate collection that have an aggregate collectin

I am having some trouble which schema design to pick, i have a document which holds user info each user have a very big set of items that can be up to 20k items.
an item have a date and an id and 19 other fields and also an internal array which can have 20-30 items , and it can be modified,deleted and of course newly inserted and queried by any property that it holds.
so i came up with 2 possible schemas.
1.Putting everything into a single docment
{_id:ObjectId("") type:'user' name:'xxx' items:[{.......,internalitems:[]},{.......,internalitems:[]},...]}
{_id:ObjectId("") type:'user' name:'yyy' items:[{.......,internalitems:[]},{.......,internalitems:[]},...]}
2.Seperating the items from the user and letting eachitem have its own
document
{_id:ObjectId(""), type:'user', username:'xxx'}
{_id:ObjectId(""), type:'user', username:'yyy'}
{_id:ObjectId(""), type:'useritem' username:'xxx' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'xxx' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'yyy' item:{.......,internalitems:[]}]}
{_id:ObjectId(""), type:'useritem' username:'yyy' item:{.......,internalitems:[]}]}
as i explained before a single user can have thousands of items and i have tens of users, internalitems can have 20-30 items, and it has 9 fields
considering that a single item can be queried by different users and can be modified only by the owner and another process.
if performance is really important which design would you pick?
if you pick neither of them what schema can you suggest?
on a side note i will be sharding and i have a single collection for everything.
I wouldn't recommend the first approach, there is a limit to the maximum document size:
"The maximum BSON document size is 16 megabytes.
The maximum document size helps ensure that a single document cannot use excessive amount of RAM or, during transmission, excessive amount of bandwidth. To store documents larger than the maximum size, MongoDB provides the GridFS API. See mongofiles and the documentation for your driver for more information about GridFS."
Source: http://docs.mongodb.org/manual/reference/limits/
There is also a performance implication if you exceed the current allocated document space when updating (http://docs.mongodb.org/manual/core/write-performance/ "Document Growth").
Your first solution is susceptible to both of these issues.
The second one is (Disclaimer: In the case of 20-30 internal items) is less susceptible of reaching the limit but still might require reallocation when doing updates. I haven't had this issue with a similar scenario, so this might be the way to go. And you might wanna look into Record Padding(http://docs.mongodb.org/manual/core/record-padding/) for some more details.
And, if all else fails, you can always split the internal items out as well.
Hope this helps!

Mapping datasets to NoSql (MongoDB) collection

what I have ?
I have data of 'n' department
each department has more than 1000 datasets
each datasets has more than 10,000 csv files(size greater than 10MB) each with different schema.
This data even grow more in future
What I want to DO?
I want to map this data into mongodb
What approaches I used?
I can't map each datasets to a document in mongo since it has limit of 4-16MB
I cannot create collection for each datasets as max number of collection is also limited (<24000)
So finally I thought to create collection for each department , in that collection one document for each record in csv file belonging to that department.
I want to know from you :
will there be a performance issue if we map each record to document?
is there any max limit for number of documents?
is there any other design i can do?
will there be a performance issue if we map each record to document?
mapping each record to document in mongodb is not a bad design. You can have a look at FAQ at mongodb site
http://docs.mongodb.org/manual/faq/fundamentals/#do-mongodb-databases-have-tables .
It says,
...Instead of tables, a MongoDB database stores its data in collections,
which are the rough equivalent of RDBMS tables. A collection holds one
or more documents, which corresponds to a record or a row in a
relational database table....
Along with limitation of BSON document size(16MB), It also has max limit of 100 for level of document nesting
http://docs.mongodb.org/manual/reference/limits/#BSON Document Size
...Nested Depth for BSON Documents Changed in version 2.2.
MongoDB supports no more than 100 levels of nesting for BSON document...
So its better to go with one document for each record
is there any max limit for number of documents?
No, Its mention in reference manual of mongoDB
...Maximum Number of Documents in a Capped Collection Changed in
version
2.4.
If you specify a maximum number of documents for a capped collection
using the max parameter to create, the limit must be less than 232
documents. If you do not specify a maximum number of documents when
creating a capped collection, there is no limit on the number of
documents ...
is there any other design i can do?
If your document is too large then you can think of document partitioning at application level. But it will have high computation requirement at application layer.
will there be a performance issue if we map each record to document?
That depends entirely on how you search them. When you use a lot of queries which affect only one document, it is likely even faster that way. When a higher document-granularity results in a lot of document-spanning queries, it will get slower because MongoDB can't do that itself.
is there any max limit for number of documents?
No.
is there any other design i can do?
Maybe, but that depends on how you want to query your data. When you are content with treating files as a BLOB which is retrieved as a whole but not searched or analyzed on the database level, you could consider storing them on GridFS. It's a way to store files larger than 16MB on MongoDB.
In General, MongoDB database design doesn't depend so much on what and how much data you have, but rather on how you want to work with it.

general questions about using mongodb

I'm thinking about trying MongoDB to use for storing our stats but have some general questions about whether I'm understanding it correctly before I actually start learning it.
I understand the concept of using documents, what I'm not too clear about is how much data can be stored inside each document. The following diagram explains the layout I'm thinking of:
Website (document)
- some keys/values about the particular document
- statistics (tree)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
What got me excited about mongodb was the grouping functions such as:
http://www.mongodb.org/display/DOCS/Aggregation
db.test.group(
{ cond: {"invoked_at.d": {$gte: "2009-11", $lt: "2009-12"}}
, key: {http_action: true}
, initial: {count: 0, total_time:0}
, reduce: function(doc, out){ out.count++; out.total_time+=doc.response_time }
, finalize: function(out){ out.avg_time = out.total_time / out.count }
} );
But my main concern is how hard would that command for example be on the server if there is say 10's of millions of records across dozens of documents on a 512-1gb ram server on rackspace for example? Would it still run low load?
Is there any limit to the number of documents MongoDB can have (seperate databases)? Also, is there any limit to the number of records in a tree I explained above? Also, does that query I showed above run instantly or is it some sort of map/reduce query? Not very sure if I can execute that upon page load in our control panel to get those stats instantly.
Thanks!
Every document has a size limit of 4MB (which in text is A LOT).
It's recommended to run MongoDB in replication mode or to use sharding as you otherwise will have problems with single-server durability. Single-server durability is not given because MongoDB only fsync's to the disk every 60 seconds, so if your server goes down between two fsync's the data that got inserted/updated in that time will be lost.
There is no limit of documents other than your disk space in mongodb.
You should try to import a dataset that matches your data (or generate some test data) to MongoDB and analyse how fast your query executes. Remember to set indexes on those fields that you use heavily in your queries. Your above query should work pretty well even with a lot of data.
In order to analyze the speed of your query use the database profiler MongoDB comes with. On the mongo shell do:
db.setProfilingLevel(2); // to set the profiling level
[your query]
db.system.profile.find(); // to see the results
Remember to turn off profiling once you're finished (log will get pretty huge otherwise).
Regarding your database layout I suggest to change the "schema" (yeah yeah, schema less..) to:
website (collection):
- some keys/values about the particular document
statistics (collection)
- millions of rows where each record is inserted from a pageview (key/value array containing data such as timestamp, ip, browser, etc)
+ DBRef to website
See Database References
Documents in MongoDB are limited to a size of 4MB. Let's say a single page view results in 32 bytes being stored. Then you'll be able to store about 130,000 page views in a single document.
Basically the amount of page views a page can generate is infinite, and you indicated that you expect millions of them, so I suggest you store the log entries as separate documents. Each log entry should contain the _id of the parent document.
The number of documents in a database is limited to 2GB of total space on 32-bit systems. 64-bit systems don't have this limitation.
The group() function is a map-reduce query under the hood. The documentation recommends you use a map-reduce query instead of group(), because it has some limitations with large datasets and sharded environments.