I have an app in which customers can partially define their schema by adding custom field to various domain objects we have. We are looking at doing trending data for these custom fields. I've been thinking about storing the data in a format which has the the changes listed on the object.
{
_id: "id",
custom1: 2,
changes: {
'2011-3-25': { custom1: 1 }
}
}
This would obviously have to be less than the max document size (16mb) which I think is well within the amount of changes we'd have.
Another alternative would be have multiple records for every object change:
{
_id: "id",
custom1: 1,
changeTime: '2011-3-25'
}
This doesn't have the document restriction, but there would be more records, requiring more work to get the full change set for a record.
Which would you choose?
I think I'd be looking to go down the single document route if it will remain within the 16MB limit. That way, it's just a single read to load a record and all of it's changes which should be pretty darn quick. Having the changes listed within the document feels like a natural way to model the data.
In situations like this, especially if it's not something I've done before, I'd try to knock up a benchmark to test the approaches - that way, the pros/cons/performance of each approach presents itself to you and (hopefully) gives you confidence in the approach you choose.
Related
So I have this kind of structure in the firestore database
{
issues: {
issueName1: "sample opinion",
issueName2: "sample opinion",
...
}
}
The issueName1 and issueName2 represent social issues, and sample opinion signifies a specific persons opinion on them. Now, these issues will change overtime. Is there a way to change these field names in the firestore? Or is there a better way to structure this? I'm new to NoSQL and having a hard time grasping its concept. Thank you for any help!
If you are referring to how to update older documents with the new issues, you wouldn’t need to do that as these issues would not have an opinion associated with it. In NoSQL databases you don’t need to have a consistent schema among different entries so you can just add the issues that do have an opinion.
On a side note, depending on how you are planning to retrieve the issues per person, you might want to store them all together in a map instead of individual fields:
{
issues: [{
issueName1: “sample opinion”,
issueName3: “sample opinion”
}]
}
I'm designing my first database for MongoDB, I'm wondering if I'm going in a right direction about it.
It's basically a mock reservation system for a theatre. Is there anything inherently wrong about going 2 or 3 levels deep with nesting?
Will it create problems with queries later on?
What would be performance and usability wise the best solution here?
Should I perhaps use references like i did with clients that made reservations?
Here is what I have so far:
//shows
{
_id: 2132131,
name: 'something',
screenplay: ['author1', 'author2'],
show_type: 'children',
plays: {
datetime: "O0:00:00 0000-00-00",
price: 120,
seats:
{
_id:['a', 1],
status: 'reserved',
client: 1
},
{
_id:['a', 2],
status: 'reserved',
'client:1
}
}
}
//clients
{
_id:1,
name: 'Julius Cesar',
email: 'julius#rome.com',
}
You might hear different opinions on this one but let me share my views on this with you.
First of all, your schema does not seem correct for your usecase. You most likely want "plays" to be an array rather than an object, so :
{
"_id":2132131,
"name":"something",
"screenplay":[
"author1",
"author2"
],
"show_type":"children",
"plays":[
{
"datetime":"O0:00:00 0000-00-00",
"price":120,
"seats":[
{
"_id":[
"a",
1
],
"status":"reserved",
"client":1
},
{
"_id":[
"a",
2
],
"status":"reserved",
"client":1
}
]
}
]
}
If my assumption is correct you now have double nested arrays which is an extremely impractical schema since you cannot use more than one positional operator in a query or update.
Despite what most of the NoSQL crowd seems to think there are actually only a few valid use-cases to embed collections into a document. The following conditions need to be met :
The embedded collection has very clear upper limits in terms of size. This limit should not be higher than a couple of dozen before this becomes unwieldy/inefficient.
The embedded collection should not grow regularly (this causes the document to move around on disk which dramatically reduces performance)
The elements in the embedded collection should not contain array attributes (the current query language does not allow you to modify specific elements of a double nested array)
You should never require the elements of the nested collection without having to query the root document that contains that embedded collection.
You'll find that not that many situations will meet all the criteria above. Some of those criteria are somewhat subjective but lightweight referencing is not actually that more complicated. Actually, not being able to "atomically" modify documents in different collections is the only complication and you'll find and that isn't as big a problem as it sounds in most cases.
TL;DR : Don't double nest arrays; stick "plays" in a seperate collection
Usually you should always prefer embedding over referencing in MongoDB, so you are already heading into the right direction.
The only reason to use referencing for a 1:n relation is when you have growing objects, because an update which causes a document to grow in size can be slow. However, it seems like you are working with data which isn't going to grow very frequently (maybe a few times a day), which likely means that you shouldn't run into performance problems.
Coming from a MySQL background, I've been questioning the some of the design patterns when working with Mongo. One question I keep asking myself is when should I create a new collection vs creating a property of an array type? My current situation goes as follows:
I have a collection of Users who all have at least 1 Inbox
Each inbox has 0 or more messages
Each message can have 0 or comments
My current structure looks like this:
{
username:"danramosd",
inboxes:[
{
name:"inbox1",
messages:[
{
message:"this is my message"
comments:[
{
comment:"this is a great message"
}
]
}
]
}
]
}
For simplicity I only listed 1 inbox, 1 message and 1 comment. Realistically though there could be many more.
An approach I believe that would work better is to use 4 collections:
Users - stores just the username
Inboxes - name of the inbox, along with the UID of User it belongs to
Messages - content of the message, along with the UID of inbox it belongs to
Comments - content of the comment, along with the UID of the message it belongs to.
So which one would be the better approach?
No one can help you with this question, because it is highly dependent on your application:
how many inboxes/messages/comments do you have on average
how often do you write/modify/delete these elements
how often do you read them
a lot of other things that I forgot to mention
When you are selecting one approach over another you are doing tradeofs.
If you store everything together (in one collection as your first case) you make it super easy to get all the things for a particular user. Taking apart the thing that most probably you do not need all the information at once, you at the same time makes it super hard to update some parts of the elements (try to write a query that will add a comment or remove the third comment). Even if this is easy - mongodb does not handle well growing documents because whenever you exceeds the padding factor it moves the document to another location (which is expensive) and increases the padding factor. Also keep in mind that this potentially can hit mongodb's limit on the size of the document.
It is always a good idea to read all mongodb use cases before trying to design any storage schema. Not surprisingly they have a comprehensive overview of your case as well.
Problem
Starting with nosql document database I figured out lots of new possibilities, however, I see some pitfalls, and I would like to know how can I deal with them.
Suppose I have a product, and this product can be sold in many regions. There is one responsible for each region (with access to CMS). Each responsible modifies the products accordingly regional laws and rules.
Since Join feature isn't supported as we know it on relational databases, the document should be design in a way it contains all the needed information to build our selection statements and selection result to avoid round trips to the database.
So my first though was to design a document that follows more or less this structure:
{
type : "product",
id : "product_id",
title : "title",
allowedAge : 12,
regions : {
'TX' : {
title : "overriden title",
allowedAge : 13
},
'FL' : {
title : "still another title"
}
}
}
But I have the impression that this approach will generate conflicts while updating the document. Suppose we have a lot of users updating lots of document through a CMS. When same document is updated, the last update overwrites the updates done before, even the users are able to modify just fragments of this document (in this case the responsible should be able to modify just the regional data).
How to deal with this situation?
One possible solution I think of would be partial document updates. Positive: reducing the data overwriting from different operations, Negative: lose the optimistic lock feature since locking if done over a document not a fragment of such.
Is there another approach for the problem?
In this case you can use 3 solutions:
Leave current document structure and always check CAS value on update operations. If CAS doesn't match - call store function again. (But as you say if you have a lot of users it can be very slow).
Separate doc in several parts that could be updated independently, and then on app-side combine them together. This will result in increasing view calls (one for get main doc, another call to get i.e. regions, etc.). If you have many "parts" it will also reduce performance.
See this doc. It's about simulating joins in couchbase. There is also good example written by #Tug Grall.
If you're not bounded to using Couchbase (not clear from your question if it's general or specific to it) - look also into MongoDB. It supports partial updates on documents and also other atomic operations (like increments and array operations), so it might suite your use case better (checkout possible update operations on mongo - http://docs.mongodb.org/manual/core/update/ )
I still getting used to using a schema-less document oriented database and I am wondering what a generally accepted practice is regarding schema designs within an application model.
Specifically I'm wondering whether it is a good practice to use enforce a schema within the application model when saving to mongodb like this:
{
_id: "foobar",
name: "John"
billing: {
address: "8237 Landeau Lane",
city: "Eden Prairie",
state: "MN",
postal: null
}
balance: null,
last_activity: null
}
versus only storing the fields that are used like this:
{
_id: "foobar",
name: "John"
billing: {
address: "8237 Landeau Lane",
city: "Eden Prairie",
state: "MN"
}
}
The former is self-descriptive which I like, while the latter makes no assumptions on the mutability of the model schema.
I like the first option because it makes it easy to see at a glance what fields are used by the model yet currently unspecified, but it seems like it would be a hassle to update every document to reflect a new schema design if I wanted to add an extra field, like favorite_color.
How do most veteran mongodb users handle this?
I would suggest second approach.
You can always see the intended structure if you look at your entity class in the source code. Or do you use dynamic language, and don't create an entity?
You save a lot of space per record, because you don't have to store null column names. This may not be expensive on small collections. But on large, with millions of records, I would even go to shorten the names of fields.
As you already mentioned. By specifying optional column names, you create a pattern, which, if you want to follow, you'll have to update all existing records when you add a new field. This is, again, a bad idea for a big DB.
In any case it all goes down your db size. If you don't target for many GBs or TBs of data, then both approaches are fine. But, if you predict, that your DB may grow really large, I would do anything to cut the size. Spending 30-40% of storage for column names is a bad idea.
I prefer the first option, it is easier to code within the application and requires much less state holders and functions to understand how things should work.
As for adding a new field over time you don't need to update all your records to support this new field like you would in SQL all you need to do is write the new field into your model application side and support this field being null if it is not returned from MongoDB.
A good example is in PHP.
I have a class of user at first with only one field, name
class User{
public $name;
}
6 months down the line and 60,000 users later I want to add, say, address. All I do is add that variable to my application model:
class User{
public $name;
public $address = array();
}
This now works exactly like adding a new null field to SQL without having to actually add it to every row on-demand.
It is a very reactive design, don't update what you don't need to. If that row gets used it will get updated, if not then who cares.
So eventually your rows actually become a mix and match between option 1 and 2 but it is really a reactive option 1.
Edit
On the storage side you have also got to think of pre-allocation and movement of documents.
Say the amount of a set record now is only a third of the doc but then suddenly, from the user updating the doc with all of the fields, you now have extra fragmentation from the movement of your docs.
Normally when you are defining a schema like this you are defining one that will eventually grow and apply to that user in most cases (much like an SQL schema does).
This is something to take into consideration that even though storage might be lower in the short term it could cause fragmentation and slow querying due to that fragmentation and you could easily find yourself having to run compacts or repairDbs due to the problems you now face.
I should mention that both of those functions I said above are not designed to be run regularly and have a significant performance problem to them while they run on a production environment.
So really with the structure above you don't need to add a new field across all documents and you will most likely get less movement and problems in the long run.
You can fix the performance problems of consistently growing documents by using power of 2 sizes padding, but then this is collection wide which means that even your fully filled documents will use up at least double their previous space and you small documents will probably be using as much space as your full documents would have on a padding factor of 1.
Aka you lose space, not gain it.