NoSQL best practice : should I save derivative (calculated data) as it is used in app? - mongodb

I have multiple level deep nested objects including some original data, and then after some user input, calculations are made to the original dataset, and results are kept along with the original data.
Also some other redundant data is stored in the objects by Angular. All this extra data would be easy to programmatically reconstruct by only storing the original data set, the user inputs and some ids..
The easiest (but least economic) version is to store the data as is. This would mean approximately 2-3x size objects, more storage and bandwidth used, etc.
The other version is to store the minimum required data and reconstruct the objects on each query.
The app is not huge (but can grow in the future) and objects aren't either (around 200 keys).
So I'm curious what is best practice to follow in general?

It's a balance between over-denormalizing and having an efficient structure in terms of space and complexity (re keeping everything in sync if you do denormalise).
Start with your user stories, the query patterns this will dictate what information is needed in a single document.
It sounds like this is how you've done it anyway. Embedded sub-docs which you make calculations on during entry. Keep the calculated values in the parent doc and make sure they're updated with the child records. Using sub docs means you can update both the calculated values and emebedded sub docs atomically too.

Related

Exceeding NoSQL Item Size limit while preserving read/write performance

I need a way to exceed the NoSQL document item size limitations while preserving read/write performance.
All NoSQL implementations have document item size limits (e.g. AWS DynamoDB is 400KB, MongoDB is 16MB). Moreover, the larger the document item size, the slower the retrieval of the document becomes. Read, write, update and delete performance is important in my implementation.
I'm already using a NoSQL database and the documents I work with have a tree structure with four levels (The document describes vector drawings of varying sizes and complexity). Each node of the tree is a component that has an ID and a body that refers to the below components. Reading and updates can be performed for the full tree or parts of the tree.
I've considered switching to a Graph database but they are optimized to querying relationships in graphs. What I need a simple and high performance way to store an expanding tree documents rather than handling graphs.
I'm thinking that I need an implementation of Closure Table structure in NoSQL to split the tree nodes into separate components and - thus - breaking the document size barrier. On the client side, a ORM (object-relational mapper) with lazy-loading will retrieve the document tree nodes as necessary.
My questions are:
Has this been implemented before?
Is there another way to get the functionality described above?
If you need such a functionality, upvote this so that more exports would answer this question.
DynamoDB does not have penalty of speed for size, yet networking will take extra time, that's it. Also if you have such a big file and trying to push it to DynamoDB, that's the design issue to begin with. Just store big file in S3 and store meta data in DynamoDB.
You are paying for write & read units by KBs with DynamoDB, why waste money on big data when S3 is so cheap for it and has the size limit is 5TB instead of 400KB. Also if your app generates those big JSON structures, you could save $ overtime by using S3 storage classes automation, which would move unused JSON to "colder" storage class.

Model a large MongoDB collection that rarely updates

Context:
I'm currently modeling data which follow a deep tree pattern consisting of 4 layers (categories, subcategories, subsubcategories, subsubsubcategories... those two lasts are of course not the real words I'll be using)
This collection is meant to grow larger and larger over time, and each layer will contain a list of dozens of elements.
Problem:
Modeling a full embedded collection like that raises a big problem ; the 16MB document limit of MongoDB is not really ideal in this context because the document size will slowly approach the limit.
But at the same time, this data is not meant to be updated very often (at most a few times a day). Client-side, the API needs to return a fully-constructed big JSON file made of all those layers nested together. It can be easily made in such a way that every time a layer is updated, the full JSON result is updated too and stored in RAM, ready to be sent.
I was wondering if having a 4 layers tree like that split in different collections would be a better idea, because at the same time it would raises more queries, but it would be way more scalable and easy to understand. But I don't really know if it's the way MongoDB documents are meant do be modeled. I may be doing something wrong (first time using MongoDB) and I want to be sure that everything is already in this way of doing things
I'll suggest you to take a look at official MongoDB tree structures advices, and especially the solution with parent reference. It will allow you to keep your structure without struggling of the 16MB maximum size, and you can use $graphLookup aggregation stages to perform your further queries on tree subdocuments

Parse / MongoDB / NoSQL: Is it better to have larger (fewer) or smaller (more) objects / records?

I am currently designing an app where a user uploads several daily informartion.
I see two different ways to accomplish this:
Fewer, larger objects: Save all information in an array in the _User class of Parse.
More, smaller objects: Create a new class Information and save every information there with a pointer to the _User class.
Let's assume the Information class contains 1,000,000 records:
Which option is preferable? Are there actually any CPU / RAM differences regarding queries? Since in case 1 all the information are stored in the PFUser.currentUser()'s object, so there is actually no need for a query.
You can't realistically store more than a few hundred items (depending on the item size, if it's a simple number then the count could be a lot higher) in an array on an object before you'll start seeing issues, and eventually the record will be so big you can't actually save any changes to it. So, using pointers is better.
If you're going to have a huge dataset then it's more important to consider the technology you're using first, and it's ability to scale and index the data. After that you can decide how to structure the data best for that platform.
On parse, use smaller objects with pointers and relationships.

When should I create a new collections in MongoDB?

So just a quick best practice question here. How do I know when I should create new collections in MongoDB?
I have an app that queries TV show data. Should each show have its own collection, or should they all be store within one collection with relevant data in the same document. Please explain why you chose the approach you did. (I'm still very new to MongoDB. I'm used to MySql.)
The Two Most Popular Approaches to Schema Design in MongoDB
Embed data into documents and store them in a single collection.
Normalize data across multiple collections.
Embedding Data
There are several reasons why MongoDB doesn't support joins across collections, and I won't get into all of them here. But the main reason why we don't need joins is because we can embed relevant data into a single hierarchical JSON document. We can think of it as pre-joining the data before we store it. In the relational database world, this amounts to denormalizing our data. In MongoDB, this is about the most routine thing we can do.
Normalizing Data
Even though MongoDB doesn't support joins, we can still store related data across multiple collections and still get to it all, albeit in a round about way. This requires us to store a reference to a key from one collection inside another collection. It sounds similar to relational databases, but MongoDB doesn't enforce any of key constraints for us like most relational databases do. Enforcing key constraints is left entirely up to us. We're good enough to manage it though, right?
Accessing all related data in this way means we're required to make at least one query for every collection the data is stored across. It's up to each of us to decide if we can live with that.
When to Embed Data
Embed data when that embedded data will be accessed at the same time as the rest of the document. Pre-joining data that is frequently used together reduces the amount of code we have to write to query across multiple collections. It also reduces the number of round trips to the server.
Embed data when that embedded data only pertains to that single document. Like most rules, we need to give this some thought before blindly following it. If we're storing an address for a user, we don't need to create a separate collection to store addresses just because the user might have a roommate with the same address. Remember, we're not normalizing here, so duplicating data to some degree is ok.
Embed data when you need "transaction-like" writes. Prior to v4.0, MongoDB did not support transactions, though it does guarantee that a single document write is atomic. It'll write the document or it won't. Writes across multiple collections could not be made atomic, and update anomalies could occur for how many ever number of scenarios we can imagine. This is no longer the case since v4.0, however it is still more typical to denormalize data to avoid the need for transactions.
When to Normalize Data
Normalize data when data that applies to many documents changes frequently. So here we're talking about "one to many" relationships. If we have a large number of documents that have a city field with the value "New York" and all of a sudden the city of New York decides to change its name to "New-New York", well then we have to update a lot of documents. Got anomalies? In cases like this where we suspect other cities will follow suit and change their name, then we'd be better off creating a cities collection containing a single document for each city.
Normalize data when data grows frequently. When documents grow, they have to be moved on disk. If we're embedding data that frequently grows beyond its allotted space, that document will have to be moved often. Since these documents are bigger each time they're moved, the process only grows more complex and won't get any better over time. By normalizing those embedded parts that grow frequently, we eliminate the need for the entire document to be moved.
Normalize data when the document is expected to grow larger than 16MB. Documents have a 16MB limit in MongoDB. That's just the way things are. We should start breaking them up into multiple collections if we ever approach that limit.
The Most Important Consideration to Schema Design in MongoDB is...
How our applications access and use data. This requires us to think? Uhg! What data is used together? What data is used mostly as read-only? What data is written to frequently? Let your applications data access patterns drive your schema, not the other way around.
The scope you've described is definitely not too much for "one collection". In fact, being able to store everything in a single place is the whole point of a MongoDB collection.
For the most part, you don't want to be thinking about querying across combined tables as you would in SQL. Unlike in SQL, MongoDB lets you avoid thinking in terms of "JOINs"--in fact MongoDB doesn't even support them natively.
See this slideshare:
http://www.slideshare.net/mongodb/migrating-from-rdbms-to-mongodb?related=1
Specifically look at slides 24 onward. Note how a MongoDB schema is meant to replace the multi-table schemas customary to SQL and RDBMS.
In MongoDB a single document holds all information regarding a record. All records are stored in a single collection.
Also see this question:
MongoDB query multiple collections at once

Are there any advantages to using a custom _id for documents in MongoDB?

Let's say I have a collection called Articles. If I were to insert a new document into that collection without providing a value for the _id field, MongoDB will generate one for me that is specific to the machine and the time of the operation (e.g. sdf4sd89fds78hj).
However, I do have the ability to pass a value for MongoDB to use as the value of the _id key (e.g. 1).
My question is, are there any advantages to using my own custom _ids, or is it best to just let Mongo do its thing? In what scenarios would I need to assign a custom _id?
Update
For anyone else that may find this. The general idea (as I understand it) is that there's nothing wrong with assigning your own _ids, but it forces you to maintain unique values within your application layer, which is a PITA, and requires an extra query before every insert to make sure you don't accidentally duplicate a value.
Sammaye provides an excellent answer here:
Is it bad to change _id type in MongoDB to integer?
Advantages with generating your own _ids:
You can make them more human-friendly, by assigning incrementing numbers: 1, 2, 3, ...
Or you can make them more human-friendly, using random strings: t3oSKd9q
(That doesn't take up too much space on screen, could be picked out from a list, and could potentially be copied manually if needed. However you do need to make it long enough to prevent collisions.)
If you use randomly generated strings they will have an approximately even sharding distribution, unlike the standard mongo ObjectIds, which tends to group records created around the same time onto the same shard. (Whether that is helpful or not really depends on your sharding strategy.)
Or you may like to generate your own custom _ids that will group related objects onto one shard, e.g. by owner, or geographical region, or a combination. (Again, whether that is desirable or not depends on how you intend to query the data, and/or how rapidly you are producing and storing it. You can also do this by specifying a shard key, rather than the _id itself. See the discussion below.)
Advantages to using ObjectIds:
ObjectIds are very good at avoiding collisions. If you generate your own _ids randomly or concurrently, then you need to manage the collision risk yourself.
ObjectIds contain their creation time within them. That can be a cheap and easy way to retain the creation date of a document, and to sort documents chronologically. (On the other hand, if you don't want to expose/leak the creation date of a document, then you must not expose its ObjectId!)
The nanoid module can help you to generate short random ids. They also provide a calculator which can help you choose a good id length, depending on how many documents/ids you are generating each hour.
Alternatively, I wrote mongoose-generate-unique-key for generating very short random ids (provided you are using the mongoose library).
Sharding strategies
Note: Sharding is only needed if you have a huge number of documents (or very heavy documents) that cannot be managed by one server. It takes quite a bit of effort to set up, so I would not recommend worrying about it until you are sure you actually need it.
I won't claim to be an expert on how best to shard data, but here are some situations we might consider:
An astronomical observatory or particle accelerator handles gigabytes of data per second. When an interesting event is detected, they may want to store a huge amount of data in only a few seconds. In this case, they probably want an even distribution of documents across the shards, so that each shard will be working equally hard to store the data, and no one shard will be overwhelmed.
You have a huge amount of data and you sometimes need to process all of it at once. In this case (but depending on the algorithm) an even distribution might again be desirable, so that all shards can work equally hard on processing their chunk of the data, before combining the results at the end. (Although in this scenario, we may be able to rely on MongoDB's balancer, rather than our shard key, for the even distribution. The balancer runs in the background after data has been stored. After collecting a lot of data, you may need to leave it to redistribute the chunks overnight.)
You have a social media app with a large amount of data, but this time many different users are making many light queries related mainly to their own data, or their specific friends or topics. In this case, it doesn't make sense to involve every shard whenever a user makes a little query. It might make sense to shard by userId (or by topic or by geographical region) so that all documents belonging to one user will be stored on one shard, and when that user makes a query, only one shard needs to do work. This should leave the other shards free to process queries for other users, so many users can be served at once.
Sharding documents by creation time (which the default ObjectIds will give you) might be desirable if you have lots of light queries looking at data for similar time periods. For example many different users querying different historical charts.
But it might not be so desirable if most of your users are querying only the most recent documents (a common situation on social media platforms) because that would mean one or two shards would be getting most of the work. Distributing by topic or perhaps by region might provide a flatter overall distribution, whilst also allowing related documents to clump together on a single shard.
You may like to read the official docs on this subject:
https://docs.mongodb.com/manual/sharding/#shard-key-strategy
https://docs.mongodb.com/manual/core/sharding-choose-a-shard-key/
I can think of one good reason to generate your own ID up front. That is for idempotency. For example so that it is possible to tell if something worked or not after a crash. This method works well when using re-try logic.
Let me explain. The reason people might consider re-try logic:
Inter-app communication can sometimes fail for different reasons, (especially in a microservice architecture). The app would be more resilient and self-healing by codifying the app to re-try and not give up right away. This rides over odd blips that might occur without the consumer ever being affected.
For example when dealing with mongo, a request is sent to the DB to store some object, the DB saves it, but just as it is trying to respond to the client to say everything worked fine, there is a network blip for whatever reason and the “OK” is never received. The app assumes it didn't work and so the app may end up re-trying the same data and storing it twice, or worse it just blows up.
Creating the ID up front is an easy, low overhead way to help deal with re-try logic. Of course one could think of other schemes too.
Although this sort of resiliency may be overkill in some types of projects, it really just depends.
I have used custom ids a couple of times and it was quite useful.
In particular I had a collection where I would store stats by date, so the _id was actually a date in a specific format. I did that mostly because I would always query by date. Keep in mind that using this approach can simplify your indexes as no extra index is needed, the basic cursor is sufficient.
Sometimes the ID is something more meaningful than a randomly generated one. For example, a user collection may use the email address as the _id instead. In my project I generate IDs that are much shorter than the ones Mongodb uses so that the ID shown in the URL is much shorter.
I'll use an example , i created a property management tool and it had multiple collections. For simplicity some fields would be duplicated for example the payment. And when i needed to update these record it had to happen simultaneously across all collections it appeared in so i would assign them a custom payment id so when the delete/query action is performed it changes all instances of it database wide