NoSQL Database Design with multile indexes - mongodb

I have a DynnamoDB/NoSQL/MongoDB question. I am from an RDBMS backround and struggling to get a design right for a NoSQL DB. If anyone can help!
I have the following objects (tables in my terms):
Organisation
Users
Courses
Units
I want the following access points, of which most are achievable:
Get/Create/Update and Delete Organisation
Get/Create/Update and Delete Users
Get/Create/Update and Delete Courses
Which I can achieve.
Issue is that the Users and Courses objects has many way to retrieve data:
email
username
For example: List Users on course.
List users for Org.
List courses for Org.
List users in org
list users in unit
All these user secondary indexes, which I semi-understand, but I have tertiary..ish indexes, but that probably my design.
Coming from a relational methodology, I am not sure about reporting, how would it work if I wanted to do a search for all users under the courses which have not completed their (call it status flag)?
From what I understand I need indexes for everting I want to search by?
AWS DynamoDB is my preference, but another NoSQL I'll happily consider. I realise that I need more education regarding NoSQL, so please if anyone can provide good documentation and examples which help the learning process, that will be awesome.
Regards
Richard
I have watched a few UDEMY videos and been Gooling for many weeks (oh and checked here "obviously")

Things to keep in mind
Partitions
In DynamoDB everything is organized in partitions that give you hash-based access to elements. This is very powerful in terms of performance but each partition has limits, so similarly to the hash function in hash maps the partition keys should try to equally distribute the elements
Single Table Design
Don't split the data into multiple tables. This makes everything harder and actually limits the capabilities of the DB. Store everything in a single table.
Keys
Keys in dynamo have to be designed around your access patterns. This is the hardest part.
You have Partition Key (Hash Key) -> this key has to be exactly specified every time. You can't perform a query without knowing the PK. This is why placing things like timestamps into PK is really bad idea.
Sort (Range) keys -> these are used for querying as specified in the AWS docs.
Attribute names
DB migrations are really hard in NoSQL so you have to use generic names for the attributes. They should not have any meaning.
For example "UserID" is a bad name for partition key, "PK" is a good name for partition key, same goes for all keys.
Indexes
You have two types of indexes, local and global.
Local Indexes are created once when you create the table and can't be changed (easily) afterward. You can only have a few of them. They give you an extra sort key to work with. The main benefit is that they are strongly consistent
Global Indexes can be created at any time. They give you both new partition key and sort key to work with but are eventually consistent. Got with global indexes unless you have a good reason to use local.
Going back to your problem, if we focus on one of the table as an example - Users
The user can be inserter like this (for example)
PK SK GSI1PK GSI1SK
Username#john123 Email#jhon#gmail.com Email#jhon#gmail.com Username#john123 <User Data>
This way you can query users by email and username. Keep in mind that PK and SK have to be unique pairs. SK in this case is free and can be used for other access patterns (which you didn't provide)
Another way might be to copy the data
PK SK
Username#john123 Email#jhon#gmail.com <user data>
Email#jhon#gmail.com Username#john123 <user data>
this way you avoid having to deal with indexes (which might be expensive sometimes) but you have to manually keep the user data consistent.
Further reading
-> https://www.alexdebrie.com/posts/dynamodb-single-table/
-> my medium post

Related

Storing primary key of Relational database in document of MongoDB

Suppose, I have a table (customers) in Oracle with column names as customer_id(PK), customer_name, customer_email, customer address. And I have a collection (products) in MongoDB which is storing customer_id as one of its field. Below, is a sample of document in products collection, which is storing customer_id "customer123", which is primary key in customers table in Oracle database.
{
_id : "product124",
customer_id: "customer123",
product_name: "hairdryer"
}
My questions is, Is it a good idea to use different types of databases when one field like customer_id here is shared between them. Is it a good practice in enterprises level development?
Please ignore the use case, as I am just trying to give a simple example to provide better understanding of the problem.
I would say it is acceptable to use different databases in distributed systems and keep references between entities, but it really depends on the use case. If you plan to perform frequent and heavy joins between these 2 entities then storing them in separated databases (especially of different types) might dramatically affect your performance. However, if your use case does not require frequent relations resolving, this approach could work. But bear in mind that you need to consider the future scale of your application and how would this architectural decision affect the potential growth.

How to design the tables/entities in NoSQL DB?

I do my first steps in NoSQL databases, thus I would like to hear the best practices about implementing the following requirement.
Let suppose I have a messages database, which is powered by MongoDB engine. This DB contains a collection of documents, where each document has the following fields:
time stamp;
message author/source;
message content.
Now, I want to build a list of authors/sources in order to add some metadata about each source. In the case of the classical RDBMS, I would define a table tblSources where I would store the names of the message sources and all additional meta-data (or links to the relevant tables) for each author.
What is the right approach to such task in NoSQL/MongoDB world?
It really depends on how you want to use the data. NoSQL dbs are generally not designed with fast joins in mind but they are still capable of doing joins and storing foreign keys.
Your options here are really
duplicate data aka store the author metadata in every document. This might be better in the case where you are really trying to optimize lookups and use Mongo as a key value store
Join on foreign key - this is pretty similar to how you would use a RDBMS

MongoDB beginner - to normalize or not to normalize?

I'm going to try and make this as straight-forward as I can.
Coming from MySQL and thinking in terms of tables, let's use the following example:
Let's say that we have a real-estate website and we're displaying a list of houses
normally, I'd use the following tables:
houses - the real estate asset at hand
owners - the owner of the house (one-to-many relationship with houses)
agencies - the real-estate broker agency (many-to-many relationship with houses)
images - many-to-one relationship with houses
reviews - many-to-one relationship with houses
I understand that MongoDB gives you the flexibility to design your web-app in different collections with unique IDs much like a relational database (normalized), and to enjoy quick selections, you can nest within a collection, related objects and data (un-normalized).
Back to our real-estate houses list, the query used to populate it is quite expensive in a normal relational DB, for each house you need to query its images, reviews, owner & agencies, each entity resides in a different table with its fields, you'd probably use joins and have multiple queries joined into one - Expensive!
Enter MongoDB - where you don't need joins, and you can store all the related data of a house in a house item on the houses collection, selection was never faster, it's a db heaven!
But what happens when you need to add/update/delete related reviews/agencies/owner/images?
This is a mystery to me, and if I need to guess, each related collection exist on its own collection on top of its data within the houses table, and once one of these pieces of related data is being added/updated/deleted you'll have to update it on its own collection as well as on the houses collection. Upon this update - do I need to query the other collections as well to make sure I'm updating the house record with all the updated related data?
I'm just guessing here and would really appreciate your feedback.
Thanks,
Ajar
Try this approach:
Work out which entity (or entities) are the hero(s)
With 'hero', I mean the entity(s) that the database is centered around. Let's take your example. The hero of the real-estate example is the house*.
Work out the ownerships
Go through the other entities, such as the owner, agency, images and reviews and ask yourself whether it makes sense to place their information together with the house. Would you have a cascading delete on any of the foreign keys in your relational database? If so, then that implies ownership.
Work out whether it actually matters that data is de-normalised
You will have agency (and probably owner) details spread across multiple houses. Does that matter?
Your house collection will probably look like this:
house: {
owner,
agency,
images[], // recommend references to GridFS here
reviews[] // you probably won't get too many of these for a single house
}
*Actually, it's probably the ad of the house (since houses are typically advertised on a real-estate website and that's probably what you're really interested in) so just consider that
Sarah Mei wrote an informative article about the kinds of issues that can arise with data integrity in nosql dbs. The choice between duplicate data or using id's, code based joins and the challenges with keeping data integrity. Her take is that any nosql db with code based joins will lose data integrity at some point. Imho the articles comments are as valuable as the article itself in understanding these issues and possible resolutions.
Link: http://www.sarahmei.com/blog/2013/11/11/why-you-should-never-use-mongodb/comment-page-1/
I would just like to give a normalization refresher from the MongoDB's perspective -
What are the goals of normalization?
Frees the database from modification anomalies - For MongoDB, it looks like embedding data would mostly cause this. And in fact, we should try to avoid embedding data in documents in MongoDB which possibly create these anomalies. Occasionally, we might need to duplicate data in the documents for performance reasons. However that's not the default approach. The default is to avoid it.
Should minimize re-design when extending - MongoDB is flexible enough because it allows addition of keys without re-designing all the documents
Avoid bias toward any particular access pattern - this is something, we're not going to worry about when describing schema in MongoDB. And one of the ideas behind the MongoDB is to tune up your database to the applications that we're trying to write and the problem we're trying to solve.

MongoDB Schema Design ordering service

I have the following objects Company, User and Order (contains orderlines). User's place orders with 1 or more orderlines and these relate to a Company. The time period for which orders can be placed for this Company is only a week.
What I'm not sure on is where to place the orders array, should it be a collection of it's own containing a link to the User and a link to the Company or should it sit under the Company or finally should the orders be sat under the User.
Numbers wise I need to plan for 50k+ in orders.
Queries wise, I'll probably be looking at Orders by Company mainly but I would need to find an Order by Company based for a specific user.
1) For folks coming from the SQL world (such as myself) one of the hardest learn about MongoDB is the new style of schema design. In the SQL world, everything goes into third normal form. Folks come to think that there is a single right way to design their schema, because there typically is one.
In the MongoDB world, there is no one best schema design. More accurately, in MongoDB schema design depends on how the application is going to access the data.
2) Here are the key questions that you need to have answered in order to design a good schema for MongoDB:
How much data do you have?
What are your most common operations? Will you be mostly inserting new data, updating existing data, or doing queries?
What are your most common queries?
How many I/O operations do you expect per second?
What you're talking about here is modeling Many-to-One relationships:
Company -> User
User -> Order
Order -> Order Lines
Company -> Order
Using SQL you would create a pair of master/detail tables with a primary key/foreign key relationship. In MongoDB, you have a number of choices: you can embed the data, you can create a linked relationship, you can duplicate and denormalize the data, or you can use a hybrid approach.
The correct approach would depend on a lot of details about the use case of your application, many of which you haven't provided.
3) This is my best guess - and it's only a guess - as to a good schema for you.
a) Have separate collections for Users, Companies, and Orders
If you're looking at 50k+ orders, there are too many to embed in a single document. Having them as a separate collection will allow you to reference them from both the Company and the User documents.
b) Have an array of references to the Order documents in both the Company and the User documents. This makes the query "Find all Orders for this Company" a single-document query
c) If your query pattern supports it, you might also have a duplicate link from Orders back to the owning Company and/or User.
d) Assuming that the order lines are unique to the individual Order, you would embed the Order Lines in an array within the Order documents.
e) If your order lines refer back to individual Products, you might want to have a separate Product collection, and include a reference to the Product document in the order line sub-document
4) Here are some good general references on MongoDB schema design.
MongoDB presentations:
http://www.10gen.com/presentations/mongosf2011/schemabasics
http://www.10gen.com/presentations/mongosv-2011/schema-design-by-example
http://www.10gen.com/presentations/mongosf2011/schemascale
Here are a couple of books about MongoDB schema design that I think you would find useful:
http://www.manning.com/banker/ (MongoDB in Action)
http://shop.oreilly.com/product/0636920018391.do
Here are some sample schema designs:
http://docs.mongodb.org/manual/use-cases/
Note that the "MongoDB in Action" book includes a sample schema for an e-commerce application, which is very similar to what you're trying to build -- I recommend you check it out.

Am I missing something about Document Databases?

I've been looking at the rise of the NoSql movement and the accompanying rise in popularity of document databases like mongodb, ravendb, and others. While there are quite a few things about these that I like, I feel like I'm not understanding something important.
Let's say that you are implementing a store application, and you want to store in the database products, all of which have a single, unique category. In Relational Databases, this would be accomplished by having two tables, a product and a category table, and the product table would have a field (called perhaps "category_id") which would reference the row in the category table holding the correct category entry. This has several benefits, including non-repetition of data.
It also means that if you misspelled the category name, for example, you could update the category table and then it's fixed, since that's the only place that value exists.
In document databases, though, this is not how it works. You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data, and errors are much more difficult to correct. Thinking about this more, doesn't it also mean that running queries like "give me all products with this category" can lead to result that do not have integrity.
Of course the way around this is to re-implement the whole "category_id" thing in the document database, but when I get to that point in my thinking, I realize I should just stay with relational databases instead of re-implementing them.
This leads me to believe I'm missing some key point about document databases that leads me down this incorrect path. So I wanted to put it to stack-overflow, what am I missing?
You completely denormalize, meaning in the "products" document, you would actually have a value holding the actual category string, leading to lots of repetition of data [...]
True, denormalizing means storing additional data. It also means less collections (tables in SQL), thus resulting in less relations between pieces of data. Each single document can contain the information that would otherwise come from multiple SQL tables.
Now, if your database is distributed across multiple servers, it's more efficient to query a single server instead of multiple servers. With the denormalized structure of document databases, it's much more likely that you only need to query a single server to get all the data you need. With a SQL database, chances are that your related data is spread across multiple servers, making queries very inefficient.
[...] and errors are much more difficult to correct.
Also true. Most NoSQL solutions don't guarantee things such as referential integrity, which are common to SQL databases. As a result, your application is responsible for maintaining relations between data. However, as the amount of relations in a document database is very small, it's not as hard as it may sound.
One of the advantages of a document database is that it is schema-less. You're completely free to define the contents of a document at all times; you're not tied to a predefined set of tables and columns as you are with a SQL database.
Real-world example
If you're building a CMS on top of a SQL database, you'll either have a separate table for each CMS content type, or a single table with generic columns in which you store all types of content. With separate tables, you'll have a lot of tables. Just think of all the join tables you'll need for things like tags and comments for each content type. With a single generic table, your application is responsible for correctly managing all of the data. Also, the raw data in your database is hard to update and quite meaningless outside of your CMS application.
With a document database, you can store each type of CMS content in a single collection, while maintaining a strongly defined structure within each document. You could also store all tags and comments within the document, making data retrieval very efficient. This efficiency and flexibility comes at a price: your application is more responsible for managing the integrity of the data. On the other hand, the price of scaling out with a document database is much less, compared to a SQL database.
Advice
As you can see, both SQL and NoSQL solutions have advantages and disadvantages. As David already pointed out, each type has its uses. I recommend you to analyze your requirements and create two data models, one for a SQL solution and one for a document database. Then choose the solution that fits best, keeping scalability in mind.
I'd say that the number one thing you're overlooking (at least based on the content of the post) is that document databases are not meant to replace relational databases. The example you give does, in fact, work really well in a relational database. It should probably stay there. Document databases are just another tool to accomplish tasks in another way, they're not suited for every task.
Document databases were made to address the problem that (looking at it the other way around), relational databases aren't the best way to solve every problem. Both designs have their use, neither is inherently better than the other.
Take a look at the Use Cases on the MongoDB website: http://www.mongodb.org/display/DOCS/Use+Cases
A document db gives a feeling of freedom when you start. You no longer have to write create table and alter table scripts. You simply embed details in the master 'records'.
But after a while you realize that you are locked in a different way. It becomes less easy to combine or aggregate the data in a way that you didn't think was needed when you stored the data. Data mining/business intelligence (searching for the unknown) becomes harder.
That means that it is also harder to check if your app has stored the data in the db in a correct way.
For instance you have two collection with each approximately 10000 'records'. Now you want to know which ids are present in 'table' A that are not present in 'table' B.
Trivial with SQL, a lot harder with MongoDB.
But I like MongoDB !!
OrientDB, for example, supports schema-less, schema-full or mixed mode. In some contexts you need constraints, validation, etc. but you would need the flexibility to add fields without touch the schema. This is a schema mixed mode.
Example:
{
'#rid': 10:3,
'#class': 'Customer',
'#ver': 3,
'name': 'Jay',
'surname': 'Miner',
'invented': [ 'Amiga' ]
}
In this example the fields "name" and "surname" are mandatories (by defining them in the schema), but the field "invented" has been created only for this document. All your app need to don't know about it but you can execute queries against it:
SELECT FROM Customer WHERE invented IS NOT NULL
It will return only the documents with the field "invented".