I am currently trying to model a MongoDB database structure where the entities are very complex in relation to each other.
In my current collections, MongoDB queries are difficult or impossible to put into a single aggregation. Incidentally, I'm not a database specialist and have been working with MongoDB for only about half a year.
To keep it as simple as possible but necessary, this is my challenge:
I have newspaper articles that contain simple keywords, works (oevres, books, movies), persons and linked combinations of works and persons. In addition, the same people appear under different names in different articles.
Later, on the person view I want to show the following:
the links of the person with name and work and the respective articles
the articles in which the person appears without a work (by name)
the other keywords that are still in the article
In my structure I want to avoid that entities such as people occur multiple times. So these are my current collections:
Article
id
title
keywordRelations
KeywordRelation
id
type (single or combination)
simpleKeywordId (optional)
personNameConnectionIds (optional)
workIds (optional)
SimpleKeyword
id
value
PersonNameConnection
id
personId
nameInArticleId
Person
id
firstname
lastname
NameInArticle
id
name
type (e.g. abbreviation, synonyme)
Work
id
title
To meet the requirements, I would always have to create queries that range over 3 to 4 tables. Is that possible and useful with MongoDB?
Or is there an easier way and structure to achieve that?
Related
So in a traditional database I might have 2 tables like users, company
id
username
companyid
email
1
j23
1
something#gmail.com
2
fj222
1
james#aol.com
id
ownerid
company_name
1
1
A Really boring company
This is to say that user 1 and 2 are apart of company 1 (a really boring company) and user 1 is the owner of this company.
I could easily issue an update statement in MySQL or Postgresql to update the company name.
But how could I model the same data from a NoSQL perspective, in something like Dynamodb or Mongodb?
Would each user record (document in NoSQL) contain the same company table data (id, ownerid (or is owner true/false, and company name)? I'm unclear how to update the record for all users containing this data then if the company name needed to be updated.
In case you want to save the company object as JSON in each field (for performance reasons), indeed, you have to update a lot of rows.
But best way to achieve this is to have a similar structure as you have above, in MySQL. NoSql schema depends a lot on the queries you will be making.
For example, the schema above is great for:
Find a particular user by username, along with his company name. First you need to query User by username (you can add an index), get the companyId and do another query on Company to fetch the name.
Let's assume company name changes often
In this case company name update is easy. To execute the read query, you need 2 queries to get your result (but they should execute fast)
Embedded company JSON would work better for:
Find all users from a specific city and show their company name
Let's assume company name changes very rarely
In this case, we can't use the "relational" approach, because we will do 1 query to fetch Users by city and then another query for all users found to fetch the company name
Using embedded approach, we need only 1 query
To update a company name, a full (expensive) scan is needed, but should be ok if done rarely
What if company name changes ofter and I want to get users by city?
This becomes tricky, NoSQL is not a replacement for SQL, it has it's shortcomings. Solution may be a platform dependent feature (from mongo, dynamodb, firestore etc.), an additional layer above (elasticSearch) or no solution at all (consider not using key-value NoSQL)
Depends on the programming language used to handle NoSQL objects/documents you have variety of ORM libraries to model your schema. Eg. for MongoDB plus JS/Typescript I recommend Mongoose and its subdocuments. Here is more about it:
https://mongoosejs.com/docs/subdocs.html
I am new to the NoSql world. I am building a serverless app with dynamodb. In a relational DB when I would have 3 entities like post, post_likes and post_tags I would have few tables and use joins to fetch data. But, I wonder how should one make a NoSql structure for a scenario where post has one to many relationship with likes, and many to many with tags.
Post model:
user_id <string>
attachment_url <string>
description <string>
public <boolean>
Like model:
user_id <string>
post_id <string>
type <string>
Tag model:
name <string>
I have few access patterns:
Get all public posts
Get all posts filtered by a single tag and public status
Get all posts by user id
Get a single post by post id
And each time a post should be fetched with tags data, and likes data including user data that is attached to a like.
In relational DB I would create post_tags table and fetch all post by tags. But, how can I do this with dynamodb?
I am struggling to figure out how my table should look like and what to set as primary and sort keys amongst post_id, user_id, tag_name or public fields for this case?
My initial thought was to build a table with entity that would look like this:
Partition key | Sort key | data attributes
tag_name | post_id | public | user_id | likes[] | other post attributes...
Then this table would look something like this:
I have set the 2 Global secondary indexes.
First Global secondary index:
partition key set to public and sort key to post_id
Second Global secondary index:
partition key set to user_id and sort key to post_id
That way for each tag a post has, I would have a duplicate of that post in the table. I thought by having a tag as a first filter, that way I could query efficiently posts if I need to query them by a tag.
But, if I do a query by just a public status or user_id, I would get all the duplicates of posts for each tag they belong to.
Or should I have 3 separate entities in the table, tags, posts and likes and if I fetch a post by a tag, I would first do one query to find all post_ids by a tag, then do the second query to fetch posts and their likes id, and then do the third query to fetch the likes array.
I don't know what is the best practice when it comes to this things, since I only just started using dynamodb.
How should this DB structure look like then?
You're off to a great start by thinking deeply about your access patterns and defining your entities (Posts, Users, Likes, etc). As you know, having a thorough understanding of your access patterns is critical to storing your data in DynamoDB.
While reviewing my answer, keep in mind that this is only one solution. DynamoDB gives you a ton of flexibility when defining your data model, which can be both a blessing and a curse! This answer is not meant to be the way to model these access patterns. Instead, it's one way that these access patterns can be implemented. Let's get into it!
I like to start by listing the entities we need to model, as well as the Primary key for each. Throughout this post, I'll be using composite primary keys, which are keys made up of a Partition Key (PK) and a Sort Key (SK). Let's start out with a blank table and fill it out as we go.
Partition Key Sort Key
User
Post
Tag
Users
Users are central to your application, so I'll start there.
Let's start by defining a User model that lets us identify a User by ID. I'll use the pattern USER#<user_id> for the PK and SK of the User entity.
This supports the following access patterns (examples in pseudocode for simplicity):
Fetch User by ID
ddbClient.query(PK = USER#1, SK = USER#1)
I'll update the table with the new PK/SK pattern for Users
Partition Key Sort Key
User USER#<user_id> USER#<user_id>
Post
Tag
Posts
I'll start modeling Posts by focusing on the one-to-many relationship between Users and their Posts.
You have an access pattern to fetch All Posts by UserId, so I'll start by adding the Post model to the User partition. I'll do this by defining a PK of USER#<user_id> and an SK of POST#<post_id>.
This supports the following access patterns:
Fetch User and all Posts
ddbClient.query(PK = USER#<user_id>)
Fetch User Posts
ddbClient.query(PK = USER#<user_id>, SK begins_with "POST#")
You may wonder about the odd-looking Post IDs. When fetching Posts, you'll probably want to get the most recent Posts first. You also want to be able to uniquely identify Posts by ID. When you have this sort of requirement, you can use a KSUID as your unique identifier. Explaining KSUID's is a bit out of scope for your question, but know that they are unique and sortable by the time they were created. Since DynamoDB sorts results by the Sort Key, your query for a user's posts will automatically be sorted by creation date!
Updating the PK/SK patterns for your application, we now have
Partition Key Sort Key
User USER#<user_id> USER#<user_id>
Post USER#<user_id> POST#<post_id>
Tag
Tags
We have a few options on how to model the one-to-many relationship between Posts and Tags. You could include a list attribute on your Post item, which simply lists the number of tags on the item. This approach is perfectly fine. However, looking at your other access patterns, I'm going to take a different approach for now (it will be apparent why later).
I will model tags with a PK of POST#<post_id> and an SK of TAG#<tag_name>
Since Primary Keys are unique, modeling tags in this way will ensure that no Post is tagged with the same Tag twice. Additionally, it allows us to have an unbounded number of Tags on a Post.
Updating our PK/SK table for Tag, we have
Partition Key Sort Key
User USER#<user_id> USER#<user_id>
Post USER#<user_id> POST#<post_id>
Tag POST#<post_id> TAG#<tag_name>
At this point we've modeled Users, Posts and Tags. However, we've only addressed one of your four access patterns. Lets see how we can use secondary indexes to support your access patterns.
Note: You could also model Likes in the exact same way.
Defining A Secondary Index
Secondary indexes allow you to support additional access patterns within your data. Let's define a very simple secondary index and see how it supports your various access patterns.
I'm going to create a secondary index that swaps the PK/SK patterns in your base table. This pattern is called an inverted index, and would look like this:
All we've done here is swapped the PK/SK pattern of your base table, which has given us access to two additional access patterns:
Fetch Post by ID
ddbClient.query(IndexName = InvertedIndex, PK = POST#<post_id>)
Fetch Posts by Tag
ddbClient.query(IndexName = InvertedIndex, PK = TAG#<tag_name>)
Fetch All Posts by Public/Private status
You wanted to fetch posts by public/private status, as well as fetching all Posts. One way to fetch all Posts is to put them in a single partition. We can put the public/private status in the sort key to separate the public and private Posts.
To do this, I'll create two new attributes on the Post item: _type and publicPostId. These fields will serve as the PK/SK patterns for the secondary index I'm calling PostByStatus.
After doing this, your base table would look like this:
and your new secondary index would look like this
This secondary index would enable the following access patterns
Fetch All Posts
ddbClient.query(IndexName = PostByStatus, PK = POST)
Fetch All Private Posts
ddbClient.query(IndexName = PostByStatus, PK = POST, SK begins_with "PRIVATE#")
Fetch All Public Posts
ddbClient.query(IndexName = PostByStatus, PK = POST, SK begins_with "PUBLIC#")
Remember, post ID's are KSUID's, so they will naturally be sorted in your results by the date the Post was made.
A Word on Hot Partitions
Storing all your Posts in a single partition will likely result in a hot partition as your application scales. One way to address this is by distributing your Post items across multiple partitions. How you do that is entirely up to you and specific to your application.
One strategy to avoid the single POST partition could involve grouping Posts by creation day/week/month/etc. For example, instead of using POST as your PK in the PostByStatus secondary index, you could use POSTS#<month>-<year> instead, which would look like this:
Your application would need to take this pattern into account when fetching Posts (e.g. start at the current month and go backwards until enough results are fetched), but you'd be spreading the load across multiple partitions.
Wrapping Up
I hope this exercise gives you some ideas on how to model your data to support specific access patterns. Data modeling in DynamoDB takes time to get right, and will likely require multiple iterations to make work for your specific application. It can be a steep learning curve, but the payoff is a solution that brings scale and speed to your application.
This question already has answers here:
How can you represent inheritance in a database?
(7 answers)
Closed 2 years ago.
I am building a Postgres database which has the following two tables:
Projects (id, startDate, etc...)
and
Employees (id, name, etc...)
I want to keep track of the types of contributions that an employee adds to a project. For example, employee #1 might be an "engineer" on project 1 and a "manager" on project 2. I also don't want to restrict the number of contributions an employee can make to a certain project. So employee #1 could be both a "engineer" and a "manager" for a single project.
My first instinct was just to have a many to many relation between the two titled ProjectEmployees or something and store the projectId, employeeId, and a contributionType as a string which would only take on values from an enum as to not have to deal with misspellings or any related issues.
My main question is just whether or not this is a bad practice. My other thought was to split up each contribution type to its own table. So instead of an EmployeeProjects table, there would be tables such as ProjectEngineers, ProjectManagers, etc... and instead of storing the contributionType as a column, it would be implicit in the table I'm using, and the table only has to store projectId and employeeId. There are many more tables in this database which have a similar sort of relationship where there are many to many relations between different tables, but each relation could be one of many "types" of relations. Is it wiser to split these all into separate tables for each type of relation? Or is it better to just keep track of the relation type in a more general table like my first idea?
My desired result is to just be able to efficiently see which all project contributions (and types) an employee worked on as well as to see all contributors + contributor types for a project.
Use the many to many relation as in your first idea, which in my opinion is a good practice.
Avoid the creation of one table per contribution type as is not scalable and flexible. I.E. if one day you'll have a new contribution type, with the 2nd option you will need each time
to create a new table
to write the new table management logic
proceed with a new deploy of your sw
About the topic of storing the contribution types on a table (with id and description) or as a constraint with contribution types strings enumerated, in my opinion both are valuable solutions.
But if you think to manage contribution types in your software (in a first release or in the future) maybe having a table with contribution types anagraphics can be better. It depends by your design and requirements
Make a table to store contribution types as strings (manager, engineer, etc) and contribution type id (numeric id). This prevents misspellings.
Make a table to store contributions with columns: employee id, project id, contribution type id (you may want other columns there, but it should be unique on the combination of these 3 columns). Do not store contribution types as strings in a table like this, since, as you correctly mentioned, this may allow misspellings. Another reason is to save disk space. An extra join with a small table of contribution types is a small price to pay.
I am trying to come up with a MongoDB document model and would like others opinions. I want to have a Document that represents an Employee. This table will contain all attributes of an employee (I.e. firstName, LastName). Now where I am stuck coming from the relational realm, is the need to store a list of employees an employee can access. In other words lets say Employee A is a Manager. I need to store the direct reports that he manages, in order to use this in various applications. In relational I would have a mapping table that tied an employee to many employees. In mongo not being able join documents, do you think I should utilize an embeded (sub-document) to store the list of accessible employees as part of the Employee document? Any other ideas ?
Unless your using employee groups (Accounting, HR, etc) You'll probably be fine adding the employee name, mongo Object ID, and any other information unique to that manager / employee relationship as a sub document to the managers document.
With that in place you could probably do your reporting on these relationships through a simple aggregation.
This is all IMHO, and begs the question; Is simple aggregation another oxymoron like military intelligence?
I'm willing to give MongoDB and CouchDB a serious try. So far I've worked a bit with Mongo, but I'm also intrigued by Couch's RESTful approach.
Having worked for years with relational DBs, I still don't get what is the best way to get some things done with non relational databases.
For example, if I have 1000 car shops and 1000 car types, I want to specify what kind of cars each shop sells. Each car has 100 features. Within a relational database i'd make a middle table to link each car shop with the car types it sells via IDs. What is the approach of No-sql? If every car shop sells 50 car types, it means replicating a huge amount of data, if I have to store within the car shop all the features of all the car types it sells!
Any help appreciated.
I can only speak to CouchDB.
The best way to stick your data in the db is to not normalize it at all beyond converting it to JSON. If that data is "cars" then stick all the data about every car in the database.
You then use map/reduce to create a normalized index of the data. So, if you want an index of every car, sorted first by shop, then by car-type you would emit each car with an index of [shop, car-type].
Map reduce seems a little scary at first, but you don't need to understand all the complicated stuff or even btrees, all you need to understand is how the key sorting works.
http://wiki.apache.org/couchdb/View_collation
With that alone you can create amazing normalized indexes over differing documents with the map reduce system in CouchDB.
In MongoDB an often used approach would be store a list of _ids of car types in each car shop. So no separate join table but still basically doing a client-side join.
Embedded documents become more relevant for cases that aren't many-to-many like this.
Coming from a HBase/BigTable point of view, typically you would completely denormalize your data, and use a "list" field, or multidimensional map column (see this link for a better description).
The word "column" is another loaded
word like "table" and "base" which
carries the emotional baggage of years
of RDBMS experience.
Instead, I find it easier to think
about this like a multidimensional map
- a map of maps if you will.
For your example for a many-to-many relationship, you can still create two tables, and use your multidimenstional map column to hold the relationship between the tables.
See the FAQ question 20 in the Hadoop/HBase FAQ:
Q:[Michael Dagaev] How would you
design an Hbase table for many-to-many
association between two entities, for
example Student and Course?
I would
define two tables: Student: student
id student data (name, address, ...)
courses (use course ids as column
qualifiers here) Course: course id
course data (name, syllabus, ...)
students (use student ids as column
qualifiers here) Does it make sense?
A[Jonathan Gray] : Your design does
make sense. As you said, you'd
probably have two column-families in
each of the Student and Course tables.
One for the data, another with a
column per student or course. For
example, a student row might look
like: Student : id/row/key = 1001
data:name = Student Name data:address
= 123 ABC St courses:2001 = (If you need more information about this
association, for example, if they are
on the waiting list) courses:2002 =
... This schema gives you fast access
to the queries, show all classes for a
student (student table, courses
family), or all students for a class
(courses table, students family).
In relational database, the concept is very clear: one table for cars with columns like "car_id, car_type, car_name, car_price", and another table for shops with columns "shop_id, car_id, shop_name, sale_count", the "car_id" links the two table together for data Ops. All the columns must well defined in creating the database.
No SQL database systems do not require you pre-define these columns and tables. You just construct your records in a certain format, say JSon, like:
"{car:[id:1, type:auto, name:ford], shop:[id:100, name:some_shop]}",
"{car:[id:2, type:auto, name:benz], shop:[id:105, name:my_shop]}",
.....
After your system is on-line providing service for your management, you may find there are some flaws in your design of db structure, you hope to add one column "employee" of "shop" for your future records. Then your new records coming is as:
"{car:[id:3, type:auto, name:RR], shop:[id:108, name:other_shop, employee:Bill]}",
No SQL systems allow you to do so, but relational database is impossible for this job.