I want to implement content tagging using MongoDB. In a relational database, the best approach would be to have a many-to-many relation between the content (say, "products") and tags tables. But what is best approach with NoSQL databases?
Would it be better to put every tag in a tags array of the "content" document, or put references to tags in a string?
In most cases where you have a n:m relation in MongoDB, you should use embedding instead of referencing. So I would recommend you to have an array "tags" in each product with the tag names. I assume that looking at a single product will be the most frequent use-case in your system. This design will allow you to show the user a product with a list of tag names with a single database query.
When you need some additional meta-data about the tags which you don't want to bind to a product (like a long-text description of a tag), you could create an additional tags collection, where the name field gets an unique index for fast lookup and avoiding duplicates. When the user clicks on or hovers over a tag name, you can use an additional query to get the tag details.
A problematic case in this design is the situation when you want to delete or rename a tag. Then you have to edit every product which includes the tag. But because MongoDB doesn't know foreign keys with CASCADE ON DELETE like SQL databases, you will always have that problem when you have documents referencing one another.
Renaming tags could be made easier by storing objectIDs instead of names in the tag array of the product. But IDs have the disadvantage that they are useless for the user. You need to get the names of the tags to show a product page. That means that you have to request every single one from the tags collection, which requires an additional database query.
Related
I am new to the NoSql world. I am building a serverless app with dynamodb. In a relational DB when I would have 3 entities like post, post_likes and post_tags I would have few tables and use joins to fetch data. But, I wonder how should one make a NoSql structure for a scenario where post has one to many relationship with likes, and many to many with tags.
Post model:
user_id <string>
attachment_url <string>
description <string>
public <boolean>
Like model:
user_id <string>
post_id <string>
type <string>
Tag model:
name <string>
I have few access patterns:
Get all public posts
Get all posts filtered by a single tag and public status
Get all posts by user id
Get a single post by post id
And each time a post should be fetched with tags data, and likes data including user data that is attached to a like.
In relational DB I would create post_tags table and fetch all post by tags. But, how can I do this with dynamodb?
I am struggling to figure out how my table should look like and what to set as primary and sort keys amongst post_id, user_id, tag_name or public fields for this case?
My initial thought was to build a table with entity that would look like this:
Partition key | Sort key | data attributes
tag_name | post_id | public | user_id | likes[] | other post attributes...
Then this table would look something like this:
I have set the 2 Global secondary indexes.
First Global secondary index:
partition key set to public and sort key to post_id
Second Global secondary index:
partition key set to user_id and sort key to post_id
That way for each tag a post has, I would have a duplicate of that post in the table. I thought by having a tag as a first filter, that way I could query efficiently posts if I need to query them by a tag.
But, if I do a query by just a public status or user_id, I would get all the duplicates of posts for each tag they belong to.
Or should I have 3 separate entities in the table, tags, posts and likes and if I fetch a post by a tag, I would first do one query to find all post_ids by a tag, then do the second query to fetch posts and their likes id, and then do the third query to fetch the likes array.
I don't know what is the best practice when it comes to this things, since I only just started using dynamodb.
How should this DB structure look like then?
You're off to a great start by thinking deeply about your access patterns and defining your entities (Posts, Users, Likes, etc). As you know, having a thorough understanding of your access patterns is critical to storing your data in DynamoDB.
While reviewing my answer, keep in mind that this is only one solution. DynamoDB gives you a ton of flexibility when defining your data model, which can be both a blessing and a curse! This answer is not meant to be the way to model these access patterns. Instead, it's one way that these access patterns can be implemented. Let's get into it!
I like to start by listing the entities we need to model, as well as the Primary key for each. Throughout this post, I'll be using composite primary keys, which are keys made up of a Partition Key (PK) and a Sort Key (SK). Let's start out with a blank table and fill it out as we go.
Partition Key Sort Key
User
Post
Tag
Users
Users are central to your application, so I'll start there.
Let's start by defining a User model that lets us identify a User by ID. I'll use the pattern USER#<user_id> for the PK and SK of the User entity.
This supports the following access patterns (examples in pseudocode for simplicity):
Fetch User by ID
ddbClient.query(PK = USER#1, SK = USER#1)
I'll update the table with the new PK/SK pattern for Users
Partition Key Sort Key
User USER#<user_id> USER#<user_id>
Post
Tag
Posts
I'll start modeling Posts by focusing on the one-to-many relationship between Users and their Posts.
You have an access pattern to fetch All Posts by UserId, so I'll start by adding the Post model to the User partition. I'll do this by defining a PK of USER#<user_id> and an SK of POST#<post_id>.
This supports the following access patterns:
Fetch User and all Posts
ddbClient.query(PK = USER#<user_id>)
Fetch User Posts
ddbClient.query(PK = USER#<user_id>, SK begins_with "POST#")
You may wonder about the odd-looking Post IDs. When fetching Posts, you'll probably want to get the most recent Posts first. You also want to be able to uniquely identify Posts by ID. When you have this sort of requirement, you can use a KSUID as your unique identifier. Explaining KSUID's is a bit out of scope for your question, but know that they are unique and sortable by the time they were created. Since DynamoDB sorts results by the Sort Key, your query for a user's posts will automatically be sorted by creation date!
Updating the PK/SK patterns for your application, we now have
Partition Key Sort Key
User USER#<user_id> USER#<user_id>
Post USER#<user_id> POST#<post_id>
Tag
Tags
We have a few options on how to model the one-to-many relationship between Posts and Tags. You could include a list attribute on your Post item, which simply lists the number of tags on the item. This approach is perfectly fine. However, looking at your other access patterns, I'm going to take a different approach for now (it will be apparent why later).
I will model tags with a PK of POST#<post_id> and an SK of TAG#<tag_name>
Since Primary Keys are unique, modeling tags in this way will ensure that no Post is tagged with the same Tag twice. Additionally, it allows us to have an unbounded number of Tags on a Post.
Updating our PK/SK table for Tag, we have
Partition Key Sort Key
User USER#<user_id> USER#<user_id>
Post USER#<user_id> POST#<post_id>
Tag POST#<post_id> TAG#<tag_name>
At this point we've modeled Users, Posts and Tags. However, we've only addressed one of your four access patterns. Lets see how we can use secondary indexes to support your access patterns.
Note: You could also model Likes in the exact same way.
Defining A Secondary Index
Secondary indexes allow you to support additional access patterns within your data. Let's define a very simple secondary index and see how it supports your various access patterns.
I'm going to create a secondary index that swaps the PK/SK patterns in your base table. This pattern is called an inverted index, and would look like this:
All we've done here is swapped the PK/SK pattern of your base table, which has given us access to two additional access patterns:
Fetch Post by ID
ddbClient.query(IndexName = InvertedIndex, PK = POST#<post_id>)
Fetch Posts by Tag
ddbClient.query(IndexName = InvertedIndex, PK = TAG#<tag_name>)
Fetch All Posts by Public/Private status
You wanted to fetch posts by public/private status, as well as fetching all Posts. One way to fetch all Posts is to put them in a single partition. We can put the public/private status in the sort key to separate the public and private Posts.
To do this, I'll create two new attributes on the Post item: _type and publicPostId. These fields will serve as the PK/SK patterns for the secondary index I'm calling PostByStatus.
After doing this, your base table would look like this:
and your new secondary index would look like this
This secondary index would enable the following access patterns
Fetch All Posts
ddbClient.query(IndexName = PostByStatus, PK = POST)
Fetch All Private Posts
ddbClient.query(IndexName = PostByStatus, PK = POST, SK begins_with "PRIVATE#")
Fetch All Public Posts
ddbClient.query(IndexName = PostByStatus, PK = POST, SK begins_with "PUBLIC#")
Remember, post ID's are KSUID's, so they will naturally be sorted in your results by the date the Post was made.
A Word on Hot Partitions
Storing all your Posts in a single partition will likely result in a hot partition as your application scales. One way to address this is by distributing your Post items across multiple partitions. How you do that is entirely up to you and specific to your application.
One strategy to avoid the single POST partition could involve grouping Posts by creation day/week/month/etc. For example, instead of using POST as your PK in the PostByStatus secondary index, you could use POSTS#<month>-<year> instead, which would look like this:
Your application would need to take this pattern into account when fetching Posts (e.g. start at the current month and go backwards until enough results are fetched), but you'd be spreading the load across multiple partitions.
Wrapping Up
I hope this exercise gives you some ideas on how to model your data to support specific access patterns. Data modeling in DynamoDB takes time to get right, and will likely require multiple iterations to make work for your specific application. It can be a steep learning curve, but the payoff is a solution that brings scale and speed to your application.
I am currently trying to model a MongoDB database structure where the entities are very complex in relation to each other.
In my current collections, MongoDB queries are difficult or impossible to put into a single aggregation. Incidentally, I'm not a database specialist and have been working with MongoDB for only about half a year.
To keep it as simple as possible but necessary, this is my challenge:
I have newspaper articles that contain simple keywords, works (oevres, books, movies), persons and linked combinations of works and persons. In addition, the same people appear under different names in different articles.
Later, on the person view I want to show the following:
the links of the person with name and work and the respective articles
the articles in which the person appears without a work (by name)
the other keywords that are still in the article
In my structure I want to avoid that entities such as people occur multiple times. So these are my current collections:
Article
id
title
keywordRelations
KeywordRelation
id
type (single or combination)
simpleKeywordId (optional)
personNameConnectionIds (optional)
workIds (optional)
SimpleKeyword
id
value
PersonNameConnection
id
personId
nameInArticleId
Person
id
firstname
lastname
NameInArticle
id
name
type (e.g. abbreviation, synonyme)
Work
id
title
To meet the requirements, I would always have to create queries that range over 3 to 4 tables. Is that possible and useful with MongoDB?
Or is there an easier way and structure to achieve that?
I must design an API to manage a Document entity: the originality of this entity is it can have two different ids:
id1 (number, i.e. 1234)
id2 (number, i.e. 89)
For each document, one and only one id is available (id1 or id2, not both)
Usually I solve this issue by using query parameters to perform some kind of "search" feature:
GET /documents?id1=1234
GET /documents?id2=89
But it works only if there is no sub-entity...
Let's say I want to get the authors of the documents :
GET /documents/1234/authors
Impossible because I can't know what type of id I get: is it id1 or id2 ?
GET /documents/authors?id1=1234
Not really REST I think because id1 then refers to the "Author" entity, not "Document" anymore...
GET /id1-documents/1234/authors
GET /id2-documents/1234/authors
Then you create two URIs that return the same entity (/author) not really REST compliant.
GET /documents/id1=1234/authors
GET /documents/id2=89/authors
It looks like a composite key created only for the API, it has no "backend" meaning. For me it sounds strange to create a "composite" key on the fly.
GET /document-authors?id1=1234
GET /document-authors?id2=89
In this case you completely lose the notion of tree... You end up with an API that contains only root entities.
Do you see another alternative ?
Which one looks the best ?
Thank you very much.
It seems to me that you're conflating two different resources here - documents and authors. A document has a relationship with an author, but they should be separate resources because the authors have existence from any individual document. With that in mind you need to ask whether your clients are searching for authors or documents. If it's authors, then they should be querying an authors API rather than a documents API.
e.g.For all the authors of documents with id1 89 or id1 1234 or id2 4444 you might query like this...
GET /authors?docId1=89&docId1=1234&docId2=4444
That should return a list of author representations. If people care about the documents themselves, the author representations could contain links to the documents.
Alternatively, if you're looking for documents then you should be querying that directly...
GET /documents?id1=89&id1=1234&id2=4444
What you're modelling as a sub-resource isn't really a subresource. It's a relationship between 2 independent resources and should be modelled as a set of links. Each document returned from the documents api should contain a set of authors links (if people really care about the authors) and vice versa from the authors to the documents.
Here's an opinionated solution from SlashDB, which allows for record filtering and traversing to related resources at the same time.
The example is similar to yours - two entities Artist and Album.
Let's identify the Artist first.
Artist by ID:
https://demo.slashdb.com/db/Chinook/Artist/ArtistId/2
Artist by Name:
https://demo.slashdb.com/db/Chinook/Artist/Name/Accept
An Artist may have issued Albums. The two entities are related. We allow extending the URL with the name of the related entity, like so:
https://demo.slashdb.com/db/Chinook/Artist/Name/Accept/Album
You can keep "going", say to get to the Tracks from those albums
https://demo.slashdb.com/db/Chinook/Artist/Name/Accept/Album/Track
And even continue filtering too i.e. only tracks, which are shorter than 300000 milliseconds:
https://demo.slashdb.com/db/Chinook/Artist/Name/Accept/Album/Track/Milliseconds/..300000
The objective:
Having a many-to-many relation be displayed as a dynamic list of select inputs(single choice dropdown list)
User arrives on page with a single select field (multiple = false) populated with persisted entities and add/remove buttons. By clicking the add button, a new select field with the same options appears below the first, which adds a new entry in the M2M relation. By clicking remove the field disappears and the entry should be removed.
The model:
Two entities: User & Manager. A User has exactly one "special" Manager and unlimited normal Managers.
Managers manage unlimited users.To model this I have created two relationships for which the user is the "owner" (not sure how to translate this)
ManyToOne specialManager
ManyToMany normalManagers
I haven't created a many to many relationship with attribute "special" because the requirement is exactly one special manager and I wasn't sure if Symfony/Doctrine would cause problems down the line.
What I have:
I can display a multiple select field with the existing entities using Entity field type, as per the documentation. Functionally this is what I need, visually it is not.
I can also use the Collection field type to display a single text field, and add or remove more with JS, as per the documentation. Visually this is what I need, but The text fields (entity attribute) need to be replaced by choice field.
The question:
Before I continue digging, is there a simple way to achieve this list of select tags?
For anyone else who may eventually need a dynamic list of select fields:
I initially solved this issue by detaching the field(s) in event listeners, and handling the display/submission manually in the controller.
However I wasn't satisfied with this clunky solution and when I encountered the same need I used a second solution: creating an intermediary entity xxxChoice (in this case ManagerChoice) which is Mto1 inversed related to User and Mto1 related to Manager. Then by creating a ManagerChoiceType form with "Manager" entity field type I was able to easily display my collection of dropdown select lists.
I'm trying to develop a many-to-many relationship between tags (in the tags table) and items (in the items table) using a field of type integer[] on each item.
I know that Rails 4 (and Rails 3 via postgres_ext) has support for Postgres' arrays feature through the :array => true parameter, but I can't figure out how to combine them with Active Record associations.
Does has_many have an option for this? Is there a gem for this? Should I give up and just create a has_many :through relationship (though with the amount of relations I'm expecting this is probably unmanageable)?
At this point, there isn't a way to use relationships with arrays in Rails. Using the selected answer though, you will run into the N+1 select issue. Say you get your posts and then the tags for it on each post with "tags" method defined in the class. For each post you call the tags on, you will incur another database hit.
Hopefully, this will change in the future and we can get rid of the join table (especially given that Postgres 9.4 will include support for foreign keys in Arrays).
All you really need to do is
def tags
Tag.where(id: tag_ids)
end
def add_tag(tag)
self.tag_ids += [tag.id] unless tag_ids.include?(tag.id)
end
At least that's what I do at the moment. I do some pretty cool stuff with hashes (hstore) as well with permissions. One way of handling tags is to create the has_many through and persist the tags in a string array column as they are added for convenience and performance (not having to query the 2 related tables just to get the names out). I you don't necessarily have to use active record to do cool stuff with the database.