I worked with relational databases for a long time, and now I am going to work with DynamoDB. After, working with relational databases, I am struggling to design some of our current SQL tables in DynamoDB. Especially, deciding about partition and sort keys. I will try to explain with an example:
Current Tables:
Student: StudentId(PK), Email, First name, Last name, Password, SchoolId(FK)
School: SchoolId(PK), Name, Description
I was thinking to merge these tables in DynamoDB and use SchoolId as the Partition Key, StudentId as the sort key. However, I saw some similar examples use StudentId as the Partition Key.
And then I realized, we use "username" in each login functionality, so the application will query with "username"(sometimes with a password, or auth token) a lot. This situation makes me think about; SchoolId as the Partition Key and Username as the sort key.
I need some ideas about what would be the best practice in that case and some suggestions to give me a better understanding of NoSQL and DynamoDb concepts.
In NoSql you should try to list down all your use cases first and then try to model the table schema.
Below are the use-cases that I see in your application
Get user info for one user with userId (password, age, name,...)
Get School info for user with userId (className, schoolName)
Get All the student in one school.
Get All the student in one class of one school.
Based on these given access pattern this is how I would have designed the schema
| pk | sk | GSI1 PK | GSI1 SK |
| 12345 | metadata | | | Age:13 | Last name: Singh | Name:Rohan | ...
| 12345 | schoolMeta | SchoolName: DPS | DPS#class5 | className:5 |
With the above schema you can solve the identified use cases as
Get user info for one user with userId
Select * where pk=userId and sk=metadata
Get school info for user with userId
Select * where pk=userId and sk=schoolMeta
Get All the student in one school.
Select * where pk=SchoolId from table=GSI1
Get All the student in one class.
Select * where pk=SchoolId and sk startswith SchoolId#className from table=GSI1
But the given schema suffers from the drawback that
If you want to change school name you will have to update too many rows.
Related
I'm trying to implement row-level security in Postgres. In reality, I have many roles, but for the sake of this question, there are four roles: executive, director, manager, and junior. I have a table that looks like this:
SELECT * FROM ex_schema.residence; --as superuser
primary_key | residence | security_level
------------------+---------------+--------------
5 | Time-Share | executive
1 | Single-Family | junior
2 | Multi-Family | director
4 | Condominium | manager
6 | Modular | junior
3 | Townhouse | director
I've written a policy to enable row-level security that looks like this:
CREATE POLICY residence_policy
ON ex_schema.residence
FOR ALL
USING (security_level = CURRENT_USER)
WITH CHECK (primary_key IS NOT NULL AND security_level = CURRENT_USER);
As expected, when the executive connects to the database and selects the table, that role only sees rows that have executive in the security_level column. What I'd like to do is enable the row-level security so that higher security roles can see rows that match their security level as well as rows that have lower security privileges. The hierarchy would look like this:
ROW ACCESS PER ROLE
executive: executive, director, manager, junior
director: director, manager, junior
manager: manager, junior
junior: junior
I'm wondering how to implement this type of row-level policy so that a specific role can access multiple types of security levels. There's flexibility in changing the security_level column structure and data type.
One thing you can do is define an enum type for your levels:
CREATE TYPE sec_level AS ENUM
('junior', 'manager', 'director', 'executive');
Then you can use that type for the security_level column and write your policy as
CREATE POLICY residence_policy ON ex_schema.residence
FOR ALL
USING (security_level >= CURRENT_USER::sec_level);
There is no need to check if the primary key is NULL, that would generate an error anyway.
Use an enum type only if you know that these levels won't change, particularly that no level will ever be removed.
Alternatively, you could use a lookup table:
CREATE TABLE sec_level
name text PRIMARY KEY,
rank double precision UNIQUE NOT NULL
);
The column security_level would then be a foreign key to sec_level(rank), and you can compare the values in the policy like before. You will need an extra join with the lookup table, but you can remove levels.
I currently have a SQL Server database with a table containing 400,000 movies. I have another table containing thousands of users.
CREATE TABLE [movie].[Header]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[SourceId] [int] NOT NULL,
[ReleaseDate] [Date] NOT NULL,
[Title] [nvarchar](500) NOT NULL
)
CREATE TABLE [account].[Registration]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Username] [varchar](50) NOT NULL,
[PasswordHash] [varchar](1000) NOT NULL,
[Email] [varchar](100) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UpdatedAt] [datetime] NOT NULL
)
CREATE TABLE [movie].[Likes]
(
[Id] [uniqueidentifier] NOT NULL,
[HeaderId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[CreatedAt] [datetime] NOT NULL
)
CREATE TABLE [movie].[Dislikes]
(
[Id] [uniqueidentifier] NOT NULL,
[HeaderId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[CreatedAt] [datetime] NOT NULL
)
Each user is shown 100 movies starting from two weeks into the future. They can then perform an action such as like, dislike, recommend etc.
I'm in the process of moving the entire application into a serverless architecture. I have the APIs running in AWS via Lambda + API Gateway and now I'm looking at using DynamoDB for the database. I don't think I have anything super crazy that would prevent me from storing the data in Dynamo and their pricing/consumption model seems like it would be substantially cheaper than SQL Server (currently hosted in Azure).
The one thing I'm having issues with is understanding how I would model the users performing an action on a movie. If they "like" a movie, it goes into a likes list that they can go back and visit. There, I present them the entire move record (which actually consists of more data such as cast/crew/ratings etc. I just truncated the cable to simplify it). If I stored each "Like" as an item in Dynamo, along with the entire movie as an attribute, I'd think that the users document would get very large.
I also need to continue to show users movies, starting two weeks out, that they have not performed any actions on. Movies that they have performed actions on I need to remove from the query. Today I'm just joining on the movies table and the users actions table, removing movies from the query that already exists in the users action table. How would I model this in NoSql with the same end-result?
I can consolidate the likes/dislikes into a single document with an action type attribute (representing like/dislike etc), and an array of movies that the action has been performed on. Not sure still how I would go about filtering the [Header] query so that the movies in the users document don't come back.
I figured I would set my movies hash key to the release date for sharding, since there's roughly 10 movies per release date on average. That gives a nice distribution. I figured I'd use the userid has the hash key for the document containing all of the movies that a user has performed an action on; not sure if that's the right path though.
I've never dealt with NoSql so I wanted to ask for input. I am not sure how best to design something that is essentially one-to-many, but with the potential for the movies-per-user being in the tens of thousands.
So, based on your comments I am gonna throw in a suggestion. It doesn't mean its a right answer, I could be wrong as well or missing a point
First of all please read every segment of the Best Practices over and over again. There are patterns that you might never thought of but is still possible with NoSQL approach. Its very helpful and educative (considering you saying you are new to NoSQL). There are similarities to your case and you might create your own answer based on the best practices.
What I can suggest is:
NoSQL is very bad at querying for 'not existing'. A big trick of NoSQL is it exactly knows where to find the data you are looking for, not where not not to find. So its bit hard to find users that didn't perform any action on a movie yet. If you can use a side DB such like Redis you can pull this off very easily. With Redis data structures you can query which user hasn't liked/disliked yet and get the rest of the movie data from DynamoDB. But putting side database, Redis, to aside for now and going with only DynamoDB approach.
One approach could be when each movie arrives to DB (new movie) you can add them to each of the users with the action type not-actioned-yet. And now for all users you can query these very easy and very fast. (Now it knows where the data is ;) ) But this isn't right because if there are 10.000 users then for every movie you make 10.000 writes.
Another approach could be imagine you have item on a table that holds the date of the user's last 'get list of not-yet-actioned' query. Now, after some time user comes back for the same query and now you need to read that date and get all the movies that is added to your DB after that date. With datetimes as sort keys you can query movies starting from that date. Lets say, 10 movies added after users last query (these are definitely user hasn't actioned yet). Now you add these 10 movies to a table as an item not-actioned-yet. After this you will you have all the movies user hasn't actioned yet. 'not-actioned-yet' is also type like 'like, disliked'. From now on you can query for them easily.
Example table structures:
You can either use sparse indexes or time series table approach to separates new movies (in next 2 weeks) from others. This way you query or scan only them efficiently. Going with sparse indexes here
Movies table
| Id (Hash Key|Primary Key) | StartingDateUnix(GSI SK) | IsIn2Weeks (GSI) |
|:-------------------------:|-------------------------:|:----------------:|
| MovieId1 | 1234567 | 1
| MovieId2 | 1234568 | 1
| MovieId3 | 001123 | null
To get movies after unix 1234567 you have to query GSI with a sort key bigger than unix time.
User Actions Table
| UserId (Hash Key) | ActionType_ForMovie(Sort Key) | CreatedAt (LSI) |
|:-----------------:|:-----------------------------:|:---------------:|
| UserId1 | no-action::MovieId1 | 1234567 |
| UserId1 | no-action::MovieId2 | 1234568 |
| UserId1 | like::MovieId3 | 1234569 |
| UserId1 | like::MovieId4 | 1234561 |
| UserId1 | dislike::MovieId5 | 1234562 |
Using sort keys you can query for all the likes dislikes not yet actioned ... and you can sort them by dates. You can also paginate.
I have spent some time on this problem, because its also good challenge for me and i would appreciate a feedback. Hope it helps in some way
Completely hypothetical question to compare performance of hstore is postgress
Let's say each user has a list of followers. There are 2 way to implement it
Many-to-Many relationship with a 'follower' table ( user_id, follower_id )
An hstore column where the values are the ids of the followers. (with a GiST index)
If I want to find all the users that follow a certain user, which version would perform faster?
SELECT follower_id from follower where user_id = '1234'
SELECT user_id from user where (data #> 'followers=>'1234')
In real life, for option b we would probably also maintain a list of all users the user follows - for the sake of the question let's assume we don't do that.
I was wanting to know if there is a good way to work in timestamps into junction tables using the Entity Framework (4.0). An example would be ...
Clients
--------
ID | Uniqueidentifier
Name | varchar(64)
Products
-----------
ID | uniqueidentifier
Name | varchar(64)
Purchases
--------
Client | uniqueidentifier
Product | uniqueidentifier
This works smooth for junctioning the two together - but I'd like to add a timestamp. Whenever I do that, I'm forced to go through the middle-table in my code. I don't think I can add the timestamp field to the junction table - but is there a different method that might be useable?
Well, your question says it all. You must either have a middle "Purchases" entity or not have the timestamp on Purchases. Actually, you can have the field on the table if you don't map it, but if you want it on your entity model then these are the only two choices.
I have a table with some duplicate rows that I want to normalize into 2 tables.
user | url | keyword
-----|-----|--------
fred | foo | kw1
fred | bar | kw1
sam | blah| kw2
I'd like to start by normalizing this into two tables (user, and url_keyword). Is there a query I can run to normalize this, or do I need to loop through the table with a script to build the tables?
You can do it with a few queries, but I'm not familiar with postgreSQL. Create a table users first, with an identity column. Also add a column userID to the existing table:
Then something along these lines:
INSERT INTO users (userName)
SELECT DISTINCT user FROM url_keyword
UPDATE url_keyword
SET userID=(SELECT ID FROM users WHERE userName=user)
Then you can drop the old user column, create the foreign key constraint, etc.