I currently have a SQL Server database with a table containing 400,000 movies. I have another table containing thousands of users.
CREATE TABLE [movie].[Header]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[SourceId] [int] NOT NULL,
[ReleaseDate] [Date] NOT NULL,
[Title] [nvarchar](500) NOT NULL
)
CREATE TABLE [account].[Registration]
(
[Id] [int] IDENTITY(1,1) NOT NULL,
[Username] [varchar](50) NOT NULL,
[PasswordHash] [varchar](1000) NOT NULL,
[Email] [varchar](100) NOT NULL,
[CreatedAt] [datetime] NOT NULL,
[UpdatedAt] [datetime] NOT NULL
)
CREATE TABLE [movie].[Likes]
(
[Id] [uniqueidentifier] NOT NULL,
[HeaderId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[CreatedAt] [datetime] NOT NULL
)
CREATE TABLE [movie].[Dislikes]
(
[Id] [uniqueidentifier] NOT NULL,
[HeaderId] [int] NOT NULL,
[UserId] [int] NOT NULL,
[CreatedAt] [datetime] NOT NULL
)
Each user is shown 100 movies starting from two weeks into the future. They can then perform an action such as like, dislike, recommend etc.
I'm in the process of moving the entire application into a serverless architecture. I have the APIs running in AWS via Lambda + API Gateway and now I'm looking at using DynamoDB for the database. I don't think I have anything super crazy that would prevent me from storing the data in Dynamo and their pricing/consumption model seems like it would be substantially cheaper than SQL Server (currently hosted in Azure).
The one thing I'm having issues with is understanding how I would model the users performing an action on a movie. If they "like" a movie, it goes into a likes list that they can go back and visit. There, I present them the entire move record (which actually consists of more data such as cast/crew/ratings etc. I just truncated the cable to simplify it). If I stored each "Like" as an item in Dynamo, along with the entire movie as an attribute, I'd think that the users document would get very large.
I also need to continue to show users movies, starting two weeks out, that they have not performed any actions on. Movies that they have performed actions on I need to remove from the query. Today I'm just joining on the movies table and the users actions table, removing movies from the query that already exists in the users action table. How would I model this in NoSql with the same end-result?
I can consolidate the likes/dislikes into a single document with an action type attribute (representing like/dislike etc), and an array of movies that the action has been performed on. Not sure still how I would go about filtering the [Header] query so that the movies in the users document don't come back.
I figured I would set my movies hash key to the release date for sharding, since there's roughly 10 movies per release date on average. That gives a nice distribution. I figured I'd use the userid has the hash key for the document containing all of the movies that a user has performed an action on; not sure if that's the right path though.
I've never dealt with NoSql so I wanted to ask for input. I am not sure how best to design something that is essentially one-to-many, but with the potential for the movies-per-user being in the tens of thousands.
So, based on your comments I am gonna throw in a suggestion. It doesn't mean its a right answer, I could be wrong as well or missing a point
First of all please read every segment of the Best Practices over and over again. There are patterns that you might never thought of but is still possible with NoSQL approach. Its very helpful and educative (considering you saying you are new to NoSQL). There are similarities to your case and you might create your own answer based on the best practices.
What I can suggest is:
NoSQL is very bad at querying for 'not existing'. A big trick of NoSQL is it exactly knows where to find the data you are looking for, not where not not to find. So its bit hard to find users that didn't perform any action on a movie yet. If you can use a side DB such like Redis you can pull this off very easily. With Redis data structures you can query which user hasn't liked/disliked yet and get the rest of the movie data from DynamoDB. But putting side database, Redis, to aside for now and going with only DynamoDB approach.
One approach could be when each movie arrives to DB (new movie) you can add them to each of the users with the action type not-actioned-yet. And now for all users you can query these very easy and very fast. (Now it knows where the data is ;) ) But this isn't right because if there are 10.000 users then for every movie you make 10.000 writes.
Another approach could be imagine you have item on a table that holds the date of the user's last 'get list of not-yet-actioned' query. Now, after some time user comes back for the same query and now you need to read that date and get all the movies that is added to your DB after that date. With datetimes as sort keys you can query movies starting from that date. Lets say, 10 movies added after users last query (these are definitely user hasn't actioned yet). Now you add these 10 movies to a table as an item not-actioned-yet. After this you will you have all the movies user hasn't actioned yet. 'not-actioned-yet' is also type like 'like, disliked'. From now on you can query for them easily.
Example table structures:
You can either use sparse indexes or time series table approach to separates new movies (in next 2 weeks) from others. This way you query or scan only them efficiently. Going with sparse indexes here
Movies table
| Id (Hash Key|Primary Key) | StartingDateUnix(GSI SK) | IsIn2Weeks (GSI) |
|:-------------------------:|-------------------------:|:----------------:|
| MovieId1 | 1234567 | 1
| MovieId2 | 1234568 | 1
| MovieId3 | 001123 | null
To get movies after unix 1234567 you have to query GSI with a sort key bigger than unix time.
User Actions Table
| UserId (Hash Key) | ActionType_ForMovie(Sort Key) | CreatedAt (LSI) |
|:-----------------:|:-----------------------------:|:---------------:|
| UserId1 | no-action::MovieId1 | 1234567 |
| UserId1 | no-action::MovieId2 | 1234568 |
| UserId1 | like::MovieId3 | 1234569 |
| UserId1 | like::MovieId4 | 1234561 |
| UserId1 | dislike::MovieId5 | 1234562 |
Using sort keys you can query for all the likes dislikes not yet actioned ... and you can sort them by dates. You can also paginate.
I have spent some time on this problem, because its also good challenge for me and i would appreciate a feedback. Hope it helps in some way
Related
Suppose I have a psql table with a primary key and some data:
pkey | price
----------------------+-------
0075QlyLvw8bi7q6XJo7 | 20
(1 row)
However, I would like to save historical updates on it without losing the functionality that comes from referencing it's key in other tables as foreign keys.
I am thinking of doing some kind of revision_number + timestamp approach where each "update" would be a new row, example:
pkey | price | rev_no
----------------------+-------+--------
0075QlyLvw8bi7q6XJo7 | 20 | 0
----------------------+-------+--------
0075QlyLvw8bi7q6XJo7 | 15 | 1
(2 rows)
Then create a view that always takes the highest revision number of the table and reference keys from that view.
However to me this workaraound seems a bit too heavy for a task that in my opinion should be fairly common. Is there something I'm missing? Do you have a better solution or is there a well known paradigm for these types of problems which I don't know about?
Assuming PKey is actually the defined primary key you cannot do the revision scheme you outlined without creating a history table and moving old data to it. The primary key must be unique for any revision. But if you have a properly normalized table there several valid method, the following is one:
Review the other attributes and identify the candidate business keys (columns of business meaning that could be defined unique -- perhaps the item name.
If not already present add 2 columns: effective timestamp and superseded timestamp.
Now create a partial unique index on the identified column,from #1) and the superseded timestamp being a column meaning this is the currently active version.
Create a simple view as Select * from table. Since this is a simple view it is fully update-able. Use this View for Select,Insert and Delete, but
for Update create an instead of trigger. This trigger will set the superseded timestamp of the current active row and insert a new row update applied and the updated the version number.
With the above you can get you uniquely keep on the current active revision. Further you maintain the history of all relationships at each version. (See demo, including a couple useful functions)
The title isn't very specific, so I'll elaborate.
I'm working on a database system in which users can add data to a postgres database though a watered-down API.
So far, all the user's data is compiled into one table, structured similar this:
CREATE TABLE UserData (
userId int NOT NULL,
dataId int NOT NULL PRIMARY KEY,
key varchar(255) NOT NULL,
data json not NOT NULL,
);
However, I am thinking that it may be more efficient (and a faster query) to instead give each userId it's own table:
CREATE TABLE UserData_{userId} (
dataId int NOT NULL PRIMARY KEY,
key varchar(255) NOT NULL,
data json not NOT NULL,
);
CREATE TABLE UserData_{anotherUserId} ();
etc...
I am worried that this will clog up the database, however.
What are the pros and cons for each? Under what load/speed requirements would each serve well? And which of these do you think would be better for a high-load, high-speed scenario?
What you are suggesting is essentially partitioning, so I suggest reading the docs about that. It's mainly advantageous when your operations each cover most of one partition (i.e. select all data for one user, or delete all data for one user).
Most use cases, however, are better served by having one properly indexed table. It's a much simpler structure, and can be very performant. If all of your queries are for a single user, then you'll want all of the indexes to start with the userId column, and postgres will use them to efficiently reach only the relevant rows. And if a day comes when you want to query data across multiple users, it will be much easier to do that.
I advise you not to take my word for it, though. Create both structures, generate fake data to fill them up, and see how they behave!
Consider:
You might end up with x amount of tables if you have one per user. How many "users" do you expect?
The json data is unbound and might grow as your solution/app grows. How will you handle missing keys/values?
The Users table will grow horizontally (more columns) where you should always aim to grow vertically (more rows)
A better solution would be to hold your data in tables related to the user_id.
ie. a "keys" table which holds the key, date_added, active and foreign key (user_id)
This will also solve saving your data as a json which, in you example, will be difficult to maintain. Rather open that json up into a table where you can benefit from indexes and clustering.
If you reference your user_id in separate tables as a foreign key, you can partition or cluster these tables on that key to significantly increase speed and compensate for growth. Which means you have a single table for users (id, name, active, created_at, ...) and lots of tables linked to that user, eg.
subscriptions (id, user_id, ...), items (id, user_id, ...), things (id,user_id, ...)
I will explain the problem with an example:
I am designing a specific case of referential integrity in a table. In the model there are two tables, enterprise and document. We register the companies and then someone insert the documents associated with it. The name of the enterprise is variable. When it comes to recovering the documents, I need the name of the enterprise to be the same as it was when it was registered and not the value it currently has. The solution that I thought was to register the company again in each change with the same code, the updated name in this way would have the expected result, but I am not sure if it is the best solution. Can someone make a suggestion?
There are several possible solutions and it is hard to determine which one will exactly be the easiest.
Side comment: your question is limited to managing names efficiently but I would like to comment the fact that your DB is sensitive to files being moved, renamed or deleted. Your database will not be able to keep records up-to-date if anything happen at OS level. You should consider to do something about it too.
Amongst the few solution I considered, the one that is best normalized is the schema below:
CREATE TABLE Enterprise
(
IdEnterprise SERIAL PRIMARY KEY
, Code VARCHAR(4) UNIQUE
, IdName INTEGER DEFAULT -1 /* This will be used to get a single active name */
);
CREATE TABLE EnterpriseName (
IDName SERIAL PRIMARY KEY
, IdEnterprise INTEGER NOT NULL REFERENCES Enterprise(IdEnterprise) ON UPDATE NO ACTION ON DELETE CASCADE
, Name TEXT NOT NULL
);
ALTER TABLE Enterprise ADD FOREIGN KEY (IdName) REFERENCES EnterpriseName(IdName) ON UPDATE NO ACTION ON DELETE NO ACTION DEFERRABLE INITIALLY DEFERRED;
CREATE TABLE Document
(
IdDocument SERIAL PRIMARY KEY
, IdName INTEGER NOT NULL REFERENCES EnterpriseName(IDName) ON UPDATE NO ACTION ON DELETE NO ACTION
, FilePath TEXT NOT NULL
, Description TEXT
);
Using flag and/or timestamps or moving the enterprise name to the document table are appealing solutions, but only at first glance.
Especially, the part where you have to ensure a company always has 1, and 1 only "active" name is no easy thing to do.
Add a date range to your enterprise: valid_from, valid_to. Initialise to -infinity,+infinity. When you change the name of an enterprise, instead: update existing rows where valid_to = +infinity to be now() and insert the new name with valid_from = now(), valid_to = +infinity.
Add a date field to the document, something like create_date. Then when joining to enterprise you join on ID and d.create_date between e.valid_from and e.valid_to.
This is a simplistic approach and breaks things like uniqueness for your id and code. To handle that you could record the name in a separate table with the id,from,to,name. Leaving your original table with just the id and code for uniqueness.
Completely hypothetical question to compare performance of hstore is postgress
Let's say each user has a list of followers. There are 2 way to implement it
Many-to-Many relationship with a 'follower' table ( user_id, follower_id )
An hstore column where the values are the ids of the followers. (with a GiST index)
If I want to find all the users that follow a certain user, which version would perform faster?
SELECT follower_id from follower where user_id = '1234'
SELECT user_id from user where (data #> 'followers=>'1234')
In real life, for option b we would probably also maintain a list of all users the user follows - for the sake of the question let's assume we don't do that.
I'm working in an API that needs to return a list of financial transactions. These records are held in 6 different tables, but all have 3 common fields:
transaction_id int NOT NULL,
account_id bigint NOT NULL,
created timestamptz NOT NULL
note: might have actually
been a good use of table in inheritance in postgresql but it wasn't done like that.
The business requirement is to return all transactions for a given account_id in 1 list sorted by created in descending order (similar to an online banking page where your last transaction is at the top). Originally, they want to paginate in groups of 50 records, but I've got them to do it on date ranges (believing that I can do it more efficiently in the database than using offset and limits).
My intent is to create an index on each of these tables like this:
CREATE INDEX idx_table_1_account_created ON table_1(account_id, created desc);
ALTER TABLE table_1 CLUSTER ON idx_table_1_account_created;
Then finally to create a view to union all of the records from the 6 tables into one list and then obviously the records from the 6 tables will need to be *resorted" to come up with a unified list (in the correct order). This call will look like:
SELECT * FROM vw_all_transactions
WHERE account_id = 12345678901234
AND created >= '2014-01-01' AND created < '2014-02-01'
ORDER BY created desc;
My question is related to creating the indexing and clustering scheme. Since the records are going to have to be resorted by the view anyway is there any reason to do specify the individual indexes as created desc? And does sorting this way have any penalties when periodicially calling CLUSTER;
I've done some googling and reading but can't really seem to find any information that answers how this clustering is going to work.
Using PostgreSQL 9.2 on Heroku.