Lets say I have an app where users can make posts. I store these in a single DynamoDB table using the following design:
+--------+--------+---------------------------+
| PK | SK | (Attributes) |
+-----------------+---------------------------+
| UserId | UserId | username, profile, etc... | <-- user item
| UserId | PostId | body, timestamp, etc... | <-- post item
+--------+--------+---------------------------+
When a user makes a post, my Lambda function receives the following data:
{
"userId": <UserId>",
"body": <Body>,
etc...
}
My question is, should I first verify that the user exists before adding the post to the table by using dynamodb.get({PK: userId, SK: userId)? This would make sure there won't be any orphaned posts, but also the function will require both a read and write unit.
One idea I have is to just write the post, potentially allowing orphaned posts. Then, I could have another Lambda function that runs periodically to find and remove any orphans.
This is obviously a simple case, but imagine a more complex system where objects have multiple relationships. It seems it could easily get very costly to check for relationship existence in these cases.
"Then, I could have another Lambda function that runs periodically to find and remove any orphans." <-- This could get very expensive over time, especially if you plan to do this by scanning the table.
I develop a system built on DynamoDB that has similar relationships, and I validate relationships before saving data because I do not want to have garbage data in my tables.
One option to consider is implicitly testing for the existence of a valid user via authentication & authorization. If a user has passed your auth tests, then you know that they exists, so you can add their posts with confidence.
Related
I'm building an application where I would like to use the multi-tenant strategy of creating a schema for each client. Would it be appropriate to store all users in a users table within a single schema that includes a reference column for their respective schemas?
Db_app_01
/schema_public
/schema_public/table_users
/schema_client_1
Where in table_users I have:
|user_id|username|password|schema_id|
--------------------------------------
|1 |user1 |* |1 |
I was thinking with this that I could easily query the correct schema as the schema_id would be available in main users table which is used for authentication.
Your approach looks fine to me, as long as there are not too many different users. When the number of tables and schemas goes into the 10000s, metadata queries will become sluggish, and it won't be much fun any more.
I wouldn't construct dynamic queries out of the schema_id, explicitly referencing the appropriate schema.
Rather, I would set search_path appropriately.
Let's say 3 tables
session
-------------------------
id | name | date
-------------------------
speaker
-------------------------
id | name
-------------------------
session_speaker
-------------------------
session_id | speaker_id
-------------------------
I've endpoints already in place to do the insertion
POST /session
POST /speaker
What kind of REST request should I create to specify the intention to insert into the JOIN table using POST /session or any other method (passing session_id and speaker_id)
Note: I've a PATCH request already in place to activate or deactivate a session.
Question:
Basically seeking an ideal REST based solution to handle CRUD based operations for the JOIN table, please advise.
You could use the following REST operation for creating the relationship:
PUT speakers/speaker/{speakerId}/session/{sessionId}/
I don't advise using plural names in URLs (e.g. speakers), I'd recommend a singular name such as SessionSpeaker, but as you can't change it from "speakers" I've used it as requested.
You should also use PUT instead of POST for inserting data, as PUT is idempotent i.e. PUT guards against you inserting the same speaker at a session.
To then retrieve speakers information you could use:
GET speakers/session/{sessionId}
GET speakers/speaker/{speakerId}
Another good answer regarding REST and entity multiplicity is here.
I'm looking for some best practices when it comes to modeling confidential hierarchical data in general and specifically with DynamoDB.
The scenario is best explained with an example:
Let's say we have a number of users. Each user has a number of products. Each product consists of a number of parts.
Typical use cases:
List all products for a given user
List all parts for a given product
So far I have modeled this in DynamoDB like this:
Users
----------------
HashKey: UserId
Products
-------------------
HashKey: UserId
RangeKey: ProductId
Parts
-------------------
HashKey: ProductId
RangeKey: PartId
The data is confidential and accessed through authenticated REST endpoints where an authentication token can be mapped to a UserId. Each user may be allowed to view other users' data through some group concept.
Listing all products for a given user is simple since UserId is a key in the products table:
GET /users/111/products becomes a simple Query(Table=Products, UserId=111)
But consider the case of listing all parts for a given product:
GET /users/111/products/222/parts
If I simply do a Query(Table=Parts, ProductId=222) then I will get the desired data fast, but I am not protecting against other users querying for data belonging to user 111, provided they somehow know about ProductId 222 (in reality, ID:s will of course be UUID:s or similar so not so easily guessable):
GET /users/119/products/222/parts
... would result in malicious user 119 retrieving data that doesn't belong to him, provided nothing is done to address this.
So here I imagine I need to do something like one of these:
First make another query to make sure product 222 in fact belongs to the given user
Duplicate the UserId in the Parts table and include it in the query condition (which basically means it will match either all rows or no rows when scanning through the set identified by ProductId): Query(Table=Parts, ProductId=222, UserId=111)
Use UserId as the hash key also in the Parts table and instead keep ProductId as a secondary index
Use a composite HashKey such as UserId_ProductId ("111_222") on the Parts table
If I need to return a 401 as opposed to just empty data, option 1 seems like the only approach. But if we imagine a deeper hierarchy of data, e.g. "users having inboxes having messages having parts having attachments" it seems this approach could eventually be expensive (listing all attachments for part P might result in a query to check that part P belongs to message M, that message M belongs to inbox I and that inbox I belongs to user U, and so on).
Does anyone have any good arguments for which approach is most favorable? Or am I doing something stupid and should be modeling my data in some other way completely?
We have a people table, each person has a gender defined by a gender_id to a genders table,
| people |
|-----------|
| id |
| name |
| gender_id |
| genders |
|---------|
| id |
| name |
Now, we want to allow people to create forms by themselves using a nice form builder. One of the elements we want to add is a select list with user defined options,
| lists |
|-------|
| id |
| name |
| list_options |
|--------------|
| id |
| list_id |
| label |
| value |
However, they can't use the genders as a dropdown list because it's in a different table. They could create a new list with the same options as genders but this isn't very nice and if a new gender is added they'd need to add it in multiple places.
So we want to move the gender options into a list that the user can edit at will and will be reflected when a new person is created too.
What's the best way to move the genders into a list and list_options while still having a gender_id (or similar) column in the people table? Thoughts I've had so far include;
Create a 'magic' list with a known id and always assume that this contains the gender options.
Not a great fan of this because it sounds like using 'magic' numbers. The code will need some kind of 'map' between system level select boxes and what they mean
Instead of having a 'magic' list, move it out into an option that the user can choose so they have a choice which list contains the genders.
This isn't really much different, but the ID wouldn't be hardcoded. It would require more work looking through DB tables though
Have some kind of column(s) on the lists table that would mark it as pulling its options from another table.
Would likely require a lot more (and more complex) code to make this work.
Some kind of polymorphic table that I'm not sure how would work but I've just thought about and wanted to write down before I forget.
No idea how this would work because I've only just had the idea
The easiest solution would change your list_options table to a view. If you have multiple tables you need have a list drop down for to pull from this table, just UNION result sets together.
SELECT
(your list id here) -- make this a part primary key
id, -- and this a part primary key
Name,
FROM dbo.Genders
UNION
SELECT
(your list id here) -- make this a part primary key
id, -- and this a part primary key
Name,
FROM dbo.SomeOtherTable
This way it's automatically updated anytime the data changes. Now you are going to want to test this, as if this gets big it might get slow, you can get around this by only pulling all this information once in your application (or say cache it for 30 minutes and then refresh just in case).
Your second option is to create a table list_options and then create a procedure (etc.) which goes through all the other lookup tables and pulls the information to compile it. This will be faster for application performance, but it will require you to keep it all in sync. The easiest way to handle this one is to create a series of triggers which will rebuild portions (or the entire) list_options table when something in the look up tables is changed. In this one, I would suggest moving away from creating a automatically generated primary key and move to a composite key, like I mentioned with the views. Since this is going to be rebuilt, the id will change, so it's best to not having anything think that value is at all stable. With the composite (list_id,lookup_Id) it should always be the same no matter how many times that row is inserted into the table.
Let me explain my problem, and hopefully someone can offer some good advice.
I am currently working on a web-app that stores information and meta-data for a large amount of applications. For each application there could be anywhere from 10 to 100's of comments that are tied to the application and an application version id. I am using MongoDB because of a need for easy future scalability and speed. I have read that comments should be embedded in a collection for read performance reasons, but I'm not sure that this works in my case. I read on another post:
In general, if you need to work with a given data set on its own, make it a collection.
By: #kb
In my case however I don't need to work on the collection by themselves. Let me explain further. I will have a table of apps (that can be filtered) and will dynamically load entries as you scroll, or filter, through the list of apps. If I embed the comments within the application collection, I am sending ALL the comments when I dynamically load the application entry into the table. However, I would like to do "lazy loading" in that I only want to load the comments when the user requests to see them (by clicking on the entry in the table).
As an example, my table might look like the following
| app name | version | rating | etc. | view comments |
------------------------------------------------------
| app1 | v.1.0 | 4 star | etc. | click me! |
| app2 | v.2.4.5 | 3 star | etc. | click me! |
| ...
My question is what would be more efficient? Are reads fast enough on MongoDB that it really doesn't matter that I am pulling all the comments with each application? If a user did not filter any of the applications and scrolled all the way to the bottom, they might load somewhere between 125k to 250k entries/applications.
I would suggest thinking more specifically about your query - you specify which parts of an object you'd like to return. This should allow you to avoid the overhead of getting a bunch of embedded comments when you're only interested in displaying some specific bits of information about the application.
You can do something like: db.collection.find({ appName : 'Foo'}, {comments : 0 }); to retrieve the application object with appName Foo, but specifically exclude the comments object (more likely array of objects) embedded within it.
From the MongoDB docs
Retrieving a Subset of Fields
By default on a find operation, the entire document/object is returned. However we may also request that only certain fields are returned. Note that the _id field is always returned automatically.
// select z from things where x=3
db.things.find( { x : 3 }, { z : 1 } );
You can also remove specific fields that you know will be large:
// get all posts about mongodb without comments
db.posts.find( { tags : 'mongodb' }, { comments : 0 } );
EDIT
Also remember the limit(n) function to retrieve only n apps at a time. For instance, getting n=50 apps without their comments would be:
db.collection.find({}, {comments : 0 }).limit(50);