Two different approaches to structure my NoSQL database < What to choose? - nosql

I currently get to work with DynamoDB and I have a question regarding the structure I should choose.
I setup Twilio for being able to receive WhatsApp messages from guests in a restaurant. Guests can send their feedback directly to my Twilio WhatsApp number. I receive that feedback via webhook and save it in DynamoDB. The restaurant manager gets a Dashboard (React application) where he can see monitor the feedback. While I start with one restaurant / one WhatsApp number I will add more users / restaurants over time.
Now I have one of the following two structures in mind. With the first idea, I would always create a new item when a new message from a guest is sent to the restaurant.
With the second idea, I would (most of the time) update an existing entry. Only if the receiver / the restaurant doesn't exist yet, a new item is created. Every other message to that restaurant will just update the existing item.
Do you have any advice on what's the best way forward?
First idea:
PK (primary key), Created (Epoc time), Receiver/Restaurant (phone number), Sender/Guest (phone number), Body (String)
Sample data:
1, 1574290885, 4917123525993, 4916034325342, "Example Message 1" # Restaurant McDonalds (4917123525993)
2, 1574291036, 4917123525993, 4917542358273, "Example Message 2" # different sender (4917542358273)
3, 1574291044, 4917123525993, 4916034325342, "Example Message 3" # same sender as pk 1 (4916034325342)
4, 1574291044, 4913423525123, 4916034325342, "Example Message 4" # Restaurant Burger King (4913423525123)
Second idea:
{
Receiver (primary key),
Messages: {
{
id,
Created,
From,
Body
}
}
}
Sample data (same data as for first idea, but different structured):
{
Receiver: 4917123525993,
Messages: {
{
Created: 1574290885,
Sender: 4916034325342,
Body: "Example Message 1"
},
{
Created: 1574291036,
Sender: 4917542358273,
Body: "Example Message 2"
},
{
Created: 1574291044,
Sender: 4916034325342,
Body: "Example Message 3"
}
}
}
{
Receiver: 4913423525123,
Messages: {
{
Created: 1574291044,
Sender: 4916034325342,
Body: "Example Message 4"
}
}
}

If I read this correctly, in both approaches, the proposal is to save all messages received by a restaurant as a nested list (the Messages property looks like an object in the samples you've shared, but I assume it is an array since that would make more sense).
One potential problem that I foresee with this is that DynamoDB documents have a limitation on how big they can get (400kb). Agreed this seems like a pretty large number, but you're bound to reach that limit pretty quickly if you use this application for something like a food order delivery system.
Another potential issue is that querying on nested objects is not possible in DynamoDB and the proposed structure would mostly involve table scans for any filtering, greatly increasing operational costs.
Unlike with relational DBs, the structure of your data in document DBs is dependent heavily on the questions you want to answer most frequently. In fact, you should avoid designing your NoSQL schema unless you know what questions you want to answer, your access patterns, and your data volumes.
To come up with a data model, I will assume you want to answer the following questions with your table :
Get all messages received by a restaurant, ordered by timestamp (ascending / descending can be determined in the query by specifying ScanIndexForward = true/false
Get all messages sent by a user ordered by timestamp
Get all messages sent by a user to a restaurant, ordered by timestamp
Consider the following record structure :
{
pk : <restaurant id>, // Partition key of the main table
sk : "<user id>:<timestamp>", // Synthetic (generated) range key of the main table
messageBody : <message content>,
timestamp: <timestamp> // Local secondary index (LSI) on this field
}
You insert a new record of this structure for each new message that comes into your system. This structure allows you to :
Efficiently query all messages received by a restaurant ID using only the partition key
Efficiently retrieve all messages received by a restaurant and sent by a user using pk = <restaurant id> and begins_with(sk, <user id>)
The LSI on timestamp allows for efficiently filtering messages based on creation time.
However, this by itself does not allow you to query all messages sent by a user (to any restaurant, or a specific restaurant). To do that we can create a global secondary index (GSI), using the table's sk property (containing user IDs) as the GSI's primary key, and a synthetic range key that consists of the restaurant ID and timestamp separated by a ':'.
GSI structure
{
gsi_pk: <user Id>,
gsi_sk: "<dealer Id>:<timestamp>",
messageBody : <message content>
}
messageBody is a non key field projected on to the GSI
The synthetic SK of the GSI helps make use of the different key matching modes that DynamoDB provides (less than, greater than, starts with, between).
This GSI allows us to answer the following questions:
Get all messages by a user (using only gsi_pk)
Get all messages by a user, sent to a particular restaurant (ordered by timestamp) (gsi_pk = <user Id> and begins_with(gsi_sk, <restaurant Id>)
The system has a some duplication of data, but that is in line with one of the core ideas of DynamoDB, and most NoSQL databases. I hope this helps!

Storing multiple message in a single record has multiple issues
Size of write to db will increase as we go. (which will translate to money and response time, worst case you may end up hitting 400kb limit.)
Race condition between multiple writes.
No way to aggregate messages by user and other patterns.
And the worse part is that, I don't see any benefit of storing multiple messages together. (Other than may be I can query all of them together, which will becomes a con as size grows, like you will not be able to do get me last 10 reviews, you will always have to fetch all and then fetch last 10.)
Hence go for option where all the messages are stored differently.

Related

Using childByAutoId On Single Value?

I am pretty new to both Swift and Firebase, and I am attempting to make a simple app using Firebase as the backend. As far as I know, there is no memory-efficient way to use the numChildren() function without loading every single child into memory for counting, so I am implementing my own simple counter for the number of "Events" that have been created in my app.
The documentation for Firebase states that the childByAutoID() method should be used for updating lists in multi-user applications. I am assuming it adds a timestamp to the requested update and does them in order.
My question is whether it is necessary to use childByAutoID() when only updating a SINGLE field in a multi-user application. That is, will there be conflicts on my numEvents field if I do:
dbRef = FIRDatabase.database().reference()
dbRef.child("numEvents").setValue(num)
Or must I do:
dbRef = FIRDatabase.database().reference()
dbRef.child("numEvents").childByAutoId().setValue(num)
In order to avoid write conflicts? My only real confusion is that the documentation for childByAutoID stresses that it is useful when the children are a list of items, but mine is only a single item.
If you are only updating a single field you should not be using childByAutoId. To update a child value for an object, you need to obtain a reference to that object somehow, perhaps by a query of some sort (in many cases you will naturally already have a reference to the object if it needs to be changed) and you can change the value like this:
dbRef.child("events").child(objectToUpdateId).child(fieldToUpdateKey).setValue(newValue)
childByAutoId in this context would be used to create a new field like:
dbRef.child("events").childByAutoId().setValue(newObject)
I'm not exactly sure how this applies to your situation, but those are some descriptions of how to update a field, and use childByAutoId.
What childByAutoId does is create a unique key for a node, to avoid using the same key multiple times and then creating data conflicts like inconsistency (not write conflicts) to avoid write conflicts you use the transaction blocks.
The best way to learn is to try it out
If num == 1 , in the first example the result will be
dbRef:{
numEvents:1
}
While the second will be
dbRef:{
numEvents:{
//The auto-generated key
KLBHJBjhbjJBJHB:1
}
}
The childByAutoId would be useful if you want to save in a node multiple children of the same type, that way each children will have its own unique identifier
For example
pet:{
KJHBJJHB:{
name:fluffy,
owner:John Smith,
},
KhBHJBJjJ:{
name:fluffy,
owner:Jane Foster,
}
}
This way you have a unique identifier for cases where there is no clear way with the item data to guarantee it will be unique (in this case the pet's name)
Few things here:
childByAutoId is not a timestamp. But is used to create unique nodes in any given node.
Use case of childByAutoId :
You have messages node which stores messages from multiple user who are involved in a group chat. So each user can add messages in the group chat so you would do something like this each time user sends message:
dbRef = FIRDatabase.database().reference()
dbRef.child("messages").childByAutoId().setValue(messageText)
So this will create a unique message id for each message from different users. This will kind of act like primary key of message in normal databases.
The structure of database will be something like this:
messages: {
"randomIdGenerated-12asd12" : "hello",
"randomIdGenerated-12323D123" : "Hi, HOw are you",
}
So in your case your first approach is good enough! Since you dont need unique node for counting number of events added.

How to form an unordered key with many elements in mongodb

I'm attempting to use mongodb to implement a simple messaging system between two users in mongo. I want to be able to take two users, user0 and user1, and search for their entry in a collection. If the entry for those two users doesn't exist I want to create it and then add the message that was sent to its message field. If it does exist I just want to push the message to the message field.
I'm not really sure the best way to implement this.
db.privateChat.update(
{between:{$all:['user0', 'user1']}},
{$push:{message:'text'}}, {upsert:true}
)
And other similar entry schemes but they don't work. They produce the error:
"Cannot create base during insert of update. Caused by :ConflictingUpdateOperators Cannot update 'between' and 'between' at the same time"
I can think of other ways to do this producing a symmetric key (where the order of the users don't matter for the purposes of the search) from say adding the hashes together or a query that checks if either messenger0 or messenger1 is either user0 or user1 but these don't seem like great ways of doing it. Is this totally the wrong approach?
Thanks.
I think this could be solved by design.
let say that we have document in collection chats;
chat{
_id,
between[arrayOfIds],
startTime,
events[
{message:{
fromUserId,
timeStamp,
data}
}}
]}
}
then messages will be stored in message object inside chat .
App will be aware of chat _id so there will be no issues when you will have a group chat between more than 2 users.
This approach will allow you to prevent overflowing document size limitation as you could start new chat entry every week, day, etc...
Have a fun!

Modeling hierarchical data with authentication using DynamoDB

I'm looking for some best practices when it comes to modeling confidential hierarchical data in general and specifically with DynamoDB.
The scenario is best explained with an example:
Let's say we have a number of users. Each user has a number of products. Each product consists of a number of parts.
Typical use cases:
List all products for a given user
List all parts for a given product
So far I have modeled this in DynamoDB like this:
Users
----------------
HashKey: UserId
Products
-------------------
HashKey: UserId
RangeKey: ProductId
Parts
-------------------
HashKey: ProductId
RangeKey: PartId
The data is confidential and accessed through authenticated REST endpoints where an authentication token can be mapped to a UserId. Each user may be allowed to view other users' data through some group concept.
Listing all products for a given user is simple since UserId is a key in the products table:
GET /users/111/products becomes a simple Query(Table=Products, UserId=111)
But consider the case of listing all parts for a given product:
GET /users/111/products/222/parts
If I simply do a Query(Table=Parts, ProductId=222) then I will get the desired data fast, but I am not protecting against other users querying for data belonging to user 111, provided they somehow know about ProductId 222 (in reality, ID:s will of course be UUID:s or similar so not so easily guessable):
GET /users/119/products/222/parts
... would result in malicious user 119 retrieving data that doesn't belong to him, provided nothing is done to address this.
So here I imagine I need to do something like one of these:
First make another query to make sure product 222 in fact belongs to the given user
Duplicate the UserId in the Parts table and include it in the query condition (which basically means it will match either all rows or no rows when scanning through the set identified by ProductId): Query(Table=Parts, ProductId=222, UserId=111)
Use UserId as the hash key also in the Parts table and instead keep ProductId as a secondary index
Use a composite HashKey such as UserId_ProductId ("111_222") on the Parts table
If I need to return a 401 as opposed to just empty data, option 1 seems like the only approach. But if we imagine a deeper hierarchy of data, e.g. "users having inboxes having messages having parts having attachments" it seems this approach could eventually be expensive (listing all attachments for part P might result in a query to check that part P belongs to message M, that message M belongs to inbox I and that inbox I belongs to user U, and so on).
Does anyone have any good arguments for which approach is most favorable? Or am I doing something stupid and should be modeling my data in some other way completely?

Correct REST endpoint design when dealing with messaging

I'm struggling to work out the correct REST endpoints for a certain situation. On my website it's possible for users to send messages to each other. One user is able to send messages to multiple recipients.
I think that /v1/users/123/messages would return all messages that have been sent to user 123
What end point should I use for messages that user 123 has sent?
My database structure is as follows...
accounts table
id INT
username VARCHAR(64)
messages table
id INT
account_id INT <!-- This is the senders account ID
subject VARCHAR(128)
message TEXT
messagerecipients table
id INT
message_id INT
account_id INT <!-- This is the recipients account ID
The messages table defines a one-to-one relationship between a message and its sender
The messagerecipients table defines a many-to-many relationship between messages and their recipients
Also I'm reading through a PDF on API design at the moment which seems to suggest I should hide this kind of complexity behind the query string.
For instance....
/v1/emails?filter=author_id(123)
/v1/emails?filter=recipient_id(123)
Thoughts?
I would expect
/v1/users/123/messages
to return all the messages that belongs to this user. This means received, sent, deleted, tagged, drafted etc.
To specify a subset of a resource you can go two ways with this like you and bertvh stated:
Querystring:
I find it perfectly valid for filtering as e.g.
/v1/users/123/messages?type=received&folder=important
Or as a subresource:
Use this if you expect to have a lot of filter options on a higher level e.g.
/v1/users/123/messages/received?folder=important
As you can see this would reduce a filter option.
And like bertvh stated, the underlying database schema is irrelevant for serving the responses.
I would do something like this.
To get all messages sent by a user:
/v1/users/123/sent
To get all messages received by a user:
/v1/users/123/inbox
Your database structure is irrelevant for the resource scheme but it can influence the payload structure. If you want to use JSON a message could look something like this:
{
sender: 123,
receivers: [124, 125]
content: "My message content"
}

NOSQL Table Schema

I'm trying to plan a NOSQL table schema. There are relationships in my data, but they are mostly what would be N:N in a relational db; there are very few normal 1:N relationships.
So in this case, I'm trying to create implicit relationships that will allow me to browse from both ends of the relationship. I'm using Azure Table Storage, so I understand that full-text searching isn't available; I can only retrieve an "object" by its Partition Key + Row Key combination.
So imagine I have a table called "People" and a table called "Hamburgers" and each object in the tables can be related to multiple objects in the other table. Hamburgers are eaten by many people, people each eat many hamburgers.
Since the relationship is probably weighted to the people side - i.e. there are more people per hamburger than vice-versa, I would handle this in the tables like this:
Hamburger Table
Partition Key: Only 1 partition
Row Key: Unique ID
People Table
Partition Key: Only 1 partition
Row Key: Unique ID
"Columns": an extra value for every hamburger the person eats
Hamburger-People Table
Partition Key: Hamburger Row Key
Row Key: People Row Key
This way, if I'm looking at a hamburger and want to see all the people that eat it, I can go to the Hamburger-People table and use my Hamburger's Row Key to get the partition of all the people that eat the hamburger.
If I'm at a person and want to see all the hamburgers he/she eats, I have the extra values with the Row Keys of the hamburgers the person eats.
When inserting data into the tables, if the data involves a hamburger/person relationship, I would insert both values in the proper tables, then create the Hamburger-People table. If I was trying to keep a duplicate-free list of hamburgers, I would need to search the Hamburger table first to make sure the hamburger wasn't already in there (like "Whopper" - if it's in there, I wouldn't insert it again). Then, I would need to go insert a row in the hamburger's existing partition in Hamburger-People table.
But for the most part, the no-duplicate requirement doesn't exist.
Is this a good best-practices approach to NOSQL schema, or am I going to run into problems later?
UPDATE
Also, I would like to be able to partition the data tables later, but I'm not sure how to do so with this structure; adding a 2nd partition to the hamburger table would require me to store an extra value in the hamburger-People table, and I'm not sure if that would start to be too complex.
Ok, nice questions and I think most of them are the ones each RDMBS developer face as soon as hits NoSQL world:
1. How to group the partitions?
To get the best of the partitions you need to think that the load of your database should be distributed across your servers, lets see what will happend with your approach
A person with Key "A" enters to the restaurant you will save it and his burger, which is a Classic Tasty (Key "T") the person record goes to the server X and the Burger goes to server Y, now a new customer goes enters with the Key "B", and wants something different, a burger "W", again the person goes to server X and the burguer to server X, this time the server X is getting all the load, if you repeat this you'll see that the server X becomes a bottle neck, because 75% of the records are going there (all the people and 50% of the burgers), that will create some problems with your load. But... the problem will be better when you try to query because all the queries will hit the server X.
To solve this you could use the key of the person as part of the partition for the relationship, so the person will be partitioned in the same server of the burguers relationship, this way your workload will be balanced and you wont have any problems if one of the servers goes down (the person and hamburguers will be "lost" together), this will be a consistence "inconsistency"
2. Should I use a "relationship" in a NoSQL database?
Remember that NoSQL means that you are granted to duplicate information anytime your problem requires a solution to avoid "overqueries", so, if you can store the information that will be commonly queried together you will avoid a roundtrip to the database. So, if you store a "transaction" instead of "person and burguers" you will get a better performance and avoid some hits to the database, lets do an example of real data with your approach and compare it with "my" approach:
Joe Black comes to the restaurant and ask for a tasty, here you will do the following transactions:
Create a Joe Black record
Create a Burguer transaction record
if you want to list your daily transactions you will need to:
Get all the records from the day in the "table" person-burguer, then go to the person "table" and retrieve the name of the customers and now, go to the hamburguer records and retrieve their names. (you wont be able to do cross-table queries because some records could be in one server and others in the second server)
Ok, what if you create a table "transactions" and store in there the following json:
{ custid: "AAABCCC",
name: "Joe", lastName: "Black",
date: "2012/07/07",
order: {
code: "Burger0001",
name: "Tasty",
price: 3.5
}
}
I know you will have several records with the same "tasty" description, that's desnormalization which is very useful when you approach NoSQL solutions to these type of problems, now, how many transactions did you create to store the information to the database? just one! wow... and how many queries will you need to retrieve the information at the end of the day? again... just one, it will create some problems, but will save you a lot of work too, like... could you reprint the order easily? (yes you can!) what if the name of the customer changes? is that even possible?
I hope this help you some way,
I'm the creator of http://djondb.com so I think that having inside knowledge gives me a different approach to the problems according to what the database will be able to do, but I'm not aware of how azure will handle the queries if you are not able to query the document values and just the row keys, but anyway I hope this gives you an insight.