Instant Messaging Schema design advice - postgresql

I'm trying to build an Instant Messaging functionality in my app as part a bigger project.
Chats can have more than 2 participants (group chats)
If participant A delete a message, it still should be visible to participant B (that's why I used the Message Participants table)
Same applies to Conversation.
By same logic, if all participants delete the conversation/message, it should be erased from DB.
Questions :
I'm afraid that this schema is too cumbersome, meaning that the queries will be too slow once the app gets certain traffic mark (1k active users ? I'm guessing)
Message Participants will have multiple records for each message - one for each participants in the chat. Instant Messaging means it will involve those writes with very tight timings. Wouldn't that be a problem?
Should I add a layer of Redis DB, to manage a chat's active session's messaging? it will store the recent messages, and actively sync the PostgreSQL db with those messages (perhaps with Async transactions functionality that postgresql has?)
UPDATED schema :
I would also gladly hear ideas for having a "read" status functionality. I'm assuming it's much more complex with Group chats, so at least offering that for 1:1 chats would be nice.

I am a little confused by your diagram. Shouldn't the Conversation Participants be linked to the Conversations instead of the Message? The FKs look all right, just the lines appear wrong.
I wouldn't be worried about performance yet. The Premature Optimization Anti-Pattern warns us not to give up a clean design for performance reasons until we know whether we are going to have a performance problem. You anticipate 1000 users - that's not much for a modern information system. Even if they are all active at the same time and enter a message every 10 seconds, this will just mean 100 transactions per second, which is nothing to be afraid of. Of course, I don't know the platform on which you are going to run this. But it should be an easy task to set up those tables and write a simple test program that inserts those records as fast as possible.
Your second question makes me wonder how "instant" you expect your message passing to be. Shall all viewers of a message receive each keystroke of the text within a millisecond? Or do they just need to see each message appear right after it was posted? Anyway, the limiting factor for user responsiveness will probably be the network, not the database.
Maybe this is not mainly a database design issue. Let's assume you will have a tremendous rate of postings and viewings. But not all conversations will be busy all the time. If the need arises - but not earlier - it might be necessary to hold the currently busy conversations in memory and using the database just as a backup for future times when they aren't busy any more.
Concerning your additional comments:
100k users: This is a topic not for this forum, but concerning business development of a startup. Many founders of startup companies imagine huge masses of users being attracted to their site, while in reality most startups just fail or only reach very few. So beware of investments (in money, but also in design and implementation effort) that will only pay in the highly improbable case that your company will be the next Whatsapp.
In case you don't really anticipate such masses of users but just want to imagine this as a programming exercise, you still have a difficult task. You won't have the platform to simulate the traffic, so there is no way to make measurements on where you actually have a performance problem to solve. That's one of the reasons for the Premature Optimization warning: Unless you know positively where you have a bottleneck, you - and all of us - will be just guessing and probably make the wrong decisions.
Marking a message as read is easy: Introduce a boolean attribute read at Message Participants, and set it to true as soon as, well, the user has read the message. It's up to your business requirements in which cases and to whom you show this.

Related

NoSQL database design using single table

How satisfied are you with the statement "You should maintain as few tables as possible in a DynamoDB application. Most well designed applications require only one table." ?
I have listed down my use cases for a NoSQL design but restricting myself for a single table design makes the design complex and requires every developer working with NoSQL table to understand the logical complexity adhering to principles of partitioning and performance gain. just to write few my app do following things :
Register a user account with mobile device.
Allow multiple users to check their account on any mobile device having the app installed on it.
Logs user activities which could go from 10 to 1000 in a day per user.
There is a batch job which periodically checks if any user account get updates then it sends the notification on all devices which have been logged in by the user ( as the last logged user on device ).
Ofcourse these FCM notifications are based on user preferences for notifications.
And at last the user account is hinged around a unique email address and the user can have updates to all other attributes of his account from any device.
I found the batch processing job which periodically scans the whole table forcing me to create a second table so that I can spawn as many threads to process the row(s).
How can I make this all fit in a single table ?
So this is all Rick Houlihan's fault. He's a Principal Engineer at AWS, focusing on DynamoDB.
So the opening statement that most well design applications only require one table, but what is missing is the design guidance on how you should structure your table to be effective.
For single table design in your example you would want a partition per device and a partition per user, which you would write the relevant information per device or per user too. You would mix and match this all in your single table. You would use global secondary indexes to retrieve the relevant data, but that index design would be dependent on your access patterns.
Essentially the way you model your data for a single table design is completely different to an RDBMS, so you need to throw everything you knew out the window and learn it all again.
I recommend reading these blog posts
https://www.trek10.com/blog/dynamodb-single-table-relational-modeling/
https://www.jeremydaly.com/takeaways-from-dynamodb-deep-dive-advanced-design-patterns-dat403/
and watching Rick's reInvent session on it multiple times... typically by about the 10th time light bulbs start going off...
Rick's DynamoDB Advanced Design Patterns talk - https://www.youtube.com/watch?v=6yqfmXiZTlM
There is a fair amount of depth to it. Good luck!

Stuck with understanding how to build a scalable system

I need some guidance on how to properly build out a system that will be able to scale. I will give you some information about what I am trying to do and then ask my specific question.
I have a site where I want visitors to send some data to be processed. They input the data into a textarea or upload it in a file. Simple. The data is somewhat preprocessed on the client side before a POST request is made to a REST endpoint.
What I am stuck on is what is a good way to take this posted data store it and then associate an id with it that references the user since I cannot process the data fast enough for it to be returned to the user in a reasonable amount of time?
This question is a bit vague and open to opinion, I admit it. I just need a push in the right direction to keep moving. What I have been considering is throwing the data into a message queue and then having some workers process the data elsewhere and when the data is processed alert the user as to where to find it with some sort of link to an S3 bucket or just a URL to a file. The other idea was to just run the request for each item to be processed against another end-point that already processes individual records in some sort of loop client side. The problem is as follows with this idea:
To process the data it may take somewhere from 30 minutes to 2 hours depending upon the amount that they want processed. It's not ideal for them to just sit there and wait for that to finish depending on the amount of records they need processed, so I have ruled this out mostly.
Any guidance would be very much appreciated as I don't have any coworkers to bounce things off of, nor do I know many people with the domain knowledge that I could freely ask. If this isn't the right place to ask this, could you point me in the right direction as to where it should be asked?
Chris
If I've got you right, your pipeline is:
Accept item from user
Possibly preprocess/validate it (?)
Put into some queue
Process data
Return result.
You man use one or several queues on stage (3). Entity from user gets added to one of the queues. If it's big enough, it could be stored in S3 or storage alike, and only info about it put into the queue: link, add date, user id (or email of alike). Processors can pull items from queue and give feedback to users.
If you have no strict requirements on order, things get much simpler: you don't need any sync between them. Treat all the components: upload acceptors, queues, storages and processors as independent pools of processes. Monitor each pool separately. If there's some bottlenecks - add machines to that pool.

Converting resources in a RESTful manner

I'm currently stuck with designing my endpoints in a way so that they are conform with the REST principles but also ensure the integrity of the underlying data.
I have two resources, ShadowUser and RealUser whereas the first one only has a first name, last name and an e-mail.
The second user has much more properties such like an Id under which the real user can be addressed at other place in the system.
My use-case it to convert specific ShadowUsers into real users.
In my head the flow seems pretty simple:
get the shadow users /GET api/ShadowUsers?somePropery=someValue
create new real users with the data fetched /POST api/RealUsers
delete the shadow-users /DELETE api/ShadowUSers?somePropery=someValue
But what happens when there is a problem between the creation of new users and the deletion of the shadow ones? The data would now be inconsistent.
The example is even easier when there is only one single user, but the issue stays the same as there could be something between step 2 and 3 leaving the user existing as shadow and real.
So the question is, how this can be done in a "transactional" manner where anything is good and persisted or something went wrong and nothing has been changed in the underlying data-store?
Are there any "best practices" or "design-patterns" which can be used?
Perhaps the role of the RESTful API GETting and POSTing those real users in batch (I asked a question some weeks ago about a related issue: Updating RESTful resources against aggregate roots only).
In the API side, POSTed users wouldn't be handled directly but they would be enqueued in a reliable messaging queue (for example RabbitMQ). A background process would be subscribed to the whole queue and it would process both the creation and removal of real and shadow users respectively.
The point of using a reliable messaging system is that you can implement retry policies. If the operation is interrupted in the middle of finishing its work, you can retry it and detect which changes are still pending to complete the task.
In summary, using this approach you can implement that operation in a transactional way.

Thoughts on high-concurrent entities and race conditions in JPA...Transactional?

I am working on a project, and have come to a sort of "design issue". I have an entity that represents a "Competition". Users are given votes, and they vote on the competition. Each competition has 10 items, and only 10 votes total can be cast on each. When a user tries to use their votes, I do a quick search for open competitions they haven't already voted on.
The problem comes when I think of multiple users all vying for the same competition. I don't care if two users are voting at once, but I want to prevent errors where a competition is recently closed, but there are still user(s) voting on it because they selected prior to closing...what I consider a "race condition" as the voting comes to a close.
The way I see it, there are only a few options:
Option 1: Turn on READ_COMMITTED transactions, but my understanding is that this locks the row on read, so no other queries will finish until the lock is returned. Since this is a JSP app, does the read lock end when the JSP is finished rendering? Seems like I could still have the same problem.
Option 2: Write out how many users have checked out the competition. This seems to follow the Database-as-IPC anti-pattern, though, and I could see where monitoring and maintaining the counts would be tricky at best.
Option 3: Dont worry about it. If the user takes too long voting, then just throw an error and make them move to the next one.
Option 4: Lean on AJAX heavily, maybe messaging using Atmosphere, to keep a live vote count on the competition page. Not sure how you handle browser timeouts or when the user simply leaves in the middle...perhaps some sort of cleanup timer?
Right now I am leaning on Option 4, as it seems to strike a nice balance between ease-of-implementation and ease-of-use from the user's perspective, but I want to be sure I am not missing any angles here.
How have others handled similar situations?
You might consider using optimistic or pessimistic locking. That is the normal way to resolve concurrency issues.

Third party data delivery of lots of data

Does anyone know how sites that have a real-time feed of a lot of data work? I am referring to something like a stock site, where they can tell you in real time (well, 20 minute delay mostly, but still real-time - 20 minutes as I understand it).
They have thousands of data pieces delivered to them every second, I would imagine: MSFT 25.00 +.23 VOL 12000 ???? for each stock that had a change during some interval.
So, is there just a constant feed of small pushes going on? Or do you think a site will pull from the place that has the real data and say "give me all changes since 12:23:45 CST to now" type query?
I ask this because at work we might have a situation where we need to have at our application's fingertips real time information like this, and it won't make sense to hit our third party provider over and over and over again every second...
Generally there is a server/client protocol defined between the 2 parties. In the company I work for the connection is maintained at all times.
Here is info on real time data feeds to go with your stock example
NYSE,NASDAQ
It is common for data providers to also have FTP sites with (delayed) batched data. One that comes to mind is the NWS EMWIN
Sites like Twitter feed data to certain approved sites in real-time via XMPP (Wiki link).
In the broadest terms, a push model is going to be the best way of achieving "real time" transfer, particularly if you're talking about a large amount of data.
However you do always have a problem when using a purely push model of how to recover from missed data.
Depending on the nature of your data that may not be a problem (thinking of video delivery as an analogue, where the amount of data is huge but there is sufficient redundancy for it to recover from missing data). And if you have any control over the data you may be able to build some redundancy in. For example, on every change event you can provide absolute values rather than changes, or previous value and new value.
I've done this making an attempt to retrieve the stock quote from the source, and falling back to a timestamped on-disk cache of the quote when the main source fails or times out.