Thoughts on high-concurrent entities and race conditions in JPA...Transactional? - jpa

I am working on a project, and have come to a sort of "design issue". I have an entity that represents a "Competition". Users are given votes, and they vote on the competition. Each competition has 10 items, and only 10 votes total can be cast on each. When a user tries to use their votes, I do a quick search for open competitions they haven't already voted on.
The problem comes when I think of multiple users all vying for the same competition. I don't care if two users are voting at once, but I want to prevent errors where a competition is recently closed, but there are still user(s) voting on it because they selected prior to closing...what I consider a "race condition" as the voting comes to a close.
The way I see it, there are only a few options:
Option 1: Turn on READ_COMMITTED transactions, but my understanding is that this locks the row on read, so no other queries will finish until the lock is returned. Since this is a JSP app, does the read lock end when the JSP is finished rendering? Seems like I could still have the same problem.
Option 2: Write out how many users have checked out the competition. This seems to follow the Database-as-IPC anti-pattern, though, and I could see where monitoring and maintaining the counts would be tricky at best.
Option 3: Dont worry about it. If the user takes too long voting, then just throw an error and make them move to the next one.
Option 4: Lean on AJAX heavily, maybe messaging using Atmosphere, to keep a live vote count on the competition page. Not sure how you handle browser timeouts or when the user simply leaves in the middle...perhaps some sort of cleanup timer?
Right now I am leaning on Option 4, as it seems to strike a nice balance between ease-of-implementation and ease-of-use from the user's perspective, but I want to be sure I am not missing any angles here.
How have others handled similar situations?

You might consider using optimistic or pessimistic locking. That is the normal way to resolve concurrency issues.

Related

postgresql concurrent transactions

I have a project that I'm working on that uses a background thread to insert data into the database from a different source. Now one of the problems that I have occasionally, once the user interacts with the database an error due to "could not serialize access due to concurrent update" is triggered.
Because my data source generates a lot of data my background thread is always busy keeping up and inserting the data and I've written the algorithm to basically "retry" whenever a row is locked, so I can live with the occasional error that occurs on the background thread.
The real problem is that my users basically interact with the database via a website that has ORM mapping in various layers and it seems that occasionally the user might be modifying the same record that is being worked on by the background thread.
Is there any recommendation of what I could do make my not be confronted with the serialization errors?
Presumably your users have been presented with some data from the database, and based on what they saw have decided to change something. If the data they looked at is obsolete, why would you expect that decision based on obsolete information to still stand? In general, they need to look at the more-current data and decide if they still want to make the change.
If those decisions are "easy" to make, then why do you have humans making them in the first place? And if they are hard to make, what choice is there but to have the people re-make them? There may be particular answers to this, but I don't see any general answers, and you haven't given us any particular details to work with.

Instant Messaging Schema design advice

I'm trying to build an Instant Messaging functionality in my app as part a bigger project.
Chats can have more than 2 participants (group chats)
If participant A delete a message, it still should be visible to participant B (that's why I used the Message Participants table)
Same applies to Conversation.
By same logic, if all participants delete the conversation/message, it should be erased from DB.
Questions :
I'm afraid that this schema is too cumbersome, meaning that the queries will be too slow once the app gets certain traffic mark (1k active users ? I'm guessing)
Message Participants will have multiple records for each message - one for each participants in the chat. Instant Messaging means it will involve those writes with very tight timings. Wouldn't that be a problem?
Should I add a layer of Redis DB, to manage a chat's active session's messaging? it will store the recent messages, and actively sync the PostgreSQL db with those messages (perhaps with Async transactions functionality that postgresql has?)
UPDATED schema :
I would also gladly hear ideas for having a "read" status functionality. I'm assuming it's much more complex with Group chats, so at least offering that for 1:1 chats would be nice.
I am a little confused by your diagram. Shouldn't the Conversation Participants be linked to the Conversations instead of the Message? The FKs look all right, just the lines appear wrong.
I wouldn't be worried about performance yet. The Premature Optimization Anti-Pattern warns us not to give up a clean design for performance reasons until we know whether we are going to have a performance problem. You anticipate 1000 users - that's not much for a modern information system. Even if they are all active at the same time and enter a message every 10 seconds, this will just mean 100 transactions per second, which is nothing to be afraid of. Of course, I don't know the platform on which you are going to run this. But it should be an easy task to set up those tables and write a simple test program that inserts those records as fast as possible.
Your second question makes me wonder how "instant" you expect your message passing to be. Shall all viewers of a message receive each keystroke of the text within a millisecond? Or do they just need to see each message appear right after it was posted? Anyway, the limiting factor for user responsiveness will probably be the network, not the database.
Maybe this is not mainly a database design issue. Let's assume you will have a tremendous rate of postings and viewings. But not all conversations will be busy all the time. If the need arises - but not earlier - it might be necessary to hold the currently busy conversations in memory and using the database just as a backup for future times when they aren't busy any more.
Concerning your additional comments:
100k users: This is a topic not for this forum, but concerning business development of a startup. Many founders of startup companies imagine huge masses of users being attracted to their site, while in reality most startups just fail or only reach very few. So beware of investments (in money, but also in design and implementation effort) that will only pay in the highly improbable case that your company will be the next Whatsapp.
In case you don't really anticipate such masses of users but just want to imagine this as a programming exercise, you still have a difficult task. You won't have the platform to simulate the traffic, so there is no way to make measurements on where you actually have a performance problem to solve. That's one of the reasons for the Premature Optimization warning: Unless you know positively where you have a bottleneck, you - and all of us - will be just guessing and probably make the wrong decisions.
Marking a message as read is easy: Introduce a boolean attribute read at Message Participants, and set it to true as soon as, well, the user has read the message. It's up to your business requirements in which cases and to whom you show this.

Multiple Siddhi apps or one big one

We are building an application on top of Siddhi (using the Java library) that allows users to dynamically add rules and have all incoming information going forward be run against those rules. My question is if it's better to have one large app with many queries, streams, windows, and partitions, or to break up each query into it's own application.
We have been including everything in one single Siddhi app (SiddhiAppRuntime), but this is starting to become large and I fear things may start interacting with each other in unintended ways. We are also snapshotting the SiddhiAppRuntime and restoring state whenever our application gets restarted. This could likely lead to massive restores if we have hundreds of pattern queries to re-run.
I am considering making a separate SiddhiAppRuntime from a single SiddhiManager for each query. The benefits (as I see them) would be reduce the risk of unintentional interactions, make each query able to function on its own, and restoring the query after a shutdown should be much simpler since it will only need to restore a single query. Potential downsides could be increased overhead for having potentially hundreds of SiddhiAppRuntimes.
What is considered best practice for our scenario? What will offer better performance, both for running data through the rules and for restoring the rules in the case we have to restart.
(If this is too broad or any clarification is needed I will do my best to update this question accordingly)
From the lengthy description that you have given I assume these rules that users add does not interact with each other meaning rules add by user1 will not be interacting with rules added by user2.
In such a case it is recommended to use different Siddhi Apps(SiddhiAppRuntimes) for each user. This wont add much additional performance overhead as apps wont be interacting with each other. This will improve snapshoting process as we will be taking separate snapshopts per each app.
Also this will make sure you will have clear separation between each collection of rules and will be easily manageable.

Stuck with understanding how to build a scalable system

I need some guidance on how to properly build out a system that will be able to scale. I will give you some information about what I am trying to do and then ask my specific question.
I have a site where I want visitors to send some data to be processed. They input the data into a textarea or upload it in a file. Simple. The data is somewhat preprocessed on the client side before a POST request is made to a REST endpoint.
What I am stuck on is what is a good way to take this posted data store it and then associate an id with it that references the user since I cannot process the data fast enough for it to be returned to the user in a reasonable amount of time?
This question is a bit vague and open to opinion, I admit it. I just need a push in the right direction to keep moving. What I have been considering is throwing the data into a message queue and then having some workers process the data elsewhere and when the data is processed alert the user as to where to find it with some sort of link to an S3 bucket or just a URL to a file. The other idea was to just run the request for each item to be processed against another end-point that already processes individual records in some sort of loop client side. The problem is as follows with this idea:
To process the data it may take somewhere from 30 minutes to 2 hours depending upon the amount that they want processed. It's not ideal for them to just sit there and wait for that to finish depending on the amount of records they need processed, so I have ruled this out mostly.
Any guidance would be very much appreciated as I don't have any coworkers to bounce things off of, nor do I know many people with the domain knowledge that I could freely ask. If this isn't the right place to ask this, could you point me in the right direction as to where it should be asked?
Chris
If I've got you right, your pipeline is:
Accept item from user
Possibly preprocess/validate it (?)
Put into some queue
Process data
Return result.
You man use one or several queues on stage (3). Entity from user gets added to one of the queues. If it's big enough, it could be stored in S3 or storage alike, and only info about it put into the queue: link, add date, user id (or email of alike). Processors can pull items from queue and give feedback to users.
If you have no strict requirements on order, things get much simpler: you don't need any sync between them. Treat all the components: upload acceptors, queues, storages and processors as independent pools of processes. Monitor each pool separately. If there's some bottlenecks - add machines to that pool.

Converting resources in a RESTful manner

I'm currently stuck with designing my endpoints in a way so that they are conform with the REST principles but also ensure the integrity of the underlying data.
I have two resources, ShadowUser and RealUser whereas the first one only has a first name, last name and an e-mail.
The second user has much more properties such like an Id under which the real user can be addressed at other place in the system.
My use-case it to convert specific ShadowUsers into real users.
In my head the flow seems pretty simple:
get the shadow users /GET api/ShadowUsers?somePropery=someValue
create new real users with the data fetched /POST api/RealUsers
delete the shadow-users /DELETE api/ShadowUSers?somePropery=someValue
But what happens when there is a problem between the creation of new users and the deletion of the shadow ones? The data would now be inconsistent.
The example is even easier when there is only one single user, but the issue stays the same as there could be something between step 2 and 3 leaving the user existing as shadow and real.
So the question is, how this can be done in a "transactional" manner where anything is good and persisted or something went wrong and nothing has been changed in the underlying data-store?
Are there any "best practices" or "design-patterns" which can be used?
Perhaps the role of the RESTful API GETting and POSTing those real users in batch (I asked a question some weeks ago about a related issue: Updating RESTful resources against aggregate roots only).
In the API side, POSTed users wouldn't be handled directly but they would be enqueued in a reliable messaging queue (for example RabbitMQ). A background process would be subscribed to the whole queue and it would process both the creation and removal of real and shadow users respectively.
The point of using a reliable messaging system is that you can implement retry policies. If the operation is interrupted in the middle of finishing its work, you can retry it and detect which changes are still pending to complete the task.
In summary, using this approach you can implement that operation in a transactional way.