Stuck with understanding how to build a scalable system - rest

I need some guidance on how to properly build out a system that will be able to scale. I will give you some information about what I am trying to do and then ask my specific question.
I have a site where I want visitors to send some data to be processed. They input the data into a textarea or upload it in a file. Simple. The data is somewhat preprocessed on the client side before a POST request is made to a REST endpoint.
What I am stuck on is what is a good way to take this posted data store it and then associate an id with it that references the user since I cannot process the data fast enough for it to be returned to the user in a reasonable amount of time?
This question is a bit vague and open to opinion, I admit it. I just need a push in the right direction to keep moving. What I have been considering is throwing the data into a message queue and then having some workers process the data elsewhere and when the data is processed alert the user as to where to find it with some sort of link to an S3 bucket or just a URL to a file. The other idea was to just run the request for each item to be processed against another end-point that already processes individual records in some sort of loop client side. The problem is as follows with this idea:
To process the data it may take somewhere from 30 minutes to 2 hours depending upon the amount that they want processed. It's not ideal for them to just sit there and wait for that to finish depending on the amount of records they need processed, so I have ruled this out mostly.
Any guidance would be very much appreciated as I don't have any coworkers to bounce things off of, nor do I know many people with the domain knowledge that I could freely ask. If this isn't the right place to ask this, could you point me in the right direction as to where it should be asked?
Chris

If I've got you right, your pipeline is:
Accept item from user
Possibly preprocess/validate it (?)
Put into some queue
Process data
Return result.
You man use one or several queues on stage (3). Entity from user gets added to one of the queues. If it's big enough, it could be stored in S3 or storage alike, and only info about it put into the queue: link, add date, user id (or email of alike). Processors can pull items from queue and give feedback to users.
If you have no strict requirements on order, things get much simpler: you don't need any sync between them. Treat all the components: upload acceptors, queues, storages and processors as independent pools of processes. Monitor each pool separately. If there's some bottlenecks - add machines to that pool.

Related

Microservices - Storing user data in separate database

I am building a microservice that has two separate services: a user service and a comments service. The user service stores the user details like email, first/last name job title, etc, and the comments service stores all comments made by the user.
In the UI, I need to populate the comments (via a REST API) and show the first/last name, email, and job title of the user.
Is it recommended that we store all these user details in the comments database?
If yes, then every time a user changes their details first/last name or job title then I will have to update their details in all the comments (I don't think this is a good idea )
If no, then if I store just the userid in the comments DB, how am I supposed to get the user details for each comment? Let's say we want to show 20 comments per page in the UI.
First, challenge architecture. Let's assume that the both services in the question are part of a larger ecosystem of microservices that all make use of the user information. Else separation will most certainly be overengineered. But from the word "comments" we can at least guess that there is at least one other class of objects, that is the things being commented. So let's assume a "user service" is a meaningful crumb to break out into a microservice, because at least some other crumbs get the necessary weight to justify the microservice breakup.
In that case I suggest the following strategy:
Second, implement an abstraction layer into your comments service right away so that most of the code will not have to care about where the user comes from (i.e. don't join or $lookup). This is also a great opportunity for local testing, because you can just create a collection with the data you need and run service level integration tests against it.
Third, for integration with the user service, get the data from there via API (which should support bulk data selection in any case) every time you need it. Because you have the abstraction layer, you can add caching, cache timeout and displacement strategies and whatever you may need below this abstraction without caring in the main portion of the code. Add such on an as needed basis. Keep it simple.
Fourth, when things really go heavyweight and you have to care with tens of thousands of users, tons of comments and many requests per second the comments service could, still below the abstraction, implement an upfront replication pattern to get the full user database locally. This will usually be done based on an asynchronous message being sent by the user service to all subscribers when something changes in te user base. When it suits the subscribers (i.e. the comment service), they can trigger full or (from time to time) delta replication of the changes. Suitable collections will be already in place from what you did for caching. And it will probably be considerably less info you need in the comments service, than is stored in user service (let alone the hashed password, other login options or accounting information).
Fifth, should you still hit performance challenges, you can break the abstraction for the few cases you need to and do the join or $lookup.
Follow the steps in order, and stop as soon as the overall assembly works fine. Every step adds considerable complexity, and when you don't need it, don't implement it.

Instant Messaging Schema design advice

I'm trying to build an Instant Messaging functionality in my app as part a bigger project.
Chats can have more than 2 participants (group chats)
If participant A delete a message, it still should be visible to participant B (that's why I used the Message Participants table)
Same applies to Conversation.
By same logic, if all participants delete the conversation/message, it should be erased from DB.
Questions :
I'm afraid that this schema is too cumbersome, meaning that the queries will be too slow once the app gets certain traffic mark (1k active users ? I'm guessing)
Message Participants will have multiple records for each message - one for each participants in the chat. Instant Messaging means it will involve those writes with very tight timings. Wouldn't that be a problem?
Should I add a layer of Redis DB, to manage a chat's active session's messaging? it will store the recent messages, and actively sync the PostgreSQL db with those messages (perhaps with Async transactions functionality that postgresql has?)
UPDATED schema :
I would also gladly hear ideas for having a "read" status functionality. I'm assuming it's much more complex with Group chats, so at least offering that for 1:1 chats would be nice.
I am a little confused by your diagram. Shouldn't the Conversation Participants be linked to the Conversations instead of the Message? The FKs look all right, just the lines appear wrong.
I wouldn't be worried about performance yet. The Premature Optimization Anti-Pattern warns us not to give up a clean design for performance reasons until we know whether we are going to have a performance problem. You anticipate 1000 users - that's not much for a modern information system. Even if they are all active at the same time and enter a message every 10 seconds, this will just mean 100 transactions per second, which is nothing to be afraid of. Of course, I don't know the platform on which you are going to run this. But it should be an easy task to set up those tables and write a simple test program that inserts those records as fast as possible.
Your second question makes me wonder how "instant" you expect your message passing to be. Shall all viewers of a message receive each keystroke of the text within a millisecond? Or do they just need to see each message appear right after it was posted? Anyway, the limiting factor for user responsiveness will probably be the network, not the database.
Maybe this is not mainly a database design issue. Let's assume you will have a tremendous rate of postings and viewings. But not all conversations will be busy all the time. If the need arises - but not earlier - it might be necessary to hold the currently busy conversations in memory and using the database just as a backup for future times when they aren't busy any more.
Concerning your additional comments:
100k users: This is a topic not for this forum, but concerning business development of a startup. Many founders of startup companies imagine huge masses of users being attracted to their site, while in reality most startups just fail or only reach very few. So beware of investments (in money, but also in design and implementation effort) that will only pay in the highly improbable case that your company will be the next Whatsapp.
In case you don't really anticipate such masses of users but just want to imagine this as a programming exercise, you still have a difficult task. You won't have the platform to simulate the traffic, so there is no way to make measurements on where you actually have a performance problem to solve. That's one of the reasons for the Premature Optimization warning: Unless you know positively where you have a bottleneck, you - and all of us - will be just guessing and probably make the wrong decisions.
Marking a message as read is easy: Introduce a boolean attribute read at Message Participants, and set it to true as soon as, well, the user has read the message. It's up to your business requirements in which cases and to whom you show this.

Converting resources in a RESTful manner

I'm currently stuck with designing my endpoints in a way so that they are conform with the REST principles but also ensure the integrity of the underlying data.
I have two resources, ShadowUser and RealUser whereas the first one only has a first name, last name and an e-mail.
The second user has much more properties such like an Id under which the real user can be addressed at other place in the system.
My use-case it to convert specific ShadowUsers into real users.
In my head the flow seems pretty simple:
get the shadow users /GET api/ShadowUsers?somePropery=someValue
create new real users with the data fetched /POST api/RealUsers
delete the shadow-users /DELETE api/ShadowUSers?somePropery=someValue
But what happens when there is a problem between the creation of new users and the deletion of the shadow ones? The data would now be inconsistent.
The example is even easier when there is only one single user, but the issue stays the same as there could be something between step 2 and 3 leaving the user existing as shadow and real.
So the question is, how this can be done in a "transactional" manner where anything is good and persisted or something went wrong and nothing has been changed in the underlying data-store?
Are there any "best practices" or "design-patterns" which can be used?
Perhaps the role of the RESTful API GETting and POSTing those real users in batch (I asked a question some weeks ago about a related issue: Updating RESTful resources against aggregate roots only).
In the API side, POSTed users wouldn't be handled directly but they would be enqueued in a reliable messaging queue (for example RabbitMQ). A background process would be subscribed to the whole queue and it would process both the creation and removal of real and shadow users respectively.
The point of using a reliable messaging system is that you can implement retry policies. If the operation is interrupted in the middle of finishing its work, you can retry it and detect which changes are still pending to complete the task.
In summary, using this approach you can implement that operation in a transactional way.

Mobile : one single request or multiple smaller requests?

On an iPhone app (or mobile in general) that constantly needs to send requests to a Web Service, is it better to work with one single requests that will fetch a large amount of data or multiple (possibly simultaneous) requests for each element with smaller amount of data fetched.
Example
I want to load a list of elements in a node. I have the node's ID. The 2 ways I can fetch the elements are the following :
send a single request with the node ID and get all the information about the n first elements in the node in a single response ;
send a first request with the node ID to get the IDs of the n first elements in the node, then for each one send another request to have one response per element.
I'm balanced between about that.
the heavyweight single response may cause more lag and timeouts because of the very unstable and slow mobile internet connection ;
the phone may have trouble handling too many responses at the same time.
What's your opinion ?
Since there is overhead for every request, one large request is generally faster than several small ones of the same size. This applies to high speed networks too, but in mobile networks the ratio between transfer speed and latency is even bigger.
I don't think the phone will have any problem handling the responses, so the multiple requests approach seems better for large requests/answers. However, depending on the size of your requests/responses, it may actually be faster to do it in a single request, in order to reduce the delay associated with multiple requests. The single request approach will also need to transfer slightly less data than the multiple request one.
Every call will have its overhead (i.e. network load), the number of connections might also be limited.
You might or might not be able to update your user interface during download, depending on how often your callbacks are called - you may be able to process partial data as it arrives.
If your data is easy to compress (typically text data), then using a single call might even reduce your total network usage footprint even more.
If the chunks of data are large, I'd go with several individual ones. This will also make things easier in case of network errors. Bottom line for me is to just get the right balance - make the packets reasonable sized and don't flood the server.
This is depend upon the situation. If you don't want to bother your user to waiting everytime throughout the app then you can use single request to load all the data at a time.
If you don't mind to let user wait then you can use multiple request on demand. For example if you just want to show title in tableview and detail when user tap on any title. So you can first get the title only and then when user tap you can get details for that title by ID. so that would be pretty good way to request on demand only.
Sometimes the situation merits use for only single requests for say a certain category. Say you have a twitter app and the tweets are seperated out into categories. Someone who has the app but only cares about sports may only look at the sports section which could be a single ajax call. Another user may only be intersted in two categories out of 15 categories. This means the user doesn't have to load unneccessary data. The important thing you need to determine is this.
Does all of the data need to be loaded all at once for the app to work correctly and are your users generally going to want all that data in the first place.

Third party data delivery of lots of data

Does anyone know how sites that have a real-time feed of a lot of data work? I am referring to something like a stock site, where they can tell you in real time (well, 20 minute delay mostly, but still real-time - 20 minutes as I understand it).
They have thousands of data pieces delivered to them every second, I would imagine: MSFT 25.00 +.23 VOL 12000 ???? for each stock that had a change during some interval.
So, is there just a constant feed of small pushes going on? Or do you think a site will pull from the place that has the real data and say "give me all changes since 12:23:45 CST to now" type query?
I ask this because at work we might have a situation where we need to have at our application's fingertips real time information like this, and it won't make sense to hit our third party provider over and over and over again every second...
Generally there is a server/client protocol defined between the 2 parties. In the company I work for the connection is maintained at all times.
Here is info on real time data feeds to go with your stock example
NYSE,NASDAQ
It is common for data providers to also have FTP sites with (delayed) batched data. One that comes to mind is the NWS EMWIN
Sites like Twitter feed data to certain approved sites in real-time via XMPP (Wiki link).
In the broadest terms, a push model is going to be the best way of achieving "real time" transfer, particularly if you're talking about a large amount of data.
However you do always have a problem when using a purely push model of how to recover from missed data.
Depending on the nature of your data that may not be a problem (thinking of video delivery as an analogue, where the amount of data is huge but there is sufficient redundancy for it to recover from missing data). And if you have any control over the data you may be able to build some redundancy in. For example, on every change event you can provide absolute values rather than changes, or previous value and new value.
I've done this making an attempt to retrieve the stock quote from the source, and falling back to a timestamped on-disk cache of the quote when the main source fails or times out.