How to ensure that parallel queries to ext. system are executed only once and then cached - scala

Server frameworks: Scala, Play 2.2, ReactiveMongo, Heroku
I think I have quite interesting brain teaser for you:
In my trip-planning application I want to display weather forecast on a map(similar to this). I'm using a paid REST service to query weather data. To speed up user experience and reduce costs I plan to cache weather data for each location for one hour.
There are a few not-so obvious things to consider:
It might require to query up to 100 location for weather to display one weather map
Weather must be queried in parallel because it would take too long to query it in serial fashion considering network latency
However launching 100 threads for each user request is not an option as well (imagine just 5 users looking at a map at one time)
The solution is to have let's say 50 workers that query weather for user requests
Multiple users might be viewing the same portion of map
There is a possible racing condition where one location is queried multiple times.
However it should be queried only once and then cached.
The application is running in clustered environment meaning there will be several play instances.
Coming from a Java EE background I can come up with a pretty good solution using the Java EE stack.
However I wonder how to do this using something more natural to Scala/Play stack: Akka. There is an example (google "heroku scala akka") for similar problem but it doesn't solve one issue: Racing condition when multiple users query the same data at once.
How would you implement this?
EDIT: I have decided that the requirement to ensure that weather data is updated only once is not necessary. The situation would happen far too infrequently to be a real problem and all proposed solutions would bring too much overhead and complexity to the system to be viable.
Thanks everyone for your time and effort. I hope answers to this question will help someone in the future with similar problem.

In Akka you can choose from multiple routing strategies. ConsistentHashingRoutingLogic could serve you well in this situation. Since actors are single-threaded you can easily maintain a cache in each actor. This routing logic will assure that two equal messages will always hit the same actor.
Each actor can work in the following way:
1. check local cache (for example apache commons LRUMap)
- if found, return
2. check global cache (distributed memcache or any other key-value store)
- if found, store the result in the local cache and return
3. query the REST service
4. store the result in the global and local caches
You can have a look at this question, which I based my answer on.

I decided that I'll post my JMS solution as well.
Controller that processes the request for weather does following:
Query the DB for weather data. If there are NO locations with out-of-date data reply immediately. Otherwise continue:
Start listening on a topic (explained later).
For each location: Check whether the weather for the location isn't being updated.
If not send a weather update request message to queue.
Certain amount of workers (50?) listen to that queue.
Worker first marks the location weather as being updated
Worker retrieves updated weather and updates the DB.
Worker sends a message to a topic with weather data for that location.
When controller receives (via topic) weather updates for all out-of-date locations, combine it with up-to-date locations and reply.

Related

How can reactive programming react to database changes?

I am new in the topic reactive programming and therefore have some questions.
I am developing a small software.
I would like to take the opportunity to get to know reactive programming better.
So I looked at Spring's project-reactor.
I also use R2DBC to reactively access the database.
I would like to know if there is any way that database responds to changes.
Or rather: If a user saves an entry in the database, then servers (for example, RestController) should be notified.
How could I go about doing that?
Enresponding controllers, configuration, entities, etc. I have already implemented.
Sorry for spelling mistakes.
Complement: The updates to the frontend are then made by Server Sent Events.
Basically, what Nick Tsitlakidis mentioned. Let me add a couple of things here.
The typical database query pattern is to query for a number of records. Databases respond with their results and indicate that the query is complete once a server has sent all records to your application. If new records arrive while the query is active or after the query is complete, you do not see these changes immediately because the of isolation and in case the query is complete, then you no longer have a reference to the query.
The feature you're asking is event-driven consumption of data. Databases call this feature continuous queries. Some stores (such as MongoDB with Tailable cursors or Postgres Logical Decoding) come with features that allow keeping a cursor/query open and your client is able to receive continuous updates.
Kafka and JMS also follow the idea of sending (messages) that are consumed typically by listeners or even through a reactive stream.
So it all boils down to the technology that you're using.
My understanding is that reactor can't solve this problem for you on its own. If you want your application to respond (react) on some database change, then you need to identify who's making this change and implement some kind of integration there.
Example, if you have Service1 updating the database, and Service2 needs to respond then Service1 can either call Service2, or, you can emit an event from Service1 and listen for the event from Service2.
The first approach is simpler and easier to implement but it has the disantvantage that is couples the two services. The second is trickier to implement but services are decoupled.
Reactor can help you in both cases :
For events, reactor can give you a way to listen to the events. For example using the reactor-rabbitmq module or the reactor-kafka.
For service-to-service calls, reactor can help you if you use Spring Webflux.
Perhaps you can tell us more about your case so we can provide a more specific solution?

design question: best way to aggregate data from several microservices and show in UI

we have a scenario where we need to aggregate data from several services and show in UI. The current scenario is when an agent logins in, we need to show cases assigned to that agent. Case information needs to be aggregated from several microservices. There would be around 1K cases assigned to agent at a time and all of the needs to be shown to agent so that he can perform sorting based on certain case data.
What be best approach to show data in this scenario? should we do API calls to several services for each case and aggregate and show ? Or there are better approaches to achieve this.
No. You'll certainly not call multiple APIs to aggregate data on runtime. Even if you call the apis parallely, it will be a huge latency.
You need to pre-aggregate the case details and cache them in a distributed caching system (e.g. Redis or memcached) using a streaming platform (e.g. Kafka). Also, store the pre-aggregated case details in a persistent database. Basically, it's a kind of materialized views.
Caching will enable you to serve the case details fast to the user without any noticeable latency. And streaming will help you to keep the cache and DB aggregations updated in a near-real time. Storing the materialized view in database will save you from storing everything in memory. You can use an LRU cache. Only the recently used data will be in cache. If you need to show any case data that is not in cache, you'd read it from database and store it in cache for future requests.
I recommend you read these two Martin Kleppmann articles here and here

Stuck with understanding how to build a scalable system

I need some guidance on how to properly build out a system that will be able to scale. I will give you some information about what I am trying to do and then ask my specific question.
I have a site where I want visitors to send some data to be processed. They input the data into a textarea or upload it in a file. Simple. The data is somewhat preprocessed on the client side before a POST request is made to a REST endpoint.
What I am stuck on is what is a good way to take this posted data store it and then associate an id with it that references the user since I cannot process the data fast enough for it to be returned to the user in a reasonable amount of time?
This question is a bit vague and open to opinion, I admit it. I just need a push in the right direction to keep moving. What I have been considering is throwing the data into a message queue and then having some workers process the data elsewhere and when the data is processed alert the user as to where to find it with some sort of link to an S3 bucket or just a URL to a file. The other idea was to just run the request for each item to be processed against another end-point that already processes individual records in some sort of loop client side. The problem is as follows with this idea:
To process the data it may take somewhere from 30 minutes to 2 hours depending upon the amount that they want processed. It's not ideal for them to just sit there and wait for that to finish depending on the amount of records they need processed, so I have ruled this out mostly.
Any guidance would be very much appreciated as I don't have any coworkers to bounce things off of, nor do I know many people with the domain knowledge that I could freely ask. If this isn't the right place to ask this, could you point me in the right direction as to where it should be asked?
Chris
If I've got you right, your pipeline is:
Accept item from user
Possibly preprocess/validate it (?)
Put into some queue
Process data
Return result.
You man use one or several queues on stage (3). Entity from user gets added to one of the queues. If it's big enough, it could be stored in S3 or storage alike, and only info about it put into the queue: link, add date, user id (or email of alike). Processors can pull items from queue and give feedback to users.
If you have no strict requirements on order, things get much simpler: you don't need any sync between them. Treat all the components: upload acceptors, queues, storages and processors as independent pools of processes. Monitor each pool separately. If there's some bottlenecks - add machines to that pool.

Converting resources in a RESTful manner

I'm currently stuck with designing my endpoints in a way so that they are conform with the REST principles but also ensure the integrity of the underlying data.
I have two resources, ShadowUser and RealUser whereas the first one only has a first name, last name and an e-mail.
The second user has much more properties such like an Id under which the real user can be addressed at other place in the system.
My use-case it to convert specific ShadowUsers into real users.
In my head the flow seems pretty simple:
get the shadow users /GET api/ShadowUsers?somePropery=someValue
create new real users with the data fetched /POST api/RealUsers
delete the shadow-users /DELETE api/ShadowUSers?somePropery=someValue
But what happens when there is a problem between the creation of new users and the deletion of the shadow ones? The data would now be inconsistent.
The example is even easier when there is only one single user, but the issue stays the same as there could be something between step 2 and 3 leaving the user existing as shadow and real.
So the question is, how this can be done in a "transactional" manner where anything is good and persisted or something went wrong and nothing has been changed in the underlying data-store?
Are there any "best practices" or "design-patterns" which can be used?
Perhaps the role of the RESTful API GETting and POSTing those real users in batch (I asked a question some weeks ago about a related issue: Updating RESTful resources against aggregate roots only).
In the API side, POSTed users wouldn't be handled directly but they would be enqueued in a reliable messaging queue (for example RabbitMQ). A background process would be subscribed to the whole queue and it would process both the creation and removal of real and shadow users respectively.
The point of using a reliable messaging system is that you can implement retry policies. If the operation is interrupted in the middle of finishing its work, you can retry it and detect which changes are still pending to complete the task.
In summary, using this approach you can implement that operation in a transactional way.

High Volume MongoDB with Twitter Streaming API, Ruby on Rails, Heroku setup

I'm looking to re-code an application to better handle spikes in tweets. I'm moving to Heroku and MongoDB (either MongoLab or MongoHQ) for the database solution.
During certain news events, tweet volume might spike to 15,000 / second. Typically with each tweet, I parse the tweet and store various pieces of data such as user data, etc. My idea is to store the raw tweets in a separate collection, and have a separate process grab raw tweets and parse them. The goal here is when there is a massive spike in tweets, my application isn't trying to parse all of these, but is essentially backlogging the raw tweets in another collection. As the volume slows, the process can take care of the backlog over time.
My question is three fold:
Can MongoDB handle this type of volume with regards to inserts into a collection at a rate of 15,000 tweets per second?
Any idea on the better setup: MongoHQ or MongoLab?
Any feedback on the overall setup?
Thanks!
The write volume that it will handle depends on lots of factors - hardware, indexes, size of each document, etc. Your best bet is to test it in the environment you're planning to use. If the demands of the write load exceed the capacity of a single mongo server, you can always use just multiple shards.
They are very similar, but there are some differences in pricing and the actual site design has a bunch of differences. There's a thread of discussion about it here: https://webmasters.stackexchange.com/questions/20782/mongodb-hosting-mongolab-vs-mongohq-vs-mongomachine
Overall it seems to make sense. Sounds like you will probably want to flesh out some details about how you will be processing the backlog. Will you be polling it by querying periodically, deleting tweets from the backlog as it processes them, etc.
Completely agree on the need to test this. In general, mongo can handle that many writes, but in practice it depends on the size of your set up, other operations, indexes, etc.
I had to do a similar approach for collecting tons of metrics data. I used a lightweight event-machine process to accept incoming requests in parallel, and store them in a simple format, then another process would take those requests and send them up to a central server. The main goal was to make sure no data was lost if the central server was down, but it also allowed me to put in some throttling logic so that the spikes in data wouldn't overwhelm the system.
I'd be interested to see how this works out for you price-wise, vs. a vps like linode. (I'm a huge Heroku fan, but with certain architectures it can get pricey quickly)