Optimal ways to rate limit a spark stream [closed] - scala

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 2 days ago.
Improve this question
I have a spark stream whose source is a blob storage, that transforms and enriches data in a batch process written in Scala 2.12. The enrich step calls an external service to get some additional information thats added before the data is dropped in a sink.
The webservice is rate limited, so I'm looking for ways to control the request rate to the service. These are the ways I have thought of so far:
Use partitioning from spark libraries to process data in smaller chunks
Make use of Akka streams, load the spark stream onto an akka stream and throttle the requests. The disadvantage of this approach is that I'll end up loading a lot of data into memory. I can still mitigate this by making multiple smaller blob files that are processed one after another.
Look for an HTTP client library that takes care of throttling and retrying for me.
Implement some kind of a circuit breaker that stops when it encounters 429 and resumes later.
What's the best way to solve this?

Related

Which Messaging System to be used? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I would like to transfer data from one db system to any other db systems. which messaging system(Kafka, ActiveMQ, RabbitMQ.....likewise) would be better to achieve this with much throughput and performance.
I guess the answer for this type of questions is "it depends"
You probably could find many information on the internet about comparison between these message brokers,
as far as I can share from our experience and knowledge, Kafka and its ecosystem tools like kafka connect , introducing your requested behavior with source connectors and sink connectors using kafka in the middle,
Kafka connect is a framework which allows adding plugin called connectors
Sink connectors- reads from kafka and send that data to target system
Source connector- read from source store and write to kafka
Using kafka connect is "no code", calling rest api to set configuration of the connectors.
Kafka is distributed system that supports very high throughout with low latency. It supports near real time streaming of data.
Kafka is highly adopted by the biggest companies around the world.
There are many tools and vendors that support your use case, they vary in price and support, it depends from which sources you need to take data and to which targets you wish to write, should it be cdc/near real time or "batch" copy

Logging microservice requests and responses to MongoDB asynchronously using AOP or Solace? [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 1 year ago.
Improve this question
I have multiple Java REST microservice APIs and I would like to log their requests and responses to MongoDB. Do I create a separate logging API with an asynchronous service method and call it from all other microservice controller classes using AOP? Or Do I use event brokers like Solace/Kafka where the microservices publish the logs to a topic and a separate service picks and stores in MongoDB?
Which is the better way, I can afford to lose some logs without being stored in MongoDB but I cannot afford to affect the performance of my microservices.
There are definitely advantages to using an event broker to handle log data, since it can serve as a buffer during times when the logging API isn't available or slow. Note that AOP could also be used with an event broker, it would just use a event endpoint, rather than an HTTPS endpoint.
A couple other related points:
Have you considered persistence layers other than MongoDB? OpenTelemetry backends are made to address exactly the sort of use case you have, and provide some very useful tooling for auditing/troubleshooting microservices.
Rather than using REST, how about connecting the microservices themselves through an event broker. It could provide some very nice performance benefits, and make your microservices more agile.
Best,
Jesse

IoT Streaming Architecture [closed]

Closed. This question is opinion-based. It is not currently accepting answers.
Want to improve this question? Update the question so it can be answered with facts and citations by editing this post.
Closed 6 years ago.
Improve this question
I just started learning about IoT and data streaming. Apologies if this question seems too obvious or generic.
I am working on a school project, which involves streaming data from hundreds (maybe thousands) of Iot sensors, storing said data on a database, then retrieving that data for display on a web-based UI.
Things to note are:
fault-tolerance and the ability to accept incomplete data entries
the database has to have the ability to load and query data by stream
I've looked around on Google for some ideas on how to build an architecture that can support these requirements. Here's what I have in mind:
Sensor data is collected by FluentD and converted into a stream
Apache Spark manages a cluster of MongoDB servers
a. the MongoDB servers are connected to the same storage
b. Spark will handle fault-tolerance and load balancing between MongoDB servers
BigQuery will be used for handling queries from UI/web application.
My current idea of a IoT streaming architecture :
The question now is whether this architecture is feasible, or whether it would work at all. I'm open to any ideas and suggestions.
Thanks in advance!
Note that you could stream your device data directly into BigQuery and avoid an intermediate buffering step.
See:
https://cloud.google.com/bigquery/streaming-data-into-bigquery

Scala batch processing triggered by size or time [closed]

Closed. This question needs to be more focused. It is not currently accepting answers.
Want to improve this question? Update the question so it focuses on one problem only by editing this post.
Closed 7 years ago.
Improve this question
I'd like to batch up some events for processing, and have the triggering of the processing for a batch be based on the number of events reaching a certain threshold OR a time interval expiring (whichever happens first). What should I consider? Futures? Akka? Some more special-purpose library that might exist?
Two options come to mind:
Using Akka
Using Quartz
This depends on your specific architecture, but you can use any form of scheduling. You can use an Akka scheduler to schedule a task to run regularly and use your own internal queue to trigger the batch job if it's full. You can also do very similar things with Quartz but you might have to write more boilerplate code with the benefit of having more flexibility.
If you don't wish to bring in a fairly heavyweight library, I suppose you could implement something yourself, but you would be reinventing the wheel.

Akka message throttling [closed]

Closed. This question needs details or clarity. It is not currently accepting answers.
Want to improve this question? Add details and clarify the problem by editing this post.
Closed 9 years ago.
Improve this question
I am trying to implement following scenario with Akka but hitting heap limitations (out of memory errors):
User uploads a text file(25mb aprox) containing around 1000000 lines.
After file gets uploaded HTTP 200 OK is sent back to the client and file processing is starting in the background.
Each line should be processed (saved to the database and external web service call should be made to look up the contents of the line with database update if lookup returned some results.)
Please suggest the approach/pattern.
Many thanks in advance!
There are several ways for achieve this, for example:
1) Use bounded mailbox for some of your actors, then your code that send messages to such actors will block if the target mailbox is full;
2) Use work pulling model when some of your actors will "ask" for more work when idle.