Throttle concurrent HTTP requests from Spark executors - scala

I want to do some Http requests from inside a Spark job to a rate limited API. In order to keep track of the number of concurrent requests in a non-distributed system (in Scala), following works:
a throttling actor which maintains a semaphore (counter) which increments when the request starts and decrements when the request completes. Although Akka is distributed, there are issues to (de)serialize the actorSystem in a distributed Spark context.
using parallel streams with fs2: https://fs2.io/concurrency-primitives.html => cannot be distributed.
I suppose I could also just collect the dataframes to the Spark driver and handle throttling there with one of above options, but I would like to keep this distributed.
How are such things typically handled?

You shouldn't try to synchronise requests across Spark executors/partitions. This is totally against Spark concurrency model.
Instead, for example, divide the global rate limit R by Executors * Cores and use mapPatitions to send requests
from each partition within its R/(e*c) rate limit.

Related

(Scala, Akka) Effect of setting the dispatcher throughput=1 on Akka Flow buffer

I have a pipeline with Akka flow that is using a custom dispatcher.
Akka Flow code:
// Pulls in data via network requests
source
// Downstream will be sent to a different actor.
// Also, a "buffer" of 16 elements will be created between those actors.
.async
.via(writeToDatabase())
.toMat(Sink.ignore)(Keep.right)
.withAttributes(ActorAttributes.dispatcher("customer-dispatcher"))
.run()
Dispatcher configuration:
custom-dispatcher {
type = Dispatcher
executor = "thread-pool-executor"
thread-pool-executor {
fixed-pool-size = 16
}
throughput = 1
}
My understanding is that .async will insert a buffer of 16 elements by default
When an asynchronous boundary is introduced, the Akka Streams API inserts a buffer between every asynchronous processing stage
Source: https://blog.colinbreck.com/maximizing-throughput-for-akka-streams/
Question: Will setting the throughput=1 cause the buffer to always have only 1 document?
No.
An Akka dispatcher is responsible for (among other things) scheduling an actor in a threadpool. The throughput configuration sets the maximum number of messages an actor will process from its mailbox before another actor would be able to be scheduled on the thread. Lower values for throughput mean more context switches and thus lower performance: the benefit in exchange is that an actor which takes a long time to process messages is less able to hog the thread.
The buffer introduced at an async boundary in Akka Streams is different and basically part of the backpressure protocol. You can set the default size of these buffers with the akka.stream.materializer.max-input-buffer-size configuration option, or programmatically per async boundary by adding the Attributes.inputBuffer attribute to the boundary.
Akka Streams could be affected by dispatcher throughput, though the effect would be pronounced in situations where there are more stream actors than dispatcher threads. Since the internal messages for signalling demand as well as those moving stream elements from actor to actor would be affected, it's rather unpredictable what the effect would be.

Is `BoundedSourceQueue` from `Source.queue` ok with concurrent producers?

Source.queue recently added an overload which specializes to OverflowStrategy.dropNew and avoids the async mechanism. The result of materializing this is a BoundedSourceQueue[T] (compared to SourceQueueWithComplete[T] in the older version). The docs for the SourceQueueWithComplete variants of Source.queue make it clear that the materialized queues should be used by any number of concurrent producers:
The materialized SourceQueue may be used by up to maxConcurrentOffers concurrent producers.
The docs for the BoundedSourceQueue don't say anything about this. Is this constraint lifted for BoundedSourceQueue? Can it be used by any number of concurrent producers?
Technically, the SourceQueueWithComplete variant doesn't have the maxConcurrentOffers restriction if OverflowStrategy.dropNew is in effect.
However, because the result of offering an element to the SourceQueueWithComplete is communicated asynchronously, that does mean that if a producer produces faster than it handles the future, it may overwhelm memory. Asynchrony removes backpressure, unless some other mechanism reintroduces it, after all.
Because when the strategy is dropNew, it's possible to know immediately that the element was dropped, the result of offering can be communicated synchronously (i.e. blocking the producer until it handles/throws away the result). This allows there to be arbitrarily many producers without OOM risk. For this reason if using the dropNew strategy, the BoundedSourceQueue version is recommended (i.e. only use the SourceQueueWithComplete if some other strategy is being used), with the recommendation becoming stronger as load becomes higher.
Yes, the number of running threads is the limit for the number of concurrent producers to the BoundedSourceQueue variant.

Storm+Kafka not parallelizing as expected

We are having an issue regarding parallelism of tasks inside a single topology. We cannot manage to get a good, fluent processing rate.
We are using Kafka and Storm to build a system with different topologies, where data is processed following a chain of topologies connected using Kafka topics.
We are using Kafka 1.0.0 and Storm 1.2.1.
The load is small in number of messages, about 2000 per day, however each task can take quite some time. One topology in particular can take a variable amount of time to process each task, usually between 1 and 20 minutes. If processed sequentially, the throughput is not enough to process all incoming messages. All topologies and Kafka system are installed in a single machine (16 cores, 16 GB of RAM).
As messages are independent and can be processed in parallel, we are trying to use Storm concurrent capabilities to improve the throughput.
For that the topology has been configured as follows:
4 workers
parallelism hint set to 10
Message size when reading from Kafka large enough to read about 8 tasks in each message.
Kafka topics use replication-factor = 1 and partitions = 10.
With this configuration, we observe the following behavior in this topology.
About 7-8 tasks are read in a single batch from Kafka by the Storm topology (task size is not fixed), max message size of 128 kB.
About 4-5 task are computed concurrently. Work is more-or-less evenly distributed among workers. Some workers take 1 task, others take 2 and process them concurrently.
As tasks are being finished, the remaining tasks start.
A starvation problem is reached when only 1-2 tasks remain to be processed. Other workers wait idle until all tasks are finished.
When all tasks are finished, the message is confirmed and sent to the next topology.
A new batch is read from Kafka and the process starts again.
We have two main issues. First, even with 4 workers and 10 parallelism hint, only 4-5 tasks are started. Second, no more batches are started while there is work pending, even if it is just 1 task.
It is not a problem of not having enough work to do, as we tried inserting 2000 tasks at the beginning, so there is plenty of work to do.
We have tried to increase the parameter "maxSpoutsPending", expecting that the topology would read more batches and queue them at the same time, but it seems they are being pipelined internally, and not processed concurrently.
Topology is created using the following code:
private static StormTopology buildTopologyOD() {
//This is the marker interface BrokerHosts.
BrokerHosts hosts = new ZkHosts(configuration.getProperty(ZKHOSTS));
TridentKafkaConfig tridentConfigCorrelation = new TridentKafkaConfig(hosts, configuration.getProperty(TOPIC_FROM_CORRELATOR_NAME));
tridentConfigCorrelation.scheme = new RawMultiScheme();
tridentConfigCorrelation.fetchSizeBytes = Integer.parseInt(configuration.getProperty(MAX_SIZE_BYTES_CORRELATED_STREAM));
OpaqueTridentKafkaSpout spoutCorrelator = new OpaqueTridentKafkaSpout(tridentConfigCorrelation);
TridentTopology topology = new TridentTopology();
Stream existingObject = topology.newStream("kafka_spout_od1", spoutCorrelator)
.shuffle()
.each(new Fields("bytes"), new ProcessTask(), new Fields(RESULT_FIELD, OBJECT_FIELD))
.parallelismHint(Integer.parseInt(configuration.getProperty(PARALLELISM_HINT)));
//Create a state Factory to produce outputs to kafka topics.
TridentKafkaStateFactory stateFactory = new TridentKafkaStateFactory()
.withProducerProperties(kafkaProperties)
.withKafkaTopicSelector(new ODTopicSelector())
.withTridentTupleToKafkaMapper(new ODTupleToKafkaMapper());
existingObject.partitionPersist(stateFactory, new Fields(RESULT_FIELD, OBJECT_FIELD), new TridentKafkaUpdater(), new Fields(OBJECT_FIELD));
return topology.build();
}
and config created as:
private static Config createConfig(boolean local) {
Config conf = new Config();
conf.setMaxSpoutPending(1); // Also tried 2..6
conf.setNumWorkers(4);
return conf;
}
Is there anything we can do to improve the performance, either by increasing the number of parallel tasks or/and avoiding starvation while finishing to process a batch?
I found an old post on storm-users by Nathan Marz regarding setting parallelism for Trident:
I recommend using the "name" function to name portions of your stream
so that the UI shows you what bolts correspond to what sections.
Trident packs operations into as few bolts as possible. In addition,
it never repartitions your stream unless you've done an operation
that explicitly involves a repartitioning (e.g. shuffle, groupBy,
partitionBy, global aggregation, etc). This property of Trident
ensures that you can control the ordering/semi-ordering of how things
are processed. So in this case, everything before the groupBy has to
have the same parallelism or else Trident would have to repartition
the stream. And since you didn't say you wanted the stream
repartitioned, it can't do that. You can get a different parallelism
for the spout vs. the each's following by introducing a repartitioning
operation, like so:
stream.parallelismHint(1).shuffle().each(…).each(…).parallelismHint(3).groupBy(…);
I think you might want to set parallelismHint for your spout as well as your .each.
Regarding processing multiple batches concurrently, you are right that that is what maxSpoutPending is for in Trident. Try checking in Storm UI that your max spout pending value is actually picked up. Also try enabling debug logging for the MasterBatchCoordinator. You should be able to tell from that logging whether multiple batches are in flight at the same time or not.
When you say that multiple batches are not processed concurrently, do you mean by ProcessTask? Keep in mind that one of the properties of Trident is that state updates are ordered with regard to batches. If you have e.g. maxSpoutPending=3 and batch 1, 2 and 3 in flight, Trident won't emit more batches for processing until batch 1 is written, at which point it'll emit one more. So slow batches can block emitting more, even if 2 and 3 are fully processed, they have to wait for 1 to finish and get written.
If you don't need the batching and ordering behavior of Trident, you could try regular Storm instead.
More of a side note, but you might want to consider migrating off storm-kafka to storm-kafka-client. It's not important for this question, but you won't be able to upgrade to Kafka 2.x without doing it, and it's easier before you get a bunch of state to migrate.

Need tips on how to achieve a thread-safe queue in fs2 (Scala)

I need to implement a microservice that loads a ton of data into memory at startup and makes that data available via HTTP GET.
I have been looking at fs2 as an option to make the data available to the web layer via an fs2.Queue.
My concern is that if I use the synchronous queue from fs2, my performance of serving the data might be affected negatively because of the blocking nature of the synchronous queue (on enqueue operation).
Is this a valid concern?
Also, which Queue abstractions (in fs2) are thread-safe? ie: can I pass any Queue around to multiple threads and can they all safely take items out of the queue without more than one of them taking the same element out of the queue?
EDIT:
Use case: 10Mil records served by the Stream -> many workers (threads) picking work from the Stream via a HTTP endpoint (GET)

Do I need to take care of producer-consumer rate-matching when using Akka 1.3's actors?

When using Akka 1.3, do I need to worry about what happens when the actors producing messages are producing them faster than than the actors consuming them can process?
Without any mechanism, in a long running process, the queue sizes would grow to consume all available memory.
The doc says the default dispatcher is the ExecutorBasedEventDrivenDispatcher.
This dispatcher has five queue configuration:
Bounded LinkedBlockingQueue
Unbounded LinkedBlockingQueue
Bounded ArrayBlockingQueue
Unbounded ArrayBlockingQueue
SynchronousQueue
and four overload policies:
CallerRuns
Abort
Discard
DicardOldest
Is this the right mechanism to be looking at? If so, what are this dispatchers' default settings?
The dispatcher has a task queue. This is unrelated to your problem. In fact, you want as many mailboxes to be enqueued as possible.
What you might be looking for is: http://doc.akka.io/docs/akka/1.3.1/scala/dispatchers.html#Making_the_Actor_mailbox_bounded