Problem Statement : Consume Million Records from Kafka & Spin the Parallel API calls ( 120 TPS )
i'm using project reactor kafka for Kafka Message Consumption ( 2 million records per hour ). Once i receive the kafka messages then I need to spin the parallel API calls ( 10 TPS ) to "abc.com/actuator". I tested the kafka part .. I'm able to consume the million records in 20 mins ( with 4 Kubernetes Pods ). But when i spin API calls everything going in sequential but not parallel. Also, API taking 1000ms to return response ( which adds the waiting time ). Can someone help to understand what's wrong in parallel API calls ? Thanks in advance.
ReceiverOptions<Integer, String> options =
receiverOptions
.subscription(Collections.singleton(topic))
.addAssignListener(partitions -> log.debug("onPartitionsAssigned {}", partitions))
.addRevokeListener(partitions -> log.debug("onPartitionsRevoked {}", partitions));
final Flux<ReceiverRecord<Integer, String>> messages = Flux.defer() -> {
final Flux<ReceiverRecord<Integer,String>> receiver =
kafkaReceicer.create(options).receive();
return Flux.<ReceiverRecord<Integer,String>>create(emmitter -> {
kafkaFlux.doOnNext(record-> {
ReceiverOffset offset = record.receiverOffset();
offset.acknowledge();
emitter.next(record);
}).blockLast();
});
});
WebClient wc = WebClient.create("abc.com:8443");
Flux.from(messages).flatMap(event -> wc.get().uri("/actuator").retrieve().bodyToMono(String.class)
.parallel(10).runOn(Schedulers.parallel()).subscribe();
Kubernetes configuration :
CPU : 300m
Memory : 10Gi
Related
I need to poll kafka and process events in bulk. In Reactor kafka, since its a steaming API, I am getting events as stream. Is there a way to combine and get a fixed max size of events.
This is what I doing currently.
final Flux<Flux<ConsumerRecord<String, String>>> receive = KafkaReceiver.create(eventReceiverOptions)
.receiveAutoAck();
receive
.concatMap(r -> r)
.doOnEach(listSignal -> log.info("got one message"))
.map(consumerRecords -> consumerRecords.value())
.collectList()
.flatMap(strings -> {
log.info("Read messages of size {}", strings.size());
return processBulkMessage(strings)
.doOnSuccess(aBoolean -> log.info("Processed records"))
.thenReturn(strings);
}).subscribe();
But code just hangs after collectList and never goes to the last flatMap.
Thanks In advance.
You just do a "flattening" with your plain .concatMap(r -> r) therefore you fully eliminate what is there is a batching originally built by that receiveAutoAck(). To have a stream of lists for your processBulkMessage() to process consider to move all the batch logic into that concatMap():
.concatMap(batch -> batch
.doOnEach(listSignal -> log.info("got one message"))
.map(ConsumerRecord::value)
.collectList())
.flatMap(strings -> {
I have built a kafka-streams application with 20 streams threads a month ago. This app calculates the consumption of different people in a fixed time interval. Recently I found the people's spend money that is queried form local state store is less than real. I have read official, and any other documents I could find, but have not found the solve method.
I used Kafka version 0.11.0.3, kafka server's version is 0.11.0.3, and kafka streams api is also 0.11.0.3. Only one application with 20 streams threads.
Some important info:
Kafka streams config:
replication.factor 3
num.stream.threads 20
commit.interval.ms 1000
partition.assignment.strategy StickyAssignor.class.getName()
fetch.max.wait.ms 500
max.poll.records 5000
max.poll.interval.ms 300000
heartbeat.interval.ms 3000
session.timeout.ms 30000
auto.offset.reset latest
kafka message structure
key = person's name
value = the money he spend
time = the current time that this message was created
Kafka streams build code:
KStreamBuilder kStreamBuilder = new KStreamBuilder();
KStream<String, Double> peopleSpendStream = kStreamBuilder.stream(topic);
peopleSpendStream.groupByKey()// group by people's name
.aggregate(() -> new HashMap<String, Double>(8192),
(key, value, aggregate) -> {
aggregate.merge(key, value, Double::sum);
return aggregate;
},
TimeWindows.of(ONE_MINUTE).until(ONE_HOUR * 10), // 1-min window, keep 9 hours
new HashMapSerde<>(), // serialize and deserialize by jackson actually
PEOPLE_SPEND_STORE_NAME);
Query code:
long time = System.currentTimeMilles();
for (String name : names) { // query by people's name
try (WindowStoreIterator<HashMap<String, Double>> iterator = store.fetch(name, time - TEN_MINUTE_MILLES, time)) {
iterator.forEachRemaining(kv -> log.info("name = {}, time = {}, cost = {}", name, kv.key, kv.value));
}
}
Is anything I get wrong? I need your help in particular.
I'm writing a spark streaming job that reads data from Kafka, makes some changes to the records and sends the results to another Kafka cluster.
The performance of the job seems very slow, the processing rate is about 70,000 records per second. The sampling shows that 30% of the time is spent on reading data and processing it and the remaining 70% spent on sending data to the Kafka.
I've tried to tweak the Kafka configurations, add memory, change batch intervals, but the only change that works is to add more cores.
profiler:
Spark job details:
max.cores 30
driver memory 6G
executor memory 16G
batch.interval 3 minutes
ingres rate 180,000 messages per second
Producer Properties (I've tried different varations)
def buildProducerKafkaProperties: Properties = {
val producerConfig = new Properties
producerConfig.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, destKafkaBrokers)
producerConfig.put(ProducerConfig.ACKS_CONFIG, "all")
producerConfig.put(ProducerConfig.BATCH_SIZE_CONFIG, "200000")
producerConfig.put(ProducerConfig.LINGER_MS_CONFIG, "2000")
producerConfig.put(ProducerConfig.COMPRESSION_TYPE_CONFIG, "gzip")
producerConfig.put(ProducerConfig.RETRIES_CONFIG, "0")
producerConfig.put(ProducerConfig.BUFFER_MEMORY_CONFIG, "13421728")
producerConfig.put(ProducerConfig.SEND_BUFFER_CONFIG, "13421728")
producerConfig
}
Sending code
stream
.foreachRDD(rdd => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
.map(consumerRecord => doSomething(consumerRecord))
.foreachPartition(partitionIter => {
val producer = kafkaSinkBroadcast.value
partitionIter.foreach(row => {
producer.send(kafkaTopic, row)
producedRecordsAcc.add(1)
})
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
Versions
Spark Standalone cluster 2.3.1
Destination Kafka cluster 1.1.1
Kafka topic has 120 partitions
Can anyone suggest how to increase sending throughput?
Update Jul 2019
size: 150k messages per second, each message has about 100 columns.
main settings:
spark.cores.max = 30 # the cores balanced between all the workers.
spark.streaming.backpressure.enabled = true
ob.ingest.batch.duration= 3 minutes
I've tried to use rdd.repartition(30), but it made the execution slower by ~10%
Thanks
Try to use repartition as below -
val numPartitons = ( Number of executors * Number of executor cores )
stream
.repartition(numPartitons)
.foreachRDD(rdd => {
val offsetRanges = rdd.asInstanceOf[HasOffsetRanges].offsetRanges
rdd
.map(consumerRecord => doSomething(consumerRecord))
.foreachPartition(partitionIter => {
val producer = kafkaSinkBroadcast.value
partitionIter.foreach(row => {
producer.send(kafkaTopic, row)
producedRecordsAcc.add(1)
})
stream.asInstanceOf[CanCommitOffsets].commitAsync(offsetRanges)
})
This will give you optimum performance.
Hope this will help.
I try to aggregate a large amount of data using time windows of different sizes using Kafka Streams.
I increased the cache size to 2 GB, but when I set the window size in 1 hour I get the CPU load of 100% and the application starts to slow down.
My code looks like this:
val tradeStream = builder.stream<String, Trade>(configuration.topicNamePattern, Consumed.with(Serdes.String(), JsonSerde(Trade::class.java)))
tradeStream
.groupBy(
{ _, trade -> trade.pair },
Serialized.with(JsonSerde(TokensPair::class.java), JsonSerde(Trade::class.java))
)
.windowedBy(TimeWindows.of(windowDuration).advanceBy(windowHop).until(windowDuration))
.aggregate(
{ Ticker(windowDuration) },
{ _, newValue, aggregate -> aggregate.add(newValue) },
Materialized.`as`<TokensPair, Ticker>(storeByPairs)
.withKeySerde(JsonSerde(TokensPair::class.java))
.withValueSerde(JsonSerde(Ticker::class.java))
)
.toStream()
.filter { tokensPair, _ -> filterFinishedWindow(tokensPair.window(), windowHop) }
.map { tokensPair, ticker -> KeyValue(
TickerKey(ticker.tokensPair!!, windowDuration, Timestamp(tokensPair.window().start())),
ticker.calcPrice()
)}
.to(topicName, Produced.with(JsonSerde(TickerKey::class.java), JsonSerde(Ticker::class.java)))
In addition, before sending the aggregated data to the kafka topic they are filtered by end time of the window in order send to topic just finished window.
Perhaps there are some better approaches for implementing this kind of aggregation?
With out a knowing a bit more of the system it’s hard to diagnose.
How many partitions are present in your cluster ?
How many stream applications are you running ?
Are the stream applications running on the same machine ?
Are you using compression for the payload ?
Does it work for smaller intervals?
Hope that helps.
After developing and executing my Storm (1.0.1) topology with a KafkaSpout and a couple of Bolts, I noticed a huge network traffic even when the topology is idle (no message on Kafka, no processing is done in bolts). So I started to comment out my topology piece by piece in order to find the cause and now I have only the KafkaSpout in my main:
....
final SpoutConfig spoutConfig = new SpoutConfig(
new ZkHosts(zkHosts, "/brokers"),
"files-topic", // topic
"/kafka", // ZK chroot
"consumer-group-name");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.startOffsetTime = OffsetRequest.LatestTime();
topologyBuilder.setSpout(
"kafka-spout-id,
new KafkaSpout(config),
1);
....
When this (useless) topology executes, even in local mode, even the very first time, the network traffic always grows a lot: I see (in my Activity Monitor)
An average of 432 KB of data received/sec
After a couple of hours the topology is running (idle) data received is 1.26GB and data sent is 1GB
(Important: Kafka is not running in cluster, a single instance that runs in the same machine with a single topic and a single partition. I just downloaded Kafka on my machine, started it and created a simple topic. When I put a message in the topic, everything in the topology is working without any problem at all)
Obviously, the reason is in the KafkaSpout.nextTuple() method (below), but I don't understand why, without any message in Kafka, I should have such traffic. Is there something I didn't consider? Is that the expected behaviour? I had a look at Kafka logs, ZK logs, nothing, I have cleaned up Kafka and ZK data, nothing, still the same behaviour.
#Override
public void nextTuple() {
List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
for (int i = 0; i < managers.size(); i++) {
try {
// in case the number of managers decreased
_currPartitionIndex = _currPartitionIndex % managers.size();
EmitState state = managers.get(_currPartitionIndex).next(_collector);
if (state != EmitState.EMITTED_MORE_LEFT) {
_currPartitionIndex = (_currPartitionIndex + 1) % managers.size();
}
if (state != EmitState.NO_EMITTED) {
break;
}
} catch (FailedFetchException e) {
LOG.warn("Fetch failed", e);
_coordinator.refresh();
}
}
long diffWithNow = System.currentTimeMillis() - _lastUpdateMs;
/*
As far as the System.currentTimeMillis() is dependent on System clock,
additional check on negative value of diffWithNow in case of external changes.
*/
if (diffWithNow > _spoutConfig.stateUpdateIntervalMs || diffWithNow < 0) {
commit();
}
}
Put a sleep for one second (1000ms) in the nextTuple() method and observe the traffic now, For example,
#Override
public void nextTuple() {
try {
Thread.sleep(1000);
} catch(Exception ex){
log.error("Ëxception while sleeping...",e);
}
List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
for (int i = 0; i < managers.size(); i++) {
...
...
...
...
}
The reason is, kafka consumer works on the basis of pull methodology which means, consumers will pull data from kafka brokers. So in consumer point of view (Kafka Spout) will do a fetch request to the kafka broker continuously which is a TCP network request. So you are facing a huge statistics on the data packet sent/received. Though the consumer doesn't consumes any message, pull request and empty response also will get account into network data packet sent/received statistics. Your network traffic will be less if your sleeping time is high. There are also some network related configurations for the brokers and also for consumer. Doing the research on configuration may helps you. Hope it will helps you.
Is your bolt receiving messages ? Do your bolt inherits BaseRichBolt ?
Comment out that line m.fail(id.offset) in Kafaspout and check it out. If your bolt doesn't ack then your spout assumes that message is failed and try to replay the same message.
public void fail(Object msgId) {
KafkaMessageId id = (KafkaMessageId) msgId;
PartitionManager m = _coordinator.getManager(id.partition);
if (m != null) {
//m.fail(id.offset);
}
Also try halt the nextTuple() for few millis and check it out.
Let me know if it helps