Run a Alpakka Kafka Consumer on Demand in Scala - scala

When all the messages have been handled, I need to shut off the Kafka consumer until the next call. I obviously won't be able to use what I wrote. Any assistance would be greatly valued.
val consumerDrainingControl = Consumer
.mapAsync(1) { msg =>
// Process
Future.successful(msg.committableOffset)
}
.via(Committer.flow(committerDefaults.withMaxBatch(1)))
.toMat(Sink.ignore)(DrainingControl.apply)
.run()
consumerDrainingControl.drainAndShutdown()

Related

How to send Kafka message from one topic to another topic?

suppose my producer is writing the message to Topic A...once the message is in Topic A, i want to copy the same message to Topic B. Is this possible in kafka?
If I understand correctly, you just want stream.to("topic-b"), although, that seems strange without doing something to the data.
Note:
The specified topic should be manually created before it is used
I am not clear about what use case you are exactly trying to achieve by simply copying data from one topic to another topic. If both the topics are in the same Kafka cluster then it is never a good idea to have two topics with the same message/content.
I believe the gap here is that probably you are not clear about the concept of the Consumer group in Kafka. Probably you have two action items to do by consuming the message from the Kafka topic. And you are believing that if the first application consumes the message from the Kafka topic, will it be available for the second application to consume the same message or not. Kafka allows you to solve this kind of common use case with the help of the consumer group.
Let's try to differentiate between other message queue and Kafka and you will understand that you do not need to copy the same data/message between two topics.
In other message queues, like SQS(Simple Queue Service) where if the message is consumed by a consumer, the same message is not available to get consumed by other consumers. It is the responsibility of the consumer to delete the message safely once it has processed the message. By doing this we guarantee that the same message should not get processed by two consumers leading to inconsistency.
But, In Kafka, it is totally fine to have multiple sets of consumers consuming from the same topic. The set of consumers form a group commonly termed as the consumer group. Here one of the consumers from the consumer group can process the message based on the partition of the Kafka topic the message is getting consumed from.
Now the catch here is that we can have multiple consumer groups consuming from the same Kafka topic. Each consumer group will process the message in the way they want to do. There is no interference between consumers of two different consumer groups.
To fulfill your use case I believe you might need two consumer groups that can simply process the message in the way they want. You do not essentially have to copy the data between two topics.
Hope this helps.
There are two immediate options to forward the contents of one topic to another:
by using the stream feature of Kafka to create a forwarding link
between the two topics.
by creating a consumer / producer pair
and using those to receive and then forward on messages
I have a short piece of code that shows both (in Scala):
def topologyPlan(): StreamsBuilder = {
val builder = new StreamsBuilder
val inputTopic: KStream[String, String] = builder.stream[String, String]("topic2")
inputTopic.to("topic3")
builder
}
def run() = {
val kafkaStreams = createStreams(topologyPlan())
kafkaStreams.start()
val kafkaConsumer = createConsumer()
val kafkaProducer = createProducer()
kafkaConsumer.subscribe(List("topic1").asJava)
while (true) {
val record = kafkaConsumer.poll(Duration.ofSeconds(5)).asScala
for (data <- record.iterator) {
kafkaProducer.send(new ProducerRecord[String, String]("topic2", data.value()))
}
}
}
Looking at the run method, the first two lines set up a streams object to that uses the topologyPlan() to listen for messages in 'topic2' and forward then to 'topic3'.
The remaining lines show how a consumer can listen to a 'topic1' and use a producer to send them onward to 'topic2'.
The final point of the example here is Kafka is flexible enough to let you mix options depending on what you need, so the code above will take messages in 'topic1', and send them to 'topic3' via 'topic2'.
If you want to see the code that sets up consumer, producer and streams, see the full class here.

Independent Kafka Spark Sinks (multiple producers and brokers)

So I have a problem with Kafka Sinks in Spark Streaming while sending JSONs to multiple topics and unreliable kafka brokers. Here are some parts of code:
val kS = KafkaUtils.createDirectStream[String, TMapRecord]
(ssc,
PreferConsistent,
Subscribe[String, TMapRecord](topicsSetT, kafkaParamsInT))
Then I iterate over RDD's
kSMapped.foreachRDD {
rdd: RDD[TMsg] => {
rdd.foreachPartition {
part => {
part.foreach { ...........
And inside foreach I do
kafkaSink.value.send(kafkaTopic, strJSON)
kafkaSinkMirror.value.send(kafkaTopicMirrorBroker, strJSON)
When Mirror broker is down the entire Streaming Application is waiting for it and we are not sending anything to the main broker.
How would you handle it?
For the easiest solution you propose, imagine that me just skip messages that were meant to be sent to a broker that went down (say, that's CASE 1)
for the CASE 2 we'd do some buffering.
P.S. Later on I will use Kafka Mirror, but currently I don't have such an option so I need to make some solution in my code.
I've found several decisions of this problem:
You may use throwing any timeout exception on worker and checkpoints. Spark tries to restart bad task several times described in spark.task.maxFailures property. It is possible to increase number of retries. If streaming job fails after max retries, just will restart the job from checkpoint when broker is available. Or you could manually stop the job when it fails.
You could configure backpressure spark.streaming.backpressure.enabled=true that allow to receive data only as fast as it can process it.
You could send you two results back to your technical Kafka topic and handle it later with another streaming job.
You could make Hive or Hbase buffer for this cases and send unhandled data later in batch mode.

Using Kafka to communicate between long running Spark jobs

I am new to Apache Spark and have a need to run several long-running processes (jobs) on my Spark cluster at the same time. Often, these individual processes (each of which is its own job) will need to communicate with each other. Tentatively, I'm looking at using Kafka to be the broker in between these processes. So the high-level job-to-job communication would look like:
Job #1 does some work and publishes message to a Kafka topic
Job #2 is set up as a streaming receiver (using a StreamingContext) to that same Kafka topic, and as soon as the message is published to the topic, Job #2 consumes it
Job #2 can now do some work, based on the message it consumed
From what I can tell, streaming contexts are blocking listeners that run on the Spark Driver node. This means that once I start the streaming consumer like so:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String,
String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.collect().foreach { msg =>
// Now do some work as soon as we receive a messsage from the topic
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
...that there are now 2 implications:
The Driver is now blocking and listening for work to consume from Kafka; and
When work (messages) are received, they are sent to any available Worker Nodes to actually be executed upon
So first, if anything that I've said above is incorrect or is misleading, please begin by correcting me! Assuming I'm more or less correct, then I'm simply wondering if there is a more scalable or performant way to accomplish this, given my criteria. Again, I have two long-runnning jobs (Job #1 and Job #2) that are running on my Spark nodes, and one of them needs to be able to 'send work to' the other one. Any ideas?
From what I can tell, streaming contexts are blocking listeners that
run on the Spark Driver node.
A StreamingContext (singular) isn't a blocking listener. It's job is to create the graph of execution for your streaming job.
When you start reading from Kafka, you specify that you want to fetch new records every 10 seconds. What happens from now on depends on which Kafka abstraction you're using for Kafka, either the Receiver approach via KafkaUtils.createStream, or the Receiver-less approach via KafkaUtils.createDirectStream.
In both approaches in general, data is being consumed from Kafka and then dispatched to each Spark worker to process in parallel.
then I'm simply wondering if there is a more scalable or performant
way to accomplish this
This approach is highly scalable. When using the receiver-less approach, each Kafka partition maps to a Spark partition in a given RDD. You can increase parallelism by either increasing the amount of partitions in Kafka, or by re-partitions the data inside Spark (using DStream.repartition). I suggest testing this setup to determine if it suits your performance requirements.

Serial consumption of Kafka topics from Spark

Given the following code:
def createKafkaStream(ssc: StreamingContext,
kafkaTopics: String, brokers: String): DStream[(String, String)] = {
// some configs here
KafkaUtils.createDirectStream[String, String, StringDecoder,
StringDecoder](ssc, props, topicsSet)
}
def consumerHandler(): StreamingContext = {
val ssc = new StreamingContext(sc, Seconds(10))
createKafkaStream(ssc, "someTopic", "my-kafka-ip:9092").foreachRDD(rdd => {
rdd.foreach { msg =>
// Now do some DataFrame-intensive work.
// As I understand things, DataFrame ops must be run
// on Workers as well as streaming consumers.
}
})
ssc
}
StreamingContext.getActive.foreach {
_.stop(stopSparkContext = false)
}
val ssc = StreamingContext.getActiveOrCreate(consumerHandler)
ssc.start()
ssc.awaitTermination()
My understanding is that Spark and Kafka will automagically work together to figure out how many consumer threads to deploy to available Worker Nodes, which likely results in parallel processing of messages off a Kafka topic.
But what if I don't want multiple, parallel consumers? What if want 1-and-only-1 consumer reading the next message from a topic, processing it completely, and then starting back over again and polling for the next message.
Also, when I call:
val ssc = new StreamingContext(sc, Seconds(10))
Does this mean:
That a single consumer thread will receive all messages that were published to the topic in the last 10 seconds; or
That a single consumer thread will receive the next (single) message from the topic, and that it will poll for the next message every 10 seconds?
But what if I don't want multiple, parallel consumers? What if want
1-and-only-1 consumer reading the next message from a topic,
processing it completely, and then starting back over again and
polling for the next message.
If that is your use-case, I'd say why use Spark at all? Its entire advantage is that you can read in parallel. The only hacky workaround I can think of is creating a Kafka topic with a single partition, which would make Spark assign the entire offset range to a single worker, but that is ugly.
Does that mean that a single consumer thread will receive all messages that were
published to the topic in the last 10 seconds or that a single
consumer thread will receive the next (single) message from the topic,
and that it will poll for the next message every 10 seconds?
Neither. Since you're using direct (receiverless) stream approach, it means that every 10 seconds, your driver will ask Kafka to give him the offset ranges that have changed since the last batch, for each partition of the said topic. Then, Spark will take each such offset range, and send it to one of the workers to consume directly from Kafka. This means that with the direct stream approach, there is a 1:1 correspondence between Kafka partitions and Spark partitions.

Akka, Camel and ActiveMQ: throttling consumers

I've got a very basic skeleton Scala application (with Akka, Camel and ActiveMQ) where I want to publish onto an ActiveMQ queue as quickly as possible, but then only consume from that queue at a particular rate (eg. 1 per second).
Here's some code to illustrate that:
MyProducer.scala
class Producer extends Actor with Producer with Oneway {
def endpointUri = "activemq:myqueue"
}
MyConsumer.scala
class MyConsumer extends Actor with Consumer {
def endpointUri = "activemq:myqueue"
def receive = {
case msg: CamelMessage => println("Ping!")
}
}
In my main method, I then have all the boilerplate to set up Camel and get it talking to ActiveMQ, and then I have:
// Start the consumer
val consumer = system.actorOf(Props[MyConsumer])
val producer = system.actorOf(Props[MyProducer])
// Imagine I call this line 100+ times
producer ! "message"
How can I make it so that MyProducer sends things to ActiveMQ as quickly as possible (ie. no throttling) whilst making sure that MyConsumer only reads a message every x seconds? I'd like each message to stay on the ActiveMQ queue until the last possible moment (ie. when it's read by MyConsumer).
So far, I've managed to use a TimerBasedThrottler to consume at a certain rate, but this still consumes all of the messages in one big go.
Apologies if I've missed something along the way, I'm relatively new to Akka/Camel.
How many consumers comprise "MyConsumer"?
a) If it were only one, then it is unclear why a simple sleep between reading/consuming messages would not work.
If there are multiple consumers, which behavior are you requiring:
each consumer is throttled to the specified consumption rate. In that case each Consumer thread still behaves as mentioned in a)
the overall pool of consumers is throttled to the consumption rate. In that case a central Throttler would need to retain the inter-message delay and block each consumer until the required delay were met . There would be the complexity of managing when there were backlogs - to allow "catch-up". You probably get the drift here.
It may be you were looking for something else /more specific in this question. If so then please elaborate.