KafkaIO with Apache Beam stuck in infinite loop on DirectRunner - apache-beam

I'm trying to run this simple example where data from a Kafka topic are filtered out: https://www.talend.com/blog/2018/08/07/developing-data-processing-job-using-apache-beam-streaming-pipeline/
I have a similar setup with a localhost broker with default settings, but i can't even read from the topic.
When running the application it gets stuck in a infinite loop and nothing happens. I've tried giving gibberish url for my broker to see if it's even able to reach to them - it's not. The cluster is up and running and i'm able to add messages to the topic. Here is where i specify the broker and the topic:
pipeline
.apply(
KafkaIO.<Long, String>read()
.withBootstrapServers("localhost:9092")
.withTopic("BEAM_IN")
.withKeyDeserializer(LongDeserializer.class)
.withValueDeserializer(StringDeserializer.class)
)
I don't see any errors and there is nothing written to the output topic.
When debugging, I see it's stuck in this loop:
while(Instant.now().isBefore(completionTime)) {
ExecutorServiceParallelExecutor.VisibleExecutorUpdate update = this.visibleUpdates.tryNext(Duration.millis(25L));
if (update == null && ((State)this.pipelineState.get()).isTerminal()) {
return (State)this.pipelineState.get();
}
if (update != null) {
if (this.isTerminalStateUpdate(update)) {
return (State)this.pipelineState.get();
}
if (update.thrown.isPresent()) {
Throwable thrown = (Throwable)update.thrown.get();
if (thrown instanceof Exception) {
throw (Exception)thrown;
}
if (thrown instanceof Error) {
throw (Error)thrown;
}
throw new Exception("Unknown Type of Throwable", thrown);
}
}
In the isKeyed(PValue pvalue) method within the ExecutorServiceParallelExecutor class.
What am I missing?

Related

Kafka Consumer Poll runs indefinitely and doesn't return anything

I am facing difficulty with KafkaConsumer.poll(duration timeout), wherein it runs indefinitely and never come out of the method. Understand that this could be related to connection and I have seen it a bit inconsistent sometimes. How do I handle this should poll stops responding? Given below is the snippet from KafkaConsumer.poll()
public ConsumerRecords<K, V> poll(final Duration timeout) {
return poll(time.timer(timeout), true);
}
and I am calling the above from here :
Duration timeout = Duration.ofSeconds(30);
while (true) {
final ConsumerRecords<recordID, topicName> records = consumer.poll(timeout);
System.out.println("record count is" + records.count());
}
I am getting the below error:
org.apache.kafka.common.errors.SerializationException: Error
deserializing key/value for partition at offset 2. If
needed, please seek past the record to continue consumption.
I stumbled upon some useful information while trying to fix the problem I was facing above. I will provide the piece of code which should be able to handle this, but before that it is important to know what causes this.
While producing or consuming message or data to Apache Kafka, we need schema structure to that message or data, in my case Avro schema. If there is a conflict of message being produced to Kafka that conflict with that message schema, it will have an effect on consumption.
Add below code in your consumer topic in the method where it consume records --
do remember to import below packages:
import org.apache.kafka.common.TopicPartition;
import org.jsoup.SerializationException;
try {
while (true) {
ConsumerRecords<String, GenericRecord> records = null;
try {
records = consumer.poll(10000);
} catch (SerializationException e) {
String s = e.getMessage().split("Error deserializing key/value
for partition ")[1].split(". If needed, please seek past the record to
continue consumption.")[0];
String topics = s.split("-")[0];
int offset = Integer.valueOf(s.split("offset ")[1]);
int partition = Integer.valueOf(s.split("-")[1].split(" at") .
[0]);
TopicPartition topicPartition = new TopicPartition(topics,
partition);
//log.info("Skipping " + topic + "-" + partition + " offset "
+ offset);
consumer.seek(topicPartition, offset + 1);
}
for (ConsumerRecord<String, GenericRecord> record : records) {
System.out.printf("value = %s \n", record.value());
}
}
} finally {
consumer.close();
}
I ran into this while setting up a test environment.
Running the following command on the broker printed out the stored records as one would expect:
bin/kafka-console-consumer.sh --bootstrap-server="localhost:9092" --topic="foo" --from-beginning
It turned out that the Kafka server was misconfigured. To connect from an external
IP address listeners must have a valid value in kafka/config/server.properties, e.g.
# The address the socket server listens on. It will get the value returned from
# java.net.InetAddress.getCanonicalHostName() if not configured.
# FORMAT:
# listeners = listener_name://host_name:port
# EXAMPLE:
# listeners = PLAINTEXT://your.host.name:9092
listeners=PLAINTEXT://:9092

Return from Kafka consumer when there is no message

I want to process a topic in application startup using Confluent dotnet client. Assume following example:
while (true)
{
try
{
var cr = c.Consume();
Console.WriteLine($"Consumed message '{cr.Value}' at: '{cr.TopicPartitionOffset}'.");
}
catch (ConsumeException e)
{
Console.WriteLine($"Error occured: {e.Error.Reason}");
}
}
When there is no new message in Kafka, c.Consume will be blocked. Because I want to use it for application startup (Like cache warm up) I want to proceed my code when I found there is no new message.
I know there is an overload for setting timeout like c.Consume(timeout) but the problem with this approach is that if you have a message in your topic and the time duration of reading the message was more than your timeout, You receive null output which is not desirable.
The consumer(s) is not supposed to be aware of the producer(s).
Now if you want to know that you have read everything in the topic from the moment you start to consume, you can:
Load the newest offset before starting to consume.
Then start consuming messages.
If the message's offset is the same as the newest offset you loaded before, stop consuming.
I'm not a C# developper but from what I read in the dotnet confluent doc you can call QueryWatermarkOffsetson the consumer to get oldest and newest offset. https://docs.confluent.io/current/clients/confluent-kafka-dotnet/api/Confluent.Kafka.Consumer.html#Confluent_Kafka_Consumer_QueryWatermarkOffsets_Confluent_Kafka_TopicPartition_
And then, on the Messageclass you have an Offset accessor. So the whole thing should not be too hard to achieve.
https://docs.confluent.io/current/clients/confluent-kafka-dotnet/api/Confluent.Kafka.Message.html#Confluent_Kafka_Message_Offset
You can use OnPartitionEOF event that indicates you have reached the end of partition.
CancellationTokenSource source = new CancellationTokenSource();
bool isContinue = true;
c.OnPartitionEOF += (o, e) =>
{
Console.WriteLine($"You have reached end of partition");
isContinue = false;
source.Cancel();
};
while (isContinue)
{
try
{
var cr = c.Consume(source.Token);
Console.WriteLine($"Consumed message '{cr.Value}' at: '{cr.TopicPartitionOffset}'.");
}
catch (ConsumeException e)
{
Console.WriteLine($"Error occured: {e.Error.Reason}");
}
}
I found the Consumer.IsPartitionEOF useful.

Kafka Consumer with limited number of retries when processing messages

I'm having a hard time figuring simple patterns for handling exceptions in the consumer of a Kafka topic.
Scenario is as follows: in the consumer I call an external service. If the service is unavailable I want to retry a few times and then stop consuming.
The simplest pattern seems a blocking synchronous way of dealing with it, something like this in java:
ConsumerRecords<String, String> records = consumer.poll(100);
for (ConsumerRecord<String, String> record : records) {
boolean processed=false;
int count=0;
while (!processed) {
try {
callService(..);
} catch (Exception e) {
if (count++ < 3) {
Thread.sleep(5000);
continue;
} else throw new RuntimeException();
}
}
}
However, I have the feeling there must be a simpler approach (without using third party libraries), and one that avoids blocking the thread.
Seems like a common thing we would like to have, yet I could not find a simple example for this pattern.
There is no such retrial mechanism provided by Kafka out of the box. With the experience of using RabbitMQ where the MQ provides a retry exchange. These exchanges are called as Dead-Letter-Exchanges in RabbitMQ.
https://www.rabbitmq.com/dlx.html
You can apply the same pattern in the case of kafka.
On message processing failure we can publish a copy of the message to another topic and wait for the next message. Let’s call the new topic the ‘retry_topic’. The consumer of the ‘retry_topic’ will receive the message from the Kafka and then will wait some predefined time, for example one hour, before starting the message processing. This way we can postpone next attempts of the message processing without any impact on the ‘main_topic’ consumer. If processing in the ‘retry_topic’ consumer fails we just have to give up and store the message in the ‘failed_topic’ for further manual handling of this problem. The ‘main_topic’ consumer code may look like this:
Pushing message to retry_topic on failure/exception
void consumeMainTopicWithPostponedRetry() {
while (true) {
Message message = takeNextMessage("main_topic");
try {
process(message);
} catch (Exception ex) {
publishTo("retry_topic");
LOGGER.warn("Message processing failure. Will try once again in the future.", ex);
}
}
}
Consumer of the retry topic
void consumeRetryTopic() {
while (true) {
Message message = takeNextMessage("retry_topic");
try {
process(message);
waitSomeLongerTime();
} catch (Exception ex) {
publishTo("failed_topic");
LOGGER.warn("Message processing failure. Will skip it.", ex);
}
}
}
The above strategy and examples are picked from the below link. The whole credit goes to the owner of the blog post.
https://blog.pragmatists.com/retrying-consumer-architecture-in-the-apache-kafka-939ac4cb851a
For non-blocking way of doing above can be understood by reading the whole blog post. Hope this helps.

KafkaSpout (idle) generates a huge network traffic

After developing and executing my Storm (1.0.1) topology with a KafkaSpout and a couple of Bolts, I noticed a huge network traffic even when the topology is idle (no message on Kafka, no processing is done in bolts). So I started to comment out my topology piece by piece in order to find the cause and now I have only the KafkaSpout in my main:
....
final SpoutConfig spoutConfig = new SpoutConfig(
new ZkHosts(zkHosts, "/brokers"),
"files-topic", // topic
"/kafka", // ZK chroot
"consumer-group-name");
spoutConfig.scheme = new SchemeAsMultiScheme(new StringScheme());
spoutConfig.startOffsetTime = OffsetRequest.LatestTime();
topologyBuilder.setSpout(
"kafka-spout-id,
new KafkaSpout(config),
1);
....
When this (useless) topology executes, even in local mode, even the very first time, the network traffic always grows a lot: I see (in my Activity Monitor)
An average of 432 KB of data received/sec
After a couple of hours the topology is running (idle) data received is 1.26GB and data sent is 1GB
(Important: Kafka is not running in cluster, a single instance that runs in the same machine with a single topic and a single partition. I just downloaded Kafka on my machine, started it and created a simple topic. When I put a message in the topic, everything in the topology is working without any problem at all)
Obviously, the reason is in the KafkaSpout.nextTuple() method (below), but I don't understand why, without any message in Kafka, I should have such traffic. Is there something I didn't consider? Is that the expected behaviour? I had a look at Kafka logs, ZK logs, nothing, I have cleaned up Kafka and ZK data, nothing, still the same behaviour.
#Override
public void nextTuple() {
List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
for (int i = 0; i < managers.size(); i++) {
try {
// in case the number of managers decreased
_currPartitionIndex = _currPartitionIndex % managers.size();
EmitState state = managers.get(_currPartitionIndex).next(_collector);
if (state != EmitState.EMITTED_MORE_LEFT) {
_currPartitionIndex = (_currPartitionIndex + 1) % managers.size();
}
if (state != EmitState.NO_EMITTED) {
break;
}
} catch (FailedFetchException e) {
LOG.warn("Fetch failed", e);
_coordinator.refresh();
}
}
long diffWithNow = System.currentTimeMillis() - _lastUpdateMs;
/*
As far as the System.currentTimeMillis() is dependent on System clock,
additional check on negative value of diffWithNow in case of external changes.
*/
if (diffWithNow > _spoutConfig.stateUpdateIntervalMs || diffWithNow < 0) {
commit();
}
}
Put a sleep for one second (1000ms) in the nextTuple() method and observe the traffic now, For example,
#Override
public void nextTuple() {
try {
Thread.sleep(1000);
} catch(Exception ex){
log.error("Ëxception while sleeping...",e);
}
List<PartitionManager> managers = _coordinator.getMyManagedPartitions();
for (int i = 0; i < managers.size(); i++) {
...
...
...
...
}
The reason is, kafka consumer works on the basis of pull methodology which means, consumers will pull data from kafka brokers. So in consumer point of view (Kafka Spout) will do a fetch request to the kafka broker continuously which is a TCP network request. So you are facing a huge statistics on the data packet sent/received. Though the consumer doesn't consumes any message, pull request and empty response also will get account into network data packet sent/received statistics. Your network traffic will be less if your sleeping time is high. There are also some network related configurations for the brokers and also for consumer. Doing the research on configuration may helps you. Hope it will helps you.
Is your bolt receiving messages ? Do your bolt inherits BaseRichBolt ?
Comment out that line m.fail(id.offset) in Kafaspout and check it out. If your bolt doesn't ack then your spout assumes that message is failed and try to replay the same message.
public void fail(Object msgId) {
KafkaMessageId id = (KafkaMessageId) msgId;
PartitionManager m = _coordinator.getManager(id.partition);
if (m != null) {
//m.fail(id.offset);
}
Also try halt the nextTuple() for few millis and check it out.
Let me know if it helps

Apache Kafka with High Level Consumer: Skip corrupted messages

I'm facing an issue with high level kafka consumer (0.8.2.0) - after consuming some amount of data one of our consumers stops. After restart it consumes some messages and stops again with no error/exception or warning.
After some investigation I found that the problem with consumer was this exception:
ERROR c.u.u.e.impl.kafka.KafkaConsumer - Error consuming message stream:
kafka.message.InvalidMessageException: Message is corrupt (stored crc = 3801080313, computed crc = 2728178222)
Any ideas how can I simple skip such messages at all?
So, answering my own question. After some debugging of Kafka Consumer, I found one possible solution:
Create a subclass of kafka.consumer.ConsumerIterator
Override makeNext-method. In this method catch InvalidMessageException and return some dummy-placeholder.
In your while-loop you have to convert the kafka.consumer.ConsumerIterator to your implementation. Unfortunately all fields of kafka.consumer.ConsumerIterator are private, so you have to use reflection.
So this is the code example:
val skipIt = createKafkaSkippingIterator(ks.iterator())
while(skipIt.hasNext()) {
val messageAndTopic = skipIt.next()
if (messageNotCorrupt(messageAndTopic)) {
consumeFn(messageAndTopic)
}
}
The messageNotCorrupt-method simply checks if the argument is equal to the dummy-message.
another solution, possibly easier, using Kafka 0.8.2 client.
try {
val m = it.next()
//...
} catch {
case e: kafka.message.InvalidMessageException ⇒
log.warn("Corrupted message. Skipping.", e)
resetIteratorState(it)
}
//...
def resetIteratorState(it: ConsumerIterator[Array[Byte], Array[Byte]]): Unit = {
val method = classOf[IteratorTemplate[_]].getDeclaredMethod("resetState")
method.setAccessible(true)
method.invoke(it)
}