Alpakka kafka consumer offset - scala

I am using Alpakka-kafka in scala to consume a Kafka topic. Here's my code:
val kafkaConsumerSettings: ConsumerSettings[String, String] =
ConsumerSettings(actorSystem, new StringDeserializer, new StringDeserializer)
.withBootstrapServers(kafkaConfig.server)
.withGroupId(kafkaConfig.group)
.withProperties(
ConsumerConfig.MAX_POLL_RECORDS_CONFIG -> "100",
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest",
CommonClientConfigs.SECURITY_PROTOCOL_CONFIG -> "SSL"
)
Consumer
.plainSource(kafkaConsumerSettings, Subscriptions.topics(kafkaConfig.topic))
.runWith(Sink.foreach(println))
However, consumer only starts polling from the first uncommitted message in topic. I would like to always start from offset 0, regardless of messages being committed.
With Alpakka consumer, how do I specify offset manually?

I think you want to add a couple of config entries:
ConsumerConfig.ENABLE_AUTO_COMMIT_CONFIG -> False so your job never save any offset
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG -> "earliest" so your job starts from the begining.
If your job already committed offsets in the past, you may have to reset its offset to earliest.

Related

How to set specific offset number while consuming message from Kafka topic through Spark streaming Scala

I am using below spark streaming Scala code for consuming real time kafka message from producer topic.
But the issue is sometime my job is failed due to server connectivity or some other reason and in my code auto commit property is set true due to that some message is lost and not able to store in my database.
So just want to know is there any way if we want to pull old kafka message from specific offset number.
I tried to set "auto.offset.reset" is earliest or latest but it fetch only new message those is not yet commit.
Let's take the example here like my current offset number is 1060 and auto offset reset property is earliest so when I restart my job it starts reading the message from 1061 but in some case if I want to read old kafka message from offset number 1020 then is there any property that we can use to start the consuming message from specific offset number
import org.apache.spark._
import org.apache.spark.streaming._
import org.apache.spark.streaming.kafka010.ConsumerStrategies.Subscribe
import org.apache.spark.streaming.kafka010.LocationStrategies.PreferConsistent
import org.apache.spark.streaming.StreamingContext._
val conf = new SparkConf().setMaster("local[2]").setAppName("NetworkWordCount")
val ssc = new StreamingContext(conf, Seconds(1))
val topic = "test123"
val kafkaParams = Map(
"bootstrap.servers" -> "localhost:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[KafkaAvroDeserializer],
"schema.registry.url" -> "http://abc.test.com:8089"
"group.id" -> "spark-streaming-notes",
"auto.offset.reset" -> "earliest"
"enable.auto.commit" -> true
)
val stream = KafkaUtils.createDirectStream[String, Object](
ssc,
PreferConsistent,
Subscribe[String, Object](topic, KafkaParams)
stream.print()
ssc.start()
ssc.awaitTermination()
From Spark Streaming, you can't. You'd need to use kafka-consumer-groups CLI to commit offsets specific to your group id. Or manually construct a KafkaConsumer instance and invoke commitSync before starting the Spark context.
import org.apache.kafka.clients.consumer.KafkaConsumer
val c = KafkaConsumer(...)
val toCommit: java.util.Map[TopicPartition,OffsetAndMetadata] = ...
c.commitSync(toCommit) // But don't do this every run of your app
ssc.start()
Alternatively, Structured Streaming does offer startingOffsets config.
auto.offset.reset only applies to non existing group.id's

Flink Kafka Sink org.apache.kafka.common.errors.UnsupportedVersionException ERROR

version flink(1.11.3), kafka(2.1.1)
My flink datapipeline is kafka(source) -> flink -> kafka(sink).
When I submit job first, it works well.
but after jobmanager or taskmanagers fail, if they restarted, they occur exception
2020-12-31 10:35:23.831 [objectOperator -> Sink: objectSink (1/1)] WARN o.apache.flink.streaming.connectors.kafka.FlinkKafkaProducer - Encountered error org.apache.kafka.common.errors.InvalidTxnStateException: The producer attempted a transactional operation in an invalid state. while recovering transaction KafkaTransactionState [transactionalId=objectOperator -> Sink: objectSink-bcabd9b643c47ab46ace22db2e1285b6-3, producerId=14698, epoch=7]. Presumably this transaction has been already committed before
2020-12-31 10:35:23.919 [userOperator -> Sink: userSink (1/1)] WARN org.apache.flink.runtime.taskmanager.Task - userOperator -> Sink: userSink (1/1) (2a5a171aa335f444740b4acfc7688d7c) switched from RUNNING to FAILED.
org.apache.kafka.common.errors.InvalidPidMappingException: The producer attempted to use a producer id which is not currently assigned to its transactional id.
2020-12-31 10:35:24.131 [objectOperator -> Sink: objectSink (1/1)] WARN org.apache.flink.runtime.taskmanager.Task - objectOperator -> Sink: objectSink (1/1) (07fe747a81b31e016e88ea6331b31433) switched from RUNNING to FAILED.
org.apache.kafka.common.errors.UnsupportedVersionException: Attempted to write a non-default producerId at version 1
I don't know why this error occurs.
my kafka producer code
Properties props = new Properties();
props.setProperty("bootstrap.servers", servers);
props.setProperty("transaction.timeout.ms", "30000");
FlinkKafkaProducer<CountModel> producer = new FlinkKafkaProducer<CountModel>(
topic,((record, timestamp) -> new ProducerRecord<>(
topic
, Longs.toByteArray(record.getUserInKey())
, JsonUtils.toJsonBytes(record))), props, FlinkKafkaProducer.Semantic.EXACTLY_ONCE);
I don't think it's a version issue.
It seems that no one has experienced the same error as me
Each Producer is assigned a unique PID when it is initialized. This PID is transparent to the application and is not exposed to the user at all. For a given PID, the sequence number will increase from 0, and each Topic-Partition will have an independent sequence number. When the Producer sends data, it will identify a sequence number for each msg, and the Server will use this to verify whether the data is duplicated. The PID here is globally unique, and a new PID will be assigned after the Producer is restarted after a failure. This is also one of the reasons why idempotence cannot be achieved across sessions.
If you resume from savepoint, the previous producerId will be used, and a new session will generate 1000 new producerIds (these id runs through the entire session, equivalent to the default value), so it will be non-default

Kafka streams fail on decoding timestamp metadata inside StreamTask

We got strange errors on Kafka Streams during starting app
java.lang.IllegalArgumentException: Illegal base64 character 7b
at java.base/java.util.Base64$Decoder.decode0(Base64.java:743)
at java.base/java.util.Base64$Decoder.decode(Base64.java:535)
at java.base/java.util.Base64$Decoder.decode(Base64.java:558)
at org.apache.kafka.streams.processor.internals.StreamTask.decodeTimestamp(StreamTask.java:985)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeTaskTime(StreamTask.java:303)
at org.apache.kafka.streams.processor.internals.StreamTask.initializeMetadata(StreamTask.java:265)
at org.apache.kafka.streams.processor.internals.AssignedTasks.initializeNewTasks(AssignedTasks.java:71)
at org.apache.kafka.streams.processor.internals.TaskManager.updateNewAndRestoringTasks(TaskManager.java:385)
at org.apache.kafka.streams.processor.internals.StreamThread.runOnce(StreamThread.java:769)
at org.apache.kafka.streams.processor.internals.StreamThread.runLoop(StreamThread.java:698)
at org.apache.kafka.streams.processor.internals.StreamThread.run(StreamThread.java:671)
and, as a result, error about failed stream: ERROR KafkaStreams - stream-client [xxx] All stream threads have died. The instance will be in error state and should be closed.
According to code inside org.apache.kafka.streams.processor.internals.StreamTask, failure happened due to error in decoding timestamp metadata (StreamTask.decodeTimestamp()). It happened on prod, and can't reproduce on stage.
What could be the root cause of such errors?
Extra info: our app uses Kafka-Streams and consumes messages from several kafka brokers using the same application.id and state.dir (actually we switch from one broker to another, but during some period we connected to both brokers, so we have two kafka streams, one per each broker). As I understand, consumer group lives on broker side (so shouldn't be a problem), but state dir is on client side. Maybe some race condition occurred due to using the same state.dir for two kafka streams? could it be the root cause?
We use kafka-streams v.2.4.0, kafka-clients v.2.4.0, Kafka Broker v.1.1.1, with the following configs:
default.key.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
default.value.serde: org.apache.kafka.common.serialization.Serdes$StringSerde
default.timestamp.extractor: org.apache.kafka.streams.processor.WallclockTimestampExtractor
default.deserialization.exception.handler: org.apache.kafka.streams.errors.LogAndContinueExceptionHandler
commit.interval.ms: 5000
num.stream.threads: 1
auto.offset.reset: latest
Finally, we figured out what is the root cause of corrupted metadata by some consumer groups.
It was one of our internal monitoring tool (written with pykafka) that corrupted metadata by temporarily inactive consumer groups.
Metadata were unencrupted and contained invalid data like the following: {"consumer_id": "", "hostname": "monitoring-xxx"}.
In order to understand what exactly we have in consumer metadata, we could use the following code:
Map<String, Object> config = Map.of( "group.id", "...", "bootstrap.servers", "...");
String topicName = "...";
Consumer<byte[], byte[]> kafkaConsumer = new KafkaConsumer<byte[], byte[]>(config, new ByteArrayDeserializer(), new ByteArrayDeserializer());
Set<TopicPartition> topicPartitions = kafkaConsumer.partitionsFor(topicName).stream()
.map(partitionInfo -> new TopicPartition(topicName, partitionInfo.partition()))
.collect(Collectors.toSet());
kafkaConsumer.committed(topicPartitions).forEach((key, value) ->
System.out.println("Partition: " + key + " metadata: " + (value != null ? value.metadata() : null)));
Several options to fix already corrupted metadata:
change consumer group to a new one. caution that you might lose or duplicate messages depending on the latest or earliest offset reset policy. so for some cases, this option might be not acceptable
overwrite metadata manually (timestamp is encoded according to logic inside StreamTask.decodeTimestamp()):
Map<TopicPartition, OffsetAndMetadata> updatedTopicPartitionToOffsetMetadataMap = kafkaConsumer.committed(topicPartitions).entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, (entry) -> new OffsetAndMetadata((entry.getValue()).offset(), "AQAAAXGhcf01")));
kafkaConsumer.commitSync(updatedTopicPartitionToOffsetMetadataMap);
or specify metadata as Af////////// that means NO_TIMESTAMP in Kafka Streams.

Spark consumer not receiving Kafka messages

I have a Spark scala consumer that is connecting to my Kafka brokers on another cluster (Kafka cluster is separate from CDH cluster). params are my Kafka params that being picked up correctly.
val incomingstream = KafkaUtils.createDirectStream[String, String](
streamingContext, .....](topicSet, params))
print(incomingstream)
I am able to produce and consume on the console of my Kafka cluster. But on running the spark consumer having the above code, it just keeps waiting and even though i send messages through the kafka console producer it doesnt show up on the log prints. incomingstream doesnt get printed.
I have connectivity from the node where spark job is running to kafka cluster. Submitting in yarn mode. Shows connection to kafka brokers. (Not sure if the issue is because of Kerberos...doesnt say that in logs..)
Using CDH 5.10
Spark 2.2
Kafka 0.10
Scala 2.11.8
EDIT: Kafka Params passed in as below. Connecting fine to my kafka brokers from the spark job - printed the logs
"bootstrap.servers" -> "<domain>:9092",
"key.deserializer" -> classOf[StringDeserializer],
"value.deserializer" -> classOf[StringDeserializer],
"auto.offset.reset" -> "latest",
"enable.auto.commit" -> (false: java.lang.Boolean),
"security.protocol" -> "PLAINTEXT"
My Kafka listener is configured as plaintext (not SSL) - but if i pass the above, It is complaining of
Selector:375 - Connection with 10.18.63.18 disconnected
java.io.EOFException
at org.apache.kafka.common.network.NetworkReceive.readFromReadableChannel(NetworkReceive.java:83)
at org.apache.kafka.common.network.NetworkReceive.readFrom(NetworkReceive.java:71)

Consuming from the beginning of a kafka topic with Flink

How do I make sure I always consume from the beginning of a Kafka topic with Flink?
With the Kafka 0.9.x consumer that is part of Flink 1.0.2, it appears that it's no longer Kafka but Flink to control the offset:
Flink snapshots the offsets internally as part of its
distributed checkpoints. The offsets committed to Kafka / ZooKeeper
are only to bring the outside view of progress in sync with Flink's
view of the progress. That way, monitoring and other jobs can get a
view of how far the Flink Kafka consumer has consumed a topic.
This is how far I got, but my Flink program always starts where it left off, and doesn't return to the beginning as the configuration instructs it to:
val props = new Properties()
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", "myflinkservice")
props.setProperty("auto.offset.reset", "earliest")
val incomingData = env.addSource(
new FlinkKafkaConsumer09[IncomingDataRecord](
"my.topic.name",
new IncomingDataSchema,
props
)
)
Use:
consumer.setStartFromEarliest();
I think you can get around this by specifying a random group.id:
val props = new Properties()
props.setProperty("bootstrap.servers", "localhost:9092");
props.setProperty("group.id", s"myflinkservice_${UUID.randomUUID}")
props.setProperty("auto.offset.reset", "smallest") // "smallest", not "earliest"
auto.offset.reset only works when there's no initial offset available in ZooKeeper.