Kafka Streaming Reset Issues - scala

I have been attempting to build a Kafka Streaming application for use with Spark. I have a static dataset for testing. After running my code once through, Kafka sets the current offset such that I cannot re-process the data upon a second run. Running kafka-streams-application-reset supposedly resets the offsets. However, re-running my code results in an empty GlobalKTable. The only way I have been able to re-analyze the data is by changing my ID in my Kafka connection. Here is what I'm doing.
Setup the sample data in Kafka:
kafka-console-producer --broker-list localhost:9092 \
--topic testTopic \
--property "parse.key=true" \
--property "key.separator=:"
1:abcd
2:bcde
3:cdef
4:defg
5:efgh
6:fghi
7:ghij
8:hijk
9:ijkl
10:jklm
Scala code:
//Streams imports - need to update Kafka
import org.apache.kafka.common.serialization.Serdes
//import org.apache.kafka.common.utils.Bytes
import org.apache.kafka.streams._
import org.apache.kafka.streams.kstream.{GlobalKTable, KStream, KTable, Materialized, Produced, KStreamBuilder}
import org.apache.kafka.streams.StreamsConfig
import org.apache.kafka.streams.state.{KeyValueIterator, QueryableStoreTypes, ReadOnlyKeyValueStore, KeyValueStore}
import org.apache.kafka.streams.state.Stores
import org.apache.kafka.clients.consumer.{ConsumerConfig, KafkaConsumer}
import java.util.{Properties}
val kafkaServer = "127.0.0.1:9092"
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "testStream")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServer)
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass())
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass())
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p.put(StreamsConfig.CLIENT_ID_CONFIG, "test-consumer-stream")
val config = new StreamsConfig(p)
val builder: StreamsBuilder = new StreamsBuilder()
val imkvs = Stores.inMemoryKeyValueStore("testLookup-stream")
val sBuilder = Stores.keyValueStoreBuilder(imkvs, Serdes.String, Serdes.String).withLoggingDisabled().withCachingEnabled()
val gTable: GlobalKTable[String, String] = builder.globalTable("testTopic", Materialized.as(imkvs).withKeySerde(Serdes.String()).withValueSerde(Serdes.String()).withCachingDisabled())
val streams: KafkaStreams = new KafkaStreams(builder.build(), config)
streams.start()
val read: ReadOnlyKeyValueStore[String, String] = streams.store(gTable.queryableStoreName(), QueryableStoreTypes.keyValueStore[String, String]())
val hexLookup = "2"
println(read.get(hexLookup))
val iter: KeyValueIterator[String, String] = read.all()
while(iter.hasNext) {
val next = iter.next()
println(next.key + ": " + next.value)
}
Streams Reset command:
kafka-streams-application-reset --application-id testStream \
--bootstrap-servers localhost:9092 \
--to-earliest
1) Am I coding something wrong, or is kafka-streams-application-reset not functioning correctly?
2) I had hoped that using a inMemoryKeyValueStore would result in Kafka not keeping track of the current offset; is there a way to force a GlobalKTable to not keep the current offset? I want to always search the entire dataset.
Software Versions:
Kafka 1.1.1-1
Confluent 4.1.1-1
Spark-Scala 2.3.1
kafka-clients 1.1.0
kafka-streams 1.1.0

If you want to restart an application from an empty internal state and re-process the data from offset 0, you have to provide "--input-topics" parameter with comma seperated list of topics.
bin/kafka-streams-application-reset.sh --application-id testApplication1 --input-topics demoTopic1
You can find more details here : https://kafka.apache.org/10/documentation/streams/developer-guide/app-reset-tool
Regarding GlobalKTable, ideally it is materialized view on top of stream/topic just like any other queryable store.
Also GlobalKTable always applies "auto.offset.reset" strategy "earliest" regardless of the specified value in StreamsConfig.
So it should allow you to query the entire table at any time.

Related

write into kafka topic using spark and scala

I am reading data from Kafka topic and write back the data received into another Kafka topic.
Below is my code ,
import org.apache.spark.sql.types._
import org.apache.spark.sql.functions._
import org.apache.kafka.clients.producer.{Kafka Producer, ProducerRecord}
import org.apache.spark.sql.ForeachWriter
//loading data from kafka
val data = spark.readStream.format("kafka")
.option("kafka.bootstrap.servers", "*******:9092")
.option("subscribe", "PARAMTABLE")
.option("startingOffsets", "latest")
.load()
//Extracting value from Json
val schema = new StructType().add("PARAM_INSTANCE_ID",IntegerType).add("ENTITY_ID",IntegerType).add("PARAM_NAME",StringType).add("VALUE",StringType)
val df1 = data.selectExpr("CAST(value AS STRING)")
val dataDF = df1.select(from_json(col("value"), schema).as("data")).select("data.*")
//Insert into another Kafka topic
val topic = "SparkParamValues"
val brokers = "********:9092"
val writer = new KafkaSink(topic, brokers)
val query = dataDF.writeStream
.foreach(writer)
.outputMode("update")
.start().awaitTermination()
I am getting the below error,
<Console>:47:error :not found: type KafkaSink
val writer = new KafkaSink(topic, brokers)
I am very new to spark, Someone suggest how to resolve this or verify the above code whether it is correct. Thanks in advance .
In spark structured streaming, You can write to Kafka topic after reading from another topic using existing DataStreamWriter for Kafka or you can create your own sink by extending ForeachWriter class.
Without using custom sink:
You can use below code to write a dataframe to kafka. Assuming df as the dataframe generated by reading from kafka topic.
Here dataframe should have atleast one column with name as value. If you have multiple columns you should merge them into one column and name it as value. If key column is not specified then key will be marked as null in destination topic.
df.select("key", "value")
.writeStream
.format("kafka")
.option("kafka.bootstrap.servers", "host1:port1,host2:port2")
.option("topic", "<topicName>")
.start()
.awaitTermination()
Using custom sink:
If you want to implement your own Kafka sink you need create a class by extending ForeachWriter. You need override some methods and pass the object of this class to foreach() method.
// By using Anonymous class to extend ForeachWriter
df.writeStream.foreach(new ForeachWriter[Row] {
// If you are writing Dataset[String] then new ForeachWriter[String]
def open(partitionId: Long, version: Long): Boolean = {
// open connection
}
def process(record: String) = {
// write rows to connection
}
def close(errorOrNull: Throwable): Unit = {
// close the connection
}
}).start()
You can check this databricks notebook for the implemented code (Scroll down and check the code under Kafka Sink heading). I think you are referring to this page only. To solve the issue you need to make sure that KafkaSink class is available to your spark code. You can bring both spark code file and class file in same package. If you are running on spark-shell paste the KafkaSink class before pasting spark code.
Read structured streaming kafka integration guide to explore more.

How to read values from kafka topic using consumer through akka-streams/alpakka-kafka?

Running with a Consumer.plainSource, nothing happens. Isn't this one way of reading from kafka topic through streams .
val consumerSettings2 = ConsumerSettings(system,new StringDeserializer,new StringDeserializer)
.withBootstrapServers("localhost:3333")
.withGroupId("ssss")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
val source: Source[ConsumerRecord[String, String], Consumer.Control] =
Consumer.plainSource(consumerSettings2, Subscriptions.topics("candy"))
val sink =
Sink.foreach[ConsumerRecord[String,String]](x=>println("consumed "+x))
source.runWith(sink)

Apache Flink: Kafka Producer in IDE execution not working as expected

I have a sample streaming WordCount example written in Flink (Scala). In it, I want to put the result in Kafka using Flink-Kafka producer. But it is not working as expected.
My code is as follows:
object WordCount {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment
.getExecutionEnvironment
.setStateBackend(new RocksDBStateBackend("file:///path/to/checkpoint", true))
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000)
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000)
// prevent the tasks from failing if an error happens in their checkpointing, the checkpoint will just be declined.
env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// prepare Kafka consumer properties
val kafkaConsumerProperties = new Properties
kafkaConsumerProperties.setProperty("zookeeper.connect", "localhost:2181")
kafkaConsumerProperties.setProperty("group.id", "flink")
kafkaConsumerProperties.setProperty("bootstrap.servers", "localhost:9092")
// set up Kafka Consumer
val kafkaConsumer = new FlinkKafkaConsumer[String]("input", new SimpleStringSchema, kafkaConsumerProperties)
println("Executing WordCount example.")
// get text from Kafka
val text = env.addSource(kafkaConsumer)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuples) containing: (word,1)
.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_, 1))
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.mapWithState((in: (String, Int), count: Option[Int]) =>
count match {
case Some(c) => ((in._1, c), Some(c + in._2))
case None => ((in._1, 1), Some(in._2 + 1))
})
// emit result
println("Printing result to stdout.")
counts.map(_.toString()).addSink(new FlinkKafkaProducer[String]("output", new SimpleStringSchema,
kafkaProperties))
// execute program
env.execute("Streaming WordCount")
}
}
The data I sent to Kafka input topic is:
hi
hello
I don't get any output in Kafka topic output. Since I am a newbie to Apache Flink, I don't know how to achieve the expected result. Can anyone help me achieve the correct behavior?
I run your code into my local environment, and everything is OK. I think you can try the command below:
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic output --from-beginning

Kafka consumer not returning any events

The below Scala kafka consumer is not returning any events from the poll call.
However, the topic is correct, and I can see events being sent to the topic using the console consumer:
/opt/kafka_2.11-0.10.1.0/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic my_topic --from-beginning
I also see the topic in my Scala code sample below when I step through it with a debugger and invoke kafkaConsumer.listTopics()
Also, this is called from a single unit test, so I'm only creating one instance of this trait and consumer (i.e. another consumer instance can't be consuming the messages). I'm also using a random group_id.
Is there anything wrong with the below code/configuration?
import java.util.Properties
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.common.serialization.{ByteArrayDeserializer, StringDeserializer}
import scala.util.Random
trait KafkaTest {
val kafkaConsumerProperties = new Properties()
kafkaConsumerProperties.put("bootstrap.servers", "kafka:9092")
kafkaConsumerProperties.put("group.id", Random.alphanumeric.take(10).mkString)
kafkaConsumerProperties.put("key.deserializer", classOf[ByteArrayDeserializer])
kafkaConsumerProperties.put("value.deserializer", classOf[StringDeserializer])
val kafkaConsumer = new KafkaConsumer[String, String](kafkaConsumerProperties)
kafkaConsumer.subscribe(java.util.Collections.singletonList("my_topic"))
def checkKafkaHasReceivedEvent(): Assertion = {
val kafkaEvents = kafkaConsumer.poll(2000) // Always returns 0 events?
...
}
}
Increasing the poll timeout doesn't help either.
To read from beginning AUTO_OFFSET_RESET_CONFIG property has to be set to earliest, by default it "latest"
kafkaConsumerProperties.put(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
OffsetResetStrategy.EARLIEST.toString().toLowerCase())

Apache Kafka: How to receive latest message from Kafka?

I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you
Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.
You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).
You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.
Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()