The below Scala kafka consumer is not returning any events from the poll call.
However, the topic is correct, and I can see events being sent to the topic using the console consumer:
/opt/kafka_2.11-0.10.1.0/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic my_topic --from-beginning
I also see the topic in my Scala code sample below when I step through it with a debugger and invoke kafkaConsumer.listTopics()
Also, this is called from a single unit test, so I'm only creating one instance of this trait and consumer (i.e. another consumer instance can't be consuming the messages). I'm also using a random group_id.
Is there anything wrong with the below code/configuration?
import java.util.Properties
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.common.serialization.{ByteArrayDeserializer, StringDeserializer}
import scala.util.Random
trait KafkaTest {
val kafkaConsumerProperties = new Properties()
kafkaConsumerProperties.put("bootstrap.servers", "kafka:9092")
kafkaConsumerProperties.put("group.id", Random.alphanumeric.take(10).mkString)
kafkaConsumerProperties.put("key.deserializer", classOf[ByteArrayDeserializer])
kafkaConsumerProperties.put("value.deserializer", classOf[StringDeserializer])
val kafkaConsumer = new KafkaConsumer[String, String](kafkaConsumerProperties)
kafkaConsumer.subscribe(java.util.Collections.singletonList("my_topic"))
def checkKafkaHasReceivedEvent(): Assertion = {
val kafkaEvents = kafkaConsumer.poll(2000) // Always returns 0 events?
...
}
}
Increasing the poll timeout doesn't help either.
To read from beginning AUTO_OFFSET_RESET_CONFIG property has to be set to earliest, by default it "latest"
kafkaConsumerProperties.put(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
OffsetResetStrategy.EARLIEST.toString().toLowerCase())
Related
I'm new to kafka,and want to use Kafka 2.3 to implement a producer/consumer app.
I had download and install the Kafka 2.3 on my ubuntu server.
I found some code online and build it on my laptop in IDEA, But the consumer can't get any info.
I had checked the topic info on my server which has the topic.
I had use kafka-console-consumer to check this topic, got the topic's value successfuly, but not with my consumer.
So what's wrong with my consumer?
Producer
package com.phitrellis.tool
import java.util.Properties
import java.util.concurrent.{Future, TimeUnit}
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.clients.producer._
object MyKafkaProducer extends App {
def createKafkaProducer(): Producer[String, String] = {
val props = new Properties()
props.put("bootstrap.servers", "*:9092")
props.put("key.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer", "org.apache.kafka.common.serialization.StringSerializer")
props.put("producer.type", "async")
props.put("acks", "all")
new KafkaProducer[String, String](props)
}
def writeToKafka(topic: String): Unit = {
val producer = createKafkaProducer()
val record = new ProducerRecord[String, String](topic, "key", "value22222222222")
println("start")
producer.send(record)
producer.close()
println("end")
}
writeToKafka("phitrellis")
}
Consumer
package com.phitrellis.tool
import java.util
import java.util.Properties
import java.time.Duration
import scala.collection.JavaConverters._
import org.apache.kafka.clients.consumer.KafkaConsumer
object MyKafkaConsumer extends App {
def createKafkaConsumer(): KafkaConsumer[String, String] = {
val props = new Properties()
props.put("bootstrap.servers", "*:9092")
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
// props.put("auto.offset.reset", "latest")
props.put("enable.auto.commit", "true")
props.put("auto.commit.interval.ms", "1000")
props.put("group.id", "test")
new KafkaConsumer[String, String](props)
}
def consumeFromKafka(topic: String) = {
val consumer: KafkaConsumer[String, String] = createKafkaConsumer()
consumer.subscribe(util.Arrays.asList(topic))
while (true) {
val records = consumer.poll(Duration.ofSeconds(2)).asScala.iterator
println("true")
for (record <- records){
print(record.value())
}
}
}
consumeFromKafka("phitrellis")
}
Two line in your Consumer code are crucial:
props.put("auto.offset.reset", "latest")
props.put("group.id", "test")
To read from beginning of the topic you have to set auto.offset.reset to earliest (latest cause that you skip messages produced before your Consumer started).
group.id is responsible for group management. If you start processing data with some group.id and than restart your application or start new with same group.id only new messages will be read.
For your tests I would suggest to add auto.offset.reset -> earliest and change group.id
props.put("auto.offset.reset", "earliest")
props.put("group.id", "test123")
Additionally:
You have to remember that KafkaProducer::send returns Future<RecordMetadata> and messages are sent asynchronously and if you progam finished before Future will finished messages might not be sent.
There's two parts here. The producing side, and the consumer.
You don't say anything about the producer, so we're assuming it did work. However, did you check on the servers? You could check the kafka log files to see if there's any data on those particular topic/partitions.
On the consumer side, to validate, you should try to consume using the command-line from that same topic, to make sure the data is in there. Look for "Kafka Consumer Console" at the following link, and follow those steps.
http://cloudurable.com/blog/kafka-tutorial-kafka-from-command-line/index.html
If there is data on the topic, then running that command should get you data. If it's not, then it will just "hang" because it's waiting for data to be written to the topic.
In addition, you can try producing to the same topic using those command line tools, to make sure your cluster is configured correctly, you have the right addresses and ports, that the ports are not blocked, etc.
I am trying to use Apache Kafka through a vagrant machine to run a simple Kafka Consumer program. The program get's stuck before the for loop when it tries to call the .poll(100) method.
Lot's of digging into deeper classes for debugging but not much has been found.
val TOPIC="testTopic"
val props = new Properties()
props.put(ConsumerConfig.BOOTSTRAP_SERVERS_CONFIG, "192.168.56.10:9092")
props.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
props.put(ConsumerConfig.GROUP_ID_CONFIG, UUID.randomUUID().toString());
props.put("key.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
props.put("value.deserializer", "org.apache.kafka.common.serialization.StringDeserializer")
val consumer = new KafkaConsumer[String, String](props)
consumer.subscribe(util.Collections.singletonList(TOPIC))
while(true) {
println("Test")
val records = consumer.poll(100)
for (record <- records.asScala) {
println(record)
}
println("Test2")
}
}
Currently outputs Test and then get's stuck with no error message. It's expected that it will output the contents of the Kafka topic.
You need to upgrade your kafka-clients version to 2.0.0 or above. When the kafka server is down, for example, using the poll method from KafkaConsumer class you will get stuck in the internal loop waiting for the broker to become available again.
According to KIP-266:
ConsumerRecords
poll​(long timeout)
Deprecated. Since 2.0. Use poll(Duration), which does not block
beyond the timeout awaiting partition assignment. See KIP-266 for more
information.
In your case:
import org.apache.kafka.clients.consumer.KafkaConsumer;
import scala.concurrent.duration._
// ...
val timeout = Duration(100, MILLISECONDS)
while(true) {
println("Test")
val records = consumer.poll(timeout)
for (record <- records.asScala) {
println(record)
}
println("Test2")
}
//...
In conclusion, you just need to import the new version of the KafkaConsumer class and pass the timeout parameter to the new poll method as an instance of the Duration object.
I have a sample streaming WordCount example written in Flink (Scala). In it, I want to put the result in Kafka using Flink-Kafka producer. But it is not working as expected.
My code is as follows:
object WordCount {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment
.getExecutionEnvironment
.setStateBackend(new RocksDBStateBackend("file:///path/to/checkpoint", true))
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000)
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000)
// prevent the tasks from failing if an error happens in their checkpointing, the checkpoint will just be declined.
env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// prepare Kafka consumer properties
val kafkaConsumerProperties = new Properties
kafkaConsumerProperties.setProperty("zookeeper.connect", "localhost:2181")
kafkaConsumerProperties.setProperty("group.id", "flink")
kafkaConsumerProperties.setProperty("bootstrap.servers", "localhost:9092")
// set up Kafka Consumer
val kafkaConsumer = new FlinkKafkaConsumer[String]("input", new SimpleStringSchema, kafkaConsumerProperties)
println("Executing WordCount example.")
// get text from Kafka
val text = env.addSource(kafkaConsumer)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuples) containing: (word,1)
.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_, 1))
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.mapWithState((in: (String, Int), count: Option[Int]) =>
count match {
case Some(c) => ((in._1, c), Some(c + in._2))
case None => ((in._1, 1), Some(in._2 + 1))
})
// emit result
println("Printing result to stdout.")
counts.map(_.toString()).addSink(new FlinkKafkaProducer[String]("output", new SimpleStringSchema,
kafkaProperties))
// execute program
env.execute("Streaming WordCount")
}
}
The data I sent to Kafka input topic is:
hi
hello
I don't get any output in Kafka topic output. Since I am a newbie to Apache Flink, I don't know how to achieve the expected result. Can anyone help me achieve the correct behavior?
I run your code into my local environment, and everything is OK. I think you can try the command below:
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic output --from-beginning
I'm trying to create a simple producer which create a topic with some partitions provided by configuration.
According to Alpakka Producer Setting Doc any property from org.apache.kafka.clients.producer.ProducerConfig can be set in kafka-clients section. And, there is a num.partitions property as commented in Producer API Doc .
Thus, I added that property to my application.conf file as given below:
topic = "topic"
topic = ${?TOPIC}
# Properties for akka.kafka.ProducerSettings can be
# defined in this section or a configuration section with
# the same layout.
akka.kafka.producer {
# Tuning parameter of how many sends that can run in parallel.
parallelism = 100
parallelism = ${?PARALLELISM}
# Duration to wait for `KafkaConsumer.close` to finish.
close-timeout = 20s
# Fully qualified config path which holds the dispatcher configuration
# to be used by the producer stages. Some blocking may occur.
# When this value is empty, the dispatcher configured for the stream
# will be used.
use-dispatcher = "akka.kafka.default-dispatcher"
# The time interval to commit a transaction when using the `Transactional.sink` or `Transactional.flow`
eos-commit-interval = 100ms
# Properties defined by org.apache.kafka.clients.producer.ProducerConfig
# can be defined in this configuration section.
kafka-clients {
bootstrap.servers = "my-kafka:9092"
bootstrap.servers = ${?BOOTSTRAPSERVERS}
num.partitions = "3"
num.partitions = ${?NUM_PARTITIONS}
}
}
The producer application code is also given below:
object Main extends App {
val config = ConfigFactory.load()
implicit val system: ActorSystem = ActorSystem("producer")
implicit val materializer: Materializer = ActorMaterializer()
val producerConfigs = config.getConfig("akka.kafka.producer")
val producerSettings = ProducerSettings(producerConfigs, new StringSerializer, new StringSerializer)
val topic = config.getString("topic")
val done: Future[Done] =
Source(1 to 100000)
.map(_.toString)
.map(value => new ProducerRecord[String, String](topic, value))
.runWith(Producer.plainSink(producerSettings))
implicit val ec: ExecutionContextExecutor = system.dispatcher
done onComplete {
case Success(_) => println("Done"); system.terminate()
case Failure(err) => println(err.toString); system.terminate()
}
}
But, this doesn't work. Producer creates a topic with a single partition instead of 3 partitions as I've set by configuration:
num.partitions = "3"
Finally, Kafkacat output is given below:
~$ kafkacat -b my-kafka:9092 -L
Metadata for all topics (from broker -1: my-kafka:9092/bootstrap):
3 brokers:
broker 2 at my-kafka-2.my-kafka-headless.default:9092
broker 1 at my-kafka-1.my-kafka-headless.default:9092
broker 0 at my-kafka-0.my-kafka-headless.default:9092
1 topics:
topic "topic" with 1 partitions:
partition 0, leader 2, replicas: 2, isrs: 2
What is wrong? Is it possible to set properties from Kafka Producer API in kafka-clients section using Alpakka?
# Properties defined by org.apache.kafka.clients.producer.ProducerConfig
# can be defined in this configuration section.
As this says, ProducerConfig is for producer settings, not broker settings, which is what num.partitions is (I think you got lost in which table the property was shown on the Apache Kafka docs... scroll to the top of it to see the proper header).
There is no way to set the partitions of a topic from the producer... You would need to use AdminClient class to create a topic, and the number of partitions is a parameter there, not a configuation property.
Sample code
val props = new Properties()
props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
val adminClient = AdminClient.create(props)
val numPartitions = 3
val replicationFactor = 3.toShort
val newTopic = new NewTopic("new-topic-name", numPartitions, replicationFactor)
val configs = Map(TopicConfig.COMPRESSION_TYPE_CONFIG -> "gzip")
// settings some configs
newTopic.configs(configs.asJava)
adminClient.createTopics(List(newTopic).asJavaCollection)
And then you can start the producer
It appears that the topic is getting create by Default , which is the default behavior for Kafka. If that is the case you need to define the default number of partitions in the server.properties file for your broker.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=3
I have been attempting to build a Kafka Streaming application for use with Spark. I have a static dataset for testing. After running my code once through, Kafka sets the current offset such that I cannot re-process the data upon a second run. Running kafka-streams-application-reset supposedly resets the offsets. However, re-running my code results in an empty GlobalKTable. The only way I have been able to re-analyze the data is by changing my ID in my Kafka connection. Here is what I'm doing.
Setup the sample data in Kafka:
kafka-console-producer --broker-list localhost:9092 \
--topic testTopic \
--property "parse.key=true" \
--property "key.separator=:"
1:abcd
2:bcde
3:cdef
4:defg
5:efgh
6:fghi
7:ghij
8:hijk
9:ijkl
10:jklm
Scala code:
//Streams imports - need to update Kafka
import org.apache.kafka.common.serialization.Serdes
//import org.apache.kafka.common.utils.Bytes
import org.apache.kafka.streams._
import org.apache.kafka.streams.kstream.{GlobalKTable, KStream, KTable, Materialized, Produced, KStreamBuilder}
import org.apache.kafka.streams.StreamsConfig
import org.apache.kafka.streams.state.{KeyValueIterator, QueryableStoreTypes, ReadOnlyKeyValueStore, KeyValueStore}
import org.apache.kafka.streams.state.Stores
import org.apache.kafka.clients.consumer.{ConsumerConfig, KafkaConsumer}
import java.util.{Properties}
val kafkaServer = "127.0.0.1:9092"
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "testStream")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServer)
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass())
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass())
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p.put(StreamsConfig.CLIENT_ID_CONFIG, "test-consumer-stream")
val config = new StreamsConfig(p)
val builder: StreamsBuilder = new StreamsBuilder()
val imkvs = Stores.inMemoryKeyValueStore("testLookup-stream")
val sBuilder = Stores.keyValueStoreBuilder(imkvs, Serdes.String, Serdes.String).withLoggingDisabled().withCachingEnabled()
val gTable: GlobalKTable[String, String] = builder.globalTable("testTopic", Materialized.as(imkvs).withKeySerde(Serdes.String()).withValueSerde(Serdes.String()).withCachingDisabled())
val streams: KafkaStreams = new KafkaStreams(builder.build(), config)
streams.start()
val read: ReadOnlyKeyValueStore[String, String] = streams.store(gTable.queryableStoreName(), QueryableStoreTypes.keyValueStore[String, String]())
val hexLookup = "2"
println(read.get(hexLookup))
val iter: KeyValueIterator[String, String] = read.all()
while(iter.hasNext) {
val next = iter.next()
println(next.key + ": " + next.value)
}
Streams Reset command:
kafka-streams-application-reset --application-id testStream \
--bootstrap-servers localhost:9092 \
--to-earliest
1) Am I coding something wrong, or is kafka-streams-application-reset not functioning correctly?
2) I had hoped that using a inMemoryKeyValueStore would result in Kafka not keeping track of the current offset; is there a way to force a GlobalKTable to not keep the current offset? I want to always search the entire dataset.
Software Versions:
Kafka 1.1.1-1
Confluent 4.1.1-1
Spark-Scala 2.3.1
kafka-clients 1.1.0
kafka-streams 1.1.0
If you want to restart an application from an empty internal state and re-process the data from offset 0, you have to provide "--input-topics" parameter with comma seperated list of topics.
bin/kafka-streams-application-reset.sh --application-id testApplication1 --input-topics demoTopic1
You can find more details here : https://kafka.apache.org/10/documentation/streams/developer-guide/app-reset-tool
Regarding GlobalKTable, ideally it is materialized view on top of stream/topic just like any other queryable store.
Also GlobalKTable always applies "auto.offset.reset" strategy "earliest" regardless of the specified value in StreamsConfig.
So it should allow you to query the entire table at any time.