How to create several partitions by Alpakka - apache-kafka

I'm trying to create a simple producer which create a topic with some partitions provided by configuration.
According to Alpakka Producer Setting Doc any property from org.apache.kafka.clients.producer.ProducerConfig can be set in kafka-clients section. And, there is a num.partitions property as commented in Producer API Doc .
Thus, I added that property to my application.conf file as given below:
topic = "topic"
topic = ${?TOPIC}
# Properties for akka.kafka.ProducerSettings can be
# defined in this section or a configuration section with
# the same layout.
akka.kafka.producer {
# Tuning parameter of how many sends that can run in parallel.
parallelism = 100
parallelism = ${?PARALLELISM}
# Duration to wait for `KafkaConsumer.close` to finish.
close-timeout = 20s
# Fully qualified config path which holds the dispatcher configuration
# to be used by the producer stages. Some blocking may occur.
# When this value is empty, the dispatcher configured for the stream
# will be used.
use-dispatcher = "akka.kafka.default-dispatcher"
# The time interval to commit a transaction when using the `Transactional.sink` or `Transactional.flow`
eos-commit-interval = 100ms
# Properties defined by org.apache.kafka.clients.producer.ProducerConfig
# can be defined in this configuration section.
kafka-clients {
bootstrap.servers = "my-kafka:9092"
bootstrap.servers = ${?BOOTSTRAPSERVERS}
num.partitions = "3"
num.partitions = ${?NUM_PARTITIONS}
}
}
The producer application code is also given below:
object Main extends App {
val config = ConfigFactory.load()
implicit val system: ActorSystem = ActorSystem("producer")
implicit val materializer: Materializer = ActorMaterializer()
val producerConfigs = config.getConfig("akka.kafka.producer")
val producerSettings = ProducerSettings(producerConfigs, new StringSerializer, new StringSerializer)
val topic = config.getString("topic")
val done: Future[Done] =
Source(1 to 100000)
.map(_.toString)
.map(value => new ProducerRecord[String, String](topic, value))
.runWith(Producer.plainSink(producerSettings))
implicit val ec: ExecutionContextExecutor = system.dispatcher
done onComplete {
case Success(_) => println("Done"); system.terminate()
case Failure(err) => println(err.toString); system.terminate()
}
}
But, this doesn't work. Producer creates a topic with a single partition instead of 3 partitions as I've set by configuration:
num.partitions = "3"
Finally, Kafkacat output is given below:
~$ kafkacat -b my-kafka:9092 -L
Metadata for all topics (from broker -1: my-kafka:9092/bootstrap):
3 brokers:
broker 2 at my-kafka-2.my-kafka-headless.default:9092
broker 1 at my-kafka-1.my-kafka-headless.default:9092
broker 0 at my-kafka-0.my-kafka-headless.default:9092
1 topics:
topic "topic" with 1 partitions:
partition 0, leader 2, replicas: 2, isrs: 2
What is wrong? Is it possible to set properties from Kafka Producer API in kafka-clients section using Alpakka?

# Properties defined by org.apache.kafka.clients.producer.ProducerConfig
# can be defined in this configuration section.
As this says, ProducerConfig is for producer settings, not broker settings, which is what num.partitions is (I think you got lost in which table the property was shown on the Apache Kafka docs... scroll to the top of it to see the proper header).
There is no way to set the partitions of a topic from the producer... You would need to use AdminClient class to create a topic, and the number of partitions is a parameter there, not a configuation property.
Sample code
val props = new Properties()
props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
val adminClient = AdminClient.create(props)
val numPartitions = 3
val replicationFactor = 3.toShort
val newTopic = new NewTopic("new-topic-name", numPartitions, replicationFactor)
val configs = Map(TopicConfig.COMPRESSION_TYPE_CONFIG -> "gzip")
// settings some configs
newTopic.configs(configs.asJava)
adminClient.createTopics(List(newTopic).asJavaCollection)
And then you can start the producer

It appears that the topic is getting create by Default , which is the default behavior for Kafka. If that is the case you need to define the default number of partitions in the server.properties file for your broker.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=3

Related

How to read values from kafka topic using consumer through akka-streams/alpakka-kafka?

Running with a Consumer.plainSource, nothing happens. Isn't this one way of reading from kafka topic through streams .
val consumerSettings2 = ConsumerSettings(system,new StringDeserializer,new StringDeserializer)
.withBootstrapServers("localhost:3333")
.withGroupId("ssss")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
val source: Source[ConsumerRecord[String, String], Consumer.Control] =
Consumer.plainSource(consumerSettings2, Subscriptions.topics("candy"))
val sink =
Sink.foreach[ConsumerRecord[String,String]](x=>println("consumed "+x))
source.runWith(sink)

Flink crash with “java.lang.IllegalArgumentException” after I fetch some data to kafka topic

I have a Flink program consuming a Kafka topic. I use Spark to send some message(JSON String I copied from the topic) to the topic (what I want to do is to manually trigger the flink computation). Then the Flink crashed instantly with following error:
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at kafka.message.Message.sliceDelimited(Message.scala:236)
at kafka.message.Message.payload(Message.scala:218)
at org.apache.flink.streaming.connectors.kafka.internals.SimpleConsumerThread.run(SimpleConsumerThread.java:338)
Anyone can tell me why this happened and how to resolve it ?
There is the spark code I use to write json string to Kafka:
// Connect Kafka
println("Connecting kafka")
val KAFKA_QUEUE_TIME = 5000
val KAFKA_BATCH_SIZE = 16384
val brokerList = "kafka05broker01.cnsuning.com:9092,kafka05broker02.cnsuning.com:9092,kafka05broker03.cnsuning.com:9092"
val props = new Properties
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("partitioner.class", "utils.SimplePartitioner")
props.put("metadata.broker.list", brokerList)
props.put("producer.type", "async")
props.put("queue.time", "5000")
props.put("batch.size", "16384")
val config = new ProducerConfig(props)
val producer = new Producer[AnyRef, AnyRef](config)
// Send Kafka
println("Sending msg to kafka")
val topic = "xxxxxx"
val msg = "xxx"
for (i <- 0 to 1000) {
println(i)
val randomPartition = "" + new Random().nextInt(255)
val message = new KeyedMessage[AnyRef, AnyRef](topic, randomPartition, msg)
producer.send(message)
}
There is the how I consume in flink:
val allActProperties = kafkaPropertiesGen( GroupId, BrokerServer, ZKConnect)
val streamComsumer = new FlinkKafkaConsumer08[TraitRecord](topic, new TraitRecordSchema(), allActProperties)
val stream: DataStream[TraitRecord] = env.addSource(streamComsumer ).setParallelism(12)
Which version of kafka you are using?It's seems the kafka jar version is not corresponding to your kafka version.Or FlinkKafkaConsumer08 is not corresponding to your kafka version?Visit
java.lang.IllegalArgumentException kafka console consumer

Apache Flink: Kafka Producer in IDE execution not working as expected

I have a sample streaming WordCount example written in Flink (Scala). In it, I want to put the result in Kafka using Flink-Kafka producer. But it is not working as expected.
My code is as follows:
object WordCount {
def main(args: Array[String]) {
// set up the execution environment
val env = StreamExecutionEnvironment
.getExecutionEnvironment
.setStateBackend(new RocksDBStateBackend("file:///path/to/checkpoint", true))
// start a checkpoint every 1000 ms
env.enableCheckpointing(1000)
// set mode to exactly-once (this is the default)
env.getCheckpointConfig.setCheckpointingMode(CheckpointingMode.EXACTLY_ONCE)
// make sure 500 ms of progress happen between checkpoints
env.getCheckpointConfig.setMinPauseBetweenCheckpoints(500)
// checkpoints have to complete within one minute, or are discarded
env.getCheckpointConfig.setCheckpointTimeout(60000)
// prevent the tasks from failing if an error happens in their checkpointing, the checkpoint will just be declined.
env.getCheckpointConfig.setFailOnCheckpointingErrors(false)
// allow only one checkpoint to be in progress at the same time
env.getCheckpointConfig.setMaxConcurrentCheckpoints(1)
// prepare Kafka consumer properties
val kafkaConsumerProperties = new Properties
kafkaConsumerProperties.setProperty("zookeeper.connect", "localhost:2181")
kafkaConsumerProperties.setProperty("group.id", "flink")
kafkaConsumerProperties.setProperty("bootstrap.servers", "localhost:9092")
// set up Kafka Consumer
val kafkaConsumer = new FlinkKafkaConsumer[String]("input", new SimpleStringSchema, kafkaConsumerProperties)
println("Executing WordCount example.")
// get text from Kafka
val text = env.addSource(kafkaConsumer)
val counts: DataStream[(String, Int)] = text
// split up the lines in pairs (2-tuples) containing: (word,1)
.flatMap(_.toLowerCase.split("\\W+"))
.filter(_.nonEmpty)
.map((_, 1))
// group by the tuple field "0" and sum up tuple field "1"
.keyBy(0)
.mapWithState((in: (String, Int), count: Option[Int]) =>
count match {
case Some(c) => ((in._1, c), Some(c + in._2))
case None => ((in._1, 1), Some(in._2 + 1))
})
// emit result
println("Printing result to stdout.")
counts.map(_.toString()).addSink(new FlinkKafkaProducer[String]("output", new SimpleStringSchema,
kafkaProperties))
// execute program
env.execute("Streaming WordCount")
}
}
The data I sent to Kafka input topic is:
hi
hello
I don't get any output in Kafka topic output. Since I am a newbie to Apache Flink, I don't know how to achieve the expected result. Can anyone help me achieve the correct behavior?
I run your code into my local environment, and everything is OK. I think you can try the command below:
./kafka-console-consumer.sh --bootstrap-server localhost:9092 --topic output --from-beginning

Kafka consumer not returning any events

The below Scala kafka consumer is not returning any events from the poll call.
However, the topic is correct, and I can see events being sent to the topic using the console consumer:
/opt/kafka_2.11-0.10.1.0/bin/kafka-console-consumer.sh --bootstrap-server kafka:9092 --topic my_topic --from-beginning
I also see the topic in my Scala code sample below when I step through it with a debugger and invoke kafkaConsumer.listTopics()
Also, this is called from a single unit test, so I'm only creating one instance of this trait and consumer (i.e. another consumer instance can't be consuming the messages). I'm also using a random group_id.
Is there anything wrong with the below code/configuration?
import java.util.Properties
import org.apache.kafka.clients.consumer.KafkaConsumer
import org.apache.kafka.common.serialization.{ByteArrayDeserializer, StringDeserializer}
import scala.util.Random
trait KafkaTest {
val kafkaConsumerProperties = new Properties()
kafkaConsumerProperties.put("bootstrap.servers", "kafka:9092")
kafkaConsumerProperties.put("group.id", Random.alphanumeric.take(10).mkString)
kafkaConsumerProperties.put("key.deserializer", classOf[ByteArrayDeserializer])
kafkaConsumerProperties.put("value.deserializer", classOf[StringDeserializer])
val kafkaConsumer = new KafkaConsumer[String, String](kafkaConsumerProperties)
kafkaConsumer.subscribe(java.util.Collections.singletonList("my_topic"))
def checkKafkaHasReceivedEvent(): Assertion = {
val kafkaEvents = kafkaConsumer.poll(2000) // Always returns 0 events?
...
}
}
Increasing the poll timeout doesn't help either.
To read from beginning AUTO_OFFSET_RESET_CONFIG property has to be set to earliest, by default it "latest"
kafkaConsumerProperties.put(
ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,
OffsetResetStrategy.EARLIEST.toString().toLowerCase())

Kafka Streaming Reset Issues

I have been attempting to build a Kafka Streaming application for use with Spark. I have a static dataset for testing. After running my code once through, Kafka sets the current offset such that I cannot re-process the data upon a second run. Running kafka-streams-application-reset supposedly resets the offsets. However, re-running my code results in an empty GlobalKTable. The only way I have been able to re-analyze the data is by changing my ID in my Kafka connection. Here is what I'm doing.
Setup the sample data in Kafka:
kafka-console-producer --broker-list localhost:9092 \
--topic testTopic \
--property "parse.key=true" \
--property "key.separator=:"
1:abcd
2:bcde
3:cdef
4:defg
5:efgh
6:fghi
7:ghij
8:hijk
9:ijkl
10:jklm
Scala code:
//Streams imports - need to update Kafka
import org.apache.kafka.common.serialization.Serdes
//import org.apache.kafka.common.utils.Bytes
import org.apache.kafka.streams._
import org.apache.kafka.streams.kstream.{GlobalKTable, KStream, KTable, Materialized, Produced, KStreamBuilder}
import org.apache.kafka.streams.StreamsConfig
import org.apache.kafka.streams.state.{KeyValueIterator, QueryableStoreTypes, ReadOnlyKeyValueStore, KeyValueStore}
import org.apache.kafka.streams.state.Stores
import org.apache.kafka.clients.consumer.{ConsumerConfig, KafkaConsumer}
import java.util.{Properties}
val kafkaServer = "127.0.0.1:9092"
val p = new Properties()
p.put(StreamsConfig.APPLICATION_ID_CONFIG, "testStream")
p.put(StreamsConfig.BOOTSTRAP_SERVERS_CONFIG, kafkaServer)
p.put(StreamsConfig.DEFAULT_KEY_SERDE_CLASS_CONFIG, Serdes.String().getClass())
p.put(StreamsConfig.DEFAULT_VALUE_SERDE_CLASS_CONFIG, Serdes.String().getClass())
p.put(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG, "earliest")
p.put(StreamsConfig.CLIENT_ID_CONFIG, "test-consumer-stream")
val config = new StreamsConfig(p)
val builder: StreamsBuilder = new StreamsBuilder()
val imkvs = Stores.inMemoryKeyValueStore("testLookup-stream")
val sBuilder = Stores.keyValueStoreBuilder(imkvs, Serdes.String, Serdes.String).withLoggingDisabled().withCachingEnabled()
val gTable: GlobalKTable[String, String] = builder.globalTable("testTopic", Materialized.as(imkvs).withKeySerde(Serdes.String()).withValueSerde(Serdes.String()).withCachingDisabled())
val streams: KafkaStreams = new KafkaStreams(builder.build(), config)
streams.start()
val read: ReadOnlyKeyValueStore[String, String] = streams.store(gTable.queryableStoreName(), QueryableStoreTypes.keyValueStore[String, String]())
val hexLookup = "2"
println(read.get(hexLookup))
val iter: KeyValueIterator[String, String] = read.all()
while(iter.hasNext) {
val next = iter.next()
println(next.key + ": " + next.value)
}
Streams Reset command:
kafka-streams-application-reset --application-id testStream \
--bootstrap-servers localhost:9092 \
--to-earliest
1) Am I coding something wrong, or is kafka-streams-application-reset not functioning correctly?
2) I had hoped that using a inMemoryKeyValueStore would result in Kafka not keeping track of the current offset; is there a way to force a GlobalKTable to not keep the current offset? I want to always search the entire dataset.
Software Versions:
Kafka 1.1.1-1
Confluent 4.1.1-1
Spark-Scala 2.3.1
kafka-clients 1.1.0
kafka-streams 1.1.0
If you want to restart an application from an empty internal state and re-process the data from offset 0, you have to provide "--input-topics" parameter with comma seperated list of topics.
bin/kafka-streams-application-reset.sh --application-id testApplication1 --input-topics demoTopic1
You can find more details here : https://kafka.apache.org/10/documentation/streams/developer-guide/app-reset-tool
Regarding GlobalKTable, ideally it is materialized view on top of stream/topic just like any other queryable store.
Also GlobalKTable always applies "auto.offset.reset" strategy "earliest" regardless of the specified value in StreamsConfig.
So it should allow you to query the entire table at any time.