spark-streaming read from specific event hub partition - scala

The azure event hub "my_event_hub" has a total of 5 partitions ("0", "1", "2", "3", "4")
The readstream should only read events from partitions "0" and "4"
event hub configuration as streaming source:-
val name = "my_event_hub"
val connectionString = "my_event_hub_connection_string"
val max_events = 50
val positions = Map(
new NameAndPartition(name, 0) -> EventPosition.fromEndOfStream,
new NameAndPartition(name, 4) -> EventPosition.fromEndOfStream
)
val eventHubsConf = EventHubsConf(connectionString)
.setStartingPositions(start)
.setMaxEventsPerTrigger(max_Events)
official doc for structured-streaming-eventhubs-integation: https://github.com/Azure/azure-event-hubs-spark/blob/master/docs/structured-streaming-eventhubs-integration.md
Using the above configuration the streaming application reads from all 5 partitions of the event hub.
Can we read from specific partitions only?
For example read events only from 2 partitions "0" and "4" with the checkpoint and offsets pointed to the specific partitions.

Related

Kafka as readstream source always returns 0 messages in the first iteration

I have a Structured Streaming job which has got Kafka as source and Delta as sink. Each of the batches will be processed inside a foreachBatch.
The problem I am facing is I need to have this Structured Streaming configured to be triggered just once, but in that initial run Kafka always returns no records.
This is how I have configured the Structured Streaming process:
var kafka_stream = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", kafka_bootstrap_config)
.option("subscribe", kafka_topic)
.option("startingOffsets", "latest")
.option("groupid", my_group_id)
.option("minOffsetsPerTrigger", "20")
.load()
val kafka_stream_payload = kafka_stream.selectExpr("cast (value as string) as msg ")
kafka_stream_payload
.writeStream
.format( "console" )
.queryName( "my_query" )
.outputMode( "append" )
.foreachBatch { (batchDF: DataFrame, batchId: Long) => process_micro_batch( batchDF ) }
.trigger(Trigger.AvailableNow())
.start()
.awaitTermination()
I tried to configure the Kafka readStream to pick a minimum of 20 new messages by using "minOffsetsPerTrigger", "20". However, every first iteration it keeps returning 0 new messages.
In case I remove the .trigger(Trigger.AvailableNow()) option, during the second (and following) iterations the process will be reading an average of 200 new kafka messages.
Is there a reason why I am getting 0 records during the first iteration?, and how can I configure the sourceStream to enforce a minimum number of new messages?
Since you configured (.option("startingOffsets", "latest")) it's possible that you can get 0 messages for the first iteration, if messages are not available on time in the Kafka topic. try to check for (.option("startingOffsets", "earliest")).
or add ("auto.offset.reset", "earliest")
or make sure the data is getting published to the Kafka topic continuously then start your consumer

How to read values from kafka topic using consumer through akka-streams/alpakka-kafka?

Running with a Consumer.plainSource, nothing happens. Isn't this one way of reading from kafka topic through streams .
val consumerSettings2 = ConsumerSettings(system,new StringDeserializer,new StringDeserializer)
.withBootstrapServers("localhost:3333")
.withGroupId("ssss")
.withProperty(ConsumerConfig.AUTO_OFFSET_RESET_CONFIG,"earliest")
val source: Source[ConsumerRecord[String, String], Consumer.Control] =
Consumer.plainSource(consumerSettings2, Subscriptions.topics("candy"))
val sink =
Sink.foreach[ConsumerRecord[String,String]](x=>println("consumed "+x))
source.runWith(sink)

Flink crash with “java.lang.IllegalArgumentException” after I fetch some data to kafka topic

I have a Flink program consuming a Kafka topic. I use Spark to send some message(JSON String I copied from the topic) to the topic (what I want to do is to manually trigger the flink computation). Then the Flink crashed instantly with following error:
java.lang.IllegalArgumentException
at java.nio.Buffer.limit(Buffer.java:275)
at kafka.message.Message.sliceDelimited(Message.scala:236)
at kafka.message.Message.payload(Message.scala:218)
at org.apache.flink.streaming.connectors.kafka.internals.SimpleConsumerThread.run(SimpleConsumerThread.java:338)
Anyone can tell me why this happened and how to resolve it ?
There is the spark code I use to write json string to Kafka:
// Connect Kafka
println("Connecting kafka")
val KAFKA_QUEUE_TIME = 5000
val KAFKA_BATCH_SIZE = 16384
val brokerList = "kafka05broker01.cnsuning.com:9092,kafka05broker02.cnsuning.com:9092,kafka05broker03.cnsuning.com:9092"
val props = new Properties
props.put("serializer.class", "kafka.serializer.StringEncoder")
props.put("partitioner.class", "utils.SimplePartitioner")
props.put("metadata.broker.list", brokerList)
props.put("producer.type", "async")
props.put("queue.time", "5000")
props.put("batch.size", "16384")
val config = new ProducerConfig(props)
val producer = new Producer[AnyRef, AnyRef](config)
// Send Kafka
println("Sending msg to kafka")
val topic = "xxxxxx"
val msg = "xxx"
for (i <- 0 to 1000) {
println(i)
val randomPartition = "" + new Random().nextInt(255)
val message = new KeyedMessage[AnyRef, AnyRef](topic, randomPartition, msg)
producer.send(message)
}
There is the how I consume in flink:
val allActProperties = kafkaPropertiesGen( GroupId, BrokerServer, ZKConnect)
val streamComsumer = new FlinkKafkaConsumer08[TraitRecord](topic, new TraitRecordSchema(), allActProperties)
val stream: DataStream[TraitRecord] = env.addSource(streamComsumer ).setParallelism(12)
Which version of kafka you are using?It's seems the kafka jar version is not corresponding to your kafka version.Or FlinkKafkaConsumer08 is not corresponding to your kafka version?Visit
java.lang.IllegalArgumentException kafka console consumer

How to create several partitions by Alpakka

I'm trying to create a simple producer which create a topic with some partitions provided by configuration.
According to Alpakka Producer Setting Doc any property from org.apache.kafka.clients.producer.ProducerConfig can be set in kafka-clients section. And, there is a num.partitions property as commented in Producer API Doc .
Thus, I added that property to my application.conf file as given below:
topic = "topic"
topic = ${?TOPIC}
# Properties for akka.kafka.ProducerSettings can be
# defined in this section or a configuration section with
# the same layout.
akka.kafka.producer {
# Tuning parameter of how many sends that can run in parallel.
parallelism = 100
parallelism = ${?PARALLELISM}
# Duration to wait for `KafkaConsumer.close` to finish.
close-timeout = 20s
# Fully qualified config path which holds the dispatcher configuration
# to be used by the producer stages. Some blocking may occur.
# When this value is empty, the dispatcher configured for the stream
# will be used.
use-dispatcher = "akka.kafka.default-dispatcher"
# The time interval to commit a transaction when using the `Transactional.sink` or `Transactional.flow`
eos-commit-interval = 100ms
# Properties defined by org.apache.kafka.clients.producer.ProducerConfig
# can be defined in this configuration section.
kafka-clients {
bootstrap.servers = "my-kafka:9092"
bootstrap.servers = ${?BOOTSTRAPSERVERS}
num.partitions = "3"
num.partitions = ${?NUM_PARTITIONS}
}
}
The producer application code is also given below:
object Main extends App {
val config = ConfigFactory.load()
implicit val system: ActorSystem = ActorSystem("producer")
implicit val materializer: Materializer = ActorMaterializer()
val producerConfigs = config.getConfig("akka.kafka.producer")
val producerSettings = ProducerSettings(producerConfigs, new StringSerializer, new StringSerializer)
val topic = config.getString("topic")
val done: Future[Done] =
Source(1 to 100000)
.map(_.toString)
.map(value => new ProducerRecord[String, String](topic, value))
.runWith(Producer.plainSink(producerSettings))
implicit val ec: ExecutionContextExecutor = system.dispatcher
done onComplete {
case Success(_) => println("Done"); system.terminate()
case Failure(err) => println(err.toString); system.terminate()
}
}
But, this doesn't work. Producer creates a topic with a single partition instead of 3 partitions as I've set by configuration:
num.partitions = "3"
Finally, Kafkacat output is given below:
~$ kafkacat -b my-kafka:9092 -L
Metadata for all topics (from broker -1: my-kafka:9092/bootstrap):
3 brokers:
broker 2 at my-kafka-2.my-kafka-headless.default:9092
broker 1 at my-kafka-1.my-kafka-headless.default:9092
broker 0 at my-kafka-0.my-kafka-headless.default:9092
1 topics:
topic "topic" with 1 partitions:
partition 0, leader 2, replicas: 2, isrs: 2
What is wrong? Is it possible to set properties from Kafka Producer API in kafka-clients section using Alpakka?
# Properties defined by org.apache.kafka.clients.producer.ProducerConfig
# can be defined in this configuration section.
As this says, ProducerConfig is for producer settings, not broker settings, which is what num.partitions is (I think you got lost in which table the property was shown on the Apache Kafka docs... scroll to the top of it to see the proper header).
There is no way to set the partitions of a topic from the producer... You would need to use AdminClient class to create a topic, and the number of partitions is a parameter there, not a configuation property.
Sample code
val props = new Properties()
props.setProperty(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
val adminClient = AdminClient.create(props)
val numPartitions = 3
val replicationFactor = 3.toShort
val newTopic = new NewTopic("new-topic-name", numPartitions, replicationFactor)
val configs = Map(TopicConfig.COMPRESSION_TYPE_CONFIG -> "gzip")
// settings some configs
newTopic.configs(configs.asJava)
adminClient.createTopics(List(newTopic).asJavaCollection)
And then you can start the producer
It appears that the topic is getting create by Default , which is the default behavior for Kafka. If that is the case you need to define the default number of partitions in the server.properties file for your broker.
# The default number of log partitions per topic. More partitions allow greater
# parallelism for consumption, but this will also result in more files across
# the brokers.
num.partitions=3

Apache Kafka: How to receive latest message from Kafka?

I am consuming and processing messages in the Kafka consumer application using Spark in Scala. Sometimes it takes little more time than usual to process messages from Kafka message queue. At that time I need to consume latest message, ignoring the earlier ones which have been published by the producer and yet to be consumed.
Here is my consumer code:
object KafkaSparkConsumer extends MessageProcessor {
def main(args: scala.Array[String]): Unit = {
val properties = readProperties()
val streamConf = new SparkConf().setMaster("local[*]").setAppName("Kafka-Stream")
val ssc = new StreamingContext(streamConf, Seconds(1))
val group_id = Random.alphanumeric.take(4).mkString("dfhSfv")
val kafkaParams = Map("metadata.broker.list" -> properties.getProperty("broker_connection_str"),
"zookeeper.connect" -> properties.getProperty("zookeeper_connection_str"),
"group.id" -> group_id,
"auto.offset.reset" -> properties.getProperty("offset_reset"),
"zookeeper.session.timeout" -> properties.getProperty("zookeeper_timeout"))
val msgStream = KafkaUtils.createStream[scala.Array[Byte], String, DefaultDecoder, StringDecoder](
ssc,
kafkaParams,
Map("moved_object" -> 1),
StorageLevel.MEMORY_ONLY_SER
).map(_._2)
msgStream.foreachRDD { x =>
x.foreach {
msg => println("Message: "+msg)
processMessage(msg)
}
}
ssc.start()
ssc.awaitTermination()
}
}
Is there any way to make sure the consumer always gets the most recent message in the consumer application? Or do I need to set any property in Kafka configuration to achieve the same?
Any help on this would be greatly appreciated. Thank you
Kafka consumer api include method
void seekToEnd(Collection<TopicPartition> partitions)
So, you can get assigned partitions from consumer and seek for all of them to the end. There is similar method to seekToBeginning.
You can leverage two KafkaConsumer APIs to get the very last message from a partition (assuming log compaction won't be an issue):
public Map<TopicPartition, Long> endOffsets(Collection<TopicPartition> partitions): This gives you the end offset of the given partitions. Note that the end offset is the offset of the next message to be delivered.
public void seek(TopicPartition partition, long offset): Run this for each partition and provide its end offset from above call minus 1 (assuming it's greater than 0).
You can always generate a new (random) group id when connecting to Kafka - that way you will start consuming new messages when you connect.
Yes, you can set staringOffset to latest to consume latest messages.
val spark = SparkSession
.builder
.appName("kafka-reading")
.getOrCreate()
import spark.implicits._
val df = spark
.readStream
.format("kafka")
.option("kafka.bootstrap.servers", "localhost:9092")
.option("startingOffsets", "latest")
.option("subscribe", topicName)
.load()