Spark Kafka Batch write with SASL mechanism throwing Timeout Exception with topic not present in metadata - scala

I am reading data from cassandra db and applying few transformations on it and sending the data to kafka via .save() batch method. I am also using Kafka Producer to set the properties. But everytime i am getting the below error :
Caused by: org.apache.kafka.common.errors.TimeoutException: Topic XXXXXXXXXXXXX not present in metadata after 60000 ms.
All configurations, credentials are set.
Same code is working fine on local as there is no SASL mechanism present there, but on cluster getting the above exception.
Please help.
System.setProperty("java.security.auth.login.config","/apps/xxxx/jaas.conf")
val props = new Properties()
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
props.put("acks","all")
props.put("bootstrap.servers","xxxxxx1:9095,xxxxxx2:9095,xxxxx3:9095")
props.put("ssl.truststore.location","/home/xxxxx/ocrptrust.jks")
props.put("ssl.truststore.password","xxxxxxxxx")
props.put("sasl.mechanism","SCRAM-SHA-512")
props.put("sasl.jaas.config","org.apache.kafka.common.security.scram.ScramLoginModule required username=\"username\" password=\"password\";")
props.put("security.protocol","SASL_SSL")
val producerConfig = new KafkaProducer[String,String](props)
jsonRead.selectExpr("CAST(householdID AS STRING) AS key", "to_json(struct(*)) AS value")
.write.format("kafka")
.option("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
.option("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
.option("acks","all")
.option("ssl.truststore.location","/home/xxxxx/ocrptrust.jks")
.option("ssl.truststore.password","xxxxxx")
.option("sasl.mechanism","SCRAM-SHA-512")
.option("sasl.jaas.config","org.apache.kafka.common.security.scram.ScramLoginModule required username=\"username\" password=\"password\";")
.option("security.protocol","SASL_SSL")
.option("kafka.bootstrap.servers","xxxxxx1:9095,xxxxxx2:9095,xxxxx3:9095")
.option("topic", "xxxxxxxxxxx").save()

Related

Kafka TimeoutException: Topic not present in metadata after 60000 ms

I'm trying out with some Kafka basics and following examples at https://kafka.apache.org/quickstart. After starting zookeepier and kafka, I tried producing and consuming with included kafka shell scripts and it all worked without issue.
When I try to produce message from simple scala application then I get following error org.apache.kafka.common.errors.TimeoutException: Topic quickstart-events not present in metadata after 60000 ms.
I ensured the topic has been created and can telnet to localhost:9092 as well.
Here's the code I'm using for producer:
val props = new Properties()
props.put(ProducerConfig.BOOTSTRAP_SERVERS_CONFIG, "localhost:9092")
props.put(ProducerConfig.CLIENT_ID_CONFIG, "test")
props.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
props.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, classOf[StringSerializer].getName)
val producer = new KafkaProducer[String, String](props)
producer.send(new ProducerRecord[String, String]("quickstart-events", "1", "some event")).get()
Running this on mac, above code is part of a test case executed in IntelliJ.
Solved. I used kafka-clients library version 2.6.0 and running kafka server version 3.2.0. Matching version of the library fixed the issue.
I got this problem as well, the version is correct for me.
I figure out it's the lack of sasl certification.
try:
// set SASL configuration here
props.put(CommonClientConfigs.SECURITY_PROTOCOL_CONFIG, “SASL_PLAINTEXT”);
props.put(SaslConfigs.SASL_MECHANISM, “PLAIN”);
props.put(“sasl.jaas.config”,
“org.apache.kafka.common.security.plain.PlainLoginModule required username=\”alice\” password=\”123456\”;”);

java.lang.RuntimeException for Flink consumer connecting to Kafka cluster with multiple partitions

Flink Version 1.9.0
Scala Version 2.11.12
Kafka Cluster Version 2.3.0
I am trying to connect a flink job I made to a kafka cluster that has 3 partitions. I have tested my job against a kafka cluster topic running on my localhost that has one partition and it works to read and write to the local kafka. When I attempt to connect to a topic that has multiple partitions I get the following error (topicName is the name of the topic I am trying to consume. Weirdly I dont have any issues when I am trying to produce to a multi-partition topic.
java.lang.RuntimeException: topicName
at org.apache.flink.streaming.connectors.kafka.internal.KafkaPartitionDiscoverer.getAllPartitionsForTopics(KafkaPartitionDiscoverer.java:80)
at org.apache.flink.streaming.connectors.kafka.internals.AbstractPartitionDiscoverer.discoverPartitions(AbstractPartitionDiscoverer.java:131)
at org.apache.flink.streaming.connectors.kafka.FlinkKafkaConsumerBase.open(FlinkKafkaConsumerBase.java:508)
at org.apache.flink.api.common.functions.util.FunctionUtils.openFunction(FunctionUtils.java:36)
at org.apache.flink.streaming.api.operators.AbstractUdfStreamOperator.open(AbstractUdfStreamOperator.java:102)
at org.apache.flink.streaming.runtime.tasks.StreamTask.openAllOperators(StreamTask.java:529)
at org.apache.flink.streaming.runtime.tasks.StreamTask.invoke(StreamTask.java:393)
at org.apache.flink.runtime.taskmanager.Task.doRun(Task.java:705)
at org.apache.flink.runtime.taskmanager.Task.run(Task.java:530)
at java.lang.Thread.run(Thread.java:748)
My consumer code looks like this:
def defineKafkaDataStream[A: TypeInformation](topic: String,
env: StreamExecutionEnvironment,
SASL_username:String,
SASL_password:String,
kafkaBootstrapServer: String = "localhost:9092",
zookeeperHost: String = "localhost:2181",
groupId: String = "test"
)(implicit c: JsonConverter[A]): DataStream[A] = {
val properties = new Properties()
properties.setProperty("bootstrap.servers", kafkaBootstrapServer)
properties.setProperty("security.protocol" , "SASL_SSL")
properties.setProperty("sasl.mechanism" , "PLAIN")
val jaasTemplate = "org.apache.kafka.common.security.plain.PlainLoginModule required username=\"%s\" password=\"%s\";"
val jaasConfig = String.format(jaasTemplate, SASL_username, SASL_password)
properties.setProperty("sasl.jaas.config", jaasConfig)
properties.setProperty("group.id", "MyConsumerGroup")
env
.addSource(new FlinkKafkaConsumer(topic, new JSONKeyValueDeserializationSchema(true), properties))
.map(x => x.convertTo[A](c))
}
Is there another property I should be setting to allow for a single job to consume from multiple partitions?
After digging around and questioning everything in my process I found the issue.
I looked at the Java code of the KafkaPartitionDiscoverer function that had the runtime exception.
One section I noticed handled RuntimeException
if (kafkaPartitions == null) {
throw new RuntimeException("Could not fetch partitions for %s. Make sure that the topic exists.".format(topic));
}
I was working off of a kafka cluster that I dont maintain and had a topic name that was given to me that I did not verify first. When I did verify it using:
kafka-topics --describe --zookeeper serverIP:2181 --topic topicName
It returned a response of :
Error while executing topic command : Topics in [] does not exist
ERROR java.lang.IllegalArgumentException: Topics in [] does not exist
at kafka.admin.TopicCommand$.kafka$admin$TopicCommand$$ensureTopicExists(TopicCommand.scala:435)
at kafka.admin.TopicCommand$ZookeeperTopicService.describeTopic(TopicCommand.scala:350)
at kafka.admin.TopicCommand$.main(TopicCommand.scala:66)
at kafka.admin.TopicCommand.main(TopicCommand.scala)
After I got the correct topic name everything works.

Spark structured streaming with kafka throwing error after running for a while

I am observing weired behaviour while running spark structured streaming program. I am using S3 bucket for metadata checkpointing.
The kafka topic has 310 partitions.
When i start streaming job for the first time, after completion of every batch spark creates a new file named after batch_id gets created in offset directory at checkpinting location. After successful
completion of few batches, spark job fails after few retries giving warning "WARN KafkaMicroBatchReader:66 - Set(logs-2019-10-04-77, logs-2019-10-04-85, logs-2019-10-04-71, logs-2019-10-04-93, logs-2019-10-04-97, logs-2019-10-04-101, logs-2019-10-04-89, logs-2019-10-04-81, logs-2019-10-04-103, logs-2019-10-04-104, logs-2019-10-04-102, logs-2019-10-04-98, logs-2019-10-04-94, logs-2019-10-04-90, logs-2019-10-04-74, logs-2019-10-04-78, logs-2019-10-04-82, logs-2019-10-04-86, logs-2019-10-04-99, logs-2019-10-04-91, logs-2019-10-04-73, logs-2019-10-04-79, logs-2019-10-04-87, logs-2019-10-04-83, logs-2019-10-04-75, logs-2019-10-04-92, logs-2019-10-04-70, logs-2019-10-04-96, logs-2019-10-04-88, logs-2019-10-04-95, logs-2019-10-04-100, logs-2019-10-04-72, logs-2019-10-04-76, logs-2019-10-04-84, logs-2019-10-04-80) are gone. Some data may have been missed.
Some data may have been lost because they are not available in Kafka any more; either the
data was aged out by Kafka or the topic may have been deleted before all the data in the
topic was processed. If you want your streaming query to fail on such cases, set the source
option "failOnDataLoss" to "false"."
The weired thing here is previous batch's offset file contains partition info of all 310 partitions but current batch is reading only selected partitions(see above warning message).
I reran the job by setting ".option("failOnDataLoss", false)" but got same warning above without job failure. It was observed that spark was processing correct offsets for few partitions and for rest of the partitions it was reading from starting offset(0).
There were no connection issues with spark-kafka while this error coming (we checked kafka logs also).
Could someone help with this?Am i doing something wrong or missing something?
Below is the read and write stream code snippet.
val kafkaDF = ss.readStream.format("kafka")
.option("kafka.bootstrap.servers", kafkaBrokers /*"localhost:9092"*/)
.option("subscribe", logs)
.option("fetchOffset.numRetries",5)
.option("maxOffsetsPerTrigger", 30000000)
.load()
val query = logDS
.writeStream
.foreachBatch {
(batchDS: Dataset[Row], batchId: Long) =>
batchDS.repartition(noofpartitions, batchDS.col("abc"), batchDS.col("xyz")).write.mode(SaveMode.Append).partitionBy("date", "abc", "xyz").format("parquet").saveAsTable(hiveTableName /*"default.logs"*/)
}
.trigger(Trigger.ProcessingTime(1800 + " seconds"))
.option("checkpointLocation", s3bucketpath)
.start()
Thanks in advance.

Configure Apache Kafka sink jdbc connector

I want to send the data sent to the topic to a postgresql-database. So I follow this guide and have configured the properties-file like this:
name=transaction-sink
connector.class=io.confluent.connect.jdbc.JdbcSinkConnector
tasks.max=1
topics=transactions
connection.url=jdbc:postgresql://localhost:5432/db
connection.user=db-user
connection.password=
auto.create=true
insert.mode=insert
table.name.format=transaction
pk.mode=none
I start the connector with
./bin/connect-standalone etc/schema-registry/connect-avro-standalone.properties etc/kafka-connect-jdbc/sink-quickstart-postgresql.properties
The sink-connector is created but does not start due to this error:
Caused by: org.apache.kafka.common.errors.SerializationException: Error deserializing Avro message for id -1
Caused by: org.apache.kafka.common.errors.SerializationException: Unknown magic byte!
The schema is in avro-format and registered and I can send (produce) messages to the topic and read (consume) from it. But I can't seem to sent it to the database.
This is my ./etc/schema-registry/connect-avro-standalone.properties
key.converter=io.confluent.connect.avro.AvroConverter
key.converter.schema.registry.url=http://localhost:8081
value.converter=io.confluent.connect.avro.AvroConverter
value.converter.schema.registry.url=http://localhost:8081
This is a producer feeding the topic using the java-api:
properties.put(ProducerConfig.KEY_SERIALIZER_CLASS_CONFIG, StringSerializer.class);
properties.put(ProducerConfig.VALUE_SERIALIZER_CLASS_CONFIG, KafkaAvroSerializer.class);
properties.put(AbstractKafkaAvroSerDeConfig.SCHEMA_REGISTRY_URL_CONFIG, "http://localhost:8081");
try (KafkaProducer<String, Transaction> producer = new KafkaProducer<>(properties)) {
Transaction transaction = new Transaction();
transaction.setFoo("foo");
transaction.setBar("bar");
UUID uuid = UUID.randomUUID();
final ProducerRecord<String, Transaction> record = new ProducerRecord<>(TOPIC, uuid.toString(), transaction);
producer.send(record);
}
I'm verifying data is properly serialized and deserialized using
./bin/kafka-avro-console-consumer --bootstrap-server localhost:9092 \
--property schema.registry.url=http://localhost:8081 \
--topic transactions \
--from-beginning --max-messages 1
The database is up and running.
This is not correct:
The unknown magic byte can be due to a id-field not part of the schema
What that error means that the message on the topic was not serialised using the Schema Registry Avro serialiser.
How are you putting data on the topic?
Maybe all the messages have the problem, maybe only some—but by default this will halt the Kafka Connect task.
You can set
"errors.tolerance":"all",
to get it to ignore messages that it can't deserialise. But if all of them are not correctly Avro serialised this won't help and you need to serialise them correctly, or choose a different Converter (e.g. if they're actually JSON, use the JSONConverter).
These references should help you more:
https://www.confluent.io/blog/kafka-connect-deep-dive-converters-serialization-explained
https://www.confluent.io/blog/kafka-connect-deep-dive-error-handling-dead-letter-queues
http://rmoff.dev/ksldn19-kafka-connect
Edit :
If you are serialising the key with StringSerializer then you need to use this in your Connect config:
key.converter=org.apache.kafka.connect.storage.StringConverter
You can set it at the worker (global property, applies to all connectors that you run on it), or just for this connector (i.e. put it in the connector properties itself, it will override the worker settings)

Apache Spark: Getting a InstanceAlreadyExistsException when running the Kafka producer

I have an small app in scala that creates kafka producer and that run with Apache Spark.
when I run the command
spark-submit --master local[2] --deploy-mode client <into the jar file> <app Name> <kafka broker> <kafka in queue> <kafka out queue> <interval>
I am getting this WARN:
WARN AppInfoParser: Error registering AppInfo mbean
javax.management.InstanceAlreadyExistsException: kafka.producer:type=app-info,id=
The code is not relevant because I am getting this exception when scala creates the KafkaProducer: val producer = new KafkaProducerObject,Object
Does anybody have a solution for this?
Thank you!
When a Kafka Producer is created, it attempts to register an MBean using the client.id as its unique identifier.
There are two possibilities of why you are getting the InstanceAlreadyExistsException warning:
You are attempting to initialize more than one Producer at a time with the same client.id property on the same JVM.
You are not calling close() on an existing Producer before initializing another Producer. Calling close() unregisters the MBean.
If you leave the client.id property blank when initializing the producer, a unique one will be created for you. Giving your producers unique client.id values or allowing them to be auto-generated would resolve this problem.
In the case of Kafka, MBeans can be used for tracking statistics.