Configuring Spark Structured Streaming with authenticated Confluent Schema Registry

Configuring Spark Structured Streaming with authenticated Confluent Schema Registry - scala

I’m using a Kafka Source in Spark Streaming to receive records generated using Datagen in Confluent Cloud. I intend to use Confluent Schema Registry,
Currently, this is the exception I am facing :
*
Exception in thread “main”
io.confluent.kafka.schemaregistry.client.rest.exceptions.RestClientException:
Unauthorized; error code: 401
the schema registry of confluent cloud requires to pass some authentication data that I don’t know how to enter them:
basic.auth.credentials.source=USER_INFO
schema.registry.basic.auth.user.info=secret: secret
I think I have to pass this authentication data to CachedSchemaRegistryClient but I’m not sure if so and how.
// Setup the Avro deserialization UDF
schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 128)
kafkaAvroDeserializer = new AvroDeserializer(schemaRegistryClient)
spark.udf.register("deserialize", (bytes: Array[Byte]) =>
kafkaAvroDeserializer.deserialize(bytes)
If I am trying to send authentication to schema registry as
val restService = new RestService(schemaRegistryURL)
val props = Map(
"basic.auth.credentials.source" -> "USER_INFO",
"schema.registry.basic.auth.user.info" -> "secret:secret"
).asJava
var schemaRegistryClient = new CachedSchemaRegistryClient(restService, 100, props)
I get
Cannot resolve overloaded constructor CachedSchemaRegistryClient, seems that only 2 parameters are to be sent to CachedSchemaRegistryClient.
HOW DO I FIX THIS?
I came across this post but here they haven't applied any authentication to schema registry in confluent cloud.

This piece of code worked for me:
private val schemaRegistryUrl = "<schemaregistryURL>"
val props = Map("basic.auth.credentials.source" -> "USER_INFO",
"schema.registry.basic.auth.user.info" -> "<api-key>:<api-secret>").asJava
private val schemaRegistryClient = new CachedSchemaRegistryClient(schemaRegistryUrl, 100,props)
We need to make sure we are doing a correct import while converting to JAVA:
import scala.collection.JavaConverters.mapAsJavaMapConverter

Related

Exception in Flink Streaming to Kafka Avro Sink java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData

I'm Using flink streaming to read events from Kafka source topic and after de-duplication, writing to separate kafka topic in avro topic.
Flow
Kafka Topic(json format) -> flink streaming(de-duplication) -> scala
case class objects -> Kafka Topic(Avro Format)
val sink = sinkProvider.getKafkaSink(brokerURL, targetTopic,kafkaTransactionMaxTimeoutMs, kafkaTransactionTimeoutMs)
messageStream
.map {
record =>
convertJsonToExample(record)
}
.sinkTo(sink)
.name("Example Kafka Avro Sink")
.uid("Example-Kafka-Avro-Sink")
Here are the steps I followed:
I created avro schema for my output schema
{
"type":"record",
"name":"Example",
"namespace":"ca.ix.dcn.test",
"fields":[
{
"name":"x",
"type":"string"
},
{
"name":"y",
"type":"long"
}
]
}
From avro schema I generated case class using avro-hugger tools(version 1.2.1) for
SpecificRecord
I used flink AvroSerializationSchema forSpecificRecord cause flink
kafka avro sink let's you use either specific record or generic
record constructor for serialization to avro.
def getKafkaSink(brokers: String, targetTopic: String,transactionMaxTimeoutMs:String,transactionTimeoutMs:String) = {
val schema = ReflectData.get.getSchema(classOf[Example])
val sink = KafkaSink.builder()
.setBootstrapServers(brokers)
.setProperty("transaction.max.timeout.ms",transactionMaxTimeoutMs)
.setProperty("transaction.timeout.ms",transactionTimeoutMs)
.setRecordSerializer(KafkaRecordSerializationSchema.builder()
.setTopic(targetTopic)
.setValueSerializationSchema(AvroSerializationSchema.forSpecific[Example](classOf[Example]))
.setPartitioner(new FlinkFixedPartitioner())
.build()
)
.setDeliveryGuarantee(DeliveryGuarantee.EXACTLY_ONCE)
.build()
sink
}
Now when I run it I get the exeption:
Caused by: org.apache.avro.AvroRuntimeException: java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData can not access a member of class ca.ix.dcn.test with modifiers "private final"
at org.apache.avro.specific.SpecificData.createSchema(SpecificData.java:405)
at org.apache.avro.reflect.ReflectData.createSchema(ReflectData.java:734)
I saw there is a bug opened on flink for this:
https://issues.apache.org/jira/browse/FLINK-18478
But I didn't find any work around for this.
Wondering if there is any workaround for this. Also if there are detailed examples that explain how to use flink streaming sink(for avro) using AvroSerializationSchema(Specific/Generic)
Appreciate the help on this.

In the Flink ticket that you're linking to, there's a comment made that avro-hugger is not really compatible with the Apache Avro Java library, see https://issues.apache.org/jira/browse/FLINK-18478?focusedCommentId=17164456&page=com.atlassian.jira.plugin.system.issuetabpanels%3Acomment-tabpanel#comment-17164456
The solution would be to generate Avro Java POJOs and use them in your Scala application.

Write into S3 on LocalStack with Spark 3: RemoteFileChangedException - Change reported by S3 during open at position. ETag was unavailable

I am trying to write parquet into S3 in my testcontainers Localstack and get this error:
org.apache.hadoop.fs.s3a.RemoteFileChangedException: open `s3a://***.snappy.parquet': Change reported by S3 during open at position ***. ETag *** was unavailable
It is working with real S3 and it worked with Spark 2.4 and Hadoop 2.7.
I am using: Scala 2.12.15, Spark 3.2.1, hadoop-aws 3.3.1, testcontainers-scala-localstack 0.40.8
The code is very simple, just write dataframe into s3 location:
val path = "s3a://***"
import spark.implicits._
val df = Seq(UserRow("1", List("10", "20"))).toDF()
df.write.parquet(path)

You could disable bucket versioning at the moment when you create it.
Here is an example:
//create an S3 client using localstack container
S3Client s3Client = S3Client.builder ()
.endpointOverride (localStackContainer.getEndpointOverride (LocalStackContainer.Service.S3))
.credentialsProvider (StaticCredentialsProvider.create (AwsBasicCredentials
.create (localStackContainer.getAccessKey (), localStackContainer.getSecretKey ())))
.region (Region.of (localStackContainer.getRegion ()))
.build ();
// create desired bucket
s3Client.createBucket (builder -> builder.bucket (<your-bucket-name>));
//disable versioning on your bucket
s3Client.putBucketVersioning (builder -> builder
.bucket (<your-bucket-name>)
.versioningConfiguration (builder1 -> builder1
.status (BucketVersioningStatus.SUSPENDED)));

Spark Kafka Batch write with SASL mechanism throwing Timeout Exception with topic not present in metadata

I am reading data from cassandra db and applying few transformations on it and sending the data to kafka via .save() batch method. I am also using Kafka Producer to set the properties. But everytime i am getting the below error :
Caused by: org.apache.kafka.common.errors.TimeoutException: Topic XXXXXXXXXXXXX not present in metadata after 60000 ms.
All configurations, credentials are set.
Same code is working fine on local as there is no SASL mechanism present there, but on cluster getting the above exception.
Please help.
System.setProperty("java.security.auth.login.config","/apps/xxxx/jaas.conf")
val props = new Properties()
props.put("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
props.put("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
props.put("acks","all")
props.put("bootstrap.servers","xxxxxx1:9095,xxxxxx2:9095,xxxxx3:9095")
props.put("ssl.truststore.location","/home/xxxxx/ocrptrust.jks")
props.put("ssl.truststore.password","xxxxxxxxx")
props.put("sasl.mechanism","SCRAM-SHA-512")
props.put("sasl.jaas.config","org.apache.kafka.common.security.scram.ScramLoginModule required username=\"username\" password=\"password\";")
props.put("security.protocol","SASL_SSL")
val producerConfig = new KafkaProducer[String,String](props)
jsonRead.selectExpr("CAST(householdID AS STRING) AS key", "to_json(struct(*)) AS value")
.write.format("kafka")
.option("key.serializer","org.apache.kafka.common.serialization.StringSerializer")
.option("value.serializer","org.apache.kafka.common.serialization.StringSerializer")
.option("acks","all")
.option("ssl.truststore.location","/home/xxxxx/ocrptrust.jks")
.option("ssl.truststore.password","xxxxxx")
.option("sasl.mechanism","SCRAM-SHA-512")
.option("sasl.jaas.config","org.apache.kafka.common.security.scram.ScramLoginModule required username=\"username\" password=\"password\";")
.option("security.protocol","SASL_SSL")
.option("kafka.bootstrap.servers","xxxxxx1:9095,xxxxxx2:9095,xxxxx3:9095")
.option("topic", "xxxxxxxxxxx").save()

Kafka 1.0.0 admin client cannot create topic with EOFException

Using the 1.0.0 Kafka admin client I wish to programmatically create a topic on the broker. I happen to be using Scala. I've tried using the following code to either create a topic on the Kafka broker or simply to list the available topics
import org.apache.kafka.clients.admin.{AdminClient, ListTopicsOptions, NewTopic}
import scala.collection.JavaConverters._
val zkServer = "localhost:2181"
val topic = "test1"
val zookeeperConnect = zkServer
val sessionTimeoutMs = 10 * 1000
val connectionTimeoutMs = 8 * 1000
val partitions = 1
val replication:Short = 1
val topicConfig = new Properties() // add per-topic configurations settings here
import org.apache.kafka.clients.admin.AdminClientConfig
val config = new Properties
config.put(AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG, zkServer)
val admin = AdminClient.create(config)
val existing = admin.listTopics(new ListTopicsOptions().timeoutMs(500).listInternal(true))
val nms = existing.names()
nms.get().asScala.foreach(nm => println(nm)) // nms.get() fails
val newTopic = new NewTopic(topic, partitions, replication)
newTopic.configs(Map[String,String]().asJava)
val ret = admin.createTopics(List(newTopic).asJavaCollection)
ret.all().get() // Also fails
admin.close()
With either command, the ZooKeeper (3.4.10) side throws an EOFException and closes the connection. Debugging the ZooKeeper side itself, it seems it is unable to deserialize the message that the admin client is sending (it runs out of bytes it is trying to read)
Anyone able to make the 1.0.0 Kafka admin client work for creating or listing topics?

The AdminClient directly connects to Kafka and does not need access to Zookeeper.
You need to set AdminClientConfig.BOOTSTRAP_SERVERS_CONFIG to point to your Kafka brokers (for example localhost:9092) instead of Zookeeper.

Connecting to a postgresql db using JDBC from the Bluemix Apache Spark service

I have a problem connecting to my postgresql 8.4 db using Apache Spark service on Bluemix.
My code is:
%AddJar https://jdbc.postgresql.org/download/postgresql-8.4-703.jdbc4.jar -f
val sqlContext = new org.apache.spark.sql.SQLContext(sc)
sqlContext.load("jdbc", Map("url" -> "jdbc:postgresql://<ip_address>:5432/postgres?
user=postgres&password=<password>", "dbtable" -> "table_name"))
And I get the error:
Name: java.sql.SQLException
Message: No suitable driver found for jdbc:postgresql://:5432/postgres?user=postgres&password=
I've read around and it seems I need to add the JDBC driver to the Spark class path. I've no idea how to do this in the Bluemix Apache Spark service.

There is currently an issue with adding JDBC drivers to Bluemix Apache Spark. The team is working to resolve it. You can follow the progress here:
https://developer.ibm.com/answers/questions/248803/connecting-to-postgresql-db-using-jdbc-from-bluemi.html

Possibly have a look here? I believe the load() function is deprecated in Spark 1.4 [source].
You could try this instead
val url = "jdbc:postgresql://:5432/postgres"
val prop = new java.util.Properties
prop.setProperty("user","postgres")
prop.setProperty("password","xxxxxx")
val table = sqlContext.read.jdbc(url,"table_name",prop)
The url may or may not require the completed version - i.e.
jdbc:postgresql://:5432/postgres?
user=postgres&password=password

This worked for me on Bluemix
%AddJar https://jdbc.postgresql.org/download/postgresql-9.4.1208.jar -f
val sqlContext = new org.apache.spark.sql.SQLContext(sc);
val df = sqlContext.read.format("jdbc").options(Map("url" -> "jdbc:postgresql://:/", "user" -> "", "password" -> "","dbtable" -> "", "driver" -> "org.postgresql.Driver")).load()

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Configuring Spark Structured Streaming with authenticated Confluent Schema Registry - scala

Related

Exception in Flink Streaming to Kafka Avro Sink java.lang.IllegalAccessException: Class org.apache.avro.specific.SpecificData

Write into S3 on LocalStack with Spark 3: RemoteFileChangedException - Change reported by S3 during open at position. ETag was unavailable

Spark Kafka Batch write with SASL mechanism throwing Timeout Exception with topic not present in metadata

Kafka 1.0.0 admin client cannot create topic with EOFException

Connecting to a postgresql db using JDBC from the Bluemix Apache Spark service

Categories

Resources