Spark structured streaming read messages from Kafka with varying schema - scala

I am trying to read messages from a Kafka topic in a streaming manner. The messages in the topic have two types of schema.
1. { "Request": { <some nested json> } }
2. { "Response": { <different nested json> } }
One message has request schema and another response schema. How can I read these json messages in Spark, identify whether it's a Request message or a Response message and then take actions accordingly? Need to do this in Scala.

Related

Extract all schemaIDs of all messages in a kafka topic without the need to consume all messages

I'm wondering if there's way (kafka API/tool) to return list of schemaIds used by messages under a topic from within kafka and/or schema registry.
I have a quick soluiton to consume all messages to extract from outside of kafka. However, it's kind of time and resource consuming.
Is there a way (kafka API/tool) to return list of schemaIds used by messages under a topic from within kafka
Yes.
kafka-avro-console-consumer ... --property print.schema.ids=true
https://github.com/confluentinc/schema-registry/pull/901
Without the need to consume
No; you need to deserialize at least the first 5 bytes of each record.
The other answer shows what is in the registry, which is not necessary what currently exists in the topic.
First Solution:
First you can request all schemas by :
/schemas
The response is an array of object which each one contain a subject field that represent typically your topic name.
{
"schema":{"type": "string"},
"subject": "your target topic"},
"version": "version number",
"schema": {}, // schema that you are looking for
"id": "id"
}
Second Solutions:
/subjects/${your topic name }/versions/
The response is an array of versions ids like:
[1,2,3,..]
And you have to fetch for each version the wanted schema as:
/subjects/${your topic name }/versions/version // 1,2,3 etc
Check the schema registry rest api doc here

How to update data in Kafka/Kafka stream?

Lets suppose there is Kafka topic orders. Data is stored in JSON format:
{
"order_id": 1,
"status": 1
}
Status defines status of order (pending - 1, completed - 2).
How to change it on completed when it is finished?
As I know Kafka topic immutable and I can not change message JSON, just create a new message with chnaged value, right?
If your order changes state, a process that is changing the state should generate a new message with the new state in the topic. The kafka streams application can react on new messages, do transformations aggregations or similar and output the modified/aggregated messages in new topics... So you need a kafka producer that when the order state changes, produces a message to the order topic.

INCOMPLETE_SOURCE_TOPIC_METADATA error in KafkaStreams

I writing KafkaStreams application and set the maximum.num.threads as one. I've three source topics and have 6,8,8 partitions respectively. Currently running this streamtopology with 4 instances so 4 running streams threads.
I'm getting INCOMPLETE_SOURCE_TOPIC_METADATA in one of my kafka topics. I found below code from github
throwing this error and trying to understand the code
final Map<String, InternalTopicConfig> repartitionTopicMetadata = new HashMap<>();
for (final InternalTopologyBuilder.TopicsInfo topicsInfo : topicGroups.values()) {
for (final String topic : topicsInfo.sourceTopics) {
if (!topicsInfo.repartitionSourceTopics.keySet().contains(topic) &&
!metadata.topics().contains(topic)) {
log.error("Missing source topic {} during assignment. Returning error {}.",
topic, AssignorError.INCOMPLETE_SOURCE_TOPIC_METADATA.name());
return new GroupAssignment(
errorAssignment(clientMetadataMap, topic,
AssignorError.INCOMPLETE_SOURCE_TOPIC_METADATA.code())
);
}
}
for (final InternalTopicConfig topic : topicsInfo.repartitionSourceTopics.values()) {
repartitionTopicMetadata.put(topic.name(), topic);
}
}
My questions:
Is this error comes because of partition mismatch on Kafka topics, or TopicsInfo is not available at the time (Think like Kafka group lost its access to the Kafka topic)?
what is meant by topicsInfo.repartitionSourceTopics call?
The error means that Kafka Streams could not fetch all required metadata for the used topics and thus cannot proceed. This could happen if a topic does not exist (ie, not created yet or miss spelled), or if the corresponding metadata was not broadcasted to all brokers yet (for this case, the error should be transient). Note that metadata is updated asynchronously and thus there might be some delay until all brokers learn about a new topic.

How to read from a string delimited kafka topic then convert it to json?

I'm currently working on a POC project that uses Kafka Connect to read from a Kafka topic to write to database. I am aware that JDBC sink connectors requires schema to work. However all of our kafka topics are string delimited. Apart from creating a new topic with json or avro, I'm planning to create an API that converts string to json, any possible solutions that I can try?
If I understood your use case, I would definitely try Kafka Streams to process your delimited String topic and sink it into a JSON topic. Then, you can use this JSON topic as input to your database.
With Kafka Streams, you can map each String record of your topic and apply logic into it to generate a JSON with your data. The processed records could be sinked into your results topic just as Strings with JSON format or even as a JSON Type using the Kafka JSON Serdes (Kafka Streams DatatTypes).
This approach also adds flexibility with your input data, you can process, transform or ignore each delimited field of your string.
If you want to get started, check out the Kafka Demo and the Confluent Kafka-Streams examples for more complex use cases.
Let's create Kafka Streams application. You will consume data as
KStream<String, String> (assumption that key also is the String)
and than in Kafka Stream app transform it to KStream<String,YourAvroClass> and produce messages to target Avro topic.
KStream<String,YourAvroClass> avroKStream ....
avroKStream.to("avro-output-topic", Produced.with(Serdes.String(), yourAvroClassSerde));
The solution to your problem would be creating a custom Converter. Unfortunately, there is no blog post or online resource on how to do that step by step but you can have a look over how the StringConverter has been built. It's a pretty straightforward API that uses Serializers and Deserializers to convert the data.
L.E:
A converter uses Serializers and Deserializers so the only thing that you should do is to create your own:
public class StringConverter implements Converter {
private final CustomStringSerializer serializer = new CustomStringSerializer();
private final CustomStringSerializer deserializer = new CustomStringSerializer();
--- boilerplate code resulted by implementing the converter Interface ---
#Override
public byte[] fromConnectData(String topic, Schema schema, Object value) {
try {
return serializer.serialize(topic, value == null ? null : value.toString());
} catch (SerializationException e) {
throw new DataException("Failed to serialize to a string: ", e);
}
}
#Override
public SchemaAndValue toConnectData(String topic, byte[] value) {
try {
return new SchemaAndValue(createYourOwnSchema(topic, value), deserializer.deserialize(topic, value));
} catch (SerializationException e) {
throw new DataException("Failed to deserialize string: ", e);
}
}
}
You can notice that toConnectData is returning a SchemaAndValue object and actually it is exactly what you need to customize. You can create your own createYourOwnSchema method that will create the schema based on the topic ( and the value? ) and then you have the deserializer that will deserialize your data.

Read XML message from Kafka topic in Spark Streaming

I need to consume XML messages from a Kafka topic and load into a Spark Dataframe within my foreachRDD block in streaming. How can I do that? I am able to consume JSON messages in my streaming job by doing spark.sqlContext.read.json(rdd); What is the analogous code for reading XML format messages from Kafka? I am using Spark 2.2, Scala 2.11.8 and Kafka 0.10
My XML messages will have about 400 fields (heavily nested) so I want to dynamically store them in a DF in the stream.foreachRDD { rdd => ... } block and then operate on the DF.
Also should I convert the XML into JSON or Avro before sending into the topic at the producer end? Is that heavy to send XMLs as-is and is it better to send JSON instead?