Storm- Routing bolt to get schema from the kafka spout - apache-kafka

Storm - Conditionally consuming stream from kafka spout?
How do i get the schema of the data inside the Split Bolt when I try to output it using the declareOutputFields().
Fields schema = new Fields(?)
How do i get the schema of the all the fields in the data inside this bolt without basically reparsing all the data and recreating it?

You need to know the schema beforehand, ie, before you process the first tuples. The method declareOutputFields() is called during deployment before the first call to execute().
Storm cannot handle a variable schema. If you have JSON data with unknown structure, you could declare new Fields("json") and put the whole JSON object into a single field.

Related

Is it possible to extract the Schema ID when using KStream processing?

I am processing messages from sourceTopic to a targetTopic using KStream (using map method). In the map method, I am generating a new schema (since i need to extract explicit fields) for the targettopic using the incoming messages, but since the KStream operation is per message, i wish to avoid regenerating the schema for every message and would instead want to cache the schemaID of the incoming messages (for both Key and Value) and generate new targetschema only if the source Schema changes.
Is there a way to do this via the KStream object or from the Key/Value objects used in the map method
Update:
I was not able to get the schema ID for my above use case, as a workaround I cached the schema into a local variable and checked on each iteration if it changed and process further as required.
You only will have access to the ID if you use Serde.Bytes; after the records are deserialized, you'll only have access to the Schema.
The AvroSerdes from Confluent already cache the ids, though.

Apache Nifi: Is there a way to publish messages to kafka with a message key as combination of multiple attributes?

I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.

using existing kafka topic with avro in ksqldb

suppose i have a topic, lets say 'some_topic'. data in this topic is serialized with avro using schema registry.
schema subject name is the same as the name of the topic - 'some_topic', without '-value' postfix
what i want to do is to create a new topic, lets say 'some_topic_new', where data will be serialized with the same schema, but some fields will be 'nullified'
i'm trying to evaluate if this could be done using ksqldb and have two questions:
is it possible to create stream/table based on existing topic and using existing schema?
maybe something like create table some_topic (*) with (schema_subject=some_topic, ...).
so fields for new table would be taken from existing schema automatically
could creating of new schema with '-value' postfix be avoided when creating new stream/table?
When you create a stream in ksqlDB based on another you can have it inherit the schema. Note that it won't share the same schema though, but the definition will be the same.
CREATE STREAM my_stream
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='AVRO');
CREATE STREAM my_stream_new
WITH (KAFKA_TOPIC='some_topic_new', VALUE_FORMAT='AVRO') AS
SELECT * FROM my_stream;

Transform Kafka Topic Data while saving on DB connector

I'm pushing data on Kafka Topic Say TEST and then using confluent sink i'm saving it on oracle db. Now i need to transform the data while saving it .
I have a request object and a transaction id ... i'm saving that transaction id as key and Request object as value. I need to convert it to following columns of oracle db row :
{transaction id , request object , timestamp}.. how can we configure it ?
I think, you could use KafkaStreams Transformations here: define your custom transformer class to change your input records before storing to DB.
Take a look here: http://kafka.apache.org/documentation.html#connect_transforms
and here:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?