Transform Kafka Topic Data while saving on DB connector - apache-kafka

I'm pushing data on Kafka Topic Say TEST and then using confluent sink i'm saving it on oracle db. Now i need to transform the data while saving it .
I have a request object and a transaction id ... i'm saving that transaction id as key and Request object as value. I need to convert it to following columns of oracle db row :
{transaction id , request object , timestamp}.. how can we configure it ?

I think, you could use KafkaStreams Transformations here: define your custom transformer class to change your input records before storing to DB.
Take a look here: http://kafka.apache.org/documentation.html#connect_transforms
and here:
https://www.confluent.io/blog/simplest-useful-kafka-connect-data-pipeline-world-thereabouts-part-3/

Related

Is it possible to extract the Schema ID when using KStream processing?

I am processing messages from sourceTopic to a targetTopic using KStream (using map method). In the map method, I am generating a new schema (since i need to extract explicit fields) for the targettopic using the incoming messages, but since the KStream operation is per message, i wish to avoid regenerating the schema for every message and would instead want to cache the schemaID of the incoming messages (for both Key and Value) and generate new targetschema only if the source Schema changes.
Is there a way to do this via the KStream object or from the Key/Value objects used in the map method
Update:
I was not able to get the schema ID for my above use case, as a workaround I cached the schema into a local variable and checked on each iteration if it changed and process further as required.
You only will have access to the ID if you use Serde.Bytes; after the records are deserialized, you'll only have access to the Schema.
The AvroSerdes from Confluent already cache the ids, though.

using existing kafka topic with avro in ksqldb

suppose i have a topic, lets say 'some_topic'. data in this topic is serialized with avro using schema registry.
schema subject name is the same as the name of the topic - 'some_topic', without '-value' postfix
what i want to do is to create a new topic, lets say 'some_topic_new', where data will be serialized with the same schema, but some fields will be 'nullified'
i'm trying to evaluate if this could be done using ksqldb and have two questions:
is it possible to create stream/table based on existing topic and using existing schema?
maybe something like create table some_topic (*) with (schema_subject=some_topic, ...).
so fields for new table would be taken from existing schema automatically
could creating of new schema with '-value' postfix be avoided when creating new stream/table?
When you create a stream in ksqlDB based on another you can have it inherit the schema. Note that it won't share the same schema though, but the definition will be the same.
CREATE STREAM my_stream
WITH (KAFKA_TOPIC='some_topic', VALUE_FORMAT='AVRO');
CREATE STREAM my_stream_new
WITH (KAFKA_TOPIC='some_topic_new', VALUE_FORMAT='AVRO') AS
SELECT * FROM my_stream;

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

How to find out Avro schema from binary data that comes in via Spark Streaming?

I set up a Spark-Streaming pipeline that gets measuring data via Kafka. This data was serialized using Avro. The data can be of two types - EquidistantData and DiscreteData. I created these using an avdl file and the sbt-avrohugger plugin. I use the variant that generates Scala case classes that inherit from SpecificRecord.
In my receiving application, I can get the two schemas by querying EquidistantData.SCHEMA$ and DiscreteData.SCHEMA$.
Now, my Kafka stream gives me RDDs whose value class is Array[Byte]. So far so good.
How can I find out from the byte array which schema was used when serializing it, i.e., whether to use EquidistantData.SCHEMA$ or DiscreteData.SCHEMA$?
I thought of sending an appropriate info in the message key. Currently, I don't use the message key. Would this be a feasible way or can I get the schema somehow from the serialized byte array I received?
Followup:
Another possibility would be to use separate topics for discrete and equidistant data. Would this be feasible?

Storm- Routing bolt to get schema from the kafka spout

Storm - Conditionally consuming stream from kafka spout?
How do i get the schema of the data inside the Split Bolt when I try to output it using the declareOutputFields().
Fields schema = new Fields(?)
How do i get the schema of the all the fields in the data inside this bolt without basically reparsing all the data and recreating it?
You need to know the schema beforehand, ie, before you process the first tuples. The method declareOutputFields() is called during deployment before the first call to execute().
Storm cannot handle a variable schema. If you have JSON data with unknown structure, you could declare new Fields("json") and put the whole JSON object into a single field.