How to transform and extract fields in Kafka sink JDBC connector - apache-kafka

I am using a 3rd party CDC tool that replicates data from a source database into Kafka topics. An example row is shown below:
{
"data":{
"USER_ID":{
"string":"1"
},
"USER_CATEGORY":{
"string":"A"
}
},
"beforeData":{
"Data":{
"USER_ID":{
"string":"1"
},
"USER_CATEGORY":{
"string":"B"
}
}
},
"headers":{
"operation":"UPDATE",
"timestamp":"2018-05-03T13:53:43.000"
}
}
What configuration is needed in the sink file in order to extract all the (sub)fields under data and headers and ignore those under beforeData so that the target table in which the data will be transferred by Kafka Sink will contain the following fields:
USER_ID, USER_CATEGORY, operation, timestamp
I went through the transformation list in confluent's docs but I was not able to find how to use them in order to achieve the aforementioned target.

I think you want ExtractField, and unfortunately, it's a Map.get operation, so that means 1) nested fields cannot be gotten in one pass 2) multiple fields need multiple transforms.
That being said, you might to attempt this (untested)
transforms=ExtractData,ExtractHeaders
transforms.ExtractData.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractData.field=data
transforms.ExtractHeaders.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractHeaders.field=headers
If that doesn't work, you might be better off implementing your own Transformations package that can at least drop values from the Struct / Map.

If you're willing to list specific field names, you can solve this by:
Using a Flatten transform to collapse the nesting (which will convert the original structure's paths into dot-delimited names)
Using a Replace transform with rename to make the field names be what you want the sink to emit
Using another Replace transform with whitelist to limit the emitted fields to those you select
For your case it might look like:
"transforms": "t1,t2,t3",
"transforms.t1.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.t2.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.t2.renames": "data.USER_ID:USER_ID,data.USER_CATEGORY:USER_CATEGORY,headers.operation:operation,headers.timestamp:timestamp",
"transforms.t3.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.t3.whitelist": "USER_ID,USER_CATEGORY,operation,timestamp",

Related

How to map Data Flow parameters to Sink SQL Table

I need to store/map one or more data flow parameters to my Sink (Azure SQL Table).
I can fetch other data from a REST Api and is able to map these to my Sink columns (see below). I also need to generate some UUID's as key fields and add these to the same table.
I would like my EmployeeId column to contain my Data Flow Input parameter, e.g. named param_test. In addition to this I need to insert UUID's to other columns which are not part of my REST input fields.
How to I acccomplish that?
You need to use a derived column transformation, and there edit the expression to include the parameters.
derived column transformation
expression builder
Adding to #Chen Hirsh, use the same derived column to get uuid values to the columns after REST API Source.
They will come into sink mapping:
Output:

vega: Can I create marks using information coming from two datasets?

I would like to create some marks, where the information of the size comes from one dataset and the information of the color comes from another dataset.
Is this possible?
Or can I update created marks (created with dataset 1) by using information from a second dataset?
Yes, you can do it.
You can use lookup transform provided there is a lookup key in both datasets.
In this example, 'category' is the key that performs lookup transform

Flush size when using kafka-connect-transform-archive with HdfsSinkConnector

I have data in a Kafka topic which I want to preserve on my data lake.
Before worrying about the keys, I was able to save the Avro values in files on the datalake using HdfsSinkConnector. The number of message values in each file was determined by the "flush.size" property of the HdfsSinkConnector.
All good. Next I wanted to preserve the keys as well. To do this I used the kafka-connect-transform-archive which wraps the String key and Avro value into a new Avro schema.
This works great ... except that the flush.size for the HdfsSinkConnector is now being ignored. Each file saved in the data lake has exactly 1 message only.
So, the two cases are 1) save values only, with the number of values in each file determined by the flush.size and 2) save keys and values with each file containing exactly one message and flush.size being ignored.
The only difference between the two situations is the configuration for the HdfsSinkConnector which specifies the archive transform.
"transforms": "tran",
"transforms.tran.type": "com.github.jcustenborder.kafka.connect.archive.Archive"
Does the kafka-connect-transform-archive ignore flush size by design, or is there some additional configuration that I need in order to be able to save multiple key, value messages per file on the data lake?
i had the same problem when using kafka gcs sink connector.
In com.github.jcustenborder.kafka.connect.archive.Archive code, a new Schema is created per message.
private R applyWithSchema(R r) {
final Schema schema = SchemaBuilder.struct()
.name("com.github.jcustenborder.kafka.connect.archive.Storage")
.field("key", r.keySchema())
.field("value", r.valueSchema())
.field("topic", Schema.STRING_SCHEMA)
.field("timestamp", Schema.INT64_SCHEMA);
Struct value = new Struct(schema)
.put("key", r.key())
.put("value", r.value())
.put("topic", r.topic())
.put("timestamp", r.timestamp());
return r.newRecord(r.topic(), r.kafkaPartition(), null, null, schema, value, r.timestamp());
}
If you look at kafka transform InsertField$Value method, you will see that it use a SynchronizedCache in order to retreive the same schema every time.
https://github.com/axbaretto/kafka/blob/ba633e40ea77f28d8f385c7a92ec9601e218fb5b/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/InsertField.java#L170
So, you just need to create a schema (outside the apply function) or use the same SynchronizedCache code.

How to track rows (by id) with a specific column value using Kafka JDBC Connector?

I have a table containing a large number of records. There's a column defining a type of the record. I'd like to collect records with a specific value in that column. Kind of:
Select * FROM myVeryOwnTable WHERE type = "VERY_IMPORTANT_TYPE"
What I've noticed I can't use WHERE clause in a custom query when I choose incremental(+timestamp) mode, otherwise I'd need to take care if filtering on my own.
The background of that I'd like to achieve is that I use Logstash to transfer some type of data from MySQL to ES. That's easily achievable there by using query that can contain where clause. However, with Kafka I can transfer my data much quicker (almost instantly) after inserting new rows in DB.
Thank you for any hints or advices.
Thanks to #wardziniak I was able to set it up.
query=select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p
topic.prefix=test-mysql-jdbc-
incrementing.column.name=id
however, I was expecting a topic test-mysql-jdbc-myVeryOwnTable so I've registered my consumer to that. However, using the query shown above table name is skipped so my topic was named exactly as prefix defined above. So I've just updated my properties topic.prefix=test-mysql-jdbc-myVeryOwnTable and it seems to be working just fine.
You can use subquery in your Jdbc Source Connector query property.
Sample JDBC Source Connector configuration:
{
...
"query": "select * from (select * from myVeryOwnTable p where type = 'VERY_IMPORTANT_TYPE') p",
"incrementing.column.name": "id",
...
}

Pivot data in Talend

I have some data which I need to pivot in Talend. This is a sample:
brandname,metric,value
A,xyz,2
B,xyz,2
A,abc,3
C,def,1
C,ghi,6
A,ghi,1
Now I need this data to be pivoted on the metric column like this:
brandname,abc,def,ghi,xyz
A,3,null,1,2
B,null,null,null,2
C,null,1,6,null
Currently I am using tPivotToColumnsDelimited to pivot the data to a file and reading back from that file. However having to store data on an external file and reading back is messy and unnecessary overhead.
Is there a way to do this with Talend without writing to an external file? I tried to use tDenormalize but as far as I understand, it will return the rows as 1 column which is not what I need. I also looked for some 3rd party component in TalendExchange but couldn't find anything useful.
Thank you for your help.
Assuming that your metrics are fixed, you can use their names as columns of the output. The solution to do the pivot has two parts: first, a tMap that transposes the value of each input-row in into the corresponding column in the output-row out and second, a tAggregate that groups the map's output-rows according to the brandname.
For the tMap you'd have to fill the columns conditionally like this, example for output colum named "abc":
out.abc = "abc".equals(in.metric)?in.value:null
In the tAggregate you'd have to group by out.brandname and aggregate each column as sum ignoring nulls.