1 topic maps to two different db table Kafka Sink Connector - postgresql

I am currently having trouble mapping my Kafka topic: st_record to two separate database table: 1) gt_school.strecord_1week 2) gt_school.strecord_1semester. My Kafka sink configuration is
"tasks.max": "1",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": " "'"$URL"'",
"topics":"st_record",
"table.name.format": "gt_school.strecord_1week, gt_school.strecord_1semester",
"table.whitelist": "gt_school.strecord_1week, gt_school.strecord_1semester",
"transforms":"route",
"transforms.route.type":"org.apache.kafka.connect.transforms.RegexRouter,
"transforms.route.regex":"st_record",
"transforms.route.replacement":"gt_school.strecord_1week, gt_school.strecord_1semester"
I tried table.name.format, table.whitelist, and transform route however everytime I received the following error that both tables are unfound
io.confluent.connect.jdbc.sink.TableAlterOrCreateException: Table "gt_school"."strecord_1week, gt_school"."strecord_1semester" is missing and auto-creation is disabled"
Which is true, it should return in this format, "gt_school.strecord_1week, gt_school.strecord_1semester".
Does anyone know what field it should map the two tables to from 1 topic name. Am I suppose to use table.name.format. I know that in default the topic and table name are suppose to be the same however I route it and still errors

The error only says one table isn't found. Not two. The comma is within quotes... JDBC sink only writes to one table, per topic. Plus, tables cannot contain commas, as far as I know.
RegexRouter doesn't split your topic into two. It only renames the topic to a static string.
If you want to write to two distinct tables, create two separate connectors with
"topics":"st_record",
...
"transforms":"route",
"transforms.route.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex":".*",
"transforms.route.replacement":"$0_1week"
"topics":"st_record",
...
"transforms":"route",
"transforms.route.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex":".*",
"transforms.route.replacement":"$0_1semester"
However, this will obviously duplicate data in the database, so I'd recommend creating one table with data from the topic, then two VIEWs instead to do different queries of weeks/semesters

Related

How can i change the database schema depending on the message value in Kafka Connect?

I want to change schema on insert in PostgreSql depends on value message when i producer message to kafka connect?
How can i do that?
I tried using each topic for a different schema type, but this is not what i want.
eg: Topic name: country_city
{ cityId: 1 }
will insert into schema: country_1
{ cityId: 2 }
will insert into schema: country_2
The schema/table that is used depends exclusively on the topic name, not values within each record.
You will need to use Kafka Connect Transforms to override the "outgoing topic name" to control where data is written in a database.
However, if you have numbered tables in your database, that is likely an anti-pattern and you should be using relations instead.

How to ensure that in one Kafka topic same key goes to same partition for multiple tables

I have a requirement to produce data from multiple MongoDB tables and push to the same Kafka Topic using the mongo-kafka connector. Also I have to ensure that the data for the same table key columns always go to the same partition every time to ensure message ordering.
For example :
tables --> customer , address
table key columns -->CustomerID(for table customer) ,AddressID(for table address)
For CustomerID =12345 , it will always go to partition 1
For AddressID = 54321 , it will always go to partition 2
For a single table , the second requirement is easy to achieve using chained transformations. However for multiple tables->1 topic , finding it difficult to achieve since each of these tables has different key column names.
Is there any way available to fulfil both requirements using the Kafka connector?
If you use ExtractField$Key transform and IntegerConverter, all matching IDs should go to the same partition.
If you have two columns and one table, or end up with keys like {"CustomerID": 12345} then you have a composite/object key, meaning the whole key will be hashed when used to compute partitioning, not the ID itself.
You cannot set partition for specific fields within any record without setting producer.override.partitioner.class in Connector config. In other words, you need to implement a partitioner that will deserialize your data, parse the values, then compute and return the respective partition.

Kafka Connect - Transformes rename field only if it exist

I have an S3 sink connector for multiple topics (topic_a, topic_b, topic_c) and topic_a have field created_date and topic_b, topic_c have creation_date . I have used the below transforms.RenameField.renames to rename the field (created_date:creation_date) but since the only topic_a have created_date and others don't, the connector is failing.
I want to move all the messages (from all topics with single connector) into s3 with creation_date (and rename created_date to creation_date if exist) but I am not able to figure out the regex or transformer to rename the field (if it exists) for the specific topic.
"config":{
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"errors.log.include.messages":"true",
"s3.region":"eu-west-1",
"topics.dir":"dir",
"flush.size":"5",
"tasks.max":"2",
"s3.part.size":"5242880",
"timezone":"UTC",
"locale":"en",
"format.class":"io.confluent.connect.s3.format.json.JsonFormat",
"errors.log.enable":"true",
"s3.bucket.name":"bucket",
"topics": "topic_a, topic_b, topic_c",
"s3.compression.type":"gzip",
"partitioner.class":"io.confluent.connect.storage.partitioner.DailyPartitioner",
"name":"NAME",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"key.converter.schemas.enable":"true",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable":"true",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"https://schemaregistry.com",
"enhanced.avro.schema.support": "true",
"transforms": "RenameField",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.RenameField.renames": "created_date:creation_date"
}
only topic_a have created_date and others don't,
Then you would use separate Connectors. One with the Transform and all topics with the field, then another without the transform.
from all topics with single connector
This doesn't scale very well. You're making limited consumer threads and one consumer group to read from many topics at once. Multiple connectors would be better to distribute the load.

Flush size when using kafka-connect-transform-archive with HdfsSinkConnector

I have data in a Kafka topic which I want to preserve on my data lake.
Before worrying about the keys, I was able to save the Avro values in files on the datalake using HdfsSinkConnector. The number of message values in each file was determined by the "flush.size" property of the HdfsSinkConnector.
All good. Next I wanted to preserve the keys as well. To do this I used the kafka-connect-transform-archive which wraps the String key and Avro value into a new Avro schema.
This works great ... except that the flush.size for the HdfsSinkConnector is now being ignored. Each file saved in the data lake has exactly 1 message only.
So, the two cases are 1) save values only, with the number of values in each file determined by the flush.size and 2) save keys and values with each file containing exactly one message and flush.size being ignored.
The only difference between the two situations is the configuration for the HdfsSinkConnector which specifies the archive transform.
"transforms": "tran",
"transforms.tran.type": "com.github.jcustenborder.kafka.connect.archive.Archive"
Does the kafka-connect-transform-archive ignore flush size by design, or is there some additional configuration that I need in order to be able to save multiple key, value messages per file on the data lake?
i had the same problem when using kafka gcs sink connector.
In com.github.jcustenborder.kafka.connect.archive.Archive code, a new Schema is created per message.
private R applyWithSchema(R r) {
final Schema schema = SchemaBuilder.struct()
.name("com.github.jcustenborder.kafka.connect.archive.Storage")
.field("key", r.keySchema())
.field("value", r.valueSchema())
.field("topic", Schema.STRING_SCHEMA)
.field("timestamp", Schema.INT64_SCHEMA);
Struct value = new Struct(schema)
.put("key", r.key())
.put("value", r.value())
.put("topic", r.topic())
.put("timestamp", r.timestamp());
return r.newRecord(r.topic(), r.kafkaPartition(), null, null, schema, value, r.timestamp());
}
If you look at kafka transform InsertField$Value method, you will see that it use a SynchronizedCache in order to retreive the same schema every time.
https://github.com/axbaretto/kafka/blob/ba633e40ea77f28d8f385c7a92ec9601e218fb5b/connect/transforms/src/main/java/org/apache/kafka/connect/transforms/InsertField.java#L170
So, you just need to create a schema (outside the apply function) or use the same SynchronizedCache code.

using different avro schema for new columns

I am using flume + kafka to sink the log data to hdfs. My sink data type is Avro. In avro schema (.avsc), there is 80 fields as columns.
So I created an external table like that
CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')
Now, I need to add 25 more columns to avro schema. In that case,
if I create a new table with new schema which has 105 columns, I will have two table for one project. And if I add or remove some columns in coming days, I have to create a new table for that. I am afraid of having a lot of table which use different schema for same project.
If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.
What is the best way to use avro schema in case like that?
This is indeed challenging. The best way is to make sure all schema changes you make are compatible with the old data - so only remove columns with defaults, and make sure you give defaults in the columns you are adding. This way you can safely swap out the schemas without a conflict and keep reading old data. Avro is pretty clever about that, it's called "schema evolution" (in case you want to google a bit more) and allows reader and writer schemas to be a bit different.
As an aside, I want to mention that Kafka has a native HDFS connector (i.e. without Flume) that uses Confluent's schema registry to handle these kinds of schema changes automatically - you can use the registry to check if the schemas are compatible, and if they are - simply write data using the new schema and the Hive table will automatically evolve to match.
I added new columns to avro schema like that
{"name":"newColumn1", "type": "string", "default": ""},
{"name":"newColumn2", "type": "string", "default": ""},
{"name":"newColumn3", "type": "string", "default": ""},
When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.
For setting null value as default, you need that
{ "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },
or
{ "name": "newColumn5", "type": [ "null", "string" ]},
The position of null in type property, can be first place or can be second place with default property.