I have an S3 sink connector for multiple topics (topic_a, topic_b, topic_c) and topic_a have field created_date and topic_b, topic_c have creation_date . I have used the below transforms.RenameField.renames to rename the field (created_date:creation_date) but since the only topic_a have created_date and others don't, the connector is failing.
I want to move all the messages (from all topics with single connector) into s3 with creation_date (and rename created_date to creation_date if exist) but I am not able to figure out the regex or transformer to rename the field (if it exists) for the specific topic.
"config":{
"connector.class":"io.confluent.connect.s3.S3SinkConnector",
"errors.log.include.messages":"true",
"s3.region":"eu-west-1",
"topics.dir":"dir",
"flush.size":"5",
"tasks.max":"2",
"s3.part.size":"5242880",
"timezone":"UTC",
"locale":"en",
"format.class":"io.confluent.connect.s3.format.json.JsonFormat",
"errors.log.enable":"true",
"s3.bucket.name":"bucket",
"topics": "topic_a, topic_b, topic_c",
"s3.compression.type":"gzip",
"partitioner.class":"io.confluent.connect.storage.partitioner.DailyPartitioner",
"name":"NAME",
"storage.class":"io.confluent.connect.s3.storage.S3Storage",
"key.converter.schemas.enable":"true",
"key.converter":"org.apache.kafka.connect.storage.StringConverter",
"value.converter.schemas.enable":"true",
"value.converter":"io.confluent.connect.avro.AvroConverter",
"value.converter.schema.registry.url":"https://schemaregistry.com",
"enhanced.avro.schema.support": "true",
"transforms": "RenameField",
"transforms.RenameField.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.RenameField.renames": "created_date:creation_date"
}
only topic_a have created_date and others don't,
Then you would use separate Connectors. One with the Transform and all topics with the field, then another without the transform.
from all topics with single connector
This doesn't scale very well. You're making limited consumer threads and one consumer group to read from many topics at once. Multiple connectors would be better to distribute the load.
Related
I’m having trouble getting a connector to stream data from Postgres to the topics. Looking for your input.
So if I make a simple connector for just our 'alerts' table the logs make it appear all is well as stated here,
[kafkaconnect-62]2022-11-13T22:58:05.824715[kafka-connect][2022-11-13 22:58:05,824] INFO [pg-staging|task-0] Exporting data from table 'public.alerts' (1 of 1 tables) (io.debezium.relational.RelationalSnapshotChangeEventSource:339)
[kafkaconnect-62]2022-11-13T22:58:05.826304[kafka-connect][2022-11-13 22:58:05,826] INFO [pg-staging|task-0] For table 'public.alerts' using select statement: 'SELECT "id", "team_id", "webhook_id", "event_id", "attempt_id", "sent_at", "updated_at", "created_at" FROM "public"."alerts"' (io.debezium.relational.RelationalSnapshotChangeEventSource:347)
[kafkaconnect-62]2022-11-13T22:58:05.933326[kafka-connect][2022-11-13 22:58:05,933] INFO [pg-staging|task-0] Finished exporting 154 records for table 'public.alerts'; total duration '00:00:00.109' (io.debezium.relational.RelationalSnapshotChangeEventSource:393)
[kafkaconnect-62]2022-11-13T22:58:05.950241[kafka-connect][2022-11-13 22:58:05,950] INFO [pg-staging|task-0] Snapshot - Final stage (io.debezium.pipeline.source.AbstractSnapshotChangeEventSource:88)
however, the offset for the alerts topic remains at zero, and there are no messages in the topic.
If I perform the same test as above but only the 'quota_excesses' table it works fine, all rows appear in the topic.
I’m at a lost as to why this is. What am I missing here?
I am currently having trouble mapping my Kafka topic: st_record to two separate database table: 1) gt_school.strecord_1week 2) gt_school.strecord_1semester. My Kafka sink configuration is
"tasks.max": "1",
"connector.class": "io.confluent.connect.jdbc.JdbcSinkConnector",
"connection.url": " "'"$URL"'",
"topics":"st_record",
"table.name.format": "gt_school.strecord_1week, gt_school.strecord_1semester",
"table.whitelist": "gt_school.strecord_1week, gt_school.strecord_1semester",
"transforms":"route",
"transforms.route.type":"org.apache.kafka.connect.transforms.RegexRouter,
"transforms.route.regex":"st_record",
"transforms.route.replacement":"gt_school.strecord_1week, gt_school.strecord_1semester"
I tried table.name.format, table.whitelist, and transform route however everytime I received the following error that both tables are unfound
io.confluent.connect.jdbc.sink.TableAlterOrCreateException: Table "gt_school"."strecord_1week, gt_school"."strecord_1semester" is missing and auto-creation is disabled"
Which is true, it should return in this format, "gt_school.strecord_1week, gt_school.strecord_1semester".
Does anyone know what field it should map the two tables to from 1 topic name. Am I suppose to use table.name.format. I know that in default the topic and table name are suppose to be the same however I route it and still errors
The error only says one table isn't found. Not two. The comma is within quotes... JDBC sink only writes to one table, per topic. Plus, tables cannot contain commas, as far as I know.
RegexRouter doesn't split your topic into two. It only renames the topic to a static string.
If you want to write to two distinct tables, create two separate connectors with
"topics":"st_record",
...
"transforms":"route",
"transforms.route.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex":".*",
"transforms.route.replacement":"$0_1week"
"topics":"st_record",
...
"transforms":"route",
"transforms.route.type":"org.apache.kafka.connect.transforms.RegexRouter",
"transforms.route.regex":".*",
"transforms.route.replacement":"$0_1semester"
However, this will obviously duplicate data in the database, so I'd recommend creating one table with data from the topic, then two VIEWs instead to do different queries of weeks/semesters
I'm trying to apply Debezium's New Record State Extraction SMT using the following configuration:
"transforms": "unwrap",
"transforms.unwrap.type": "io.debezium.transforms.ExtractNewRecordState",
"transforms.unwrap.drop.tombstones": true,
"transforms.unwrap.delete.handling.mode": "rewrite",
"transforms.unwrap.add.fields": "db,schema,table,txId,ts_ms"
For INSERT and UPDATE operations I get the messages as expected, but in case of DELETE I get the following as a payload:
"payload": {
"id": 2,
"first_name": "",
"last_name": "",
"__db": "postgres",
"__schema": "schema1",
"__table": "user_details",
"__txId": 5145,
"__ts_ms": 1638760801510,
"__deleted": "true"
}
As you can see above, both first_name and last_name fields have empty values, though the record I deleted has non-empty values for both of those fields. What I expect to see as a value for those 2 fields is their value at the moment of deletion as it is shown in debezium's before payload chunk in case when New Record State Extraction SMT is not applied.
The reason of empty values for all columns except the PK is not related to New Record State Extraction SMT at all. For postgres, there is a REPLICA IDENTITY table-level parameter that can be used to control the information written to WAL to identify tuple data that is being deleted or updated.
This parameter has 4 modes:
DEFAULT
USING INDEX index
FULL
NOTHING
In the case of DEFAULT, old tuple data is only identified with the primary key of the table. Columns that are not part of the primary key do not have their old value written.
In the case of FULL, all the column values of old tuple are properly written to WAL all the time. Hence, executing the following command for the target table will make the old record values to be properly populated in debezium message:
ALTER TABLE some_table REPLICA IDENTITY FULL;
NOTE!! FULL is the most verbose, and as well the most resource-consuming mode. Be careful with it particularly for heavily-updated tables.
I am using a 3rd party CDC tool that replicates data from a source database into Kafka topics. An example row is shown below:
{
"data":{
"USER_ID":{
"string":"1"
},
"USER_CATEGORY":{
"string":"A"
}
},
"beforeData":{
"Data":{
"USER_ID":{
"string":"1"
},
"USER_CATEGORY":{
"string":"B"
}
}
},
"headers":{
"operation":"UPDATE",
"timestamp":"2018-05-03T13:53:43.000"
}
}
What configuration is needed in the sink file in order to extract all the (sub)fields under data and headers and ignore those under beforeData so that the target table in which the data will be transferred by Kafka Sink will contain the following fields:
USER_ID, USER_CATEGORY, operation, timestamp
I went through the transformation list in confluent's docs but I was not able to find how to use them in order to achieve the aforementioned target.
I think you want ExtractField, and unfortunately, it's a Map.get operation, so that means 1) nested fields cannot be gotten in one pass 2) multiple fields need multiple transforms.
That being said, you might to attempt this (untested)
transforms=ExtractData,ExtractHeaders
transforms.ExtractData.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractData.field=data
transforms.ExtractHeaders.type=org.apache.kafka.connect.transforms.ExtractField$Value
transforms.ExtractHeaders.field=headers
If that doesn't work, you might be better off implementing your own Transformations package that can at least drop values from the Struct / Map.
If you're willing to list specific field names, you can solve this by:
Using a Flatten transform to collapse the nesting (which will convert the original structure's paths into dot-delimited names)
Using a Replace transform with rename to make the field names be what you want the sink to emit
Using another Replace transform with whitelist to limit the emitted fields to those you select
For your case it might look like:
"transforms": "t1,t2,t3",
"transforms.t1.type": "org.apache.kafka.connect.transforms.Flatten$Value",
"transforms.t2.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.t2.renames": "data.USER_ID:USER_ID,data.USER_CATEGORY:USER_CATEGORY,headers.operation:operation,headers.timestamp:timestamp",
"transforms.t3.type": "org.apache.kafka.connect.transforms.ReplaceField$Value",
"transforms.t3.whitelist": "USER_ID,USER_CATEGORY,operation,timestamp",
I am using flume + kafka to sink the log data to hdfs. My sink data type is Avro. In avro schema (.avsc), there is 80 fields as columns.
So I created an external table like that
CREATE external TABLE pgar.tiz_biaws_fraud
PARTITIONED BY(partition_date INT)
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe'
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat'
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION '/data/datapool/flume/biaws/fraud'
TBLPROPERTIES ('avro.schema.url'='hdfs://xxxx-ns/data/datapool/flume/biaws/fraud.avsc')
Now, I need to add 25 more columns to avro schema. In that case,
if I create a new table with new schema which has 105 columns, I will have two table for one project. And if I add or remove some columns in coming days, I have to create a new table for that. I am afraid of having a lot of table which use different schema for same project.
If I swap the old schema with new schema in current table, I will have only one table for one project but I can't read and get old data anymore because of schema conflict.
What is the best way to use avro schema in case like that?
This is indeed challenging. The best way is to make sure all schema changes you make are compatible with the old data - so only remove columns with defaults, and make sure you give defaults in the columns you are adding. This way you can safely swap out the schemas without a conflict and keep reading old data. Avro is pretty clever about that, it's called "schema evolution" (in case you want to google a bit more) and allows reader and writer schemas to be a bit different.
As an aside, I want to mention that Kafka has a native HDFS connector (i.e. without Flume) that uses Confluent's schema registry to handle these kinds of schema changes automatically - you can use the registry to check if the schemas are compatible, and if they are - simply write data using the new schema and the Hive table will automatically evolve to match.
I added new columns to avro schema like that
{"name":"newColumn1", "type": "string", "default": ""},
{"name":"newColumn2", "type": "string", "default": ""},
{"name":"newColumn3", "type": "string", "default": ""},
When I use default property, if that columns doesn't exist in current data it returns default value, if that columns does exist in current data it returns the data value as expected.
For setting null value as default, you need that
{ "name": "newColumn4", "type": [ "string", "null" ], "default": "null" },
or
{ "name": "newColumn5", "type": [ "null", "string" ]},
The position of null in type property, can be first place or can be second place with default property.