guarantee events order in debezium - debezium

in debezium's doc, change event key's structure is like
{
"schema": {
"type": "struct",
"name": "mysql-server-1.inventory.customers.Key",
"optional": false,
"fields": [
{
"field": "id",
"type": "int32",
"optional": false
}
]
},
"payload": {
"id": 1001
}
}
question1: events for the row in table customers which id=1001 always have the same key, right?
question2: since kafka will send the data with the same key to the same partition, so I can say the events for customers.id=1001 can be consumed orderly, right?
question3: if I alter the primary key to varchar, the key will change so the partition number maybe change, in this case how can I guarantee the events always consumed orderly?

1: Yes.
2: Yes.
3: If you change the primary key -- either just its value, or even its type -- you won't have any ordering guarantees between events before and after that change.

Related

When I tried to use PARTITION BY at KSQL, but the field that PARTITION BY use, will missing at the value

I have a topic test_partition_key_stream, and it's have the schema like this:
value: key:null
{ "id": 1, "age": 18, "name": "lisa" }
Then I did this:
CREATE STREAM TEST_STREAM_JSON (id INT ,age INT ,name VARCHAR) WITH (KAFKA_TOPIC = 'test_partition_key_stream', VALUE_FORMAT = 'JSON');
CREATE STREAM TEST_STREAM_AVRO WITH (PARTITIONS=3, VALUE_FORMAT='AVRO') AS SELECT * FROM TEST_STREAM_JSON PARTITION BY ID;
But when I use PARTITION BY, the 'ID' field will missed at the topic value side.
The new Topic generated to:
{ "fields": [ { "default": null, "name": "AGE", "type": [ "null", "int" ] }, { "default": null, "name": "NAME", "type": [ "null", "string" ] } ], "name": "KsqlDataSourceSchema", "namespace": "io.confluent.ksql.avro_schemas", "type": "record" }
I want to let the new topic partition by ID, but I don't want to lose it at vaule.
Resolved.
The PARTITION BY clause moves the columns into the key. If you want them in the value also, you must copy them by using the AS_VALUE function.
The doc by https://docs.ksqldb.io/en/latest/developer-guide/joins/partition-data/

ksqldb keeps saying - VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON

I'm trying to create a stream in ksqldb to a topic in Kafka using an avro schema.
The command looks like this:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
Topic customers looks like this:
Using the command - print 'customers';
Key format: ¯_(ツ)_/¯ - no data processed
Value format: JSON or KAFKA_STRING
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"John Smith","PhoneNumbers":["212 555-1111","212 555-2222"],"Remote":false,"Height":"62.4","FicoScore":" > 640"}, partition: 0
rowtime: 2022/09/29 12:34:53.440 Z, key: , value: {"Name":"Jane Smith","PhoneNumbers":["269 xxx-1111","269 xxx-2222"],"Remote":false,"Height":"69.9","FicoScore":" > 690"}, partition: 0
To this topic an avro schema has been added.
{
"type": "record",
"name": "Customer",
"namespace": "com.acme.avro",
"fields": [{
"name": "ficoScore",
"type": ["null", "string"],
"default": null
}, {
"name": "height",
"type": ["null", "double"],
"default": null
}, {
"name": "name",
"type": ["null", "string"],
"default": null
}, {
"name": "phoneNumbers",
"type": ["null", {
"type": "array",
"items": ["null", "string"]
}
],
"default": null
}, {
"name": "remote",
"type": ["null", "boolean"],
"default": null
}
]
}
When I run the command below I got this reply:
CREATE STREAM customer_stream WITH (KAFKA_TOPIC='customers', VALUE_FORMAT='JSON', VALUE_SCHEMA_ID=1);
VALUE_FORMAT should support schema inference when VALUE_SCHEMA_ID is provided. Current format is JSON.
Any suggestion?
JSON doesn't use schema IDs. JSON_SR format does, but if you want Avro, then you need to use AVRO as the format.
You dont "add schemas" to topics. You can only register them in the registry.
Example of converting JSON to Avro with kSQL:
CREATE STREAM sensor_events_json (sensor_id VARCHAR, temperature INTEGER, ...)
WITH (KAFKA_TOPIC='events-topic', VALUE_FORMAT='JSON');
CREATE STREAM sensor_events_avro WITH (VALUE_FORMAT='AVRO') AS SELECT * FROM sensor_events_json;
Notice that you dont need to refer to any ID as the serializer will auto-register the necessary schema.

Default value for a record in AVRO?

I wanted to add a new field into an AVRO schema of type "record" that cannot be null and therefore has a default value. The topic is set to compatibility type "Full_Transitive".
The schema did not change from the last version, only the last field produktType was added:
{
"type": "record",
"name": "Finished",
"namespace": "com.domain.finishing",
"doc": "Schema to indicate the end of the ongoing saga...",
"fields": [
{
"name": "numberOfAThing",
"type": [
"null",
{
"type": "string",
"avro.java.string": "String"
}
],
"default": null
},
{
"name": "previousNumbersOfThings",
"type": {
"type": "array",
"items": {
"type": "string",
"avro.java.string": "String"
}
},
"default": []
},
{
"name": "produktType",
"type": {
"type": "record",
"name": "ProduktType",
"fields": [
{
"name": "art",
"type": "int",
"default": 1
},
{
"name": "code",
"type": "int",
"default": 10003
}
]
},
"default": { "art": 1, "code": 10003 }
}
]
}
I've checked with the schema-registry that the new version of the schema is compatible.
But when we try to read old messages that do not contain that new field with the new schema (where the defaults are) there is a EOF Exception and it does not seem to work.
The part that causes headaches is the new added field "produktType". It cannot be null so we tried adding defaults. Which is possible for primitive type fields ("int" and so on). The line "default": { "art": 1, "code": 10003 } seems to be ok with the schema-registry but does not seem to have an effect when we read messages from the topic that do not contain this field.
The schema registry also marks it as not compatible when the last "default": { "art": 1, "code": 10003 } line is missing (but also "default": true works regarding schema compatibility...).
The AVRO specification for complex types contains an example for type "record" and default {"a": 1} so that is where we got that idea from. But since its not working something is still wrong.
There are similar questions like this one claiming records can only have null as a default or this un-answered one.
Is this supposed to work? And if so how can defaults for these "type": "record" fields be defined? Or is it still true that records can only have null as default?
Thanks!
Update on the compatibility cases:
Schema V1 (old one without the new field): can read v1 and v2 records.
Schema V2 (new field added): cannot read v1 records, can read v2 records
The case where a consumer using schema v2 encountering records using v1 is the surprising one - as I thought the defaults are for that purpose.
Even weirder: when I don't set the new field values at all. The v2 record does contain some values:
I have no idea where the value for code is from. The schema uses other numbers for its defaults:
So one of them seems to work, the other does not.

Kafka JDBC Source connector Insert or Update

I configured a Kafka JDBC Source connector in order to push on a Kafka topic the record changed (insert or update) from a PostgreSQL database.
I use "timestamp+incrementing" mode. Seems to work fine.
I didnt't configure the JDBC Sink connector because I'm using a Kafka Consumer that listen on the topic.
The message on the topic is a JSON. This is an example:
{
"schema": {
"type": "struct",
"fields": [
{
"type": "int64",
"optional": false,
"field": "id"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_create_date"
},
{
"type": "int64",
"optional": true,
"name": "org.apache.kafka.connect.data.Timestamp",
"version": 1,
"field": "entity_modify_date"
},
{
"type": "int32",
"optional": true,
"field": "entity_version"
},
{
"type": "string",
"optional": true,
"field": "firstname"
},
{
"type": "string",
"optional": true,
"field": "lastname"
}
],
"optional": false,
"name": "author"
},
"payload": {
"id": 1,
"entity_create_date": 1600287236682,
"entity_modify_date": 1600287236682,
"entity_version": 1,
"firstname": "George",
"lastname": "Orwell"
}
}
As you can see there is no info about if this change is captured by Source connector because of an insert or an update.
I need this information. How can solve?
You can't get that information using the JDBC Source connector, unless you do something bespoke in the source schema and triggers.
This is one of the reasons why log-based CDC is generally a better way to get events from the source database, and for other reasons including:
capturing deletes
capturing the type of operation
capturing all events, not just what's there at the time when the connector polls.
For more details on the nuances of this see this blog or a talk based on the same.
Using a CDC based approach as suggested by #Robin Moffatt may be the proper way to handle your requirement. Checkout https://debezium.io/
However, looking at your table data you could use "entity_create_date" and "entity_modify_date" in your consumer to determine if the message in an insert or update. If "entity_create_date" = "entity_modify_date" then it's an insert else it's an update.

KafkaConnect produces CDC event with null value when reading from mongoDB with debezium

When reading the kafka topic which contains lots of CDC events produced by Kafka-Connect using debezium and the data source is in a mongodb collection with TTL, I saw some of the CDC events are null, those are in between the deletion events. what does it really mean?
As I understand all the CDC events should have the CDC event structure, even the deletion events as well, why there are events with null value?
null,
{
"after": null,
"patch": null,
"source": {
"version": "0.9.3.Final",
"connector": "mongodb",
"name": "test",
"rs": "rs1",
"ns": "testestest",
"sec": 1555060472,
"ord": 297,
"h": 1196279425766381600,
"initsync": false
},
"op": "d",
"ts_ms": 1555060472177
},
null,
{
"after": null,
"patch": null,
"source": {
"version": "0.9.3.Final",
"connector": "mongodb",
"name": "test",
"rs": "rs1",
"ns": "testestest",
"sec": 1555060472,
"ord": 298,
"h": -2199232943406075600,
"initsync": false
},
"op": "d",
"ts_ms": 1555060472177
}
I use https://debezium.io/docs/connectors/mongodb/ without flattening any event, and use the config as follows:
{
"connector.class": "io.debezium.connector.mongodb.MongoDbConnector",
"mongodb.hosts": "live.xxx.xxx:27019",
"mongodb.name": "testmongodb",
"collection.whitelist": "testest",
"tasks.max": 4,
"snapshot.mode": "never",
"poll.interval.ms": 15000
}
These are so-called tombstone events used for correct compaction of deleted events - see https://kafka.apache.org/documentation/#compaction
Compaction also allows for deletes. A message with a key and a null payload will be treated as a delete from the log. This delete marker will cause any prior message with that key to be removed (as would any new message with that key), but delete markers are special in that they will themselves be cleaned out of the log after a period of time to free up space. The point in time at which deletes are no longer retained is marked as the "delete retention point" in the above diagram.