ksql what is the difference between the primary key and with KEY - apache-kafka

ksql.
what is the difference between the primary key and with KEY then create ktable ?
should they be applied simultaneously?

In both scenarios, they will represent the message key which kinda makes sense when you think about how Kafka handles "tables".
But there is also a difference between those 2: in the second scenario, we see the KEY field more as an optimization technique. As you can notice, the same field can be found in the message value, so it will be extracted from there.
Quoting from the official documentation page:
If the Kafka message key is also present as a field/column in the Kafka message value, you may set this property to associate the corresponding field/column with the implicit ROWKEY column (message key)
Regarding your second question, if you check the latest version of the documentation you'll notice that the second query is no longer valid.

Related

Using ksqlDB to implement CDC using multiple event types in a single topic?

I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.

Allow negative primary keys in SailsJS

Is there any way to allow primary keys to be negative integers in Sails?
I ran accross the following error when testing some older software;
{
"code":"E_INVALID_VALUES_TO_SET",
"details":"Could not use specified `org`. Expecting an id representing the associated record, or `null` to indicate there will be no associated record. But the specified value is not a valid `org`. Cannot use a negative number (-1) as a primary key value.",
"message":"The server could not fulfill this request (`PATCH /user/1402`) due to a problem with the parameters that were sent. See the `details` for more info. **The following additional tip will not be shown in production**: Tip: Check your client-side code to make sure that the request data it sends matches the expectations of the corresponding attribues in your model. Also check that your client-side code sends data for every required attribute."}
I've checked the Sails documentation and can't find any place which mentions that negative primary keys are not allowed.
I've also checked the schema definitions for both tables, and neither speciefies the relevant field as unsigned.
Is there any workaround other than changing the relevant row to some different id and updating every other row which references it?
Here is a workaround. Maybe you can change your primary key to a string.
https://sailsjs.com/documentation/concepts/models-and-orm/model-settings

Apache Nifi: Is there a way to publish messages to kafka with a message key as combination of multiple attributes?

I have a requirement where I need to read a CSV and publish to Kafka topic in Avro format. During the publish, I need to set the message key as the combination of two attributes. Let's say I have an attribute called id and an attribute called group. I need my message key to be id+"-"+group. Is there a way I can achieve this in Apache nifi flow? Setting the message key to a single attribute works fine for me.
Yes, in the PublishKafka_2_0 (or whatever version you're using), set the Kafka Key property to construct your message key using NiFi Expression Language. For your example, the expression ${id}-${group} will form it (e.g. id=myId & group=MyGroup -> myId-myGroup).
If you don't populate this property explicitly, the processor looks for the attribute kafka.key, so if you had previously set that value, it would be passed through.
Additional information after comment 2020-06-15 16:49
Ah, so the PublishKafkaRecord will publish multiple messages to Kafka, each correlating with a record in the single NiFi flowfile. In this case, the property is asking for a field (a record term meaning some element of the record schema) to use to populate that message key. I would suggest using UpdateRecord before this processor to add a field called messageKey (or whatever you like) to each record using Expression Language, then reference this field in the publishing processor property.
Notice the (?)s on each property which indicates what is or isn't allowed:
When a field doesn't except expression languages, use an updateAttribute processor to set the combined value you need. Then you use the combined value downstream.
Thank you for your inputs. I had to change my initial design of producing with a key combination to actually partitioning the file based on a specific field using PartitionRecord processor. I have a date field in my CSV file and there can be multiple records per date. I partition based on this date field and produce to the kafka topics using the id field as key per partition. The kafka topic name is dynamic and is suffixed with the date value. Since I plan to use Kafka streams to read data from these topics, this is a much better design than the initial one.

Achieving tombstoning in Kafka

I have a KStream with key value pairs that are grouped by key. Every key should be unique, and the only reason why it might not be is as the same key is streamed with a null as value.
In my streams application I need to filter out all the same keys if the value of on of the records is null (tombstone). How do I get started?
KStream<Key, Value> table = builder.stream(kafkaProperties.getTopicName());
// If key exists multiple times, check for null value and if found
// remove / ignore record
So when it needs to stay, but when the complete key with all values need to be thrown away.
This is quite tricky to achieve. Data is processed linearly, and thus you would need to buffer all key-value pairs in a state store, eg, using a transform(). You would insert each input key-value pair into the key-value store. If you receive a null value, you can delete it from the store.
The difficult part is, to decide/know that there won't be any null-value for a key in the future. How to determine this, depends on your overall setup, and there is no generic answer. If you can make the decision at some point that a key-value pair in the store cannot have a future tombstone, you can send it downstream and also delete it from the store.

Possible option for PrimayKey in Table creation with KSQL?

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks
No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.