Possible option for PrimayKey in Table creation with KSQL? - apache-kafka

I've started working with KSQL and quite living the experience. I'm trying to work with Table and Stream join and the scenario is as below.
I have a sample data set like this:
"0117440512","0134217727","US","United States","VIRGINIA","Vienna","DoD Network Information Center"
"0134217728","0150994943","US","United States","MASSACHUSETTS","Woburn","Genuity"
in my kafka topic-1. Is a static data set loaded to Table and might get updated once in a month or so.
I have one more data set like:
{"state":"AD","id":"020","city":"Andorra","port":"02","region":"Canillo"}
{"state":"GD","id":"024","city":"Arab","port":"29","region":"Ordino"}
in kafka topic-2. Is a stream of data being loaded to streams.
Since Table cant be created without specifying the Key, my data don't have a unique column to do so. So while loading data from topic-1 to Table, what exactly should my key be? Remember my Table might get populated/updated once in a month or so with same data and new once too. With new data being loaded I can replace them with the key.
I tried to find if there's something like incremental value as we call PrimaryKey in SQL, but didn't find any.
Can someone help me in correcting my approach towards the implementation or a query to create a PrimaryKey if exists. Thanks

No, KSQL doesn't have the concept of a self-incrementing key. You have to define the key when you produce the data into the topic on which the KSQL Table is defined.
--- EDIT
If you want to set the key on a message as it's ingested through Kafka Connect you can use Single Message Transform (SMT).
"transforms":"createKey,extractInt",
"transforms.createKey.type":"org.apache.kafka.connect.transforms.ValueToKey",
"transforms.createKey.fields":"id",
"transforms.extractInt.type":"org.apache.kafka.connect.transforms.ExtractField$Key",
"transforms.extractInt.field":"id"
See here for more details.

Related

Using ksqlDB to implement CDC using multiple event types in a single topic?

I have the following situation where I have an Apache Kafka topic containing numerous record types.
For example:
UserCreated
UserUpdated
UserDeleted
AnotherRecordType
...
I wish to implement CDC on the three listed User* record types such that at the end, I have an up-to-date KTable with all user information.
How can I do this in ksqlDB? Since, as far as I know, Debezium and other CDC connectors also source their data from a single topic, I at least know it should be possible.
I've been reading through the Confluent docs for a while now, but I can't seem to find anything quite pertinent to my use case (CDC using existing topic). If there is anything I've overlooked, I would greatly appreciate a link to the relevant documentation as well.
I assume that, at the very least, the records must have the same key for ksqlDB to be able to match them. So my questions boil down to:
How would I teach ksqlDB which is an insert, an update and a delete?
Is the key matching a hard requirement, or are there other join/match predicates that we can use?
One possibility that I can think of is basically how CDC already does it: treat each incoming record as a new entry so that I can have something like a slowly changing dimension in the KTable, grouping on the key and selecting entries with e.g. the latest timestamp.
So, is something like the following:
CREATE TABLE users AS
SELECT user.user_id,
latest_by_offset(user.name) AS name,
latest_by_offset(user.email),
CASE WHEN record.key = UserDeleted THEN true ELSE FALSE END,
user.timestamp,
...
FROM users
GROUP BY user.user_id
EMIT CHANGES;
possible (using e.g. ROWKEY for record.key)? If not, how does e.g. Debezium do it?
The general pattern is to not have different schema types; just User. Then, the first record of any unique key (userid, for example) is an insert. Afterwards any non null values for the same key are updates (generally requiring all fields to be part of the value, effectively going a "replace" operation in the table). Deletes are caused by sending null values for the key (tombstone events).
If you have multiple schemas, it might be better to create a new stream that nulls out any of the delete events, unifies the creates and updates to a common schema that you want information for, and filter event types that you want to ignore.
how does e.g. Debezium do it?
For consuming data coming from Debezium topics, you can use a transform to "extract the new record state". It doesn't create any tables for you.

Is there anyway to check duplicate the message control id (MSH:10) in MSH segment using Mirth connect?

Is there anyway to check duplicate the message control id (MSH:10) in MSH segment using Mirth connect?
MSH|^~&|sss|xxx|INSTANCE2|KKLIU 0063/2021|20190905162034||ADT^A28^ADT_A05|Zx20190905162034|P|2.4|||NE|NE|||||
whenever message enters it needs to be validated whether duplicate of control id Zx20190905162034 is already processed or not?
Mirth will not do this for you, but you can write your own JavaScript transformer to check a database or your own set of previously encountered control ids.
Your JavaScript can make use of any appropriate Java classes.
The database check (you can implement this using code template) is the easier way out. You might want to designate the column storing MSH:10 values as a primary key or define an index on it. Queries against unique entries would be faster. Other alternatives include periodically redeploying the Channel while reading all MSH:10 values already in the database and placing them in a global map variable or maintained in an API that you can make a GET request to when processing every message. Any of the options depends on the number of records we are speaking about.

How to deserialize Avro schema and then abandon schema before write to ES Sink Connector using SMT

Use Case and Description
My use case is described more here, but the gist of the issue is:
I am making a custom SMT and want to make sure the Elasticsearch sink connector deserializes incoming records properly, but then after that I don't need any sort of schema at all. Each record has a dynamic amount of fields set, so I don't want to have any makeUpdatedSchema step (e.g., this code) at all. This both keeps code more simple and I would assume improves performance since I don't have to recreate schemas for each record.
What I tried
I tried doing something like the applySchemaless code as shown here even when the record has a schema by returning something like this, with null for schema:
return newRecord(record, null, updatedValue);
However, in runtime it errors out, saying I have an incompatible schema.
Key Question
I might be misunderstanding the role of the schema at this point in the process (is it needed at all once we're in the Elasticsearch sink connector?) or how it works, and if so that would be helpful to know as well. But is there some way to write a custom SMT like this?

Ideal way to perform lookup on a stream of Kafka topic

I have the following use-case:
There is a stream of records on a Kafka topic. I have another set of unique IDs. I need to, for each record in the stream, check if the stream's ID is present in the set of unique IDs I have. Basically, this should serve as a filter for my Kafka Streams app. i.e., only to write records of Kafka topic that match the set of Unique IDs I have to another topic.
Our current application is based on Kafka Streams. I looked at KStreams and KTables. Looks like they're good for enrichments. Now, I don't need any enrichments to the data. As for using state stores, I'm not sure how good they are as a scalable solution.
I would like to do something like this:
kStream.filter((k, v) -> {
valueToCheckInKTable = v.get(FIELD_NAME);
if (kTable.containsKey(valueToCheckInKTable)) return record
else ignore
});
The lookup data can be pretty huge. Can someone suggest the best way to do this?
You can read the reference IDs into a table via builder.table("id-topic") with the ID as primary key (note that the value must be non-null -- otherwise it would be interpreted as a delete -- if there is no actual value, just put any non-null dummy value of each record when you write the IDs into the id-topic). To load the full table on startup, you might want to provide a custom timestamp extractor that always returns 0 via Consumed parameter on the table() operator (record are processed in timestamp order and returning 0 ensure that the record from the id-topic are processed first to load the table).
To do the filtering you do a stream-table join:
KStream stream = builder.stream(...);
// the key in the stream must be ID, if this is not the case, you can use `selectKey()` to set a new ke
KStream filteredStream = stream.join(table,...);
As you don't want to do any enrichment, the provided Joiner function can just return the left stream side value unmodified (and can ignored the right hand side table value).

How to use hbase coprocessor to implement groupby?

Recently I learned hbase coprocessor, I used endpoint to accumulate one column of hbase table.For example, the hbase table named "pendings",its family is "asset", I accumulate all the value of "asset:amount". The table has other columns,such as "asset:customer_name". The first thing I want to do is accumulate the the value of "asset:amount" group by "asset:customer_name". But I found there is not API for groupby, or I did not find it. Do you know how to implement GROUPBY or how to use the API that HBASE provides?
You should use an endpoint to do this work.
You have a sum example in this article: https://blogs.apache.org/hbase/entry/coprocessor_introduction.
What you basically need to add is to append your row key and the customer name to form your new key "MyKey". You should keep a variable of the last seen MyKey and when the current MyKey is different from the previous one, you should emit the previous one along with its sum and overwrite the previous MyKey to the current one.
You have to make sure to perform the aggregation on the client side as it is done in the example provided in the URL because you may have a customer at the edges of two different regions.
Using endpoint coprocessor can make it. All you should do is that : first define related interface(reduce) protocol extends CoprocessorPotocol, then make an implementation of it, lastly code the client-side logic.