Delete a record in sink table with KSQLDB - apache-kafka

I had successfully created Debezium Source Connector for Oracle and pulled some data to Kafka topics.
After that I create a stream on existing topic that was generated by debezium connector.
CREATE STREAM "MySourceTable" WITH (KAFKA_TOPIC='server1.DEBEZIUM.MySourceTable', VALUE_FORMAT='AVRO');
--Stream was created successfully.
After that I am trying to create a new stream with some data extracted:
CREATE OR REPLACE STREAM "CustomTable1" WITH (KAFKA_TOPIC='server2.DEBEZIUM.CustomTable1', VALUE_FORMAT='AVRO', PARTITIONS=1, KEY_FORMAT='AVRO') AS
SELECT after->customTableId AS "key_id",
AS_VALUE(after->customTableId) AS "id",
after->customTableHref AS "href",
after->customTableName AS "name",
after->customTableType AS "type",
after->id AS "mySourceTableId"
FROM "MySourceTable"
WHERE (after->customTableId IS NOT NULL) EMIT CHANGES;
--Stream was created successfully.
Then I delete a value in column customTableId from MySourceTable table and it becomes null in my source db. I can capture changes from other columns like customTableHref and customTableName and the result in sink db. But I still have the records with null id from source db that are not null in sink db and they are still exist in my sink table (id is primary key in my sink table). How can solve this problem ?

Related

KSQL Windowed Aggregation Stream

I am trying to group events by one of its properties and over time using the KSQL Windowed Aggregation, specifically the Session Window.
I have a STREAM made from a kafka topic with the TIMESTAMP property well specified.
When I try to create a STREAM with a Session Windowing with a query like:
CREATE STREAM SESSION_STREAM AS
SELECT ...
FROM EVENT_STREAM
WINDOW SESSION (5 MINUTES)
GROUP BY ...;
I always get the error:
Your SELECT query produces a TABLE. Please use CREATE TABLE AS SELECT statement instead.
Is it possible to create a STREAM with a Windowed Aggregation?
When I try as suggested to create a TABLE and then a STREAM that contains all the session starting events, with a query like:
CREATE STREAM SESSION_START_STREAM AS
SELECT *
FROM SESSION_TABLE
WHERE WINDOWSTART=WINDOWEND;
KSQL informs me that:
KSQL does not support persistent queries on windowed tables
How to create a STREAM of events starting a session window in KSQL?
Your create stream statement, if switched to a create table statement will create a table that is constantly being updated. The sink topic SESSION_STREAM will contain the stream of changes to the table, i.e. its changelog.
ksqlDB models this as a TABLE, because it has TABLE semantics, i.e. only a single row can exist in the table with any specific key. However, the changelog will contain the STREAM of changes that have been applied to the table.
If what you want is a topic containing all the sessions then something like this will create that:
-- create a stream with a new 'data' topic:
CREATE STREAM DATA (USER_ID INT)
WITH (kafka_topic='data', value_format='json');
-- create a table that tracks user interactions per session:
CREATE TABLE SESSION AS
SELECT USER_ID, COUNT(USER_ID) AS COUNT
FROM DATA
WINDOW SESSION (5 SECONDS)
GROUP BY USER_ID;
This will create a SESSIONS topic that contains the changes to the SESSIONS table: i.e. its changelog.
If you want to convert this into a stream of session start events, then unfortunately ksqlDB doesn't yet allow you to directly change create a stream from the table, but you can create a stream over the table's change log:
-- Create a stream over the existing `SESSIONS` topic.
-- Note it states the window_type is 'Session'.
CREATE STREAM SESSION_STREAM (ROWKEY INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');
-- Create a stream of window start events:
CREATE STREAM SESSION_STARTS AS
SELECT * FROM SESSION_STREAM
WHERE WINDOWSTART = WINDOWEND;
Note, with the upcoming 0.10 release you'll be able to name the key column in the SESSION_STREAM correctly:
CREATE STREAM SESSION_STREAM (USER_ID INT KEY, COUNT BIGINT)
WITH (kafka_topic='SESSIONS', value_format='JSON', window_type='Session');

Keeping KEY of all streams same as the source stream

I am getting data into the source topic with keys and values in AVRO schema. I have 2 more streams in the pipeline downstream before I am sending the data into MySQL. But since I am using functions from KSQL the key from the source is not getting mapped into the downstream streams.
The keys are of a composite type. Is there any way where I can do some transformations on the key values and keep the KEYS in downstream streams too?
I tried multiple ways and know that if use SELECT AS statement during CREATE STREAM statement will keep the KEY intact but any transformations remove the KEY from the streams.
Example:
-- source stream
CREATE STREAM TEST_1
(COL1 STRING, COL2 STRING, COL3 STRING)
WITH (KAFKA_TOPIC='TEST_1', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='AVRO');
-- Second stream
CREATE STREAM TEST_2
WITH (KAFKA_TOPIC='TEST_2', PARTITIONS=1, REPLICAS=1, VALUE_FORMAT='AVRO') AS
SELECT CLEAR(COL1) AS COL1, CLEAR(COL2) AS COL2, CLEAR(COL3) AS COL3
FROM TEST_1;
Here the TEST_1 is from a database. It has composite keys (COL1, COL2).
But when I am creating stream TEST_2 the composite keys are not reflected from TEST_1.
I need the keys for tombstone records for delete operation in MySQL. There are some cleaning operations performed on the keys.
I am using Kafka confluent for this. The insert and update operations don't have any problem as I had an SMT - value to a key, in the sink connectors.
Here CLEAR() is a custom UDF. Is there a way to achieve this?

kafka sink connector creation for table having primary key as three columns

I have created a source jdbc connector for a table that has no primary key (table has column a,b,c,d,e) and it is part of an external database. I have the replica table in my database and I have created primary key using the columns a,b and c since those three combined together form unique data and can be used to form primary key. I am trying to create upsert sink connector and gave the pk.fields as a,b,c but when I launch the sink connector, it goes to degraded State and I am not able to see any proper error in the connect.log as well. I have given the pk.mode as record_value and in the pk.fields I gave it as a,b,c. Can someone please let me know if there is anything missing in the setup?
Note: it works if I change the mode to insert and remove the pk.fields. the pk.mode is record_value.
Update:
Hi Robin, Source table named as AccountDetails has columns accNumber, bankABA, bankOrigAccNumber, SpendingLimit and ExpirationDate and there is no primary key for this table. The target table is AccountInformation and has the same columns but has the primary key as (accNumber, bankABA and bankOrigAccNumber) since we need to have primary key at target for using in a different application. I have created source connector which is working fine to pull the data once in 24 hours. I am trying to create a sink connector with the mode as upsert for pushing the data from topic to table and the primary key mode as record_value and primary key fields as "accNumber,bankABA,bankOrigAccNumber". When i launch the sink, it goes to degraded state.

Data not syncing from mysql to elastic search after processing through Kafka

We are trying to send data from MySQL to elastic(ETL) though Kafka.
In MySQL we have multiple tables which we need to aggregate in specific format than we can send it to elastic search.
For that we used debezium to connect with Mysql and elastic and transformed data through ksql.
we have created streams for both the tables then partition them and create table of one entity but after joining we dint get the data from both the tables.
we are trying to join two tables of Mysql through Ksql and send it to elastic search using debezium.
Table 1: items
table 2 : item_images
CREATE STREAM items_from_debezium (id integer, tenant_id integer, name string, sku string, barcode string, qty integer, type integer, archived integer)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.items',VALUE_FORMAT='json');
CREATE STREAM images_from_debezium (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='mySqlTesting.orderhive.item_images',VALUE_FORMAT='json');
CREATE STREAM items_flat
WITH (KAFKA_TOPIC='ITEMS_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM items_from_debezium PARTITION BY id;
CREATE STREAM images_flat
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',PARTITIONS=1) as SELECT * FROM images_from_debezium PARTITION BY item_id;
CREATE TABLE item_images (id integer,item_id integer,image string, thumbnail string)
WITH (KAFKA_TOPIC='IMAGES_REPART',VALUE_FORMAT='json',KEY='item_id');
SELECT item_images.id,item_images.image,item_images.thumbnail,items_flat.id,items_flat.name,items_flat.sku,items_flat.barcode,items_flat.type,items_flat.archived,items_flat.qty
FROM items_flat left join item_images on items_flat.id=item_images.item_id
limit 10;
We are expecting data of both the tables but we are getting null from item_images table.

Apache Spark - Error persisting Dataframe to MemSQL database using JDBC driver

I'm currently facing an issue while trying to save an Apache Spark DataFrame loaded from an Apache Spark temp table to a distributed MemSQL database.
The trick is that I cannot use MemSQLContext connector for the moment. So I'm using JDBC driver.
Here is my code:
//store suppliers data from temp table into a dataframe
val suppliers = sqlContext.read.table("tmp_SUPPLIER")
//append data to the target table
suppliers.write.mode(SaveMode.Append).jdbc(url_memsql, "R_SUPPLIER", prop_memsql)
Here is the error message (occuring during the suppliers.write statement):
java.sql.SQLException: Distributed tables must either have a PRIMARY or SHARD key.
Note:
R_SUPPLIER table has exactly the same fields and datatypes than the temp table and has a primary key set.
FYI, here are some clues:
R_SUPPLIER script:
`CREATE TABLE R_SUPPLIER
(
SUP_ID INT NOT NULL PRIMARY KEY,
SUP_CAGE_CODE CHAR(5) NULL,
SUP_INTERNAL_SAP_CODE CHAR(5) NULL,
SUP_NAME VARCHAR(255) NULL,
SHARD KEY(SUP_ID)
);`
The suppliers.write statement has worked once, but data was then loaded in the DataFrame with a sqlContext.read.jdbc command and not sqlContext.sql (data was stored in a distant database and not in Apache Spark local temp table).
Did anyone face the same issue, please?
Are you getting that error when you run the create table, or when you run the suppliers.write code? That is an error that you should only get when creating a table. Therefore if you are hitting it when running suppliers.write, your code is probably trying to create and write to a new table, not the one you created before.