Kafka Connect and Custom Queries - apache-kafka

I'm interested in using the Kafka Source JDBC connector to perform to a publish to Kafka, for when an Invoice gets created. On the source end, it's broken up into 2 tables Invoice, and InvoiceLine.
Is this possible, using custom queries. What would the query look like?
Also since its polling, what gets published could contain one or more invoices in a topic?
Thanks

Yes, you can use custom queries. From the docs:
Custom Query: The source connector supports using custom queries instead of copying whole tables. With a custom query, one of the other update automatic update modes can be used as long as the necessary WHERE clause can be correctly appended to the query. Alternatively, the specified query may handle filtering to new updates itself; however, note that no offset tracking will be performed (unlike the automatic modes where incrementing and/or timestamp column values are recorded for each record), so the query must track offsets itself.

Related

Is it right use-case of KSql

I am using KStreams where I need to de-duplicate the data. Source ingests duplicated data due to many reasons i.e data itself duplicate, re-partitioning.
Currently using Redis for this use-case where data is stored something as below
id#object list-of-applications-processed-this-id-and-this-object
As KSQL is implemented on top of RocksDB which is also a Key-Value database, can I use KSql for this use case?
At the time of successful processing, I would add an entry to KSQL. At the time of reception, I will have to check the existence of the id in KSQL.
Is it correct use case as per KSql design in the event processing world?
If you want to use to use ksqlDB as a cache, you can create a TABLE using the topic as data source. Note that a CREATE TABLE statement by itself, does only declare a schema (it does not pull in any data into ksqlDB yet).
CREATE TABLE inputTable <schemaDefinition> WITH(kafka_topic='...');
Check out the docs for more details: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/create-table/
To pull in the data, you can create a second table via:
CREATE TABLE cache AS SELECT * FROM inputTable;
This will run a query in the background, that read the input data and puts the result into the ksqlDB server. Because the query is a simple SELECT * it effectively pulls in all data from the topic. You can now issue "pull queries" (i.e, lookups) against the result to use TABLE cache as desired: https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/select-pull-query/
Future work:
We are currently working on adding "source tables" (cf https://github.com/confluentinc/ksql/pull/7474) that will make this setup simpler. If you declare a source table, you can do the same with a single statement instead of two:
CREATE SOURCE TABLE cache <schemaDefinition> WITH(kafka_topic='...');

ksqlDB: Best way to create Tables from a Debezium source topics?

I would like to create Tables in ksqlDB from Debezium source topics, with the ultimate aim of performing a left join on these tables and efficiently outputting materialized views to a downstream database using the JDBC sink connector.
The Debezium source topics have not had any transforms applied (such as ExtractNewRecordState), so contain a 'before' and 'after' property, as described in the Debezium documentation here.
The reason for not applying the ExtractNewRecordState transform (which would presumably simplify matters) is that the source CDC topics may be used for various purposes and it does not appear possible to create multiple topics off the same source database table (since topic names are automatically determined by Debezium and depend on the database server, schema and table name as described here).
The best approach I have found so far is to:
create a stream in ksqlDB from the raw Debezium input, e.g.:
CREATE STREAM user_stream WITH (KAFKA_TOPIC='mssql.dbo.user', VALUE_FORMAT='AVRO');
create a second stream selecting the required fields from the 'after' property of the first stream, e.g.:
CREATE STREAM user_stream2 AS SELECT AFTER->user_id, AFTER->username, AFTER->email FROM user_stream EMIT CHANGES;
finally, convert the second stream to a table as described here, namely:
SELECT user_id,
LATEST_BY_OFFSET(username) AS username,
LATEST_BY_OFFSET(email) AS email
FROM user_stream2
GROUP BY user_id
EMIT CHANGES;
These steps must be repeated to generate each Table, at which point a join can be performed on the Tables to produce an output.
This seems quite long-winded, with a lot of intermediate steps. Performance also seems sluggish. Is there a better and/or more direct way to generated materialized views using ksqlDB and Debezium? Can any of the steps be cut out and/or should I be using a different approach in step 3 (such as a windowing function)?
I'm particularly keen to ensure that the approach taken is the most efficient from a performance and resource usage perspective.

Saving JDBC db data as shared state Spark

I have an MSSQL table as a data source and I would like to save some kind of the processing offset in the form of the timestamp (it is one of the table's columns). So it would be possible to process the data from the latest offset. I would like to save as some kind of shared state between Spark sessions. I have researched shared state in Spark session, however, I did not find the way to store this offset in the shared state. So is it possible to use existing Spark constructs to perform this task?
As far as I know there is no official built-in feature supporting passing data between sessions in Spark. As alternative I would consider the following options/suggestions:
First the offset column must be an indexed field in MSSQL in order to be able to query it fast.
If there is already an in-memory (i.e Redis, Apache Ignite) system installed and used by your project I would store there the offset.
I wouldn't use a message queue system such as Kafka because once you consume one message you will need to resend it therefore that would't make sense.
As solution I would prefer to save it in the filesystem or in Hive even if it would add extra overhead since you will have only one value in that table. In the case of the filesystem of course the performance would be much better.
Let me know if further information is needed

Is there a way to update side inputs in Apache Beam?

I am developing a data transformation pipeline in Apache Beam, where I need some look up table to help with transforming each incoming record.
I can pass in the look up table as a side input, but the caveat is an incoming record could update the lookup table that I am using.
Is there a way to update the lookup table and then broadcast the update to every other worker?
Update: One possibility is to use a data driven trigger to signal an end to the current window when an incoming record updates the side input. Does side input gets automatically refreshed at the start of the next window? In my case I am retrieving the side input from an external source.
If the lookup table is related to the key and window you can try using state
https://beam.apache.org/blog/2017/02/13/stateful-processing.html
Otherwise you may need an external data storage. Database or in memory cache. Just be aware of the fact that DoFns are serializable and opening/closing connections need to be done carefully.

How should i keep track of the delete operations in database without using triggers?

The appliocation polls the database after certain intervals of time. On each polling, the application would read all the tables.
As a part of optimization, we want that application should read the table only if any INSERT/UPDATE/DELETE has happened. So i want to use the timestamp concept.
Having a seperate timestamp column can help me in tracking any row modifications.
While querying on a table i can check if the in-memory stored timestamp is lesser than the max-of-TimeStamp in the table. if yes, it means that some row has been modified.
But if certain row gets deleted, then the latest timestamp assosiated with this row is no more pressent. Hence the above algorithm fails in this case since the max-of-timestamp does not give the correct value.
Is there a way in which i can track the delete operations as well without using triggers?
Any help would be highly appreciated.
I am using Sybase ASA database.
Maybe you could implement a logical deletion. Instead of removing a record you simply mark it as deleted with a specific flag for example.
You still have the max timestamp and you can exclude the flagged records from the selection queries (maybe create some views on top of the table to do the job automatically).