KSQL - Non-streaming query - apache-kafka

Is there a way to query all current entries in KTABLE? I'm trying to execute the http request to REST api with the payload
{
"ksql": "SELECT * FROM MY_KTABLE;",
"streamsProperties": {
"auto.offset.reset": "earliest"
}
}
and the stream is indefinitely hanging. The documentation says
It is the equivalent of a traditional database table but enriched by
streaming semantics such as windowing.
So is it possible to make regular queries when you just need all current data without streaming and regard KTABLE as regular cache table?

KSQL Table used Kafka Streams' KTable so in order to access the current value of the KTable you will need to access state stores in all instances of the streams job. In Kafka Streams you can do this using interactive queries, however, we don't support interactive queries in KSQL yet.
One workaround to see the current state of a table in KSQL would be to use Kafka Connect to push the kafka topic corresponding to the Table into an external table such as Postgres table or Cassandra table. This external table will have the latest values of the KSQL table.

Related

Can we use kafka JDBC source connector to pull data from multiple databases and put it into one input topic?

We have a use case wherein the business logic requires us to join tables from different databases and push the end result to an input topic.
table1 from schema1 in database1
table2 from schema2 in database2
Business logic
SELECT a,b FROM table1 INNER JOIN table2 ON table1.c = table2.d;
here a is from table1 and b is from table2, and value of the message in the input topic looks like { "payload":{ "a":xyz,"b":xyz} }
Is there any way to achieve this requirement with a single jdbc source connector?
PS:
I have referred to Can the JDBC Kafka Connector pull data from multiple databases?, but in the accepted answer messages are pushed to input topic without implementing any business logic. With this implementation we won't be able to push the message to input topic as per our requirement.
Alternative way would be using kafka streams, i.e., push the messages to input topic from each table and handle the joining logic at the kafka stream application level. But we are looking for a solution if we could implement the logic at the connector level itself?
Short answer: No, you cannot use the JDBC Source connector in this way.
Longer answer: The JDBC source connector can connect to one database per connector instance. You have a few options:
Stream the contents of both tables into Kafka, and use ksqlDB (or Kafka Streams if you prefer) to join them and push the resulting data to a new Kafka topic.
Write a new connector plugin yourself that connects to both databases and does the join (this sounds like an awful idea)
If the database supports it, use a remote join (e.g. Oracle's DB Link) and the JDBC source connector's query option.
Depending on data volumes and query complexity I'd personally go for option 1. ksqlDB is a perfect fit here.
If both the databases are on the same Database Server you can use the CustomQuery and write the SQL joining the tables across database.

KSQL query and tables storage

I was looking for a documentation about where KSQL storage querys and tables. For example, since KSQL was built to work with Kafka, when I create a table from a topic, or when I write a query, where are stored the tables or the query results? More specifically, does KSQL use some kind of pointers to events inside segments inside the topic partitions, or it duplicates the events when I create a table from a topic, for example?
The queries that have been ran or are active are persisteted back into a Kafka topic.
A Select statement has no persistent state - it acts as a consumer
A Create Stream/Table command will create potentially many topics, resulting in duplication, manpulation, and filtering of the input topic out to a given destination topic. For any stateful operations, results would be stored in a RocksDB instance on the KSQL server(s).
Since KSQL is built on Kafka Streams, you can refer to the wiki on Kafka Streams Internal Data Management

Is it possible to use Kafka Connect to mirror an RDBMS table to a Kafka Stream?

I know it's possible to push updates from a database to a Kafka stream using Kafka Connect. My question is, can I create a consumer to write changes from that same stream back into the table without creating an infinite loop?
I'm assuming if I create a consumer that writes updates into the database table, it would trigger Connect to push that update to the stream, etc. Is there a way around this so I can mirror a database table to a stream?
You can stream from a Kafka topic to a database using the JDBC Sink connector for Kafka Connect.
You'd need to code in your business logic for avoiding an infinite replication loop into either the connectors or your consumer. For example:
JDBC Source connector uses a WHERE clause to only pull records with a flag set to indicate they are the original record
Custom Single Message Transform in the source connector to drop records with a flag set to indicate they are not the original record
Stream application (e.g. KSQL / Kafka Streams) processes the inbound stream of all database changes to filter out only those with a flag set to indicate they are the original record
Inefficient because then you're still streaming everything from the database
Yes. It is possible to configure synchronisation/replication.

Read data from KSQL tables

maybe this is a beginner question but what is the recommended way to read data produced in KSQL?
Let's assume I do some stream processing and write the data to a KSQL table. Now I want to access this data via a Spring application (e.g. fan-out some live data via a websocket). My first guess here was to use Spring Kafka and just subscribe to the underlying topic. Or should I use Kafka Streams?
Another use-case could be to do stream processing and write the results to a Redis store (e.g. for a webservice which always returns current values). What would be the approach here?
Thanks!
The results if KSQL queries are stored in Kafka topics. So you can access the results from third party applications by reading from the result topic.
If the query result is a Table the resulted Kafka topic is a changelog topic meaning that you can read it into a table in third party system such as Cassandra or Redis. This table will always have the latest result and you can query it from web services.
Check out our Clickstream demo where we push the results into Elastic for visualization. The visualized values are the latest values for in the corresponding tables.
https://github.com/confluentinc/ksql/tree/master/ksql-clickstream-demo#clickstream-analysis

Querying MySQL tables using Apache Kafka

I am trying to use Kafka Streams for achieving a use-case.
I have two tables in MySQL - User and Account. And I am getting events from MySQL into Kafka using a Kafka MySQL connector.
I need to get all user-IDs within an account from within Kafka itself.
So I was planning to use KStream on MySQL output topic, process it to form an output and publish it to a topic with Key as the account-id and value as the userIds separated by comma (,).
Then I can use interactive query to get all userIds using account id, with the get() method of ReadOnlyKeyValueStore class.
Is this the right way to do this? Is there a better way?
Can KSQL be used here?
You can use Kafka Connect to stream data in from MySQL, e.g. using Debezium. From here you can use KStreams, or KSQL, to transform the data, including re-keying which I think is what you're looking to do here, as well as join it to other streams.
If you ingest the data from MySQL into a topic with log compaction set then you are guaranteed to always have the latest value for every key in the topic.
I would take a look at striim if you want built in CDC and interactive continuous SQL queries on the streaming data in one UI. More info here:
http://www.striim.com/blog/2017/08/making-apache-kafka-processing-preparation-kafka/