KSQL query and tables storage - apache-kafka

I was looking for a documentation about where KSQL storage querys and tables. For example, since KSQL was built to work with Kafka, when I create a table from a topic, or when I write a query, where are stored the tables or the query results? More specifically, does KSQL use some kind of pointers to events inside segments inside the topic partitions, or it duplicates the events when I create a table from a topic, for example?

The queries that have been ran or are active are persisteted back into a Kafka topic.
A Select statement has no persistent state - it acts as a consumer
A Create Stream/Table command will create potentially many topics, resulting in duplication, manpulation, and filtering of the input topic out to a given destination topic. For any stateful operations, results would be stored in a RocksDB instance on the KSQL server(s).
Since KSQL is built on Kafka Streams, you can refer to the wiki on Kafka Streams Internal Data Management

Related

What is the best practice to enrich data when synchronizing data with Kafka Connect

I am thinking about solutions to enrich data from Kafka.
Now I am using implementing Mongo Kafka Connect to sync all changes to Kafka. The kafka connect use the change stream to watch oplogs and public changes to Kafka. Relationship between Mongo's collection and Kafka Topic is 1:1.
On the consumer side, when it pulls data, it will get the reference id that we need to join to other collection to get the data.
To join data between collections, I have 2 solutions below.
when pulling data by consumers, it need to go back to the Mongo database to fetch or the data or join collections according to the reference key.
For this way, I concern about the number of connects that I need to go back to the Mongo database.
using kafka streaming to join data among topics.
For the second solution, I like to know how to keep that master data in the topics forever and how to maintain records in topics like db tables, so each row have unique index, and when data changes come to the topic, we can update the records.
If you have any other solutions, please let me know.
Your consumer can do whatever it wants. You may need to increase various Kafka timeout configs depending on your database lookups, though.
Kafka topics can be infinitely retained with retention.ms=-1, or by compaction. When you use compaction, it'll act similarly to a KV store (but as a log). To get an actual lookup store, you can build a KTable, then join a topic stream against it
This page covers various join patterns in Kafka Streams - https://developer.confluent.io/learn-kafka/kafka-streams/joins/
You can also use ksqlDB

Read from single kafka topic and upsert into multiple oracle tables

I'm using kafka to send the multiple collections data into a single topic using mongo source connector and upsert the data into different oracle tables using jbcsink connector.
In mongo source connector,we are appending respective collection name for all records to process the information at sink side based on the collection name.
is that possible using jdbcsink connectors? can we do this via node .js/ spring boot as a consumer application to split the topic message and write it into different collections?
EX: Collection A,collection B collection C - MongoSourceconnector
Table A,Table B,Table c- Jdbcsinkconnector
Collection A 's data has to map to table A, likewise for the remaining.
The JDBC Sink will only write to a table that is based on the name of the topic, by default. You'd need to rename the topic at runtime to write data from one topic to other tables.
can we do this via node .js/ spring boot as a consumer application to split the topic message and write it into different collections?
Sure, you can use Kafka Streams (Spring Cloud Streams) to branch the data into different topics before the sink connector would read them.

What happens internally when we run a kSQL query?

I am entirely new to Apache Kafka and kSQL. I was having a question in my mind and I tried to find out the answer but I failed to do so.
My current understanding is that the events that are getting generated from the producer are being stored in the Kafka internally in the topics in serialized form (0s and 1s). If I create a Kafka stream to consume the data and after that, If I run kSQL query let's say to use the COUNT() function so will the output of that query persist in the Kafka topics.
If that the case will it not be a storage cost?
Behind the scenes, it runs a Kafka Streams topology.
Any persisted streams or aggregated tables, in your case, indeed occupy storage.

How to store all topics present in the kafka cluster to another topic using KSQL

I'm new to KSQL. I want to store all topics names present in a kafka cluster to another topic using KSQL query.
SHOW TOPICS; from KSQL CLI gives me list of topics. I want to store all these topic information in another topic by creating a stream.
I will be polling this new topic (using a consumer) and whenever a new topic get created in the cluster, my consumer will receive a message.
I want a KSQL query to accomplish this.
Thanks in advance.
You can't currently achieve what you want using ksqlDB. The SHOW TOPICS command is a system command, not a sql statement. So the output of the query can't be piped into a stream.
ksqlDB allows you to process the data within the topics in the Kafka cluster. It doesn't (yet) allow you to process the metadata of the Kafka cluster, e.g. the list of topics, or consumer groups, etc.
It may be worth raising a feature request on GitHub: https://github.com/confluentinc/ksql/issues/new/choose

KSQL - Non-streaming query

Is there a way to query all current entries in KTABLE? I'm trying to execute the http request to REST api with the payload
{
"ksql": "SELECT * FROM MY_KTABLE;",
"streamsProperties": {
"auto.offset.reset": "earliest"
}
}
and the stream is indefinitely hanging. The documentation says
It is the equivalent of a traditional database table but enriched by
streaming semantics such as windowing.
So is it possible to make regular queries when you just need all current data without streaming and regard KTABLE as regular cache table?
KSQL Table used Kafka Streams' KTable so in order to access the current value of the KTable you will need to access state stores in all instances of the streams job. In Kafka Streams you can do this using interactive queries, however, we don't support interactive queries in KSQL yet.
One workaround to see the current state of a table in KSQL would be to use Kafka Connect to push the kafka topic corresponding to the Table into an external table such as Postgres table or Cassandra table. This external table will have the latest values of the KSQL table.