What happens internally when we run a kSQL query?

What happens internally when we run a kSQL query? - apache-kafka

I am entirely new to Apache Kafka and kSQL. I was having a question in my mind and I tried to find out the answer but I failed to do so.
My current understanding is that the events that are getting generated from the producer are being stored in the Kafka internally in the topics in serialized form (0s and 1s). If I create a Kafka stream to consume the data and after that, If I run kSQL query let's say to use the COUNT() function so will the output of that query persist in the Kafka topics.
If that the case will it not be a storage cost?

Behind the scenes, it runs a Kafka Streams topology.
Any persisted streams or aggregated tables, in your case, indeed occupy storage.

Related

KTable initialization and persistence

This is more of an architectural question. I'm learning about Event-Driven Architecture and Streaming Systems with Apache Kafka. I've learned about Event Sourcing and CQRS and have some basic questions regarding implementation.
For example, consider a streaming application where we are monitoring vehicular events of drivers registered in our system. These events will be coming in as a KStream. The drivers registered in the system will be in a KTable, and we need to join the events and drivers to derive some output.
Assume that we insert a new driver in the system by a microservice, which pushes the data in a Cassandra table and then to the KTable topic by Change Data Capture.
since Kafka topics have a TTL associated with them, how do we make sure that the driver records are not dropped?
I understand that Kafka has a persistent state store that can maintain the required state, but can I depend on it like a Cassandra table? Is there a size consideration?
If the whole application, and all kafka brokers and consumer nodes are terminated, can the application be restarted without loss of driver records in the KTable?
If the streaming application is Kubernetes based, how would I maintain the persistent disk volumes of each container and correctly attach them as containers come and go?
Would it be preferable to join the event stream with the driver table in Cassandra using Spark Streaming or Flink? Can Spark and Flink still maintain data locality as their streaming consumers will be distributed by Kafka partitions, and the Cassandra data by I don't know what?
EDIT: - I realized Spark and Flink would be pulling data from Cassandra on the respective nodes depending on what keys they have. Kafka Streaming has the advantage that the Stream and KTable to join will already be data local.

KTables don't have a TTL since they are built from compacted topics (infinite retention).
Yes, you need to maintain storage directories for persistent Kafka StateStores. Since those stores would be on-disk, no records should be dropped from them upon broker/app restarts until you actively clear the state directories from the app instance hosts.
Spark/Flink do not integrate with Kafka Streams stores, and have their own locality considerations. I do believe Flink offers RocksDB state, and both broadcast data for remote-joins, otherwise, joining Kafka record keys requires both topics have matching partition counts - this way partitions are assigned to the same instances/executors, similar to Kafka Streams joins.

How to do batch processing on kafka connect generated datasets?

Suppose we have batch jobs producing records into kafka and we have a kafka connect cluster consuming records and moving them to HDFS. We want the ability to run batch jobs later on the same data but we want to ensure that batch jobs see the whole records generated by producers. What is a good design for this?

You can run any MapReduce, Spark, Hive, etc query on the data, and you will get all records that have been thus far been written to HDFS. It will not see data that has not been consumed by the Sink from the producers, but this has nothing to do with Connect or HDFS, that is a pure Kafka limitation.
Worth pointing out that Apache Pinot is a better place to combine Kafka streaming data and have batch query support.

How to monitor 'bad' messages written to kafka topic with no schema

I use Kafka Connect to take data from RabbitMQ into kafka topic. The data comes without schema so in order to associate schema I use ksql stream. On top of the stream I create a new topic that now has a defined schema. At the end I take the data to BQ database. My question is how do I monitor messages that have not passed the stream stage? in this way, do i support schema evolution? and if not, how can use the schema registry functionality?
Thanks

use Kafka Connect to take data ... data comes without schema
I'm not familiar specifically with Rabbitmq connector, but if you use the Confluent converter classes that do use schemas, then it would have one, although maybe only a string or bytes schema
If ksql is consuming the non-schema topic, then there's a consumer group associated with that process. You can monitor its lag to know how many messages have not yet been processed by ksql. If ksql is unable to parse a message because it's "bad", then I assume it's either skipped or the stream stops consuming completely; this is likely configurable
If you've set the output topic format to Avro, for example, then the schema will automatically be registered to the Registry. There will be no evolution until you modify the fields of the stream

KSQL query and tables storage

I was looking for a documentation about where KSQL storage querys and tables. For example, since KSQL was built to work with Kafka, when I create a table from a topic, or when I write a query, where are stored the tables or the query results? More specifically, does KSQL use some kind of pointers to events inside segments inside the topic partitions, or it duplicates the events when I create a table from a topic, for example?

The queries that have been ran or are active are persisteted back into a Kafka topic.
A Select statement has no persistent state - it acts as a consumer
A Create Stream/Table command will create potentially many topics, resulting in duplication, manpulation, and filtering of the input topic out to a given destination topic. For any stateful operations, results would be stored in a RocksDB instance on the KSQL server(s).
Since KSQL is built on Kafka Streams, you can refer to the wiki on Kafka Streams Internal Data Management

Where to run the processing code in Kafka?

I am trying to setup a data pipeline using Kafka.
Data go in (with producers), get processed, enriched and cleaned and move out to different databases or storage (with consumers or Kafka connect).
But where do you run the actual pipeline processing code to enrich and clean the data? Should it be part of the producers or the consumers? I think I missed something.

In the use case of a data pipeline the Kafka clients could serve both as a consumer and producer.
For example, if you have raw data being streamed into ClientA where it is being cleaned before being passed to ClientB for enrichment then ClientA is serving as a consumer (listening to a topic for raw data) and a producer (publishing cleaned data to a topic).
Where you draw those boundaries is a separate question.

It can be part of either producer or consumer.
Or you could setup an environment dedicated to something like Kafka Streams processes or a KSQL cluster

It is possible either ways.Consider all possible options , choose an option which suits you best. Lets assume you have a source, raw data in csv or some DB(Oracle) and you want to do your ETL stuff and load it back to some different datastores
1) Use kafka connect to produce your data to kafka topics.
Have a consumer which would consume off of these topics(could Kstreams, Ksql or Akka, Spark).
Produce back to a kafka topic for further use or some datastore, any sink basically
This has the benefit of ingesting your data with little or no code using kafka connect as it is easy to set up kafka connect source producers.
2) Write custom producers, do your transformations in producers before
writing to kafka topic or directly to a sink unless you want to reuse this produced data
for some further processing.
Read from kafka topic and do some further processing and write it back to persistent store.
It all boils down to your design choice, the thoughput you need from the system, how complicated your data structure is.