How does connector works in KSQLDB and Kafka? - apache-kafka

I don't have enough information about how source connector works in KSQLDb and Kafka altogether.
How much fast the data is populated to Kafka topics?
And what if KsqlDb stream needs data from source to join data, but data is loaded still?
Does source connector send updated/inserted data to topic, it happens instantly?
Could you help me with these issues or avice a good tutorial, where I can learn more.

How much fast the data is populated to Kafka topics?
Depends on the connector. Some connectors are event driven and some use a polling mechanism. The event driven connectors are generally going to be more real-time, but often require more db-side setup. Where as the polling based connectors generally don't require any db-side changes. With the polling based connectors you can increase the polling frequency, trading lower latencies for high db load.
Look more into the documentation of the connectors for more info.
And what if KsqlDb stream needs data from source to join data, but data is loaded still?
ksqlDB generally processes your data in time order. When joining two topics, ksqlDB will process the side with the oldest data. This will generally mean the stream data is not processed until the table is bootstrapped.
Does source connector send updated/inserted data to topic, it happens instantly?
Not sure how this question differs from question #1.

Related

ksqlDB for finding average last hour, and store results back to a kafka topic?

We have a readpanda (kafka compatible) source, with sensor data. Can we do the following:
Every hour, find the average sensor data last hour for each sensor
Store them back to a topic
You want to create a materialized view over the stream of events that can be queried by other applications. Your source publishes the individual events to Kafka/Redpanda, another process observers the events and makes them available as queryable "tables" for other applications. Elaborating a few options:
KSQLdb is likely a default choice as it comes as "native" in the Kafka/Confluent stack. Be careful with using it over your production Kafka cluster. It has heavy impact on the cluster performance. See the basic tutorial or the advanced tutorial.
Use an out-of-the box solution for materialized views such as Materialize. It's easiest to setup/use and doesn't stress the Kafka broker. However, it is single-node only as of now (06/2022). See the tutorial.
Another popular option is using a stream processor and store hourly aggregates to an attached database (for example Flink storing data to Redis). This is a do-it-yourself approach. Have a look on Hazelcast. It is one process running both stream processing services and a queryable store.

Kafka and microservices - Architecture questions

In a Microservices based architecture, who writes to Kafka? services themselves or the Microservices databases? I've been thinking about this and see pros and cons to both approaches but leaning towards having database write to Kafka topics because
Database and data in the Kafka topic won't go out of sync in case write to Kafka fails for whatever reason
Application teams won't have to have one more step to worry about
Applications can keep focusing on the core function rather than worrying about Kafka.
Thanks for your inputs
As cricket_007 has been saying, databases typically cannot write to Apache Kafka themselves; instead, you'd need a change data capturing services such as Debezium in order to stream data changes from the database into Kafka (disclaimer: I'm the lead of Debezium).
Such an approach allows to ensure (eventual) consistency between a service's own database and Kafka messages sent to other services. On specific CDC application I'd recommend to look into is the outbox pattern. The idea there is to not capture changes to the service's actual business tables, but instead work with a separate "outbox table", into which the service writes specific messages meant for consumption by other services. CDC would then be used to sent these events from that table to Kafka.
This approach avoids exposing internal data structures to outside consumers while also avoiding the issues of "dual writes" which a service would suffer from when directly writing to its database and Kafka. In Debezium there's some means of built-in support for the outbox pattern via a message transformation that helps to route the events from the outbox table into event-type specific Kafka topics.
Not all services need a database, they just emit data (logs, metrics, sensors, etc)
So, the answer would be either.
Plus, I'm not sure what database directly can export to Kafka, so you'd have some other service like Debezium deployed which would be polling those CDC records off the database
Application developers still have to "worry" about how to deserialize their data, how many partitions are in the topic so they can scale out consumption, manage offsets, among other things

How to efficiently repair data in large kafka / kafka streams applications

Project:
the application i am working on processes financial transaction (orders and trade) data, several millions per day.
the data is fed into a kafka topic.
kafka streams microservices aggregate the information (e.g. nr of trades per stock), and this data is consumed by other software. In addition, the data is persisted in mongodb.
Problem:
the data sent to the topic needs to be sometimes modified, e.g. changes of prices due to bug or misconfiguration.
Since kafka is append-only, i do the correction in mongodb, and after that, the corrected data is piped into a new kafka topic, leading to a complete re-calculations of the downstream aggregations.
However, this process causes scalability concerns, as more and more data needs to be replayed over time.
Question
I am considering splitting the large kafka topic into daily topics, so that only a single day's topics needs to be replayed in most cases of data repair.
My question is if this is a plausible way to address this problem or if there are better solutions to it.
Data repairing or in general error handling and Kafka heavily depends on the use case. In our case we build our system based on the CQRS + event sourcing principles (generic description here) and as a result for data repairing we are using "compensating events" (i.e. an event that amends the effects of another event) and eventually the system will be consistent.

Is there a way that i can push historical data into druid over http?

I have an IOT project and want to use Druid as Time Series DBMS. Sometimes the IOT device may lose the network and will re-transfer the historical data and real-time data when reconnecting to the server. I know the Druid can ingest real-time data over http push/pull and historical data over http pull or KIS, but i can't find the document about ingesting historical data over http push.
Is there a way that i can send historical data into druid over http push?
I see a few options here:
Keep pushing historical data to the same kafka topic (or other streaming source) and do a rejection based on message-timestamp inside Druid. This simplifies your application architecture and let druid handle expired events rejection
Use batch ingestion for historical data. You push the historical data to another Kafka topic, run a spark/gobblin/any other index job to get the data to HDFS. Then do a batch ingestion onto Druid. But remember that Druid overwrites any real-time segments with batch segments for the specified windowPeriod. So if the historical data is not complete, you run into data loss. To prevent this, you could always pump real-time data into hadoop as well and do a de-duplication on the HDFS data periodically and ingest into Druid. As you can see this is a complicated architecture, but this can result in minimal data loss.
If I were you, I would simplify and send all data to the same streaming source like Kafka. I would index segments in Druid based on my message's timestamp and not current time (which is the default I believe).
kafka indexing service released recently guarantees exactly once ingestion.
Refer the below link - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
If you still want to ingest over http, you can checkout tranquility server. It has some mechanisms built-in for handling duplicates.

Why we require Apache Kafka with NoSQL databases?

Apache Kafka is an real-time messaging service. It stores streams of data safely in distributed and fault-tolerant. We can filter streaming data when comming producer. I don't understant that why we need NoSQL databases like as MongoDB to store same data in Apache Kafka. The true question is that why we store same data in a NoSQL database and Apache Kafka?
I think if we need a NoSQL database, we can collect streams of data from clients in MongoDB at first without the use of Apache Kafka. But, most of big data architecture preference using Apache Kafka between data source and NoSQL database.(see)
What is the advantages of that for real systems?
This architecture has several advantages:
Kafka as Data Integration Bus
It helps distribute data between several producers and many consumers easily. Here Apache Kafka serves as an "data" integration message bus.
Kafka as Data Buffer
Putting Kafka in front of your "end" data storages like MongoDB or MySQL acts like a natural data buffer. So you are able to deploy/maintain/redeploy your consumer services independently. At the time your service is down for maintanance Kafka is still storing all incoming data, that is quite useful.
Kafka as a Short Time Data Storage
You don't have to store everything in Kafka: very often you use Kafka topics with retention. It means all data older than some value will be deleted by Kafka automatically. So, for example you may have Kafka topic with 1 week retention (so you store 1 week of data only) but at the same time your data lives in long time storage services like classic SQL-DBs or Cassandra etc.
Kafka as a Long Time Data Storage
On the other hand you can use Apache Kafka as a long term storage system. Using compacted topics enables you to store only the last value for each key. So your topic becomes a last state storage of your app.