Druid.io: update/override existing data via streams from Kafka (Druid Kafka indexing service) - apache-kafka

I'm loading streams from Kafka using the Druid Kafka indexing service.
But the data I uploaded is always changed, so I need to reload it again and avoid duplicates and collisions if data was already loaded.
I research docs about Updating Existing Data in Druid.
But all info about Hadoop Batch Ingestion, Lookups .
Is it possible to update existing Druid data during Kafka streams?
In other words, I need to rewrite the old values with new ones using Kafka indexing service (streams from Kafka).
May be any kind of setting to rewrite duplicates?

Druid is in a way a time-series database where the data gets "finalised" and written to a log every time-interval. It does aggregations and optimises columns for storage and easy queries when it "finalises" the data.
By "finalising", what I mean is that Druid assumes that the data for the specified interval is already present and it can safely do its computations on top of them. So this in effect means that there is no support for you to update the data (like you do in a database). Any data that you write is treated as a new data and it keeps adding to its computations.
But Druid is different in the sense it provides a way to upload historical data for the same time period the real-time indexing has already taken place. This batch upload will overwrite any segments with the new ones and further queries will reflect the latest uploaded batch data.
So I am afraid the only option would be to do batch ingestion. Maybe you could still send the data to Kafka, but have a spark/gobbin job that does de-duplication and write to Hadoop. Then have a simple cron job to re-index these as a batch onto Druid.

Related

What is the point of using Kafka in this example and why not use DB straightaway?

Here is an example of how Kafka should run for a Social network site.
But it is hard for me to understand the point of Kafka here. We would not want to store posts and likes in Kafka as they will be destroyed after some time. So kafka should be an intermediate storage between View and DB.
But why would we need it? Wouldn't it be better to use DB straightaway.
I guess that we could use kafka as some kind of cache so the data accumulates in Kafka and then we can insert it to DB in one big batch query. But I am pretty sure that is not the reason kafka here.
What's not shown in the diagram is the processes querying the database (RocksDB, in this case). Without using Kafka Streams, you'd need to write some external service to run GROUP BY / SUM on the database. The "website" box on the left is doing some sort of front-end Javascript, and it is unclear how the Kafka backend consumer sends data to it (perhaps WebSockets?).
With Kafka Streams Interactive Queries, that logic can be moved closer to the actual event source, and is performed in near real time, rather than a polling batch. In a streaming framework, you could also send out individual event hooks (websockets, for example) to dynamically update "likes per post", "shares per post", "trends", etc without needing the user to update the page, or have the page load AJAX calls with large API responses for those details for all page rendered items.
More specifically, each Kafka Stream instance serves a specific query, rather than the API hitting one database for all queries. Therefore, load is more distributed and fault tolerant.
Worth pointing out that Apache Pinot loaded from Kafka is more suited for such real time analytical queries than Kafka Streams.
Also as you pointed out, Kafka or any message queue would act as a buffer ahead of any database (not a cache, although, Redis could be added as a cache, just like the later mentioned search service). And there's nothing preventing you from adding another database that's connected to Kafka Connect sink. For instance, a popular design is to write data to a RDBMS as well as Elasticsearch for text based search-indexing. The producer code only cares about one Kafka topic, not every downstream system where the data is needed.

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Why do we need a database when using Apache Kafka?

According to the schema data comes to Kafka, then to stream and Mapr-DB.
After storing data in DB, user can display data on the map.
Question is, why we use DB to dispaly data on the map if Kafka is already DB.
It seems to me more slowly to get realtime data from Mapr-DB that from Kafka.
What do you think, why this example uses this appoarch?
The core abstraction Kafka provides for a stream of records is known as topic. You can imagine topics as the tables in a database. A database (Kafka) can have multiple tables (topics). Like in databases, a topic can have any kind of records depending on the usecase. But note that Kafka is not a database.
Also note that in most cases, you would have to configure a retention policy. This means that messages at some point will be deleted based on a configurable time or size based retention policy. Therefore, you need to store the data into a persistent storage system and in this case, this is your Database.
You can read more about how Kafka works in this blog post.

Is there any way to connect MongoDB to Druid?

My organisation have MongoDB which stores application based time-series data. Now we are trying to create a data pipeline for analytics and visualisation. Due to time-series data we plan to use Druid as intermediate storage where we can do the required transformation and then use Apache Superset to visualise. Is there any way to migrate required data (not only updates) from MongoDB to Druid?
I was thinking about Apache Kafka but from what I have read, I understood that it will work better only to stream the changes happening in topics (topic associated with tables) which already exists in MongoDB and Druid. But what if there is a table of at least 100,000 records which exists only in MongoDB and first I wish to push whole table to Druid, will Kafka work in this scenario?

Is there a way that i can push historical data into druid over http?

I have an IOT project and want to use Druid as Time Series DBMS. Sometimes the IOT device may lose the network and will re-transfer the historical data and real-time data when reconnecting to the server. I know the Druid can ingest real-time data over http push/pull and historical data over http pull or KIS, but i can't find the document about ingesting historical data over http push.
Is there a way that i can send historical data into druid over http push?
I see a few options here:
Keep pushing historical data to the same kafka topic (or other streaming source) and do a rejection based on message-timestamp inside Druid. This simplifies your application architecture and let druid handle expired events rejection
Use batch ingestion for historical data. You push the historical data to another Kafka topic, run a spark/gobblin/any other index job to get the data to HDFS. Then do a batch ingestion onto Druid. But remember that Druid overwrites any real-time segments with batch segments for the specified windowPeriod. So if the historical data is not complete, you run into data loss. To prevent this, you could always pump real-time data into hadoop as well and do a de-duplication on the HDFS data periodically and ingest into Druid. As you can see this is a complicated architecture, but this can result in minimal data loss.
If I were you, I would simplify and send all data to the same streaming source like Kafka. I would index segments in Druid based on my message's timestamp and not current time (which is the default I believe).
kafka indexing service released recently guarantees exactly once ingestion.
Refer the below link - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
If you still want to ingest over http, you can checkout tranquility server. It has some mechanisms built-in for handling duplicates.