Is there a way that i can push historical data into druid over http? - druid

I have an IOT project and want to use Druid as Time Series DBMS. Sometimes the IOT device may lose the network and will re-transfer the historical data and real-time data when reconnecting to the server. I know the Druid can ingest real-time data over http push/pull and historical data over http pull or KIS, but i can't find the document about ingesting historical data over http push.
Is there a way that i can send historical data into druid over http push?

I see a few options here:
Keep pushing historical data to the same kafka topic (or other streaming source) and do a rejection based on message-timestamp inside Druid. This simplifies your application architecture and let druid handle expired events rejection
Use batch ingestion for historical data. You push the historical data to another Kafka topic, run a spark/gobblin/any other index job to get the data to HDFS. Then do a batch ingestion onto Druid. But remember that Druid overwrites any real-time segments with batch segments for the specified windowPeriod. So if the historical data is not complete, you run into data loss. To prevent this, you could always pump real-time data into hadoop as well and do a de-duplication on the HDFS data periodically and ingest into Druid. As you can see this is a complicated architecture, but this can result in minimal data loss.
If I were you, I would simplify and send all data to the same streaming source like Kafka. I would index segments in Druid based on my message's timestamp and not current time (which is the default I believe).

kafka indexing service released recently guarantees exactly once ingestion.
Refer the below link - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
If you still want to ingest over http, you can checkout tranquility server. It has some mechanisms built-in for handling duplicates.

Related

Streaming on-demand data on to Kafka topics based on consumer requests

We are a source system and we have a couple of downstream systems which would require our data for their needs, currently we are publishing events onto Kafka topics as and when there is a change in source system for them to consume and make changes to their tables (all delta updates)
Our downstream systems is currently accessing our database directly sometimes to make complete refresh of their tables on demand basis once in a while to make sure data is in sync apart from subscribing to Kafka topics, as you know we always need a data refresh sometimes when we feel data is out of sync for some reason.
We are planning to stop giving access to our database directly, how can we achieve this ? Is there a way that consumers request us their data needs by any triggers like passing request to us and we can publish the stream of data for them to consume on their end and they sync the tables or get the bulk data into their memory to perform some tasks based on their needs.
We currently have written RESTful APIs to provide data based on the requests, but we are exposing small data volumes as I think APIs we only send smaller volumes of data, but it won't work when we want to send millions of data to consumers and I believe only way is to stream data on Kafka, but with Kafka how can we respond to the request from consumers and only pump that specific data on to Kafka topics for them to consume ?
You have the option of setting the retention policy on any topic to keep messages forever with:
retention.ms: -1
see the docs
In that case you could store the entire change log in the same manner that you currently are. Then if a consumer needs to re-materialize the entire history, they can start with the first offset and go from there without you having to produce a specialized dataset.

How does connector works in KSQLDB and Kafka?

I don't have enough information about how source connector works in KSQLDb and Kafka altogether.
How much fast the data is populated to Kafka topics?
And what if KsqlDb stream needs data from source to join data, but data is loaded still?
Does source connector send updated/inserted data to topic, it happens instantly?
Could you help me with these issues or avice a good tutorial, where I can learn more.
How much fast the data is populated to Kafka topics?
Depends on the connector. Some connectors are event driven and some use a polling mechanism. The event driven connectors are generally going to be more real-time, but often require more db-side setup. Where as the polling based connectors generally don't require any db-side changes. With the polling based connectors you can increase the polling frequency, trading lower latencies for high db load.
Look more into the documentation of the connectors for more info.
And what if KsqlDb stream needs data from source to join data, but data is loaded still?
ksqlDB generally processes your data in time order. When joining two topics, ksqlDB will process the side with the oldest data. This will generally mean the stream data is not processed until the table is bootstrapped.
Does source connector send updated/inserted data to topic, it happens instantly?
Not sure how this question differs from question #1.

Stream CDC change with Kafka and Spark still processes it in batches, whereas we wish to process each record

I'm still new in Spark and I want to learn more about it. I want to build and data pipeline architecture with Kafka and Spark.Here is my proposed architecture where PostgreSQL provide data for Kafka. The condition is the PostgreSQL are not empty and I want to catch any CDC change in the database. At the end,I want to grab the Kafka Message and process it in stream with Spark so i can get analysis about what happen at the same time when the CDC event happen.
However, when I try to run an simple stream, it seems Spark receive the data in stream, but process the data in batch, which not my goal. I have see some article that the source of data for this case came from API which we want to monitor, and there's limited case for Database to Database streaming processing. I have done the process before with Kafka to another database, but i need to transform and aggregate the data (I'm not use Confluent and rely on generic Kafka+Debezium+JDBC connectors)
According to my case, is Spark and Kafka can meet the requirement? Thank You
I have designed such pipelines and if you use Structured Streaming KAFKA in continuous or non-continuous mode, you will always get a microbatch. You can process the individual records, so not sure what the issue is.
If you want to process per record, then use the Spring Boot KAFKA setup for consumption of KAFKA messages, that can work in various ways, and fulfill your need. Spring Boor offers various modes of consumption.
Of course Spark Structured Streaming can be done using Scala and has a lot of support obviating extra work elsewhere.
https://medium.com/#contactsunny/simple-apache-kafka-producer-and-consumer-using-spring-boot-41be672f4e2b This article discusses the single message processing approach.

Druid Kafka ingestion with read-your-writes

I'm learning Druid now. I read that ingestion via Kafka Indexing Service guarantees exactly-once semantics.
However, I have a problem with determining consistency model of Druid. Typically streams are asynchronous, but I want to have read-your-writes semantics in application.
Is there any possibility to check Druid's ingestion status? For example, I send event A and want to check if it was already saved in Druid. If yes, query to Druid should return result with this value.
Maybe there is some other possibility to do real-time ingestion with exactly-once semantics and with read-your-writes?
Druid has separate process for ingestion and reading the data. read-your-writes won't be directly possible, however you can get the success for writes, and than you can make a separate query for reading your write.
check out tranquility server, which gives an http based gateway to write in real-time and it tries to handle exactly once ingestion too.
Though the best approach to ensure exactly once ingestion is to do reindexing by batch ingestion at regular interval depending on your use case.

Druid.io: update/override existing data via streams from Kafka (Druid Kafka indexing service)

I'm loading streams from Kafka using the Druid Kafka indexing service.
But the data I uploaded is always changed, so I need to reload it again and avoid duplicates and collisions if data was already loaded.
I research docs about Updating Existing Data in Druid.
But all info about Hadoop Batch Ingestion, Lookups .
Is it possible to update existing Druid data during Kafka streams?
In other words, I need to rewrite the old values with new ones using Kafka indexing service (streams from Kafka).
May be any kind of setting to rewrite duplicates?
Druid is in a way a time-series database where the data gets "finalised" and written to a log every time-interval. It does aggregations and optimises columns for storage and easy queries when it "finalises" the data.
By "finalising", what I mean is that Druid assumes that the data for the specified interval is already present and it can safely do its computations on top of them. So this in effect means that there is no support for you to update the data (like you do in a database). Any data that you write is treated as a new data and it keeps adding to its computations.
But Druid is different in the sense it provides a way to upload historical data for the same time period the real-time indexing has already taken place. This batch upload will overwrite any segments with the new ones and further queries will reflect the latest uploaded batch data.
So I am afraid the only option would be to do batch ingestion. Maybe you could still send the data to Kafka, but have a spark/gobbin job that does de-duplication and write to Hadoop. Then have a simple cron job to re-index these as a batch onto Druid.