My question is about Druid Real Time Ingestion - druid

I am able to ingest real time data using Tranquility in Druid. But only problem I am facing that I can query rows in druid but when real time process is stopped the newly created datasource is gone and I can 0.0 kb in data sources.
What could be the reason?

Related

Publishing custom application metrics from Spark streaming job

I am running a Spark streaming job where few millions of Kafka events are being processed in a sliding window of 3 minutes, with window size of 24 hrs. There are multiple steps in the job, including aggregating and filtering the events based on certain fields and joining them with static RDDs loaded from files in S3 and finally running an MLLib transformation on each aggregated row of the window, publishing the results to Kafka topic.
I need a way to publish a bunch of application metrics, starting with how much time it takes to complete processing for each window, how many raw events being processed, what's the data size in bytes being processed, etc.
I've searched through all events that Spark publishes, and executor level events doesn't get me what I need. I'm trying out Kamon and Spark MetricsSource and Sink for now.
Any suggestions on the best way to accomplish this?
Also, I'm using Spark 2.4 now, as the original codebase is pretty old. But will be migrating to Spark 3.x soon.

How can I set the micro batch size in Spark Structured Streaming from Kafka topic?

I have a Spark Structured Streaming app that reads from Kafka and writes to Elasticsearch and S3. I have enabled checkpointing to a S3 bucket as well (app runs AWS EMR). I saw that in S3 bucket that over time the commits get less frequently and there is always growing delay in the data.
So I want to make Spark to process always to process batches with same amount of data each batch. I tried to set the ".option("maxOffsetsPerTrigger", 100)" but the batch size didnt become smaller, still huge amount of time between commits.
As I understood that we just tell spark how much data consume from kafka per poll and that spark just polls multiple times and then writes, so no limitations in the batch size.
I also tried to use continuous mode but the submit failed, i guess cuz of the output sink / foreachbatch doesnt support it.
any ideas are welcome, i will try everything ^^
actually the each offset contained so much data that I had to limit the max offsets per trigger to 50, and had to delete the old checkpoint folder, I read somewhere that it tries to finish first batch with the offset in the checkpoint, and then turns on the max offset per trigger

Druid.io: update/override existing data via streams from Kafka (Druid Kafka indexing service)

I'm loading streams from Kafka using the Druid Kafka indexing service.
But the data I uploaded is always changed, so I need to reload it again and avoid duplicates and collisions if data was already loaded.
I research docs about Updating Existing Data in Druid.
But all info about Hadoop Batch Ingestion, Lookups .
Is it possible to update existing Druid data during Kafka streams?
In other words, I need to rewrite the old values with new ones using Kafka indexing service (streams from Kafka).
May be any kind of setting to rewrite duplicates?
Druid is in a way a time-series database where the data gets "finalised" and written to a log every time-interval. It does aggregations and optimises columns for storage and easy queries when it "finalises" the data.
By "finalising", what I mean is that Druid assumes that the data for the specified interval is already present and it can safely do its computations on top of them. So this in effect means that there is no support for you to update the data (like you do in a database). Any data that you write is treated as a new data and it keeps adding to its computations.
But Druid is different in the sense it provides a way to upload historical data for the same time period the real-time indexing has already taken place. This batch upload will overwrite any segments with the new ones and further queries will reflect the latest uploaded batch data.
So I am afraid the only option would be to do batch ingestion. Maybe you could still send the data to Kafka, but have a spark/gobbin job that does de-duplication and write to Hadoop. Then have a simple cron job to re-index these as a batch onto Druid.

Is there a way that i can push historical data into druid over http?

I have an IOT project and want to use Druid as Time Series DBMS. Sometimes the IOT device may lose the network and will re-transfer the historical data and real-time data when reconnecting to the server. I know the Druid can ingest real-time data over http push/pull and historical data over http pull or KIS, but i can't find the document about ingesting historical data over http push.
Is there a way that i can send historical data into druid over http push?
I see a few options here:
Keep pushing historical data to the same kafka topic (or other streaming source) and do a rejection based on message-timestamp inside Druid. This simplifies your application architecture and let druid handle expired events rejection
Use batch ingestion for historical data. You push the historical data to another Kafka topic, run a spark/gobblin/any other index job to get the data to HDFS. Then do a batch ingestion onto Druid. But remember that Druid overwrites any real-time segments with batch segments for the specified windowPeriod. So if the historical data is not complete, you run into data loss. To prevent this, you could always pump real-time data into hadoop as well and do a de-duplication on the HDFS data periodically and ingest into Druid. As you can see this is a complicated architecture, but this can result in minimal data loss.
If I were you, I would simplify and send all data to the same streaming source like Kafka. I would index segments in Druid based on my message's timestamp and not current time (which is the default I believe).
kafka indexing service released recently guarantees exactly once ingestion.
Refer the below link - http://druid.io/docs/latest/development/extensions-core/kafka-ingestion.html
If you still want to ingest over http, you can checkout tranquility server. It has some mechanisms built-in for handling duplicates.

how to better process the huge history data in the kafka topic by using spark streaming

I am experiencing an issue to start spark streaming on a really big kafka topic, there are around 150 million data in this topic already and the topic is growing super fast.
When I tried to start spark streaming and read data from the beginning of this topic by setting kafka parameter ("auto.offset.reset" -> "smallest"), it always try to finish all 150 million data processing in the first batch and return a "java.lang.OutOfMemoryError: GC overhead limit exceeded" error. There isn't a lot calculation in this spark stream app though.
Can I have a way to process the history data in this topic in first several batches but not all in first batch?
Bunch of thanks in advance!
James
You can control spark kafka-input reading rate with following spark configuration spark.streaming.kafka.maxRatePerPartition .
You can configure this by giving how many docs you want to process per batch.
sparkConf.set("spark.streaming.kafka.maxRatePerPartition","<docs-count>")
Above config process <docs-count>*<batch_interval> records per batch.
You can find more info about above config here.