How can I complete a Druid Task? - apache-kafka

I import some data on my Druid Datasource. For that, I use Nifi and Tranquility for streaming injection with minute granularity (for my tests).
I've Ambari for check all my tasks and their status.
All my data are imported on my Datasource correctly and i can request them with Hive query.
When I look my tasks on Ambari, all of them are running, they are never "Complete". If I want to complete one of them, I have to kill it but I loose my data and status task is "FAILED".
I would like to understand what can I do for complete my tasks with success.
Thanks.

I found the problem.
In my tranquility conf, I had declared a big value for the "WindowPeriod".
In fact, the task automatically ends when the "WindowPeriod" end.
For example, "WindowPeriod":"PT10M" means that the task will end in 10 minutes.

Glad that you figured it out! Just want to call out for anyone reading this that Tranquility is deprecated. The streaming ingestion services such as https://druid.apache.org/docs/latest/development/extensions-core/kafka-ingestion should be proffered for anyone starting a new deployment.

Related

Spring Batch + Kafka: KafkaItemReader run forever?

I want to make something to monitor some Kafka topic continuously, and then execute some batch job when a message comes in (hitting some REST api and storing response). I set something up with KafkaItemReader, however, it turns off if it doesn't receive a message for 30 seconds based on pollTimeout. How can I make it run indefinitely? Since this is not an obvious option I'm wondering if I am using the right tool for the job.
Likely answer: you are not supposed to do this.
That's correct. Batch processing is about processing finite data sets. If your data source is an infinite stream of records and you want to monitor it continuously, then a streaming solution is more appropriate for your use case.

How to configure druid properly to fire a periodic kill task

I have been trying to get druid to fire a kill task periodically to clean up unused segments.
These are the configuration variables responsible for it
druid.coordinator.kill.on=true
druid.coordinator.kill.period=PT45M
druid.coordinator.kill.durationToRetain=PT45M
druid.coordinator.kill.maxSegments=10
From the above configuration my mental model is, once ingested data is marked unused, kill task will fire and delete the segments that are older that 45 mins while retaining 45 mins worth of data. period and durationToRetain are the config vars that are confusing me, not quite sure how to leverage them. Any help would be appreciated.
The caveat for druid.coordinator.kill.on=true is that segments are deleted from whitelisted datasources. The whitelist is empty by default.
To populate the whitelist with all datasources, set killAllDataSources to true. Once I did that, the kill task fired as expected and deleted the segments from s3 (COS). This was tested for Druid version 0.18.1.
Now, while the above configuration properties can be set when you build your image, the killAllDataSources needs to be set through an API. This can be set via the druid UI too.
When you click the option, a modal appears that has Kill All Data Sources. Click on True and you should see a kill task (Ingestion ---> Tasks below) firing in the interval specified. It would be really nice to have this as a part of runtime.properties or some sort of common configuration file that we can set the value in when build the druid image.
Use crontab it works quite well for us.
If you want to have a control outside the druid over the segments removal, then you must use an scheduled task which runs based on your desire interval and register kill-tasks in druid. It can increase your control over your segments, since when they go away, you cannot recover them. You can use this script to accompany you:
https://github.com/mostafatalebi/druid-kill-task

Graceful custom activity timeout in data factory

I'm looking to impose a timeout on custom activities in data factory via the policy.timeout property in the activity json.
However I haven't seen any documentation to suggest how the timeout operates upon Azure batch? I assume that the batch task is forcibly terminated somehow.
But is the task -> custom activity informed so it can tidy up?
The reason I ask is that I could be mid-copying to data lake store and I neither want to let it run indefinitely nor stop it without some sort of clean up (I can't see a way of doing transactions as such using the data lake store SDK).
I'm considering putting the timeout within the custom activity, but it would be a shame to have timeouts defined at 2 different levels (I'd probably still want the overall timeout).
I feel your pain.
ADF does simply terminate the activity if its own time out is reached regardless of what state the invoked service is in.
I have the same issue with U-SQL processing calls. It takes a lot of proactive monitoring via PowerShell to ensure data lake or batch jobs have enough compute to complete jobs with natually increasing data volumes before the ADF timeout kill occurs.
I'm not aware of any graceful way for ADF to handle this because it would differ for each activity type.
Time to create another feedback article for Microsoft!

Apache Flink: changing state parameters at runtime from outside

i'm currently working on a streaming ML pipeline and need exactly once event processing. I was interested by Flink but i'm wondering if there is any way to alter/update the execution state from outside.
The ml algorithm state is kept by Flink and that's ok, but considering that i'd like to change some execution parameters at runtime, i cannot find a viable solution. Basically an external webapp (in GO) is used to tune the parameters and changes should reflect in Flink for the subsequent events.
I thought about:
a shared Redis with pub/sub (as polling for each event would kill throughput)
writing a custom solution in Go :D
...
The state would be kept by key, related to the source of one of the multiple event streams coming in from Kafka.
Thanks
You could use a CoMapFunction/CoFlatMapFunction to achieve what you described. One of the inputs is the normal data input and on the other input you receive state changing commands. This could be easiest ingested via a dedicated Kafka topic.

how to configure MessageChannelPartitionHandler to poll partition result using job repo

Does anyone know how to configure MessageChannelPartitionHandler to poll partition results from db instead of aggregating jms response from each slave?
I am trying to use this as current aggregator is giving out of memory error after a while job is running. Looks like master node is not releasing memory.
After looking at the code of MessageChannelPartitionHandler, all we need is DataSource and JobExplorer reference. You can also configure polling interval but if you dont provide , default is 10sec.