Reindex from druid - task not returning data - druid

We have a supervisor reading stream data from kafka with one minute granularity.
Every 15 minutes, we are running a “Reindex from druid” task that aggregates data by fifteenminute granularity for the previous 15 minutes and writes it into a new datasource.
The task is running with success, but with no data written in the concerned datasource.
while data is well present in the 1 minute granularity datasource.
if we wait for 1 hour and rerun the same task, the fifteenminute datasource is well-populated with data.
Is there any configuration we can add to reindex from druid task to reindex the newly inserted data?

Related

Future partition in postgres11 on OLAP DB never finish job

My cron job is scheduled to run daily and creates 15 days future partitions in postgres11.
Due to huge Select queries on table 24/7 ,cron job never completes on time (I have put 10 sec timeout else it blocks everything).
I know,this is solved in postgres12, but in postgre11,what is the smart way to handle this?

Scheduler Processing using Spring batch

we have a requirement to process millions of records using spring batch . We have planned to use a Spring Batch to do this by reading the db using JdbcPagingItemReaderBuilder and process in chunks and write it to Kaafka Queue. The active consumers of the queue will process the chunks of data and update the db
The consumer task is to iterate every item from the chunk and invoke the external api's.
In case the external system is down or not responding with success response , there should be retries of atleast 3 times and considering that each task in the chunk has to do this, what would be the ideal approach?
Another use case to consider, what happens when the job is processing and the system goes down and say that the job has already processed 10000 record and the remaining records are yet to be processed . After the restart how to make sure the execution doesnt restart the entire process from beginning and to resume from the point of failure.
Spring Batch creates the following tables. You can use them to check the status of your job and customize your scheduler to behave in a way you see fit.
I'd use the step execution Id in BATCH_STEP_EXCECUTION to validate the status that's set and then retry based off on that status, Or something similar to that sense.
BATCH_JOB_EXECUTION
BATCH_JOB_EXECUTION_CONTEXT
BATCH_JOB_EXECUTION_PARAMS
BATCH_JOB_INSTANCE
BATCH_STEP_EXECUTION

Kafka consumer configuration to fetch at interval

I am new to kafka and trying to understand various configuration properties I need to set for my requirement as below.
I have a kafka consumer which is expected to fetch 'n' records at a time and after successfully processing them, another fetch should happen.
Example: If my consumer fetches 10 records at a time and every record takes 5 seconds to complete its processing, then after 50 seconds another fetch request should get executed and so on.
Considering the above example, Could anyone let me know what should be the values for the below configs ?
Below is my current configuration. After processing 10 records, it doesn't wait for minute as I configured. It keeps on fetching and polling.
fetchMaxBytes=50000 //approx size for 10 records
fetchWaitMaxMs=60000 //wait for a minute before a next fetch
requestTimeoutMs= //default value
heartbeatIntervalMs=1000 //acknowledgement to avoid rebalancing
maxPollIntervalMs=60000 //assuming the processing takes one minute
maxPollRecords=10 //we need 10 records to be polled at once
sessionTimeoutMs= //default value
I am using camel-kafka component to implement this.
It would be really great if someone could help me with this. Thanks Much.

How to run Spring batch job only after completion running job

I have a list of records to process via a spring batch job. Each record has millions of data points, I want to process each record one after another otherwise database will not handle the load.
Data will be like this:
artworkList will contain 10 records and each artwork record will containt 30 million of data.
I am using spring batch with quartz schedular.

Handling embargoed content scenario in MarkLogic

I have a MarkLogic 7 database in which several documents are inserted and every document has its own created-on and released-on. Say for example if a document is inserted into the database at 1400 hrs and its released-on value is 1700 hrs then I need to POST this document to an external REST service at 1700 hrs.
I have tried the following options:
Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and a Scheduled Task is created to trigger based on the timestamp value read from released-on.
Following are the queries/ observations for this approach:
Since admin configuration manipulation APIs are not transactionally protected operations I need to force a lock on some URI in order to create Scheduled Tasks from within CPF action modules running in parallel.
For details read here
When I insert 1000 documents it takes around 20 minutes for the CPF action modules to trigger and create 1000 scheduled tasks based on the released-on value read from the inserted document.
How can I pass the URI of the document that triggered the CPF action module to the Schedule Task which got created from within the CPF action module based on the released-on value read from the document?
Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and xdmp:sleep() is called with the milliseconds remaining between current date Time and the value of released-on in the document.
Following are the queries/ observations for this approach:
The Task Server threads on which the CPF action modules are triggered remain occupied and are not released when xdmp:sleep() is called from within them due to which at any time CPF action module is triggered for 16 maximum documents and others remain in queue.
Is there any way we can configure the sleeping thread to become inactive and let other queued action modules to get triggered and when the sleep duration has been elapsed then it again becomes active?
Configure a muti-step CPF pipeline as described here in which the document keeps bouncing between two states till the time released-on timestamp has arrived.
Following are the queries/ observations for this approach:
Even when 30 documents were inserted the CPU utilization was observed to be 100%
In all the attempts a lot of system resources (CPU and RAM) get utilized even for as small as 1000 documents. I need to find an approach that can cater even 100K documents.
Please let me know in case there are any improvements that can be done in the above mentioned approaches or MarkLogic provides some other way to efficiently handle such scenarios.
Rather than CPF, you could set up a scheduled job that will run, say, every 10 minutes and look for documents that are ready to be published. That job would look for documents with released-on values between fn:current-dateTime() and the last time the job ran, which I would save in the database.
For each of those documents, you would spawn a task to POST the document, so that an error in one doesn't cause problems for the others. After looping through, save the current time in the database for the next time.
The 10-minute window can be as large or small as you like, depending on your tolerance for a little delay.