Handling embargoed content scenario in MarkLogic - scheduled-tasks

I have a MarkLogic 7 database in which several documents are inserted and every document has its own created-on and released-on. Say for example if a document is inserted into the database at 1400 hrs and its released-on value is 1700 hrs then I need to POST this document to an external REST service at 1700 hrs.
I have tried the following options:
Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and a Scheduled Task is created to trigger based on the timestamp value read from released-on.
Following are the queries/ observations for this approach:
Since admin configuration manipulation APIs are not transactionally protected operations I need to force a lock on some URI in order to create Scheduled Tasks from within CPF action modules running in parallel.
For details read here
When I insert 1000 documents it takes around 20 minutes for the CPF action modules to trigger and create 1000 scheduled tasks based on the released-on value read from the inserted document.
How can I pass the URI of the document that triggered the CPF action module to the Schedule Task which got created from within the CPF action module based on the released-on value read from the document?
Configure a CPF pipeline such that whenever a document is inserted it's released-on value is read and xdmp:sleep() is called with the milliseconds remaining between current date Time and the value of released-on in the document.
Following are the queries/ observations for this approach:
The Task Server threads on which the CPF action modules are triggered remain occupied and are not released when xdmp:sleep() is called from within them due to which at any time CPF action module is triggered for 16 maximum documents and others remain in queue.
Is there any way we can configure the sleeping thread to become inactive and let other queued action modules to get triggered and when the sleep duration has been elapsed then it again becomes active?
Configure a muti-step CPF pipeline as described here in which the document keeps bouncing between two states till the time released-on timestamp has arrived.
Following are the queries/ observations for this approach:
Even when 30 documents were inserted the CPU utilization was observed to be 100%
In all the attempts a lot of system resources (CPU and RAM) get utilized even for as small as 1000 documents. I need to find an approach that can cater even 100K documents.
Please let me know in case there are any improvements that can be done in the above mentioned approaches or MarkLogic provides some other way to efficiently handle such scenarios.

Rather than CPF, you could set up a scheduled job that will run, say, every 10 minutes and look for documents that are ready to be published. That job would look for documents with released-on values between fn:current-dateTime() and the last time the job ran, which I would save in the database.
For each of those documents, you would spawn a task to POST the document, so that an error in one doesn't cause problems for the others. After looping through, save the current time in the database for the next time.
The 10-minute window can be as large or small as you like, depending on your tolerance for a little delay.

Related

resource utilization as per schedule in anylogic

I am working on car wash model, and I am running it for 100 hours but I want to store the data on daily basis ( after every 10 hours ) how can we do that. for example for every service block I have resource pool I want to see the utilization of resource pool in every 10 hours.
Create a cycle event that triggers every 10 hrs and writes data as you need it.
One simple way to print to the console: traceln(myResourcePool.utilization()
If you want to write to the dbase, check this help article.

Quarkus Scheduled Records Processing mechanism Best Practice

What is the best practice or way to process the records from DB in scheduled.
Situation:
A Microservice based on Quarkus - responsible for sending a communication to customers.
DB Table Having Customers Records (100000 customers)
Microservice is running on multiple nodes (4 nodes)
Expectation:
There should be a scheduler that runs every 5 sec
Fetches the records from DB where employee status = pending
Should be Multithreaded architecture.
Send email to employee email.
Problem 1:
The same scheduler running on multiple nodes picks the same records and process How can we avoid this?
Problem 2:
Scheduler pics (100 records and processing it) and takes more than 5 seconds and scheduler run again pics few same records. How can we avoid that:
If you are planning to run your microservices on kubernetes I would sugest to use an external components as a scheduler and let this component distribute the work over your microservices using messages or HTTP invocations.
As responses to your questions here we go:
You can use some locking strategy or "reserve" each row including a field that indicates that your record is being processed and excluding all records containing this fields from your query. By this means when the scheduler fires it will read a set of rows not reserved and use a multithreading approach to process the records, by using a locking strategy (pesimits or optimist) you can prevent other records from marking the same row as reserved for them to be processed. After that the thread thas was able to commit the reserve process the records and updates the state or releases the "reserve" so other workers can work on the record if its needed.
You can always instruct your scheduler to do no execute if there is still an execution going.
#Scheduled(identity = "ProcessUpdateScheduler", every = "2s", concurrentExecution = Scheduled.ConcurrentExecution.SKIP)
You mainly have two approaches among other possible ones:
Pulling (Distribute mining or work distribution): Each instance of the microservice pick a random pending row and mark this row as "processing" commiting the transaction, if its able to commit then this instance holds the right to process this record continuing with its execution, if not it tries to retrieve a different row or just exists waiting for the next invocation. This approach scales horizontally because adding more workers will mean increasing your processing throughput.
Pushing (central distribution, distributed processing). You have two kinds of components: First the "Distributor" which is executed with the scheduler and is responsible for picking rows to be processed and marking then as "pending processing", this rows will be forward via a messaging system or HTTP call to the "Processor". The Processor component recieves as input a record and is responsible of processing this record completely or releasing the hold ("procesing pending") state.
Choouse the best suited for your scenario, if you go for the second option, you can have one or more distributors if its necessary, but in order to increment your processing throughput you only need to scale the "Processor" workers

Airflow limit daily trigger

there is a "natural" ( I mean thought parameter) way to limit the number of triggering a dag (let say every 24 hours).
I don't want to schedule it, but some user can trigger the same dag multiple time, and for resources and others reason, I want it only once .
As I see "depends_on_past" depend only against the previous run, but it could be many time a day.
Thx
Not directly, but you could likely implement task_instance_mutation_hook in the first task of the DAG, it could then immediately fail the task if you check if it's been run several times the same day.
https://airflow.apache.org/docs/apache-airflow/stable/concepts/cluster-policies.html#task-instance-mutation

RDBMS Event-Store: Ensure ordering (single threaded writer)

Short description about the setup:
I'm trying to implement a "basic" event store/ event-sourcing application using a RDBMS (in my case Postgres). The events are general purpose events with only some basic fields like eventtime, location, action, formatted as XML. Due to this general structure, there is now way of partitioning them in a useful way. The events are captured via a Java Application, that validate the events and then store them in an events table. Each event will get an uuid and recordtime when it is captured.
In addition, there can be subscriptions to external applications, which should get all events matching a custom criteria. When a new matching event is captured, the event should be PUSHED to the subscriber. To ensure, that the subscriber does not miss any event, I'm currently forcing the capture process to be single threaded. When a new event comes in, a lock is set, the event gets a recordtime assigned to the current time and the event is finally inserted into the DB table (explicitly waiting for the commit). Then the lock is released. For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime. When the matching events are successfully pushed to the subscriber, the subscription_recordtime is set to the events max recordtime.
Everything is actually working but as you can imagine, a single threaded capture process, does not scale very well. Thus the main question is: How can I optimise this and allow for example multiple capture processes running in parallel?
I already thought about setting the recordtime in the DB itself on insert, but since the order of commits cannot be guaranteed (JVM pauses), I think I might loose events when two capture transactions are running nearly at the same time. When I understand the DB generated timestamp currectly, it will be set before the actual commit. Thus a transaction with a recordtime t2 can already be visible to the subscription query, although another transaction with a recordtime t1 (t1 < t2), is still ongoing and so has not been committed. The recordtime for the subscription will be set to t2 and so the event from transaction 1 will be lost...
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed? Every newly visible event must have a later timestamp then the event before (strictly monotonically increasing). I know about a full table lock, but I think, then I will have the same performance penalties as before.
Is it possible to set the DB to use a single threaded writer? Then each capture process would also be waiting for another write TX to finished, but on a DB level, which would be much better than a single instance/threaded capture application. Or can I use a different field/id for tracking the current state? Normal sequence ids will suffer from the same reasons.
Is there a way to guarantee the order on a DB level, so that events are visible in the order they are captured/ committed?
You should not be concerned with global ordering of events. Your events should contain a Version property. When writing events, you should always be inserting monotonically increasing Version numbers for a given Aggregate/Stream ID. That really is the only ordering that should matter when you are inserting. For Customer ABC, with events 1, 2, 3, and 4, you should only write event 5.
A database transaction can ensure the correct order within a stream using the rules above.
For a subscription which runs scheduled for example every 5 seconds, I track the recordtime of the last sent event, and execute a query for new events like where recordtime > subscription_recordtime.
Reading events is a slightly different story. Firstly, you will likely have a serial column to uniquely identify events. That will give you ordering and allow you to determine if you have read all events. When you read events from the store, if you detect a gap in the sequence. This will happen if an insert was in flight when you read the latest events. In this case, simply re-read the data and see if the gap is gone. This requires your subscription to maintain it's position in the index. Alternatively or additionally, you can read events that are at least N milliseconds old where N is a threshold high enough to compensate for delays in transactions (e.g 500 or 1000).
Also, bear in mind that there are open source RDBMS event stores that you can either use or leverage in your process.
Marten: http://jasperfx.github.io/marten/documentation/events/
SqlStreamStore: https://github.com/SQLStreamStore/SQLStreamStore

SQL Agent Job runtime alert

I was hoping i could get some help on how i can setup an e-mail alert for a specific agent job, such that it sends an e-mail alert when the run duration exceeds 30 minutes.
Would it be easier to add this step in the job itself? Are there any available methods in the SQL Agent GUI or do i have to create a new job? I figured creating a new job is less likely as i would have to query the sysjobhistory in msdb; The value is only updated once the job finishes so that doesn't help...I need it to check the real time duration of 1 specific agent job as it's running...
Specifically because it happens that the job runs into a deadlock ( That's no longer an issue now), so the job just stays stuck on the table it's locked on, and i only get the notification from the enduser that the report doesn't return results :S
The best method outside of 3rd party monitoring software is to create a high-frequency SQL Agent Job that runs a query on active sessions (returned by something like sp_who) for the duration of spids. This way you can have this monitoring job email you whenever a spid goes over a threshold. Alternatively you could have it compare the current runtime vs a calculated average runtime gleaned from the sys.jobhistory table.