Nested scheduling support in Quartz - scheduled-tasks

I have a business requirement that requires the following scheduling pattern
----t1--------ta-------tb---------t2
Between t1 and t2, give 10% discount on product A
However, for nested time window ta - tb, give discount of 20%.
When tb is reached, go back to 10% discount on product A until t2.
Can Quartz job scheduling implement this out of the box?
I want to avoid scheduling 3 jobs here - for intervals (t1, ta) (ta, tb) and (tb, t2).

Quartz is a generic Java scheduling API and as such it does not come with any application-specific business logic "out of the box". The way I would solve the above requirement with Quartz is like so:
Create a generic ProductPriceUpdaterJob Quartz job that will simply update the product price stored in your product store (typically a database). The job would expect a single job data map parameter "discount" with the discount percentage figure (i.e. 0, 10, 20).
Associate the job with 4 Quartz triggers (T1, Ta, Tb, T2) that start your job at t1, ta, tb and t2 respectively. These triggers would specify the desired discount amount in their job data map (T1 has discount=10, Ta has discount=20, Tb has discount=10, T2 has discount=0).
Start Quartz and register the job and triggers with it and you are done.
At t1, Quartz starts your job using trigger T1 and the job applies the 10% discount to the product price. At ta, Quartz starts your job using trigger Ta and your job applies the 20% discount to the product price etc.
Quartz supports 4 different trigger types and I think you can safely use the CronTrigger type for your triggers.
You will probably want to use another job data map parameter in your triggers where you can specify the ID (or IDs) of the product(s) to apply the discount to. This way, your job will be truly generic and usable with all your products.

Related

Tracking an expected set of Kafka events

Say I have N cities and each will report their temperature for the hour (H) by producing Kafka events. I have a complex model I want to run but want to ensure it doesn't attempt to kick-off before all N are read.
Say they are being produced in batches, I understand that to ensure at-least-once consumption, if a consumer fails mid-batch then it will pick up at the front of the batch. I have built this into my model to count by unique Cities (and if a city is sent multiple times it will overwrite existing records).
My current plan is to set it up as follows:
An application creates an initial event which says "Expect these N cities to report for H o'clock".
The events are persisted (in db, Redis, etc) by another application. After writing, it produces an event which states how many unique cities have been reported in total so far for H.
Some process matches the initial "Expect N" events with "N Written" events. It alerts the rest of the system that the data set for H is ready for creating the model when they are equal.
Does this problem have a name and are there common patterns or libraries available to manage it?
Does the solution as outlined have glaring holes or overcomplicate the issue?
What you're describing sounds like an Aggregator, described by Gregor Hohpe and Bobby Woolf's "Enterprise Integration Patterns" as:
a special Filter that receives a stream of messages and identifies messages that are correlated. Once a complete set of messages has been received [...], the Aggregator collects information from each correlated message and publishes a single, aggregated message to the output channel for further processing.
This could be done on top of Kafka Streams, using its built-in aggregation, or with a stateful service like you suggested.
One other suggestion -- designing processes like this with event-driven choreography can be tricky. I have seen strong engineering teams fail to deliver similar solutions due to diving into the deep end without first learning to swim. If your scale demands it and your organization is already primed for event-driven distributed architecture, then go for it, but if not, consider an orchestration-based alternative (for example, AWS Step Functions, Airflow, or another workflow orchestration tool). These are much easier to reason about and debug.

Kafka KStream-KTable join race condition

I have the following:
KTable<Integer, A> tableA = builder.table("A");
KStream<Integer, B> streamB = builder.stream("B");
Messages in streamB need to be enriched with data from tableA.
Example data:
Topic A: (1, {name=john})
Topic B: (1, {type=create,...}), (1, {type=update,...}), (1, {type=update...})
In a perfect world, I would like to do
streamB.join(tableA, (b, a) -> { b.name = a.name; return b; })
.selectKey((k,b) -> b.name)
.to("C");
Unfortunately this does not work for me because my data is such that every time a message is written to topic A, a corresponding message is also written to topic B (the source is a single DB transaction). Now after this initial 'creation' transaction topic B will keep receiving more messages. Sometimes several events per seconds will show up on topic B but it is also possible to have consecutive events hours apart for a given key.
The reason the simple solution does not work is that the original 'creation' transaction causes a race condition: Topic A and B get their message almost simultaneously and if the B message reaches the 'join' part of the topology first (say a few ms before the A message gets there) the tableA will not yet contain a corresponding entry. At this point the event is lost. I can see this happening on topic C: some events show up, some don't (if I use a leftJoin, all events show up but some have null key which is equivalent to being lost). This is only a problem for the initial 'creation' transaction. After that every time an event arrives on topic B, the corresponding entry exists in tableA.
So my question is: how do you fix this?
My current solution is ugly. What I do is that I created a 'collection of B' and read topic B using
B.groupByKey()
.aggregate(() -> new CollectionOfB(), (id, b, agg) -> agg.add(b));
.join(tableA, ...);
Now we have a KTable-KTable join, which is not susceptible to this race condition. The reason I consider this 'ugly' is because after each join, I have to send a special message back to topic B that essentially says "remove the event(s) that I just processed from the collection". If this special message is not sent to topic B, the collection will keep growing and every event in the collection will be reported on every join.
Currently I'm investigating whether a window join would work (read both A and B into KStreams and use a windowed join). I'm not sure that this will work either because there is no upper bound on the size of the window. I want to say, "window starts 1 second 'before' and ends infinity seconds 'after'". Even if I can somehow make this work, I am a bit concerned with the space requirement of having an unbounded window.
Any suggestion would be greatly appreciated.
Not sure what version you are using, but latest Kafka 2.1 improves the stream-table-join. Even before 2.1, the following holds:
stream-table join is base on event-time
Kafka Streams processes messages based on event-time, however, in offset-order (for two input streams, the stream with smaller record timestamps is processed first)
if you want to ensure that the table is updated first, the table update record should have a smaller timestamp than the stream record
Since 2.1:
to allow for some delay, you can configure max.task.idle.ms configuration to delay processing for the case that only one input topic has input data
The event-time processing order is implemented as best-effort in 2.0 and earlier versions what can lead to the race condition you describe. In 2.1, processing order is guaranteed and might only be violated if max.task.idle.ms hits.
For details, see https://cwiki.apache.org/confluence/display/KAFKA/KIP-353%3A+Improve+Kafka+Streams+Timestamp+Synchronization

can single processor environment prevent race condition?

When multiple processors are working, the processes are working concurrently. Race condition happens when multiple threads accessing some common data area, one may overwrite the other value.
So, if it is a single processor and single core environment, can it prevent the race condition from happening?
Help me clarify this confusion, Thank you.
A race condition could happen in Single processor environment. As per Wiki Race Condition occurs when output is dependent on the sequence or timing of other uncontrollable events
Single processor environment could support multiple threads of the same process of different process that might be waiting for another thread to yield on a resource. Deadlocks can happen in single processor environments too.
Scenario:
T1: Wants add an employee record to file "employee.txt"
T2: Wants to compute average salary for "legal dept"
T3: Wants to remove an employee who left
T4: Wants to list number of employees working in each dept
If all the above threads are waiting at time=0 and submitted to single processor, it would decide which thread goes first, second and so on. The order in which the Threads are prioritised and yielded differs on different platform, scenarios etc. Thus T2 and T4 might not give consistent result.

Understanding multilevel feedback queue scheduling

I'm trying to understand multilevel feedback queue scheduling and I came across the following example from William Stallings Operating Systems Internal and Principles Design (7th ed).
I got this process:
And the result in the book is this:
I believe I'm doing the first steps wright but when I get to process E CPU time my next process is B not D as in the book example.
I can't understand if there are n RQ and after each time a process get's CPU time it is demoted to a lower priority time RQ or if, for example, process A is in RQ1 and there are no process at the çower RQ, the process is promoted to that ready queue (this is how I am doing).
Can someone explain me the process how, at the above example, after E is processed, D gets CPU time and them E (and not B) is served?
The multilevel feedback algortihm selects always the first job of the lowest queue (i.e., the queue with the highest priority) that is not empty.
When job E leaves RQ1 (time 9), job D is in queue RT2 but job B in RT3. Thus, B is executed. Please consider the modified figure, where the red numbers give the queue in which the job is executed.
As you can see, job B has already left RT2 at time 9 (more preceisly, it leaves RT2 at time 6), whereas job D has just entered.

Running two instances of the ScheduledThreadPoolExecutor

I have a number of asynchronous tasks to run in parallel. All the tasks can be divided into two types, lets call one - type A (that are time consuming) and everything else type B (faster and quick to execute ones).
with a single ScheduledThreadPoolExecutor with x poolsize, eventually at some point all threads are busy executing type A, as a resul type B gets blocked and delayed.
what im trying to accomplish is to run a type A tasks parallel to type B, and i want tasks in both the types to run parallel within their group for performance .
Would you think its prudent to have two instances of ScheduledThreadPoolExecutor for the type A and B exclusively with their own thread pools ? Do you see any issues with this approach?
No, that's seems reasonable.
I am doing something similar i.e. I need to execute tasks in serial fashion depending on some id e.g. all the tasks which are for component with id="1" need to be executed serially to each another and in parallel to all other tasks which are for components with different ids.
so basically I need a separate queue of tasks for each different component, the tasks are pulled one after another from each specific queue.
In order to achieve that I use
Executors.newSingleThreadExecutor(new JobThreadFactory(componentId));
for each component.
Additionally I need ExecutorService for a different type of tasks which are not bound to componentIds, for that I create additional ExecutorService instance
Executors.newFixedThreadPool(DEFAULT_THREAD_POOL_SIZE, new JobThreadFactory());
This works fine for my case at least.
The only problem I can think of if there is a need of ordered execution of the tasks i.e.
task2 NEEDS to be executed after task1 and so on... But I doubt this the case here ...