Spring Batch Conditional Flow splitting twice - spring-batch

In my Spring Boot Batch flow, I have to split the flow at some points based on the exit code of a step. After that, both steps will have to continue with the same step, and after that, flow will split againg based on the original split decision to execute other steps.
The logic should look like that:
Step A
|
-----------
| |
A COMPLETED A FAILED
| |
V V
STEP B STEP C
| |
-----------
|
V
STEP D
|
-----------
| |
A COMPLETED A FAILED
| |
V V
STEP E STEP F
| |
-----------
|
V
STEP G
How would I express that in the job definition? Currently, my code looks like
return this.jobBuilderFactory.get("SchreibenModifikationsJob")
.incrementer(new RunIdIncrementer())
.listener(listener)
.start(stepA)
.on(ExitStatus.COMPLETED.getExitCode())
.to(stepB)
.from(stepA)
.on(ExitStatus.FAILED.getExitCode())
.to(stepC)
.from(stepB)
.on(ExitStatus.COMPLETED.getExitCode())
.to(stepD)
.from(stebB)
.on(ExitStatus.FAILED.getExitCode())
.to(stepD)
.from(stepC)
.on(ExitStatus.COMPLETED.getExitCode())
.to(stepD)
.from(stepC)
.on(ExitStatus.FAILED.getExitCode())
.to(stepD)
.from(stepD)
.on(ExitStatus.COMPLETED.getExitCode()) // Need status of stepA here
.to(stepE)
.from(stepD)
.on(ExitStatus.FAILED.getExitCode()) // Need status of stepA here
.to(stepF)
.from(stepE)
.on(ExitStatus.COMPLETED.getExitCode())
.to(stepG)
.from(stepE)
.on(ExitStatus.FAILED.getExitCode())
.to(stepG)
.from(stepF)
.on(ExitStatus.COMPLETED.getExitCode())
.to(stepG)
.from(stepF)
.on(ExitStatus.FAILED.getExitCode())
.to(stepG)
.from(stepG)
.end()
.build();

Related

Distribute group tasks evenly using pandas_udf in PySpark

I have a Spark Dataframe which contains groups of training data. Each group is identified by the "group" column.
group | feature_1 | feature_2 | label
--------------------------------------
1 | 123 | 456 | 0
1 | 553 | 346 | 1
... | ... | ... | ...
2 | 623 | 498 | 0
2 | 533 | 124 | 1
... | ... | ... | ...
I want to train a python ML model (lightgbm in my case) for each group in parallel.
Therefore I have the following working code:
schema = T.StructType([T.StructField("group_id", T.IntegerType(), True),
T.StructField("model", T.BinaryType(), True)])
#F.pandas_udf(schema, F.PandasUDFType.GROUPED_MAP)
def _fit(pdf):
group_id = pdf.loc[0, "group"]
X = df.loc[: X_col]
y = df.loc[:, y_col].values
# train model
model = ...
out_df = pd.DataFrame(
[[group_id, pickle.dumps(model)],
columns=["group_id", "model"]]
)
return out_df
df.groupby("group").apply(_fit)
I have 10 groups in the dataset and 10 worker nodes.
Most of the times, each group is assigned to an executor and the processing is very quick.
However sometimes, more than 1 group are assigned to an executor while some other executors are left free.
This causes the processing to become very slow as the executor has to train multiple models at the same time.
Question: how do I schedule each group to train on a separate executor to avoid this problem?
I think you're going to want to look into playing around with setting the following 2 spark configurations:
spark.task.cpus (the number of cpus per task)
spark.executor.cores (the number of cpus per executor)
I believe setting spark.executor.cores = spark.task.cpus = (cores per worker -1) might solve your problem.

Querying data with additional column that creates a number for ordering purposes

I am trying to create a "queue" system by adding an arbitrary column that creates a number based on a condition and date, to sort the importance of a row.
For example, below is the query result I pulled in Postgres:
Table: task
Result:
description | status/condition| task_created |
bla | A | 2019-12-01 07:00:00|
pikachu | A | 2019-12-01 16:32:10|
abcdef | B | 2019-12-02 18:34:22|
doremi | B | 2019-12-02 15:09:43|
lalala | A | 2019-12-03 22:10:59|
In the above, each task has a date/timestamp and status/condition applied to them. I would like to create another column that gives a number to a row where it prioritises the older tasks first, BUT if the condition is B, then we take the older task of those in B as first priority.
The expected end result (based on the example) should be:
Table1: task
description | status/condition| task_created | priority index
bla | A | 2019-12-01 07:00:00| 3
pikachu | A | 2019-12-01 16:32:10| 4
abcdef | B | 2019-12-02 18:34:22| 2
doremi | B | 2019-12-02 15:09:43| 1
lalala | A | 2019-12-03 22:10:59| 5
For priority number, 1 being most urgent to do/resolve, while 5 being the least.
How would I go about adding this additional column into the existing query? especially since there's another condition apart from just the task_created date/time.
Any help is appreciated. Many thanks!
You maybe want the Rank or Dense Rank function (depends on your needs) window functions.
If you don't need a conditional order on the status you can use this one.
SELECT *,
rank() OVER (
ORDER BY status desc, task_created
) as priority_index
FROM task
If you need a custom order based on the value of the status:
SELECT *,
rank() OVER (
ORDER BY
CASE status
WHEN 'B' THEN 1
WHEN 'A' THEN 2
WHEN 'C' THEN 3
ELSE 4
END, task_created
) as priority_index
FROM task
If you have few values this is good enough, because we can simply specify your custom order. But if you have a lot of values and the ordering information is fixed, then it should have its own table.

How to get the latest value within a time window

This is what my streaming data looks like:
time | id | group
---- | ---| ---
1 | a1 | b1
2 | a1 | b2
3 | a1 | b3
4 | a2 | b3
Consider all examples above within our window. My use case gets the latest distinct id.
I need the output to be like below:
time | id | group
---- | ---| ---
3 | a1 | b3
4 | a2 | b3
How can I achieve this in Flink?
I am aware of the window function WindowFunction. However, I cannot wrap my head around doing this.
I have tried this just to get the distinct ids. How can I extend this function to my use case?
class DistinctGrid extends WindowFunction[UserMessage, String, Tuple, TimeWindow] {
override def apply(key: Tuple, window: TimeWindow, input: Iterable[UserMessage], out: Collector[String]): Unit = {
val distinctGeo = input.map(_.id).toSet
for (i <- distinctGeo) {
out.collect(i)
}
}
}
If you key the stream by the id field, then there's no need to think about distinct ids -- you'll have a separate window for each distinct key. Your window function just needs iterate over the window contents to find the UserMessage with the largest timestamp, and output that as the result of the window (for that key). However, there's a built-in function that does just that -- look at the documentation for maxBy() -- so no need for a window function in this case.
Roughly speaking then, this will look like
stream.keyBy("id")
.timeWindow(Time.minutes(10))
.maxBy("time")
.print()

Add a key element for n rows in PySpark Dataframe

I have a dataframe like the one shown below.
id | run_id
--------------
4 | 12345
6 | 12567
10 | 12890
13 | 12450
I wish to add a new column say Key that will have value 1 for the first n rows and 2 for the next n rows. The result will be like:
id | run_id | key
----------------------
4 | 12345 | 1
6 | 12567 | 1
10 | 12890 | 2
13 | 12450 | 2
Is it possibile to do the same with PySpark?. Thanks in advance for the help.
Here is one way to do it using zipWithIndex:
# sample rdd
rdd=sc.parallelize([[4,12345], [6,12567], [10,12890], [13,12450]])
# group size for key
n=2
# add rownumber and then label in batches of size n
rdd=rdd.zipWithIndex().map(lambda (x, rownum): x+[int(rownum/n)+1])
# convert to dataframe
df=rdd.toDF(schema=['id', 'run_id', 'key'])
df.show(4)

complex canvas getting stuck in the middle

Setup:
Celery 4.1.0, broker=RabbitMQ 3.6.5, backend=Redis 3.2.5
Consider the following canvas:
celery worker -A worker.celeryapp:app -l info -Q default -c 2 -n defaultworker#%h -Ofair
#app.task(name='task_1',
bind=True,
base=MyConnectionHolderTask)
def task_1(self, feed_id, flow_id, **kwargs):
do_something()
task_1 = t_1.si(feed_id, flow_id)
.
.
task_13 = t_13.si(feed_id, flow_id)
(task_1 |
group((task_2 | group(task_3, task_4)),
task_5,
task_6,
task_7,
task_8) |
task_9 |
task_10 |
task_11 |
task_12 |
task_13).apply_async(link_error=unlock)
means I have chain of tasks which one of its tasks is a group of several tasks, and one of them is chain of size 2 (with latter task as group of 2).
Expected behavior
all task succeeded so expecting finish until task_13
Actual behavior
task_4 is the last to run. task_9 and the rest (10..13) doesn't run.
if i cancel the group of task_3 & task_4 it does work (till 13):
(task_1 |
group((task_2 | task_3 | task_4),
task_5,
task_6,
task_7,
task_8) |
task_9 |
task_10 |
task_11 |
task_12 |
task_13).apply_async(link_error=unlock)
Ref: Issue in github