How to handle redistribution/allocation algorithm using Spark in Scala - scala

Let's say I have a bunch of penguins around the country and I need to allocate food provisioning (which are distributed around the country as well) to the penguins.
I tried to simplify the problem as solving :
Input
The distribution of the penguins by area, grouped by proximity and prioritized as
+------------+------+-------+--------------------------------------+----------+
| PENGUIN ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+------------+------+-------+--------------------------------------+----------+
| P1 | A | A1 | 1 | 5 |
| P2 | A | A1 | 2 | 5 |
| P3 | A | A2 | 1 | 5 |
| P4 | B | B1 | 1 | 5 |
| P5 | B | B2 | 1 | 5 |
+------------+------+-------+--------------------------------------+----------+
The distribution of the food by area, also grouped by proximity and prioritized as
+---------+------+-------+--------------------------------------+----------+
| FOOD ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+---------+------+-------+--------------------------------------+----------+
| F1 | A | A1 | 2 | 5 |
| F2 | A | A1 | 1 | 2 |
| F3 | A | A2 | 1 | 7 |
| F4 | B | B1 | 1 | 7 |
+---------+------+-------+--------------------------------------+----------+
Expected output
The challenge is to allocate the food to the penguins from the same group first, respecting the priority order of both food and penguin and then take the left food to the other area.
So based on above data we would first allocate within same area and group as:
Stage 1: A1 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A1 | F2 | P1 | 2 |
| A | A1 | F1 | P1 | 3 |
| A | A1 | F1 | P2 | 2 |
| A | A1 | X | P2 | 3 |
+------+-------+---------+------------+--------------------+
Stage 1: A2 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A2 | F3 | P3 | 5 |
| A | A2 | F3 | X | 2 |
+------+-------+---------+------------+--------------------+
Stage 2: A (same area, food left from Stage 1:A2 can now be delivered to Stage 1:A1 penguin)
+------+---------+------------+--------------------+
| AREA | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+---------+------------+--------------------+
| A | F2 | P1 | 2 |
| A | F1 | P1 | 3 |
| A | F1 | P2 | 2 |
| A | F3 | P3 | 5 |
| A | F3 | P2 | 2 |
| A | X | P2 | 1 |
+------+---------+------------+--------------------+
and then we continue do the same for Stage 3 (across AERA), Stage 4 (across AERA2 (by train), which is a different geography cut than AERA (by truck) so we can't just re-aggregate), 5...
What I tried
I'm well familiar how to do it efficiently with a simple R code using a bunch of For loop, array pointer and creating output row by row for each allocation. However with Spark/Scala i could only end up with big and none-efficient code for solving such a simple problem and i would like to reach the community because its probably just that i missed a spark functionality.
I can do it using a lot of spark row transformation as [withColumn,groupby,agg(sum),join,union,filters] but the DAG creation end up being so big that it start to slow the DAG build up after 5/6 stages. I can go around that by saving the output as a file after each stage but then i got an IO issue as i have millions of records to save per stage.
I can also do it running a UDAF (using .split() buffer) for each stage, explode result then join back to the original table to update each quantities per stage. It does make the DAG much more simple and fast to build but unfortunately likely due to the string manipulation inside the UDAF it is too slow for few partitions.
In the end both of the above method feel wrong as they are more like hacks and there must be a more simple way to solve this issue. Ideally i would prefer use transformation to not loose the lazy-evaluations as this is just one step among many other transformations
Thanks a lot for your time. I'm happy to discuss any suggested approach.

This is psuedocode/description, but my solution to Stage 1. The problem is pretty interesting, and I thought you described it quite well.
My thought is to use spark's window, struct, collect_list (and maybe a sortWithinPartitions), cumulative sums, and lagging to get to something like this:
C1 C2 C3 C4 C5 C6 C7 | C8
P1 | A | A1 | 5 | 0 | [(F1,2), (F2,7)] | [F2] | 2
P1 | A | A1 | 10 | 5 | [(F1,2), (F2,7)] | [] | -3
C4 = cumulative sum of quantity, grouped by area/group, ordered by priority
C5 = lag of C4 down a row, and null = 0
C6 = structure of food / quantity, with a cumulative sum of food quantity
C7/C8 = remaining food/food ids
Now you can use a plain udf to return the array of food groups that belong to a penguin, since you can find the first instance where C5 < C6.quantity and the first instance where C4 > C6.quantity. Everything in between is returned. If C4 is never larger than C6.quantity, then you can append X. Exploding this result of this array will get you all penguins and if a penguin does not have food.
To determine whether there is extra food, you can have a udf which calculates the amount of "remaining food" for each row and use a window and row_number to get the the last area that is fed. If remaining food > 0, those food ids have left over food, it will be reflected in the array, and you can also make it struct to map to the number of food items left over.
I think in the end I'm still doing a fair number of aggregations, but hopefully grouping some things together into arrays makes it faster to do comparisons across each individual item.

Related

How to Decompose Global System Metrics to a Per Endpoint Basis on a Webserver

I'm implementing a metrics system for a backend API at scale and am running into a dilemma: using statsd, the application itself is logging request metrics on a per endpoint basis, but the CPU metrics are at the global server level. Currently each server has 10 threads, meaning 10 requests can be processed at once (yeah, yeah its actually serial).
For example, if we have two endpoints, /user and /item, the statsd implementation is differentiating statistics (DB/Redis I/O, etc.) per endpoint. However, say we are looking at linux-metrics every N seconds, those statistics do not separate endpoints, inherently.
I believe that it would be possible, assuming that your polling time ("N seconds") is small enough and that you have enough diversity within your requests, to decompose the global system metrics to create an estimate at the endpoint level.
Image a scenario like this:
note: we'll say a represents a GET to /user and b represents a GET to /item
|------|------|------|------|------|------|------|------|------|------|
| t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 | t10 |
|------|------|------|------|------|------|------|------|------|------|
| a | b | b | a | a | b | b | a | b | b |
| b | a | b | | b | a | b | | b | |
| a | b | b | | a | a | b | | a | |
| a | | b | | b | a | a | | a | |
| a | | b | | a | a | b | | | |
| | | | | a | | a | | | |
|------|------|------|------|------|------|------|------|------|------|
At every timestep, t (i.e. t1, t2, etc.), we also take a snapshot of our system metrics. I feel like there should be a way (possibly through a sort of signal decomposition) to estimate the avg load each a/b request takes. Now, in practice I have ~20 routes so it would be far more difficult to get an accurate estimate. But like I said before, provided your requests have enough diversity (but not too much) so that they overlap in certain places like above, it should be at the very least possible to get a rough estimate.
I have to imagine that there is some name for this kind of thing or at the very least some research or naive implementations of this method. In practice, are there any methods that can achieve these kinds of results?
Note: it may be more difficult when considering that requests may bleed over these timesteps, but almost all requests take <250ms. Even if our system stats polling rate is every 5 seconds (which is aggressive), this shouldn't really cause problems. It is also safe to assume that we would be achieving at the very least 50 requests/second on each server, so sparsity of data shouldn't cause problems.
I believe the answer is doing a sum decomposition through linear equations. If we say that a system metric, for example the CPU, is a function CPU(t1), then it would just be a matter of solving the following set of equations for the posted example:
|------|------|------|------|------|------|------|------|------|------|
| t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 | t10 |
|------|------|------|------|------|------|------|------|------|------|
| a | b | b | a | a | b | b | a | b | b |
| b | a | b | | b | a | b | | b | |
| a | b | b | | a | a | b | | a | |
| a | | b | | b | a | a | | a | |
| a | | b | | a | a | b | | | |
| | | | | a | | a | | | |
|------|------|------|------|------|------|------|------|------|------|
4a + b = CPU(t1)
a + 2b = CPU(t2)
5b = CPU(t3)
a = CPU(t4)
3a + 3b = CPU(t5)
4a + b = CPU(t6)
2a + 4b = CPU(t7)
a = CPU(t8)
2a + 2b = CPU(t9)
b = CPU(t10)
Now, there will be more than one way to solve this equation (i.e. a = CPU(t8) and a = CPU(t4)), but if you took the average of a and b (AVG(a)) from their corresponding solutions, you should get a pretty solid metric for this.

LibreOffice calc formula: find the value for a given data based on known data

I have LibreOffice Calc spreadsheet with table that calculates cost for two services. I want calculate cost of service #2 based on known data. The known data are rates (0,80 and 0,68: its permanent) and total incl.VAT 21%. Variable data in column C (unknown): C2 always equal to C3. Based on known data, I want split "Total incl. VAT" amount into a two separate parts, service #1 and service #2 cost. In particular, I want know the 'service #2' amount with VAT. (D3 + VAT) Can someone show formula how to make this?
+---+------------+---------------+-----------------+----------+-----------------+
| | A | B | C | D | E |
+---+------------+---------------+-----------------+----------+-----------------+
| 1 | services | Rate (eur/m3) | volume, m3 | Sum(eur) | service #2 cost |
| 2 | service #1 | 0,80 | 71,00 | 56,80 | |
| 3 | service #2 | 0,68 | 71,00 | 48,28 | |
| 4 | | | Subtotal: | 105,08 | |
| 5 | | | VAT 21% | 22,07 | |
| 6 | | | Total incl. VAT | 127,15 | D3 value + VAT |
+---+------------+---------------+-----------------+----------+-----------------+

Cross tab with a list of values instead of summation

I want a Cross tab that lists field values and counts them instead of just giving a count for the summation. I know I could make this with groups but I cant list the values vertically that way. From my research I believe I have to use a Display String Formula.
SQL Field Data
-------------------------------------------------
| Play # | Formation |Back Set | R/P | PLAY |
-------------------------------------------------
| 1 | TREY | FG | R | TRUCK |
-------------------------------------------------
| 2 | T | FG | R | RHINO |
-------------------------------------------------
| 3 | D | FG | P | 5 STEP |
-------------------------------------------------
| 4 | D | FG | P | 5 STEP |
-------------------------------------------------
| 5 | K JET | NG | R | DOG |
-------------------------------------------------
Desired report structure:
-----------------------------------------------------------
| Backet & Formation | Run | Pass |
-----------------------------------------------------------
| NG K JET | BULLA 1 | |
| | HELL 3 | |
-----------------------------------------------------------
| FG D | | 5 STEP 2 |
-----------------------------------------------------------
| NG K JET | DOG | |
-----------------------------------------------------------
| FG T | RHINO | |
-----------------------------------------------------------
Don't see why a Crosstab is necessary for this - especially if the entire body of the report is just that table.
Group your records by Bracket and Formation - If that's not
something natively configured in your table, make a new Formula field
and group on that.
Drop the 3 relevant fields into whichever section you need to display. (It might be a Footer, based on whether or not you want repeats
Write a formula to determine whether or not Run or Pass are displayed, and place it in their suppression field. (Good luck getting a Crosstab to do that for you! It tends to prefer 0s over blanks.)
If there's more to the report than just this table, you can cheat the system by placing your "table" into a subreport. And of course you can stretch Line objects across the sections and it will stretch to form the table outlines

Which Rule engine to use?

I have a requirement for handling multiple rules and select a value as per the matching criteria.
The rule could be
case-1
----------------------------------------
| A | B | C | D | priority | value |
----------------------------------------
| a1 | b1 | | c1 | 1 | 250 |
----------------------------------------
| | b2 | c2 | d2 | 3 | 200 |
----------------------------------------
| a3 | b3 | c3 | d3 | 2 | 100 |
----------------------------------------
As per the above defined rules, we look for highest number of matching criteria first, and select the value of that rule, (i.e rule with value "100")
case-2
----------------------------------------
| A | B | C | D | priority | value |
----------------------------------------
| a1 | b1 | | c1 | 1 | 100 |
----------------------------------------
| | b2 | c2 | d2 | 2 | 200 |
----------------------------------------
If two conflicting rules found with same number of matching criteria, then look for priority, and select rule with highest priority. In this case (Rule with value "100".
case-3
----------------------------------------
| A | B | C | D | priority | value |
----------------------------------------
| a1 | b1 | | c1 | 3 | 100 |
----------------------------------------
| | b2 | c2 | d2 | 2 | 200 |
----------------------------------------
| a3 | b3 | c3 | d3 | 1 | 300 |
----------------------------------------
| a4 | b4 | c4 | d4 | 1 | 400 |
----------------------------------------
In this case, if more than one rule with same number of matching criteria found and with same priority then select the rule with highest value (i.e Rule4 with value 400).
I know it looks very specific, but i tried to google but couldn't came across any rule engine which can be used in this case.
Please help me out with some pointers and ideas to start with.
Like others have pointed out, any rule engine should do in your case. Since this seems at first glance to be a very lightweight use-case, you can use Rulette to do this almost trivially (Disclosure - I am the author). You could define your rules, and then use the getAllRules API to get the list of applicable rules on which you could do min/max as required.
I am curious, though, to understand why you would want to define conflicting rules and then apply a "priority" on them?

How to set sequence number of sub-elements in TSQL unsing same element as parent?

I need to set a sequence inside T-SQL when in the first column I have sequence marker (which is repeating) and use other column for ordering.
It is hard to explain so I try with example.
This is what I need:
|------------|-------------|----------------|
| Group Col | Order Col | Desired Result |
|------------|-------------|----------------|
| D | 1 | NULL |
| A | 2 | 1 |
| C | 3 | 1 |
| E | 4 | 1 |
| A | 5 | 2 |
| B | 6 | 2 |
| C | 7 | 2 |
| A | 8 | 3 |
| F | 9 | 3 |
| T | 10 | 3 |
| A | 11 | 4 |
| Y | 12 | 4 |
|------------|-------------|----------------|
So my marker is A (each time I met A I must start new group inside my result). All rows before first A must be set to NULL.
I know that I can achieve that with loop but it would be slow solution and I need to update a lot of rows (may be sometimes several thousand).
Is there a way to achive this without loop?
You can use window version of COUNT to get the desired result:
SELECT [Group Col], [Order Col],
COUNT(CASE WHEN [Group Col] = 'A' THEN 1 END)
OVER
(ORDER BY [Order Col]) AS [Desired Result]
FROM mytable
If you need all rows before first A set to NULL then use SUM instead of COUNT.
Demo here