I have a requirement for handling multiple rules and select a value as per the matching criteria.
The rule could be
case-1
----------------------------------------
| A | B | C | D | priority | value |
----------------------------------------
| a1 | b1 | | c1 | 1 | 250 |
----------------------------------------
| | b2 | c2 | d2 | 3 | 200 |
----------------------------------------
| a3 | b3 | c3 | d3 | 2 | 100 |
----------------------------------------
As per the above defined rules, we look for highest number of matching criteria first, and select the value of that rule, (i.e rule with value "100")
case-2
----------------------------------------
| A | B | C | D | priority | value |
----------------------------------------
| a1 | b1 | | c1 | 1 | 100 |
----------------------------------------
| | b2 | c2 | d2 | 2 | 200 |
----------------------------------------
If two conflicting rules found with same number of matching criteria, then look for priority, and select rule with highest priority. In this case (Rule with value "100".
case-3
----------------------------------------
| A | B | C | D | priority | value |
----------------------------------------
| a1 | b1 | | c1 | 3 | 100 |
----------------------------------------
| | b2 | c2 | d2 | 2 | 200 |
----------------------------------------
| a3 | b3 | c3 | d3 | 1 | 300 |
----------------------------------------
| a4 | b4 | c4 | d4 | 1 | 400 |
----------------------------------------
In this case, if more than one rule with same number of matching criteria found and with same priority then select the rule with highest value (i.e Rule4 with value 400).
I know it looks very specific, but i tried to google but couldn't came across any rule engine which can be used in this case.
Please help me out with some pointers and ideas to start with.
Like others have pointed out, any rule engine should do in your case. Since this seems at first glance to be a very lightweight use-case, you can use Rulette to do this almost trivially (Disclosure - I am the author). You could define your rules, and then use the getAllRules API to get the list of applicable rules on which you could do min/max as required.
I am curious, though, to understand why you would want to define conflicting rules and then apply a "priority" on them?
Related
I have a table like the following one:
+---------+-------+-------+-------------+--+
| Section | Group | Level | Fulfillment | |
+---------+-------+-------+-------------+--+
| A | Y | 1 | 82.2 | |
| A | Y | 2 | 23.2 | |
| A | M | 1 | 81.1 | |
| A | M | 2 | 28.2 | |
| B | Y | 1 | 89.1 | |
| B | Y | 2 | 58.2 | |
| B | M | 1 | 32.5 | |
| B | M | 2 | 21.4 | |
+---------+-------+-------+-------------+--+
And this would be my desired output:
+---------+-------+--------------------+--------------------+
| Section | Group | Level1_Fulfillment | Level2_Fulfillment |
+---------+-------+--------------------+--------------------+
| A | Y | 82.2 | 23.2 |
| A | M | 81.1 | 28.2 |
| B | Y | 89.1 | 58.2 |
| B | M | 32.5 | 21.4 |
+---------+-------+--------------------+--------------------+
Thus, for each section and group I'd like to obtain their percents of fulfillment for level 1 and level 2. To achieve this, I've tried crosstab(), but using this function returns me an error ("The provided SQL must return 3 columns: rowid, category, and values.") because I'm using more than three columns (I need to maintain section and group as identifiers for each row). Is possible to use crosstab in this case?
Regards.
I find crosstab() unnecessary complicated to use and prefer conditional aggregation:
select section,
"group",
max(fulfillment) filter (where level = 1) as level_1,
max(fulfillment) filter (where level = 2) as level_2
from the_table
group by section, "group"
order by section;
Online example
I'm implementing a metrics system for a backend API at scale and am running into a dilemma: using statsd, the application itself is logging request metrics on a per endpoint basis, but the CPU metrics are at the global server level. Currently each server has 10 threads, meaning 10 requests can be processed at once (yeah, yeah its actually serial).
For example, if we have two endpoints, /user and /item, the statsd implementation is differentiating statistics (DB/Redis I/O, etc.) per endpoint. However, say we are looking at linux-metrics every N seconds, those statistics do not separate endpoints, inherently.
I believe that it would be possible, assuming that your polling time ("N seconds") is small enough and that you have enough diversity within your requests, to decompose the global system metrics to create an estimate at the endpoint level.
Image a scenario like this:
note: we'll say a represents a GET to /user and b represents a GET to /item
|------|------|------|------|------|------|------|------|------|------|
| t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 | t10 |
|------|------|------|------|------|------|------|------|------|------|
| a | b | b | a | a | b | b | a | b | b |
| b | a | b | | b | a | b | | b | |
| a | b | b | | a | a | b | | a | |
| a | | b | | b | a | a | | a | |
| a | | b | | a | a | b | | | |
| | | | | a | | a | | | |
|------|------|------|------|------|------|------|------|------|------|
At every timestep, t (i.e. t1, t2, etc.), we also take a snapshot of our system metrics. I feel like there should be a way (possibly through a sort of signal decomposition) to estimate the avg load each a/b request takes. Now, in practice I have ~20 routes so it would be far more difficult to get an accurate estimate. But like I said before, provided your requests have enough diversity (but not too much) so that they overlap in certain places like above, it should be at the very least possible to get a rough estimate.
I have to imagine that there is some name for this kind of thing or at the very least some research or naive implementations of this method. In practice, are there any methods that can achieve these kinds of results?
Note: it may be more difficult when considering that requests may bleed over these timesteps, but almost all requests take <250ms. Even if our system stats polling rate is every 5 seconds (which is aggressive), this shouldn't really cause problems. It is also safe to assume that we would be achieving at the very least 50 requests/second on each server, so sparsity of data shouldn't cause problems.
I believe the answer is doing a sum decomposition through linear equations. If we say that a system metric, for example the CPU, is a function CPU(t1), then it would just be a matter of solving the following set of equations for the posted example:
|------|------|------|------|------|------|------|------|------|------|
| t1 | t2 | t3 | t4 | t5 | t6 | t7 | t8 | t9 | t10 |
|------|------|------|------|------|------|------|------|------|------|
| a | b | b | a | a | b | b | a | b | b |
| b | a | b | | b | a | b | | b | |
| a | b | b | | a | a | b | | a | |
| a | | b | | b | a | a | | a | |
| a | | b | | a | a | b | | | |
| | | | | a | | a | | | |
|------|------|------|------|------|------|------|------|------|------|
4a + b = CPU(t1)
a + 2b = CPU(t2)
5b = CPU(t3)
a = CPU(t4)
3a + 3b = CPU(t5)
4a + b = CPU(t6)
2a + 4b = CPU(t7)
a = CPU(t8)
2a + 2b = CPU(t9)
b = CPU(t10)
Now, there will be more than one way to solve this equation (i.e. a = CPU(t8) and a = CPU(t4)), but if you took the average of a and b (AVG(a)) from their corresponding solutions, you should get a pretty solid metric for this.
I have this table, and I want to do the following:
For an element a(i,j) of the table, it's value is the product of the last element in the row(i) times the last element in the column(j), divided by the last element in the matrix (i_max,j_max)
What I have tried so far, to get an idea of where I'm at:
Here is what you need:
#>, or #>$0 refers to the last row, current column.
$>, or #0$> refers to the current row, last column.
#>$> refers to the last cell in the table.
Documentation here.
| C1 | C2 | C3 | C4 | 5 | C6 |
|-------+-----------+-----------+-----------+---+------|
| sp | 157.09091 | 264.88201 | 143.30368 | | 648 |
| pr | 72.969697 | 123.03933 | 66.565442 | | 301 |
| rs | 145.93939 | 246.07866 | 133.13088 | | 602 |
|-------+-----------+-----------+-----------+---+------|
| total | 376 | 634 | 343 | | 1551 |
#+TBLFM: #5$6=vsum(#I..#II)::#2$2..#4$4=#0$>*#>$0/#>$>
Let me explain my problem.
I have a table with this shape:
+----+------+-----------+-----------+
| ID | A | B | W |
+----+------+-----------+-----------+
| 1 | 534 | [a,b,c] | [4,6,2] |
| 2 | 534 | [a,b,d,e] | [6,3,6,2] |
| … | … | … | … |
| 54 | 667 | [a,b,r,e] | [4,6,2,3] |
| 55 | 8789 | [d] | [9] |
| 56 | 8789 | [a,b,d] | [7,2,3] |
| 57 | 8789 | [d,e,f,g] | [4,2,2,8] |
| … | … | … | … |
+----+------+-----------+-----------+
The query the I need to perform is the following: given an input with A,B and W values (e.g. A=8789; B=[a,b]; W=[3,2]) I need to find the "closest" line in the table that has the same value on A.
I've already defined my custom distance function.
The naive approach would be something like (given the input in the example):
SELECT * from my_table T, dist_function(T.B,T.W,ARRAY[a,b],ARRAY[3,2]) as dist
WHERE T.A = 8789
ORDER BY dist ASC
LIMIT 7
In my understanding this is a classical KNN problem for which I realized something already exists:
KNN-GiST
GiST & SP-GiST
SP-Gist example
I'm just not sure about which is the best index to consider.
Thanks.
Let's say I have a bunch of penguins around the country and I need to allocate food provisioning (which are distributed around the country as well) to the penguins.
I tried to simplify the problem as solving :
Input
The distribution of the penguins by area, grouped by proximity and prioritized as
+------------+------+-------+--------------------------------------+----------+
| PENGUIN ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+------------+------+-------+--------------------------------------+----------+
| P1 | A | A1 | 1 | 5 |
| P2 | A | A1 | 2 | 5 |
| P3 | A | A2 | 1 | 5 |
| P4 | B | B1 | 1 | 5 |
| P5 | B | B2 | 1 | 5 |
+------------+------+-------+--------------------------------------+----------+
The distribution of the food by area, also grouped by proximity and prioritized as
+---------+------+-------+--------------------------------------+----------+
| FOOD ID | AERA | GROUP | PRIORITY (lower are allocated first) | QUANTITY |
+---------+------+-------+--------------------------------------+----------+
| F1 | A | A1 | 2 | 5 |
| F2 | A | A1 | 1 | 2 |
| F3 | A | A2 | 1 | 7 |
| F4 | B | B1 | 1 | 7 |
+---------+------+-------+--------------------------------------+----------+
Expected output
The challenge is to allocate the food to the penguins from the same group first, respecting the priority order of both food and penguin and then take the left food to the other area.
So based on above data we would first allocate within same area and group as:
Stage 1: A1 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A1 | F2 | P1 | 2 |
| A | A1 | F1 | P1 | 3 |
| A | A1 | F1 | P2 | 2 |
| A | A1 | X | P2 | 3 |
+------+-------+---------+------------+--------------------+
Stage 1: A2 (same area and group)
+------+-------+---------+------------+--------------------+
| AREA | GROUP | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+-------+---------+------------+--------------------+
| A | A2 | F3 | P3 | 5 |
| A | A2 | F3 | X | 2 |
+------+-------+---------+------------+--------------------+
Stage 2: A (same area, food left from Stage 1:A2 can now be delivered to Stage 1:A1 penguin)
+------+---------+------------+--------------------+
| AREA | FOOD ID | PINGUIN ID | ALLOCATED_QUANTITY |
+------+---------+------------+--------------------+
| A | F2 | P1 | 2 |
| A | F1 | P1 | 3 |
| A | F1 | P2 | 2 |
| A | F3 | P3 | 5 |
| A | F3 | P2 | 2 |
| A | X | P2 | 1 |
+------+---------+------------+--------------------+
and then we continue do the same for Stage 3 (across AERA), Stage 4 (across AERA2 (by train), which is a different geography cut than AERA (by truck) so we can't just re-aggregate), 5...
What I tried
I'm well familiar how to do it efficiently with a simple R code using a bunch of For loop, array pointer and creating output row by row for each allocation. However with Spark/Scala i could only end up with big and none-efficient code for solving such a simple problem and i would like to reach the community because its probably just that i missed a spark functionality.
I can do it using a lot of spark row transformation as [withColumn,groupby,agg(sum),join,union,filters] but the DAG creation end up being so big that it start to slow the DAG build up after 5/6 stages. I can go around that by saving the output as a file after each stage but then i got an IO issue as i have millions of records to save per stage.
I can also do it running a UDAF (using .split() buffer) for each stage, explode result then join back to the original table to update each quantities per stage. It does make the DAG much more simple and fast to build but unfortunately likely due to the string manipulation inside the UDAF it is too slow for few partitions.
In the end both of the above method feel wrong as they are more like hacks and there must be a more simple way to solve this issue. Ideally i would prefer use transformation to not loose the lazy-evaluations as this is just one step among many other transformations
Thanks a lot for your time. I'm happy to discuss any suggested approach.
This is psuedocode/description, but my solution to Stage 1. The problem is pretty interesting, and I thought you described it quite well.
My thought is to use spark's window, struct, collect_list (and maybe a sortWithinPartitions), cumulative sums, and lagging to get to something like this:
C1 C2 C3 C4 C5 C6 C7 | C8
P1 | A | A1 | 5 | 0 | [(F1,2), (F2,7)] | [F2] | 2
P1 | A | A1 | 10 | 5 | [(F1,2), (F2,7)] | [] | -3
C4 = cumulative sum of quantity, grouped by area/group, ordered by priority
C5 = lag of C4 down a row, and null = 0
C6 = structure of food / quantity, with a cumulative sum of food quantity
C7/C8 = remaining food/food ids
Now you can use a plain udf to return the array of food groups that belong to a penguin, since you can find the first instance where C5 < C6.quantity and the first instance where C4 > C6.quantity. Everything in between is returned. If C4 is never larger than C6.quantity, then you can append X. Exploding this result of this array will get you all penguins and if a penguin does not have food.
To determine whether there is extra food, you can have a udf which calculates the amount of "remaining food" for each row and use a window and row_number to get the the last area that is fed. If remaining food > 0, those food ids have left over food, it will be reflected in the array, and you can also make it struct to map to the number of food items left over.
I think in the end I'm still doing a fair number of aggregations, but hopefully grouping some things together into arrays makes it faster to do comparisons across each individual item.