KSQL return top-n rows - apache-kafka

I'd like to have a table, which only contains the top n rows per group. Consider this table:
Field | Type
--------------------------------------------
a | VARCHAR(STRING) (primary key)
b | VARCHAR(STRING) (primary key)
c | VARCHAR(STRING) (primary key)
d | INTEGER
For every group (denoted by the primary key), I need e. g. the 10 rows with the highest value in column d. d is an aggregation over {a, b, c}, which sums up column c. This works pretty easy in normal SQL with ROW_NUMBER(), as described here: How do I use ROW_NUMBER()? , where you simply assign a number to every row, which depicts the row's placement in descending order depending on the value of column d.
Unfortunately, ksql doesn't support subqueries yet, which you need for ROW_NUMBER(). https://github.com/confluentinc/ksql/issues/745 I'm also not quite sure whether ksql supports ROW_NUMBER(), I think not, since I haven't found anything in the documentation and didn't manage to run it by myself.
I also found the TOPK function in ksql, but that doesn't seem to work as expected. https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/aggregate-functions/#topk
There is an issue about it on GitHub https://github.com/confluentinc/ksql/issues/403 . When I run it by myself, I obtain a column with an array with the top n values for {a, b}, which I can't map to the according values of column c. So that doesn't work as well. Example of what TOPK yields for k = 5:
a | b | d
--------------------------------------------
val1 | val2 | [10, 8, 6, 4, 2]
val1 | val4 | [7, 3, 3, 1, 0]
val1 | val6 | [5, 4, 3, 2, 1]
Here is an example of what I actually need, assuming that {10, 8, 6, 4, 2} are the 5 biggest values of d. For {a, b}, I want the values of c with the 5 biggest values of d.
a | b | c | d
--------------------------------------------
val1 | val2 | val3 | 10
val1 | val2 | val8 | 8
val1 | val2 | val9 | 6
val1 | val2 | val10 | 4
val1 | val2 | val11 | 2
Now, is there any possibility to do a top-n query in ksql? Or is it on the roadmap for future releases? Thanks.

Related

Postgres Unique JSON Array Aggregate Values

I have a table that stores values like this:
| id | thing | values |
|----|-------|--------|
| 1 | a |[1, 2] |
| 2 | b |[2, 3] |
| 3 | a |[2, 3] |
And would like to use an aggregate function to group by thing but store only the unique values of the array such that the result would be:
| thing | values |
|-------|---------|
| a |[1, 2, 3]|
| b |[2, 3] |
Is there a simple and performant way of doing this in Postgres?
First you take the JSON array apart with json_array_elements() - this is a set-returning function with a JOIN LATERAL you get a row with id, thing and a JSON array element for each element.
Then you select DISTINCT records for thing and value, ordered by value.
Finally you aggregate records back together with json_agg().
In SQL that looks like:
SELECT thing, json_agg(value) AS values
FROM (
SELECT DISTINCT thing, value
FROM t
JOIN LATERAL json_array_elements(t.values) AS v(value) ON true
ORDER BY value) x
GROUP BY thing
In general you would want to use the jsonb type as that is more efficient than json. Then you'd have to use the corresponding jsonb_...() functions.

Aggregate all combinations of rows taken k at a time

I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)
The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.

Modeling mutual exclusion & co-relation in relational database schema

If A, B & C are attributes with their values as,
A -> {1}
B -> {2,5,9}
C -> {11,12}
A & B are correlated (A cannot exist without B).
When A = 1, B can be 5 or 9, B cannot be 2.
B & C are correlated, when B is 5, C can be 11, C cannot be 12.
Ex: So, when C = 11, then B = 5, A = 1
how do i model this relationship in a relational schema or is there a better way to represent it?
What i have so far as attribute table.
ID | Attribute | value
----------------------
1 | A | 1
2 | B | 2
3 | B | 5
4 | B | 9
5 | C | 11
6 | C | 12
and the correlation table, ID1 & ID2 are foreign keys to the attribute table and are together composite primary key.
ID1 | ID2
---------
1 | 3
1 | 4
3 | 11

Tag unique rows?

I have some data from different systems which can be joined only in a certain case because of different granularity between the data sets.
Given three columns:
call_date, login_id, customer_id
How can I efficiently 'flag' each row which has a unique value across those three values? I didn't want to SELECT DISTINCT because I do not know which of the rows actually matches up with the other. I want to know which records (combination of columns) exist only once in a single date.
For example, if a customer called in 5 times on a single date and ordered a product, I do not know which of those specific call records ties back to the product order (lack of timestamps in the raw data). However, if a customer only called in once on a specific date and had a product order, I know for sure that the order ties back to that call record. (This is just an example - I am doing something similar across about 7 different tables from different source data).
timestamp customer_id login_name score unique
01/24/2017 18:58:11 441987 abc123 .25 TRUE
03/31/2017 15:01:20 783356 abc123 1 FALSE
03/31/2017 16:51:32 783356 abc123 0 FALSE
call_date customer_id login_name order unique
01/24/2017 441987 abc123 0 TRUE
03/31/2017 783356 abc123 1 TRUE
In the above example, I would only want to join rows where the 'uniqueness' is True for both tables. So on 1/24, I know that there was no order for the call which had a score of 0.25.
To find whether the row (or some set of columns) is unique within the list of rows, you need to make use of PostgreSQL window functions.
SELECT *,
(count(*) OVER(PARTITION BY b, c, d) = 1) as unique_within_b_c_d_columns
FROM unnest(ARRAY[
row(1, 2, 3, 1),
row(2, 2, 3, 2),
row(3, 2, 3, 2),
row(4, 2, 3, 4)
]) as t(a int, b int, c int, d int)
Output:
| a | b | c | d | unique_within_b_c_d_columns |
-----------------------------------------------
| 1 | 2 | 3 | 1 | true |
| 2 | 2 | 3 | 2 | false |
| 3 | 2 | 3 | 2 | false |
| 4 | 2 | 3 | 4 | true |
In PARTITION clause you need to specify the list of columns that you want to make comparison on. Note that in the example above a column doesn't take part in comparison.

sql query to break down count of every combination

I need a Postgresql Query that returns the count of every type of combination of record.
For example, I have a table T with columns A, B, C, D, E and other columns that are not of importance:
Table T
--------------
A | B | C | D | E
The query should return a table R with the values from columns A, B, C, D, and a count for how many times each configuration occurs with the specified E value.
Table R
---------------
A | B | C | D | count
When all of the counts for each record are added together, it should equal the total number of records in the original table.
It seems like a very simple problem, but due to my lack of SQL knowledge, I cannot figure out how to do this.
The only solution I can think of is this:
select a, b, c, d, count(*)
from T
where e = 'abc'
group by a, b, c, d
But when adding the counts up from this query, it is way more than the count of the original table. It seems like count(*) shouldn't be used, or i'm just totally going about this the wrong way. I'd really appreciate any advice as to how I should go about this. Thank you all.
NULL values couldn't possibly fool you. Consider this demo:
WITH t(a,b,c,d) AS (
VALUES
(1,2,3,4)
,(1,2,3,NULL)
,(2,2,3,NULL)
,(2,2,3,NULL)
,(2,2,3,4)
,(2,NULL,NULL,NULL)
,(NULL,NULL,NULL,NULL)
)
SELECT a, b, c, d, count(*)
FROM t
GROUP BY a, b, c, d
ORDER BY a, b, c, d;
a | b | c | d | count
---+---+---+---+-------
1 | 2 | 3 | 4 | 1
1 | 2 | 3 | | 1
2 | 2 | 3 | 4 | 1
2 | 2 | 3 | | 2
2 | | | | 1
| | | | 1
There must be some other misunderstanding here.
I figured it out, it was something really stupid. I forgot to specify the where 'E' = 'ABC' clause in the select count(*) when comparing the count. Thanks anyway for your help guys!