Postgres Unique JSON Array Aggregate Values - postgresql

I have a table that stores values like this:
| id | thing | values |
|----|-------|--------|
| 1 | a |[1, 2] |
| 2 | b |[2, 3] |
| 3 | a |[2, 3] |
And would like to use an aggregate function to group by thing but store only the unique values of the array such that the result would be:
| thing | values |
|-------|---------|
| a |[1, 2, 3]|
| b |[2, 3] |
Is there a simple and performant way of doing this in Postgres?

First you take the JSON array apart with json_array_elements() - this is a set-returning function with a JOIN LATERAL you get a row with id, thing and a JSON array element for each element.
Then you select DISTINCT records for thing and value, ordered by value.
Finally you aggregate records back together with json_agg().
In SQL that looks like:
SELECT thing, json_agg(value) AS values
FROM (
SELECT DISTINCT thing, value
FROM t
JOIN LATERAL json_array_elements(t.values) AS v(value) ON true
ORDER BY value) x
GROUP BY thing
In general you would want to use the jsonb type as that is more efficient than json. Then you'd have to use the corresponding jsonb_...() functions.

Related

pyspark column type casting in pivot

I have a dataframe where I want to create pivot table from 2 columns, i'm using the question header column which will have its value pivoted like below : age , age_numeric
and the answer header is the value , my problem is I want to put the value of the answer header in a list which I'm doing using collect_list function, but the problem is i want the new column like age_numeric to be list of int, while column age to be list of strings, based on question type column, but when i try the code it always gives me a list of strings, any idea how to solve this problem?
this is the code
y=output.groupby("sessionId").pivot("questionHeader").
agg(collect_list(when(col("questionType")=="numericAnswer",
col("answerHeader")
.cast("float")).when(col("questionType")!="numericAnswer",col("answerHeader"))))
this is what i get
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | ["20"]
| 3 | ["20-25 years"] | ["20"]
This is what i want
| session id | Age | Age_numeric
| 1 | ["20-25 years"] | [20]
| 3 | ["20-25 years"] | [20]
If you want the output as in the last two rows, then you do not require a pivot, just groupby and collect_list on each of the two columns To get the list of integers for Age_numeric, apply .cast("array< int>"), or change the type of Age_numeric column before collect_list().
Replicate the data
import pyspark.sql.functions as F
data = [(1, "20-25 years", "20"), (3, "20-25 years", "20")]
df = spark.createDataFrame(data, schema=["session_id", "Age", "Age_numeric"])
Replicate the output
df_out = (df.groupBy("session_id")
.agg(F.collect_list("Age").alias("Age"),
F.collect_list("Age_numeric")
.cast("array<int>")
.alias("Age_numeric"))

KSQL return top-n rows

I'd like to have a table, which only contains the top n rows per group. Consider this table:
Field | Type
--------------------------------------------
a | VARCHAR(STRING) (primary key)
b | VARCHAR(STRING) (primary key)
c | VARCHAR(STRING) (primary key)
d | INTEGER
For every group (denoted by the primary key), I need e. g. the 10 rows with the highest value in column d. d is an aggregation over {a, b, c}, which sums up column c. This works pretty easy in normal SQL with ROW_NUMBER(), as described here: How do I use ROW_NUMBER()? , where you simply assign a number to every row, which depicts the row's placement in descending order depending on the value of column d.
Unfortunately, ksql doesn't support subqueries yet, which you need for ROW_NUMBER(). https://github.com/confluentinc/ksql/issues/745 I'm also not quite sure whether ksql supports ROW_NUMBER(), I think not, since I haven't found anything in the documentation and didn't manage to run it by myself.
I also found the TOPK function in ksql, but that doesn't seem to work as expected. https://docs.ksqldb.io/en/latest/developer-guide/ksqldb-reference/aggregate-functions/#topk
There is an issue about it on GitHub https://github.com/confluentinc/ksql/issues/403 . When I run it by myself, I obtain a column with an array with the top n values for {a, b}, which I can't map to the according values of column c. So that doesn't work as well. Example of what TOPK yields for k = 5:
a | b | d
--------------------------------------------
val1 | val2 | [10, 8, 6, 4, 2]
val1 | val4 | [7, 3, 3, 1, 0]
val1 | val6 | [5, 4, 3, 2, 1]
Here is an example of what I actually need, assuming that {10, 8, 6, 4, 2} are the 5 biggest values of d. For {a, b}, I want the values of c with the 5 biggest values of d.
a | b | c | d
--------------------------------------------
val1 | val2 | val3 | 10
val1 | val2 | val8 | 8
val1 | val2 | val9 | 6
val1 | val2 | val10 | 4
val1 | val2 | val11 | 2
Now, is there any possibility to do a top-n query in ksql? Or is it on the roadmap for future releases? Thanks.

UPDATE from temp table picking the "last" row per group

Suppose there is a table with data:
+----+-------+
| id | value |
+----+-------+
| 1 | 0 |
| 2 | 0 |
+----+-------+
I need to do a bulk update. And use COPY FROM STDIN for fast insert to temp table without constraints and so it can contains duplicate values in id column
Temp table to update from:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
| 1 | 2 |
| 2 | 2 |
+----+-------+
If I simply run a query like with:
UPDATE test target SET value = source.value FROM tmp_test source WHERE target.id = source.id;
I got wrong results:
+----+-------+
| id | value |
+----+-------+
| 1 | 1 |
| 2 | 1 |
+----+-------+
I need the target table to contain the values that appeared last in the temporary table.
What is the most effective way to do this, given that the target table may contain millions of records, and the temporary table may contain tens of thousands?**
Assuming you want to take the value from the row that was inserted last into the temp table, physically, you can (ab-)use the system column ctid, signifying the physical location:
UPDATE test AS target
SET value = source.value
FROM (
SELECT DISTINCT ON (id)
id, value
FROM tmp_test
ORDER BY id, ctid DESC
) source
WHERE target.id = source.id
AND target.value <> source.value; -- skip empty updates
About DISTINCT ON:
Select first row in each GROUP BY group?
This builds on a implementation detail, and is not backed up by the SQL standard. If some insert method should not write rows in sequence (like future "parallel" INSERT), it breaks. Currently, it should work. About ctid:
How do I decompose ctid into page and row numbers?
If you want a safe way, you need to add some user column to signify the order of rows, like a serial column. But do your really care? Your tiebreaker seems rather arbitrary. See:
Temporary sequence within a SELECT
AND target.value <> source.value
skips empty updates - assuming both columns are NOT NULL. Else, use:
AND target.value IS DISTINCT FROM source.value
See:
How do I (or can I) SELECT DISTINCT on multiple columns?

Aggregate all combinations of rows taken k at a time

I am trying to calculate an aggregate function for a field for a subset of rows in a table. The problem is that I'd like to find the mean of every combination of rows taken k at a time --- so for all the rows, I'd like to find (say) the mean of every combination of 10 rows. So:
id | count
----|------
1 | 5
2 | 3
3 | 6
...
30 | 16
should give me
mean of ids 1..10; ids 1, 3..11; ids 1, 4..12, and so so. I know this will yield a lot of rows.
There are SO answers for finding combinations from arrays. I could do this programmatically by taking 30 ids 10 at a time and then SELECTing them. Is there a way to do this with PARTITION BY, TABLESAMPLE, or another function (something like python's itertools.combinations())? (TABLESAMPLE by itself won't guarantee which subset of rows I am selecting as far as I can tell.)
The method described in the cited answer is static. A more convenient solution may be to use recursion.
Example data:
drop table if exists my_table;
create table my_table(id int primary key, number int);
insert into my_table values
(1, 5),
(2, 3),
(3, 6),
(4, 9),
(5, 2);
Query which finds 2 element subsets in 5 element set (k-combination with k = 2):
with recursive recur as (
select
id,
array[id] as combination,
array[number] as numbers,
number as sum
from my_table
union all
select
t.id,
combination || t.id,
numbers || t.number,
sum+ number
from my_table t
join recur r on r.id < t.id
and cardinality(combination) < 2 -- param k
)
select combination, numbers, sum/2.0 as average -- param k
from recur
where cardinality(combination) = 2 -- param k
combination | numbers | average
-------------+---------+--------------------
{1,2} | {5,3} | 4.0000000000000000
{1,3} | {5,6} | 5.5000000000000000
{1,4} | {5,9} | 7.0000000000000000
{1,5} | {5,2} | 3.5000000000000000
{2,3} | {3,6} | 4.5000000000000000
{2,4} | {3,9} | 6.0000000000000000
{2,5} | {3,2} | 2.5000000000000000
{3,4} | {6,9} | 7.5000000000000000
{3,5} | {6,2} | 4.0000000000000000
{4,5} | {9,2} | 5.5000000000000000
(10 rows)
The same query for k = 3 gives:
combination | numbers | average
-------------+---------+--------------------
{1,2,3} | {5,3,6} | 4.6666666666666667
{1,2,4} | {5,3,9} | 5.6666666666666667
{1,2,5} | {5,3,2} | 3.3333333333333333
{1,3,4} | {5,6,9} | 6.6666666666666667
{1,3,5} | {5,6,2} | 4.3333333333333333
{1,4,5} | {5,9,2} | 5.3333333333333333
{2,3,4} | {3,6,9} | 6.0000000000000000
{2,3,5} | {3,6,2} | 3.6666666666666667
{2,4,5} | {3,9,2} | 4.6666666666666667
{3,4,5} | {6,9,2} | 5.6666666666666667
(10 rows)
Of course, you can remove numbers from the query if you do not need them.

Replacing a comma seperate value in table with another in select query (postgres)

I have two tables, table A has ID column whose values are comma separated, each of those ID value has a representation in table B.
Table A
+-----------------+
| Name | ID |
+------------------
| A1 | 1,2,3|
| A2 | 2 |
| A3 | 3,2 |
+------------------
Table B
+-------------------+
| ID | Value |
+-------------------+
| 1 | Apple |
| 2 | Orange |
| 3 | Mango |
+-------------------+
I was wondering if there is an efficient way to do a select where the result would as below,
Name, Value
A1 Apple, Orange, Mango
A2 Orange
A3 Mango, Orange
Any suggestions would be welcome. Thanks.
You need to first "normalize" table_a into a new table using the following:
select name, regexp_split_to_table(id, ',') id
from table_a;
The result of this can be joined to table_b and the result of the join then needs to be grouped in order to get the comma separated list of the names:
select a.name, string_agg(b.value, ',')
from (
select name, regexp_split_to_table(id, ',') id
from table_a
) a
JOIN table_b b on b.id = a.id
group by a.name;
SQLFiddle: http://sqlfiddle.com/#!12/77fdf/1
There are two regex related functions that can be useful:
http://www.postgresql.org/docs/current/static/functions-string.html
regexp_split_to_table()
regexp_split_to_array()
Code below is untested, but you'd use something like it to match A and B:
select name, value
from A
join B on B.id = ANY(regexp_split_to_array(A.id, E'\\s*,\\s*', 'g')::int[]))
You can then use array_agg(value), grouping by name, and format using array_to_string().
Two notes, though:
It won't be as efficient as normalizing things.
The formatting itself ought to be done further down, in your views.