Postgres percent_rank() across rows not column - postgresql

Data:
I have a postgres table where each row contain a question_id, and count values of when a user pressed a button.
+----------+-----------+------+-----+----------+
| Question | Very Good | Good | Bad | Very Bad |
+----------+-----------+------+-----+----------+
| 1 | 23 | 12 | 23 | 67 |
+----------+-----------+------+-----+----------+
| 2 | 56 | 90 | 23 | 18 |
+----------+-----------+------+-----+----------+
Requirement:
I want to be able to convert each row value in a percentage of the total row.
+----------+-----------+-------+-------+----------+
| Question | Very Good | Good | Bad | Very Bad |
+----------+-----------+-------+-------+----------+
| 1 | 18.4 | 9.6 | 18.4 | 53.8 |
+----------+-----------+-------+-------+----------+
| 2 | 29.94 | 48.12 | 12.29 | 9.6 |
+----------+-----------+-------+-------+----------+
Attempt:
I found that percent_rank() will show me the percentage based off columns, I'm wondering is there a similar function which works row wise?
SELECT
question_id,
PERCENT_RANK() OVER (
ORDER BY Very_good
),
PERCENT_RANK() OVER (
ORDER BY Good
)
PERCENT_RANK() OVER (
ORDER BY Bad
)
PERCENT_RANK() OVER (
ORDER BY Very Bad
)
FROM Question_feedback

I'm afraid the only thing that will work is to do this manually:
SELECT
question_id,
Very_good::double precision / (Very_good + Good + Bad + Very_bad),
Good::double precision / (Very_good + Good + Bad + Very_bad),
Bad::double precision / (Very_good + Good + Bad + Very_bad),
Very_bad::double precision / (Very_good + Good + Bad + Very_bad)
FROM Question_feedback
The good news is it will be faster than PERCENT_RANK because it only needs to consider that row, which is much cheaper.

Working Solution
WITH QUESTION_FEEDBACK
AS (SELECT 1 AS QUESTION,
23 VERYGOOD,
12 GOOD,
23 BAD,
67 VERYBAD
UNION ALL
SELECT 2 AS QUESTION,
56 VERYGOOD,
90 GOOD,
23 BAD,
18 VERYBAD
)
SELECT QUESTION,
VERYGOOD,
GOOD,
BAD,
VERYBAD,
(CAST(VERYGOOD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION))*100 VERYGOODPER,
(CAST(GOOD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION))*100 GOODPER,
(CAST(BAD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION) )*100 BADPER,
(CAST(VERYBAD AS DECIMAL) / SUM (VERYGOOD + GOOD + BAD + VERYBAD) OVER (PARTITION BY QUESTION))*100 VERYBADPER
FROM QUESTION_FEEDBACK

Related

PostgresQL for each row, generate new rows and merge

I have a table called example that looks as follows:
ID | MIN | MAX |
1 | 1 | 5 |
2 | 34 | 38 |
I need to take each ID and loop from it's min to max, incrementing by 2 and thus get the following WITHOUT using INSERT statements, thus in a SELECT:
ID | INDEX | VALUE
1 | 1 | 1
1 | 2 | 3
1 | 3 | 5
2 | 1 | 34
2 | 2 | 36
2 | 3 | 38
Any ideas of how to do this?
The set-returning function generate_series does exactly that:
SELECT
id,
generate_series(1, (max-min)/2+1) AS index,
generate_series(min, max, 2) AS value
FROM
example;
(online demo)
The index can alternatively be generated with RANK() (example, see also #a_horse_­with_­no_­name's answer) if you don't want to rely on the parallel sets.
Use generate_series() to generate the numbers and a window function to calculate the index:
select e.id,
row_number() over (partition by e.id order by g.value) as index,
g.value
from example e
cross join generate_series(e.min, e.max, 2) as g(value);

kdb+ equivalent of SQL's rank() and dense_rank()

Any one every have to simulate the result of SQL's rank(), dense_rank(), and row_number(), in kdb+? Here is some SQL to demonstrate the features. If anyone has a specific solution below, perhaps I could work on generalising it to support multiple partition and order by columns -- and post back on this site.
CREATE TABLE student(course VARCHAR(10), mark int, name varchar(10));
INSERT INTO student VALUES
('Maths', 60, 'Thulile'),
('Maths', 60, 'Pritha'),
('Maths', 70, 'Voitto'),
('Maths', 55, 'Chun'),
('Biology', 60, 'Bilal'),
('Biology', 70, 'Roger');
SELECT
RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS dense_rank,
ROW_NUMBER() OVER (PARTITION BY course ORDER BY mark DESC) AS row_num,
course, mark, name
FROM student ORDER BY course, mark DESC;
+------+------------+---------+---------+------+---------+
| rank | dense_rank | row_num | course | mark | name |
+------+------------+---------+---------+------+---------+
| 1 | 1 | 1 | Biology | 70 | Roger |
| 2 | 2 | 2 | Biology | 60 | Bilal |
| 1 | 1 | 1 | Maths | 70 | Voitto |
| 2 | 2 | 2 | Maths | 60 | Thulile |
| 2 | 2 | 3 | Maths | 60 | Pritha |
| 4 | 3 | 4 | Maths | 55 | Chun |
+------+------------+---------+---------+------+---------+
Here is some kdb+ to generate the equivalent student table:
student:([] course:`Maths`Maths`Maths`Maths`Biology`Biology;
mark:60 60 70 55 60 70;
name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
Thank you!
If you sort the table initially by course and mark:
student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
course mark name
--------------------
Biology 70 Roger
Biology 60 Bilal
Maths 70 Voitto
Maths 60 Thulile
Maths 60 Pritha
Maths 55 Chun
Then you can use something like the below to achieve your output:
update rank_sql:first row_num by course,mark from update dense_rank:1+where count each (where differ mark)cut mark,row_num:1+rank i by course from student
course mark name dense_rank row_num rank_sql
------------------------------------------------
Biology 70 Roger 1 1 1
Biology 60 Bilal 2 2 2
Maths 70 Voitto 1 1 1
Maths 60 Thulile 2 2 2
Maths 60 Pritha 2 3 2
Maths 55 Chun 3 4 4
This solution uses rank and the virtual index column if you would like to read up further on these.
For table ordered by target columns:
q) dense_sql:{sums differ x}
q) rank_sql:{raze #'[(1_deltas b),1;b:1+where differ x]}
q) row_sql:{1+til count x}
q) student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
q)update row_num:row_sql mark,rank_s:rank_sql mark,dense_s:dense_sql mark by course from student
I can think of this as of now:
Note: The rank function in kdb works on asc list, so I created below functions.
I would not xdesc the table, as I can just use the vector column and desc it
q)denseF
{((desc distinct x)?x)+1}
q)rankF
{((desc x)?x)+1}
q)update dense_rank:denseF mark,rank_rank:rankF mark,row_num:1+rank i by course from student
course
mark name
dense_rank
rank_rank
row_num
Maths
60 Thulile
2
2
1
Maths
60 Pritha
2
2
2
Maths
70 Voitto
1
1
3
Maths
55 Chun
3
4
4
Biology
60 Bilal
2
2
1
Biology
70 Roger
1
1
2

PostgreSQL calculate with calculated value from previous rows

The problem I need to solve:
In order to calculate the number of hours per day that are used for (public) holidays or days of illness, the average working hours are used from the previous 3 months (with a starting value of 8 hours per day).
The tricky part is that the calculated value of the previous month will need to be factored in, meaning if there was a public holiday last month, which had been assigned a calculated value of 8.5 hours, these calculated hours will influence the average working hours per day for that last month, which then is being used to assigned working hours to current months' holidays.
So far I only have come up with the following, which doesn't factor in the row-by-row calculation, yet:
WITH
const (h_target, h_extra) AS (VALUES (8.0, 20)),
monthly_sums (c_month, d_work, d_off, h_work) AS (VALUES
('2018-12', 16, 5, 150.25),
('2019-01', 20, 3, 171.25),
('2019-02', 15, 5, 120.5)
),
calc AS (
SELECT
ms.*,
(ms.d_work + ms.d_off) AS d_total,
(ms.h_work + ms.d_off * const.h_target) AS h_total,
(avg((ms.h_work + ms.d_off * const.h_target) / (ms.d_work + ms.d_off))
OVER (ORDER BY ms.c_month ROWS BETWEEN 2 PRECEDING AND CURRENT ROW))::numeric(10,2)
AS h_off
FROM monthly_sums AS ms
CROSS JOIN const
)
SELECT
calc.c_month,
calc.d_work,
calc.d_off,
calc.d_total,
calc.h_work,
calc.h_off,
(d_off * lag(h_off, 1, const.h_target) OVER (ORDER BY c_month)) AS h_off_sum,
(h_work + d_off * lag(h_off, 1, const.h_target) OVER (ORDER BY c_month)) AS h_sum
FROM calc CROSS JOIN const;
...giving the following result:
c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum
---------+--------+-------+---------+--------+-------+-----------+--------
2018-12 | 16 | 5 | 21 | 150.25 | 9.06 | 40.0 | 190.25
2019-01 | 20 | 3 | 23 | 171.25 | 8.77 | 27.18 | 198.43
2019-02 | 15 | 5 | 20 | 120.5 | 8.52 | 43.85 | 164.35
(3 rows)
This calculates correctly for the first row and for the second row for columns that rely on previous row values (lag) but the average hours per day calculation is obviously wrong as I couldn't figure out how to feed the current row value (h_sum) back into the calculation for the new h_off.
The desired result should be as follows:
c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum
---------+--------+-------+---------+--------+-------+-----------+--------
2018-12 | 16 | 5 | 21 | 150.25 | 9.06 | 40.0 | 190.25
2019-01 | 20 | 3 | 23 | 171.25 | 8.84 | 27.18 | 198.43
2019-02 | 15 | 5 | 20 | 120.5 | 8.64 | 44.2 | 164.7
(3 rows)
...meaning h_off is used for next months' h_off_sum and resulting h_sum and h_sum's of available months (at most three) in turn result into the calculation of current months' h_off (essentially avg(h_sum / d_total) over up to three months).
So, actual calculation is:
c_month | calculation | h_off
---------+----------------------------------------------------+-------
| | 8.00 << initial
.---------------------- uses ---------------------^
2018-12 | ((190.25 / 21)) / 1 | 9.06
.------------ uses ---------------^
2019-01 | ((190.25 / 21) + (198.43 / 23)) / 2 | 8.84
.--- uses --------^
2019-02 | ((190.25 / 21) + (198.43 / 23) + (164.7 / 20)) / 3 | 8.64
P.S.: I am using PostgreSQL 11, so I have the latest features at hands if that makes any difference.
I wasn't able to solve that inter-column + inter-row calculation problem with the use of window functions at all and not without falling back to a special use of a recursive CTE as well as introducing special-purpose columns for the days (d_total_1) and hours (h_sum_1) of the 3rd historical month (as you cannot join in the recursive temporary table more than once).
In addition, I added a 4th row to the input data and used an additional index column which I can refer to when joining, which is usually made up with a sub-query like this:
SELECT ROW_NUMBER() OVER (ORDER BY c_month) AS row_num, * FROM monthly_sums
So, here's my take at it:
WITH RECURSIVE calc AS (
SELECT
monthly_sums.row_num,
monthly_sums.c_month,
monthly_sums.d_work,
monthly_sums.d_off,
monthly_sums.h_work,
(monthly_sums.d_off * 8)::numeric(10,2) AS h_off_sum,
monthly_sums.d_work + monthly_sums.d_off AS d_total,
0.0 AS d_total_1,
(monthly_sums.h_work + monthly_sums.d_off * 8)::numeric(10,2) AS h_sum,
0.0 AS h_sum_1,
(
(monthly_sums.h_work + monthly_sums.d_off * 8)
/
(monthly_sums.d_work + monthly_sums.d_off)
)::numeric(10,2) AS h_off
FROM
(
SELECT * FROM (VALUES
(1, '2018-12', 16, 5, 150.25),
(2, '2019-01', 20, 3, 171.25),
(3, '2019-02', 15, 5, 120.5),
(4, '2019-03', 19, 2, 131.75)
) AS tmp (row_num, c_month, d_work, d_off, h_work)
) AS monthly_sums
WHERE
monthly_sums.row_num = 1
UNION ALL
SELECT
monthly_sums.row_num,
monthly_sums.c_month,
monthly_sums.d_work,
monthly_sums.d_off,
monthly_sums.h_work,
lat_off.h_off_sum::numeric(10,2),
lat_days.d_total,
calc.d_total AS d_total_1,
lat_sum.h_sum::numeric(10,2),
calc.h_sum AS h_sum_1,
lat_calc.h_off::numeric(10,2)
FROM
(
SELECT * FROM (VALUES
(1, '2018-12', 16, 5, 150.25),
(2, '2019-01', 20, 3, 171.25),
(3, '2019-02', 15, 5, 120.5),
(4, '2019-03', 19, 2, 131.75)
) AS tmp (row_num, c_month, d_work, d_off, h_work)
) AS monthly_sums
INNER JOIN calc ON (calc.row_num = monthly_sums.row_num - 1),
LATERAL (SELECT monthly_sums.d_work + monthly_sums.d_off AS d_total) AS lat_days,
LATERAL (SELECT monthly_sums.d_off * calc.h_off AS h_off_sum) AS lat_off,
LATERAL (SELECT monthly_sums.h_work + lat_off.h_off_sum AS h_sum) AS lat_sum,
LATERAL (SELECT
(calc.h_sum_1 + calc.h_sum + lat_sum.h_sum)
/
(calc.d_total_1 + calc.d_total + lat_days.d_total)
AS h_off
) AS lat_calc
WHERE
monthly_sums.row_num > 1
)
SELECT c_month, d_work, d_off, d_total, h_work, h_off, h_off_sum, h_sum FROM calc
;
...which gives:
c_month | d_work | d_off | d_total | h_work | h_off | h_off_sum | h_sum
---------+--------+-------+---------+--------+-------+-----------+--------
2018-12 | 16 | 5 | 21 | 150.25 | 9.06 | 40.00 | 190.25
2019-01 | 20 | 3 | 23 | 171.25 | 8.83 | 27.18 | 198.43
2019-02 | 15 | 5 | 20 | 120.5 | 8.65 | 44.15 | 164.65
2019-03 | 19 | 2 | 21 | 131.75 | 8.00 | 17.30 | 149.05
(4 rows)
(PostgreSQL's default type conversion behavior is to round numeric values and so the result is slightly different than initially expected but actually correct)
Please note that PostgreSQL is generally pretty picky about data types and refuses to process queries like this whenever there is a discrepancy that could potentially lead to loss of precision (e.g. numeric vs. integer), which is why I have used explicit types for the columns in both places.
One of the final pieces of the puzzle was solved by using LATERAL subqueries, which enables me to have one calculation reference the result of a previous one and even shift around columns in the final output independent of the calculation hierarchy.
If anyone can come up with a simpler variant I'd be happy to learn about it.

de-aggregate for table columns in Greenplum

I am using Greenplum, and I have data like:
id | val
----+-----
12 | 12
12 | 23
12 | 34
13 | 23
13 | 34
13 | 45
(6 rows)
somehow I want the result like:
id | step
----+-----
12 | 12
12 | 11
12 | 11
13 | 23
13 | 11
13 | 11
(6 rows)
How it comes:
First there should be a Window function, which execute a de-aggreagte function based on partition by id
the column val is cumulative value, and what I want to get is the step values.
Maybe I can do it like:
select deagg(val) over (partition by id) from table_name;
So I need the deagg function.
Thanks for your help!
P.S and Greenplum is based on postgresql v8.2
You can just use the LAG function:
SELECT id,
val - lag(val, 1, 0) over (partition BY id ORDER BY val) as step
FROM yourTable
Note carefully that lag() has three parameters. The first is the column for which to find the lag, the second indicates to look at the previous record, and the third will cause lag to return a default value of zero.
Here is a table showing the table this query would generate:
id | val | lag(val, 1, 0) | val - lag(val, 1, 0)
----+-----+----------------+----------------------
12 | 12 | 0 | 12
12 | 23 | 12 | 11
12 | 34 | 23 | 11
13 | 23 | 0 | 23
13 | 34 | 23 | 11
13 | 45 | 34 | 11
Second note: This answer assumes that you want to compute your rolling difference in order of val ascending. If you want a different order you can change the ORDER BY clause of the partition.
val seems to be a cumulative sum. You can "unaggregate" it by subtracting the previous val from the current val, e.g., by using the lag function. Just note you'll have to treat the first value in each group specially, as lag will return null:
SELECT id, val - COALESCE(LAG(val) OVER (PARTITION BY id ORDER BY val), 0) AS val
FROM mytable;

PostgreSQL - Pull earliest timestamp per user

I have a table which records each time the user performs a certain behavior, with timestamps for each iteration. I need to pull one row from each user with the earliest timestamp as part of a nested query.
As an example, the table looks like this:
+ row | user_id | timestamp | description
+ 1 | 100 | 02-02-2010| android
+ 2 | 100 | 02-03-2010| ios
+ 3 | 100 | 02-05-2010| windows
+ 4 | 111 | 02-01-2010| ios
+ 5 | 112 | 02-03-2010| android
+ 6 | 112 | 02-04-2010| android
And my query should pull just rows 1, 4 and 5.
Thanks!
This should be help. Don't understand your nested query part.
SELECT user_id, MIN(timestamp) AS min_timestamp
FROM table1
GROUP BY user_id
ORDER BY user_id;