postgresql Subselect Aggregate in larger query - postgresql

I'm working with a gigantic dataset of individuals with demographic information and action tracking. I am trying to get the percentage of people who committed an action, which is simple, but also trying to get average ages of people who fit in a specific subgroup of the original SELECT. The CASE WHEN line works fine alone, and the subquery runs fine in it's own query but I cannot seem to get it integrated into this query as a subquery, it gives me a syntax error on the CASE WHEN statement. Here's a slightly anonymized version of the query. Any help would be VERY appreciated.
SELECT
AVG(ageagg)
FROM
(
SELECT
age AS ageagg
FROM
agetable
WHERE
age>30
AND action_taken=1) AvgAge_30Action,
COUNT(
CASE
WHEN action_taken=1
AND age> 30
THEN 1
ELSE 0 NULL) / COUNT(
CASE
WHEN age>30) AS Over_30_Action
FROM
agetable
WHERE
website_type=3

If I've interpreted your intent correctly, you wish to compute the following:
1) the number of people over the age of 30 that took a specific action as a percentage of the total number of people over the age of 30
2) the average age of the people over the age of 30 that took a specific action
Assuming my interpretation is correct, this query might work for you:
SELECT
100 * over_30_action / over_30_total AS percentage_of_over_30_took_action,
average_age_of_over_30_took_action
FROM (
SELECT
SUM(CASE WHEN action_taken=1 THEN 1 ELSE 0 END) AS over_30_action,
COUNT(*) AS over_30_total,
AVG(CASE WHEN action_taken=1 THEN age ELSE NULL END)
AS average_age_of_over_30_took_action
FROM agetable
WHERE website_type=3 AND age>30
) aggregated;
I created a dummy table and populated it with the following data.
postgres=# select * from agetable order by website_type, action_taken, age;
age | action_taken | website_type
-----+--------------+--------------
33 | 1 | 1
32 | 1 | 2
28 | 1 | 3
29 | 1 | 3
32 | 1 | 3
33 | 1 | 3
34 | 1 | 3
32 | 2 | 3
32 | 3 | 3
33 | 4 | 3
34 | 5 | 3
33 | 6 | 3
34 | 7 | 3
35 | 8 | 3
(14 rows)
Of the 14 rows, 4 rows (the first four in this listing) have either the wrong website_type or have age below 30. Of the ten remaining rows, you can see that 3 of them have an action_taken of 1. So, the query should determine that 30% of folks over the age of 30 took a particular action, and the average age among that particular population should be 33 (ages 32, 33, and 34). The results of the query I posted:
percentage_of_over_30_took_action | average_age_of_over_30_took_action
-----------------------------------+------------------------------------
30 | 33.0000000000000000
(1 row)
Again, all of this is predicated upon my interpretation of your intent actually being accurate. This is of course based on a highly contrived data set, but hopefully it's enough of a functional signpost to get you on the right path.

Related

How to group by counted rows in Postgres?

If I have a table:
id | status
----+--------
2 | 200
1 | 0
4 | 100
3 | 200
5 | 200
I want to count the number of occurrences of each status. I have tried to use the COUNT/OVER function
SELECT status, COUNT(*) OVER () AS all, COUNT(*) OVER (PARTITION by status) as count FROM my_table;
The results are what is expected per the postgres docs on windows "However, window functions do not cause rows to become grouped into a single output row like non-window aggregate calls would. Instead, the rows retain their separate identities"
status | all | count
--------+-------+-------
0 | 5 | 1
100 | 5 | 1
200 | 5 | 3
200 | 5 | 3
200 | 5 | 3
How instead can get an output that combines the rows, so that I only get 1 row per unique status if the partition is required?
status | all | count
--------+-------+-------
0 | 5 | 1
100 | 5 | 1
200 | 5 | 3
No window function necessary in the first stage of the query, i.e. getting the counts per status. Window functions work on the result of the non-windowing part of the query, thus you can have a window function referring the aggregate & non-aggregate columns in a query. To get all_counts, it is sufficient to SUM the status_count over all the rows.
SELECT
status
, COUNT(*) status_count
, SUM(COUNT(*)) OVER () all_count
FROM my_table
GROUP BY status

PostgresQL for each row, generate new rows and merge

I have a table called example that looks as follows:
ID | MIN | MAX |
1 | 1 | 5 |
2 | 34 | 38 |
I need to take each ID and loop from it's min to max, incrementing by 2 and thus get the following WITHOUT using INSERT statements, thus in a SELECT:
ID | INDEX | VALUE
1 | 1 | 1
1 | 2 | 3
1 | 3 | 5
2 | 1 | 34
2 | 2 | 36
2 | 3 | 38
Any ideas of how to do this?
The set-returning function generate_series does exactly that:
SELECT
id,
generate_series(1, (max-min)/2+1) AS index,
generate_series(min, max, 2) AS value
FROM
example;
(online demo)
The index can alternatively be generated with RANK() (example, see also #a_horse_­with_­no_­name's answer) if you don't want to rely on the parallel sets.
Use generate_series() to generate the numbers and a window function to calculate the index:
select e.id,
row_number() over (partition by e.id order by g.value) as index,
g.value
from example e
cross join generate_series(e.min, e.max, 2) as g(value);

kdb+ equivalent of SQL's rank() and dense_rank()

Any one every have to simulate the result of SQL's rank(), dense_rank(), and row_number(), in kdb+? Here is some SQL to demonstrate the features. If anyone has a specific solution below, perhaps I could work on generalising it to support multiple partition and order by columns -- and post back on this site.
CREATE TABLE student(course VARCHAR(10), mark int, name varchar(10));
INSERT INTO student VALUES
('Maths', 60, 'Thulile'),
('Maths', 60, 'Pritha'),
('Maths', 70, 'Voitto'),
('Maths', 55, 'Chun'),
('Biology', 60, 'Bilal'),
('Biology', 70, 'Roger');
SELECT
RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS rank,
DENSE_RANK() OVER (PARTITION BY course ORDER BY mark DESC) AS dense_rank,
ROW_NUMBER() OVER (PARTITION BY course ORDER BY mark DESC) AS row_num,
course, mark, name
FROM student ORDER BY course, mark DESC;
+------+------------+---------+---------+------+---------+
| rank | dense_rank | row_num | course | mark | name |
+------+------------+---------+---------+------+---------+
| 1 | 1 | 1 | Biology | 70 | Roger |
| 2 | 2 | 2 | Biology | 60 | Bilal |
| 1 | 1 | 1 | Maths | 70 | Voitto |
| 2 | 2 | 2 | Maths | 60 | Thulile |
| 2 | 2 | 3 | Maths | 60 | Pritha |
| 4 | 3 | 4 | Maths | 55 | Chun |
+------+------------+---------+---------+------+---------+
Here is some kdb+ to generate the equivalent student table:
student:([] course:`Maths`Maths`Maths`Maths`Biology`Biology;
mark:60 60 70 55 60 70;
name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
Thank you!
If you sort the table initially by course and mark:
student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
course mark name
--------------------
Biology 70 Roger
Biology 60 Bilal
Maths 70 Voitto
Maths 60 Thulile
Maths 60 Pritha
Maths 55 Chun
Then you can use something like the below to achieve your output:
update rank_sql:first row_num by course,mark from update dense_rank:1+where count each (where differ mark)cut mark,row_num:1+rank i by course from student
course mark name dense_rank row_num rank_sql
------------------------------------------------
Biology 70 Roger 1 1 1
Biology 60 Bilal 2 2 2
Maths 70 Voitto 1 1 1
Maths 60 Thulile 2 2 2
Maths 60 Pritha 2 3 2
Maths 55 Chun 3 4 4
This solution uses rank and the virtual index column if you would like to read up further on these.
For table ordered by target columns:
q) dense_sql:{sums differ x}
q) rank_sql:{raze #'[(1_deltas b),1;b:1+where differ x]}
q) row_sql:{1+til count x}
q) student:`course xasc `mark xdesc ([] course:`Maths`Maths`Maths`Maths`Biology`Biology;mark:60 60 70 55 60 70;name:`Thulile`Pritha`Voitto`Chun`Bilal`Roger)
q)update row_num:row_sql mark,rank_s:rank_sql mark,dense_s:dense_sql mark by course from student
I can think of this as of now:
Note: The rank function in kdb works on asc list, so I created below functions.
I would not xdesc the table, as I can just use the vector column and desc it
q)denseF
{((desc distinct x)?x)+1}
q)rankF
{((desc x)?x)+1}
q)update dense_rank:denseF mark,rank_rank:rankF mark,row_num:1+rank i by course from student
course
mark name
dense_rank
rank_rank
row_num
Maths
60 Thulile
2
2
1
Maths
60 Pritha
2
2
2
Maths
70 Voitto
1
1
3
Maths
55 Chun
3
4
4
Biology
60 Bilal
2
2
1
Biology
70 Roger
1
1
2

Take new columns as output table - KDB

I have a query which returns results of data, which runs on a frequent basis. The new table will contain results of the old table as well but I only want to take whatever is in new in the most recent run of the new table and send that as an email. I already have the line for the email and trade data but just need a way to be able to:
display the results of the new table to be emailed
save the complete results of the new table to be used in the next run of the query
e.g.
Old results: tbl
| idx | name | age |
| 0 | Tom | 30 |
| 1 | Jerry | 25 |
| 2 | Bob | 30 |
| 3 | Ken | 45 |
New results: tbl
| idx | name | age |
| 0 | Tom | 30 |
| 1 | Jerry | 25 |
| 2 | Bob | 30 |
| 3 | Ken | 45 |
| 4 | Sam | 40 |
output required:
| 4 | Sam | 40 |
and then save the New results to be used in the next run
Thanks! :)
If the only changes between runs is that records are being appended onto the new table, you could just keep a variable denoting the last index seen and then select only those rows where idx is larger than that.
If the indexes are always increasing, this could be achieved using a query like
lastidx:exec last idx from tbl
select from tbl where idx>lastidx
If the idx values don't always increase monotonically, you could keep a count of the number of rows instead and only
lasti:count tbl
select from tbl where i>=lasti
This doesn't require saving the whole table in memory for use in the next iteration.
E.g to start with the old table had 4 rows so lasti = 4
q)tbl
idx name age
-------------
0 Tom 30
1 Jerry 25
2 Bob 30
3 Ken 45
q)lasti
4
The new table comes in and running the command selects the new row
q)tbl
idx name age
-------------
0 Tom 30
1 Jerry 25
2 Bob 30
3 Ken 45
4 Sam 40
q)select from tbl where i>lasti
idx name age
------------
4 Sam 40
lasti can then be updated to reflect the new count
q)lasti:count tbl
q)lasti
5
One way you can get this done, assuming the idx is the unique key :
q)old:([] idx:0 1 2 3; name:`T`J`B`K; age: 30 25 30 45)
q)new:old,enlist `idx`name`age!(4; `S;40) //new output from your query
q)out:()
q)if[0<count i:new[`idx] except old[`idx] ; out:new i ; old:new]
q)out
idx name age
------------
4 S 40
Another way, if your new records are always added to the last of old records:
q)old:([] idx:0 1 2 3; name:`T`J`B`K; age: 30 25 30 45)
q)i:count old
q)new:old,enlist `idx`name`age!(4; `S;40) //new output from your query
q)out:()
q)if[i<c:count new ; out:(i-c)#new ; old:new; i:c]
q)out
idx name age
------------
4 S 40

de-aggregate for table columns in Greenplum

I am using Greenplum, and I have data like:
id | val
----+-----
12 | 12
12 | 23
12 | 34
13 | 23
13 | 34
13 | 45
(6 rows)
somehow I want the result like:
id | step
----+-----
12 | 12
12 | 11
12 | 11
13 | 23
13 | 11
13 | 11
(6 rows)
How it comes:
First there should be a Window function, which execute a de-aggreagte function based on partition by id
the column val is cumulative value, and what I want to get is the step values.
Maybe I can do it like:
select deagg(val) over (partition by id) from table_name;
So I need the deagg function.
Thanks for your help!
P.S and Greenplum is based on postgresql v8.2
You can just use the LAG function:
SELECT id,
val - lag(val, 1, 0) over (partition BY id ORDER BY val) as step
FROM yourTable
Note carefully that lag() has three parameters. The first is the column for which to find the lag, the second indicates to look at the previous record, and the third will cause lag to return a default value of zero.
Here is a table showing the table this query would generate:
id | val | lag(val, 1, 0) | val - lag(val, 1, 0)
----+-----+----------------+----------------------
12 | 12 | 0 | 12
12 | 23 | 12 | 11
12 | 34 | 23 | 11
13 | 23 | 0 | 23
13 | 34 | 23 | 11
13 | 45 | 34 | 11
Second note: This answer assumes that you want to compute your rolling difference in order of val ascending. If you want a different order you can change the ORDER BY clause of the partition.
val seems to be a cumulative sum. You can "unaggregate" it by subtracting the previous val from the current val, e.g., by using the lag function. Just note you'll have to treat the first value in each group specially, as lag will return null:
SELECT id, val - COALESCE(LAG(val) OVER (PARTITION BY id ORDER BY val), 0) AS val
FROM mytable;