Count with group by on Postgresql - postgresql

I have a postgresql type and a table
CREATE TYPE mem_status AS ENUM('waiting', 'active', 'expired');
CREATE TABLE mems (
id BIGSERIAL PRIMARY KEY,
status mem_status NOT NULL
);
dataset
INSERT INTO mems(id, status) VALUES
(1, 'active'), (2, 'active'), (3, 'expired');
I want to query counts that grouped by statuses. So I treid the query below.
WITH mem_statuses AS (
SELECT unnest(enum_range(NULL::mem_status)) AS status
)
SELECT m.status, count(1)
FROM mems m
RIGHT JOIN mem_statuses ms ON ms.status = m.status
GROUP BY m.status;
But if there is no waiting mems, the result looks like below.
status | count
================
NULL | 1 <- problem
'active' | 2
'expired' | 1
I want to get result like this.
status | count
================
'waiting' | 0
'active' | 2
'expired' | 1
How can I do that?

Use count(id):
WITH mem_statuses AS (
SELECT unnest(enum_range(NULL::mem_status)) AS status
)
SELECT ms.status, count(id)
FROM mems m
RIGHT JOIN mem_statuses ms ON ms.status = m.status
GROUP BY ms.status;
or:
select status, count(id)
from unnest(enum_range(null::mem_status)) as status
left join mems using(status)
group by status
status | count
---------+-------
waiting | 0
active | 2
expired | 1
(3 rows)
Per the documentation count(expression) gives
number of input rows for which the value of expression is not null

You need to modify the join and aggregate a bit -
select ms.status, count(m.status)
from (select unnest(enum_range(null::mem_status))) as ms(status)
left join mems as m
on ms.status = m.status
group by ms.status;

Related

Postgres SQL Query to total column with multiple filters

I have a Postgres table that contains a date and status field. I want to create a query that will return the date, plus the total number of records and then the total number of records for each status on that date.
Source Table:
job_id, process_datetime, process_status
The results I would like:
process_date | total_925_jobs | total_completed_925_jobs
2022-01-02 | 50 | 45
2022-01-03 | 150 | 135
I tried to join to subqueries, but it does not like the calculated date field.
SELECT
date(all_records.create_datetime) AS process_date,
total_jobs.total_925_jobs,
total_completed.total_completed_925_jobs
from "925-FilePreprocessing"
all_records
INNER JOIN
( SELECT
date("925-FilePreprocessing".create_datetime) AS total_process_date,
"925-FilePreprocessing".process_status,
COUNT("925-FilePreprocessing".file_preprocessing_id) as total_925_jobs
FROM
"925-FilePreprocessing"
where
"925-FilePreprocessing".create_datetime > '2022-01-01'
GROUP BY
total_process_date, process_status
) as "total_jobs"
ON date(all_records.create_datetime) = date(total_jobs.total_process_date)
INNER JOIN
(SELECT
date("925-FilePreprocessing".create_datetime) AS completed_process_date,
COUNT("925-FilePreprocessing".file_preprocessing_id) as total_completed_925_jobs
FROM
"925-FilePreprocessing"
where
"925-FilePreprocessing".create_datetime > '2022-01-01'
and ("925-FilePreprocessing".process_status = 'completed'
or "925-FilePreprocessing".process_status = 'completed-duplicated'
or "925-FilePreprocessing".process_status = 'completed-duplicated-published'
or "925-FilePreprocessing".process_status = 'completed-not_a_drawing'
)
GROUP BY
completed_process_date
) as "total_completed"
ON all_records.process_date = total_completed.completed_process_date
ORDER BY
process_date
I get an error:
ERROR: column all_records.process_date does not exist
LINE 42: ON all_records.process_date = total_completed.completed_pro...
^
Conditional count may be usefull
Old way (using sum) - before Postgresql 9.4
select
a.process_datetime::DATE,
count(*) total_925_jobs,
sum ( case when a.process_status in ('completed',
'completed-duplicated',
'completed-duplicated-published',
'completed-not_a_drawing')
then 1
else 0 end) total_completed_925_jobs
from "925-FilePreprocessing" a
where a.process_datetime::DATE >= '2021-01-01'
group by a.process_datetime::DATE
New way - from POstgresql 9.4 (using filter)
select
a.process_datetime::DATE,
count(*) total_925_jobs,
count(*) filter (where a.process_status in ('completed', 'completed-duplicated', 'completed-duplicated-published', 'completed-not_a_drawing')) total_completed_925_jobs
from "925-FilePreprocessing" a
where a.process_datetime::DATE >= '2021-01-01'
group by a.process_datetime::DATE
Going back to your query - I have error column 925-FilePreprocessing.create_datetime does not exist which is different than yours. Check if table definition you deliver is complete.
the result you like
process_date | total_925_jobs | total_completed_925_jobs
2022-01-02 | 50 | 45
2022-01-03 | 150 | 135
since total_completed have far less row than total_jobs means that there are only two date/datetime greater than '2022-01-01'.
the follow query can be get your result. I declutter a lot unnecessary code.
group by 1 mean: https://www.cybertec-postgresql.com/en/postgresql-group-by-expression/
WITH total_jobs AS (
SELECT
create_datetime::date AS total_process_date,
process_status,
COUNT(file_preprocessing_id) AS total_925_jobs
FROM
"925-FilePreprocessing"
WHERE
create_datetime::date > '2022-01-01'::date
GROUP BY
1,
2
),
total_completed AS (
SELECT
date("925-FilePreprocessing".create_datetime) AS completed_process_date,
COUNT(file_preprocessing_id) AS total_completed_925_jobs
FROM
"925-FilePreprocessing"
WHERE
create_datetime::date > '2022-01-01'
AND process_status IN ('completed', 'completed-duplicated', 'completed-duplicated-published', 'completed-not_a_drawing')
GROUP BY
1
)
SELECT
total_jobs. *,
tp.total_completed_925_jobs
FROM
total_jobs tk
JOIN total_completed tp ON tk.total_process_date = tp.completed_process_date

Perform foreach loop on table column in PostgreSQL

I need to iterate through a table column and for each value to execute a simple SELECT statement.
I get the result table with the following statement:
SELECT event_id, count(event_id) as occurence
FROM event
GROUP BY event_id
ORDER BY occurence DESC
LIMIT 50
Output:
event_id | occurence
---------------------
1234567 | 56678
8901234 | 86753
For each event_id from the output table I need to execute a SELECT statement like:
SELECT * FROM event WHERE event_id = 'event_id from result row'
Expected output:
event_id | even_type | event_time
----------------------------
1234567 | ....... | .......
1234567 | ....... | .......
8901234 | ....... | .......
8901234 | ....... | .......
In other words: I need to get the 50 most occuring event_ids from the event table and then retrieve all available data for those specific events.
How can I achieve that?
There are probably a few way to handle this but here is one way:
SELECT a.*, b.event_type, b.event_time
FROM
(
SELECT event_id, count(event_id) as occurence
FROM event
GROUP BY event_id
ORDER BY occurence DESC
LIMIT 50
) a
JOIN event b ON (b.event_id = a.event_id)
;
Instead of specific columns from what I called 'b' you could select b.* for all columns.
No need to even join - just simply use a window function! See below:
SELECT *
FROM (
SELECT
*,
COUNT(*) OVER(PARTITION BY event_id) AS event_count
FROM event
) A
ORDER BY event_count DESC
LIMIT 50

Conditional Row_Number() for min and maximum date

I ve got a table with data which looks like this:
Table T1
+----+------------+------------+
| ID | Udate | last_code |
+----+------------+------------+
| 1 | 05/11/2018 | ATTEMPT |
| 1 | 03/11/2018 | ATTEMPT |
| 1 | 01/11/2017 | INFO |
| 1 | 25/10/2016 | ARRIVED |
| 1 | 22/9/2016 | ARRIVED |
| 1 | 14/9/2016 | SENT |
| 1 | 1/9/2016 | SENT |
+----+------------+------------+
| 2 | 26/10/2016 | RECEIVED |
| 2 | 19/10/2016 | ARRIVED |
| 2 | 18/10/2016 | ARRIVED |
| 2 | 14/10/2016 | ANNOUNCED |
| 2 | 23/9/2016 | INFO |
| 2 | 14/9/2016 | DAMAGE |
| 2 | 2/9/2016 | SCHEDULED |
+----+------------+------------+
Each id has multiple codes at different dates and there is no pattern for them.
Overall I m trying to get the last date and code, but if there is an "ATTEMPT" code, I need to get the first date and that code for each individual ID. Based on the table above, I would get:
+----+------------+------------+
| ID | Udate | last_code |
| 1 | 03/11/2018 | ATTEMPT |
| 2 | 26/10/2016 | RECEIVED |
+----+------------+------------+
I ve been trying
ROW_NUMBER() OVER (PARTITION BY ID
ORDER BY
(CASE WHEN code = 'ATTEMPT' THEN u_date END) ASC,
(CASE WHEN code_key <> 'ATTEMPT' THEN u_date END) DESC
) as RN
And at the moment I ve been stuck after I use ROW_NUMBER() twice, but can t think of a way to bring them all in the same table.
,ROW_NUMBER() OVER (PARTITION BY id, code order by udate asc) as RN1
,ROW_NUMBER() OVER (PARTITION BY id order by udate desc) AS RN2
I m not very familiar with CTEs and I think it s one of those queries which requires one perhaps..
Thanks.
I think you have a couple of options before attempting a CTE.
Give these a try, examples below:
DECLARE #TestData TABLE
(
[ID] INT
, [Udate] DATE
, [last_code] NVARCHAR(100)
);
INSERT INTO #TestData (
[ID]
, [Udate]
, [last_code]
)
VALUES ( 1, '11/05/2018', 'ATTEMPT ' )
, ( 1, '11/03/2018', 'ATTEMPT' )
, ( 1, '11/01/2017', 'INFO' )
, ( 1, '10/25/2016', 'ARRIVED' )
, ( 1, '9/22/2016 ', 'ARRIVED' )
, ( 1, '9/14/2016 ', 'SENT' )
, ( 1, '9/1/2016 ', 'SENT' )
, ( 2, '10/26/2016', 'RECEIVED' )
, ( 2, '10/19/2016', 'ARRIVED' )
, ( 2, '10/18/2016', 'ARRIVED' )
, ( 2, '10/14/2016', 'ANNOUNCED' )
, ( 2, '9/23/2016 ', 'INFO' )
, ( 2, '9/14/2016 ', 'DAMAGE' )
, ( 2, '9/2/2016 ', 'SCHEDULED' );
--option 1
--couple of outer apply
--1 - to get the min date for attempt
--2 - to get the max date regardless of the the code
--where clause, using coalesce will pick what date. Use the date if I have one for code ='ATTEMPT', if not use the max date.
SELECT [a].*
FROM #TestData [a]
OUTER APPLY (
SELECT [b].[ID]
, MIN([b].[Udate]) AS [AttemptUdate]
FROM #TestData [b]
WHERE [b].[ID] = [a].[ID]
AND [b].[last_code] = 'ATTEMPT'
GROUP BY [b].[ID]
) AS [aa]
OUTER APPLY (
SELECT [c].[ID]
, MAX([c].[Udate]) AS [MaxUdate]
FROM #TestData [c]
WHERE [c].[ID] = [a].[ID]
GROUP BY [c].[ID]
) AS [cc]
WHERE [a].[ID] = COALESCE([aa].[ID], [cc].[ID])
AND [a].[Udate] = COALESCE([aa].[AttemptUdate], [cc].[MaxUdate]);
--use window functions
--Similiar in that we are finding the max Udate and also min Udate when last_code='ATTEMPT'
--Then using COALESCE in the where clause to evaluate which one to use.
--Maybe a little cleaner
SELECT [td].[ID]
, [td].[Udate]
, [td].[last_code]
FROM (
SELECT [ID]
, [last_code]
, [Udate]
, MAX([Udate]) OVER ( PARTITION BY [ID] ) AS [MaxUdate]
, MIN( CASE WHEN [last_code] = 'ATTEMPT' THEN [Udate]
ELSE NULL
END
) OVER ( PARTITION BY [ID] ) AS [AttemptUdate]
FROM #TestData
) AS [td]
WHERE [td].[Udate] = COALESCE([td].[AttemptUdate], [td].[MaxUdate]);
To explain how I got there a little bit, it was primarily base on your requirement:
Overall I m trying to get the last date and code, but if there is an
"ATTEMPT" code, I need to get the first date and that code for each
individual ID.
So for each ID I needed a way to get:
Minimum Udate for last_code = 'ATTEMPT' per ID - if there was no ATTEMPT we'll get a null
Maximum Udate for all records per ID
If I could determine the above for each record based on ID then my final result set are basically those where the Udate equals my Maximum Udate if the Minimum was null. If the Minimum wasn't null use that instead.
The first option, using 2 outer applies is doing each of the points above.
Minimum Udate for last_code = 'ATTEMPT' per ID - if there was no ATTEMPT we'll get a null:
OUTER APPLY (
SELECT [b].[ID]
, MIN([b].[Udate]) AS [AttemptUdate]
FROM #TestData [b]
WHERE [b].[ID] = [a].[ID]
AND [b].[last_code] = 'ATTEMPT'
GROUP BY [b].[ID]
) AS [aa]
Outer Apply as I might not have an ATTEMPT record for a given ID, so in those situations it returns NULL.
Maximum Udate for all records per ID:
OUTER APPLY (
SELECT [c].[ID]
, MAX([c].[Udate]) AS [MaxUdate]
FROM #TestData [c]
WHERE [c].[ID] = [a].[ID]
GROUP BY [c].[ID]
) AS [cc]
Then the where clause compares what was returned by those to return only the records I want:
[a].[Udate] = COALESCE([aa].[AttemptUdate], [cc].[MaxUdate]);
I'm using COALESCE to handled and evaluate NULLs. COALESCE will evaluate the fields from left to right and use/return the first non NULL value.
So using this with Udate we can evaluate which Udate value I should use in my filter to satisfy the requirement.
Because if I had an ATTEMPT record field AttemptUdate would have a value and be used in the filter first. If I didn't have an ATTEMPT record AttemptUdate would be NULL so then MaxUdate would be used.
For option 2, similar just going after it a little different.
Minimum Udate for last_code = 'ATTEMPT' per ID - if there was no ATTEMPT we'll get a null:
MIN( CASE WHEN [last_code] = 'ATTEMPT' THEN [Udate]
ELSE NULL
END
) OVER ( PARTITION BY [ID] ) AS [AttemptUdate]
Min on Udate, but I use a case statement to evaluate if that records is an ATTEMPT or not. using OVER PARTITION will do that based on how I tell it to partition the data, by ID.
Maximum Udate for all records per ID:
MAX([Udate]) OVER ( PARTITION BY [ID] ) AS [MaxUdate]
Go get me the maximum Udate based on ID, since that's how I told it to partition it.
I do all that in a sub-query to make the where clause easier to work with. Then it's the same as before when filtering:
[td].[Udate] = COALESCE([td].[AttemptUdate], [td].[MaxUdate]);
Using COALESCE to determine which date I should be using and only return the records I want.
With the second option, go a little deeper, If you run just the sub query, you'll see you get for each individual record the 2 main driving points of the requirement:
What's the Max Udate per ID
What's the mint Udate of last_code=ATTEMPT per ID
From there I can just filter on those records satisfying what I was originally looking for, using a COALESCE to simplify my filter.
[td].[Udate] = COALESCE([td].[AttemptUdate], [td].[MaxUdate]);
Use AttemptUdate unless it's NULL then use MaxUdate.

How to get id of the row which was selected by aggregate function? [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 4 years ago.
I have next data:
id | name | amount | datefrom
---------------------------
3 | a | 8 | 2018-01-01
4 | a | 3 | 2018-01-15 10:00
5 | b | 1 | 2018-02-20
I can group result with the next query:
select name, max(amount) from table group by name
But I need the id of selected row too. Thus I have tried:
select max(id), name, max(amount) from table group by name
And as it was expected it returns:
id | name | amount
-----------
4 | a | 8
5 | b | 1
But I need the id to have 3 for the amount of 8:
id | name | amount
-----------
3 | a | 8
5 | b | 1
Is this possible?
PS. This is required for billing task. At some day 2018-01-15 configuration of a was changed and user consumes some resource 10h with the amount of 8 and rests the day 14h -- 3. I need to count such a day by the maximum value. Thus row with id = 4 is just ignored for 2018-01-15 day. (for next day 2018-01-16 I will bill the amount of 3)
So I take for billing the row:
3 | a | 8 | 2018-01-01
And if something is wrong with it. I must report that row with id == 3 is wrong.
But when I used aggregation function the information about id is lost.
Would be awesome if this is possible:
select current(id), name, max(amount) from table group by name
select aggregated_row(id), name, max(amount) from table group by name
Here agg_row refer to the row which was selected by aggregation function max
UPD
I resolve the task as:
SELECT
(
SELECT id FROM t2
WHERE id = ANY ( ARRAY_AGG( tf.id ) ) AND amount = MAX( tf.amount )
) id,
name,
MAX(amount) ma,
SUM( ratio )
FROM t2 tf
GROUP BY name
UPD
It would be much better to use window functions
There are at least 3 ways, see below:
CREATE TEMP TABLE test (
id integer, name text, amount numeric, datefrom timestamptz
);
COPY test FROM STDIN (FORMAT csv);
3,a,8,2018-01-01
4,a,3,2018-01-15 10:00
5,b,1,2018-02-20
6,b,1,2019-01-01
\.
Method 1. using DISTINCT ON (PostgreSQL-specific)
SELECT DISTINCT ON (name)
id, name, amount
FROM test
ORDER BY name, amount DESC, datefrom ASC;
Method 2. using window functions
SELECT id, name, amount FROM (
SELECT *, row_number() OVER (
PARTITION BY name
ORDER BY amount DESC, datefrom ASC) AS __rn
FROM test) AS x
WHERE x.__rn = 1;
Method 3. using corelated subquery
SELECT id, name, amount FROM test
WHERE id = (
SELECT id FROM test AS t2
WHERE t2.name = test.name
ORDER BY amount DESC, datefrom ASC
LIMIT 1
);
demo: db<>fiddle
You need DISTINCT ON which filters the first row per group.
SELECT DISTINCT ON (name)
*
FROM table
ORDER BY name, amount DESC
You need a nested inner join. Try this -
SELECT id, T2.name, T2.amount
FROM TABLE T
INNER JOIN (SELECT name, MAX(amount) amount
FROM TABLE
GROUP BY name) T2
ON T.amount = T2.amount

PostgreSQL Window Function "column must appear in the GROUP BY clause"

I'm trying to get a leaderboard of summed user scores from a list of user score entries. A single user can have more than one entry in this table.
I have the following table:
rewards
=======
user_id | amount
I want to add up all of the amount values for given users and then rank them on a global leaderboard. Here's the query I'm trying to run:
SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION BY user_id) FROM rewards;
I'm getting the following error:
ERROR: column "rewards.user_id" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: SELECT user_id, SUM(amount) AS score, rank() OVER (PARTITION...
Isn't user_id already in an "aggregate function" because I'm trying to partition on it? The PostgreSQL manual shows the following entry which I feel is a direct parallel of mine, so I'm not sure why mine's not working:
SELECT depname, empno, salary, avg(salary) OVER (PARTITION BY depname) FROM empsalary;
They're not grouping by depname, so how come theirs works?
For example, for the following data:
user_id | score
===============
1 | 2
1 | 3
2 | 5
3 | 1
I would expect the following output (I have made a "tie" between users 1 and 2):
user_id | SUM(score) | rank
===========================
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
So user 1 has a total score of 5 and is ranked #1, user 2 is tied with a score of 5 and thus is also rank #1, and user 3 is ranked #3 with a score of 1.
You need to GROUP BY user_id since it's not being aggregated. Then you can rank by SUM(score) descending as you want;
SQL Fiddle Demo
SELECT user_id, SUM(score), RANK() OVER (ORDER BY SUM(score) DESC)
FROM rewards
GROUP BY user_id;
user_id | sum | rank
---------+-----+------
1 | 5 | 1
2 | 5 | 1
3 | 1 | 3
There is a difference between window functions and aggregate functions. Some functions can be used both as a window function and an aggregate function, which can cause confusion. Window functions can be recognized by the OVER clause in the query.
The query in your case then becomes, split in doing first an aggregate on user_id followed by a window function on the total_amount.
SELECT user_id, total_amount, RANK() OVER (ORDER BY total_amount DESC)
FROM (
SELECT user_id, SUM(amount) total_amount
FROM table
GROUP BY user_id
) q
ORDER BY total_amount DESC
If you have
SELECT user_id, SUM(amount) ....
^^^
agreagted function (not window function)
....
FROM .....
You need
GROUP BY user_id