PostgreSQL group by error - postgresql

I have a relation in a PostgreSQL database called 'processed_data' having the following schema:
Date -> date type, shop_id -> integer type, item_category_id -> integer type, sum_item_cnt_day -> real type.
Displaying the first 5 rows of the relation is as follows:
date | shop_id | item_category_id | sum_item_cnt_day
------+-----------+--------------------+------------------
2014-12-29 | 49 | 3 | 4
2014-12-29 | 49 | 6 | 1
2014-12-29 | 49 | 7 | 1
2014-12-29 | 49 | 12 | 3
2014-12-29 | 49 | 16 | 1
Now, the 'shop_id' has 60 unique shops ranging from 0-59 where each shop sells some items grouped to a new column 'item_category_id' where 'sum_item_cnt_day' denotes the number of items sold by a shop and it's item_category_id.
I am now trying to further aggregate the data by just trying to get the following columns as final result-
date, shop_id, sum_item_cnt_day
So that, data is aggregated according to number of all items sold in 'item_category_id' per shop (denoted by 'shop_id') and calculating sum of 'sum_item_cnt_day'.
When I try to execute the following SQL command-
select date, shop_id, sum(sum_item_cnt_day) from processed_data group by shop_id;
It gives the error-
ERROR: column "processed_data.date" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select date, shop_id, sum(sum_item_cnt_day) from processed_d...
^
Even the following SQL command-
select date, shop_id, sum(sum_item_cnt_day) from processed_data where date between '2013-01-01' and '2013-01-31' group by shop_id;
Gives the error-
ERROR: column "processed_data.date" must appear in the GROUP BY clause or be used in an aggregate function
LINE 1: select date, shop_id, sum(sum_item_cnt_day) from processed_d...
^
Any suggestions as to what's going wrong and what am I missing?
Thanks!

The simplest fix, which is what I think you want, would be to just add date to the GROUP BY clause:
SELECT date, shop_id, SUM(sum_item_cnt_day)
FROM processed_data
GROUP BY date, shop_id;
If you really don't want sums taken for each shop on each day, but rather for each shop over all days, then you will have to think of which of the many dates you want to display.

Related

ORDER BY MIN date

I need to fetch the number of employees per month, having a first work in a selected period. And I have to display only the month when the employee appears for the first time. My request works fine, but I need to order the result by date. Here is my request:
SELECT TO_CHAR(sub.minStartDate,'mm/YYYY') as date,
COUNT(DISTINCT sub.id) AS nombre
FROM (
SELECT MIN(sw.start_date) as minStartDate,
e.id
FROM employee e
INNER JOIN social_work sw ON e.id = sw.employee_id
GROUP BY e.id
HAVING MIN(sw.start_date) BETWEEN '2020-01-01' AND '2022-12-31'
) sub
GROUP BY date
ORDER BY date
And the result:
date | nombre
--------------
04/2021 | 2
05/2020 | 1
Excepted output:
date | nombre
--------------
05/2020 | 1
04/2021 | 2
I've tried to put sub.minStartDate in the ORDER BY clause but then I also have to put it in GROUP BY clause, what gives me this output :
date | nombre
--------------
05/2020 | 1
04/2021 | 1
04/2021 | 1
And it's not what I want.
You're ordering by date, which is the result of the TO_CHAR() function. The TO_CHAR() function returns a text, so your ORDER BY clause results in an alphanumeric sort.
Since you don't want to ORDER BY sub.minStartDate, you could try changing your format to put the least significant variable of the date (in this case, the month) to the right: TO_CHAR(sub.minStartDate, 'YYYY/mm').
If you can't change your format either, then you'll probably have to resort to grouping and ordering by minStartDate:
SELECT
TO_CHAR(sub.minStartDate,'mm/YYYY') as date,
TO_CHAR(sub.minStartDate,'YYYY/mm') sortingDate,
COUNT(DISTINCT sub.id) AS nombre
FROM
-- omitted for simplicity
GROUP BY date, sortingDate
ORDER BY sortingDate

How to find MAX(date) from BETWEEN(dates) in column 2 with DUPLICATES in column 1?

I have a Database that has product names in column 1 and product release dates in column 2. I want to find 'old' products by their release date. However, I'm only interested in finding 'old' products that released a minimum of 1 year ago. I cannot make any edits to the original database infrastructure.
The table looks like this:
Product| Release_Day
A | 2018-08-23
A | 2017-08-23
A | 2019-08-21
B | 2018-08-22
B | 2016-08-22
B | 2017-08-22
C | 2018-10-25
C | 2016-10-25
C | 2019-08-19
I have already tried multiple versions of DISTINCT, MAX, BETWEEN, >, <, etc.
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2015-08-22' and '2018-08-22'
and release_day not between '2018-08-23' and '2019-08-22'
GROUP BY 1
ORDER BY MAX(release_day) DESC
The expected results should not contain any products found by this query:
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2018-08-23' and '2019-08-22'
AND product = A
GROUP BY 1
However, every check I complete returns a product from this date range.
This is the output of the initial query:
Product|Most_Recent_Release
A | 2018-08-23
B | 2018-08-22
C | 2015-10-25
And, for example, if I run the check query on Product A, I get this:
Product|Most_Recent_Release
A | 2019-08-21
Use HAVING to filter on most_recent_release
SELECT product, MAX(release_day) as most_recent_release
FROM Product_Release
GROUP BY product
HAVING most_recent_release < '2018-08-23'
ORDER BY most_recent_release DESC
There's no need to use DISTINCT when you use GROUP BY -- you can't get duplicates if there's only one row per product.

Postgresql: Create a date sequence, use it in date range query

I'm not great with SQL but I have been making good progress on a project up to this point. Now I am completely stuck.
I'm trying to get a count for the number of apartments with each status. I want this information for each day so that I can trend it over time. I have data that looks like this:
table: y_unit_status
unit | date_occurred | start_date | end_date | status
1 | 2017-01-01 | 2017-01-01 | 2017-01-05 | Occupied No Notice
1 | 2017-01-06 | 2017-01-06 | 2017-01-31 | Occupied Notice
1 | 2017-02-01 | 2017-02-01 | | Vacant
2 | 2017-01-01 | 2017-01-01 | | Occupied No Notice
And I want to get output that looks like this:
date | occupied_no_notice | occupied_notice | vacant
2017-01-01 | 2 | 0 | 0
...
2017-01-10 | 1 | 1 | 0
...
2017-02-01 | 1 | 0 | 1
Or, this approach would work:
date | status | count
2017-01-01 | occupied no notice | 2
2017-01-01 | occupied notice | 0
date_occurred: Date when the status of the unit changed
start_date: Same as date_occurred
end_date: Date when status stopped being x and changed to y.
I am pulling in the number of bedrooms and a property id so the second approach of selecting counts for one status at a time would produce a relatively large number of rows vs. option 1 (if that matters).
I've found a lot of references that have gotten me close to what I'm looking for but I always end up with a sort of rolling, cumulative count.
Here's my query, which produces a column of dates and counts, which accumulate over time rather than reflecting a snapshot of counts for a particular day. You can see my references to another table where I'm pulling in a property id. The table schema is Property -> Unit -> Unit Status.
WITH t AS(
SELECT i::date from generate_series('2016-06-29', '2017-08-03', '1 day'::interval) i
)
SELECT t.i as date,
u.hproperty,
count(us.hmy) as count --us.hmy is the id
FROM t
LEFT OUTER JOIN y_unit_status us ON t.i BETWEEN us.dtstart AND
us.dtend
INNER JOIN y_unit u ON u.hmy = us.hunit -- to get property id
WHERE us.sstatus = 'Occupied No Notice'
AND t.i >= us.dtstart
AND t.i <= us.dtend
AND u.hproperty = '1'
GROUP BY t.i, u.hproperty
ORDER BY t.i
limit 1500
I also tried a FOR loop, iterating over the dates to determine cases where the date was between start and end but my logic wasn't working. Thanks for any insight!
You are on the right track, but you'll need to handle NULL values in end_date. If those means that status is assumed to be changed somewhere in the future (but not sure when it will change), the containment operators (#> and <#) for the daterange type are perfect for you (because ranges can be "unbounded"):
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d, status, count(unit)
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1, 2
To achieve the first variant, you can use conditional aggregation:
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d,
count(unit) filter (where status = 'Occupied No Notice') occupied_no_notice,
count(unit) filter (where status = 'Occupied Notice') occupied_notice,
count(unit) filter (where status = 'Vacant') vacant
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1
Notes:
The syntax filter (where <predicate>) is new to 9.4+. Before that, you can use CASE (and the fact that most aggregate functions does not include NULL values) to emulate it.
You can even index the expression daterange(start_date, end_date, '[]') (using gist) for better performance.
http://rextester.com/HWKDE34743

Postgres aggregate sum conditional on row comparison

So, I have data that looks something like this
User_Object | filesize | created_date | deleted_date
row 1 | 40 | May 10 | Aug 20
row 2 | 10 | June 3 | Null
row 3 | 20 | Nov 8 | Null
I'm building statistics to record user data usage to graph based on time based datapoints. However, I'm having difficulty developing a query to take the sum for each row of all queries before it, but only for the rows that existed at the time of that row's creation. Before taking this step to incorporate deleted values, I had a simple naive query like this:
SELECT User_Object.id, User_Object.created, SUM(filesize) OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
However, I want to alter this somehow so that there's a conditional for the the window function to only get the sum of any row created before this User Object when that row doesn't have a deleted date also before this User Object.
This incorrect syntax illustrates what I want to do:
SELECT User_Object.id, User_Object.created,
SUM(CASE WHEN NOT window_function_row.deleted
OR window_function_row.deleted > User_Object.created
THEN filesize ELSE 0)
OVER (ORDER BY User_Object.created) AS sum_data_used
FROM User_Object
JOIN user ON User_Object.user_id = user.id
WHERE user.id = $1
When this function runs on the data that I have, it should output something like
id | created | sum_data_used|
1 | May 10 | 40
2 | June 3 | 50
3 | Nov 8 | 30
Something along these lines may work for you:
SELECT a.user_id
,MIN(a.created_date) AS created_date
,SUM(b.filesize) AS sum_data_used
FROM user_object a
JOIN user_object b ON (b.user_id <= a.user_id
AND COALESCE(b.deleted_date, a.created_date) >= a.created_date)
GROUP BY a.user_id
ORDER BY a.user_id
For each row, self-join, match id lower or equal, and with date overlap. It will be expensive because each row needs to look through the entire table to calculate the files size result. There is no cumulative operation taking place here. But I'm not sure there is a way that.
Example table definition:
create table user_object(user_id int, filesize int, created_date date, deleted_date date);
Data:
1;40;2016-05-10;2016-08-29
2;10;2016-06-03;<NULL>
3;20;2016-11-08;<NULL>
Result:
1;2016-05-10;40
2;2016-06-03;50
3;2016-11-08;30

adding missing date in a table in PostgreSQL

I have a table that contains data for every day in 2002, but it has some missing dates. Namely, 354 records for 2002 (instead of 365). For my calculations, I need to have the missing data in the table with Null values
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | 65.6 | 2002-05-09 |
| 103 | 75.9 | 2002-05-10 |
+-----+------------+------------+
you see that 2002-05-08 is missing. I want my final table to be like:
+-----+------------+------------+
| ID | rainfall | date |
+-----+------------+------------+
| 100 | 110.2 | 2002-05-06 |
| 101 | 56.6 | 2002-05-07 |
| 102 | | 2002-05-08 |
| 103 | 65.6 | 2002-05-09 |
| 104 | 75.9 | 2002-05-10 |
+-----+------------+------------+
Is there a way to do that in PostgreSQL?
It doesn't matter if I have the result just as a query result (not necessarily an updated table)
date is a reserved word in standard SQL and the name of a data type in PostgreSQL. PostgreSQL allows it as identifier, but that doesn't make it a good idea. I use thedate as column name instead.
Don't rely on the absence of gaps in a surrogate ID. That's almost always a bad idea. Treat such an ID as unique number without meaning, even if it seems to carry certain other attributes most of the time.
In this particular case, as #Clodoaldo commented, thedate seems to be a perfect primary key and the column id is just cruft - which I removed:
CREATE TEMP TABLE tbl (thedate date PRIMARY KEY, rainfall numeric);
INSERT INTO tbl(thedate, rainfall) VALUES
('2002-05-06', 110.2)
, ('2002-05-07', 56.6)
, ('2002-05-09', 65.6)
, ('2002-05-10', 75.9);
Query
Full table by query:
SELECT x.thedate, t.rainfall -- rainfall automatically NULL for missing rows
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
LEFT JOIN tbl t USING (thedate)
ORDER BY x.thedate
Similar to what #a_horse_with_no_name posted, but simplified and ignoring the pruned id.
Fills in gaps between first and last date found in the table. If there can be leading / lagging gaps, extend accordingly. You can use date_trunc() like #Clodoaldo demonstrated - but his query suffers from syntax errors and can be simpler.
INSERT missing rows
The fastest and most readable way to do it is a NOT EXISTS anti-semi-join.
INSERT INTO tbl (thedate, rainfall)
SELECT x.thedate, NULL
FROM (
SELECT generate_series(min(thedate), max(thedate), '1d')::date AS thedate
FROM tbl
) x
WHERE NOT EXISTS (SELECT 1 FROM tbl t WHERE t.thedate = x.thedate)
Just do an outer join against a query that returns all dates in 2002:
with all_dates as (
select date '2002-01-01' + i as date_col
from generate_series(0, extract(doy from date '2002-12-31')::int - 1) as i
)
select row_number() over (order by ad.date_col) as id,
t.rainfall,
ad.date_col as date
from all_dates ad
left join your_table t on ad.date_col = t.date
order by ad.date_col;
This will not change your table, it will just produce the result as desired.
Note that the generated id column will not contain the same values as the ID column in your table as it is merely a counter in the result set.
You could also replace the row_number() function with extract(doy from ad.date_col)
To fill the gaps. This will not reorder the IDs:
insert into t (rainfall, "date") values
select null, "date"
from (
select d::date as "date"
from (
t
right join
generate_series(
(select date_trunc('year', min("date")) from t)::timestamp,
(select max("date") from t),
'1 day'
) s(d) on t."date" = s.d::date
where t."date" is null
) q
) s
You have to fully re-create your table as indexes haves to change.
The better way to do it is to use your prefered dbi language, make a loop ignoring ID and putting values in a new table with new serialized IDs.
for day in (whole needed calendar)
value = select rainfall from oldbrokentable where date = day
insert into newcleanedtable date=day, rainfall=value, id=serialized
(That's not real code! Just conceptual to be adapted to your prefered scripting language)