Fill in missing rows when aggregating over multiple fields in Postgres - postgresql

I am aggregating sales for a set of products per day using Postgres and need to know not just when sales do happen, but also when they do not for further processing.
SELECT
sd.date,
COUNT(sd.sale_id) AS sales,
sd.product
FROM sales_data sd
-- sales per product, per day
GROUP BY sd.product, sd.date
ORDER BY sd.product, sd.date
This produces the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-17 | 2 | shower gel
2017-08-21 | 1 | shower gel
As you can see - the date ranges per product are not continuous as sales_data just didn't contain any info for these products on some days.
What I'm aiming to do is to add a sales = 0 row for each product that is not sold on any day in a range - for example here, between 2017-08-17 and 2017-08-21 to give something like the the following:
date | sales | product
------------+-------+-------------------
2017-08-17 | 10 | soap
2017-08-18 | 0 | soap
2017-08-19 | 2 | soap
2017-08-20 | 5 | soap
2017-08-21 | 0 | soap
2017-08-17 | 2 | shower gel
2017-08-18 | 0 | shower gel
2017-08-19 | 0 | shower gel
2017-08-20 | 0 | shower gel
2017-08-21 | 1 | shower gel
In a simpler case where there was only a single product, it seems like the solution would be to use generate_series() i.e.:
create a full range of dates using generate_series
LEFT JOIN the already aggregated sales data onto the date series
COALESCE any NULL counts to 0 in the missing rows
The problem I have is that this approach does not seem to work dates repeat in the aggregated data as I'm grouping over not just multiple dates, but multiple products also.
It feels like I should be able to do something cunning with window functions here to solve this e.g. joining onto the full date range over partitions defined by the product name - but I can't see a way of actually getting this to work.

You could use:
WITH cte AS (
SELECT date, s.product
FROM ... -- some way to generate date series
CROSS JOIN (SELECT DISTINCT product FROM sales_data) s
)
SELECT
c.date,
c.product,
COUNT(sd.sale_id) AS sales
FROM cte c
LEFT JOIN sales_data sd
ON c.date = sd.date AND c.product= sd.product
GROUP BY c.date, c.product
ORDER BY c.date, c.product;
First create Cartesian product of dates and products, then LEFT JOIN to actual data and do calculations.
Oracle has great feature for this scenarios called Partitioned Outer Joins:
SELECT times.time_id, product, quantity
FROM inventory PARTITION BY (product)
RIGHT OUTER JOIN times ON (times.time_id = inventory.time_id)
WHERE times.time_id BETWEEN TO_DATE('01/04/01', 'DD/MM/YY')
AND TO_DATE('06/04/01', 'DD/MM/YY')
ORDER BY 2,1;

select
date,
count(sale_id) as sales,
product
from
sales_data
right join (
(
select d::date as date
from generate_series (
(select min(date) from sales_data),
(select max(date) from sales_data),
'1 day'
) gs (d)
) gs
cross join
(select distinct product from sales_data) p
) cj using (product, date)
group by product, date
order by product, date

Related

PostgreSQL Marketing Report

I'm writing out a query that takes ad marketing data from Google Ads, Microsoft, and Taboola and merges it into one table.
The table should have 3 rows, one for each ad company with 4 columns: traffic source (ad company), money spent, sales, and cost per conversion. Right now I'm just dealing with the first 2 till I get those right. The whole table's data should be grouped within that a given month's data.
Right now the results I'm getting are multiple rows from each traffic source, some of them merging months of data into the cost column instead of summing up the costs within a given month.
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
Here's an example of the results I'm getting. The month column is simply for testing purposes.
| traffic_source1 | cost | month1 |
|:----------------|:-----------|:---------------|
| Google | 210.00 | 01/09/18 00:00 |
| Google | 1,213.00 | 01/10/18 00:00 |
| Google | 2,481.00 | 01/11/18 00:00 |
| Google | 3,503.00 | 01/12/18 00:00 |
| Google | 7,492.00 | 01/01/19 00:00 |
| Microsoft | 22,059.00 | 01/02/19 00:00 |
| Microsoft | 16,958.00 | 01/03/19 00:00 |
| Microsoft | 7,582.00 | 01/04/19 00:00 |
| Microsoft | 76,125.00 | 01/05/19 00:00 |
| Taboola | 37,205.00 | 01/06/19 00:00 |
| Google | 45,910.00 | 01/07/19 00:00 |
| Google | 137,421.00 | 01/08/19 00:00 |
| Google | 29,501.00 | 01/09/19 00:00 |
Instead, it should look like this (Let's say for the month of July this year, for instance):
| traffic_source | cost |
|----------------|-----------|
| Google | 53,901.00 |
| Microsoft | 22,059.00 |
| Taboola | 37,205.00 |
Any help would be greatly appreciated, thanks!
You can try this way:
WITH google_ads AS
( SELECT 'Google' AS traffic_source,
date_trunc('month', "day"::date) AS month,
SUM(cost / 1000000) AS cost
FROM googleads_campaign AS g
GROUP BY month
ORDER BY month DESC),
taboola AS
( SELECT 'Taboola' AS traffic_source,
date_trunc('month', "date"::date) AS month,
SUM(spent) AS cost
FROM taboola_campaign AS t
GROUP BY month
ORDER BY month DESC),
microsoft AS
( SELECT 'Microsoft' AS traffic_source,
date_trunc('month', "TimePeriod"::date) AS month,
SUM("Spend") AS cost
FROM microsoft_campaign AS m
GROUP BY month
ORDER BY month DESC)
SELECT (CASE
WHEN M.traffic_source='Microsoft' THEN M.traffic_source
WHEN T.traffic_source='Taboola' THEN T.traffic_source
WHEN G.traffic_source='Google' THEN G.traffic_source
END) AS traffic_source1,
SUM(CASE
WHEN G.traffic_source='Google' THEN G.cost
WHEN T.traffic_source='Taboola' THEN T.cost
WHEN M.traffic_source='Microsoft' THEN M.cost
END) AS cost,
(CASE
WHEN G.traffic_source='Google' THEN G.month
WHEN T.traffic_source='Taboola' THEN T.month
WHEN M.traffic_source='Microsoft' THEN M.month
END) AS month1
FROM google_ads G
LEFT JOIN taboola T ON G.month = T.month
LEFT JOIN microsoft M ON G.month = M.month
GROUP BY traffic_source1, month1
HAVING EXTRACT(month from month1) = ... desired month (July is 7)
The concept of a different table for each ad source is really a very bad idea. It vastly compounds the complexity of of queries requiring consolidation. It would be much better to have a single table having the source along with the other columns. Consider what happens when marketing decides to use 30-40 or more ad suppliers. If you cannot create a single table then at least standardize column names and types. Also build a view, a materialized view, or a table function (below) which combines all the traffic sources into a single source.
create or replace function consolidated_ad_traffic()
returns table( traffic_source text
, ad_month timestamp with time zone
, ad_cost numeric(11,2)
, ad_sales numeric(11,2)
, conversion_cost numeric(11,6)
)
language sql
AS $$
with ad_sources as
( select 'Google' as traffic_source
, "date" as ad_date
, round(cast (cost AS numeric ) / 1000000.0,2) as cost
, sales
, cost_per_conversion
from googleads_campaign
union all
select 'Taboola'
, "date"
, spent
, sales
, cost_per_conversion
from taboola_campaign
union all
select 'Microsoft'
, "TimePeriod"
, "Spend"
, sales
, cost_per_conversion
from microsoft_campaign
)
select * from ad_sources;
$$;
With a consolidated view of the data you can now write normal selects as though you had a single table. Such as:
select * from consolidated_ad_traffic();
select distinct on( traffic_source, to_char(ad_month, 'mm'))
traffic_source
, to_char(ad_month, 'Mon') "For Month"
, to_char(sum(ad_cost) over(partition by traffic_source, to_char(ad_month, 'Mon')), 'FM99,999,999,990.00') monthly_traffic_cost
, to_char(sum(ad_cost) over(partition by traffic_source), 'FM99,999,999,990.00') total_traffic_cost
from consolidated_ad_traffic();
select traffic_source, sum(ad_cost) ad_cost
from consolidated_ad_traffic()
group by traffic_source
order by traffic_source;
select traffic_source
, to_char(ad_month, 'dd-Mon') "For Month"
, sum(ad_cost) "Monthly Cost"
from consolidated_ad_traffic()
where date_trunc('month',ad_month) = date_trunc('month', date '2019-07-01')
and traffic_source = 'Google'
group by traffic_source, to_char(ad_month, 'dd-Mon') ;
Now this won't do much for updating but will drastically ease selection.

How to find MAX(date) from BETWEEN(dates) in column 2 with DUPLICATES in column 1?

I have a Database that has product names in column 1 and product release dates in column 2. I want to find 'old' products by their release date. However, I'm only interested in finding 'old' products that released a minimum of 1 year ago. I cannot make any edits to the original database infrastructure.
The table looks like this:
Product| Release_Day
A | 2018-08-23
A | 2017-08-23
A | 2019-08-21
B | 2018-08-22
B | 2016-08-22
B | 2017-08-22
C | 2018-10-25
C | 2016-10-25
C | 2019-08-19
I have already tried multiple versions of DISTINCT, MAX, BETWEEN, >, <, etc.
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2015-08-22' and '2018-08-22'
and release_day not between '2018-08-23' and '2019-08-22'
GROUP BY 1
ORDER BY MAX(release_day) DESC
The expected results should not contain any products found by this query:
SELECT DISTINCT product,MAX(release_day) as most_recent_release
FROM Product_Release
WHERE
release_day between '2018-08-23' and '2019-08-22'
AND product = A
GROUP BY 1
However, every check I complete returns a product from this date range.
This is the output of the initial query:
Product|Most_Recent_Release
A | 2018-08-23
B | 2018-08-22
C | 2015-10-25
And, for example, if I run the check query on Product A, I get this:
Product|Most_Recent_Release
A | 2019-08-21
Use HAVING to filter on most_recent_release
SELECT product, MAX(release_day) as most_recent_release
FROM Product_Release
GROUP BY product
HAVING most_recent_release < '2018-08-23'
ORDER BY most_recent_release DESC
There's no need to use DISTINCT when you use GROUP BY -- you can't get duplicates if there's only one row per product.

selecting records without value

I have a problem when I'm trying to reach the desired result. The task looks simple — make a daily count of occurrences of the event for top countries.
The main table looks like this:
id | date | country | col1 | col2 | ...
1 | 2018-01-01 21:21:21 | US | value 1 | value 2 | ...
2 | 2018-01-01 22:32:54 | UK | value 1 | value 2 | ...
From this table, I want to get daily event counts by the country, which is achieved by
SELECT date::DATE AT TIME ZONE 'UTC', country, COALESCE(count(id),0) FROM tab1
GROUP BY 1, 2
The problem comes when there is no event was made by an UK user on 2 January 2018
country_events
date | country | count
2018-01-01 | US | 23
2018-01-01 | UK | 5
2018-01-02 | US | 30
2018-01-02 | UK | 0 -> is desired result, but row is missing
I've tried to generate date series and series of countries which I'm looking for, then CROSS JOIN these two tables. This helper with columns date and country I've left joined with my result table like
SELECT * FROM helper h
LEFT JOIN country_events c ON c.date::DATE = h.date::DATE AND c.country = h.country
I'm using PostgreSQL.
You need an outer join, not a cross join:
SELECT tab1.date::date, tab1.country, coalesce(count(*), 0)
FROM generate_series(TIMESTAMP '2018-01-01 00:00:00',
TIMESTAMP '2018-01-31 00:00:00',
INTERVAL '1 day') AS ts(d)
LEFT JOIN tab1 ON tab1.date >= ts.d AND tab1.date < ts.d + INTERVAL '1 day'
GROUP BY tab1.date::date, tab1.country
ORDER BY tab1.date::date, tab1.country;
This will give the desired list for January 2018.

Postgresql: Create a date sequence, use it in date range query

I'm not great with SQL but I have been making good progress on a project up to this point. Now I am completely stuck.
I'm trying to get a count for the number of apartments with each status. I want this information for each day so that I can trend it over time. I have data that looks like this:
table: y_unit_status
unit | date_occurred | start_date | end_date | status
1 | 2017-01-01 | 2017-01-01 | 2017-01-05 | Occupied No Notice
1 | 2017-01-06 | 2017-01-06 | 2017-01-31 | Occupied Notice
1 | 2017-02-01 | 2017-02-01 | | Vacant
2 | 2017-01-01 | 2017-01-01 | | Occupied No Notice
And I want to get output that looks like this:
date | occupied_no_notice | occupied_notice | vacant
2017-01-01 | 2 | 0 | 0
...
2017-01-10 | 1 | 1 | 0
...
2017-02-01 | 1 | 0 | 1
Or, this approach would work:
date | status | count
2017-01-01 | occupied no notice | 2
2017-01-01 | occupied notice | 0
date_occurred: Date when the status of the unit changed
start_date: Same as date_occurred
end_date: Date when status stopped being x and changed to y.
I am pulling in the number of bedrooms and a property id so the second approach of selecting counts for one status at a time would produce a relatively large number of rows vs. option 1 (if that matters).
I've found a lot of references that have gotten me close to what I'm looking for but I always end up with a sort of rolling, cumulative count.
Here's my query, which produces a column of dates and counts, which accumulate over time rather than reflecting a snapshot of counts for a particular day. You can see my references to another table where I'm pulling in a property id. The table schema is Property -> Unit -> Unit Status.
WITH t AS(
SELECT i::date from generate_series('2016-06-29', '2017-08-03', '1 day'::interval) i
)
SELECT t.i as date,
u.hproperty,
count(us.hmy) as count --us.hmy is the id
FROM t
LEFT OUTER JOIN y_unit_status us ON t.i BETWEEN us.dtstart AND
us.dtend
INNER JOIN y_unit u ON u.hmy = us.hunit -- to get property id
WHERE us.sstatus = 'Occupied No Notice'
AND t.i >= us.dtstart
AND t.i <= us.dtend
AND u.hproperty = '1'
GROUP BY t.i, u.hproperty
ORDER BY t.i
limit 1500
I also tried a FOR loop, iterating over the dates to determine cases where the date was between start and end but my logic wasn't working. Thanks for any insight!
You are on the right track, but you'll need to handle NULL values in end_date. If those means that status is assumed to be changed somewhere in the future (but not sure when it will change), the containment operators (#> and <#) for the daterange type are perfect for you (because ranges can be "unbounded"):
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d, status, count(unit)
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1, 2
To achieve the first variant, you can use conditional aggregation:
with params as (
select date '2017-01-01' date_from,
date '2017-02-02' date_to
)
select date_from + d,
count(unit) filter (where status = 'Occupied No Notice') occupied_no_notice,
count(unit) filter (where status = 'Occupied Notice') occupied_notice,
count(unit) filter (where status = 'Vacant') vacant
from params
cross join generate_series(0, date_to - date_from) d
left join y_unit_status on daterange(start_date, end_date, '[]') #> date_from + d
group by 1
Notes:
The syntax filter (where <predicate>) is new to 9.4+. Before that, you can use CASE (and the fact that most aggregate functions does not include NULL values) to emulate it.
You can even index the expression daterange(start_date, end_date, '[]') (using gist) for better performance.
http://rextester.com/HWKDE34743

Postgresql: get first item of an ordered group not working [duplicate]

This question already has answers here:
Select first row in each GROUP BY group?
(20 answers)
Closed 6 years ago.
I have this data:
| id | person_id | date |
|--------|-----------|---------------------|
| 313962 | 1111111 | 2016-04-14 16:00:00 | --> this row
| 313946 | 2222222 | 2015-03-13 15:00:00 | --> this row
| 313937 | 1111111 | 2014-02-12 14:00:00 |
| 313944 | 1111111 | 2013-01-11 13:00:00 |
| ...... | ....... | ................... |
-What I would like to select are the indicated rows, i.e. the rows with the most recent date for each person_id.
-Also the output format for the date must be dd-mm-YYYY
So far I was trying with this:
SELECT
l.person_id,
to_char(DATE(l.date), 'dd-mm-YYYY') AS user_date
FROM login l
group by l.person_id
order by l.date desc
I was trying different approaches, but I have all kind of Aggregation error messages such as:
for select distinct order by expressions must appear
And
must appear in the GROUP BY clause or be used in an aggregate function
Any idea?
There are several ways, but the simplest way (and perhaps more efficient - but not SQL standard) is to rely on Postgresql's DISTINCT ON:
SELECT DISTINCT ON (person_id )
id, person_id , date
FROM login
ORDER BY person_id , date desc
The date formatting (do you really want that?) can be done in a outer select:
SELECT id,person_id, to_char(DATE(date), 'dd-mm-YYYY') as date
FROM (
SELECT DISTINCT ON (person_id )
id, person_id , date
FROM login
ORDER BY person_id, date desc )
AS XXX;
You can do it with a subquery, something like this:
SELECT
l.person_id,
to_char(DATE(l.date), 'dd-mm-YYYY') AS user_date
FROM login l
where l.date = (select max(date) from login where person_id = l.person_id)
order by l.person_id
You need something like the following to know which date to grab for each person.
select l.person_id, to_char(DATE(d.maxdate), 'dd-mm-YYYY')
from login l
inner join
(select person_id, max(date) as maxdate
from login group by person_id) d on l.person_id = d.person_id
order by d.maxdate desc