Generate missing data and fill it down - postgresql - postgresql

I have the dataset:
The problem is that the records are added only if an event happened, e.g. for the row with id 13897, the record was updated on 4/18/2020 and then on 5/1/2020 - the status was changed. What I need is the status of each record at the end of every month.
I was thinking about the below logic:
generate the series of dates from the min(date) till now - T1
get distinct id from the dataset - T2
do cross join between two above tables so that we get a new row for every row in the second table - T3
extract the dataset with all required fields - T4
merge T3 and T4 by concatenate(date and id) - T5
sort T5 by id and d asc - T5
fill-down all the fields grouped by id - T5
generate the series of dates from min(date) till now with the interval of one month and get the last day of each month - T6
merge T5 and T6 by date - right join so that we get only rows with the date = end of month
I am on step 6.
SELECT *
FROM (SELECT d, Concat(dt, t2.id) AS cnct
FROM (SELECT d,d::date AS dt
FROM generate_series(
( SELECT min(created_at::date)
FROM new_table), CURRENT_DATE , interval '1 day') d) t1
CROSS JOIN
(SELECT DISTINCT id FROM new_table )) t2)t3
--in case if a record with the same id was updated several times throughout the day
LEFT JOIN (WITH cte AS
( SELECT id, status, created_at at time zone 'eat' at time zone 'utc' AS "created_at", updated_at::date AS date, updated_at::date, row_number() OVER (partition BY id, updated_at::date ORDER BY updated_at DESC) rnFROM new_table ))SELECT cte.*, Concat(updated_at::date, id) AS cnct
FROM cte
WHERE rn = 1) t4
ON t3.cnct = t4.cnct
I am stuck on step 7. I found fill column with last value from partition in postgresql but it is not what I need. I envision that I need to sort by a date block i.e. dates from min date to now for one id - 13894 are to be considered block 1, dates from min date to now for another id - 13897 are to be considered block 2. The next step I thought is to fill-down all fields per a block.
And another question, how do you deal with the event-based data to adapt it for the time-series?
Tried:

You can use Postgresql's DISTINCT ON feature to do this. We'll generate a series with the start of every month (you'll need to supply start and end dates here) and put the ID and the date into the DISTINCT ON so that we get only one row of new_table for each distinct ID and month pair. Then we simply filter and order to ensure that the row we're getting for each ID and month is the latest row for which the date is before the new month.
SELECT DISTINCT ON (new_table.id, month_start) *
FROM new_table, generate_series(start_date, end_date, interval '1 month') month_start
WHERE new_table.date < month_start
ORDER BY new_table.id, month_start ASC, new_table.date DESC;
(If you need your results to have the last day of the month and not the first day of the next month, you can just subtract 1 day from month_start in your select clause.)
EDIT: Running on the data you supplied, I get this:
SELECT DISTINCT ON (new_table.id, month_start) new_table.id, month_start - interval '1 day' as month_end, new_table.status
FROM new_table, generate_series('2020-05-01', '2020-06-01', interval '1 month') month_start
WHERE new_table.date < month_start
ORDER BY new_table.id, month_start ASC, new_table.date DESC;
id | month_end | status
-------+------------------------+--------
13894 | 2020-04-30 00:00:00-07 | 5
13894 | 2020-05-31 00:00:00-07 | 5
13897 | 2020-04-30 00:00:00-07 | 2
13897 | 2020-05-31 00:00:00-07 | 5
(4 rows)

Related

Calculations inside window function in PostgreSQL

I have a dataset of sales. To summarize, the structure is
client_id
date_purchase
There might be several purchases done by the same customer on different dates. There can also be several purchases done on the same date (by different or the same customer).
My goal is to get the number of customers, for any given day, that made 2 or more purchases between that day and 90 days prior.
That is, the expected output is
date_purchase
number_of_customers
2022-12-19
200
2022-12-18
194
(...)
Please note this calculates, for any given date, the number of customer with 2+ purchases between that date and 90 days prior.
I know it has something to do with a window function. But so far I have not found a way to calculate, for every window of 90 days, how many customers have done 2+ purchases.
I've tried several window functions with no success:
partition by date_purchase
range between interval '90 days' preceding and current row
So far I can't get to calculate correctly the number for each date.
Window function doesn't seem to be relevant here because there is no relationship between the rows of the same window. A simple query or a self-join query should provide the expected result.
Assuming that client_id and date_purchase are two columns of my_table :
1. Query for a given date reference_date :
SELECT a.reference_date AS date_purchase, count(*) AS number_of_customers
FROM ( SELECT reference_date , client_id
FROM my_table
WHERE date_purchase <= reference_date AND date_purchase >= reference_date - INTERVAL '90 days'
GROUP BY client_id
HAVING count(*) >= 2
) AS a
2. Query for a given interval of dates reference_date => reference_date + INTERVAL '20 days' :
SELECT a.date AS date_purchase, count(*) AS number_of_customers
FROM ( SELECT ref.date, t.client_id
FROM my_table AS t
INNER JOIN generate_series(reference_date, reference_date + INTERVAL '20 days', '1 day') AS ref(date)
ON t.date_purchase <= ref.date AND t.date_purchase >= ref.date - INTERVAL '90 days'
GROUP BY ref.date, t.client_id
HAVING count(*) >= 2
) AS a
GROUP BY a.date
ORDER BY a.date
3. Query for all the date_purchase in mytable :
SELECT a.date AS date_purchase, count(*) AS number_of_customers
FROM ( SELECT ref.date, t.client_id
FROM my_table AS t
INNER JOIN (SELECT DISTINCT date_purchase AS date FROM my_table) AS ref
ON t.date_purchase <= ref.date AND t.date_purchase >= ref.date - INTERVAL '90 days'
GROUP BY ref.date, t.client_id
HAVING count(*) >= 2
) AS a
GROUP BY a.date
ORDER BY a.date

Dynamic value passing in Postgres

Here is a complex query where i need to pass some dates as dynamic to this, As of now i have hardcoded this '2021-08-01' AND '2022-07-31' these 2 dates.
But i have to pass this dates dynamically in such a way that next dates ie, 2022-06 month , thew dates passed will be '2021-07-01' and '2022-06-30' , basically 12 months behind data.
if we take 2022-05 then the passed date should be '2021-06-01' and '2022-05-31'.
How can we achieve this ? Any suggestions or help will be much appreciated.
below is the query for reference
WITH base as
(
SELECT created_at as period ,order_number, TRIM(email) as email ,is_first_order
FROM orders
WHERE created_at::DATE BETWEEN '2021-08-01' AND '2022-07-31'
)
,base_agg as
(
select TO_CHAR(period,'YYYY-MM') as period
,COUNT(DISTINCT email)FILTER(WHERE is_first_order IS TRUE) as new_users
,COUNT(DISTINCT order_number)FILTER(WHERE is_first_order IS FALSE) as returning_orders
FROM base
GROUP BY 1
)
,base_cumulative as
(
SELECT ROW_NUMBER() OVER(ORDER BY PERIOD DESC ) as rno
,period
,new_users
,returning_orders
,sum("new_users")over (order by "period" asc rows between unbounded preceding and current row) as "cumulative_total"
from base_agg
)
SELECT
(SELECT period FROM base_cumulative WHERE rno=1) period
,(SELECT cumulative_total FROM base_cumulative WHERE rno=1) as cumulated_customers
,SUM(returning_orders) as returning_orders
,SUM(returning_orders)/NULLIF((SELECT cumulative_total FROM base_cumulative WHERE rno=1),0) as rate
FROM base_cumulative
You can calculate the end of current month based on NOW() and some logic, the same can be applied with the rest of the calculation
select date_trunc('month', now())::date + interval '1 month - 1 day' end_of_this_month,
date_trunc('month', now())::date + interval '1 month - 1 day'::interval - '1 year'::interval + '1 day'::interval first_day_of_prev_year_month
;
Result
end_of_this_month | first_day_of_prev_year_month
---------------------+------------------------------
2022-08-31 00:00:00 | 2021-09-01 00:00:00
(1 row)

How to form a dynamic pivot table or return multiple values from GROUP BY subquery

I'm having some major issues with the following query formation:
I have projects with start and end dates
Name Start End
---------------------------------------
Project 1 2020-08-01 2020-09-10
Project 2 2020-01-01 2025-01-01
and I'm trying to count the monthly working days within each project with the following subquery
select datetrunc('month', days) as d_month, count(days) as d_count
from generate_series(greatest('2020-08-01'::date, p.start), least('2020-09-14'::date, p.end), '1 day'::interval) days
where extract(DOW from days) not IN (0, 6)
group by d_month
where p.start is from the aliased main query and the dates are hard-coded for now, this correctly gives me the following result:
{"d_month"=>2020-08-01 00:00:00 +0000, "d_count"=>21}
{"d_month"=>2020-09-01 00:00:00 +0000, "d_count"=>10}
However subqueries can't return multiple values. The date range for the query is dynamic, so I would either need to somehow return the query as:
Name Start End 2020-08-01 2020-09-01 ...
-------------------------------------------------------------------------
Project 1 2020-08-01 2020-09-10 21 8
Project 2 2020-01-01 2025-01-01 21 10
Or simply return the whole subquery as JSON, but it doesn't seem to working either.
Any idea on how to achieve this or whether there are simpler solutions for this?
The most correct solution would be to create an actual calendar table that holds every possible day of interest to your business and, at a minimum for your purpose here, marks work days.
Ideally you would have columns to hold fiscal quarters, periods, and weeks to match your industry. You would also mark holidays. Joining to this table makes these kinds of calculations a snap.
create table calendar (
ddate date not null primary key,
is_work_day boolean default true
);
insert into calendar
select ts::date as ddate,
extract(dow from ts) not in (0,6) as is_work_day
from generate_series(
'2000-01-01'::timestamp,
'2099-12-31'::timestamp,
interval '1 day'
) as gs(ts);
Assuming a calendar table is not within scope, you can do this:
with bounds as (
select min(start) as first_start, max("end") as last_end
from my_projects
), cal as (
select ts::date as ddate,
extract(dow from ts) not in (0,6) as is_work_day
from bounds
cross join generate_series(
first_start,
last_end,
interval '1 day'
) as gs(ts)
), bymonth as (
select p.name, p.start, p.end,
date_trunc('month', c.ddate) as month_start,
count(*) as work_days
from my_projects p
join cal c on c.ddate between p.start and p.end
where c.is_work_day
group by p.name, p.start, p.end, month_start
)
select jsonb_object_agg(to_char(month_start, 'YYYY-MM-DD'), work_days)
|| jsonb_object_agg('name', name)
|| jsonb_object_agg('start', start)
|| jsonb_object_agg('end', "end") as result
from bymonth
group by name;
Doing a pivot from rows to columns in SQL is usually a bad idea, so the query produces json for you.

First record by month & by year

I have a Rails application with 20+ years of data.
I'm struggling to create two SQLs:
Fetch the first record of each year (based on filters)
Fetch the first record of each month (based on filters)
I made a DBFiddle here: https://www.db-fiddle.com/f/wjQqrrpaJeiYG8zkExbaos/0
For the first query (yearly), the result should be:
a | b_id | created_at
74780 | 82373 | 2020-01-02 01:34:33 +0000
15670 | 16639 | 2019-02-24 14:33:56 +0000
14586 | 87594 | 2018-01-06 09:14:31 +0000
I can fetch the years and months using date_part('year', created_at) and date_part('month', created_at), but didn't find a way to "glue" them with min(created_at).
Try to use window function OVER:
with grouped as(
select *, min(created_at) over(partition by date_trunc('year', created_at))
from z order by date_trunc('year', created_at) desc
)
select a, b_id, created_at from grouped where min = created_at
For the first record by month you can use the same approach by replacing all date_trunc('year', created_at) with date_trunc('month', created_at)

Trouble joining generate_series timestamp without time zone on a field that's timestamp without timezone

I am trying to figure out a way to report how many people are in a location at the same time, down to the second.
I have a table with the id for the person, the date they entered, the time they entered, the date they left and the time they left.
example:
select unique_id, start_date, start_time, end_date, end_time
from My_Table
where start_date between '09/01/2019' and '09/02/2019'
limit 3
"unique_id" "start_date" "start_time" "end_date" "end_time"
989179 "2019-09-01" "06:03:13" "2019-09-01" "06:03:55"
995203 "2019-09-01" "11:29:27" "2019-09-01" "11:30:13"
917637 "2019-09-01" "11:06:46" "2019-09-01" "11:06:59"
i've concatenated the start_date & start_time as well as end_date & end_time so they are 2 fields
select unique_id, ((start_date + start_time)::timestamp without time zone) as start_date,
((end_date + end_time)::timestamp without time zone) as end_date
result example:
"start_date"
"2019-09-01 09:28:54"
so i'm making that a CTE, then using a second CTE that uses generate_series between dates down to the second.
The goal being, the generate series will have a row for every second between the two dates. Then when i join my data sets, i can count how many records exist in my_table where the start_date(plus time) is equal or greater than the generate_series date_time field, and the end_date(plus time) is less than or equal to the generate_series date_time field.
i feel that was harder to explain than it needed to be.
in theory, if a person was in the room from 2019-09-01 00:01:01 and left at 2019-09-01 00:01:03, i would count that record in the generate_series rows 2019-09-01 00:01:01, 2019-09-01 00:01:02 & 2019-09-01 00:01:03.
When i look at the data i can see that i should be returning hundreds of people in the room at specific peak periods. but the query returns all 0's.
is this possibly a field formatting issue i need to adjust?
Here is the query:
with CTE as (
select unique_id, ((start_date+start_time)::timestamp without time zone) as start_date,
((end_date+end_time)::timestamp without time zone) as end_date
from My_table
where start_date between '09/01/2019' and '09/02/2019'
),
time_series as (
select generate_series( (date '2019-09-01')::timestamp, (date '2019-09-02')::timestamp, interval '1 second') as date_time
)
/*FINAL SELECT*/
select date_time, count(B.unique_id) as NumPpl
FROM (
select A.date_time
FROM time_series a
)x
left join CTE b on b.start_date >= x.date_time AND b.end_date <= x.date_time
GROUP BY 1
ORDER BY 1
(partial) result screenshot
Thank you in advance
i should also add i have read only access to this database so i'm not able to create functions.
Simple version: b.start_date >= x.date_time AND b.end_date <= x.date_time will never be true assuming end_date is always after start_date.
Longer version: You also do not need a CTE for the generate_series() and there is no reason for selecting all columns and all rows of this CTE as a subquery. I would also drop the CTE for your original data and just join it to the seconds (NOTE: this does somehow change the query, since you might now take those entries into account, where start_date is earlier than 2019-09-01. If you do not want this, you can add your condition again to the join condition. But I guess this is what you really wanted). I also removed some casts which were not needed. Try this:
SELECT gs.second, COUNT(my.unique_id)
FROM generate_series('2019-09-01'::timestamp, '2019-09-02'::timestamp, interval '1 second') gs (second)
LEFT JOIN my_table my ON (my.start_date + my.start_time) <= gs.second
AND (my.end_date + my.end_time) >= gs.second
GROUP BY 1
ORDER BY 1