First record by month & by year - postgresql

I have a Rails application with 20+ years of data.
I'm struggling to create two SQLs:
Fetch the first record of each year (based on filters)
Fetch the first record of each month (based on filters)
I made a DBFiddle here: https://www.db-fiddle.com/f/wjQqrrpaJeiYG8zkExbaos/0
For the first query (yearly), the result should be:
a | b_id | created_at
74780 | 82373 | 2020-01-02 01:34:33 +0000
15670 | 16639 | 2019-02-24 14:33:56 +0000
14586 | 87594 | 2018-01-06 09:14:31 +0000
I can fetch the years and months using date_part('year', created_at) and date_part('month', created_at), but didn't find a way to "glue" them with min(created_at).

Try to use window function OVER:
with grouped as(
select *, min(created_at) over(partition by date_trunc('year', created_at))
from z order by date_trunc('year', created_at) desc
)
select a, b_id, created_at from grouped where min = created_at
For the first record by month you can use the same approach by replacing all date_trunc('year', created_at) with date_trunc('month', created_at)

Related

Generate missing data and fill it down - postgresql

I have the dataset:
The problem is that the records are added only if an event happened, e.g. for the row with id 13897, the record was updated on 4/18/2020 and then on 5/1/2020 - the status was changed. What I need is the status of each record at the end of every month.
I was thinking about the below logic:
generate the series of dates from the min(date) till now - T1
get distinct id from the dataset - T2
do cross join between two above tables so that we get a new row for every row in the second table - T3
extract the dataset with all required fields - T4
merge T3 and T4 by concatenate(date and id) - T5
sort T5 by id and d asc - T5
fill-down all the fields grouped by id - T5
generate the series of dates from min(date) till now with the interval of one month and get the last day of each month - T6
merge T5 and T6 by date - right join so that we get only rows with the date = end of month
I am on step 6.
SELECT *
FROM (SELECT d, Concat(dt, t2.id) AS cnct
FROM (SELECT d,d::date AS dt
FROM generate_series(
( SELECT min(created_at::date)
FROM new_table), CURRENT_DATE , interval '1 day') d) t1
CROSS JOIN
(SELECT DISTINCT id FROM new_table )) t2)t3
--in case if a record with the same id was updated several times throughout the day
LEFT JOIN (WITH cte AS
( SELECT id, status, created_at at time zone 'eat' at time zone 'utc' AS "created_at", updated_at::date AS date, updated_at::date, row_number() OVER (partition BY id, updated_at::date ORDER BY updated_at DESC) rnFROM new_table ))SELECT cte.*, Concat(updated_at::date, id) AS cnct
FROM cte
WHERE rn = 1) t4
ON t3.cnct = t4.cnct
I am stuck on step 7. I found fill column with last value from partition in postgresql but it is not what I need. I envision that I need to sort by a date block i.e. dates from min date to now for one id - 13894 are to be considered block 1, dates from min date to now for another id - 13897 are to be considered block 2. The next step I thought is to fill-down all fields per a block.
And another question, how do you deal with the event-based data to adapt it for the time-series?
Tried:
You can use Postgresql's DISTINCT ON feature to do this. We'll generate a series with the start of every month (you'll need to supply start and end dates here) and put the ID and the date into the DISTINCT ON so that we get only one row of new_table for each distinct ID and month pair. Then we simply filter and order to ensure that the row we're getting for each ID and month is the latest row for which the date is before the new month.
SELECT DISTINCT ON (new_table.id, month_start) *
FROM new_table, generate_series(start_date, end_date, interval '1 month') month_start
WHERE new_table.date < month_start
ORDER BY new_table.id, month_start ASC, new_table.date DESC;
(If you need your results to have the last day of the month and not the first day of the next month, you can just subtract 1 day from month_start in your select clause.)
EDIT: Running on the data you supplied, I get this:
SELECT DISTINCT ON (new_table.id, month_start) new_table.id, month_start - interval '1 day' as month_end, new_table.status
FROM new_table, generate_series('2020-05-01', '2020-06-01', interval '1 month') month_start
WHERE new_table.date < month_start
ORDER BY new_table.id, month_start ASC, new_table.date DESC;
id | month_end | status
-------+------------------------+--------
13894 | 2020-04-30 00:00:00-07 | 5
13894 | 2020-05-31 00:00:00-07 | 5
13897 | 2020-04-30 00:00:00-07 | 2
13897 | 2020-05-31 00:00:00-07 | 5
(4 rows)

Gaps and Islands - get a list of dates unemployed over a date range with Postgresl

I have a table called Position, in this table, I have the following, dates are inclusive (yyyy-mm-dd), below is a simplified view of the employment dates
id, person_id, start_date, end_date , title
1 , 1 , 2001-12-01, 2002-01-31, 'admin'
2 , 1 , 2002-02-11, 2002-03-31, 'admin'
3 , 1 , 2002-02-15, 2002-05-31, 'sales'
4 , 1 , 2002-06-15, 2002-12-31, 'ops'
I'd like to be able to calculate the gaps in employment, assuming some of the dates overlap to produce the following output for the person with id=1
person_id, start_date, end_date , last_position_id, gap_in_days
1 , 2002-02-01, 2002-02-10, 1 , 10
1 , 2002-06-01, 2002-06-14, 3 , 14
I have looked at numerous solutions, UNIONS, Materialized views, tables with generated calendar date ranges, etc. I really am not sure what is the best way to do this. Is there a single query where I can get this done?
step-by-step demo:db<>fiddle
You just need the lead() window function. With this you are able to get a value (start_date in this case) to the current row.
SELECT
person_id,
end_date + 1 AS start_date,
lead - 1 AS end_date,
id AS last_position_id,
lead - (end_date + 1) AS gap_in_days
FROM (
SELECT
*,
lead(start_date) OVER (PARTITION BY person_id ORDER BY start_date)
FROM
positions
) s
WHERE lead - (end_date + 1) > 0
After getting the next start_date you are able to compare it with the current end_date. If they differ, you have a gap. These positive values can be filtered within the WHERE clause.
(if 2 positions overlap, the diff is negative. So it can be ignored.)
first you need to find what dates overlaps Determine Whether Two Date Ranges Overlap
then merge those ranges as a single one and keep the last id
finally calculate the ranges of days between one end_date and the next start_date - 1
SQL DEMO
with find_overlap as (
SELECT t1."id" as t1_id, t1."person_id", t1."start_date", t1."end_date",
t2."id" as t2_id, t2."start_date" as t2_start_date, t2."end_date" as t2_end_date
FROM Table1 t1
LEFT JOIN Table1 t2
ON t1."person_id" = t2."person_id"
AND t1."start_date" <= t2."end_date"
AND t1."end_date" >= t2."start_date"
AND t1.id < t2.id
), merge_overlap as (
SELECT
person_id,
start_date,
COALESCE(t2_end_date, end_date) as end_date,
COALESCE(t2_id, t1_id) as last_position_id
FROM find_overlap
WHERE t1_id NOT IN (SELECT t2_id FROM find_overlap WHERE t2_ID IS NOT NULL)
), cte as (
SELECT *,
LEAD(start_date) OVER (partition by person_id order by start_date) next_start
FROM merge_overlap
)
SELECT *,
DATE_PART('day',
(next_start::timestamp - INTERVAL '1 DAY') - end_date::timestamp
) as days
FROM cte
WHERE next_start IS NOT NULL
OUTPUT
| person_id | start_date | end_date | last_position_id | next_start | days |
|-----------|------------|------------|------------------|------------|------|
| 1 | 2001-12-01 | 2002-01-31 | 1 | 2002-02-11 | 10 |
| 1 | 2002-02-11 | 2002-05-31 | 3 | 2002-06-15 | 14 |

PostgreSQL Query: Column Sum for the latest available date of each month

Given a pSQL table which looks like this:
date | data
2015-01-23 | 15
2015-01-23 | 11
2015-02-25 | 15
2015-02-25 | 11
2015-01-25 | 24
2015-01-25 | 2
2015-01-25 | 13
2015-01-29 | 5
2015-02-28 | 12
2015-02-28 | 1
2015-05-15 | 12
2015-05-16 | 1
How can I get the sum of data for the last available date of each month?
Example result:
date | data
2015-01-29 | 5
2015-02-28 | 13
2015-05-16 | 1
This is what I've tried so far:
SELECT year,month,max(day),sum(data) FROM
(
SELECT
date,
date_part('year', date) AS year,
date_part('month', date) AS month,
date_part('day', date) AS day,
sum(data) AS tdata
FROM table a
GROUP BY date, date_part('year', date), date_part('month', date), date_part('day', date)
ORDER BY year ASC, month ASC, day ASC
) dataq
GROUP BY year,month
The sum I get from this appears to be wrong.
You should calculate the sums in the inner query, grouping by a single day. Select latest day in month in the outer query:
select distinct on (year, month)
make_date(year::int, month::int, day::int) as date,
data
from (
select
date_part('year', date) as year,
date_part('month', date) as month,
date_part('day', date) as day,
sum(data) as data
from my_table
group by date
) s
order by year, month, day desc
date | data
------------+------
2015-01-29 | 5
2015-02-28 | 13
2015-05-16 | 1
(3 rows)
I guess you need just to remove days that you don't want to sum. For example using NOT EXISTS as follows:
SELECT year,month,max(day),sum(tdata) tdata FROM
(
SELECT
d,
date_part('year', d) AS year,
date_part('month', d) AS month,
date_part('day', d) AS day,
sum(data) AS tdata
FROM tab a
WHERE NOT EXISTS
(
SELECT *
FROM tab a2
WHERE date_part('year', a.d) = date_part('year', a2.d) AND
date_part('month', a.d) = date_part('month', a2.d) AND
date_part('day', a.d) < date_part('day', a2.d)
)
GROUP BY d, date_part('year', d), date_part('month', d), date_part('day', d)
ORDER BY year ASC, month ASC, day ASC
) dataq
GROUP BY year,month
SQLFiddle

Postgres: Calculating the number of working months in the last X years

I users table and a jobs. User has many jobs and jobs have a start_date and end_date:
Column | Type | Modifiers
----------------+-----------------------------+---------------------------------------------------
id | integer | not null default nextval('jobs_id_seq'::regclass)
title | character varying |
employer | character varying |
start_date | date |
end_date | date |
user_id | integer |
I need to calculate the total number of months that a person has spent working within the past X years.
I've looked at OVERLAPS and played with intervals a bit but I can't quite figure out what I need. I want to make sure that even it the start_date is outside the X years range that I still count the months that are inside the range.
Here is what I have so far:
select sum(EXTRACT(YEAR FROM months) * 12 + EXTRACT(MONTH FROM months))
as working_months
from (
select CASE current
WHEN true THEN
age(current_date, start_date)
ELSE age(end_date, start_date)
END as months
from jobs inner join users on jobs.user_id = users.id
where users.id = 4
) as employment_time;
with jobs (start_date, end_date, user_id) as ( values
('2000-01-01'::date, '2005-12-31'::date, 1),
('2007-10-01', '2008-09-30', 1),
('2010-09-01', '2014-10-20', 1)
)
select
user_id,
extract(year from work_time) * 12 + extract(month from work_time) as months
from (
select
user_id,
sum(age(upper(period), lower(period))) as work_time
from (
select
user_id,
daterange(start_date, end_date, '[]') *
daterange((current_date - interval '10 years')::date, current_date)
as period
from jobs
) s
group by user_id
) s
;
user_id | months
---------+--------
1 | 70
Range type -
Range functions
The basic query would be this:
SELECT sum(extract(year from months) * 12 + extract(month from months)) AS working_months
FROM (
SELECT
age(CASE (start_date, start_date) OVERLAPS (current_date, interval '-5 years')
WHEN true THEN start_date
ELSE current_date - interval '5 years'
END AS strt::timestamp,
CASE current
WHEN true THEN current_date
ELSE end_date
END AS fin::timestamp) AS months
FROM jobs
WHERE user_id = 4) AS employment_time;
You may also put this in a SQL function with parameters for the number of years and user_id. Note that you throw away partial months from individual jobs. You can add extract(day from months) / 30 to the top SELECT to harvest those partial months into full months.
This assumes that jobs cannot overlap. If they do, then the query becomes much more complex.

Compare interval date by row

I am trying to group dates within a 1 year interval given an identifier by labeling which is the earliest date and which is the latest date. If there are no dates within a 1 year interval from that date, then it will record it's own date as the first and last date. For example originally the data is:
id | date
____________
a | 1/1/2000
a | 1/2/2001
a | 1/6/2000
b | 1/3/2001
b | 1/3/2000
b | 1/3/1999
c | 1/1/2000
c | 1/1/2002
c | 1/1/2003
And the output I want is:
id | first_date | last_date
___________________________
a | 1/1/2000 | 1/2/2001
b | 1/3/1999 | 1/3/2001
c | 1/1/2000 | 1/1/2000
c | 1/1/2002 | 1/1/2003
I have been trying to figure this out the whole day and can't figure it out. I can do it for cases id's with only 2 duplicates, but can't for greater values. Any help would be great.
SELECT id
, min(min_date) AS min_date
, max(max_date) AS max_date
, sum(row_ct) AS row_ct
FROM (
SELECT id, year, min_date, max_date, row_ct
, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM (
SELECT id
, extract(year FROM the_date)::int AS year
, min(the_date) AS min_date
, max(the_date) AS max_date
, count(*) AS row_ct
FROM tbl
GROUP BY id, year
) sub1
) sub2
GROUP BY id, grp
ORDER BY id, grp;
1) Group all rows per (id, year), in subquery sub1. Record min and max of the date. I added a count of rows (row_ct) for demonstration.
2) Subtract the row_number() from the year in the second subquery sub2. Thus, all rows in succession end up in the same group (grp). A gap in the years starts a new group.
3) In the final SELECT, group a second time, this time by (id, grp) and record min, max and row count again. Voilá. Produces exactly the result you are looking for.
-> SQLfiddle demo.
Related answers:
Return array of years as year ranges
Group by repeating attribute
select id, min ([date]) first_date, max([date]) last_date
from <yourTbl> group by id
Use this (SQLFiddle Demo):
SELECT id,
min(date) AS first_date,
max(date) AS last_date
FROM mytable
GROUP BY 1
ORDER BY 1