A table with infinite rows? - postgresql

I have a table posts:
Column | Type | Modifiers
-------------------+--------------------------+----------------------------------------------------
body | text | not null
from | character varying(2000) | not null
date | timestamp with time zone | not null
and I'd like to count how many rows a user has in one day, one row for every day in a given month.
In oracle I would "generate" a table with as many days the current month has, and then join the "date" column with the "generated" date.
Something like
> select *
2 from (select sysdate + level l from dual connect by level < 10)
3 /
L
----------
2013-06-07
2013-06-08
2013-06-09
2013-06-10
2013-06-11
2013-06-12
2013-06-13
2013-06-14
2013-06-15
9 rows selected.
Is there something similar in postgres?

http://diethardsteiner.blogspot.com/2012/03/postgresql-auto-generating-sample.html
I found this with just one google hit. U might try using it.
Incase the author removes or web page gets wiped out.
WITH date_series AS (
SELECT
DATE(GENERATE_SERIES(DATE '2012-01-01', DATE '2012-01-10','1 day')) AS generateddate
)
SELECT
generateddate
, EXTRACT(DAY FROM generateddate) AS day
, EXTRACT(MONTH FROM generateddate) AS month
, EXTRACT(QUARTER FROM generateddate) AS quarter
, EXTRACT(YEAR FROM generateddate) AS year
FROM
date_series;

Related

How to Shorten Execution Time for A View

I have 3 tables, a user table, an admin table, and a cust table. Both admin and cust tables are foreign keyed to the user_account table. Basically, every user has a user record, and the type of user they are is determined by if they have a record in the admin or the cust table.
user admin cust
user_id user_id | admin_id user_id | cust_id
--------- ---------|---------- ---------|---------
1 1 | a 2 | dd
2 4 | b 3 | ff
3
4
Then I have a login_history table that records the user_id and login timestamp every time a user logs into the app
login_history
user_id | login_on
---------|---------------------
1 | 2022-01-01 13:22:43
1 | 2022-01-02 16:16:27
3 | 2022-01-05 21:17:52
2 | 2022-01-11 11:12:26
3 | 2022-01-12 03:34:47
I would like to create a view that would contain all dates for the first day of each week in the year starting from jan 1st, and a count column that contains the count of unique admin users that logged in that week and a count of unique cust users that logged in that week. So the resulting view should contain the following 53 records, one for each week.
login_counts_view
week_start_date | admin_count | cust_count
-----------------|-------------|------------
2022-01-01 | 1 | 1
2022-01-08 | 0 | 2
2022-01-15 | 0 | 0
.
.
.
2022-12-31 | 0 | 0
Note that the first week (2022-01-01) only has 1 count for admin_count even though the admin with user_id 1 logged in twice that week.
Below is the current query I have for the view. However, the tables are pretty large and it takes over 10 seconds to retrieve all records from the view, mainly because of the left joined date comparisons.
CREATE VIEW login_counts_view AS
SELECT
week_start_dates.week_start_date::text AS week_start_date,
count(distinct a.user_id) AS admin_count,
count(distinct c.user_id) AS cust_count
FROM (
SELECT
to_char(i::date, 'YYYY-MM-DD') AS week_start_date
FROM
generate_series(date_trunc('year', NOW()), to_char(NOW(), 'YYYY-12-31')::date, '1 week') i
) week_start_dates
LEFT JOIN login_history l ON l.login_on::date BETWEEN week_start_dates.week_start_date::date AND (week_start_dates.week_start_date::date + INTERVAL '6 day')::date
LEFT JOIN admin a ON a.user_id = l.user_id
LEFT JOIN cust c ON c.user_id = l.user_id
GROUP BY week_start_date;
Does anyone have any tips as to how to make this query execute more efficiently?
Idea
Compute the pseudo-week of each login date: partition the year into 7-day slices and number them consecutively. The pseudo-week of a given date would be the ordinal number of the slice it falls into.
Then operate the joins on integers representing the pseudo-weeks instead of date values and comparisons.
Implementation
A view to implement this follows:
CREATE VIEW login_counts_view_fast AS
WITH RECURSIVE Numbers(i) AS ( SELECT 0 UNION ALL SELECT i + 1 FROM Numbers WHERE i < 52 )
SELECT CAST ( date_trunc('year', NOW()) AS DATE) + 7 * n.i week_start_date
, count(distinct lw.admin_id) admin_count
, count(distinct lw.cust_id) cust_count
FROM (
SELECT i FROM Numbers
) n
LEFT JOIN (
SELECT admin_id
, cust_id
, base
, pit
, pit-base delta
, (pit-base) / (3600 * 24 * 7) week
FROM (
SELECT a.user_id admin_id
, c.user_id cust_id
, CAST ( EXTRACT ( EPOCH FROM l.login_on ) AS INTEGER ) pit
, CAST ( EXTRACT ( EPOCH FROM date_trunc('year', NOW()) ) AS INTEGER ) base
FROM login_history l
LEFT JOIN admin a ON a.user_id = l.user_id
LEFT JOIN cust c ON c.user_id = l.user_id
) le
) lw
ON lw.week = n.i
GROUP BY n.i
;
Some remarks:
The epoch values are the number of seconds elapsed since an absolute base datetime (specifically 1/1/1970 0h00).
CASTS are necessary to convert doubles to integers and timestamps to dates as mandated by the signatures of postgresql date functions and in order to enforce integer arithmetics.
The recursive subquery is a generator of consecutive integers. It could possibly be replaced by a generate_series call (untested)
Evaluation
See it in action in this db fiddle
The query plan indicates savings of 50-70% in execution time.

Gaps and Islands - get a list of dates unemployed over a date range with Postgresl

I have a table called Position, in this table, I have the following, dates are inclusive (yyyy-mm-dd), below is a simplified view of the employment dates
id, person_id, start_date, end_date , title
1 , 1 , 2001-12-01, 2002-01-31, 'admin'
2 , 1 , 2002-02-11, 2002-03-31, 'admin'
3 , 1 , 2002-02-15, 2002-05-31, 'sales'
4 , 1 , 2002-06-15, 2002-12-31, 'ops'
I'd like to be able to calculate the gaps in employment, assuming some of the dates overlap to produce the following output for the person with id=1
person_id, start_date, end_date , last_position_id, gap_in_days
1 , 2002-02-01, 2002-02-10, 1 , 10
1 , 2002-06-01, 2002-06-14, 3 , 14
I have looked at numerous solutions, UNIONS, Materialized views, tables with generated calendar date ranges, etc. I really am not sure what is the best way to do this. Is there a single query where I can get this done?
step-by-step demo:db<>fiddle
You just need the lead() window function. With this you are able to get a value (start_date in this case) to the current row.
SELECT
person_id,
end_date + 1 AS start_date,
lead - 1 AS end_date,
id AS last_position_id,
lead - (end_date + 1) AS gap_in_days
FROM (
SELECT
*,
lead(start_date) OVER (PARTITION BY person_id ORDER BY start_date)
FROM
positions
) s
WHERE lead - (end_date + 1) > 0
After getting the next start_date you are able to compare it with the current end_date. If they differ, you have a gap. These positive values can be filtered within the WHERE clause.
(if 2 positions overlap, the diff is negative. So it can be ignored.)
first you need to find what dates overlaps Determine Whether Two Date Ranges Overlap
then merge those ranges as a single one and keep the last id
finally calculate the ranges of days between one end_date and the next start_date - 1
SQL DEMO
with find_overlap as (
SELECT t1."id" as t1_id, t1."person_id", t1."start_date", t1."end_date",
t2."id" as t2_id, t2."start_date" as t2_start_date, t2."end_date" as t2_end_date
FROM Table1 t1
LEFT JOIN Table1 t2
ON t1."person_id" = t2."person_id"
AND t1."start_date" <= t2."end_date"
AND t1."end_date" >= t2."start_date"
AND t1.id < t2.id
), merge_overlap as (
SELECT
person_id,
start_date,
COALESCE(t2_end_date, end_date) as end_date,
COALESCE(t2_id, t1_id) as last_position_id
FROM find_overlap
WHERE t1_id NOT IN (SELECT t2_id FROM find_overlap WHERE t2_ID IS NOT NULL)
), cte as (
SELECT *,
LEAD(start_date) OVER (partition by person_id order by start_date) next_start
FROM merge_overlap
)
SELECT *,
DATE_PART('day',
(next_start::timestamp - INTERVAL '1 DAY') - end_date::timestamp
) as days
FROM cte
WHERE next_start IS NOT NULL
OUTPUT
| person_id | start_date | end_date | last_position_id | next_start | days |
|-----------|------------|------------|------------------|------------|------|
| 1 | 2001-12-01 | 2002-01-31 | 1 | 2002-02-11 | 10 |
| 1 | 2002-02-11 | 2002-05-31 | 3 | 2002-06-15 | 14 |

Postgres: Calculating the number of working months in the last X years

I users table and a jobs. User has many jobs and jobs have a start_date and end_date:
Column | Type | Modifiers
----------------+-----------------------------+---------------------------------------------------
id | integer | not null default nextval('jobs_id_seq'::regclass)
title | character varying |
employer | character varying |
start_date | date |
end_date | date |
user_id | integer |
I need to calculate the total number of months that a person has spent working within the past X years.
I've looked at OVERLAPS and played with intervals a bit but I can't quite figure out what I need. I want to make sure that even it the start_date is outside the X years range that I still count the months that are inside the range.
Here is what I have so far:
select sum(EXTRACT(YEAR FROM months) * 12 + EXTRACT(MONTH FROM months))
as working_months
from (
select CASE current
WHEN true THEN
age(current_date, start_date)
ELSE age(end_date, start_date)
END as months
from jobs inner join users on jobs.user_id = users.id
where users.id = 4
) as employment_time;
with jobs (start_date, end_date, user_id) as ( values
('2000-01-01'::date, '2005-12-31'::date, 1),
('2007-10-01', '2008-09-30', 1),
('2010-09-01', '2014-10-20', 1)
)
select
user_id,
extract(year from work_time) * 12 + extract(month from work_time) as months
from (
select
user_id,
sum(age(upper(period), lower(period))) as work_time
from (
select
user_id,
daterange(start_date, end_date, '[]') *
daterange((current_date - interval '10 years')::date, current_date)
as period
from jobs
) s
group by user_id
) s
;
user_id | months
---------+--------
1 | 70
Range type -
Range functions
The basic query would be this:
SELECT sum(extract(year from months) * 12 + extract(month from months)) AS working_months
FROM (
SELECT
age(CASE (start_date, start_date) OVERLAPS (current_date, interval '-5 years')
WHEN true THEN start_date
ELSE current_date - interval '5 years'
END AS strt::timestamp,
CASE current
WHEN true THEN current_date
ELSE end_date
END AS fin::timestamp) AS months
FROM jobs
WHERE user_id = 4) AS employment_time;
You may also put this in a SQL function with parameters for the number of years and user_id. Note that you throw away partial months from individual jobs. You can add extract(day from months) / 30 to the top SELECT to harvest those partial months into full months.
This assumes that jobs cannot overlap. If they do, then the query becomes much more complex.

Compare interval date by row

I am trying to group dates within a 1 year interval given an identifier by labeling which is the earliest date and which is the latest date. If there are no dates within a 1 year interval from that date, then it will record it's own date as the first and last date. For example originally the data is:
id | date
____________
a | 1/1/2000
a | 1/2/2001
a | 1/6/2000
b | 1/3/2001
b | 1/3/2000
b | 1/3/1999
c | 1/1/2000
c | 1/1/2002
c | 1/1/2003
And the output I want is:
id | first_date | last_date
___________________________
a | 1/1/2000 | 1/2/2001
b | 1/3/1999 | 1/3/2001
c | 1/1/2000 | 1/1/2000
c | 1/1/2002 | 1/1/2003
I have been trying to figure this out the whole day and can't figure it out. I can do it for cases id's with only 2 duplicates, but can't for greater values. Any help would be great.
SELECT id
, min(min_date) AS min_date
, max(max_date) AS max_date
, sum(row_ct) AS row_ct
FROM (
SELECT id, year, min_date, max_date, row_ct
, year - row_number() OVER (PARTITION BY id ORDER BY year) AS grp
FROM (
SELECT id
, extract(year FROM the_date)::int AS year
, min(the_date) AS min_date
, max(the_date) AS max_date
, count(*) AS row_ct
FROM tbl
GROUP BY id, year
) sub1
) sub2
GROUP BY id, grp
ORDER BY id, grp;
1) Group all rows per (id, year), in subquery sub1. Record min and max of the date. I added a count of rows (row_ct) for demonstration.
2) Subtract the row_number() from the year in the second subquery sub2. Thus, all rows in succession end up in the same group (grp). A gap in the years starts a new group.
3) In the final SELECT, group a second time, this time by (id, grp) and record min, max and row count again. Voilá. Produces exactly the result you are looking for.
-> SQLfiddle demo.
Related answers:
Return array of years as year ranges
Group by repeating attribute
select id, min ([date]) first_date, max([date]) last_date
from <yourTbl> group by id
Use this (SQLFiddle Demo):
SELECT id,
min(date) AS first_date,
max(date) AS last_date
FROM mytable
GROUP BY 1
ORDER BY 1

SQL query to convert date ranges to per day records

Requirements
I have data table that saves data in date ranges.
Each record is allowed to overlap previous record(s) (record has a CreatedOn datetime column).
New record can define it's own date range if it needs to hence can overlap several older records.
Each new overlapping record overrides settings of older records that it overlaps.
Result set
What I need to get is get per day data for any date range that uses record overlapping. It should return a record per day with corresponding data for that particular day.
To convert ranges to days I was thinking of numbers/dates table and user defined function (UDF) to get data for each day in the range but I wonder whether there's any other (as in better* or even faster) way of doing this since I'm using the latest SQL Server 2008 R2.
Stored data
Imagine my stored data looks like this
ID | RangeFrom | RangeTo | Starts | Ends | CreatedOn (not providing data)
---|-----------|----------|--------|-------|-----------
1 | 20110101 | 20110331 | 07:00 | 15:00
2 | 20110401 | 20110531 | 08:00 | 16:00
3 | 20110301 | 20110430 | 06:00 | 14:00 <- overrides both partially
Results
If I wanted to get data from 1st January 2011 to 31st May 2001 resulting table should look like the following (omitted obvious rows):
DayDate | Starts | Ends
--------|--------|------
20110101| 07:00 | 15:00 <- defined by record ID = 1
20110102| 07:00 | 15:00 <- defined by record ID = 1
... many rows omitted for obvious reasons
20110301| 06:00 | 14:00 <- defined by record ID = 3
20110302| 06:00 | 14:00 <- defined by record ID = 3
... many rows omitted for obvious reasons
20110501| 08:00 | 16:00 <- defined by record ID = 2
20110502| 08:00 | 16:00 <- defined by record ID = 2
... many rows omitted for obvious reasons
20110531| 08:00 | 16:00 <- defined by record ID = 2
Actually, since you are working with dates, a Calendar table would be more helpful.
Declare #StartDate date
Declare #EndDate date
;With Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
Select ...
From Calendar
Left Join MyTable
On Calendar.[Date] Between MyTable.Start And MyTable.End
Option ( Maxrecursion 0 );
Addition
Missed the part about the trumping rule in your original post:
Set DateFormat MDY;
Declare #StartDate date = '20110101';
Declare #EndDate date = '20110501';
-- This first CTE is obviously to represent
-- the source table
With SampleData As
(
Select 1 As Id
, Cast('20110101' As date) As RangeFrom
, Cast('20110331' As date) As RangeTo
, Cast('07:00' As time) As Starts
, Cast('15:00' As time) As Ends
, CURRENT_TIMESTAMP As CreatedOn
Union All Select 2, '20110401', '20110531', '08:00', '16:00', DateAdd(s,1,CURRENT_TIMESTAMP )
Union All Select 3, '20110301', '20110430', '06:00', '14:00', DateAdd(s,2,CURRENT_TIMESTAMP )
)
, Calendar As
(
Select #StartDate As [Date]
Union All
Select DateAdd(d,1,[Date])
From Calendar
Where [Date] < #EndDate
)
, RankedData As
(
Select C.[Date]
, S.Id
, S.RangeFrom, S.RangeTo, S.Starts, S.Ends
, Row_Number() Over( Partition By C.[Date] Order By S.CreatedOn Desc ) As Num
From Calendar As C
Join SampleData As S
On C.[Date] Between S.RangeFrom And S.RangeTo
)
Select [Date], Id, RangeFrom, RangeTo, Starts, Ends
From RankedData
Where Num = 1
Option ( Maxrecursion 0 );
In short, I rank all the sample data preferring the newer rows that overlap the same date.
Why do it all in DB when you can do it better in memory
This is the solution (I eventually used) that seemed most reasonable in terms of data transferred, speed and resources.
get actual range definitions from DB to mid tier (smaller amount of data)
generate in memory calendar of a certain date range (faster than in DB)
put those DB definitions in (much easier and faster than DB)
And that's it. I realised that complicating certain things in DB is not not worth it when you have executable in memory code that can do the same manipulation faster and more efficient.