postgres generate_series performance slower on server then laptop - postgresql

I have a series of views that build off of each other like this:
rpt_scn_appl_target_v --> rpt_scn_appl_target_unnest_v --> rpt_scn_appl_target_unnest_timeseries_v --> rpt_scn_appl_target_unnest_timeseries_ftprnt_v
In this view..rpt_scn_appl_target_unnest_timeseries_v, I use the generate_series function to generate monthly rows between 1/1/2015 and 12/31/2019.
what I've noticed is this:
this one takes 10secs to run
select * from rpt_scn_appl_target_unnest_timeseries_ftprnt_v where scenario_id = 202
this one takes 9 secs to run:
select * from rpt_scn_appl_target_unnest_timeseries_v where scenario_id = 202
this one takes 219msecs to run:
select * from rpt_scn_appl_target_unnest_v where scenario_id = 202
this one takes <1sec to run:
select * from rpt_scn_appl_target_v where scenario_id = 202
I've noticed that commenting out the generate_series code in the view, the query runs in under a second, but with it, it takes 10secs to run...
rpt_scn_appl_target_unnest_timeseries_v View:
CREATE OR REPLACE VIEW public.rpt_scn_appl_target_unnest_timeseries_v AS
SELECT a.scenario_id,
a.scenario_desc,
a.scenario_status,
a.scn_appl_lob_ownr_nm,
a.scn_appl_sub_lob_ownr_nm,
a.scenario_asv_id,
a.appl_ci_id,
a.appl_ci_nm,
a.appl_ci_comm_nm,
a.appl_lob_ownr_nm,
a.appl_sub_lob_ownr_nm,
a.cost,
a.agg_complexity,
a.srvc_lvl,
a.dc_loc,
a.start_dt,
a.end_dt,
a.decomm_dt,
a.asv_target_id,
a.asv_target_desc,
a.asv_target_master,
a.prod_qty_main_cloud,
a.prod_cost_main_cloud,
a.non_prod_qty_main_cloud,
a.non_prod_cost_main_cloud,
a.prod_qty_main_onprem,
a.prod_cost_main_onprem,
a.non_prod_qty_main_onprem,
a.non_prod_cost_main_onprem,
a.prod_qty_target_onprem,
a.prod_cost_target_onprem,
a.non_prod_qty_target_onprem,
a.non_prod_cost_target_onprem,
a.prod_qty_target_cloud,
a.prod_cost_target_cloud,
a.non_prod_qty_target_cloud,
a.non_prod_cost_target_cloud,
a.type,
a.cost_main,
a.qty_main,
a.cost_target,
a.qty_target,
a.dt,
a.mth_dt,
CASE
WHEN a.type ~~ '%onprem%'::text THEN 'On-Prem'::text
ELSE 'Cloud'::text
END AS env_stat,
CASE
WHEN a.type ~~ '%non_prod%'::text THEN 'Non-Prod'::text
ELSE 'Prod'::text
END AS env,
CASE
WHEN a.dt <= a.decomm_dt THEN COALESCE(a.cost_main, 0::double precision)
WHEN a.decomm_dt IS NULL AND a.end_dt IS NULL AND a.start_dt IS NULL THEN a.cost_main
ELSE 0::double precision
END AS cost_curr,
CASE
WHEN a.dt <= a.decomm_dt THEN COALESCE(a.qty_main, 0::bigint)
WHEN a.decomm_dt IS NULL AND a.end_dt IS NULL AND a.start_dt IS NULL THEN a.qty_main
ELSE 0::bigint
END AS qty_curr,
CASE
WHEN a.dt < a.start_dt THEN 0::bigint
WHEN a.dt >= a.start_dt AND a.dt < a.end_dt AND a.type ~~ '%non_prod%'::text THEN COALESCE(a.qty_target, 0::bigint)
WHEN a.dt > a.end_dt THEN COALESCE(a.qty_target, 0::bigint)
ELSE 0::bigint
END AS qty_trgt,
CASE
WHEN a.dt < a.start_dt THEN 0::double precision
WHEN a.dt >= a.start_dt AND a.dt < a.end_dt AND a.type ~~ '%non_prod%'::text THEN COALESCE(a.cost_target, 0::double precision)
WHEN a.dt > a.end_dt THEN COALESCE(a.cost_target, 0::double precision)
ELSE 0::double precision
END AS cost_trgt
FROM ( SELECT t1.scenario_id,
t1.scenario_desc,
t1.scenario_status,
t1.scn_appl_lob_ownr_nm,
t1.scn_appl_sub_lob_ownr_nm,
t1.scenario_asv_id,
t1.appl_ci_id,
t1.appl_ci_nm,
t1.appl_ci_comm_nm,
t1.appl_lob_ownr_nm,
t1.appl_sub_lob_ownr_nm,
t1.cost,
t1.agg_complexity,
t1.srvc_lvl,
t1.dc_loc,
t1.start_dt,
t1.end_dt,
t1.decomm_dt,
t1.asv_target_id,
t1.asv_target_desc,
t1.asv_target_master,
t1.prod_qty_main_cloud,
t1.prod_cost_main_cloud,
t1.non_prod_qty_main_cloud,
t1.non_prod_cost_main_cloud,
t1.prod_qty_main_onprem,
t1.prod_cost_main_onprem,
t1.non_prod_qty_main_onprem,
t1.non_prod_cost_main_onprem,
t1.prod_qty_target_onprem,
t1.prod_cost_target_onprem,
t1.non_prod_qty_target_onprem,
t1.non_prod_cost_target_onprem,
t1.prod_qty_target_cloud,
t1.prod_cost_target_cloud,
t1.non_prod_qty_target_cloud,
t1.non_prod_cost_target_cloud,
t1.type,
t1.cost_main,
t1.qty_main,
t1.cost_target,
t1.qty_target,
generate_series('2015-01-01 00:00:00'::timestamp without time zone, '2019-12-31 00:00:00'::timestamp without time zone, '1 mon'::interval)::date AS dt,
to_char(generate_series('2015-01-01 00:00:00'::timestamp without time zone, '2019-12-31 00:00:00'::timestamp without time zone, '1 mon'::interval)::date::timestamp with time zone, 'YYYY-MM'::text) AS mth_dt
FROM rpt_scn_appl_target_unnest_v t1) a;
What I've also noticed too is that performance between the database on my laptop and the AWS rds server with the same data, tables, and views..is faster on my laptop even though it has much less ram and cpu. I'm running postgres 9.6 on my laptop and 9.6 on AWS rds. My laptop is macbook pro with 16gb of ram and i7 dual core. For rds, I'm using a m4.4xlarge which is 16 cores and 64gb of ram.
Here is the AWS explain plan:
https://explain.depesz.com/s/UGF
My laptop explain plan:
https://explain.depesz.com/s/zaWt
So I guess my questions are:
1.) why is the query taking longer to run on AWS then my laptop?
2.) anything one can do to speed up generate_series function? Does creating a separate calendar table and then joining to that work faster?

1) Your laptop has fewer rows.
AWS: (cost=17,814.33..7,931,812.83 rows=158,200,000 width=527)
Laptop: (cost=15,238.52..4,002,252.94 rows=79,700,000 width=2,030)
2) If you are going to use the table several times, is better create a calendar table. 10 years is only 3650 rows, 100 years 36k rows

Related

PostgreSQL query optimization when joining a large table to a small table

Hi I have the following query to make a report from certain date to the begin of the DB.
explain (analyze, buffers)
select
(
select array_to_json(array_agg(row_to_json(v)))
from (
select cl.fkp as fkpueblo,
cr.pk::varchar(255),
(
(
cr.creado at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date || ' ' || cl.n || ' ' || cl.paterno || ' ' || cl.materno
) as nombre,
cr.montocredito,
(
cr.creado at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date as creado,
(
select array_to_json(array_agg(row_to_json(u)))
from (
select h.pk::varchar(255),
h.concepto as nombre,
h.monto,
h.tipo,
(
h.fecha at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date as fecha
from historial h
where h.fkcr = cr.pk
and (
h.fecha at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date < '2022-07-30'
and h.del = 0
and (
select COUNT(h.pk)
from historial h
where h.fkcr = cr.pk
and h.tipo in (3, 6, 9, 10)
and h.del = 0
) > 10
order by h.fkcr,
h.fecha asc
) as u
) as historiales
from credito cr,
cliente cl
where cr.del = 0
and cr.fkc = cl.pk
and cl.fkp = p.pk
and (
cr.creado at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date < '2022-07-30'
and (
cr.fecha_cierre is null
or (
cr.fecha_cierre is not null
and cr.fecha_cierre >= '2022-07-23'
)
)
order by cr.creado asc
) as v
) as creditos,
(
select array_to_json(array_agg(row_to_json(u)))
from (
select csm.tipo_monto
from corte_semanal cs,
corte_semanal_montos csm,
corte_pueblo_padre pp
where pp.del = 0
and p.pk = pp.fkpueblo
and pp.fksupervisor = cs.fksupervisor
and cs.pk = csm.fkcorte_semanal
and csm.del = 0
and cs.del = 0
and p.del = 0
and (
cs.fecha_corte at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date >= '2022-07-23'
and (
cs.fecha_corte at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date < '2022-07-30'
and csm.tipo_monto < 4
order by csm.tipo desc,
csm.tipo_monto asc,
csm.concepto asc
) as u
) as montos
from pueblo p,
ruta r
where p.del = 0
and p.fkr = r.pk
order by r.ruta,
p.pueblo asc
And this is the planning
Sorry, I had to place it on Pastebin since it was more than the body limit of the post.
https://pastebin.com/d627nxkJ
Using DBeaver I exported the planning as a graph image
As you can see, this is an expensive query to run so I went a try to optimize it.
When I started a few days ago, I noticed that some nodes had an External Merge on Disk plan so I checked the work_mem and it was on default, 4MB and so I changed it to 1GB but it didn't affect significantly to the overall execution.
I used this to change the work_mem
ALTER ROLE <user> SET work_mem TO '1GB';
Indexes
In this query there are two major tables:
historial and corte_semanal_montos on which historial has 1,882,074 rows and corte_semanal_montos 106,705 but at the end of the query there are about 1,084 rows due to aggregations.
publo has 1,141 rows and rutas only 33 rows.
Historial is the biggest table on the query by a large amount to I tried to make some indexes on where columns.
create index idx_del_zero on historial(del) where del = 0;
create index idx_tipo on historial(tipo) where tipo in (3, 6, 9, 10);
create index idx_fkcr_fecha on historial(fkcr, fecha);
CREATE INDEX idx_expr ON historial ((fecha at TIME zone 'Mexico/General' at TIME zone 'UTC'));
create index idx_del_zero_p on historial(del) where del = 0;
create index idx_del_zero_csm on corte_semanal_montos(del) where del = 0;
But again, it didn't affect significantly to the overall execution time.
Right now I don't know how can I proceed on making this query faster.
My postgreSQL version is
PostgreSQL 9.5.23 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit

How to get the difference in minutes between two timestamps excluding weekends?

I need to get the difference in minutes excluding weekends (Saturday, Sunday), between 2 timestamps in postgres, but I'm not getting the expected result.
Examples:
Get diff in minutes, however, weekends are include
SELECT EXTRACT(EPOCH FROM (NOW() - '2021-08-01 08:00:00') / 60)::BIGINT as diff_in_minutes;
$ diff_in_minutes = 17566
Get diff in weekdays, excluding saturday and sunday
SELECT COUNT(*) as diff_in_days
FROM generate_series('2021-08-01 08:00:00', NOW(), interval '1d') d
WHERE extract(isodow FROM d) < 6;
$ diff_in_days = 10
Expected:
From '2021-08-12 08:00:00' to '2021-08-13 08:00:00' = 1440
From '2021-08-13 08:00:00' to '2021-08-16 08:00:00' = 1440
From '2021-08-13 08:00:00' to '2021-08-17 08:00:00' = 2880
and so on ...
the solution is:
SELECT GREATEST(COUNT(*) - 1, 0)
FROM generate_series(from_ts, to_ts, interval'1 minute') AS x
WHERE extract(isodow FROM x) <= 5
so
SELECT GREATEST(COUNT(*) - 1, 0)
FROM generate_series('2021-08-13 08:00:00'::timestamp, '2021-08-17 08:00:00', '1 minute') AS x
WHERE extract(isodow FROM x) <= 5
returns 2880
This is not an optimal solution - but I will leave finding the optimal solution as a homework for you.
First, create an SQL function
CREATE OR REPLACE FUNCTION public.time_overlap (
b_1 timestamptz,
e_1 timestamptz,
b_2 timestamptz,
e_2 timestamptz
)
RETURNS interval AS
$body$
SELECT GREATEST(interval '0 second',e_1 - b_1 - GREATEST(interval '0 second',e_1 - e_2) - GREATEST(interval '0 second',b_2 - b_1));
$body$
LANGUAGE 'sql'
IMMUTABLE
RETURNS NULL ON NULL INPUT
SECURITY INVOKER
PARALLEL SAFE
COST 100;
Then, call it like this:
WITH frame AS (SELECT generate_series('2021-08-13 00:00:00', '2021-08-17 23:59:59', interval '1d') AS d)
SELECT SUM(EXTRACT(epoch FROM time_overlap('2021-08-13 08:00:00', '2021-08-17 08:00:00',d,d + interval '1 day'))/60) AS total
FROM frame
WHERE extract(isodow FROM d) < 6
In the CTE you should round down the left/earlier of the 2 timestamps and round up the right/later of the 2 timestamps. The idea is that you should generate the series over whole days - not in the middle of the day.
When calling the time_overlap function you should use the exact values of your 2 timestamps so that it properly calculates the overlapping in minutes between each day of the generated series and the given timeframe between your 2 timestamps.
In the end, when you sum over all the overlappings - you will get the total number of minutes excluding the weekends.

Postgres - convert international atomic time to UTC time (for loop with IF inside SQL function body)

I need to store TAI time in a pg database. This requires a custom type,
CREATE TYPE tai AS (
secs int,
nanosecs, int
);
which maps 1:1 to a GNU C timespec struct, with the TAI epoch of Jan 1 1958 00:00:00 and monotonic clock at its origins. A table of leapseconds is auxiliary data required to convert these to UTC timestamps,
DROP TABLE IF EXISTS leapseconds;
CREATE TABLE IF NOT EXISTS leapseconds (
id serial PRIMARY KEY,
moment TIMESTAMP WITHOUT TIME ZONE NOT NULL,
skew int NOT NULL
);
INSERT INTO leapseconds (moment, skew) VALUES -- note: pg assumes 00:00:00 if no hh:mm:ss given
('1972-Jan-01', 10),
('1972-Jun-30', 1),
('1972-Dec-31', 1),
('1973-Dec-31', 1),
('1974-Dec-31', 1),
('1975-Dec-31', 1),
('1976-Dec-31', 1),
('1977-Dec-31', 1),
('1978-Dec-31', 1),
('1979-Dec-31', 1),
('1981-Jun-30', 1),
('1982-Jun-30', 1),
('1983-Jun-30', 1),
('1985-Jun-30', 1),
('1987-Dec-31', 1),
('1989-Dec-31', 1),
('1990-Dec-31', 1),
('1992-Jun-30', 1),
('1993-Jun-30', 1),
('1994-Jun-30', 1),
('1995-Dec-31', 1),
('1997-Jun-30', 1),
('1998-Dec-31', 1),
('2005-Dec-31', 1),
('2008-Dec-31', 1),
('2012-Jun-30', 1),
('2015-Jun-30', 1),
('2016-Dec-31', 1)
;
I need a function to convert these to UTC timestamps. It would be optimal for for this to live in postgres to avoid latency. The SQL/python pseudocode to do this is
# SQL
SELECT (moment, skew)
FROM LEAPSECONDS
ORDER BY MOMEN ASC
AS tuples
# python
def tai_to_utc(tai):
modtime = to_timestamp(tai.seconds) # to_timestamp from pgsql
modtime += tai.nanosec; # timestamp in pg has usec precision, gloss over it
for moment, skew in tuples:
if modtime > moment:
modtime += skew # type mismatch, gloss over it
return modtime
I know how to do the typecasting, but I'm struggling to write this for+if in plpsql. Is the path of least resistance to learn how to write a stored C procedure and do this in the database? I can also have the client provide the UTC timestamps and do this conversion based on a query to the database, but the chatter to pull data from the database in order to insert into it is going to really hurt ingest speed.
You basically need to use the sum() window function to get the cumulative sum of leap seconds over the moments. Add that to the base timestamp (without the leap seconds) and get the one with the youngest moment where the moment is older or at the base timestamp with the leap seconds added for all the previous moments. You can use DISTINCT ON and LIMIT for that.
CREATE FUNCTION tai_to_utc
(_tai tai)
RETURNS timestamptz
AS
$$
SELECT DISTINCT ON (moment)
ts
FROM (SELECT moment AT TIME ZONE 'UTC' AS moment,
skew,
'1958-01-01T00:00:00+00:00'::timestamptz
+ (_tai.secs || ' seconds')::interval
+ (_tai.nanosecs / 1000 || ' microseconds')::interval
+ (sum(skew) OVER (ORDER BY moment) || ' seconds')::interval AS ts
FROM (SELECT moment,
skew
FROM leapseconds
UNION ALL
SELECT '-infinity',
0) AS x) AS y
WHERE moment <= ts - (skew || ' seconds')::interval
ORDER BY moment DESC
LIMIT 1;
$$
LANGUAGE SQL;
I'd however recommend to also change the type of leapseconds.moment to timestamptz (timestamp with time zone) and insert the moments with explicit time zone to ensure things are what they meant to be. That way the awkward time zone conversion isn't needed in the function.
CREATE FUNCTION tai_to_utc
(_tai tai)
RETURNS timestamptz
AS
$$
SELECT DISTINCT ON (moment)
ts
FROM (SELECT moment,
skew,
'1958-01-01T00:00:00+00:00'::timestamptz
+ (_tai.secs || ' seconds')::interval
+ (_tai.nanosecs / 1000 || ' microseconds')::interval
+ (sum(skew) OVER (ORDER BY moment) || ' seconds')::interval AS ts
FROM (SELECT moment,
skew
FROM leapseconds
UNION ALL
SELECT '-infinity',
0) AS x) AS y
WHERE moment <= ts - (skew || ' seconds')::interval
ORDER BY moment DESC
LIMIT 1;
$$
LANGUAGE SQL;
db<>fiddle
An index on leapseconds (moment, skew) might improve performance. Though leapseconds is quite very small, so it might not do that much.

SQL subquery fails in Spark 2 when loading PostgreSQL table

I am facing a very annoying PSQL issue when trying to load part of a PostgreSQL table via a subquery.
The query is :
SELECT
N1,
N2,
N3,
N4
FROM CORR
WHERE CORR_N5 >= (now() - interval '18 year')
AND CORR_N5 <= (now() - interval '18 year' + interval '1 month')
This one works if written directly in PgAdmin. However when I run it from a spark 2 job, I get tho following error message :
org.postgresql.util.PSQLException: ERROR: subquery in FROM must have an alias
Hint: For example, FROM (SELECT ...) [AS] foo.
Even when I put an alias after all the clauses, the same issue happens.
Any advice ?
Thanks in advance
Melvin, have a look at the below links
https://pganalyze.com/docs/log-insights/app-errors/U115
subquery in FROM must have an alias
SELECT * FROM (
SELECT N1, N2, N3, N4
FROM CORR WHERE COR_N5 >= (now() - interval '18 year')
AND CORR_N5 <= (now() - interval '18 year' + interval '1 month')
) AS input

Convert date to unix timestamp in postgresql

I have a table with a column abc carrying the unix timestamp (eg. 13898161481435) and I want to run a between dates select.
It would be not efficient to do a
where TO_CHAR(TO_TIMESTAMP(abc / 1000), 'DD/MM/YYYY') > '14/01/2014 00:00:00' and ..;
which would convert every record.
Rather do something like
where abc > ('14/01/2014 00:00:00' tobigint()) and abc < ...
But I cant find any reference, though for the reverse case.
Try this
WHERE abc > extract(epoch from timestamp '2014-01-28 00:00:00')
PostgreSQL Docs
You do not need to convert it to char to compare it.
WHERE to_timestamp(abc/1000) > timestamp '2014-01-28 00:00:00'
I don't think that conversion would be very inefficient because timestamps are stored internally in a similar format to epoch secs (admittedly with a different origin and resolution).
If you really want to go the other way:
WHERE abc > extract(epoch from timestamp '2014-01-28 00:00:00')
Interesting observation though, while
select count(*) from cb.logs where to_timestamp(timestmp/1000) > timestamp '2014-01-15 00:00:00' and to_timestamp(timestmp/1000) < timestamp '2014-01-15 23:59:59';
takes almost 10 seconds (my db with 1,5 mill records), the below only 1,5 sec
select count(*) from cb.logs where (timestmp > (select extract(epoch from timestamp '2014-01-15 00:00:00') * 1000) and timestmp < (select extract(epoch from timestamp '2014-01-15 23:59:59') * 1000));
and the below about 1sec
select count(*) from cb.logs where (timestmp > extract(epoch from timestamp '2014-01-15 00:00:00') * 1000) and (timestmp < extract(epoch from timestamp '2014-01-15 23:59:59') * 1000);
to count ~40.000 records
Most likely because the division I would say.
1
select count(*) from cb.logs where to_timestamp(timestmp/1000) > timestamp '2014-01-15 00:00:00' and to_timestamp(timestmp/1000) < timestamp '2014-01-15 23:59:59';
8600ms
"Aggregate (cost=225390.52..225390.53 rows=1 width=0)"
" -> Seq Scan on logs (cost=0.00..225370.34 rows=8073 width=0)"
" Filter: ((to_timestamp(((timestmp / 1000))::double precision) > '2014-01-15 00:00:00'::timestamp without time zone) AND (to_timestamp(((timestmp / 1000))::double precision) < '2014-01-15 23:59:59'::timestamp without time zone))"
2
select count(*) from cb.logs where (timestmp > (select extract(epoch from timestamp '2014-01-15 00:00:00') * 1000) and timestmp < (select extract(epoch from timestamp '2014-01-15 23:59:59') * 1000));
1199ms
"Aggregate (cost=209245.94..209245.95 rows=1 width=0)"
" InitPlan 1 (returns $0)"
" -> Result (cost=0.00..0.01 rows=1 width=0)"
" InitPlan 2 (returns $1)"
" -> Result (cost=0.00..0.01 rows=1 width=0)"
" -> Seq Scan on logs (cost=0.00..209225.74 rows=8073 width=0)"
" Filter: (((timestmp)::double precision > $0) AND ((timestmp)::double precision < $1))"