PostgreSQL query optimization when joining a large table to a small table

PostgreSQL query optimization when joining a large table to a small table - postgresql

Hi I have the following query to make a report from certain date to the begin of the DB.
explain (analyze, buffers)
select
(
select array_to_json(array_agg(row_to_json(v)))
from (
select cl.fkp as fkpueblo,
cr.pk::varchar(255),
(
(
cr.creado at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date || ' ' || cl.n || ' ' || cl.paterno || ' ' || cl.materno
) as nombre,
cr.montocredito,
(
cr.creado at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date as creado,
(
select array_to_json(array_agg(row_to_json(u)))
from (
select h.pk::varchar(255),
h.concepto as nombre,
h.monto,
h.tipo,
(
h.fecha at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date as fecha
from historial h
where h.fkcr = cr.pk
and (
h.fecha at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date < '2022-07-30'
and h.del = 0
and (
select COUNT(h.pk)
from historial h
where h.fkcr = cr.pk
and h.tipo in (3, 6, 9, 10)
and h.del = 0
) > 10
order by h.fkcr,
h.fecha asc
) as u
) as historiales
from credito cr,
cliente cl
where cr.del = 0
and cr.fkc = cl.pk
and cl.fkp = p.pk
and (
cr.creado at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date < '2022-07-30'
and (
cr.fecha_cierre is null
or (
cr.fecha_cierre is not null
and cr.fecha_cierre >= '2022-07-23'
)
)
order by cr.creado asc
) as v
) as creditos,
(
select array_to_json(array_agg(row_to_json(u)))
from (
select csm.tipo_monto
from corte_semanal cs,
corte_semanal_montos csm,
corte_pueblo_padre pp
where pp.del = 0
and p.pk = pp.fkpueblo
and pp.fksupervisor = cs.fksupervisor
and cs.pk = csm.fkcorte_semanal
and csm.del = 0
and cs.del = 0
and p.del = 0
and (
cs.fecha_corte at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date >= '2022-07-23'
and (
cs.fecha_corte at TIME zone 'Mexico/General' at TIME zone 'UTC'
)::date < '2022-07-30'
and csm.tipo_monto < 4
order by csm.tipo desc,
csm.tipo_monto asc,
csm.concepto asc
) as u
) as montos
from pueblo p,
ruta r
where p.del = 0
and p.fkr = r.pk
order by r.ruta,
p.pueblo asc
And this is the planning
Sorry, I had to place it on Pastebin since it was more than the body limit of the post.
https://pastebin.com/d627nxkJ
Using DBeaver I exported the planning as a graph image
As you can see, this is an expensive query to run so I went a try to optimize it.
When I started a few days ago, I noticed that some nodes had an External Merge on Disk plan so I checked the work_mem and it was on default, 4MB and so I changed it to 1GB but it didn't affect significantly to the overall execution.
I used this to change the work_mem
ALTER ROLE <user> SET work_mem TO '1GB';
Indexes
In this query there are two major tables:
historial and corte_semanal_montos on which historial has 1,882,074 rows and corte_semanal_montos 106,705 but at the end of the query there are about 1,084 rows due to aggregations.
publo has 1,141 rows and rutas only 33 rows.
Historial is the biggest table on the query by a large amount to I tried to make some indexes on where columns.
create index idx_del_zero on historial(del) where del = 0;
create index idx_tipo on historial(tipo) where tipo in (3, 6, 9, 10);
create index idx_fkcr_fecha on historial(fkcr, fecha);
CREATE INDEX idx_expr ON historial ((fecha at TIME zone 'Mexico/General' at TIME zone 'UTC'));
create index idx_del_zero_p on historial(del) where del = 0;
create index idx_del_zero_csm on corte_semanal_montos(del) where del = 0;
But again, it didn't affect significantly to the overall execution time.
Right now I don't know how can I proceed on making this query faster.
My postgreSQL version is
PostgreSQL 9.5.23 on x86_64-pc-linux-gnu, compiled by gcc (Ubuntu 5.4.0-6ubuntu1~16.04.12) 5.4.0 20160609, 64-bit

Related

Postgresql generate series with interval '15 minutes' longer than 29092 items

Sut:
create table meter.materialized_quarters
(
id int4 not null generated by default as identity,
tm timestamp without time zone
,constraint pk_materialized_quarters primary key (id)
--,constraint uq_materialized_quarters unique (tm)
);
Then setup data:
insert into meter.materialized_quarters (tm)
select GENERATE_SERIES ('1999-01-01', '2030-10-30', interval '15 minute');
And check data:
select count(*),tm
from meter.materialized_quarters
group by tm
having count(*)> 1
Some results:
count|tm |
-----+-----------------------+
2|1999-10-31 02:00:00.000|
2|1999-10-31 02:15:00.000|
2|1999-10-31 02:30:00.000|
2|1999-10-31 02:45:00.000|
2|2000-10-29 02:00:00.000|
2|2000-10-29 02:15:00.000|
2|2000-10-29 02:30:00.000|
2|2000-10-29 02:45:00.000|
2|2001-10-28 02:00:00.000|
2|2001-10-28 02:15:00.000|
2|2001-10-28 02:30:00.000|
....
Details:
select * from meter.materialized_quarters where tm = '1999-10-31 01:45:00';
Result:
id |tm |
-----+-----------------------+
29092|1999-10-31 01:45:00.000|
As I see, 29092 is maximum series of nonduplicated data generated by: GENERATE_SERIES with 15 minutes interval.
How to fill table (meter.materialized_quarters) from 1999 to 2030?
One solution is:
insert into meter.materialized_quarters (tm)
select GENERATE_SERIES ('1999-01-01', '1999-10-31 01:45:00', interval '15 minute');
then:
insert into meter.materialized_quarters (tm)
select GENERATE_SERIES ('1999-10-31 02:00:00.000', '2000-10-29 00:00:00.000', interval '15 minute');
and again, and again.
Or
with bad as (
select count(*),tm
from meter.materialized_quarters
group by tm
having count(*)> 1
)
, ids as (
select mq1.id, mq2.id as iddel
from meter.materialized_quarters mq1 inner join bad on bad.tm = mq1.tm inner join meter.materialized_quarters mq2 on bad.tm = mq2.tm
where mq1.id<mq2.id
)
delete from meter.materialized_quarters
where id in (select iddel from ids);
Is there more 'elegant' way?
EDIT.
I see the problem.
xxxx-10-29 02:00:00 - summer time become winter time.
select GENERATE_SERIES ('1999-10-31 01:45:00', '1999-10-31 02:00:00', interval '15 minute');

Your problem is the conversion from timestamp WITH time zone which is returned by generate_series() and your column which is defined as timestamp WITHOUT time zone.
1999-10-31 is the day where daylight savings time changes (at least in some countries)
If you change your column to timestamp WITH time zone your code works without any modification.
Example
If you want to stick with timestamp WITHOUT timestamp you need to convert the value returned by generate_series()
insert into materialized_quarters (tm)
select g.tm at time zone 'UTC' --<< change to the time zone you need
from GENERATE_SERIES ('1999-01-01', '2030-10-30', interval '15 minute') as g(tm)
Example

Postgres - convert international atomic time to UTC time (for loop with IF inside SQL function body)

I need to store TAI time in a pg database. This requires a custom type,
CREATE TYPE tai AS (
secs int,
nanosecs, int
);
which maps 1:1 to a GNU C timespec struct, with the TAI epoch of Jan 1 1958 00:00:00 and monotonic clock at its origins. A table of leapseconds is auxiliary data required to convert these to UTC timestamps,
DROP TABLE IF EXISTS leapseconds;
CREATE TABLE IF NOT EXISTS leapseconds (
id serial PRIMARY KEY,
moment TIMESTAMP WITHOUT TIME ZONE NOT NULL,
skew int NOT NULL
);
INSERT INTO leapseconds (moment, skew) VALUES -- note: pg assumes 00:00:00 if no hh:mm:ss given
('1972-Jan-01', 10),
('1972-Jun-30', 1),
('1972-Dec-31', 1),
('1973-Dec-31', 1),
('1974-Dec-31', 1),
('1975-Dec-31', 1),
('1976-Dec-31', 1),
('1977-Dec-31', 1),
('1978-Dec-31', 1),
('1979-Dec-31', 1),
('1981-Jun-30', 1),
('1982-Jun-30', 1),
('1983-Jun-30', 1),
('1985-Jun-30', 1),
('1987-Dec-31', 1),
('1989-Dec-31', 1),
('1990-Dec-31', 1),
('1992-Jun-30', 1),
('1993-Jun-30', 1),
('1994-Jun-30', 1),
('1995-Dec-31', 1),
('1997-Jun-30', 1),
('1998-Dec-31', 1),
('2005-Dec-31', 1),
('2008-Dec-31', 1),
('2012-Jun-30', 1),
('2015-Jun-30', 1),
('2016-Dec-31', 1)
;
I need a function to convert these to UTC timestamps. It would be optimal for for this to live in postgres to avoid latency. The SQL/python pseudocode to do this is
# SQL
SELECT (moment, skew)
FROM LEAPSECONDS
ORDER BY MOMEN ASC
AS tuples
# python
def tai_to_utc(tai):
modtime = to_timestamp(tai.seconds) # to_timestamp from pgsql
modtime += tai.nanosec; # timestamp in pg has usec precision, gloss over it
for moment, skew in tuples:
if modtime > moment:
modtime += skew # type mismatch, gloss over it
return modtime
I know how to do the typecasting, but I'm struggling to write this for+if in plpsql. Is the path of least resistance to learn how to write a stored C procedure and do this in the database? I can also have the client provide the UTC timestamps and do this conversion based on a query to the database, but the chatter to pull data from the database in order to insert into it is going to really hurt ingest speed.

You basically need to use the sum() window function to get the cumulative sum of leap seconds over the moments. Add that to the base timestamp (without the leap seconds) and get the one with the youngest moment where the moment is older or at the base timestamp with the leap seconds added for all the previous moments. You can use DISTINCT ON and LIMIT for that.
CREATE FUNCTION tai_to_utc
(_tai tai)
RETURNS timestamptz
AS
$$
SELECT DISTINCT ON (moment)
ts
FROM (SELECT moment AT TIME ZONE 'UTC' AS moment,
skew,
'1958-01-01T00:00:00+00:00'::timestamptz
+ (_tai.secs || ' seconds')::interval
+ (_tai.nanosecs / 1000 || ' microseconds')::interval
+ (sum(skew) OVER (ORDER BY moment) || ' seconds')::interval AS ts
FROM (SELECT moment,
skew
FROM leapseconds
UNION ALL
SELECT '-infinity',
0) AS x) AS y
WHERE moment <= ts - (skew || ' seconds')::interval
ORDER BY moment DESC
LIMIT 1;
$$
LANGUAGE SQL;
I'd however recommend to also change the type of leapseconds.moment to timestamptz (timestamp with time zone) and insert the moments with explicit time zone to ensure things are what they meant to be. That way the awkward time zone conversion isn't needed in the function.
CREATE FUNCTION tai_to_utc
(_tai tai)
RETURNS timestamptz
AS
$$
SELECT DISTINCT ON (moment)
ts
FROM (SELECT moment,
skew,
'1958-01-01T00:00:00+00:00'::timestamptz
+ (_tai.secs || ' seconds')::interval
+ (_tai.nanosecs / 1000 || ' microseconds')::interval
+ (sum(skew) OVER (ORDER BY moment) || ' seconds')::interval AS ts
FROM (SELECT moment,
skew
FROM leapseconds
UNION ALL
SELECT '-infinity',
0) AS x) AS y
WHERE moment <= ts - (skew || ' seconds')::interval
ORDER BY moment DESC
LIMIT 1;
$$
LANGUAGE SQL;
db<>fiddle
An index on leapseconds (moment, skew) might improve performance. Though leapseconds is quite very small, so it might not do that much.

Selecting by ids with the precalculated order

Here's the raw query I feed to the ORM I'm using:
SELECT
"fightEventId"
FROM
${EDBTableNames.LOCATION_TIME_SLOT} AS lts
WHERE
"fightId" IN (
SELECT
f.id
FROM
${EDBTableNames.FIGHTS} AS f
WHERE
f.status = '${EFightStatus.CONFIRMED}'
)
AND "fightEventId" IN (
SELECT
fe.id
FROM
${EDBTableNames.FIGHT_EVENTS} AS fe
WHERE
${status.includes(EFightEventStatus.ONGOING)}
AND (
NOW() at time zone 'utc' >= fe.from AND NOW() at time zone 'utc' <= fe.to
)
OR ${status.includes(EFightEventStatus.UPCOMING)} AND NOW() at time zone 'utc' <= fe.to
OR ${status.includes(EFightEventStatus.FINISHED)} AND NOW() at time zone 'utc' > fe.to
ORDER BY fe."from" ASC
)
GROUP BY "fightEventId"
HAVING
COUNT("fightId") > ${SHOW_WITH_NUMBER_OF_FIGHTS}
LIMIT ${limit}
OFFSET ${page * limit};
The problem with this query is that even though fight events are ordered by the "from" date: ORDER BY fe."from" ASC, this subquery order is not maintained in the whole query. I need it to be maintained.
What would be the right way to do this? By the "right way" I mean performance and clarity.
Here is a bunch of options, but I'm a little bit confused as to which one to go for.
ORDER BY the IN value list
P.S.
SHOW_WITH_NUMBER_OF_FIGHTS is an integer and as of now it's required to be equal to 4.

Postgres search available time slots with generate_series

I have a table in my postgres database which has a column of dates. I want to search which of those dates is missing - for example:
date
2016-11-09 18:30:00
2016-11-09 19:00:00
2016-11-09 20:15:00
2016-11-09 22:20:00
2016-11-09 23:00:00
Here, |2016-11-09 21:00:00| is missing. After sorting my generated series if my table has an entry between two slots (slot of 1 hr interval) i need to remove that.
I want to make a query with generate_series that returns me the date which is missing. Is this possible?.
sample query that i used to generate series.
SELECT t
FROM generate_series(
TIMESTAMP WITH TIME ZONE '2016-11-09 18:00:00',
TIMESTAMP WITH TIME ZONE '2016-11-09 23:00:00',
INTERVAL '1 hour'
) t
EXCEPT
SELECT tscol
FROM mytable;
But this query is not removing 2016-11-09 18:30:00,2016-11-09 20:15:00 etc. cuz i used except.

This is not a gaps-and-island problem. You just want to find the 1 hour intervals for which no record exist in the table.
EXCEPT does not work here because it does equality comparison, while you want to check if a record exists or not within a range.
A typical solution for this is to use a left join antipattern:
select dt
from generate_series(
timestamp with time zone '2016-11-09 18:00:00',
timestamp with time zone '2016-11-09 23:00:00',
interval '1 hour'
) d(dt)
left join mytable t
on t.tscol >= dt and t.tscol < dt + interval '1 hour'
where t.tscol is null
You can also use not exists:
select dt
from generate_series(
timestamp with time zone '2016-11-09 18:00:00',
timestamp with time zone '2016-11-09 23:00:00',
interval '1 hour'
) d(dt)
where not exists (
select 1
from mytable t
where t.tscol >= dt and t.tscol < dt + interval '1 hour'
)
In this demo on DB Fiddle, both queries return:
| dt |
| :--------------------- |
| 2016-11-09 21:00:00+00 |

Get Data From Postgres Table At every nth interval

Below is my table and i am inserting data from my windows .Net application at every 1 Second Interval. i want to write query to fetch data from the table at every nth interval for example at every 5 second.Below is the query i am using but not getting result as required. Please Help me
CREATE TABLE table_1
(
timestamp_col timestamp without time zone,
value_1 bigint,
value_2 bigint
)
This is my query which i am using
select timestamp_col,value_1,value_2
from (
select timestamp_col,value_1,value_2,
INTERVAL '5 Seconds' * (row_number() OVER(ORDER BY timestamp_col) - 1 )
+ timestamp_col as r
from table_1
) as dt
Where r = 1

Use date_part() function with modulo operator:
select timestamp_col, value_1, value_2
from table_1
where date_part('second', timestamp_col)::int % 5 = 0

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

PostgreSQL query optimization when joining a large table to a small table - postgresql

Related

Postgresql generate series with interval '15 minutes' longer than 29092 items

Postgres - convert international atomic time to UTC time (for loop with IF inside SQL function body)

Selecting by ids with the precalculated order

Postgres search available time slots with generate_series

Get Data From Postgres Table At every nth interval

Categories

Resources