Update performance issues - best practice - postgresql

I've just started working with PostgreSQL, I've used to work with SQL Server and I'm currently migrating some of the existing processes.
The current issue which I'm facing is the performance for an Update statement.
I'm trying to update all records from one table (e.g. MyTable_History) and set new values for some columns.
In Sql Server I've used the following syntax:
declare #NewEndDate datetime = (select dateadd(minute, -1, getdate()))
update MyTable_History
set isLastestVersion=0, ValidTo=#NewEndDate , ModifiedBy='TestSCriptSql',ModifiedTime=GETDATE()
The code which i could come up with (since I don't know how to simply use variables, therefore used a temp tbl) for PostgreSQL is:
CREATE TEMP TABLE dates AS VALUES (current_timestamp + (-1 ||' minutes')::interval);
with d as (
select th.validto as validto, th.islatestversion as islatestversion,
th.modifiedby as modifiedby, th.modifiedtime as modifiedtime, d.column1 as newvalidto
from MyTable_History th, dates d
)
update MyTable_History
set validto = d.newvalidto, islatestversion=false, modifiedby='test_update_script', modifiedtime=current_timestamp
from d
The Sql Server runs localy on my laptop (not a super config) and the PosgreSQL server runs on AWS as RDS (i don't know the exact specs).
My question is am I doing something wrong in the PostgreSql update statement? Because on a 5000+ dataset sample on Sql Server the statement is instantly performed, while on PostgreSql it takes around 50 secs to successfully finish.
Also, from my point of view it seems I've over engineered, since on Sql Server I was having 3 lines of code, while on postgreSql i'm using a CTE.
Regrards,

I don't see why you would need a variable to begin with. current_timestamp returns the same value throughout a transaction as documented in the manual and thus will have the same value for all updated rows.
update mytable_history
set islastestversion = 0,
validto = current_timestamp - interval '1 minute',
modifiedby = 'test_update_script',
modifiedtime = current_timestamp;
But your usage of FROM in the UPDATE statement is wrong. The semantics of using FROM in an UPDATE statement are very different between Postgres and SQL Server
The way you use it, creates a cross join between the CTE and mytable_history. (so essentially a cross join of the table with itself).
You need to have a join condition in the WHERE clause on the primary key:
with d as (...)
update MyTable_History
set validto = d.newvalidto, islatestversion=false,
modifiedby='test_update_script', modifiedtime=current_timestamp
from d
where d.pk_column = MyTable_History.pk_column;
But if you really want to simulate something like variables, you don't need the CTE:
update mytable_history
set islastestversion = 0,
validto = t.newvalidto
modifiedby = 'test_update_script',
modifiedtime = current_timestamp
from (
values (current_timestamp - interval '1 minute')
) t (newvalidto);
The above still creates a "cross join" but as the joined table (from (values ...)) only contains a single row, it's not really a cross join.

Related

When is it better to use CTE or temp table postgres

I am doing a query on a very large data set and i am using WITH (CTE) syntax.. this seems to take a while and i was reading online that temp tables could be faster to use in these cases can someone advise me in which direction to go. In the CTE we join to a lot of tables then we filter on the CTE result..
Only interesting in postgres answers
What version of PostgreSQL are you using? CTEs perform differently in PostgreSQL versions 11 and older than versions 12 and above.
In PostgreSQL 11 and older, CTEs are optimization fences (outer query restrictions are not passed on to CTEs) and the database evaluates the query inside the CTE and caches the results (i.e., materialized results) and outer WHERE clauses are applied later when the outer query is processed, which means either a full table scan or a full index seek is performed and results in horrible performance for large tables. To avoid this, apply as much filters in the WHERE clause inside the CTE:
WITH UserRecord AS (SELECT * FROM Users WHERE Id = 100)
SELECT * FROM UserRecord;
PostgreSQL 12 addresses this problem by introducing query optimizer hints to enable us to control if the CTE should be materialized or not: MATERIALIZED, NOT MATERIALIZED.
WITH AllUsers AS NOT MATERIALIZED (SELECT * FROM Users)
SELECT * FROM AllUsers WHERE Id = 100;
Note: Text and code examples are taken from my book Migrating your SQL Server Workloads to PostgreSQL
Summary:
PostgreSQL 11 and older: Use Subquery
PostgreSQL 12 and above: Use CTE with NOT MATERIALIZED clause
My follow up comment is more than I can fit in a comment... so understand this may not be an answer to the OP per se.
Take the following query, which uses a CTE:
with sales as (
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item
),
inventory as (
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item
)
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item
There are times where I cannot explain why that the query runs slower than I would expect. Some times, simply materializing the CTEs makes it run better, as expected. Other times it does not, but when I do this:
drop table if exists sales;
drop table if exists inventory;
create temporary table sales as
select item, sum (qty) as sales_qty, sum (revenue) as sales_revenue
from sales_data
where country = 'USA'
group by item;
create temporary table inventory as
select item, sum (on_hand_qty) as inventory_qty
from inventory_data
where country = 'USA' and on_hand_qty != 0
group by item;
select
a.item, a.description, s.sales_qty, s.sales_revenue,
i.inventory_qty, i.inventory_qty * a.cost as inventory_cost
from
all_items a
left join sales s on
a.item = s.item
left join inventory i on
a.item = i.item;
Suddenly all is right in the world.
Temp tables may persist across sessions, but to my knowledge the data in them will be session-based. I'm honestly not even sure if the structures persist, which is why to be safe I always drop:
drop table if exists sales;
And use "if exists" to avoid any errors about the object not existing.
I rarely use these in common queries for the simple reason that they are not as portable as a simple SQL statement (you can't give the final query to another user without having the temp tables). My most common use case is when I am processing within a procedure/function:
create procedure sales_and_inventory()
language plpgsql
as
$BODY$
BEGIN
create temp table sales...
insert into sales_inventory
select ...
drop table sales;
END;
$BODY$
Hopefully this helps.
Also, to answer your question on indexes... typically I don't, but nothing says that's always the right answer. If I put data into a temp table, I assume I'm going to use all or most of it. That said, if you plan to query it multiple times with conditions where an index makes sense, then by all means do it.

Proc is running slow with NOT EXISTS

I'm working on trying to create a stored procedure however I'm running into a issue where the stored procedure runs for over 5 minutes due to close to 50k records.
The process seems pretty straight forward, I'm just not sure why it is taking so long.
Essentially I have two tables:
Table_1
ApptDate ApptName ApptDoc ApptReason ApptType
-----------------------------------------------------------------------
03/15/2021 Physical Dr Smith Yearly Day
03/15/2021 Check In Dr Doe Check In Day
03/15/2021 Appt oth Dr Dee Check In Monthly
Table_2 - this table has the same exact structure as Table_1, what I am trying to achieve is simply archive the the data from Table_1
DECLARE #Date_1 as DATETIME
SET #Date_1 = GetDate() - 1
INSERT INTO Table_2 (ApptDate, ApptName, ApptDoc, ApptReason)
SELECT ApptDate, ApptName, ApptDoc, ApptReason
FROM Table_1
WHERE ApptType = 'Day' AND ApptDate = #Date_1
AND NOT EXISTS (SELECT 1 FROM Table_2
WHERE AppType = 'Day' AND ApptDate = #Date_1)
So this stored procedure seems pretty straight forward, however the NOT EXIST is causing it to be really slow.
The reason for NOT EXIST, is that this stored procedure is part of a bigger process that runs multiple times a day (morning, afternoon, night). I'm trying to make sure that I only have 1 copy of the the '03/15/2021' data. I'm basically running an archive process on previous days data (#Date_1)
Any thoughts how this can be "sped up".
For this query:
INSERT INTO Table_2 (ApptDate, ApptName, ApptDoc, ApptReason)
SELECT ApptDate, ApptName, ApptDoc, ApptReason
from Table_1 t1
Where ApptType = 'Day' and
ApptDate = #Date_1 and
NOT EXISTS (Select 1
from Table_2 t2
where t2.AppType = t1.AppType and
t2.ApptDate = t1.ApptDate
);
You want indexes on: table_1(ApptType) and more importantly, Table_2(AppType, ApptDate) or Table_2(ApptDate, AppType).
Note: I changed the correlation clause to just refer to the values in the outer query. This seems more general than your version, but should have the same performance (in this case).

PostgreSQL - How to make a condition with records between the current record date and the same date plus 5 min?

I have something like this. With this part of code I detect if a vehicle stopped at least 5 minutes.
And works but, with a large amount of data, it starts to be slow.
I did a lot of tests and I'm sure that my problem is in the not exists block.
My table:
CREATE TABLE public.messages
(
id bigint PRIMARY KEY DEFAULT nextval('messages_id_seq'::regclass),
messagedate timestamp with time zone NOT NULL,
vehicleid integer NOT NULL,
driverid integer NOT NULL,
speedeffective double precision NOT NULL,
-- ... few nonsense properties
)
WITH (
OIDS=FALSE
);
ALTER TABLE public.messages OWNER TO postgres;
CREATE INDEX idx_messages_1 ON public.messages
USING btree (vehicleid, messagedate);
And my query:
SELECT
*
FROM
messages m
WHERE
m.speedeffective > 0
and m.next_speedeffective = 0
and not exists( -- my problem
select id
from messages
where
vehicleid = m.vehicleid
and speedeffective > 5 -- I forgot this condition
and messagedate > m.messagedate
and messagedate <= m.messagedate + interval '5 minutes'
)
I can't figure out how to build the condition in a more performant way.
Edit DAY2:
I added a previous table like this to use in the second table:
WITH messagesx as (
SELECT
vehicleid,
messagedate
FROM
messages
WHERE
speedeffective > 5
)
and now works better. I think that I'm missing a little detail.
Typically, a 'NOT EXISTS' will slow down your query as it requires a full scan of the table for each of the outer rows. Try to incorporate the same functionality within a join (I'm trying to rewrite the query here, without knowing the table, so I might make a mistake here):
SELECT
*
FROM
messages m1
LEFT JOIN
messages m2
ON m1.vehicleid = m2.vehicleid AND m1.messagedate < m2.messagedate AND m1.messagedate <= m2.messagedate+interval '5 minutes'
WHERE
speedeffective > 0
and next_speedeffective = 0
and m2.vehicleid IS NULL
Take note that the NOT EXISTS is rewritten as the non-hit of the join condition.
Based on this answer: https://stackoverflow.com/a/36445233/5000827
and reading about NOT IN, NOT EXISTS and LEFT JOIN (where join is NULL)
For PostgreSQL, NOT EXISTS and LEFT JOIN are anti-join and works at the same way. (This is the reason why the #CountZukula answer's result is almost the same than mine)
The problem was on the kind of operation: Nest or Hash.
So, based on this: https://www.postgresql.org/docs/9.6/static/routine-vacuuming.html
PostgreSQL's VACUUM command has to process each table on a regular basis for several reasons:
To recover or reuse disk space occupied by updated or deleted rows.
To update data statistics used by the PostgreSQL query planner.
To update the visibility map, which speeds up index-only scans.
To protect against loss of very old data due to transaction ID wraparound or multixact ID wraparound.
I made a VACUUM ANALYZE to messages table and the same query works way fast.
So, with the VACUUM PostgreSQL can decide better.

Declaring variables in redshift

Background
I have been using Amazon Redshift to execute my queries.
I know there was a question asked earlier regarding this. But I don't understand how to incorporate UDFs.
I want to assign a temporary variable which takes a particular value.
I want to do this to make my script dynamic. For instance- This is my
usual way of writing code.
SELECT * FROM transaction_table WHERE invoice_date >= '2013-01-01'
AND invoice_date <= '2013-06-30';
What I want to do is ...
Something like what you will see below. I believe SQL server has a declare variable which does this sort of a thing.
SET start_date TO '2013-01-01';
SET end_date TO '2013-06-30';
SELECT * FROM transaction_table WHERE invoice_date >= start_date
AND invoice_date <= end_date;
This way I don't have to search deep in my script. I can just have a set
statement up top and just change that.
Your feedback is greatly welcome.
There are no variables in Redshift, unfortunately. You can, however, get a variable-like behaviour by creating a temporary table and referring to it as follows:
CREATE TEMPORARY TABLE _variables AS (
SELECT
'2013-01-01'::date as start_date
, '2013-06-30'::date as end_date
);
SELECT
transaction_table.*
FROM
transaction_table, _variables
WHERE
invoice_date >= start_date
AND
invoice_date <= end_date
;

Is there a way to find TOP X records with grouped data?

I'm working with a Sybase 12.5 server and I have a table defined as such:
CREATE TABLE SomeTable(
[GroupID] [int] NOT NULL,
[DateStamp] [datetime] NOT NULL,
[SomeName] varchar(100),
PRIMARY KEY CLUSTERED (GroupID,DateStamp)
)
I want to be able to list, per [GroupID], only the latest X records by [DateStamp]. The kicker is X > 1, so plain old MAX() won't cut it. I'm assuming there's a wonderfully nasty way to do this with cursors and what-not, but I'm wondering if there is a simpler way without that stuff.
I know I'm missing something blatantly obvious and I'm gonna kick myself for not getting it, but .... I'm not getting it. Please help.
Is there a way to find TOP X records, but with grouped data?
According to the online manual, Sybase 12.5 supports WINDOW functions and ROW_NUMBER(), though their syntax differs from standard SQL slightly.
Try something like this:
SELECT SP.*
FROM (
SELECT *, ROW_NUMBER() OVER (windowA ORDER BY [DateStamp] DESC) AS RowNum
FROM SomeTable
WINDOW windowA AS (PARTITION BY [GroupID])
) AS SP
WHERE SP.RowNum <= 3
ORDER BY RowNum DESC;
I don't have an instance of Sybase, so I haven't tested this. I'm just synthesizing this example from the doc.
I made a mistake. The doc I was looking at was Sybase SQL Anywhere 11. It seems that Sybase ASA does not support the WINDOW clause at all, even in the most recent version.
Here's another query that could accomplish the same thing. You can use a self-join to match each row of SomeTable to all rows with the same GroupID and a later DateStamp. If there are three or fewer later rows, then we've got one of the top three.
SELECT s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
FROM SomeTable s1
LEFT OUTER JOIN SomeTable s2
ON s1.[GroupID] = s2.[GroupID] AND s1.[DateStamp] < s2.[DateStamp]
GROUP BY s1.[GroupID], s1.[Foo], s1.[Bar], s1.[Baz]
HAVING COUNT(*) < 3
ORDER BY s1.[DateStamp] DESC;
Note that you must list the same columns in the SELECT list as you list in the GROUP BY clause. Basically, all columns from s1 that you want this query to return.
Here's quite an unscalable way!
SELECT GroupID, DateStamp, SomeName
FROM SomeTable ST1
WHERE X <
(SELECT COUNT(*)
FROM SomeTable ST2
WHERE ST1.GroupID=ST2.GroupID AND ST2.DateStamp > ST1.DateStamp)
Edit Bill's solution is vastly preferable though.