PostgreSQL questions - postgresql

I am writing a function in PostgreSQL and wondering if I can do following:
I have an insert statement in every if loop. Can I pass values like this for formatdate1 and formatdate2?
I am also updating a table. Is it how we do it PostgreSQL?
CREATE OR REPLACE FUNCTION Check returns void AS $$
DECLARE
startDate=date;
formateDate1=date; formatdate2=date;newDate=date;
BEGIN
startDate:= SELECT to_date(lastdate::date, 'MM-DD-YYYY') FROM setup;
for i in 1..3 LOOP
IF i = 1 THEN
formateDate1 := select (startDate - INTERVAL '11 months');
formatdate2:= to_date(formatdate2::date,'YYYYMM');
insert into warehouse.memcnts1 (select distinct source,
formatdate2
as yearmo, to_date(formateDate1, 'MM-DD-YYYY')
where effdt <= formateDate1 and enddt >= formateDate1);
ELSIF i = 2 THEN -- this is todays date
--insert query here
insert into warehouse.memcnts1 (select distinct source,formatdate2 as yearmo, to_date(formateDate1, 'MM-DD-YYYY') where effdt <= formateDate1
and enddt >= formateDate1);
ELSIF i = 3 THEN
formateDate1 := select (startDate + INTERVAL '1 months');
newDate=formateDate1;
update dwset SET lastdate := newDate; -- wonder if this is right?
formatdate2:=startDate;
END IF;
END LOOP;
END
$$ language 'plpgsql';

In general any sort of insert, insert, update loop is going to run into problems if the loop gets large enough. I assume you want to do something like
FOR 1 .. n
LOOP
INSERT INTO foo VALUES (...);
INSERT INTO foo VALUES (...);
UPDATE BAR set ....;
END LOOP;
I make that assumption because there is no reason to do a loop and then run different logic for each one.
The problem with the above is that as n gets large (let's say, over a hundred or so) you can start to run into cache management issues which can suddenly start to cause a lot of random disk I/O activity and very long queries. There is nothing like wondering why a given function call runs through 50 iterations of a loop in, say, 100ms, but runs through 1000 iterations of a loop in 1800000 ms. (been there, done that).
In general you want to try to run your logic avoiding loops and instead perform set operations instead.
I hope this helps. It is really hard to see what you are asking.

Related

postgres inbuilt vs user defined function performance

I have to count the number of bits in postgresql on very large integer columns for which i wrote a postgresql funtion to count the number of bits in an integer.
CREATE OR REPLACE FUNCTION bitcount(i integer) RETURNS integer AS $$
DECLARE n integer;
DECLARE bitCount integer;
BEGIN
bitCount := 0;
LOOP
IF i = 0 THEN
EXIT;
END IF;
i := i & (i-1);
bitCount:= bitCount+1;
END LOOP;
RETURN bitCount;
END
$$ LANGUAGE plpgsql;
but i found one more way to do this using pg's inbuilt functions as well
like
SELECT char_length( replace(100::bit(31)::TEXT, '0', ''));
so i decided to compare performance of both of the ways
so i used below queries to compare
First
SELECT a.n, bitcount(a.n)
from generate_series(1, 100000) as a(n);
Second
SELECT a.n, char_length( replace(a.n::bit(31)::TEXT, '0', ''))
FROM generate_series(1, 100000) as a(n);
I was expecting that First method will outperform the second one
but to my surprise both of them performed almost same. In fact on my machine second one always completed a few seconds faster even with large number of integers.
can anyone explain me why second is almost as fast as first despite of using cast and string operation ?
I'd say it is because PL/pgSQL is known to be slow.
Write the function in PL/Perl or PL/Python for better performance.

postgresql case inside function doesn't trigger

So, i have a couple of functions inside my db.
One function needs to run when the data in a specific table is older than 5 minutes.
I've tried doing it with:
PERFORM case when now() - '5 minutes'::interval > (select end_time from x order by end_route desc limit 1) then update_x() else null end;
When i run the command as a regular select query, it runs OK. But when i put that inside another function (The one being called, returns updated table that is no older than 5 minutes), it never runs. Also, if i just put update_x(), then it runs OK (but every time the function is called).
Does anyone have any idea on how i could fix this?
One idea is to just set a cron to run the function every 5 mins independently, but i'd rather the server is idle since the function is quite resource intensive, and it doesn't get called that often.
I'm on version 8.4(Due to my ISP, so can't change, though i am considering moving to VPS, so if this is something that would work on 9.5 and newer, i can wait).
The function now() gives the start time of the current transaction and is invariable inside it. Use clock_timestamp(), example:
do $$
begin
for i in 1..3 loop
perform pg_sleep(1);
raise notice 'now(): % clock_timestamp(): %', now(), clock_timestamp();
end loop;
end $$;
NOTICE: now(): 2017-12-06 10:22:40.422683+01 clock_timestamp(): 2017-12-06 10:22:41.437099+01
NOTICE: now(): 2017-12-06 10:22:40.422683+01 clock_timestamp(): 2017-12-06 10:22:42.452456+01
NOTICE: now(): 2017-12-06 10:22:40.422683+01 clock_timestamp(): 2017-12-06 10:22:43.468124+01
Per the documentation:
clock_timestamp() returns the actual current time, and therefore its value changes even within a single SQL command (...)
now() is a traditional PostgreSQL equivalent to transaction_timestamp().
I'm not sure why, but once i moved the boolean up one level it started working.
So now instead of having perform case query inside function, i'm sending boolean to the function, and performing the check in a view above this function.
CREATE VIEW x_view AS select * from get_x((clock_timestamp() - '5 minutes'::interval)::timestamp > (select end_route from gps_skole2 order by end_route desc limit 1));
And inside the function:
PERFORM case when $1 then update_x() else null end;

How to generate unique timestamps in PostgreSQL?

My idea is to implement a basic «vector clock», where a timestamps are clock-based, always go forward and are guaranteed to be unique.
For example, in a simple table:
CREATE TABLE IF NOT EXISTS timestamps (
last_modified TIMESTAMP UNIQUE
);
I use a trigger to set the timestamp value before insertion. It basically just goes into the future when two inserts arrive at the same time:
CREATE OR REPLACE FUNCTION bump_timestamp()
RETURNS trigger AS $$
DECLARE
previous TIMESTAMP;
current TIMESTAMP;
BEGIN
previous := NULL;
SELECT last_modified INTO previous
FROM timestamps
ORDER BY last_modified DESC LIMIT 1;
current := clock_timestamp();
IF previous IS NOT NULL AND previous >= current THEN
current := previous + INTERVAL '1 milliseconds';
END IF;
NEW.last_modified := current;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tgr_timestamps_last_modified ON timestamps;
CREATE TRIGGER tgr_timestamps_last_modified
BEFORE INSERT OR UPDATE ON timestamps
FOR EACH ROW EXECUTE PROCEDURE bump_timestamp();
I then run a massive amount of insertions in two separate clients:
DO
$$
BEGIN
FOR i IN 1..100000 LOOP
INSERT INTO timestamps DEFAULT VALUES;
END LOOP;
END;
$$;
As expected, I get collisions:
ERROR: duplicate key value violates unique constraint "timestamps_last_modified_key"
État SQL :23505
Détail :Key (last_modified)=(2016-01-15 18:35:22.550367) already exists.
Contexte : SQL statement "INSERT INTO timestamps DEFAULT VALUES"
PL/pgSQL function inline_code_block line 4 at SQL statement
#rach suggested to mix current_clock() with a SEQUENCE object, but it would probably imply getting rid of the TIMESTAMP type. Even though I can't really figure out how it'd solve the isolation problem...
Is there a common pattern to avoid this?
Thank you for your insights :)
If you have only one Postgres server as you said, I think that using timestamp + sequence can solve the problem because sequence are non transactional and respect the insert order.
If you have db shard then it will be much more complex but maybe the distributed sequence of 2ndquadrant in BDR could help but I don't think that ordinality will be respected. I added some code below if you have setup to test it.
CREATE SEQUENCE "timestamps_seq";
-- Let's test first, how to generate id.
SELECT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0') as unique_id ;
unique_id
--------------------------------
145288519200000000000000000010
(1 row)
CREATE TABLE IF NOT EXISTS timestamps (
unique_id TEXT UNIQUE NOT NULL DEFAULT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0')
);
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
select * from timestamps;
unique_id
--------------------------------
145288556900000000000000000001
145288557000000000000000000002
145288557100000000000000000003
(3 rows)
Let me know if that works. I'm not a DBA so maybe it will be good to ask on dba.stackexchange.com too about the potential side effect.
My two cents (Inspired from http://tapoueh.org/blog/2013/03/15-batch-update).
try adding the following before massive amount of insertions:
LOCK TABLE timestamps IN SHARE MODE;
Official documentation is here: http://www.postgresql.org/docs/current/static/sql-lock.html

Purge data using a while loop.

I have a database that gets populated daily with incremental data and then at the end of each month a full download of the month's data is put into the system. Our business wants each day put into the system and then at the end of the month the daily stuff is removed and the full month data is left. I have written the query below and if you could help I'd appreciate it.
DECLARE #looper INT
DECLARE #totalindex int;
select name, (substring(name,17,8)) as Attempt, substring(name,17,4) as [year], substring(name,21,2) as [month], create_date
into #work_to_do_for
from sys.databases d
where name like 'Snapshot%' and
d.database_id >4 and
(substring(name,21,2) = DATEPART(m, DATEADD(m, -1, getdate()))) AND (substring(name,17,4) = DATEPART(yyyy, DATEADD(m, -1, getdate())))
order by d.create_date asc
SELECT #totalindex = COUNT(*) from #work_to_do_for
SET #looper = 1 -- reset and reuse counter
WHILE (#looper < #totalindex)
BEGIN;
set #looper=#looper+1
END;
DROP TABLE #work_to_do_for;
I'd need to perform the purge on several tables.
Thanks in advance.
When I delete large numbers of records, I always do it in batches and off-hours so as not to use up resources during production processes. To accomplish this, you incorporate a loop and some testing to find the optimal number to delete at a time.
begin transaction del -- I always use transactions as a safeguard
declare #count int = 1
while #count > 0
begin
delete top (100000) t
from dbo.MyTable t -- JOIN if necessary
-- WHERE if necessary
set #count = ##ROWCOUNT
end
Run this manually (without the WHILE loop) 1 time with 100000 records in parenthesis and see what your execution time is. Write it down. Run it again with 200000 records. Check the time; write it down. Run it with 500000 records. What you're looking for is a trend in the execution time. As long as the time required to delete 100000 records is decreasing as you increase the batch size, keep increasing it. You might end at 500k, but this method will help you find the optimal number to delete per batch. Then, run it as a loop.
That being said, if you are literally deleting MILLIONS of records, it might make more sense to drop and recreate the table as long as you aren't going to interfere with other processes. If you needed to save some of the data, you could insert what you needed into a new table (eg MyTable_New), drop the original table (MyTable), and rename MyTable_New to MyTable.
The script you've posted iterating through with a while loop to delete the rows should be changed to a set-based operation if at all possible. Relational database engines excel at set-based operations like
Delete dbo.table WHERE yourcolumn = 5
as opposed to iterating through one at a time. Especially if it will be for "several million" rows as you indicated in the comments above.
#rwking where are you putting the COMMIT to the Transaction.. I mean are you keeping the all eligible Delete count in single Transaction and doing one final Commit?
I have the similar type of Requirement where I have to Delete in batches, and also track the number of count affected in the end.
My Sample Code is as Follows:
Declare #count int
Declare #deletecount int
set #count=0
While(1=1)
BEGIN
BEGIN TRY
BEGIN TRAN
DELETE TOP 1000 FROM --CONDITION
SET #COUNT = #COUNT+##ROWCOUNT
IF (##ROWCOUNT)=0
Break;
COMMIT
END CATCH
BEGIN CATCH
ROLLBACK;
END CATCH
END
set #deletecount=#COUNT
Above Code Works fine, but how to keep track of #deletecount if Rollback happens in one of the batch.

Performance of a stored procedure in SQL Server 2008 R2 (using to purge records from a table)

I have a database purge process that uses a stored procedure to delete records from a huge table based on Expire Date, it runs every 3 weeks and delete about 3 million records.
Currently it is taking about 5 hours to purge the data which is causing lot of problems. I know there are lot of efficient way to write the code, but I'm out of ideas, please help me to the right direction.
--Stored Procedure
CREATE PROCEDURE [dbo].[pa_Expire_StoredValue_By_Date]
#ExpireDate DateTime, #NumExpired int OUTPUT, #RunAgain int OUTPUT
AS
-- This procedure expires all the StoredValue records that have an ExpireDate less than or equal to the DeleteDate provided
-- and have QtyUsed<QtyEarned
-- invoked by DBPurgeAgent
declare #NumRows int
set nocount on
BEGIN TRY
BEGIN TRAN T1
set #RunAgain = 1;
select #NumRows = count(*) from StoredValue where ExpireACK = 1;
if #NumRows = 0
begin
set rowcount 1800; -- only delete 1800 records at a time
update StoredValue with (RowLock)
set ExpireACK = 1
where ExpireACK = 0
and ExpireDate < #ExpireDate
and QtyEarned > QtyUsed;
set #NumExpired=##RowCount;
set rowcount 0
end
else begin
set #NumExpired = #NumRows;
end
if #NumExpired = 0
begin -- stop processing when there are no rows left
set #RunAgain = 0;
end
else begin
Insert into SVHistory (LocalID, ServerSerial, SVProgramID, CustomerPK, QtyUsed, Value, ExternalID, StatusFlag, LastUpdate, LastLocationID, ExpireDate, TotalValueEarned, RedeemedValue, BreakageValue, PresentedCustomerID, PresentedCardTypeID, ResolvedCustomerID, HHID)
select
SV.LocalID, SV.ServerSerial, SV.SVProgramID, SV.CustomerPK,
(SV.QtyEarned-SV.QtyUsed) as QtyUsed, SV.Value, SV.ExternalID,
3 as StatusFlag, getdate() as LastUpdate,
-9 as LocationID, SV.ExpireDate, SV.TotalValueEarned,
0 as RedeemedValue,
((SV.QtyEarned-SV.QtyUsed)*SV.Value*isnull(SVUOM.UnitOfMeasureLimit, 1)),
PresentedCustomerID, PresentedCardTypeID,
ResolvedCustomerID, HHID
from
StoredValue as SV with (NoLock)
Left Join
SVUnitOfMeasureLimits as SVUOM on SV.SVProgramID = SVUOM.SVProgramID
where
SV.ExpireACK = 1
Delete from StoredValue with (RowLock) where ExpireACK = 1;
end
COMMIT TRAN T1;
END TRY
BEGIN CATCH
set #RunAgain = 0;
IF ##TRANCOUNT > 0 BEGIN
ROLLBACK TRAN T1;
END
DECLARE #ErrorMessage NVARCHAR(4000);
DECLARE #ErrorSeverity INT;
DECLARE #ErrorState INT;
SELECT
#ErrorMessage = ERROR_MESSAGE(),
#ErrorSeverity = ERROR_SEVERITY(),
#ErrorState = ERROR_STATE();
RAISERROR (#ErrorMessage, #ErrorSeverity, #ErrorState);
END CATCH
Why you're running with this logic makes no sense to me. It looks like you are batching by rerunning the stored proc over and over again. You really should just do it in a WHILE loop and use smaller batches within a single run of the stored proc. You also should run in smaller transactions, this will speed things up considerably. Arguably, the way this is written you don't need a transaction. You can resume since you are flagging every record.
It's also not clear why you are touching the table 3 times. You really shouldn't need to update a flag AND select the rows into a new table AND then delete them. You can just use an output clause to do this in one step if desired, but you need to clarify your logic to get help on that.
Also, why are you using ROWLOCK? Lock escalation is fine and makes things run faster (less memory holding locks). Are you running this while the system is live? If it's after hours, use TABLOCK instead.
This is some suggested pseudo-code you can flesh out. I recommend taking #BatchSize as a parameter. Also obviously missing is error handling, this is just the core for the delete logic.
WHILE 1=1
BEGIN
BEGIN TRAN
UPDATE TOP (#BatchSize) StoredValue
SET <whatever>
INSERT INTO SVHistory <insert statement>
DELETE FROM StoredValue WHERE ExpireAck=1
IF ##ROWCOUNT = 0
BEGIN
COMMIT TRAN
BREAK;
END
COMMIT TRAN
END
First look at what is causing it to be slow by looking at the execution oplan. Is it the insert statement or the delete?
Due to the calculations in the insert, I would suspect it is the slower part (unless you have triggers or cascade delete on the table). You could change the history table to have the columns you use in the calculation and allow nulls in the calculated fields. Now you can insert to that table more quickly and then do the delete. In a separate step in your job, you can then update the calculations. This would at least tie things up for a shorter period of time, but depending on how the history table is acccessed may or may not be possible.
Another out of the box possibility is to rename your table StoredValue to StoredValueRaw and create a view called StoredValue that only displays the active records. Then the job to delete the records could run every fifteen minutes or so and only delete a few records at a time. This might be far less disruptive to the users even if the actual deletes take longer. You still might need to put the records in the history table at the time they are identified as expired.
You should possibly rethink doing this process every three weeks as well, the fewer records you have to deal with the faster it will go.