I have two questions about using SELECT … FOR UPDATE row-level locking in a Postgres function:
Does it matter which columns I select? Do they have any relation to what data I need to lock and then update?
SELECT * FROM table WHERE x=y FOR UPDATE;
vs
SELECT 1 FROM table WHERE x=y FOR UPDATE;
I can't do a select in a function without saving the data somewhere, so I save to a dummy variable. This seems hacky; is it the right way to do things?
Here is my function:
CREATE OR REPLACE FUNCTION update_message(v_1 INTEGER, v_timestamp INTEGER, v_version INTEGER)
RETURNS void AS $$
DECLARE
v_timestamp_conv TIMESTAMP;
dummy INTEGER;
BEGIN
SELECT timestamp 'epoch' + v_timestamp * interval '1 second' INTO v_timestamp_conv;
SELECT 1 INTO dummy FROM my_table WHERE userid=v_1 LIMIT 1 FOR UPDATE;
UPDATE my_table SET (timestamp) = (v_timestamp_conv) WHERE userid=v_1 AND version < v_version;
END;
$$ LANGUAGE plpgsql;
Does it matter which columns I select?
No, it doesn't matter. Even if SELECT 1 FROM table WHERE ... FOR UPDATE is used, the query locks all rows that meet where conditions.
If the query retrieves rows from a join, and we don't want to lock rows from all tables involved in the join, but only rows from specific tables, a SELECT ... FOR UPDATE OF list-of-tablenames syntax can be usefull:
http://www.postgresql.org/docs/9.0/static/sql-select.html#SQL-FOR-UPDATE-SHARE
I can't do a select in a function without saving the data somewhere, so I save to a dummy variable. This seems hacky; is it the right way to do things?
In Pl/PgSql use a PERFORM command to discard query result:
http://www.postgresql.org/docs/9.2/static/plpgsql-statements.html#PLPGSQL-STATEMENTS-SQL-NORESULT
Instead of:
SELECT 1 INTO dummy FROM my_table WHERE userid=v_1 LIMIT 1 FOR UPDATE;
use:
PERFORM 1 FROM my_table WHERE userid=v_1 LIMIT 1 FOR UPDATE;
Related
I have a postgres database with several tables like table1, table2, table3. More than 1000 tables.
I imported all of these tables from a script. And apparently the script had issues to import.
Many tables have duplicate rows (all values exactly same).
I am able to go in each table and then delete duplicate row using Dbeaver, but because there are over 1000 tables, it is very time consuming.
Example of tables:
table1
name gender age
a m 20
a m 20
b f 21
b f 21
table2
fruit hobby
x running
x running
y stamp
y stamp
How can I do the following:
Identify tables in postgres with duplicate rows.
Delete all duplicate rows, leaving 1 record.
I need to do this on all 1000+ tables at once.
As you want to automate your deduplication of all table, you need to use plpgsql function where you can write dynamic queries to achieve it.
Try This function:
create or replace function func_dedup(_schemaname varchar) returns void as
$$
declare
_rec record;
begin
for _rec in select table_name from information_schema. tables where table_schema=_schemaname
loop
execute format('CREATE TEMP TABLE tab_temp as select DISTINCT * from '||_rec.table_name);
execute format('truncate '||_rec.table_name);
execute format('insert into '||_rec.table_name||' select * from tab_temp');
execute format('drop table tab_temp');
end loop;
end;
$$
language plpgsql
Now call your function like below:
select * from func_dedup('your_schema'); --
demo
Steps:
Get the list of all tables in your schema by using below query and loop it for each table.
select table_name from information_schema. tables where table_schema=_schemaname
Insert all distinct records in a TEMP TABLE.
Truncate your main table.
Insert all your data from TEMP TABLE to main table.
Drop the TEMP TABLE. (here dropping temp table is important we have to reuse it for next loop cycle.)
Note - if your tables are very large in size the consider using Regular Table instead of TEMP TABLE.
I'm trying to EXECUTE some SELECTs to use inside a function, my code is something like this:
DECLARE
result_one record;
BEGIN
EXECUTE 'WITH Q1 AS
(
SELECT id
FROM table_two
INNER JOINs, WHERE, etc, ORDER BY... DESC
)
SELECT Q1.id
FROM Q1
WHERE, ORDER BY...DESC';
RETURN final_result;
END;
I know how to do it in MySQL, but in PostgreSQL I'm failing. What should I change or how should I do it?
For a function to be able to return multiple rows it has to be declared as returns table() (or returns setof)
And to actually return a result from within a PL/pgSQL function you need to use return query (as documented in the manual)
To build dynamic SQL in Postgres it is highly recommended to use the format() function to properly deal with identifiers (and to make the source easier to read).
So you need something like:
create or replace function get_data(p_sort_column text)
returns table (id integer)
as
$$
begin
return query execute
format(
'with q1 as (
select id
from table_two
join table_three on ...
)
select q1.id
from q1
order by %I desc', p_sort_column);
end;
$$
language plpgsql;
Note that the order by inside the CTE is pretty much useless if you are sorting the final query unless you use a LIMIT or distinct on () inside the query.
You can make your life even easier if you use another level of dollar quoting for the dynamic SQL:
create or replace function get_data(p_sort_column text)
returns table (id integer)
as
$$
begin
return query execute
format(
$query$
with q1 as (
select id
from table_two
join table_three on ...
)
select q1.id
from q1
order by %I desc
$query$, p_sort_column);
end;
$$
language plpgsql;
What a_horse said. And:
How to return result of a SELECT inside a function in PostgreSQL?
Plus, to pick a column for ORDER BY dynamically, you have to add that column to the SELECT list of your CTE, which leads to complications if the column can be duplicated (like with passing 'id') ...
Better yet, remove the CTE entirely. There is nothing in your question to warrant its use anyway. (Only use CTEs when needed in Postgres, they are typically slower than equivalent subqueries or simple queries.)
CREATE OR REPLACE FUNCTION get_data(p_sort_column text)
RETURNS TABLE (id integer) AS
$func$
BEGIN
RETURN QUERY EXECUTE format(
$q$
SELECT t2.id -- assuming you meant t2?
FROM table_two t2
JOIN table_three t3 on ...
ORDER BY t2.%I DESC NULL LAST -- see below!
$q$, $1);
END
$func$ LANGUAGE plpgsql;
I appended NULLS LAST - you'll probably want that, too:
PostgreSQL sort by datetime asc, null first?
If p_sort_column is from the same table all the time, hard-code that table name / alias in the ORDER BY clause. Else, pass the table name / alias separately and auto-quote them separately to be safe:
Define table and column names as arguments in a plpgsql function?
I suggest to table-qualify all column names in a bigger query with multiple joins (t2.id not just id). Avoids various kinds of surprising results / confusion / abuse.
And you may want to schema-qualify your table names (myschema.table_two) to avoid similar troubles when calling the function with a different search_path:
How does the search_path influence identifier resolution and the "current schema"
I tried to simulate my problem in the code example below. In the code below, I am doing a delete from test2 in a procedure. This works great:
However, in my case, this delete is part of a rather complex CTE with several updates and inserts (there are no selects so I add a dummy select 1 as main query). Let's simulate this as this:
with my_cte as(delete from test2) select 1
Now, as we know, we have to use the perform keyword to execute this:
perform (with my_cte as(delete from test2) select 1);
I am getting the following error:
ERROR: WITH clause containing a data-modifying statement must be at the top level
Is this a limitation of plpgsql?
(Please note that this is just an example to explain my problem. I know the queries do not really make any sense.)
create table test
(
key int primary key
);
create table test2
(
key int primary key
);
create function test() returns trigger as
$$
begin
raise notice 'hello there';
-- this does work
delete from test2;
-- this doesn't work
perform (with my_cte as(delete from test2) select 1);
return new;
end;
$$
language plpgsql;
create trigger test after insert on test for each row execute procedure test();
insert into test(key) select 1;
You can use CTE for combining several DELETE, INSERT, UPDATE returning queries. And you dont need perform for it, eg:
t=# begin; do $$ begin with d as (delete from s133 returning *) insert into s133 select * from d; raise info '%',(select count(1) from s133);
end; $$; commit;
BEGIN
Time: 0.135 ms
INFO: 4
DO
Time: 0.469 ms
COMMIT
Time: 0.887 ms
t=# select count(1) from s133;
count
-------
4
(1 row)
here I delete four rows and in CTE insert them back
As you found out, you can neither nest such a WITH clause in a subselect, not can you do
WITH cte AS (...)
PERFORM 1;
One solution would be to use SELECT ... INTO dummy instead of PERFORM and ignore the result.
But I don't see why you cannot code the DELETEs, UPDATEs and INSERTs in your function with several SQL statements rather than bundling them into CTEs.
If you try to protect yourself from concurrent data modification, use a REPEATABLE READ transaction so that all your statements operate on the same snapshot of the database.
My idea is to implement a basic «vector clock», where a timestamps are clock-based, always go forward and are guaranteed to be unique.
For example, in a simple table:
CREATE TABLE IF NOT EXISTS timestamps (
last_modified TIMESTAMP UNIQUE
);
I use a trigger to set the timestamp value before insertion. It basically just goes into the future when two inserts arrive at the same time:
CREATE OR REPLACE FUNCTION bump_timestamp()
RETURNS trigger AS $$
DECLARE
previous TIMESTAMP;
current TIMESTAMP;
BEGIN
previous := NULL;
SELECT last_modified INTO previous
FROM timestamps
ORDER BY last_modified DESC LIMIT 1;
current := clock_timestamp();
IF previous IS NOT NULL AND previous >= current THEN
current := previous + INTERVAL '1 milliseconds';
END IF;
NEW.last_modified := current;
RETURN NEW;
END;
$$ LANGUAGE plpgsql;
DROP TRIGGER IF EXISTS tgr_timestamps_last_modified ON timestamps;
CREATE TRIGGER tgr_timestamps_last_modified
BEFORE INSERT OR UPDATE ON timestamps
FOR EACH ROW EXECUTE PROCEDURE bump_timestamp();
I then run a massive amount of insertions in two separate clients:
DO
$$
BEGIN
FOR i IN 1..100000 LOOP
INSERT INTO timestamps DEFAULT VALUES;
END LOOP;
END;
$$;
As expected, I get collisions:
ERROR: duplicate key value violates unique constraint "timestamps_last_modified_key"
État SQL :23505
Détail :Key (last_modified)=(2016-01-15 18:35:22.550367) already exists.
Contexte : SQL statement "INSERT INTO timestamps DEFAULT VALUES"
PL/pgSQL function inline_code_block line 4 at SQL statement
#rach suggested to mix current_clock() with a SEQUENCE object, but it would probably imply getting rid of the TIMESTAMP type. Even though I can't really figure out how it'd solve the isolation problem...
Is there a common pattern to avoid this?
Thank you for your insights :)
If you have only one Postgres server as you said, I think that using timestamp + sequence can solve the problem because sequence are non transactional and respect the insert order.
If you have db shard then it will be much more complex but maybe the distributed sequence of 2ndquadrant in BDR could help but I don't think that ordinality will be respected. I added some code below if you have setup to test it.
CREATE SEQUENCE "timestamps_seq";
-- Let's test first, how to generate id.
SELECT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0') as unique_id ;
unique_id
--------------------------------
145288519200000000000000000010
(1 row)
CREATE TABLE IF NOT EXISTS timestamps (
unique_id TEXT UNIQUE NOT NULL DEFAULT extract(epoch from now())::bigint::text || LPAD(nextval('timestamps_seq')::text, 20, '0')
);
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
INSERT INTO timestamps DEFAULT VALUES;
select * from timestamps;
unique_id
--------------------------------
145288556900000000000000000001
145288557000000000000000000002
145288557100000000000000000003
(3 rows)
Let me know if that works. I'm not a DBA so maybe it will be good to ask on dba.stackexchange.com too about the potential side effect.
My two cents (Inspired from http://tapoueh.org/blog/2013/03/15-batch-update).
try adding the following before massive amount of insertions:
LOCK TABLE timestamps IN SHARE MODE;
Official documentation is here: http://www.postgresql.org/docs/current/static/sql-lock.html
I am using postgreSQL. I want to select data from a table. Such table name contains the current year. such as abc2013. I have tried
select * from concat('abc',date_part('year',current_date))
select *from from concat('abc', extract (year from current_date))
So how to fetch data from such table dynamically?
Please don't do this - look hard at alternatives first, starting with partitioning and constraint exclusion.
If you must use dynamic table names, do it at application level during query generation.
If all else fails you can use a PL/PgSQL procedure like:
CREATE OR REPLACE pleasedont(int year) RETURNS TABLE basetable AS $$
BEGIN
RETURN QUERY EXECUTE format('SELECT col1, col2, col3 FROM %I', 'basetable_'||year);
END;
$$ LANGUAGE plpgsql;
This will only work if you have a base table that has the same structure as the sub-tables. It's also really painful to work with when you start adding qualifiers (where clause constraints, etc), and it prevents any kind of plan caching or effective prepared statement use.