Slow query due to planer invalid stats - even after analyze

Slow query due to planer invalid stats - even after analyze - postgresql

I have 2 tables:
CREATE TABLE sf.dir_current (
id BIGINT primary key,
volume_id INTEGER NOT NULL,
path VARCHAR NOT NULL
);
CREATE index dir_volid_path_indx on dir_current (volume_id, path);
CREATE TABLE sf.event (
id BIGINT, -- no primary key here!
volume_id INTEGER NOT NULL,
parent_path VARCHAR NOT NULL,
type BIGINT,
depth INTEGER
);
Table dir contains ~50 millions of rows and in all rows volume_id = 1. Table events contains ~20K rows.
I execute following query (in PLSQL function - VOL_ID, MIN_ID, MAX_ID and so on are function params):
select dir.id as parent_id, event as event_row
from sf.event as event
left outer join sf.dir_current as dir on dir.volume_id = VOL_ID and parent_path = dir.path
where event.volume_id = VOL_ID
and event.id between MIN_ID and MAX_ID
and (DEPTH_FILTER is null or event.depth = DEPTH_FILTER)
and (TYPE_FILTER is null or event.type = TYPE_FILTER)
order by event.depth;
Everything works fine when all rows in dir table has volume_id = 1. After adding few thousand rows with volume_id = 2 (and running analyze) this query takes very long.
Here is explain of long running query: explain.depesz.com
As it is clearly visible query planner had no idea that there are so many rows with volume_id = 2 and created plan far from optimal.
After some debugging I found out that analyze did not find any row with volume_id = 2. I confirmed it with query:
starfish=# SELECT most_common_vals, n_distinct FROM pg_stats WHERE tablename = 'dir_current' and attname = 'volume_id';
most_common_vals | n_distinct
------------------+------------
{1} | 1
(1 row)
After few analyze's it finnally finds some values with vol_id = 2 and query gets back to normal execution time: explain.depesz.com
Question: how to prevent extremely long query time? Is there a way to force analyze to find this rows? Or maybe manually modify stats for this column (setting n_distinct for vol_id column does not help).
I'm using Postresql 9.5

Related

PostgreSQL Transaction to Use Results from Query to Insert and Query another Table then Return Original Query Results

I am writing an application that stores data on file samples and YARA signatures. Essentially, in a single transaction, I need to execute a query, reference those results in an insert and another query, then return the original results. I have three tables that are relevant to this discussion:
samples - this is the table that stores information on files that need to be scanned with the associated YARA signatures.
yararules - the table that stores information on the YARA rules.
yaratracker - a table that tracks the sample/rule pairs that have been processed thus far.
In a single transaction, the application needs to:
Get a batch of unique sample/rule pairs that have not yet been processed. Preferably, this query will get all non-processed rules associated with a single sample (i.e. if I'm going to run the YARA rules on a sample, I want to run all of the YARA rules not yet processed on that sample so that I only have to load the sample into memory once).
Get a unique list of id,sha256 from the batch found in step 1.
Insert the batch from step 1 into the yaraqueue with the matchcount column equal to 0 and complete column set to false.
I can accomplish Step 1 with the query below, but I don't know how to reference those results to accomplish step 2. I've tried looking into variables, but apparently there isn't one that can hold multiple rows. I've looked into using a cursor, but I can't seem to use the cursor with a subsequent command and then return the cursor.
SELECT s.id,r.id
FROM sample s CROSS JOIN yararules r
WHERE r.status = 'Disabled' AND NOT EXISTS(
SELECT 1 FROM yaratracker q
WHERE q.sample_id = s.id AND q.rule_id = r.id
)
ORDER BY s.id
LIMIT 1000;
The relevant database schema looks like this.
CREATE TYPE samplelist AS ENUM ('Whitelist', 'Blacklist', 'Greylist', 'Unknown');
CREATE TABLE samples (
id SERIAL PRIMARY KEY,
md5 CHAR(32) NOT NULL,
sha1 CHAR(40) NOT NULL,
sha256 CHAR(64) NOT NULL,
total INT NOT NULL,
positives INT NOT NULL,
list SAMPLELIST NOT NULL,
filetype VARCHAR(16) NOT NULL,
submitted TIMESTAMP WITH TIME ZONE NOT NULL,
user_id SERIAL REFERENCES users;
);
CREATE UNIQUE INDEX md5_idx ON {0} (md5);
CREATE UNIQUE INDEX sha1_idx ON {0} (sha1);
CREATE UNIQUE INDEX sha256_idx ON {0} (sha256);
CREATE TYPE rulestatus AS ENUM ('Enabled', 'Disabled');
CREATE TABLE yararules (
id SERIAL PRIMARY KEY,
name VARCHAR(32) NOT NULL UNIQUE,
description TEXT NOT NULL,
rules TEXT NOT NULL,
lastmodified TIMESTAMP WITH TIME ZONE NOT NULL,
status rulestatus NOT NULL,
user_id SERIAL REFERENCES users ON DELETE CASCADE
);
CREATE TABLE yaratracker (
id SERIAL PRIMARY KEY,
rule_id SERIAL REFERENCES yararules ON DELETE CASCADE,
sample_id SERIAL REFERENCES sample ON DELETE CASCADE,
matchcount INT NOT NULL,
complete BOOL NOT NULL
);
CREATE INDEX composite_idx ON yaratracker (rule_id, sample_id);
CREATE INDEX complete_idx ON yaratracker (complete);

INSERT INTO target_table(a,b,c,...)
SELECT sid, rid, sha, ...
FROM (
SELECT s.id AS sid
,r.id AS rid
, s.sha256 AS sha
, ...
, ROW_NUMBER() OVER (PARTITION BY s.id) AS rn -- <<<--- HERE
FROM sample s CROSS JOIN yararules r
WHERE r.status = 'Disabled' AND NOT EXISTS(
SELECT 1 FROM yaratracker q
WHERE q.sample_id = s.id
AND q.rule_id = r.id
)
ORDER BY s.id
LIMIT 1000;
) src
WHERE src.rn = 1; -- <<<--- HERE
The WHERE src.rn = 1 will restrict the cross-join to deliver only one tuple per sample.id (both id and sha256 are unique in the sample table, so picking a unique id has the same effect as picking a unique sha256)
The complete cross-join result will never be generated; the optimiser is smart enough to push down the WHERE rn=1 condition into the subquery.
Note: the LIMIT 1000 should probably be removed (or pulled up to a higher level)
If you REALLY need to save the results from the CROSS JOIN, you could use a chain of CTEs (expect a performance degradation ...)
WITH big AS (
SELECT s.id AS sample_id
,r.id AS rule_id
, s.sha256
-- , ...
, ROW_NUMBER() OVER (PARTITION BY s.id) AS rn -- <<<--- HERE
FROM sample s
CROSS JOIN yararules r
WHERE r.status = 'Disabled' AND NOT EXISTS(
SELECT 1 FROM yaratracker q
WHERE q.sample_id = s.id AND q.rule_id = r.id
)
)
, ins AS (
INSERT INTO target_table(a,b,c,...)
SELECT b.sample_id, b.rule_id, b.sha256 , ...
FROM big b
WHERE b.rn = 1; -- <<<--- HERE
RETURNING *
)
INSERT INTO yaratracker (rule_id, sample_id, matchcount, complete )
SELECT b.sample_id, b.rule_id, 0, False
FROM big b
-- LEFT JOIN ins i ON i.a = b.sample_id AND i.b= b.rule_id
;
NOTE: the yaratracker(rule_id,sample_id) should not be serials but just plain integers, referencing yararules(id) and sample(id)

Function taking forever to run for large number of records

I have created the following function in Postgres 9.3.5:
CREATE OR REPLACE FUNCTION get_result(val1 text, val2 text)
RETURNS text AS
$BODY
$Declare
result text;
BEGIN
select min(id) into result from table
where id_used is null and id_type = val2;
update table set
id_used = 'Y',
col1 = val1,
id_used_date = now()
where id_type = val2
and id = result;
RETURN result;
END;
$BODY$
LANGUAGE plpgsql VOLATILE COST 100;
When I run this function in a loop of over a 1000 or more records it just does freezing and just says "query is running". When I check my table nothing is being updated. When I run it for one or two records it runs fine.
Example of the function when being run:
select get_result('123','idtype');
table columns:
id character varying(200),
col1 character varying(200),
id_used character varying(1),
id_used_date timestamp without time zone,
id_type character(200)
id is the table index.
Can someone help?

Most probably you are running into race conditions. When you run your function a 1000 times in quick succession in separate transactions, something like this happens:
T1 T2 T3 ...
SELECT max(id) -- id 1
SELECT max(id) -- id 1
SELECT max(id) -- id 1
...
Row id 1 locked, wait ...
Row id 1 locked, wait ...
UPDATE id 1
...
COMMIT
Wake up, UPDATE id 1 again!
COMMIT
Wake up, UPDATE id 1 again!
COMMIT
...
Largely rewritten and simplified as SQL function:
CREATE OR REPLACE FUNCTION get_result(val1 text, val2 text)
RETURNS text AS
$func$
UPDATE table t
SET id_used = 'Y'
, col1 = val1
, id_used_date = now()
FROM (
SELECT id
FROM table
WHERE id_used IS NULL
AND id_type = val2
ORDER BY id
LIMIT 1
FOR UPDATE -- lock to avoid race condition! see below ...
) t1
WHERE t.id_type = val2
-- AND t.id_used IS NULL -- repeat condition (not if row is locked)
AND t.id = t1.id
RETURNING id;
$func$ LANGUAGE sql;
Related question with a lot more explanation:
Atomic UPDATE .. SELECT in Postgres
Explain
Don't run two separate SQL statements. That is more expensive and widens the time frame for race conditions. One UPDATE with a subquery is much better.
You don't need PL/pgSQL for the simple task. You still can use PL/pgSQL, the UPDATE stays the same.
You need to lock the selected row to defend against race conditions. But you cannot do this with the aggregate function you head because, per documentation:
The locking clauses cannot be used in contexts where returned rows
cannot be clearly identified with individual table rows; for example
they cannot be used with aggregation.
Bold emphasis mine. Luckily, you can replace min(id) easily with the equivalent ORDER BY / LIMIT 1 I provided above. Can use an index just as well.
If the table is big, you need an index on id at least. Assuming that id is indexed already as PRIMARY KEY, that would help. But this additional partial multicolumn index would probably help a lot more:
CREATE INDEX foo_idx ON table (id_type, id)
WHERE id_used IS NULL;
Alternative solutions
Advisory locks May be the superior approach here:
Postgres UPDATE ... LIMIT 1
Or you may want to lock many rows at once:
How to mark certain nr of rows in table on concurrent access

Does postgres postgis ST_makeline have a max number of points it can create a line from?

My database has a table with tons of geometry(PointZ,4326) I am doing a lot of my processing on the database side and I've noticed that when I use the ST_MakeLine I seem to be hitting a cap on the number of points it will make a line from. My table and function/query is below.
It works as long as the number of track_points returned from the sub query is less than 97. I know this because the insert puts data in the table for all columns when there are 96 points or fewer. For all records where there are 97 or more points all it inserts is the track_id, start_time and end_time.
I'm wondering if this is a bug in the ST_makeLine function of postgis or is it a setting in postgres that I need to modify.
CREATE TABLE track_line_strings(
track_id bigint NOT NULL,
linestring geometry(LINESTRINGZ,4326),
start_time bigint NOT NULL,
end_time bigint NOT NULL,
CONSTRAINT track_line_strings_pk PRIMARY KEY (track_id)
);
CREATE OR REPLACE FUNCTION create_track_line_string() RETURNS trigger
LANGUAGE plpgsql
AS $$
DECLARE
TRACKITEMID bigint := new.track_item_id;
TRACKID bigint := track_id from track_item ti where ti.id = TRACKITEMID;
STARTTIME bigint := MIN(ti.item_time) from track_item ti where ti.track_id = TRACKID;
ENDTIME bigint := MAX(ti.item_time) from track_item ti where ti.track_id = TRACKID;
BEGIN
IF EXISTS (SELECT track_id from track_line_strings where track_id = TRACKID)
THEN
UPDATE track_line_strings
SET start_time = STARTTIME, end_time = ENDTIME, linestring = (
SELECT ST_Makeline(e.trackPosition) FROM
(
Select track_id, tp.track_position AS trackPosition
FROM track_point tp JOIN track_item ti ON tp.track_item_id = ti.id
where ti.track_id = TRACKID ORDER BY ti.item_time ASC
) E )
WHERE track_id = TRACKID;
ELSE
INSERT INTO track_line_strings(track_id, linestring, start_time, end_time)
SELECT TRACKID, ST_Makeline(e.trackPosition), STARTTIME, ENDTIME FROM
(
Select track_id, tp.track_position AS trackPosition
FROM track_point tp JOIN track_item ti ON tp.track_item_id = ti.id
where ti.track_id = TRACKID ORDER BY ti.item_time ASC
)E;
END IF;
RETURN new;
END;
$$;

The database limits are pretty high, 1 GB data worth of geometry data in a field. It depends on what kind of point geometry, but it will be on the order of tens of millions of point geometries that can be used to construct a LineString.
You will see a proper error message with something about "exceeded size" if it is a limitation.
Apparent empty or missing data with pgAdminIII is a common question, but not related to database limitations:
http://postgis.net/2013/10/05/tip_pgAdmin_shows_no_data
http://postgis.net/docs/manual-dev/PostGIS_FAQ.html#pgadmin_shows_no_data_in_geom

There doesnt appear to be a limit. I was viewing results in pgAdminIII and there must be a limit on the number of characters the data output can handle for each column. I only realized this by copy pasting the results into a text file to see that it did infact return a value for the lines that have more than 96 points.

How to insert values from another table in PostgreSQL?

I have a table which references other tables:
CREATE TABLE scratch
(
id SERIAL PRIMARY KEY,
name TEXT NOT NULL,
rep_id INT NOT NULL REFERENCES reps,
term_id INT REFERENCES terms
);
CREATE TABLE reps (
id SERIAL PRIMARY KEY,
rep TEXT NOT NULL UNIQUE
);
CREATE TABLE terms (
id SERIAL PRIMARY KEY,
terms TEXT NOT NULL UNIQUE
);
I wish to add a new record to scratch given the name, the rep and the terms values, i.e. I have neither corresponding rep_id nor term_id.
Right now the only idea that I have is:
insert into scratch (name, rep_id, term_id)
values ('aaa', (select id from reps where rep='Dracula' limit 1), (select id from terms where terms='prepaid' limit 1));
My problem is this. I am trying to use the parameterized query API (from node using the node-postgres package), where an insert query looks like this:
insert into scratch (name, rep_id, term_id) values ($1, $2, $3);
and then an array of values for $1, $2 and $3 is passed as a separate argument. At the end, when I am comfortable with the parameterized queries the idea is to promote them to prepared statements to utilize the most efficient and safest way to query the database.
However, I am puzzled how can I do this with my example, where different tables have to be subqueried.
P.S. I am using PostgreSQL 9.2 and have no problem with a PostgreSQL specific solution.
EDIT 1
C:\Users\markk>psql -U postgres
psql (9.2.4)
WARNING: Console code page (437) differs from Windows code page (1252)
8-bit characters might not work correctly. See psql reference
page "Notes for Windows users" for details.
Type "help" for help.
postgres=# \c dummy
WARNING: Console code page (437) differs from Windows code page (1252)
8-bit characters might not work correctly. See psql reference
page "Notes for Windows users" for details.
You are now connected to database "dummy" as user "postgres".
dummy=# DROP TABLE scratch;
DROP TABLE
dummy=# CREATE TABLE scratch
dummy-# (
dummy(# id SERIAL NOT NULL PRIMARY KEY,
dummy(# name text NOT NULL UNIQUE,
dummy(# rep_id integer NOT NULL,
dummy(# term_id integer
dummy(# );
NOTICE: CREATE TABLE will create implicit sequence "scratch_id_seq" for serial column "scratch.id"
NOTICE: CREATE TABLE / PRIMARY KEY will create implicit index "scratch_pkey" for table "scratch"
NOTICE: CREATE TABLE / UNIQUE will create implicit index "scratch_name_key" for table "scratch"
CREATE TABLE
dummy=# DEALLOCATE insert_scratch;
ERROR: prepared statement "insert_scratch" does not exist
dummy=# PREPARE insert_scratch (text, text, text) AS
dummy-# INSERT INTO scratch (name, rep_id, term_id)
dummy-# SELECT $1, r.id, t.id
dummy-# FROM reps r, terms t
dummy-# WHERE r.rep = $2 AND t.terms = $3
dummy-# RETURNING id, name, $2 rep, $3 terms;
PREPARE
dummy=# DEALLOCATE insert_scratch2;
ERROR: prepared statement "insert_scratch2" does not exist
dummy=# PREPARE insert_scratch2 (text, text, text) AS
dummy-# INSERT INTO scratch (name, rep_id, term_id)
dummy-# VALUES ($1, (SELECT id FROM reps WHERE rep=$2 LIMIT 1), (SELECT id FROM terms WHERE terms=$3 LIMIT 1))
dummy-# RETURNING id, name, $2 rep, $3 terms;
PREPARE
dummy=# EXECUTE insert_scratch ('abc', 'Snowhite', '');
id | name | rep | terms
----+------+-----+-------
(0 rows)
INSERT 0 0
dummy=# EXECUTE insert_scratch2 ('abc', 'Snowhite', '');
id | name | rep | terms
----+------+----------+-------
1 | abc | Snowhite |
(1 row)
INSERT 0 1
dummy=# EXECUTE insert_scratch ('abcd', 'Snowhite', '30 days');
id | name | rep | terms
----+------+----------+---------
2 | abcd | Snowhite | 30 days
(1 row)
INSERT 0 1
dummy=# EXECUTE insert_scratch2 ('abcd2', 'Snowhite', '30 days');
id | name | rep | terms
----+-------+----------+---------
3 | abcd2 | Snowhite | 30 days
(1 row)
INSERT 0 1
dummy=#
EDIT 2
We can utilize the fact that rep_id is required, even though terms_id is optional and use the following version of INSERT-SELECT:
PREPARE insert_scratch (text, text, text) AS
INSERT INTO scratch (name, rep_id, term_id)
SELECT $1, r.id, t.id
FROM reps r
LEFT JOIN terms t ON t.terms = $3
WHERE r.rep = $2
RETURNING id, name, $2 rep, $3 terms;
This version, however, has two problems:
No distinction is made between a missing terms value (i.e. '') and an invalid terms value (i.e. a non empty value missing from the terms table entirely). Both are treated as missing terms. (But the INSERT with two subqueries suffers from the same problem)
The version depends on the fact that the rep is required. But what if rep_id was optional too?
EDIT 3
Found the solution for the item 2 - eliminating dependency on rep being required. Plus using the WHERE statement has the problem that the sql does not fail if the rep is invalid - it just inserts 0 rows, whereas I want to fail explicitly in this case. My solution is simply using a dummy one row CTE:
PREPARE insert_scratch (text, text, text) AS
WITH stub(x) AS (VALUES (0))
INSERT INTO scratch (name, rep_id, term_id)
SELECT $1, r.id, t.id
FROM stub
LEFT JOIN terms t ON t.terms = $3
LEFT JOIN reps r ON r.rep = $2
RETURNING id, name, rep_id, term_id;
If rep is missing or invalid, this sql will try to insert NULL into the rep_id field and since the field is NOT NULL an error would be raised - precisely what I need. And if further I decide to make rep optional - no problem, the same SQL works for that too.

INSERT into scratch (name, rep_id, term_id)
SELECT 'aaa'
, r.id
, t.id
FROM reps r , terms t -- essentially a cross join
WHERE r.rep = 'Dracula'
AND t.terms = 'prepaid'
;
Notes:
You don't need the ugly LIMITs, since r.rep and t.terms are unique (candidate keys)
you could replace the FROM a, b by a FROM a CROSS JOIN b
the scratch table will probably need an UNIQUE constraint on (rep_id, term_it) (the nullability of term_id is questionable)
UPDATE: the same as prepared query as found in the Documentation
PREPARE hoppa (text, text,text) AS
INSERT into scratch (name, rep_id, term_id)
SELECT $1 , r.id , t.id
FROM reps r , terms t -- essentially a cross join
WHERE r.rep = $2
AND t.terms = $3
;
EXECUTE hoppa ('bbb', 'Dracula' , 'prepaid' );
SELECT * FROM scratch;
UPDATE2: test data
DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE reps ( id SERIAL PRIMARY KEY, rep TEXT NOT NULL UNIQUE);
CREATE TABLE terms ( id SERIAL PRIMARY KEY, terms TEXT NOT NULL UNIQUE);
CREATE TABLE scratch ( id SERIAL PRIMARY KEY, name TEXT NOT NULL, rep_id INT NOT NULL REFERENCES reps, term_id INT REFERENCES terms);
INSERT INTO reps(rep) VALUES( 'Dracula' );
INSERT INTO terms(terms) VALUES( 'prepaid' );
Results:
NOTICE: drop cascades to 3 other objects
DETAIL: drop cascades to table tmp.reps
drop cascades to table tmp.terms
drop cascades to table tmp.scratch
DROP SCHEMA
CREATE SCHEMA
SET
CREATE TABLE
CREATE TABLE
CREATE TABLE
INSERT 0 1
INSERT 0 1
INSERT 0 1
PREPARE
INSERT 0 1
id | name | rep_id | term_id
----+------+--------+---------
1 | aaa | 1 | 1
2 | bbb | 1 | 1
(2 rows)

SELECT or INSERT a row in one command

I'm using PostgreSQL 9.0 and I have a table with just an artificial key (auto-incrementing sequence) and another unique key. (Yes, there is a reason for this table. :)) I want to look up an ID by the other key or, if it doesn't exist, insert it:
SELECT id
FROM mytable
WHERE other_key = 'SOMETHING'
Then, if no match:
INSERT INTO mytable (other_key)
VALUES ('SOMETHING')
RETURNING id
The question: is it possible to save a round-trip to the DB by doing both of these in one statement? I can insert the row if it doesn't exist like this:
INSERT INTO mytable (other_key)
SELECT 'SOMETHING'
WHERE NOT EXISTS (SELECT * FROM mytable WHERE other_key = 'SOMETHING')
RETURNING id
... but that doesn't give the ID of an existing row. Any ideas? There is a unique constraint on other_key, if that helps.

Have you tried to union it?
Edit - this requires Postgres 9.1:
create table mytable (id serial primary key, other_key varchar not null unique);
WITH new_row AS (
INSERT INTO mytable (other_key)
SELECT 'SOMETHING'
WHERE NOT EXISTS (SELECT * FROM mytable WHERE other_key = 'SOMETHING')
RETURNING *
)
SELECT * FROM new_row
UNION
SELECT * FROM mytable WHERE other_key = 'SOMETHING';
results in:
id | other_key
----+-----------
1 | SOMETHING
(1 row)

No, there is no special SQL syntax that allows you to do select or insert. You can do what Ilia mentions and create a sproc, which means it will not do a round trip fromt he client to server, but it will still result in two queries (three actually, if you count the sproc itself).

using 9.5 i successfully tried this
based on Denis de Bernardy's answer
only 1 parameter
no union
no stored procedure
atomic, thus no concurrency problems (i think...)
The Query:
WITH neworexisting AS (
INSERT INTO mytable(other_key) VALUES('hello 2')
ON CONFLICT(other_key) DO UPDATE SET existed=true -- need some update to return sth
RETURNING *
)
SELECT * FROM neworexisting
first call:
id|other_key|created |existed|
--|---------|-------------------|-------|
6|hello 1 |2019-09-11 11:39:29|false |
second call:
id|other_key|created |existed|
--|---------|-------------------|-------|
6|hello 1 |2019-09-11 11:39:29|true |
First create your table ;-)
CREATE TABLE mytable (
id serial NOT NULL,
other_key text NOT NULL,
created timestamptz NOT NULL DEFAULT now(),
existed bool NOT NULL DEFAULT false,
CONSTRAINT mytable_pk PRIMARY KEY (id),
CONSTRAINT mytable_uniq UNIQUE (other_key) --needed for on conflict
);

you can use a stored procedure
IF (SELECT id FROM mytable WHERE other_key = 'SOMETHING' LIMIT 1) < 0 THEN
INSERT INTO mytable (other_key) VALUES ('SOMETHING')
END IF

I have an alternative to Denis answer, that I think is less database-intensive, although a bit more complex:
create table mytable (id serial primary key, other_key varchar not null unique);
WITH table_sel AS (
SELECT id
FROM mytable
WHERE other_key = 'test'
UNION
SELECT NULL AS id
ORDER BY id NULLS LAST
LIMIT 1
), table_ins AS (
INSERT INTO mytable (id, other_key)
SELECT
COALESCE(id, NEXTVAL('mytable_id_seq'::REGCLASS)),
'test'
FROM table_sel
ON CONFLICT (id) DO NOTHING
RETURNING id
)
SELECT * FROM table_ins
UNION ALL
SELECT * FROM table_sel
WHERE id IS NOT NULL;
In table_sel CTE I'm looking for the right row. If I don't find it, I assure that table_sel returns at least one row, with a union with a SELECT NULL.
In table_ins CTE I try to insert the same row I was looking for earlier. COALESCE(id, NEXTVAL('mytable_id_seq'::REGCLASS)) is saying: id could be defined, if so, use it; whereas if id is null, increment the sequence on id and use this new value to insert a row. The ON CONFLICT clause assure
that if id is already in mytable I don't insert anything.
At the end I put everything together with a UNION between table_ins and table_sel, so that I'm sure to take my sweet id value and execute both CTE.
This query needs to search for the value other_key only once, and is a "search this value" not a "check if this value not exists in the table", that is very heavy; in Denis alternative you use other_key in both types of searches. In my query you "check if a value not exists" only on id that is a integer primary key, that, for construction, is fast.

Minor tweak a decade late to Denis's excellent answer:
-- Create the table with a unique constraint
CREATE TABLE mytable (
id serial PRIMARY KEY
, other_key varchar NOT NULL UNIQUE
);
WITH new_row AS (
-- Only insert when we don't find anything, avoiding a table lock if
-- possible.
INSERT INTO mytable ( other_key )
SELECT 'SOMETHING'
WHERE NOT EXISTS (
SELECT *
FROM mytable
WHERE other_key = 'SOMETHING'
)
RETURNING *
)
(
-- This comes first in the UNION ALL since it'll almost certainly be
-- in the query cache. Marginally slower for the insert case, but also
-- marginally faster for the much more common read-only case.
SELECT *
FROM mytable
WHERE other_key = 'SOMETHING'
-- Don't check for duplicates to be removed
UNION ALL
-- If we reach this point in iteration, we needed to do the INSERT and
-- lock after all.
SELECT *
FROM new_row
) LIMIT 1 -- Just return whatever comes first in the results and allow
-- the query engine to cut processing short for the INSERT
-- calculation.
;
The UNION ALL tells the planner it doesn't have to collect results for de-duplication. The LIMIT 1 at the end allows the planner to short-circuit further processing/iteration once it knows there's an answer available.
NOTE: There is a race condition present here and in the original answer. If the entry does not already exist, the INSERT will fail with a unique constraint violation. The error can be suppressed with ON CONFLICT DO NOTHING, but the query will return an empty set instead of the new row. This is a difficult problem because getting that info from another transaction would violate the I in ACID.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Slow query due to planer invalid stats - even after analyze - postgresql

Related

PostgreSQL Transaction to Use Results from Query to Insert and Query another Table then Return Original Query Results

Function taking forever to run for large number of records

Does postgres postgis ST_makeline have a max number of points it can create a line from?

How to insert values from another table in PostgreSQL?

SELECT or INSERT a row in one command

Categories

Resources