Version: PostgreSQL 9.4.2
Column | Type | Modifiers
------------+---------+----------------------------------------------------------------
id | integer | not null default nextval('T1_id_seq'::regclass)
name | text |
value | text |
parent_id | integer |
Indexes:
"T1_pkey" PRIMARY KEY, btree (id)
"T1_id_idx" btree (id)
I have two tables like this in Postgresql, say T1 and T2 with tree like data structure referencing data from own table.
I need to modify some rows in T1 and insert it to T2 in the exact order as the rows appeared in T1. What I have done thus far is copy the relevant rows from table T1 to a temporary table T3 for data modification and insert everything from T3 to T2 when changes' made.
T3 is created using
CREATE TABLE T3 (LIKE T1 INCLUDING ALL)
INSERT * INTO T3 SELECT * FROM T1
The end result is rather strange. All the data from T3 were copied to T2, but the order of the ids seems to be random.
However the result is correct if I invoke the same script to copy data from T1 to T3 directly. What is even more bizarre is it's also correct if if I split the above script into two separate script to
Create T3 from T1 and copy data from T1 to T3
Copy T3 to T2 using INSERT method.
Any clues?
You didn't specify an ORDER BY clause. Without one, PostgreSQL might fetch the rows for your SELECT in whatever order happens to be fastest to execute.
Try:
CREATE TABLE T3 (LIKE T1 INCLUDING ALL);
INSERT INTO T3
SELECT * FROM T1 ORDER BY T1.id;
Note that strictly there is no guarantee that the INSERT of multiple rows will process rows in the order they are read from the SELECT, but in practice PostgreSQL at this time will always process them in order and it's not likely to change in a hurry.
Related
Question
My query (simple with a few joins) runs extremely slow when I have a small amount of data (~50k rows) but runs fast when I have a bigger amount of data (~180k rows). The time difference is huge since it is from a few seconds to almost half an hour.
Attempts
I have re-checked the joins and they are all correct. In addition, I have run a VACUUM ANALYZE to the table before running the query but it didn't solve anything. I also checked whether there were locks that were blocking the query in any way or the connectivity was in any case slow but they are not the case of the faults.
Therefore, I went to check the output of the EXPLAIN. After reading the outcome, I see that in the slow case, it makes unnecessary extra sortings and it gets stuck in a nested for loop that is non existent in the case where I have way more data. I'm not sure how to tell postgres to do the same plan as with the bigger dataset scenario.
Based on a comment, I also tried not to use CTEs but it is not helping either: still makes the nested loops and the sortings.
Details:
Postgres version: PostgreSQL 12.3
Full query text:
WITH t0 AS (SELECT * FROM original_table WHERE id=0),
t1 AS (SELECT * FROM original_table WHERE id=1),
t2 AS (SELECT * FROM original_table WHERE id=2),
t3 AS (SELECT * FROM original_table WHERE id=3),
t4 AS (SELECT * FROM original_table WHERE id=4)
SELECT
t0.dtime,
t1.dtime,
t3.dtime,
t3.dtime::date,
t4.dtime,
t1.first_id,
t1.field,
t1.second_id,
t1.third_id,
t2.fourth_id,
t4.fourth_id
FROM t1
LEFT JOIN t0 ON t1.first_id=t0.first_id
JOIN t2 ON t1.first_id=t2.first_id AND t1.second_id = t2.second_id AND t1.third_id = t2.third_id
JOIN t3 ON t1.first_id=t3.first_id AND t1.second_id = t3.second_id AND t1.third_id = t3.third_id
JOIN t4 ON t1.first_id=t4.first_id AND t1.second_id = t4.second_id AND t1.fourth_id= t4.third_id
ORDER BY t3.dtime
;
Table definition:
Column | Type
----------+----------------------------
id | smallint
dtime | timestamp without time zone
first_id | character varying(10)
second_id | character varying(10)
third_id | character varying(10)
fourth_id | character varying(10)
field | character varying(10)
Cardinality: slow case ~50k, fast case ~180k
Query plans: output of EXPLAIN (BUFFERS, ANALYZE) for the two cases - slow case https://explain.depesz.com/s/5JDw, fast case: https://explain.depesz.com/s/JMIL
Additional info: the relevant memory configuration is:
name | current_setting | source
---------------+-----------------+---------------------
max_stack_dept | 2MB | environment variable
max_wal_size | 1GB | configuration file
min_wal_size | 80MB | configuration file
shared_buffers | 128MB | configuration file
This often happens to me on SQL-Server.
What usually causes the slowness, is that it executes the CTE once per row joined.
You can prevent that from happening by selecting into temp-tables, instead of using CTEs.
I assume the same is true for PostgreSQL, but I didn't test it:
DROP TABLE IF EXISTS tempT0;
DROP TABLE IF EXISTS tempT1;
CREATE TEMP TABLE tempT0 AS SELECT * FROM original_table WHERE id=0;
CREATE TEMP TABLE tempT1 AS SELECT * FROM original_table WHERE id=1;
[... etc]
FROM tempT1 AS t1
LEFT JOIN tempT0 AS t0 ON t1.first_id=t0.first_id
I am using postgresql. I have a table with about 10 million of records. I need to update a column of the table say 'a' using a sequence. This column needs to be updated in the order of another column say 'b'. So, for any two records r1 and r2, if value of 'a' for r1 is less than value of 'a' for r2 then value of 'b' for r1 must be less than value of 'b' for r2.
I am using something like this:
UPDATE table
SET col1 = nextval('myseq')
WHERE key IN (SELECT key
FROM table
ORDER BY col2);
key is the primary key of the table.
But it is taking too much time. Can anyone help me in doing it in optimized way.
Thanks
Try something like:
UPDATE table t
SET col1 = t2.new_col1
FROM (SELECT t2.key, nextval('myseq') as new_col1
FROM table t2
ORDER BY t2.col2) t2
WHERE t1.key = t2.key;
Or better something like:
UPDATE table t
SET col1 = t2.new_col1
FROM (SELECT t2.key,
row_number() OVER (ORDER BY t2.col2) as new_col1
FROM table t2) t2
WHERE t1.key = t2.key;
Don't use update at all.
Use a SELECT INTO like this:
SELECT *, nextval('myseq') AS col1
INTO new_table
FROM
(
SELECT *
FROM table
ORDER BY key
) AS sorted
Then replace the old table with the new. You'll have to recreate all your indexes and reinforce your primary keys.
Postgres doesn't replace each row it updates, it adds a second entry for the row and deprecates the old one. So if you're doing millions of updates it will make access extremely slow. Replacing the whole table is usually your best option.
I am new to postgresql (and databases in general) and was hoping to get some pointers on improving the efficiency of the following statement.
I am inserting data from one table to another, and do not want to insert duplicate values. I have a rid (unique identifier in each table) that are indexed and are Primary Keys.
I am currently using the following statement:
INSERT INTO table1 SELECT * FROM table2 WHERE rid NOT IN (SELECT rid FROM table1).
As of now the table one is 200,000 records, table2 is 20,000 records. Table1 is going to keep growing (probably to around 2,000,000) and table2 will stay around 20,000 records. As of now the statement takes about 15 minutes to run. I am concerned that as Table1 grows this is going to take way to long. Any suggestions?
This should be more efficient than your current query:
INSERT INTO table1
SELECT *
FROM table2
WHERE NOT EXISTS (
SELECT 1 FROM table1 WHERE table1.rid = table2.rid
);
insert into table1
select t2.*
from
table2 t2
left join
table1 t1 on t1.rid = t2.rid
where t1.rid is null
I need to perform distinct select on few columns out of which, one column is non-distinct. Can I specify which columns make up the distinct group in my SQL statement.
Currently I am doing this.
Select distinct a,b,c,d from TABLE_1 inner join TABLE_2 on TABLE_1.a = TABLE_2.a where TABLE_2.d IS NOT NULL;
The problem I have is I am getting 2 rows for the above SQL because column D holds different values. How can I form a distinct group of columns (a,b&c) ignoring column d, but have column d in my select clause as well?
FYI: I am using DB2
Thanks
Sandeep
SELECT a,b,c,MAX(d)
FROM table_1
INNER JOIN table_2 ON table_1.a = table_2.a
GROUP BY a,b,c
Well, your question, even with refinements, is still pretty general. So, you get a general answer.
Without knowing more about your table structure or your desired results, it may be impossible to give a meaningful answer, but here goes:
SELECT a, b, c, d
FROM table_1 as t1
JOIN table_2 as t2
ON t2.a = t1.a
AND t2.[some_timestamp_column] = (SELECT MAX(t3.[some_timestamp_column])
FROM table_2 as t3
WHERE t3.a = t2.a)
This assumes that table_1 is populated with single rows to retrieve, and that the one-to-many relationship between table_1 and table_2 is created because of different values of d, populated at unique [some_timestamp_column] times. If this is the case, it will get the most-recent table_2 record that matches to table_1.
I really need do something like that:
UPDATE table t1
SET column1=t2.column1
FROM table t2
INNER JOIN table t3
USING (column2)
GROUP BY t1.column2;
But postgres is saying that I have syntax error about GROUP BY clause. What is a different way to do this?
The UPDATE statement does not support GROUP BY, see the documentation. If you're trying to update t1 with the corresponding row from t2, you'd want to use the WHERE clause something like this:
UPDATE table t1 SET column1=t2.column1
FROM table t2
JOIN table t3 USING (column2)
WHERE t1.column2=t2.column2;
If you need to group the rows from t2/t3 before assigning to t1, you'd need to use a subquery something like this:
UPDATE table t1 SET column1=sq.column1
FROM (
SELECT t2.column1, column2
FROM table t2
JOIN table t3 USING (column2)
GROUP BY column2
) AS sq
WHERE t1.column2=sq.column2;
Although as formulated that won't work because t2.column1 isn't included in the GROUP BY statement (it would have to be an aggregate function rather than a simple column reference).
Otherwise, what exactly are you trying to do here?
In MariaDB/ MySQL this SQL work :
UPDATE table t1 left join (
SELECT t2.column1, column2
FROM table t2
JOIN table t3 USING (column2)
GROUP BY column2
) AS sq on t1.column2=sq.column2
SET column1=sq.column1;