Is there any way to update this column faster using PostgreSQL - postgresql

I have about 200,000,000 rows and I am trying to update one of the columns, and this query seems particularly slow, so I am not sure what exactly wrong or if it is just slow.
UPDATE table1 p
SET location = a.location
FROM table2 a
WHERE p.idv = a.idv;
I curently on idv for both of the tables. Is there someway to make this faster?

Encounter the same problem several weeks ago , finally I use the following strategies to drastically improve the speed. I guess it is not the best approach , but just for your reference.
Write a simple function which accept a range of Id. The function will execute the update SQL but just update these range of ID.
Also add 'location != a.location' to the where clause . I heard that it can help to reduce the table become bloated which will affect query performance and need to do vacuum to restore the performance.
I execute the function continuously using about 30 threads which intuitively I think it can reduce the total time required by approximately 30 times. You can adjust to use a even higher number of threads if you are ambitious enough.
So it executes something likes below concurrently :
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 1 and 100000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 100001 and 200000;
update table1 p set location = a.location from table a where p.idv = a.idv and location != a.location and p.id between 200001 and 300000;
.....
.....
Also it has another advantage that I can know the update progress and what is the estimated remaining time to go by printing some simple timing statistic in each function.

Creating a new table can be faster than update existing data. So you can try the following:
CREATE TABLE new_table AS
SELECT
a.*, -- here you can set all fields you need
CALESCE(b.location, a.location) location -- update location field from table2
FROM table1 a
LEFT JOIN table2 b ON b.idv = a.idv;
After creation you will be able to drop old table and to rename the new.

Related

Refactoring query using DISTINCT and JOINing table with a lot of records

I am using PostgreSQL v 11.6. I've read a lot of questions asking about how to optimize queries which are using DISTINCT. Mine is not that different, but despite the other questions where the people usually want's to keep the other part of the query and just somehow make DISTINCT ON faster, I am willing to rewrite the query with the sole purpose to make it as performent as possible. The current query is this:
SELECT DISTINCT s.name FROM app.source AS s
INNER JOIN app.index_value iv ON iv.source_id = s.id
INNER JOIN app.index i ON i.id = iv.index_id
INNER JOIN app.namespace AS ns ON i.namespace_id=ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
ORDER BY s.name;
The app.source table contains about 800 records. The other tables are under 5000 recrods tops, but the app.index_value contains 35_420_354 (about 35 million records) which I guess causes the overall slow execution of the query.
The EXPLAIN ANALYZE returns this:
I think that all relevent indexes are in place (maybe there can be made some small optimization) but I think that in order to get significant improvements in the time execution I need a better logic for the query.
The current execution time on a decent machine is 35~38 seconds.
Your query is not using DISTINCT ON. It is merely using DISTINCT which is quite a different thing.
SELECT DISTINCT is indeed often an indicator for a oorly written query, because DISTINCT is used to remove duplicates and it is often the case tat the query creates those duplicates itself. The same is true for your query. You simply want all names where certain entries exist. So, use EXISTS (or IN for that matter).
EXISTS
SELECT s.name
FROM app.source AS s
WHERE EXISTS
(
SELECT NULL
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE iv.source_id = s.id
AND (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
IN
SELECT s.name
FROM app.source AS s
WHERE s.id IN
(
SELECT iv.source_id
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
Thus we avoid creating an unnecessarily large intermediate result.
Update 1
From the database side we can support queries with appropriate indexes. The only criteria used in your query that limits selected rows is the array lookup, though. This is probably slow, because the DBMS cannot use database indexes here as far as I know. And depending on the array content we can end up with zero app.namespace rows, few rows, many rows or even all rows. The DBMS cannot even make proper assumptions on know how many. From there we'll retrieve the related index and index_value rows. Again, these can be all or none. The DBMS could use indexes here or not. If it used indexes this would be very fast on small sets of rows and extremely slow on large data sets. And if it used full table scans and joined these via hash joins for instance, this would be the fastest approach for many rows and rather slow on few rows.
You can create indexes and see whether they get used or not. I suggest:
create index idx1 on app.index (namespace_id, id);
create index idx2 on app.index_value (index_id, source_id);
create index idx3 on app.source (id, name);
Update 2
I am not versed with arrays. But t looks like you want to check if a matching condition exists. So again EXISTS might be a tad more appropriate:
WHERE EXISTS
(
SELECT NULL
FROM UNNEST(Array['Default']::CITEXT[]) AS nss
WHERE ns.name ILIKE nss
)
Update 3
One more idea (I feel stupid now to have missed that): For each source we just look up whether there is at least one match. So maybe the DBMS starts with the source table and goes from that table to the next. For this we'd use the following indexes:
create index idx4 on index_value (source_id, index_id);
create index idx5 on index (id, namespace_id);
create index idx6 on namespace (id, name);
Just add them to your database and see what happens. You can always drop indexes again when you see the DBMS doesn't use them.

How can you use 'For update skip locked' in postgres without locking rows in all tables used in the query?

When you want to use postgres's SELECT FOR UPDATE SKIP LOCKED functionality to ensure that two different users reading from a table and claiming tasks do not get blocked by each other and also do not get tasks already being read by another user:
A join is being used in the query to retrieve tasks. We do not want any other table to have row-level locking except the table that contains the main info. Sample query below - Lock only the rows in the table -'task' in the below query
SELECT v.someid , v.info, v.parentinfo_id, v.stage FROM task v, parentinfo pi WHERE v.stage = 'READY_TASK'
AND v.parentinfo_id = pi.id
AND pi.important_info_number = (
SELECT MAX(important_info_number) FROM parentinfo )
ORDER BY v.id limit 200 for update skip locked;
Now if user A is retrieving some 200 rows of this table, user B should be able to retrieve another set of 200 rows.
EDIT: As per the comment below, the query will be changed to :
SELECT v.someid , v.info, v.parentinfo_id, v.stage FROM task v, parentinfo pi WHERE v.stage = 'READY_TASK'
AND v.parentinfo_id = pi.id
AND pi.important_info_number = (
SELECT MAX(important_info_number) FROM parentinfo) ORDER BY v.id limit 200 for update of v skip locked;
How best to place order by such that rows are ordered? While the order would get effected if multiple users invoke this command, still some order sanctity should be maintained of the rows that are being returned.
Also, does this also ensure that multiple threads invoking the same select query would be retrieving a different set of rows or is the locking only done for update commands?
Just experimented with this a little bit - multiple select queries will end up retrieving different set of rows. Also, order by ensures the order of the final result obtained.
Yes,
FOR UPDATE OF "TABLE_NAME" SKIP LOCKED
will lock only TABLE_NAME

Reform a postgreSQL query using one (or more) index - Example

I am a beginner in PostgreSQL and, after understanding very basic things, I want to find out how I can get a better performance (on a query) by using an index (one or more). I have read some documentation, but I would like a specific example so as to "catch" it.
MY EXAMPLE: Let's say I have just a table (MyTable) with three columns (Customer(text), Time(timestamp), Consumption(integer)) and I want to find the customer(s) with the maximum consumption on '2014-07-01 01:00:00'. MY SOLUTION (without index usage):
SELECT Customer FROM MyTable WHERE Time='2013-07-01 02:00:00'
AND Consumption=(SELECT MAX(consumption) FROM MyTable);
----> What would be the exact full code, using - at least one - index for the query-example above ?
The correct query (using a correlated subquery) would be:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00' AND
Consumption = (SELECT MAX(t2.consumption) FROM MyTable t2 WHERE t2.Time = '2013-07-01 02:00:00');
The above is very reasonable. An alternative approach if you want exactly one row returned is:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00'
ORDER BY Consumption DESC
LIMIT 1;
And the best index is MyTable(Time, Consumption, Customer).

Why is performance of CTE worse than temporary table in this example

I recently asked a question regarding CTE's and using data with no true root records (i.e Instead of the root record having a NULL parent_Id it is parented to itself)
The question link is here; Creating a recursive CTE with no rootrecord
The answer has been provided to that question and I now have the data I require however I am interested in the difference between the two approaches that I THINK are available to me.
The approach that yielded the data I required was to create a temp table with cleaned up parenting data and then run a recursive CTE against. This looked like below;
Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
INTO #Parties
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
WITH linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM #Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM #Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
I also attempted to retrieve the same data by defining two CTE's. One to emulate the creation of the temp table above and the other to do the same recursive work but referencing the initial CTE rather than a temp table;
WITH Parties
AS
(Select CASE
WHEN Parent_Id = Party_Id THEN NULL
ELSE Parent_Id
END AS Act_Parent_Id
, Party_Id
, PARTY_CODE
, PARTY_NAME
FROM DIMENSION_PARTIES
WHERE CURRENT_RECORD = 1),
linkedParties
AS
(
Select Act_Parent_Id, Party_Id, PARTY_CODE, PARTY_NAME, 0 AS LEVEL
FROM Parties
WHERE Act_Parent_Id IS NULL
UNION ALL
Select p.Act_Parent_Id, p.Party_Id, p.PARTY_CODE, p.PARTY_NAME, Level + 1
FROM Parties p
inner join
linkedParties t on p.Act_Parent_Id = t.Party_Id
)
Select *
FROM linkedParties
Order By Level
Now these two scripts are run on the same server however the temp table approach yields the results in approximately 15 seconds.
The multiple CTE approach takes upwards of 5 minutes (so long in fact that I have never waited for the results to return).
Is there a reason why the temp table approach would be so much quicker?
For what it is worth I believe it is to do with the record counts. The base table has 200k records in it and from memory CTE performance is severely degraded when dealing with large data sets but I cannot seem to prove that so thought I'd check with the experts.
Many Thanks
Well as there appears to be no clear answer for this some further research into the generics of the subject threw up a number of other threads with similar problems.
This one seems to cover many of the variations between temp table and CTEs so is most useful for people looking to read around their issues;
Which are more performant, CTE or temporary tables?
In my case it would appear that the large amount of data in my CTEs would cause issue as it is not cached anywhere and therefore recreating it each time it is referenced later would have a large impact.
This might not be exactly the same issue you experienced, but I just came across a few days ago a similar one and the queries did not even process that many records (a few thousands of records).
And yesterday my colleague had a similar problem.
Just to be clear we are using SQL Server 2008 R2.
The pattern that I identified and seems to throw the sql server optimizer off the rails is using temporary tables in CTEs that are joined with other temporary tables in the main select statement.
In my case I ended up creating an extra temporary table.
Here is a sample.
I ended up doing this:
SELECT DISTINCT st.field1, st.field2
into #Temp1
FROM SomeTable st
WHERE st.field3 <> 0
select x.field1, x.field2
FROM #Temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I tried the following query but it was a lot slower, if you can believe it.
with temp1 as (
DISTINCT st.field1, st.field2
FROM SomeTable st
WHERE st.field3 <> 0
)
select x.field1, x.field2
FROM temp1 x inner join #Temp2 o
on x.field1 = o.field1
order by 1, 2
I also tried to inline the first query in the second one and the performance was the same, i.e. VERY BAD.
SQL Server never ceases to amaze me. Once in a while I come across issues like this one that reminds me it is a microsoft product after all, but in the end you can say that other database systems have their own quirks.

PostgreSQL slow COUNT() - is trigger the only solution?

I have a table with posts, which are categorized by:
type
tag
language
All of those "categories" are stored in next tables (posts_types) and connected via next tables (posts_types_assignment).
COUNTing in PostgreSQL is really slow (i have more than 500k records in that table) and i need to get the number of posts categorized by any combination of type/tag/lang.
If i would solve it through triggers, it would be full of many multi-level loops, which really doesn't look like nice and is hard to maintenance.
Is there any other solution how to effectively get actual number of posts categorized in any type/tag/language?
Let me get this straight.
You have a table posts. You have a table posts_types. The two have a many to many join on posts_types_assignment. And you have some query like this that is slow:
SELECT count(*)
FROM posts p
JOIN posts_types_assigment pta1
ON p.id = pta1.post_id
JOIN posts_types pt1
ON pt1.id = pta1.post_type_id
AND pt1.type = 'language'
AND pt1.name = 'English'
JOIN posts_types_assigment pta2
ON p.id = pta2.post_id
JOIN posts_types pt2
ON pt2.id = pta2.post_type_id
AND pt2.type = 'tag'
AND pt2.name = 'awesome'
And you would like to know why it is painfully slow.
My first note is that PostgreSQL would have to do a lot less work if you had the identifiers in the posts table rather than in the joins. But that is a moot issue, the decision has been made.
My more useful note is that I believe that PostgreSQL has a similar query optimizer to Oracle. In which case to limit the combinatorial explosion of possible query plans that it has to consider, it only considers plans that start with some table, and then repeatedly joins on one more data set at a time. However no such query plan will work here. You can start with pt1, get 1 record, then go to pta1, get a bunch of records, join p, wind up with the same number of records, then join pta2, and now you get a huge number of records, then join to pt2, get just a few records. Joining to pta2 is the slow step, because the database has no idea which records you want, and therefore has to create a temporary result set for every combination of a post and a piece of metadata (type, language or tag) on it.
If this is indeed your problem, then the right plan looks like this. Join pt1 to pta1, put an index on it. Join pt2 to pta2, then join to the result of the first query, then join to p. Then count. This means that we don't get huge result sets.
If this the case, there is no way to tell the query optimizer that this once you want it to think up a new type of execution plan. But there is a way to force it.
CREATE TEMPORARY TABLE t1
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
WHERE pt.type = 'language'
AND pt.name = 'English';
CREATE INDEX idx1 ON t1 (post_id);
CREATE TEMPORARY TABLE t2
AS
SELECT pta*
FROM posts_types pt
JOIN posts_types_assignment pta
ON pt.id = pta.post_type_id
JOIN t1
ON t1.post_id = pta.post_id
WHERE pt.type = 'language'
AND pt.name = 'English';
SELECT COUNT(*)
FROM posts p
JOIN t1
ON p.id = t1.post_id;
Barring random typos, etc, this is likely to perform somewhat better. If it doesn't, double check the indexes on your tables.
As btilly notes, and if he has correctly guessed the schema, the table design does not help - it seems (at first sight, at least) that, for example, to have three tables posts_tag(post_id,tag) post_lang(post_id,lang) post_type(post_id,type) would be more natural and much more efficient.
Apart from that (or in addition to that), one could think of a table or materialized view that summarizes all the possible countings, with columns (lang,type,tag,nposts). Of course, to compute this in full would be VERY slow, but (apart from the first time) it can be done either in full "in background", at some intervals (if the data does not vary much, and if you don't require exact counts), or eagerly with triggers.
See for example here