Optimizing a query with multiple IN - postgresql

I have a query like this:
SELECT * FROM table
WHERE department='param1' AND type='param2' AND product='param3'
AND product_code IN (10-30 alphanumerics) AND unit_code IN (10+ numerics)
AND first_name || last_name IN (10-20 names)
AND sale_id LIKE ANY(list of regex string)
Runtime was too high so I was asked to optimize it.
The list of parameters varies for the code columns for different users.
Each user provides their list of codes and then loops over product.
product used to be an IN clause list as well but it was split up.
Things I tried
By adding an index on (department, type and product) I was able to get a 4x improvement.
Current runtime is that some values of product only take 2-3 seconds, while others take 30s.
Tried creating a pre-concat'd column of first_name || last_name, but the runtime improvement was too small to be worth it.
Is there some way I can improve the performance of the other clauses, such as the "IN" clauses or the LIKE ANY clause?

In my experience replacing large IN lists, with a JOIN to a VALUES clause often improves performance.
So instead of:
SELECT *
FROM table
WHERE department='param1'
AND type='param2'
AND product='param3'
AND product_code IN (10-30 alphanumerics)
Use:
SELECT *
FROM table t
JOIN ( values (1),(2),(3) ) as x(code) on x.code = t.product_code
WHERE department='param1'
AND type='param2'
AND product='param3'
But you have to make sure you don't have any duplicates in the values () list
The concatenation is also wrong because the concatenated value is something different then comparing each value individually, e.g. ('alexander', 'son') would be treated identical to ('alex', 'anderson')`
You should use:
and (first_name, last_name) in ( ('fname1', 'lname1'), ('fname2', 'lname2'))
This can also be written as a join
SELECT *
FROM table t
JOIN ( values (1),(2),(3) ) as x(code) on x.code = t.product_code
JOIN (
values ('fname1', 'lname1'), ('fname2', 'lname2')
) as n(fname, lname) on (n.fname, n.lname) = (t.first_name, t.last_name)
WHERE department='param1'
AND type='param2'
AND product='param3'

You generally don't have to do anything special to enable an index for it to be used with multiple IN-lists, other than keep the table well vacuumed and analyzed. A btree index on (department, type, product, product_code, unit_code, (first_name || last_name)) should work well. If it doesn't, please show an EXPLAIN (ANALYZE, BUFFERS) for it, preferably with track_io_timing turned on. If the selectivities of each of your conditions are not mostly independent of each other, that might lead to planning problems.

Related

Smart way to filter out unnecessary rows from Query

So I have a query that shows a huge amount of mutations in postgres. The quality of data is bad and i have "cleaned" it as much as possible.
To make my report so user-friendly as possible I want to filter out some rows that I know the customer don't need.
I have following columns id, change_type, atr, module, value_old and value_new
For change_type = update i always want to show every row.
For the rest of the rows i want to build some kind of logic with a combination of atr and module.
For example if the change_type <> 'update' and concat atr and module is 'weightperson' than i don't want to show that row.
In this case id 3 and 11 are worthless and should not be shown.
Is this the best way to solve this or does anyone have another idea?
select * from t1
where concat(atr,module) not in ('weightperson','floorrentalcontract')
In the end my "not in" part will be filled with over 100 combinations and the query will not look good. Maybe a solution with a cte would make it look prettier and im also concerned about the perfomance..
CREATE TABLE t1(id integer, change_type text, atr text, module text, value_old text, value_new text) ;
INSERT INTO t1 VALUES
(1,'create','id','person',null ,'9'),
(2,'create','username','person',null ,'abc'),
(3,'create','weight','person',null ,'60'),
(4,'update','id','order','4231' ,'4232'),
(5,'update','filename','document','first.jpg' ,'second.jpg'),
(6,'delete','id','rent','12' ,null),
(7,'delete','cost','rent','600' ,null),
(8,'create','id','rentalcontract',null ,'110'),
(9,'create','tenant','rentalcontract',null ,'Jack'),
(10,'create','rent','rentalcontract',null ,'420'),
(11,'create','floor','rentalcontract',null ,'1')
Fiddle
You could put the list of combinations in a separate table and join with that table, or have them listed directly in a with-clause like this:
with combinations_to_remove as (
select *
from (values
('weight', 'person'),
('floor' ,'rentalcontract')
) as t (atr, module)
)
select t1.*
from t1
left join combinations_to_remove using(atr, module)
where combinations_to_remove.atr is null
I guess it would be cleaner and easier to maintain if you put them in a separate table!
Read more on with-queries if that sounds strange to you.

Refactoring query using DISTINCT and JOINing table with a lot of records

I am using PostgreSQL v 11.6. I've read a lot of questions asking about how to optimize queries which are using DISTINCT. Mine is not that different, but despite the other questions where the people usually want's to keep the other part of the query and just somehow make DISTINCT ON faster, I am willing to rewrite the query with the sole purpose to make it as performent as possible. The current query is this:
SELECT DISTINCT s.name FROM app.source AS s
INNER JOIN app.index_value iv ON iv.source_id = s.id
INNER JOIN app.index i ON i.id = iv.index_id
INNER JOIN app.namespace AS ns ON i.namespace_id=ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
ORDER BY s.name;
The app.source table contains about 800 records. The other tables are under 5000 recrods tops, but the app.index_value contains 35_420_354 (about 35 million records) which I guess causes the overall slow execution of the query.
The EXPLAIN ANALYZE returns this:
I think that all relevent indexes are in place (maybe there can be made some small optimization) but I think that in order to get significant improvements in the time execution I need a better logic for the query.
The current execution time on a decent machine is 35~38 seconds.
Your query is not using DISTINCT ON. It is merely using DISTINCT which is quite a different thing.
SELECT DISTINCT is indeed often an indicator for a oorly written query, because DISTINCT is used to remove duplicates and it is often the case tat the query creates those duplicates itself. The same is true for your query. You simply want all names where certain entries exist. So, use EXISTS (or IN for that matter).
EXISTS
SELECT s.name
FROM app.source AS s
WHERE EXISTS
(
SELECT NULL
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE iv.source_id = s.id
AND (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
IN
SELECT s.name
FROM app.source AS s
WHERE s.id IN
(
SELECT iv.source_id
FROM app.index_value iv
JOIN app.index i ON i.id = iv.index_id
JOIN app.namespace AS ns ON i.namespace_id = ns.id
WHERE (SELECT TRUE FROM UNNEST(Array['Default']::CITEXT[]) AS nss WHERE ns.name ILIKE nss LIMIT 1)
)
ORDER BY s.name;
Thus we avoid creating an unnecessarily large intermediate result.
Update 1
From the database side we can support queries with appropriate indexes. The only criteria used in your query that limits selected rows is the array lookup, though. This is probably slow, because the DBMS cannot use database indexes here as far as I know. And depending on the array content we can end up with zero app.namespace rows, few rows, many rows or even all rows. The DBMS cannot even make proper assumptions on know how many. From there we'll retrieve the related index and index_value rows. Again, these can be all or none. The DBMS could use indexes here or not. If it used indexes this would be very fast on small sets of rows and extremely slow on large data sets. And if it used full table scans and joined these via hash joins for instance, this would be the fastest approach for many rows and rather slow on few rows.
You can create indexes and see whether they get used or not. I suggest:
create index idx1 on app.index (namespace_id, id);
create index idx2 on app.index_value (index_id, source_id);
create index idx3 on app.source (id, name);
Update 2
I am not versed with arrays. But t looks like you want to check if a matching condition exists. So again EXISTS might be a tad more appropriate:
WHERE EXISTS
(
SELECT NULL
FROM UNNEST(Array['Default']::CITEXT[]) AS nss
WHERE ns.name ILIKE nss
)
Update 3
One more idea (I feel stupid now to have missed that): For each source we just look up whether there is at least one match. So maybe the DBMS starts with the source table and goes from that table to the next. For this we'd use the following indexes:
create index idx4 on index_value (source_id, index_id);
create index idx5 on index (id, namespace_id);
create index idx6 on namespace (id, name);
Just add them to your database and see what happens. You can always drop indexes again when you see the DBMS doesn't use them.

NOT IN query performance issue with large data

i was trying to get the id and the number from table with condition of number isn't in the id.
select id,number from tmp_t where number not in (select id from tmp_t)
Have tried the query and it's taking soooo looonggg... like almost 40 minutes and i got disconnected from server.
So what should i do? the data is around 500K rows..
So i wanted to show "here you go the id and the number, which the number didn't exist in the id."
Because i tried to insert the number, but the number is a FK and depending on the ID, so i wanted to know the id and the number, that's why i'm using not in.
Maybe someone know? Btw im using Postgresql-13
You can write it with NOT EXISTS instead, although these queries will have different results if any value of id is NULL (in which case, NOT IN probably yields not the answer you want, so NOT EXISTS is better from that perspective as well.)
select id,number from tmp_t where not exists
(select 1 from tmp_t a where a.id=tmp_t.number);
But your formulation is also efficient as long as work_mem is large enough.
Typically NOT EXISTS is faster (and doesn't suffer from surprises if NULL values are involved):
select t1.id, t1.number
from tmp_t t1
where not exists (select *
from tmp_t t2
where t2.id = t1.number)

PostgreSQL - Update rows in table with generate_series()

I have the following table:
create table test(
id serial primary key,
firstname varchar(32),
lastname varchar(64),
id_desc char(8)
);
I need to insert 100 rows of data. Getting the names is no problem - I have two tables one containing ten rows of first names and the other containing ten last names. By doing a insert - select query with a cross join I am able to get 100 rows of data (10x10 cross join).
id_desc contains of eight characters (fixed size is mandatory). It always starts with the same pattern (e.g. abcde) followed by 001, 002 etc. up to 999. I have tried to achieve this with the following statement:
update test set id_desc = 'abcde' || num.id
from (select * from generate_series(1, 100) as id) as num
where num.id = (select id from test where id = num.id);
The statement executes but affects zero rows. I know that the where-clause probably does not make much sense; I have been trying to finally get this to work and just started trying a couple of things. Didn't want to omit it though when posting here because I know it is definitely required.
Laurenz's suggestion fits this specific case very well. I recommend using it.
The rest of this is for the more general case where that simplification is not appropriate.
In my tests this doesn't work in this way.
I think you are better off using a WITH clause and a window function.
WITH ranked_ids (id, rank) AS (
select id, row_number() OVER (rows unbounded preceding)
FROM test
)
update test set id_desc = 'abcde' || ranked_ids.rank
from ranked_ids WHERE test.id = ranked_ids.id;
It should be as simple as
UPDATE test SET id_desc = 'abcde' || to_char(id, 'FM099');

Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094

SELECT DISTINCT tblJobReq.JobReqId
, tblJobReq.JobStatusId
, tblJobClass.JobClassId
, tblJobClass.Title
, tblJobReq.JobClassSubTitle
, tblJobAnnouncement.JobClassDesc
, tblJobAnnouncement.EndDate
, blJobAnnouncement.AgencyMktgVerbage
, tblJobAnnouncement.SpecInfo
, tblJobAnnouncement.Benefits
, tblSalary.MinRateSal
, tblSalary.MaxRateSal
, tblSalary.MinRateHour
, tblSalary.MaxRateHour
, tblJobClass.StatementEval
, tblJobReq.ApprovalDate
, tblJobReq.RecruiterId
, tblJobReq.AgencyId
FROM ((tblJobReq
LEFT JOIN tblJobAnnouncement ON tblJobReq.JobReqId = tblJobAnnouncement.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary ON tblJobClass.SalaryCode = tblSalary.SalaryCode
WHERE (tblJobReq.JobClassId in (SELECT JobClassId
from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
When i try to execute the query it results in the following error.
Cannot sort a row of size 8130, which is greater than the allowable maximum of 8094
I checked and didn't find any solution. The only way is to truncate (substring())the "tblJobAnnouncement.JobClassDesc" in the query which has column size of around 8000.
Do we have any work around so that i need not truncate the values. Or Can this query be optimised? Any setting in SQL Server 2000?
The [non obvious] reason why SQL needs to SORT is the DISTINCT keyword.
Depending on the data and underlying table structures, you may be able to do away with this DISTINCT, and hence not trigger this error.
You readily found the alternative solution which is to truncate some of the fields in the SELECT list.
Edit: Answering "Can you please explain how DISTINCT would be the reason here?"
Generally, the fashion in which the DISTINCT requirement is satisfied varies with
the data context (expected number of rows, presence/absence of index, size of row...)
the version/make of the SQL implementation (the query optimizer in particular receives new or modified heuristics with each new version, sometimes resulting in alternate query plans for various constructs in various contexts)
Yet, all the possible plans associated with a "DISTINCT query" involve *some form* of sorting of the qualifying records. In its simplest form, the plan "fist" produces the list of qualifying rows (records) (the list of records which satisfy the WHERE/JOINs/etc. parts of the query) and then sorts this list (which possibly includes some duplicates), only retaining the very first occurrence of each distinct row. In other cases, for example when only a few columns are selected and when some index(es) covering these columns is(are) available, no explicit sorting step is used in the query plan but the reliance on an index implicitly implies the "sortability" of the underlying columns. In other cases yet, steps involving various forms of merging or hashing are selected by the query optimizer, and these too, eventually, imply the ability of comparing two rows.
Bottom line: DISTINCT implies some sorting.
In the specific case of the question, the error reported by SQL Server and preventing the completion of the query is that "Sorting is not possible on rows bigger than..." AND, the DISTINCT keyword is the only apparent reason for the query to require any sorting (BTW many other SQL constructs imply sorting: for example UNION) hence the idea of removing the DISTINCT (if it is logically possible).
In fact you should remove it, for test purposes, to assert that, without DISTINCT, the query completes OK (if only including some duplicates). Once this fact is confirmed, and if effectively the query could produce duplicate rows, look into ways of producing a duplicate-free query without the DISTINCT keyword; constructs involving subqueries can sometimes be used for this purpose.
An unrelated hint, is to use table aliases, using a short string to avoid repeating these long table names. For example (only did a few tables, but you get the idea...)
SELECT DISTINCT JR.JobReqId, JR.JobStatusId,
tblJobClass.JobClassId, tblJobClass.Title,
JR.JobClassSubTitle, JA.JobClassDesc, JA.EndDate, JA.AgencyMktgVerbage,
JA.SpecInfo, JA.Benefits,
S.MinRateSal, S.MaxRateSal, S.MinRateHour, S.MaxRateHour,
tblJobClass.StatementEval,
JR.ApprovalDate, JR.RecruiterId, JR.AgencyId
FROM (
(tblJobReq AS JR
LEFT JOIN tblJobAnnouncement AS JA ON JR.JobReqId = JA.JobReqId)
INNER JOIN tblJobClass ON tblJobReq.JobClassId = tblJobClass.JobClassId)
LEFT JOIN tblSalary AS S ON tblJobClass.SalaryCode = S.SalaryCode
WHERE (JR.JobClassId in
(SELECT JobClassId from tblJobClass
WHERE tblJobClass.Title like '%Family Therapist%'))
FYI, running this SQL command on your DB can fix the problem if it is caused by space that needs to be reclaimed after dropping variable length columns:
DBCC CLEANTABLE (0,[dbo.TableName])
See: http://msdn.microsoft.com/en-us/library/ms174418.aspx
This is a limitation of SQL Server 2000. You can:
Split it into two queries and combine elsewhere
SELECT ID, ColumnA, ColumnB FROM TableA JOIN TableB
SELECT ID, ColumnC, ColumnD FROM TableA JOIN TableB
Truncate the columns appropriately
SELECT LEFT(LongColumn,2000)...
Remove any redundant columns from the SELECT
SELECT ColumnA, ColumnB, --IDColumnNotUsedInOutput
FROM TableA
Migrate off of SQL Server 2000