How can I optimize a query whose WHERE conditions include a check for user_id = X OR user_id IN (some subquery that might return no results)
In my example below, queries 1 and 2 are both extremely fast (< 1 ms), but query 3, which is simply an OR of the conditions in queries 1 and 2, is much slower (50 ms)
Can somebody please explain why query 3 is so slow, and in general what types of query optimization strategies should I be pursuing to avoid this problem? I realize the subquery in my example could easily be eliminated, but in real life sometimes subqueries seem like the least complicated way to get the data I want.
relevant code and data:
posts data
https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_posts.csv
users data
https://dl.dropbox.com/u/4597000/StackOverflow/sanitized_users.csv
# from the shell:
# > createdb test
CREATE TABLE posts (
id integer PRIMARY KEY NOT NULL,
created_by_id integer NOT NULL,
created_at integer NOT NULL
);
CREATE INDEX index_posts ON posts (created_by_id, created_at);
CREATE INDEX index_posts_2 ON posts (created_at);
CREATE TABLE users (
id integer PRIMARY KEY NOT NULL,
login varchar(50) NOT NULL
);
CREATE INDEX index_users ON users (login);
COPY posts FROM '/path/to/sanitized_posts.csv' DELIMITERS ',' CSV;
COPY users FROM '/path/to/sanitized_users.csv' DELIMITERS ',' CSV;
-- queries:
-- query 1, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 LIMIT 100;
-- query 2, fast:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
-- query 3, slow:
EXPLAIN ANALYZE SELECT * FROM posts WHERE created_by_id = 123 OR created_by_id IN (SELECT id FROM users WHERE login = 'nobodyhasthislogin') LIMIT 100;
Split the query (edited):
SELECT * FROM (
SELECT * FROM posts p WHERE p.created_by_id = 123
union
SELECT * FROM posts p
WHERE
EXISTS ( SELECT TRUE FROM users WHERE id = p.created_by_id AND login = 'nobodyhasthislogin')
) p
LIMIT 100;
How about:
EXPLAIN ANALYZE
SELECT *
FROM posts
WHERE created_by_id IN (
SELECT 123
UNION ALL
SELECT id FROM
users WHERE
login = 'nobodyhasthislogin') LIMIT 100;
Most of the time in this particular query is related to an index scan. Here is a query goes at it from a different angle to avoid this, but should return equivalent results.
SELECT posts.* FROM users JOIN posts on posts.created_by_id=users.id WHERE users.id=123 or login='nobodyhasthislogin'
This selects from the users table, doing the filter once, and then joins posts onto that.
I realize that the question is about tips for optimization, not really this specific query. To answer that, my suggestion is to run EXPLAIN ANALYZE and read up on interpreting the results, - this answer was helpful to me.
Related
I was wondering if it's possible to do a query performance of below query which is taking logger time than response timeout:
SELECT * FROM products p
WHERE lower(name) ILIKE 'BL%'
OR lower(name) ILIKE 'Bule%'
AND p.site_id = 123
AND p.product_type = 0
ORDER BY external_id ASC LIMIT 25;
Previously it was working as in DB data was less. (Using PostgreSQL)
==> name varchar(255)
I tried similar to and any operator but these are not giving same records and better performance.
I have posts table with the following structure:
| id | score | title | tags |
-------------------------------------------------
| 1 | 42 | Travel | <uk><travel><passport> |
For each blog post I want to find relevant posts, tagged with any of the tags corresponding to the current page, in my case: <uk>, <travel> or <passport>. Then, order results by score, limit it to 5 items and display it to the user.
This is the code I came up with so far, but it seems only getting the result for the first tag in the query – <uk>.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
(tags ~ tag)::int as match_found
) m
where m.match_found > 0
) t
order by t.score desc
limit 5;
EDIT
After #Mike Organek's comment I changed the query this, and it's working as I initially expected.
with tags_string (tag) as (
select unnest(string_to_array('<uk><travel><passport>', '>'))
)
select *
from
(
select distinct *
from posts
cross join tags_string
cross join lateral
(select
position(tag in tags) > 0 as match_found
) m
where m.match_found and tag <> ''
) t
order by t.score desc
limit 5;
I would convert the tags into an array then use array operators to find the relevant posts:
select id, title, score, tags
from posts
where string_to_array(trim(both '<>' from replace(tags, '><', ',')), ',') #> array['uk', 'travel', 'passport']
order by score
limit 5
In the long run, storing the tags as an array or a jsonb array is probably a lot more efficient.
If you do that a lot, things might get a bit easier if you create a function for this:
create function tags_array(p_input text)
returns text[]
as
$$
select string_to_array(trim(both '<>' from replace(p_input, '><', ',')), ',');
$$
language sql
immutable;
Then the query is a bit easier to read:
select id, title, score, tags
from posts
where tags_array(tags) #> array['uk', 'travel', 'passport']
order by score
limit 5
You can even create an index for that if you want:
create index on posts using gin ( (tags_array(tags)) );
Let's say I've the following schema :
CREATE TABLE author(
id SERIAL PRIMARY KEY,
name TEXT NOT NULL
);
CREATE TABLE article(
id SERIAL PRIMARY KEY,
rating NUMERIC NOT NULL,
author_id INTEGER NOT NULL REFERENCES author
);
CREATE INDEX ON article(author_id);
I would like to fetch all authors and their top 5 articles if there exists atleast one article of the author with rating > 4.
It was tempting to write this:
SELECT au.id AS author,
json_agg(ar.*) AS articles
FROM
author au
JOIN LATERAL
(SELECT *
FROM article
WHERE author_id = au.id
ORDER BY rating DESC LIMIT 5) ar ON (TRUE)
GROUP BY au.id
HAVING any(ar.rating) > 4;
While any(ar.rating) > 4 looks like a filter expression on each group, any(ar.rating) is not an aggregated value. So, it seems reasonable for Postgres to reject this query. Is it possible to write the query with HAVING?
As an alternative, I write this query to fetch the results :
SELECT au.id AS author,
json_agg(ar.*) AS articles
FROM
(SELECT au.*
FROM author au
WHERE EXISTS
(SELECT 1
FROM article
WHERE rating > 4 AND author_id = au.id)) au
JOIN LATERAL
(SELECT *
FROM article
WHERE author_id = au.id
ORDER BY rating DESC LIMIT 5) ar ON (TRUE)
GROUP BY au.id;
This however doesn't combine both the grouping and checking for the existence of an article with rating > 4 in a single step. Is there a better way to write this query?
If you insist on using ANY you have to use array_agg to aggregate that column into an array.
HAVING
4< ANY(Array_Agg(ar.rating))
But if any is higher than 4 it also means that the maximum is higher that 4 so more readable will be
HAVING
4 < Max(ar.rating)
I've got a visits table that looks like this:
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
For each record, I need to find a matching record that is same time or earlier, has the same patient_id, and has flag set to 1. What I am doing now is:
select parent.id as parent_id,
(
select top 1
child.id as child_id
from
visits as child
where
child.visit_date <= parent.visit_date
and child.patient_id = parent.patient_id
and child.flag = 1
order by
visit_date desc
) as child_id
from
visits as parent
So, this query works correctly, except that it runs too slow -- I suspect that this is because of the subquery. Is it possible to rewrite it as a joined query?
View the query execution plan. Where you have thick arrows, look at those statements. You should learn the different statements and what they imply, like what Clustered Index Scan/ Seek etc.
Usually when a query is going slow however I find that there are no good indexes.
The tables and columns affected and used to join, create an index that covers all these columns. This is called a covering index usually in the forums. It's something you can do for something that really needs it. But keep in mind that too many indexes will slow down insert statements.
/*
id identity(1,1) not null,
visit_date datetime not null,
patient_id int not null,
flag bit not null
*/
SELECT
T.parentId,
T.patientId,
V.id AS childId
FROM
(
SELECT
visit.id AS parentId,
visit.patient_id AS patientId,
MAX (previous_visit.visit_date) previousVisitDate
FROM
visit
LEFT JOIN visit previousVisit ON
visit.patient_id = previousVisit.patient_id
AND visit.visit_date >= previousVisit.visit_date
AND visit.id <> previousVisit.id
AND previousVisit.flag = 1
GROUP BY
visit.id,
visit.visit_date,
visit.patient_id,
visit.flag
) AS T
LEFT JOIN visit V ON
T.patientId = V.patient_id
AND T.previousVisitDate = V.visit_date
I am new to Firebird and I am messing around in it's meta data to get some information about the table structure and etc.
My problem is that I can't seem to find some information about estimated table cardinality. Is there a way to get this information from Firebird?
Edit:
By cardinality i mean the number of rows in a table :) and for my use the select count(*) is not an option.
You can use an aproximative method, using the selectivity of primary key like this:
SELECT
R.RDB$RELATION_NAME TABLENAME,
(
CASE
WHEN I.RDB$STATISTICS = 0 THEN 0
ELSE 1 / I.RDB$STATISTICS
END) AS COUNTRECORDS8
FROM RDB$RELATIONS R
JOIN RDB$RELATION_CONSTRAINTS C ON (R.RDB$RELATION_NAME = C.RDB$RELATION_NAME AND C.RDB$CONSTRAINT_TYPE = 'PRIMARY KEY')
JOIN RDB$INDICES I ON (I.RDB$RELATION_NAME = C.RDB$RELATION_NAME AND I.RDB$INDEX_NAME = C.RDB$INDEX_NAME)
To get the number of rows in a table you use the COUNT() function as in any other SQL DB, ie
SELECT count(*) FROM table;
Why not use a query:
select count(distinct field_name)/(count(field_name) + 0.0000) from table_name
The closer result to 1 the higher cardinality of a specified column.