Postgres optimizing in an "A OR exists(B)" query - postgresql

I'm having a lot of trouble with a particular case in my Postgres optimization.
Essentially, I have three tables, which will simplify as:
Place: id, name (String), select (Boolean)
Bookmark: id, user (Integer), place (Integer)
User: id, name (String)
The Place table has several million rows (and growing), but a relatively small amount of them has select as true.
I have several indexes on these tables, obviously on all id, plus a partial one on place where "select"=true, and a unique one on the bookmark (user, place) combos. There are more, but I think they're not relevant here.
When I do a query of the type:
SELECT *
FROM place
WHERE "select"
LIMIT 10;
it takes 3ms.
When I do a query of the type:
SELECT *
FROM place
WHERE exists (SELECT id
FROM bookmark
WHERE user IN (1,2,3,4)
AND bookmark.place = place.id)
LIMIT 10;
it's also blazing fast.
However, if I do an OR on both conditions, like so:
SELECT *
FROM place
WHERE "select"
OR exists (SELECT id
FROM bookmark
WHERE user IN (1,2,3,4)
AND bookmark.place = place.id)
LIMIT 10;
it slows down to over 1s.
Besides doing two queries in my code and combining the results, is there any way I can optimize this?

The old problem, OR is a performance killer.
Use UNION:
(SELECT * FROM place
WHERE select
LIMIT 10)
UNION
(SELECT * FROM place
WHERE exists (SELECT 1 FROM bookmark
WHERE user IN (1,2,3,4)
AND bookmark.place = place.id)
LIMIT 10)
LIMIT 10;

Related

PG SQL UNION With ORDER BY, LIMIT Performance Optimization

I am trying to execute a query with an ORDER BY clause and a LIMIT clause for performance. Consider the following schema.
ONE
(id, name)
(1 , a)
(2 , b)
(5 , c)
TWO
(id, name)
(3 , d)
(4 , e)
(5 , f)
I want to be able to get a list of people from tables one and two ordered by ID.
The current query I have is as follows.
WITH combined AS (
(SELECT * FROM one ORDER BY id DESC)
UNION ALL
(SELECT * FROM two ORDER BY id DESC)
)
SELECT * FROM combined ORDER BY id LIMIT 5
the output of the table will be
(id, name)
(1 , a)
(2 , b)
(3 , d)
(4 , e)
(5 , c)
You'll notice that last row "c" or "f" will change based on the order of the UNION (one UNION two versus two UNION one). That's not important as I only care about the order for ID.
Unfortunately, this query does a full scan of both tables as per the ORDER BY on "combined". My table one and two are both billions of rows.
I am looking for a query that will be able to search both tables simultaneously, if possible. Meaning rather than looking through all of "one" for the entries that I need, it first looks to sort both by ID and then look for the minimum from both tables such that if the ID in one table is lower than the ID in another table, the query will look in the other table until the other table's ID is higher or equal to the first table before looking through the first table again.
The correct order of reading the table, given one UNION two would be a, b, d, e, c/f.
Do you just mean this?
WITH combined AS (
(SELECT * FROM one ORDER BY id LIMIT 5)
UNION ALL
(SELECT * FROM two ORDER BY id LIMIT 5)
)
SELECT * FROM combined ORDER BY id LIMIT 5
That will select the 5 "lowest id" rows from each table (which is the minimum you need to guarantee 5 output rows) and then find the lowest of those.
Thanks to a_horse_with_no_name's comment on Richard Huxton's answer regarding adding an index, the query runs considerably faster, from indeterminate to under one minute.
In my case, the query was still too slow, and I came across the following solution.
Consider using results from one table to limit results from another table. The following solution, in combination with indexing by id, worked for my tables with billions of rows, but operates on the assumption that table "one" is faster than table "two" to finish the query.
WITH first as (SELECT * FROM one ORDER BY id LIMIT 5),
filter as (SELECT min(id) FROM first),
second as (SELECT * FROM two
WHERE id < (SELECT filter.id FROM filter)
ORDER BY id LIMIT 5)
combined AS (
(SELECT * FROM first ORDER BY id LIMIT 5)
UNION ALL
(SELECT * FROM second ORDER BY id LIMIT 5)
)
SELECT * FROM combined ORDER BY id LIMIT 5
By using the minimum ID from the first complete query, I can limit the scope that the database scans for completion of the second query.

postgresql: how to get the last record even with WHERE clause

I have the following postgresql command
SELECT *
FROM (
SELECT *
FROM tablename
ORDER by id DESC
LIMIT 1000
) as t
WHERE t.col1="someval"
Now i also want to get the last record of along with the above query
FROM (
SELECT *
FROM tablename
ORDER by id DESC
LIMIT 1000
)
Currently i am doing
SELECT *
FROM (
SELECT *
FROM tablename
ORDER by id DESC
LIMIT 1000
) as t
WHERE t.col1="someval"
UNION ALL
SELECT *
FROM (
SELECT *
FROM tablename
ORDER by id DESC
LIMIT 1000
) as t
ORDER BY id ASC
LIMIT 1
Is this is the right way
I would use UNION rather than UNION ALL in this case, since the final row could also be returned by the first query, and I wouldn't want to have it twice in the result set if that happens. The primary key will guarantee that UNION can accidentally remove duplicate result rows.
I don't understand the query, in particular why there is a WHERE condition at the outside query in the first case, but not in the second. But that is unrelated to the question.
Your current effort is wrong, since the LIMIT 1 applies outside the UNION ALL, so you get only one row as a result. That this is wrong should have been immediately obvious upon testing, so it is baffling that you are asking us if it is right.
You should wrap the whole second SELECT in parenthesis, so the LIMIT applies just to it.
Better yet, rather than ordering and taking 1000 rows and then reversing the order and taking the first row, you could just do OFFSET 999 LIMIT 1 to get the 1000th row.
If the 1000th rows matches both conditions, do you want to see it twice?

The last updated data shows first in the postgres selet query?

I have simple query that takes some results from User model.
Query 1:
SELECT users.id, users.name, company_id, updated_at
FROM "users"
WHERE (TRIM(telephone) = '8973847' AND company_id = 90)
LIMIT 20 OFFSET 0;
Result:
Then I have done some update on the customer 341683 and again I run the same query that time the result shows different, means the last updated shows first. So postgres is taking the last updated by default or any other things happens here?
Without an order by clause, the database is free to return rows in any order, and will usually just return them in whichever way is fastest. It stands to reason the row you recently updated will be in some cache, and thus returned first.
If you need to rely on the order of the returned rows, you need to explicitly state it, e.g.:
SELECT users.id, users.name, company_id, updated_at
FROM "users"
WHERE (TRIM(telephone) = '8973847' AND company_id = 90)
ORDER BY id -- Here!
LIMIT 20 OFFSET 0

Reform a postgreSQL query using one (or more) index - Example

I am a beginner in PostgreSQL and, after understanding very basic things, I want to find out how I can get a better performance (on a query) by using an index (one or more). I have read some documentation, but I would like a specific example so as to "catch" it.
MY EXAMPLE: Let's say I have just a table (MyTable) with three columns (Customer(text), Time(timestamp), Consumption(integer)) and I want to find the customer(s) with the maximum consumption on '2014-07-01 01:00:00'. MY SOLUTION (without index usage):
SELECT Customer FROM MyTable WHERE Time='2013-07-01 02:00:00'
AND Consumption=(SELECT MAX(consumption) FROM MyTable);
----> What would be the exact full code, using - at least one - index for the query-example above ?
The correct query (using a correlated subquery) would be:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00' AND
Consumption = (SELECT MAX(t2.consumption) FROM MyTable t2 WHERE t2.Time = '2013-07-01 02:00:00');
The above is very reasonable. An alternative approach if you want exactly one row returned is:
SELECT Customer
FROM MyTable
WHERE Time = '2013-07-01 02:00:00'
ORDER BY Consumption DESC
LIMIT 1;
And the best index is MyTable(Time, Consumption, Customer).

PostgreSQL Aggregate groups with limit performance

I'm a newbie in PostgreSQL. Is there a way to improve execution time of the following query:
SELECT s.id, s.name, s.url,
(SELECT array_agg(p.url)
FROM (
SELECT url
FROM pages
WHERE site_id = s.id ORDER BY created DESC LIMIT 5
) as p
) as last_pages
FROM sites s
I havn't found how to insert LIMIT clause into aggregate call, as ordering.
There are indexes by created (timestamp) and site_id (integer) in table pages, but the foreign key from sites.id to pages.site_id is absent, unfortunately. The query is intented to return a list of sites with sublists of 5 most recently created pages.
PostgreSQL version is 9.1.5.
You need to start by thinking like the database management system. You also need to think very carefully about what you are asking from the database.
Your fundamental problem here is that you likely have a very large number of separate indexing calls happening here when a sequential scan may be quite a bit faster. Your current query gives very little flexibility to the planner because of the fact that you have subqueries which must be correlated.
A much better way to do this would be with a view (inline or not) and a window function:
SELECT s.id, s.name, s.url, array_agg(p.url)
FROM sites s
JOIN (select site_id, url,
row_number() OVER (partition by site_id order by created desc) as num
from pages) p on s.id = p.site_id
WHERE num <= 5;
This will likely change a very large number of index scans to a single large sequential scan.