Is it true? Does PostgreSQL LIMIT throw away all other instances? - postgresql

I've been looking for a straight clean answer to the this question. Let's say I have a photo table.
Now this table has 1,000,000 rows. Let's do the following query:
SELECT * FROM photos ORDER BY creation_time LIMIT 10;
Will this query grab all 1,000,000 rows and then give me 10? or does it just grab the latest 10? I'm quite curious as to how this works because if it does grab 1,000,000 (mind you this table is constantly growing) then it's wasteful query. You're basically throwing away 999,980 rows away. Is there a more efficient way to do this?

Whether the database has to scan the whole table or not depends on a number of
factors - in the case you describe the main factors are whether there is an ORDER BY
clause and whether there is an index on the sort field(s).
All is revealed by looking at the query plan, and the cost approximations on each
of the operations. Consider the case where there is no ordering clause:
testdb=> explain select * from bigtable limit 10;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=0.00..0.22 rows=10 width=39)
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(2 rows)
The planner has decided that a sequential scan is the way to go. The expected cost
already gives us a clue. It is expressed as a range, 0.00..6943.06. The first number
(0.00) is the amount of work the database expects to have to do before it can deliver
any rows, while the second number is an estimate of the work required to deliver
the whole scan.
Thus, the input to the 'Limit' clause is going to start straight away, and it will
not have to process the full output of the sequential scan (since the total cost
is only 0.22, not 6943.06). So it definitely will not have to read the whole table
and discard most of it.
Now lets see what happens if you add an ORDER BY clause, using a column that is not
indexed.
testdb=> explain select * from bigtable ORDER BY title limit 10;
QUERY PLAN
---------------------------------------------------------------------------------
Limit (cost=13737.26..13737.29 rows=10 width=39)
-> Sort (cost=13737.26..14523.28 rows=314406 width=39)
Sort Key: title
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(4 rows)
We have a similar plan, but there is a 'Sort' operation in between the seq scan
and the limit. It has to scan the complete table, sort the full content of it,
and only then can is start delivering rows to the Limit clause. It makes sense
when you think about it - LIMIT is supposed to apply after ORDER BY; so it would
have to be sure to have found the top 10 rows in the whole table.
Now what happens when an index is used? Suppose we have a 'time' column which is
indexed:
testdb=> explain select * from bigtable ORDER BY time limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.35 rows=10 width=39)
-> Index Scan using bigtable_time_idx on bigtable (cost=0.00..10854.96 rows=314406 width=39)
(2 rows)
An index scan, using the time index, is able to start delivering rows in already
sorted order (cost starts at 0.00). The LIMIT can cut the query short after
only 10 rows, so the overall cost is very small.
The moral to the story is to carefully choose which columns or combinations of
columns you will index. You can't add them indiscriminately because adding an
index has a cost of its own - it makes it more expensive to insert, update or
delete records.

Related

Postgres "Index Scan Backward" slower than "Index Scan"

I have the following basic table in PostgreSQL 12.13:
Record:
database_id: FK to database table
updated_at: timestamp
I've created an index on both the database_id and updated_at fields.
I have a query that fetches the most recent 100 records for a given database id:
SELECT * FROM record WHERE database_id='fa9bcfa6-8d89-4c95-b04a-24c85b169066'
ORDER BY store_record.updated_at DESC
LIMIT 100;
This query is EXTREMELY slow (recently took about 6 min to run). Here is the query plan:
Limit (cost=0.09..1033.82 rows=100 width=486)
-> Index Scan Backward using record_updated_at on record (cost=0.09..8149369.05 rows=788343 width=486)
Filter: (database_id = 'fa9bcfa6-8d89-4c95-b04a-24c85b169066'::uuid)
If I change ORDER BY DESC to ORDER BY ASC then the query takes milliseconds, even though the query plan looks about the same:
SELECT * FROM record WHERE database_id='fa9bcfa6-8d89-4c95-b04a-24c85b169066'
ORDER BY store_record.updated_at
LIMIT 100;
Limit (cost=0.09..1033.86 rows=100 width=486)
-> Index Scan using record_updated_at on record (cost=0.09..8149892.78 rows=788361 width=486)
Filter: (database_id = 'fa9bcfa6-8d89-4c95-b04a-24c85b169066'::uuid)
If I remove the ORDER BY completely then the query is also fast:
SELECT * FROM record WHERE database_id='fa9bcfa6-8d89-4c95-b04a-24c85b169066'
LIMIT 100;
Limit (cost=0.11..164.75 rows=100 width=486)
-> Index Scan using record_database_id on record (cost=0.11..1297917.10 rows=788366 width=486)
Index Cond: (database_id = 'fa9bcfa6-8d89-4c95-b04a-24c85b169066'::uuid)
Few questions:
Why is the first query so much slower than the other two? I understand why the last one is faster but I don't understand why changing the ORDER BY DESC to ORDER BY makes such a difference. Amy I missing an index?
How can I speed up the initial query?
The plan follows the index to read records in order of updated_at DESC, testing each for the desired database_id, and then stops as soon as it finds 100 which have the desired database_id. But that specific desired value of database_id is much more common on one end of the index than the other. There is nothing magically about the DESC here, presumably there is some other value of database_id for which it works the opposite way, finding 100 of them when descending much faster than when ascending. If you had done EXPLAIN (ANALYZE), this would have been immediately clear based on the different reported values for "Rows Removed by Filter"
If you have a multicolumn index on (database_id,updated_at), then it can immediately jump to the desired spot of the index and read the first 100 rows it finds (after "filtering" them for visibility), and that would work fast no matter which direction you want the ORDER BY to go.

Optimal way of using joins in Redshift

I have 2 tables in AWS redshift. The details are as below
a) impressions (to count the number of impressions of a particular ad)
Number of rows (170 million)
distribution key(ad_campaign)
sort key (created_on)
b) clicks (to count the number of clicks of a particular ad).
Number of rows (80 million)
distribution key(ad_campaign)
sort key (created_on)
I have a single DC1 Large cluster with 2 slices.
I am trying to run the below query
select impressions.offer_id, count(imp_cnt) from
bidsflyer.tblImpressionLog_Opt impressions
full join bidsflyer.tblTrackingLinkLog_Opt clicks
on impressions.offer_id=clicks.offer_id and date_trunc('week',
impressions.created_on)=date_trunc('week', clicks.created_on)
where impressions.created_on >= '2017-07-27 00:00:00'
group by 1
This query takes more then 8 mins to run. I think this is quite large considering the volume of data, which I feel is not huge.
The query plan looks like something below
XN HashAggregate (cost=2778257688268.43..2778257688268.60 rows=67 width=12)
-> XN Hash Left Join DS_DIST_NONE (cost=179619.84..2778170875920.65 rows=17362469555 width=12)
Hash Cond: (("outer".offer_id = "inner".offer_id) AND (date_trunc('week'::text, "outer".created_on) = date_trunc('week'::text, "inner".created_on)))
-> XN Seq Scan on tblimpressionlog_opt impressions (cost=0.00..724967.36 rows=57997389 width=20)
Filter: (created_on >= '2017-07-27 00:00:00'::timestamp without time zone)
-> XN Hash (cost=119746.56..119746.56 rows=11974656 width=12)
-> XN Seq Scan on tbltrackinglinklog_opt clicks (cost=0.00..119746.56 rows=11974656 width=12)
Can anyone provide me guidance of the correct usage of distribution key and sort keys.
How should I design my query?
Table setup:
1) According to the plan, the most expensive operation is grouping by offer_id. This makes sense because you didn't sort or distribute your data by offer_id. Your tables are quite large, so you can recreate the table with an interleaved sort key by (offer_id,created_on) (interleaved keys are supposed to give equal and order-independent weight to the included columns and are known to have positive effect on larger tables).
2) If you join by weeks you can materialize your week column (create a physical column and populate it with date_trunc output). That might save you some computation effort to get these values dynamically during the join. However, this operation is cheap and if your table is already sorted by timestamp column Redshift might already scan the appropriate blocks only. Also, if each offer runs for a short period of time (meaning offer column has high cardinality and high correlation with time column) you can have a compound sort key by (offer_id,week_created) that will allow merge join that is faster, and aggregate will fun fast as well.
3) if you don't use ad_campaign in other queries you can distribute both tables by offer_id. Having join column in distribution key is a good practice, it's unlikely that your query will benefit from this since you have a single node and distribution style mostly affects multinode setups.
All recommendations are just the assumptions without knowing the exact nature of your data, they require running benchmarks (create table with the recommended configuration, copy data, vaccuum, analyze, run the same query at least 3 times and compare times with the original setup). I would appreciate if you do this and post results here.
RE the query itself, you can replace FULL JOIN with JOIN because you don't need it. FULL JOIN should be used when you want to get not only the intersection of both tables but also impressions that don't have related clicks and vice versa. Which doesn't seem the case because you are filtering by impressions.created_on and group by impressions.offer_id. So, all you need is just the intersection. Replacing FULL JOIN by simple JOIN also might affect query performance. If you want to see the offers that have zero clicks you can use LEFT JOIN.
Merge join is faster than hash join, you should try to achieve merge join. You sort key looks okay, but is your data actually sorted? Redshift does not automatically keep table's rows sorted by sort key, there is no way for redshift to perform merge join on your table. Running a full vacuum on the table, redshift will start performing merge join.
select * from svv_table_info where table = 'impressions'
select * from svv_table_info where table = 'clicks'
Use above query to check the amount of unsorted data you have in your table.
Run a full vacuum on both your tables. Depending on the amount of unsorted data this might take a while and use a lot of your cluster resource.
VACUUM impressions to 100 percent
VACUUM clicks to 100 percent
If I’ve made a bad assumption please comment and I’ll refocus my answer.

Selecting primary key:Why postgres prefers to do sequential scan vs index scan

I have the following table
create table log
(
id bigint default nextval('log_id_seq'::regclass) not null
constraint log_pkey
primary key,
level integer,
category varchar(255),
log_time timestamp,
prefix text,
message text
);
It contains like 3 million of rows.
I'm comparing the following queries:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
LIMIT 100000
which yields the following plan:
Limit (cost=0.00..19498.87 rows=100000 width=8)
-> Seq Scan on log (cost=0.00..422740.48 rows=2168025 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
And the same query with ORDER BY id instruction added:
EXPLAIN SELECT id
FROM log
WHERE log_time < now() - INTERVAL '3 month'
ORDER BY id ASC
LIMIT 100000
which yields
Limit (cost=0.43..25694.15 rows=100000 width=8)
-> Index Scan using log_pkey on log (cost=0.43..557048.28 rows=2168031 width=8)
Filter: (log_time < (now() - '3 mons'::interval))
I have the following questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
How can Postgres use index in the first place in such a query? WHERE clause of the query contains a non-indexed column and to fetch that column, sequential database scan will be required, but the query with ORDER BY doesn't indicate that.
The Postgres manual page says:
For a query that requires scanning a large fraction of the table, an explicit sort is likely to be faster than using an index because it requires less disk I/O due to following a sequential access pattern
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
Can you please clarify this statement for me? Index is always ordered. And reading an ordered structure is always faster, it is always a sequential access (at least in terms of page scanning) than reading non-ordered data and then ordering it manually.
The index is read sequentially, yes, but postgres needs to follow up with a read of the rows from the table. That is, in most cases, if an index identifies 100 rows, then postgres will need to perform up to 100 random reads against the table.
Internally, the postgres planner weighs sequential and random reads differently, with random reads generally much more expensive. The settings seq_page_cost and random_page_cost determine those. There are other settings you can view and tinker with if you want, though I recommend being very conservative with modifications.
Let's go back to your earlier questions:
The absence of ORDER BY instruction allows Postgres not to care about the order of rows. They may be as well delivered sorted. Why it does not use index without ORDER BY?
The reason is the sort. As you note later, the index doesn't include the constraining column, so it doesn't make any sense to use the index. Instead, the planner is basically saying "read the whole table, figure out which rows conform to the constraint, and then return the first 100000 of them, in whatever order we find them".
The sort changes things. In that case, the planner is saying "we need to sort by this field, and we have an index which is already sorted, so read rows from the table in index order, checking against the constraint, until we have 100000 of them, and return that set".
You'll note that the cost estimates (e.g. '0.43..25694.15') are much higher for the second query -- the planner thinks that doing so many random reads from the index scan is going to cost significantly more than just reading the whole table at once with no sorting.
Hope that helps, and let me know if you have further questions.

How does postgres decide whether to use index scan or seq scan?

explain analyze shows that postgres will use index scanning for my query that fetches rows and performs filtering by date (i.e., 2017-04-14 05:27:51.039):
explain analyze select * from tbl t where updated > '2017-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------
Index Scan using updated on tbl t (cost=0.43..7317.12 rows=10418 width=93) (actual time=0.011..0.515 rows=1179 loops=1)
Index Cond: (updated > '2017-04-14 05:27:51.039'::timestamp without time zone)
Planning time: 0.102 ms
Execution time: 0.720 ms
however running the same query but with different date filter '2016-04-14 05:27:51.039' shows that postgres will run the query using seq scan instead:
explain analyze select * from tbl t where updated > '2016-04-14 05:27:51.039';
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------
Seq Scan on tbl t (cost=0.00..176103.94 rows=5936959 width=93) (actual time=0.008..2005.455 rows=5871963 loops=1)
Filter: (updated > '2016-04-14 05:27:51.039'::timestamp without time zone)
Rows Removed by Filter: 947
Planning time: 0.100 ms
Execution time: 2910.086 ms
How does postgres decide on what to use, specifically when performing filtering by date?
The Postgres query planner bases its decisions on cost estimates and column statistics, which are gathered by ANALYZE and opportunistically by some other utility commands. That all happens automatically when autovacuum is on (by default).
The manual:
Most queries retrieve only a fraction of the rows in a table, due to
WHERE clauses that restrict the rows to be examined. The planner thus
needs to make an estimate of the selectivity of WHERE clauses, that
is, the fraction of rows that match each condition in the WHERE
clause. The information used for this task is stored in the
pg_statistic system catalog. Entries in pg_statistic are updated by
the ANALYZE and VACUUM ANALYZE commands, and are always approximate
even when freshly updated.
There is a row count (in pg_class), a list of most common values, etc.
The more rows Postgres expects to find, the more likely it will switch to a sequential scan, which is cheaper to retrieve large portions of a table.
Generally, it's index scan -> bitmap index scan -> sequential scan, the more rows are expected to be retrieved.
For your particular example, the important statistic is histogram_bounds, which give Postgres a rough idea how many rows have a greater value than the given one. There is the more convenient view pg_stats for the human eye:
SELECT histogram_bounds
FROM pg_stats
WHERE tablename = 'tbl'
AND attname = 'updated';
There is a dedicated chapter explaining row estimation in the manual.
Obviously, optimization of queries is tricky. This answer is not intended to dive into the specifics of the Postgres optimizer. Instead, it is intended to give you some background on how the decision to use an index is made.
Your first query is estimated to return 10,418 rows. When using an index, the following operations happen:
The engine uses the index to find the first value meeting the condition.
The engine then loops over the values, finishing when the condition is no longer true.
For each value in the index, the engine then looks up the data on the data page.
In other words, there is a little bit of overhead when using the index -- initializing the index and then looking up each data page individually.
When the engine does a full table scan it:
Starts with the first record on the first page
Does the comparison and accepts or rejects the record
Continues sequentially through all data pages
There is no additional overhead. Further, the engine can "pre-load" the next pages to be scanned while processing the current page. This overlap of I/O and processing is a big win.
The point I'm trying to make is that getting the balance between these two can be tricky. Somewhere between 10,418 and 5,936,959, Postgres decides that the index overhead (and fetching the pages randomly) costs more than just scanning the whole table.

PostgreSQL query doesn't use index

I have a very simple db schema, which has a multi column b-tree index on following columns:
PersonId, Amount, Commission
Now, if I try to select the table with following query:
explain select * from "Order" where "PersonId" = 2 AND "Commission" > 3
Pg is scanning the index and the query is very fast, but if I try the following query:
explain select * from "Order" where "PersonId" > 2 AND "Commission" > 3
It does a sequential scan, even when the index is present. Even this query
explain select * from "Order" where "Commission" > 3
does a sequential scan.
Anyone care to explain why? :-)
Thank you very much.
UPDATE
The table contains 100 million rows. I have created it just to test PostgreSQL performance against MS SQL. The table is already VACUUMED. I'm runnning Core I5 2500k quad core cpu and 8 GB of ram.
Here's the result of explain analyze for this query:
explain ANALYZE select * from "Order" where "Commission" BETWEEN 3000000 AND 3000010 LIMIT 20
Limit (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.249..28043.249 rows=0 loops=1)
-> Seq Scan on "Order" (cost=0.00..2218328.00 rows=1 width=24) (actual time=28043.247..28043.247 rows=0 loops=1)
Filter: (("Commission" >= 3000000::numeric) AND ("Commission" <= 3000010::numeric))
Total runtime: 28043.278 ms
The short answer is that when comparing the various available plans, the sequential scan is expected to be the fastest, based on the costing factors you have configured and the latest statistics available. From what little information you've provided, it seems quite likely that the planner has made the right choice. If you had three single-column indexes, it might be able to use bitmap index scans, particularly if the rows to be selected are less than about 10% of the rows in the table.
Note that with the index you describe, the entire index would need to be scanned from for all rows where "PersonId" > 2; which unless you have a lot of negative values for "PersonId" is very likely to be most of the table.
Also note that if you have a tiny table -- say a few thousand rows or less, accessing the rows through an index will rarely be faster than just scanning those few rows. Plans are sensitive to data volume, and the plan you get with a small number of rows is very unlikely to be the same plan you get with a lot of rows.
If it is, in fact, not picking the fastest plan, the odds are good that you need to adjust your cost factors to better model the costs on your machine. Another possibility is that you need to be more aggressive in your autovacuum settings, to make sure up-to-date statistics are available, or you may need to configure collection of finer-grained statistics.
People will be able to provide more specific advice if you show the table descriptions (including indexes), the EXPLAIN ANALYZE output for the query, and a description of the hardware.