I am writing some scripts that need to determine the last timestamp a timeseries datastream that can be interupted.
I am currently working out the most efficient way to do this, the simplest would be to look for the largest timestamp using MAX. As the tables in question are timescaledb hypertables they are indexed, so in theory it should be a case of following the index to find the largest and this should be very efficient operation. However, I am not sure if this is actually true and was wondering if anyone knew how max scales if it's working down an index, I know it's an O(n) function normally.
If there is an index on the column, max can use the index and will become O(1):
EXPLAIN (COSTS OFF) SELECT max(attrelid) FROM pg_attribute;
QUERY PLAN
══════════════════════════════════════════════════════════════════════════════════════════════
Result
InitPlan 1 (returns $0)
-> Limit
-> Index Only Scan Backward using pg_attribute_relid_attnum_index on pg_attribute
Index Cond: (attrelid IS NOT NULL)
(5 rows)
Related
I have a table with 1m records, with 100k records having null on colA. Remaining records have pretty distinct values, is there a difference in creating a regular index on this column vs a partial index with where colA is not null?
Since regular Postgres indexes do not store NULL values, wouldn't it be the same as creating a partial index with where colA is not null?
Any pros or cons with either indexes?
If you create a partial index without nulls, it will not use it to find nulls.
Here's a test with a full index on 13.5.
# create index idx_test_num on test(num);
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
-------------------------------------------------------------------------------------
Aggregate (cost=5135.00..5135.01 rows=1 width=8)
-> Bitmap Heap Scan on test (cost=63.05..5121.25 rows=5500 width=0)
Recheck Cond: (num IS NULL)
-> Bitmap Index Scan on idx_test_num (cost=0.00..61.68 rows=5500 width=0)
Index Cond: (num IS NULL)
(5 rows)
And with a partial index.
# create index idx_test_num on test(num) where num is not null;
CREATE INDEX
# explain select count(*) from test where num is null;
QUERY PLAN
--------------------------------------------------------------------------------------
Finalize Aggregate (cost=10458.12..10458.13 rows=1 width=8)
-> Gather (cost=10457.90..10458.11 rows=2 width=8)
Workers Planned: 2
-> Partial Aggregate (cost=9457.90..9457.91 rows=1 width=8)
-> Parallel Seq Scan on test (cost=0.00..9352.33 rows=42228 width=0)
Filter: (num IS NULL)
(6 rows)
Since regular postgres indexes do not store NULL values...
This has not been true since version 8.2 [checks notes] 16 years ago. The 8.2 docs say...
Indexes are not used for IS NULL clauses by default. The best way to use indexes in such cases is to create a partial index using an IS NULL predicate.
8.3 introduced nulls first and nulls last and many other improvements around nulls including...
Allow col IS NULL to use an index (Teodor)
It all depends.
NULL values are included in (default) B-tree indexes since version Postgres 8.3, like Schwern provided. However, predicates like the one you mention (where colA is not null) are only properly supported since Postgres 9.0. The release notes:
Allow IS NOT NULL restrictions to use indexes (Tom Lane)
This is particularly useful for finding MAX()/MIN() values in
indexes that contain many null values.
GIN indexes followed later:
As of PostgreSQL 9.1, null key values can be included in the index.
Typically, a partial index makes sense if it excludes a major part of the table from the index, making it substantially smaller and saving writes to the index. Since B-tree indexes are so shallow, bare seek performance scales fantastically (once the index is cached). 10 % fewer index entries hardly matter in that area.
Your case would exclude only around 10% of all rows, and that rarely pays. A partial index adds some overhead for the query planner and excludes queries that don't match the index condition. (The Postgres query planner doesn't try hard if the match is not immediately obvious.)
OTOH, Postgres will rarely use an index for predicates retrieving 10 % of the table - a sequential scan will typically be faster. Again, it depends.
If (almost) all queries exclude NULL anyway (in a way the Postgres planner understands), then a partial index excluding only 10 % of all rows is still a sensible option. But it may backfire if query patterns change. The added complexity may not be worth it.
Also worth noting that there are still corner cases with NULL values in Postgres indexes. I bumped into this case recently where Postgres proved unwilling to read sorted rows from a multicolumn index when the first index expression was filtered with IS NULL (making a partial index preferable for the case):
db<>fiddle here
So, it depends on the complete picture.
Let's say I have a table with some columns and a column dt which is of type TIMESTAMP.
I create a (non functional) index on this column.
Then I execute a query
SELECT *
FROM tbl
WHERE
dt::DATE = NOW()::DATE
The question is will Postgres use the index I've created earlier and under which circumstances it will/will not?
I understand that a functional index would cover this case, but does a simple index cover both cases or not when it's a TIMESTAMP -> DATE type conversion?
EDIT:
performing an EXPLAIN ANALYZE on the query tells us it does not use index and performs a Seq scan (table with 3+ mil records:
Seq Scan on tbl (cost=0.00..192289.92 rows=17043 width=12) (actual time=7.237..2493.496 rows=4928 loops=1)
Filter: ((dt)::date = (now())::date)
Rows Removed by Filter: 3397155
Total runtime: 2494.546 ms
Let me ask a question differently then, is it possible to make Postgres utilize this index or should I create another one?
A simple index will not work in this case; try it with EXPLAIN.
What you could do to use the simple index is
WHERE dt >= current_date::timestamptz
AND dt < (current_date + 1)::timestamptz
I think that this is pretty readable and the best solution, but if you want to go with your current query, you'll have to add a second index on (dt::date).
Don't forget that every additional index costs space and slows down the performance of data modifying statements.
I've been looking for a straight clean answer to the this question. Let's say I have a photo table.
Now this table has 1,000,000 rows. Let's do the following query:
SELECT * FROM photos ORDER BY creation_time LIMIT 10;
Will this query grab all 1,000,000 rows and then give me 10? or does it just grab the latest 10? I'm quite curious as to how this works because if it does grab 1,000,000 (mind you this table is constantly growing) then it's wasteful query. You're basically throwing away 999,980 rows away. Is there a more efficient way to do this?
Whether the database has to scan the whole table or not depends on a number of
factors - in the case you describe the main factors are whether there is an ORDER BY
clause and whether there is an index on the sort field(s).
All is revealed by looking at the query plan, and the cost approximations on each
of the operations. Consider the case where there is no ordering clause:
testdb=> explain select * from bigtable limit 10;
QUERY PLAN
---------------------------------------------------------------------------
Limit (cost=0.00..0.22 rows=10 width=39)
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(2 rows)
The planner has decided that a sequential scan is the way to go. The expected cost
already gives us a clue. It is expressed as a range, 0.00..6943.06. The first number
(0.00) is the amount of work the database expects to have to do before it can deliver
any rows, while the second number is an estimate of the work required to deliver
the whole scan.
Thus, the input to the 'Limit' clause is going to start straight away, and it will
not have to process the full output of the sequential scan (since the total cost
is only 0.22, not 6943.06). So it definitely will not have to read the whole table
and discard most of it.
Now lets see what happens if you add an ORDER BY clause, using a column that is not
indexed.
testdb=> explain select * from bigtable ORDER BY title limit 10;
QUERY PLAN
---------------------------------------------------------------------------------
Limit (cost=13737.26..13737.29 rows=10 width=39)
-> Sort (cost=13737.26..14523.28 rows=314406 width=39)
Sort Key: title
-> Seq Scan on bigtable (cost=0.00..6943.06 rows=314406 width=39)
(4 rows)
We have a similar plan, but there is a 'Sort' operation in between the seq scan
and the limit. It has to scan the complete table, sort the full content of it,
and only then can is start delivering rows to the Limit clause. It makes sense
when you think about it - LIMIT is supposed to apply after ORDER BY; so it would
have to be sure to have found the top 10 rows in the whole table.
Now what happens when an index is used? Suppose we have a 'time' column which is
indexed:
testdb=> explain select * from bigtable ORDER BY time limit 10;
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Limit (cost=0.00..0.35 rows=10 width=39)
-> Index Scan using bigtable_time_idx on bigtable (cost=0.00..10854.96 rows=314406 width=39)
(2 rows)
An index scan, using the time index, is able to start delivering rows in already
sorted order (cost starts at 0.00). The LIMIT can cut the query short after
only 10 rows, so the overall cost is very small.
The moral to the story is to carefully choose which columns or combinations of
columns you will index. You can't add them indiscriminately because adding an
index has a cost of its own - it makes it more expensive to insert, update or
delete records.
We have a table with an indexed array column:
CREATE TABLE mention (
id SERIAL,
phraseIds integer[],
PRIMARY KEY (id)
);
CREATE INDEX indx_mentions_phraseIds on mention USING GIN (phraseids public.gin__int_ops);
Queries using the "overlaps" operator on this column don't seem to use the index:
explain analyze select m.id FROM mention m WHERE m.phraseIds && ARRAY[11638,11639];
Seq Scan on mention m (cost=0.00..933723.44 rows=1404 width=4) (actual time=103.018..3751.525 rows=1101 loops=1)
Filter: (phraseids && '{11638,11639}'::integer[])
Rows Removed by Filter: 7019974
Total runtime: 3751.618 ms
Is it possible to get Postgresql to use the index? Or should we be doing something else?
Update: I repeated the test with 'SET enable_seqscan TO off' and the index is still not used.
Update: I should have mentioned that I am using 9.2 with the intarray extension.
Update: It seems that the intarray extension is part of this problem. I re-created the table without using the intarray extension and the index is used as expected. Anyone know how to get the index to be used with the intarray extension? The docs (http://www.postgresql.org/docs/9.2/static/intarray.html) say that indexes are supported for &&.
I built a similar table in PostgreSQL 9.2; the difference was USING GIN (phraseids); I don't seem to have int_ops available in this context for some reason. I loaded a few thousand rows of random (ish) data.
Setting enable_seqscan off let PostgreSQL use the index.
PostgreSQL calculated the cost of a sequential scan to be less than the cost of a bitmap heap scan. The actual time of a sequential scan was 10% the actual time of a bitmap heap scan, but the total run time for a sequential scan was a little more than the total run time of a bitmap heap scan.
Very simple example - one table, one index, one query:
CREATE TABLE book
(
id bigserial NOT NULL,
"year" integer,
-- other columns...
);
CREATE INDEX book_year_idx ON book (year)
EXPLAIN
SELECT *
FROM book b
WHERE b.year > 2009
gives me:
Seq Scan on book b (cost=0.00..25663.80 rows=105425 width=622)
Filter: (year > 2009)
Why it does NOT perform index scan instead?
What am I missing?
If the SELECT returns more than approximately 5-10% of all rows in the table, a sequential scan is much faster than an index scan.
This is because an index scan requires several IO operations for each row (look up the row in the index, then retrieve the row from the heap). Whereas a sequential scan only requires a single IO for each row - or even less because a block (page) on the disk contains more than one row, so more than one row can be fetched with a single IO operation.
Btw: this is true for other DBMS as well - some optimizations as "index only scans" taken aside (but for a SELECT * it's highly unlikely such a DBMS would go for an "index only scan")
Did you ANALYZE the table/database? And what about the statistics? When there are many records where year > 2009, a sequential scan might be faster than an index scan.
#a_horse_with_no_name explained it quite well. Also if you really want to use an index scan, you should generally use bounded ranges in where clause. eg -
year > 2019 and year < 2020.
A lot of the times statistics are not updated on a table and it may not be possible to do so due to constraints. In this case, the optimizer will not know how many rows it should take in year > 2019. Thus it selects a sequential scan in lieu of full knowledge. Bounded partitions will solve the problem most of the time.
In index scan, read head jumps from one row to another which is 1000 times slower than reading the next physical block (in the sequential scan).
So, if the (number of records to be retrieved * 1000) is less than the total number of records, the index scan will perform better.