I'm exploring how postgres works in different circumstances. My questions concerns multi-column indexes. Let's say I create and populate the following table.
DROP TABLE IF EXISTS perf_test;
create table perf_test(id int, reason text, annotation text);
insert into perf_test(id, reason, annotation)
select s.id, md5(random()::text), null)
from generate_series(1,10000000) as s(id)
order by random();
-- do this separately to avoid the same values for both columns
UPDATE perf_test
SET annotation = md5(random()::text)
If I run a query such as:
EXPLAIN ANALYZE
SELECT *
FROM perf_test
WHERE reason LIKE 'bc%' AND annotation LIKE 'ab%';
I get a parallel sequential scan, not surprisingly. If I build an index for both columns:
CREATE INDEX idx_perf_test_reason_annotation
ON perf_test(reason,annotation);
and then run the same query again, I get the same problem: a sequential scan. Why the planner doesn't want to switch to the index scan? The result consists of about 100 records (not so many I guess). I tried other chars getting just several records but still, it did that with a sequential scan. Why it doesn't want to use index even when the result consists of a couple of rows?
Related
I have a table with the following columns:
ID (VARCHAR)
CUSTOMER_ID (VARCHAR)
STATUS (VARCHAR) (4 different status possible)
other not relevant columns
I try to find all the lines with customer_id = and status = two different status.
The query looks like:
SELECT *
FROM my_table
WHERE customer_id = '12345678' AND status IN ('STATUS1', 'STATUS2');
The table contains about 1 mio lines. I added two indexes on customer_id and status. The query still needs about 1 second to run.
The explain plan is:
1. Gather
2. Seq Scan on my_table
Filter: (((status)::text = ANY ('{SUBMITTED,CANCELLED}'::text[])) AND ((customer_id)::text = '12345678'::text))
I ran the 'analyze my_table' after creating the indexes. What could I do to improve the performance of this quite simple query?
You need a compound (multi-column) index to help satisfy your query.
Guessing, it seems like the most selective column (the one with the most distinct values) is customer_id. status probably has only a few distinct values. So customer_id should go first in the index. Try this.
ALTER TABLE my_table ADD INDEX customer_id_status (customer_id, status);
This creates a BTREE index. A useful mental model for such an index is an old-fashioned telephone book. It's sorted in order. You look up the first matching entry in the index, then scan it sequentially for the items you want.
You may also want to try running ANALYZE my_table; to update the statistics (about selectivity) used by the query planner to choose an appropriate index.
Pro tip Avoid SELECT * if you can. Instead name the columns you want. This can help performance a lot.
Pro tip Your question said some of your columns aren't relevant to query optimization. That's probably not true; index design is a weird art. SELECT * makes it less true.
Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.
I'm running this SQL command on PostgreSQL 11:
CREATE TABLE IF NOT EXISTS my_temp_table AS TABLE my_enormous_table WITH NO DATA;
It takes 5 minutes to make the new table.
The EXPLAIN ... is:
Seq Scan on my_enormous_table (cost=0.00..35999196.34 rows=143407234 width=3278)
Moving to a query like CREATE TABLE ... (SELECT * FROM my_enormous_table WHERE FALSE); is orders of magnitude faster - there is no seq scan, and the outcome is the same.
Any ideas what could be causing this issue?
WITH NO DATA still executes the query, it just ignores the result.
The better way to do that would be to avoid CREATE TABLE ... AS:
CREATE TABLE my_temp_table (LIKE my_enormous_table);
That also allows you to use the INCLUDING clause to copy default values, storage parameters, constraints and other things from the original table:
CREATE TABLE my_temp_table (LIKE my_enormous_table
INCLUDING CONSTRAINTS INCLUDING DEFAULTS);
I'm trying to write a query for a rather large table (10 million+ would be a typical size), the result of which needs to be filtered on various predicates / conditions based on some business logic. My question: does the query optimizier (in SQL Server 2008+) attempt to use a single index for the whole query, or does it attempt to use different indexes on a by-query basis?
Consider the following:
--Use Index A
SELECT Set1
FROM ATable
WHERE AColumn = sarg-able value
UNION ALL
--Are we stuck with Index A?
SELECT Set2
FROM ATable
WHERE BColumn = sarg-able value
If we choose Index A for Set1, are we stuck with Index A for the entire query, or is the optimizer smart enough to use a different index for Set2 (assuming one exists)?
Everything #andreyNikolov said is 100% correct. This is the kind of thing you can easily figure out on your own by reviewing the Actual Execution Plan (Not Estimated Execution plan). Note the following sample data, table and index structure:
USE tempdb -- safe place in Dev to test this kind of thing...
GO
-- sample data and indexes
IF OBJECT_ID('dbo.ATable','U') IS NOT NULL DROP TABLE dbo.ATable
CREATE TABLE dbo.ATable
(
Set1 INT NOT NULL,
Set2 INT NOT NULL,
AColumn INT NOT NULL,
BColumn INT NOT NULL
);
INSERT dbo.ATable (Set1, Set2, AColumn, BColumn)
VALUES (1,2,3,3),(1,2,4,4),(5,5,6,6),(11,22,40,40),(11,20,40,44),(11,22,14,4),(1,2,3,3);
CREATE NONCLUSTERED INDEX indexA ON dbo.ATable(AColumn) INCLUDE(Set1);
CREATE NONCLUSTERED INDEX indexB ON dbo.ATable(BColumn) INCLUDE(Set2);
Now run the following with "Include Actual Execution Plan" turned on.
SELECT Set1 --Use Index A
FROM dbo.ATable
WHERE AColumn = 3
UNION ALL
SELECT Set2 --Use Index B
FROM dbo.ATable
WHERE BColumn = 4;
... and the execution plan:
The query above the UNION ALL performs a nonclustered seek against IndexA's key column (AColumn). Because I included Set1 as an include column on IndexA, IndexA can satisfy the query without a Rid or Key lookup against. This is how indexes should be designed. The same is true about the query below the UNION ALL except that it's using IndexB.
Again, this is the kind of thing that is easy to figure out on your own once you have a full understanding of how to read the execution plans.
Yes, the optimizer is smart enough. These are two separate operations, which can be performed either as a table/index scans or seeks. The decision for executing each one of them is independent and it is perfectly normal to use different indexes for each of them. Then the results of both operations will be combined.
I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.