I have BIG table with multiple indexes in postgres. It has indexes on db_timestamp, id, username.
I want to find the MAX timestamp for particular username.
The problem is the simple query like
SELECT MAX(db_timestamp) FROM Foo WHERE username = 'foo'
takes so much time because of the huge table size( we are talking 450GB table with over 30 GB index sizes).
Is their any way to optimize this query or tell postgres about what query plan to use?
Use create an index on username and db_timestamp with correct sort order:
CREATE INDEX idx_foo ON foo (username ASC, db_timestamp DESC);
Check EXPLAIN to see if things work as they should.
Postgresql can't use the index on (db_timestamp,id,username) to satisfy that query- the query term you're after has to be a prefix of the index, i.e. using the first column(s).
So an index on (username,db_timestamp) would serve that query very well, since it just has to scan the subtree (username,0)..(username,+inf) (and iirc Postresql should actually know to try and find (username,+inf) and walk backwards in-order).
In general, "covering indices" isn't a useful technique with Postgresql like it is with other databases, due to Postgresql's need to refer to the heap tuples for visibility information.
Related
In best of my knowledge Postgres final an execution plan for a query on its 1~5th execution and then stuck to it.
My query(for table contains billions of rows and i have to pick top n):
select col1, col2
from table_a
where col1='a'
and col3='b'
order by col1 desc
limit 5;
There is an existing index (ix_1) on (col1, col3) that query is using.
Moving up to Postgres 12 I have created an container index (ix_2) as under:
(col1 desc, col3) include (col2)
Now I want query to use (ix_2) to make it an index only scan as col2 is included in ix_2 but query still use old (ix_1).
Since index forcing hints also not work in Postgres, so is there anyway in Postgres to force query to recreate its execution plan, so that query may consider my new index (ix_2)?
What an interesting username.
I think your assumptions about what is going on are wrong. Creating a new index sends out an invalidation message, which should force all other sessions to re-plan the query even if they think they already know the best plan.
Most likely what is going on is that the planner just re-picks the old plan anyway, because it still thinks it will be fastest. An index-only scan is only beneficial of many of the table pages are marked as allvisible. If few of them are, then there isn't much gain. But the index is probably larger, which will give it a (slightly) higher cost estimate. You should VACUUM the table to make sure the visibility map is current.
But really if you just want to get it to use the IOS, rather than do a root cause analysis, then you can just drop the old index. There is no point in having both.
Also, I wouldn't bother with INCLUDE, unless col2 is of a type that doesn't define btree operators. Just throw it into the main body of the index like (col1 desc, col3, col2).
Finally, there is no point in ordering by a column which you just forced to all have identical values to each other.
I have table in postgres database. After the row reach 2 mio rows, the query become slower. This is my query
SELECT
c.source,
c.destination,
c.product_id,
sum(c.weight),
count(c.weight),
c.owner_id
FROM stock c
GROUP BY c.source, c.destination, c.product_id, c.owner_id;
I already add index
CREATE INDEX stock_custom_idx ON public.stock USING btree (source, destination, product_id, owner_id)
The query is very slow, so I do explain analyze, but the index not called.
So, how to optimize this query, because its take too long time and not return data?
Try this index:
CREATE INDEX better_index ON public.stock USING btree
(source, destination, product_id, owner_id, weight);
If you do not include the weight, this information still needs to be fetched from the table, thus you have a full table scan.
With the new index, you should have an index only scan. Also, the query planner can make use of the sorting order of the index for the grouping (just as it could have done with your index).
In newer versions of PostgreSQL, there would also exist the INCLUDE clause, where you can "add" columns to an index without this having any impact on the sorting order (the data is there, but this part of the data is not sorted). This would make the index yet another bit more performant for your query, I guess.
I have a table with geometry column.
I have 2 indexes on this column:
create index idg1 on tbl using gist(geom)
create index idg2 on tbl using gist(st_geomfromewkb((geom)::bytea))
I have a lot of queries using the geom (geometry) field.
Which index is used ? (when and why)
If there are two indexes on same column (as I show here), can the select queries run slower than define just one index on column ?
The use of an index depends on how the index was defined, and how the query is invoked. If you SELECT <cols> FROM tbl WHERE geom = <some_value>, then you will use the idg1 index. If you SELECT <cols> FROM tabl WHERE st_geomfromewkb(geom) = <some_value>, then you will use the idg2 index.
A good way to know which index will be used for a particular query is to call the query with EXPLAIN (i.e., EXPLAIN SELECT <cols> FROM tbl WHERE geom = <some_value>) -- this will print out the query plan, which access methods, which indexes, which joins, etc. will be used.
For your question regarding performance, the SELECT queries could run slower because there are more indexes to consider in the query planning phase. In terms of executing a given query plan, a SELECT query will not run slower because by then the query plan has been established and the decision of which index to use has been made.
You will certainly experience performance impact upon INSERT/UPDATE/DELETE of the table, as all indexes will need to be updated with respect to the changes in the table. As such, there will be extra I/O activity on disk to propagate the changes, slowing down the database, especially at scale.
Which index is used depends on the query.
Any query that has
WHERE geom && '...'::geometry
or
WHERE st_intersects(geom, '...'::geometry)
or similar will use the first index.
The second index will only be used for queries that have the expression st_geomfromewkb((geom)::bytea) in them.
This is completely useless: it converts the geometry to EWKB format and back. You should find and rewrite all queries that have this weird construct, then you should drop that index.
Having two indexes on a single column does not slow down your queries significantly (planning will take a bit longer, but I doubt if you can measure that). You will have a performance penalty for every data modification though, which will take almost twice as long as with a single index.
We have a table with 10 million rows. We need to find first few rows with like 'user%' .
This query is fast if it matches at least 2 rows (It returns results in 0.5 sec). If it doesn't find any 2 rows matching with that criteria, it is taking at least 10 sec. 10 secs is huge for us (since we are using this auto suggestions, users will not wait for so long to see the suggestions.)
Query: select distinct(name) from user_sessions where name like 'user%' limit 2;
In the above query, the name column is of type citext and it is indexed.
Whenever you're working on performance, start by explaining your query. That'll show the the query optimizer's plan, and you can get a sense of how long it's spending doing various pieces. In particular, check for any full table scans, which mean the database is examining every row in the table.
Since the query is fast when it finds something and slow when it doesn't, it sounds like you are indeed hitting a full table scan. I believe you that it's indexed, but since you're doing a like, the standard string index can't be used efficiently. You'll want to check out varchar_pattern_ops (or text_pattern_ops, depending on the column type of name). You create that this way:
CREATE INDEX ON pattern_index_on_users_name ON users (name varchar_pattern_ops)
After creating an index, check EXPLAIN query to make sure it's being used. text_pattern_ops doesn't work with the citext extension, so in this case you'll have to index and search for lower(name) to get good case-insensitive performance:
CREATE INDEX ON pattern_index_on_users_name ON users (lower(name) text_pattern_ops)
SELECT * FROM users WHERE lower(name) like 'user%' LIMIT 2
I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.