Frequently used queries in Postgres - postgresql

I have a simple update query as below which is frequently being used.
UPDATE table_name SET column_1 = ?, column_2 = ? WHERE id = ?;
The update query will execute every time when the customer is having a chat with support.
I have tried using indexes for two columns used in the query for reducing the time consumption.
I don't see much difference. Is it correct to use indexes here or is there any other approach to avoid the frequently use or time consuming.

There are a couple of things you can do to make this faster:
an index on id
no index on column_1 and column_2 and a fillfactor under 100 on the table
Then you can get HOT updates, which are a considerable performance gain, since the indexes don't have to get updated.
use prepared statements to save on planning time

Related

create 2 indexes on same column

I have a table with geometry column.
I have 2 indexes on this column:
create index idg1 on tbl using gist(geom)
create index idg2 on tbl using gist(st_geomfromewkb((geom)::bytea))
I have a lot of queries using the geom (geometry) field.
Which index is used ? (when and why)
If there are two indexes on same column (as I show here), can the select queries run slower than define just one index on column ?
The use of an index depends on how the index was defined, and how the query is invoked. If you SELECT <cols> FROM tbl WHERE geom = <some_value>, then you will use the idg1 index. If you SELECT <cols> FROM tabl WHERE st_geomfromewkb(geom) = <some_value>, then you will use the idg2 index.
A good way to know which index will be used for a particular query is to call the query with EXPLAIN (i.e., EXPLAIN SELECT <cols> FROM tbl WHERE geom = <some_value>) -- this will print out the query plan, which access methods, which indexes, which joins, etc. will be used.
For your question regarding performance, the SELECT queries could run slower because there are more indexes to consider in the query planning phase. In terms of executing a given query plan, a SELECT query will not run slower because by then the query plan has been established and the decision of which index to use has been made.
You will certainly experience performance impact upon INSERT/UPDATE/DELETE of the table, as all indexes will need to be updated with respect to the changes in the table. As such, there will be extra I/O activity on disk to propagate the changes, slowing down the database, especially at scale.
Which index is used depends on the query.
Any query that has
WHERE geom && '...'::geometry
or
WHERE st_intersects(geom, '...'::geometry)
or similar will use the first index.
The second index will only be used for queries that have the expression st_geomfromewkb((geom)::bytea) in them.
This is completely useless: it converts the geometry to EWKB format and back. You should find and rewrite all queries that have this weird construct, then you should drop that index.
Having two indexes on a single column does not slow down your queries significantly (planning will take a bit longer, but I doubt if you can measure that). You will have a performance penalty for every data modification though, which will take almost twice as long as with a single index.

How can I get the total run time of a query in redshift, with a query?

I'm in the process of benchmarking some queries in redshift so that I can say something intelligent about changes I've made to a table, such as adding encodings and running a vacuum. I can query the stl_query table with a LIKE clause to find the queries I'm interested in, so I have the query id, but tables/views like stv_query_summary are much too granular and I'm not sure how to generate the summarization I need!
The gui dashboard shows the metrics I'm interested in, but the format is difficult to store for later analysis/comparison (in other words, I want to avoid taking screenshots). Is there a good way to rebuild that view with sql selects?
To add to Alex answer, I want to comment that stl_query table has the inconvenience that if the query was in a queue before the runtime then the queue time will be included in the run time and therefore the runtime won't be a very good indicator of performance for the query.
To understand the actual runtime of the query, check on stl_wlm_query for the total_exec_time.
select total_exec_time
from stl_wlm_query
where query='query_id'
There are some usefuls tools/scripts in https://github.com/awslabs/amazon-redshift-utils
Here is one of said scripts stripped out to give you query run times in milliseconds. Play with the filters, ordering etc to show the results you are looking for:
select userid, label, stl_query.query, trim(database) as database, trim(querytxt) as qrytext, starttime, endtime, datediff(milliseconds, starttime,endtime)::numeric(12,2) as run_milliseconds,
aborted, decode(alrt.event,'Very selective query filter','Filter','Scanned a large number of deleted rows','Deleted','Nested Loop Join in the query plan','Nested Loop','Distributed a large number of rows across the network','Distributed','Broadcasted a large number of rows across the network','Broadcast','Missing query planner statistics','Stats',alrt.event) as event
from stl_query
left outer join ( select query, trim(split_part(event,':',1)) as event from STL_ALERT_EVENT_LOG group by query, trim(split_part(event,':',1)) ) as alrt on alrt.query = stl_query.query
where userid <> 1
-- and (querytxt like 'SELECT%' or querytxt like 'select%' )
-- and database = ''
order by starttime desc
limit 100

First call of query on big table is surprisingly slow

I have a query that feels like it is taking more time then it should be. This only applies on the first query for a given set of parameters, so when cached there is no issue.
I am not sure what to expect, however, given the setup and settings I was hoping someone could shed some light on a few questions and give some insight into what can be done to speed up the query. The table in question is fairly large and Postgres estimates around 155963000 in it (14 GB).
Query
select ts, sum(amp) as total_amp, sum(230 * factor) as wh
from data_cbm_aggregation_15_min
where virtual_id in (1818) and ts between '2015-02-01 00:00:00' and '2015-03-31 23:59:59'
and deleted is null
group by ts
order by ts
When I started looking into this the query it took around 15 seconds, after some changes I have gotten it to around 10 seconds which still seems long for a simply query like this. Here are the results from explain analyze: http://explain.depesz.com/s/97V1. Note the reason why GroupAggregate returns the same amount of rows is this example only has one virtual_id being used, but there can be more.
Table and index
Table being queried, it has values inserted into it every 15 minutes
CREATE TABLE data_cbm_aggregation_15_min (
virtual_id integer NOT NULL,
ts timestamp without time zone NOT NULL,
amp real,
recs smallint,
min_amp real,
max_amp real,
deleted boolean,
factor real DEFAULT 0.25,
min_amp_ts timestamp without time zone,
max_amp_ts timestamp without time zone
)
ALTER TABLE data_cbm_aggregation_15_min ALTER COLUMN virtual_id SET STATISTICS 1000;
ALTER TABLE data_cbm_aggregation_15_min ALTER COLUMN ts SET STATISTICS 1000;
The index that is used in the query
CREATE UNIQUE INDEX idx_data_cbm_aggregation_15_min_virtual_id_ts
ON data_cbm_aggregation_15_min USING btree (virtual_id, ts DESC);
ALTER TABLE data_cbm_aggregation_15_min
CLUSTER ON idx_data_cbm_aggregation_15_min_virtual_id_ts;
Postgres settings
Other settings are default.
default_statistics_target = 100
maintenance_work_mem = 2GB
effective_cache_size = 11GB
work_mem = 256MB
shared_buffers = 3840MB
random_page_cost = 1
What I have tried
I have been following the Things to try before you post in https://wiki.postgresql.org/wiki/Slow_Query_Questions and the results in a bit more detail were as follows:
Fiddling with the Postgres settings, mostly lowering random_page_cost since the index scan, while it seems not too special is miles ahead of the bitmap heap scan it tried doing instead when the random_page_cost was higher.
Adding increased statistics to the virtual_id and ts columns which the index and WHERE conditions are based on. The query planner's estimated row count was much closer to the actual row count after changing this.
Clustering on the idx_data_cbm_aggregation_15_min_virtual_id_ts index did not seem to change much, not that I noticed.
Running VACUUM manually did not change much, I am already running autovacuum so this was no surprise.
Running REINDEX on the index shrunk it considerably (by almost 50%!) but it did not improve the speed by much.
A couple of small improvements
SELECT ts, sum(amp) AS total_amp, sum(factor) * 230 AS wh
FROM data_cbm_aggregation_15_min
WHERE virtual_id = 1818
AND ts >= '2015-02-01 00:00'
AND ts < '2015-04-01 00:00'
AND deleted IS NULL
GROUP BY ts
ORDER BY ts;
sum(230 * factor) - it's cheaper to multiply the sum once instead of multiplying each element: sum(factor) * 230 The result is the same, even with NULL values.
ts between '2015-02-01 00:00:00' and '2015-03-31 23:59:59' is potentially incorrect. To include all of March 2015, use the presented alternative. BETWEEN is translated to ts >= lower AND ts <= upper anyway. It is always slightly faster to spell it out.
virtual_id in (1818) is just a needlessly convoluted way to say virtual_id = 1818.
Better index, potentially bigger improvement
CREATE INDEX data_cbm_aggregation_15_min_special_idx
ON data_cbm_aggregation_15_min (virtual_id, ts, amp, factor)
WHERE deleted IS NULL;
I see nothing in your question that would suggest DESC in your original index. While Index Scan Backward is almost as fast as a plain Index Scan, it's still better to drop the modifier.
Most importantly, there are index-only scans since Postgres 9.2. The two index columns I appended (amp, factor) only make sense if you get index-only scans out of it.
Since you obviously are not interested in deleted rows, make it a partial index. Only pays if you have more than a few deleted rows in the table.
If you have other large parts of the table that can be excluded, add more conditions - and remember to repeat the condition in the query (even if it seems redundant) so Postgres understands that the index is applicable.
Table definition
Reordering table columns like this would save 8 bytes per row:
CREATE TABLE data_cbm_aggregation_15_min (
virtual_id integer NOT NULL,
recs smallint,
deleted boolean,
ts timestamp NOT NULL,
amp real,
min_amp real,
max_amp real,
factor real DEFAULT 0.25,
min_amp_ts timestamp,
max_amp_ts timestamp
);
Related:
Configuring PostgreSQL for read performance
Most important information for last
The first query call can be substantially more expensive for very big tables, since the whole table cannot be cached. Subsequent calls profit from the populated cache. Postgres caches blocks, not necessarily whole tables.
One more thing that can be important for the first call. Due to the MVCC model of Postgres it has to maintain visibility information. When reading pages of a table the first time since the last write operation, Postgres opportunistically updates visibility information, which can impose some extra cost for the first access (and help a lot for subsequent calls). More in the manual here. Related answer on dba.SE:
Why does a SELECT statement dirty cache buffers in Postgres?
About what you've tried so far
SET STATISTICS 1000 for ts and virtual_id was an excellent idea, but the effect was largely nullified by setting random_page_cost = 1, which basically forces an index scan for this query either way.
random_page_cost = 1 is telling Postgres that random access is just as cheap as sequential access. This makes sense for a DB that (almost) completely resides in cache. For a DB with huge tables like yours, this setting seems too extreme (even if it gets Postgres to favor the desired index scan). Set it to random_page_cost = 1.1 or probably higher.
A bitmap index scan is typically a good plan for the first call of the query you presented - for data distributed randomly across the table. Since you clustered the table just like you need it for this query, an index scan is more efficient. The question is: will your table stay clustered?
Your settings for work_mem and other resources depend on how much RAM you have, the speed of your disks, on access pattern, how many concurrent connections you typically have, what other programs on the server compete for resources, etc. work_mem = 256MB seems too high. You don't need nearly as much for the presented query. Setting it that high may actually harm performance, because it reduces RAM available to cache.
REINDEX is not redundant immediately after CLUSTER, since that recreates all indexes anyway. You must have run REINDEX before cluster, or you have heavy write access on the table to get so much bloat again already.
Various
Upgrade to Postgres 9.4 (or the upcoming 9.5, currently alpha). Version 9.2 is 3 years old now, the latest version has received many improvements.
The query plan suggests that nothing is actually aggregated. rows=4,117 are read from the index and rows=4,117 remain after GroupAggregate. Looks like rows are unique on ts already? Then you can remove the aggregation completely and make it a simple SELECT ...
If that's just a misleading EXPLAIN output and you typically output much fewer rows than are read, a MATERIALIZED VIEW with index on ts would be another option. Especially in combination with Postgres 9.4, which introduces REFRESH MATERIALIZED VIEW CONCURRENTLY.

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.

How does this PostgreSQL query slow down when the number of rows increases?

I have a table briefly structured like this:
tn( id integer NOT NULL primary key DEFAULT nextval('tn_sequence'),
create_dt TIMESTAMP NOT NULL DEFAULT NOW(),
...............
deleted boolean );
create_dt is the timestamp when the row is inserted into the database.
deleted indicates that the row is or no longer useful.
And I have the following queries:
select * from tn where create_dt > ( NOW() - interval '150 seconds ) and deleted = FALSE;
select * from tn where create_dt < ( NOW() - interval '150 seconds ) and deleted = FALSE;
My question is how these query will slow down when the number of rows increase? For instance, when the number of rows exceeds 10K, 20K, or 100K, will it make a big impact on the speed? Is there any way I can optimize these queries? Note that every 5 seconds I will turn the column 'deleted' of rows which are older than 150 seconds into 'TRUE'.
The effect of table growth on performance will depend on the query plan chosen, available indexes, the selectivity of the query, and lots of other factors. EXPLAIN ANALYZE on the query might help. In short, if your query only selects a few rows and can use a simple b-tree index then it won't usually slow down tons, only a little as the index grows. On the other hand queries using complex non-indexed conditions or returning lots of rows could perform very badly indeed.
Your issue appears to mirror that in the question How should we handle rows which won't be queried once they are old in PostgreSQL?
The advice given there should apply:
Use a partial index with the condition WHERE (not deleted); or
partition on 'deleted' with constraint exclusion enabled.
For example, you might:
CREATE INDEX create_dt_when_not_deleted_idx
ON tn (create_dt)
WHERE (NOT deleted);
This includes only rows where deleted = 'f' (assuming deleted is `not null) in the index. This isn't the same as having them gone from the table completely.
Nothing changes with full table sequential scans, the deleted='t' rows must still be scanned; and
There's more I/O than if the deleted = 't' rows weren't there because any given heap page is likely to contain a mix of deleted = 't' and deleted = 'f' rows.
You can reduce the impact of the latter by CLUSTERing on an index that includes deleted. Again, this will have no effect on sequential scans. To help with sequential scans you would have to partition the table on deleted.
Pg 9.2's index only scans should (I think, haven't tested) use the partial index. When an index only scan is possible the partial index should be as fast as an index on a table containing only the deleted = 'f' rows.
Note that you'll need to keep table and index bloat under control. Ensure autovaccum runs very frequently and use a current version of PostgreSQL that doesn't need things like manually-managed free space map and has the latest, best-behaved autovacuum. I'd recommend 9.0 or above, preferably 9.1 or 9.2. Tune autovacuum to run aggressively.
When tuning and testing performance - test your queries with EXPLAIN ANALYZE, don't just guess.