PostgreSql not using index - postgresql

I have a table named snapshots with a column named data in jsonb format
An Index is created on snapshots table
create index on snapshots using(( data->>'creator' ));
The following query was using index initially but not after couple of days
SELECT id, data - 'sections' - 'sharing' AS data FROM snapshots WHERE data->>'creator' = 'abc#email.com' ORDER BY xmin::text::bigint DESC
below is the output by running explain analyze
Sort (cost=19.10..19.19 rows=35 width=77) (actual time=292.159..292.163 rows=35 loops=1)
Sort Key: (((xmin)::text)::bigint) DESC
Sort Method: quicksort Memory: 30kB
-> Seq Scan on snapshots (cost=0.00..18.20 rows=35 width=77) (actual time=3.500..292.104 rows=35 loops=1)
Filter: ((data ->> 'creator'::text) = 'abc#email.com'::text)
Rows Removed by Filter: 152
Planning Time: 0.151 ms
Execution Time: 292.198 ms

A table with 187 rows is very small. For very small tables, a sequential scan is the most efficient strategy.
What is surprising here is the long duration of the query execution (292 milliseconds!). Unless you have incredibly lame or overloaded storage, this must mean that the table is extremely bloated – it is comparatively large, but almost all pages are empty, with only 187 live rows. You should rewrite the table to compact it:
VACUUM (FULL) snapshots;
Then the query will become must faster.

Related

PostgreSQL slow query when querying large table by PK (int)

We have a large table (265M rows), which has about 50 columns in total, mostly either integers, dates, or varchars. The table has a primary key defined on an autoincremented column.
A query loads temp table with Pk values (say, 10000-20000 rows) and then the large table is queried by being joined to the temp table.
Average size of the row in the large table is fairly consistent and is around 280 bytes.
Here is the query plan it produces, when running:
Nested Loop (cost=0.57..115712.21 rows=13515 width=372) (actual time=0.016..54797.581 rows=11838 loops=1)
Buffers: shared hit=49960 read=9261, local hit=53
-> Seq Scan on t_ids ti (cost=0.00..188.15 rows=13515 width=4) (actual time=0.006..6.993 rows=11838 loops=1)
Buffers: local hit=53
-> Index Scan using test_pk on test t (cost=0.57..8.55 rows=1 width=368) (actual time=4.624..4.624 rows=1 loops=11838)
Index Cond: (pk = ti.pk)
Buffers: shared hit=49960 read=9261
Planning Time: 0.128 ms
Execution Time: 54801.600 ms
... where test is the large table (actually, clustered by pk) and t_ids is the temporary table.
It seems to be doing the right thing - scanning temp table and hitting the large table on the pk index, 11k times... But it is sooo slow....
Any suggestions on what can be tried to make it run faster are gretly appreciated!

postgresql improve the query scan/filtering

I have the following table for attributes of different objects
create table attributes(id serial primary key,
object_id int,
attribute_id text,
text_data text,
int_data int,
timestamp_data timestamp,
state text default 'active');
an object will have different type of attributes and attribute value will be in one column among text_data or int_data or timestamp_data , depending on attribute data type.
sample data is here
I want to retrieve the records, my query is
select * from attributes
where attribute_id = 55 and state='active'
order by text_data
which is very slow.
increased the work_mem to 1 GB for current session. using set command
SET work_mem TO '1 GB'; to improve the sort method from external merge Disk to quicksort
But no improvement in query execution. Query executed plan is
Gather Merge (cost=750930.58..1047136.19 rows=2538728 width=128) (actual time=18272.405..27347.556 rows=3462116 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=235635 read=201793
-> Sort (cost=749930.56..753103.97 rows=1269364 width=128) (actual time=14299.222..15494.774 rows=1154039 loops=3)
Sort Key: text_data
Sort Method: quicksort Memory: 184527kB
Worker 0: Sort Method: quicksort Memory: 266849kB
Worker 1: Sort Method: quicksort Memory: 217050kB
Buffers: shared hit=235635 read=201793
-> Parallel Seq Scan on attributes (cost=0.00..621244.50 rows=1269364 width=128) (actual time=0.083..3410.570 rows=1154039 loops=3)
Filter: ((attribute_id = 185) AND (state = 'active'))
Rows Removed by Filter: 8652494
Buffers: shared hit=235579 read=201793
Planning Time: 0.453 ms
Execution Time: 29135.285 ms
the query total runtime in 45 sec
Successfully run. Total query runtime: 45 secs 237 msec.
3462116 rows affected.
To improve filtering and query execution time, created index on attribute_id and state
create index attribute_id_state on attributes(attribute_id,state);
Sort (cost=875797.49..883413.68 rows=3046474 width=128) (actual time=47189.534..49035.361 rows=3462116 loops=1)
Sort Key: text_data
Sort Method: quicksort Memory: 643849kB
Buffers: shared read=406048
-> Bitmap Heap Scan on attributes (cost=64642.80..547711.91 rows=3046474 width=128) (actual time=981.857..10348.441 rows=3462116 loops=1)
Recheck Cond: ((attribute_id = 185) AND (state = 'active'))
Heap Blocks: exact=396586
Buffers: shared read=406048
-> Bitmap Index Scan on attribute_id_state (cost=0.00..63881.18 rows=3046474 width=0) (actual time=751.909..751.909 rows=3462116 loops=1)
Index Cond: ((attribute_id = 185) AND (state = 'active'))
Buffers: shared read=9462
Planning Time: 0.358 ms
Execution Time: 50388.619 ms
but query become very slow after creating index.
Table has 29.5 Million rows. text_data is null in 9 Million rows.
Query is returning almost 3 million records, which is 10% of table.
Is there any other index or the other way like changing parameter etc to improve the query ?
Some suggestions:
ORDER BY clauses can be accelerated by indexes. So if you put your ordering column in your compound index you may get things to go much faster.
CREATE INDEX attribute_id_state_data
ON attributes(attribute_id, state, text_data);
This index is redundant with the one in your question, so drop that one when you create this one.
You use SELECT *, a notorious performance and maintainability antipattern. You're much better off naming the columns you want. This is especially important when your result sets are large: why waste CPU and network resources on data you don't need in your application? So, let's assume you want to do this. If you don't need all those columns, remove some of them from this SELECT.
SELECT id, object_id, attribute_id, text_data, int_data,
timestamp_data, state ...
You can use the INCLUDE clause on your index so it covers your query, that is so the query can be satisfied entirely from the index.
CREATE INDEX attribute_id_state_data
ON attributes(attribute_id, state, text_data)
INCLUDE (id, object_id, int_data, timestamp_data, state)
When you use this BTREE index, your query is satisfied by random-accessing the index to the first eligible row and then scanning the index sequentially. There's no need for PostgreSQL to bounce back to the table's data. It doesn't get much faster than that for a big result set.
If you remove some columns from your SELECT clause, you can also remove them from the index's INCLUDE clause.
You ORDER BY a large-object TEXT column. That's a lot of data to sort in each record, whether during index creation or a query. It's stored out-of-line, so it's not as fast. Can you rework your application to use a limited-length VARCHAR column for this instead? It will be more efficient.

Postgres slow when selecting many rows

I'm running Postgres 11.
I have a table with 1.000.000 (1 million) rows and each row has a size of 40 bytes (it contains 5 columns). That is equal to 40MB.
When I execute (directly executed on the DB via DBeaver, DataGrid ect.- not called via Node, Python ect.):
SELECT * FROM TABLE
it takes 40 secs first time (is this not very slow even for the first time).
The CREATE statement of my tables:
CREATE TABLE public.my_table_1 (
c1 int8 NOT NULL GENERATED ALWAYS AS IDENTITY,
c2 int8 NOT NULL,
c3 timestamptz NULL,
c4 float8 NOT NULL,
c5 float8 NOT NULL,
CONSTRAINT my_table_1_pkey PRIMARY KEY (id)
);
CREATE INDEX my_table_1_c3_idx ON public.my_table_1 USING btree (c3);
CREATE UNIQUE INDEX my_table_1_c2_idx ON public.my_table_1 USING btree (c2);
On 5 random tables: EXPLAIN (ANALYZE, BUFFERS) select * from [table_1...2,3,4,5]
Seq Scan on table_1 (cost=0.00..666.06 rows=34406 width=41) (actual time=0.125..7.698 rows=34406 loops=1)
Buffers: shared read=322
Planning Time: 15.521 ms
Execution Time: 10.139 ms
Seq Scan on table_2 (cost=0.00..9734.87 rows=503187 width=41) (actual time=0.103..57.698 rows=503187 loops=1)
Buffers: shared read=4703
Planning Time: 14.265 ms
Execution Time: 74.240 ms
Seq Scan on table_3 (cost=0.00..3486217.40 rows=180205440 width=41) (actual time=0.022..14988.078 rows=180205379 loops=1)
Buffers: shared hit=7899 read=1676264
Planning Time: 0.413 ms
Execution Time: 20781.303 ms
Seq Scan on table_4 (cost=0.00..140219.73 rows=7248073 width=41) (actual time=13.638..978.125 rows=7247991 loops=1)
Buffers: shared hit=7394 read=60345
Planning Time: 0.246 ms
Execution Time: 1264.766 ms
Seq Scan on table_5 (cost=0.00..348132.60 rows=17995260 width=41) (actual time=13.648..2138.741 rows=17995174 loops=1)
Buffers: shared hit=82 read=168098
Planning Time: 0.339 ms
Execution Time: 2730.355 ms
When I add a LIMIT 1.000.000 to table_5 (it contains 1.7 million rows)
Limit (cost=0.00..19345.79 rows=1000000 width=41) (actual time=0.007..131.939 rows=1000000 loops=1)
Buffers: shared hit=9346
-> Seq Scan on table_5(cost=0.00..348132.60 rows=17995260 width=41) (actual time=0.006..68.635 rows=1000000 loops=1)
Buffers: shared hit=9346
Planning Time: 0.048 ms
Execution Time: 164.133 ms
When I add a WHERE clause between 2 dates (I'm monitored the query below with DataDog software and the results are here (max.~ 31K rows/sec when fetching): https://www.screencast.com/t/yV0k4ShrUwSd):
Seq Scan on table_5 (cost=0.00..438108.90 rows=17862027 width=41) (actual time=0.026..2070.047 rows=17866766 loops=1)
Filter: (('2018-01-01 00:00:00+04'::timestamp with time zone < matchdate) AND (matchdate < '2020-01-01 00:00:00+04'::timestamp with time zone))
Rows Removed by Filter: 128408
Buffers: shared hit=168180
Planning Time: 14.820 ms
Execution Time: 2673.171 ms
All tables has an unique index on the c3 column.
The size of the database is like 500GB in total.
The server has 16 cores and 112GB M2 memory.
I have tried to optimize Postgres system variables - Like: WorkMem(1GB), shared_buffer(50GB), effective_cache_size (20GB) - But it doesn't seems to change anything (I know the settings has been applied - because I can see a big difference in the amount of idle memory the server has allocated).
I know the database is too big for all data to be in memory. But is there anything I can do to boost the performance / speed of my query?
Make sure CreatedDate is indexed.
Make sure CreatedDate is using the date column type. This will be more efficient on storage (just 4 bytes), performance, and you can use all the built in date formatting and functions.
Avoid select * and only select the columns you need.
Use YYYY-MM-DD ISO 8601 format. This has nothing to do with performance, but it will avoid a lot of ambiguity.
The real problem is likely that you have thousands of tables with which you regularly make unions of hundreds of tables. This indicates a need to redesign your schema to simplify your queries and get better performance.
Unions and date change checks suggest a lot of redundancy. Perhaps you've partitioned your tables by date. Postgres has its own built in table partitioning which might help.
Without more detail that's all I can say. Perhaps ask another question about your schema.
Without seeing EXPLAIN (ANALYZE, BUFFERS), all we can do is speculate.
But we can do some pretty good speculation.
Cluster the tables on the index on CreatedDate. This will allow the data to be accessed more sequentially, allowing more read-ahead (but this might not help much for some kinds of storage). If the tables have high write load, they may not stay clustered and so you would have recluster them occasionally. If they are static, this could be a one-time event.
Get more RAM. If you want to perform as if all the data was in memory, then get all the data into memory.
Get faster storage, like top-notch SSD. It isn't as fast as RAM, but much faster than HDD.

Postgresql. Optimize retriving distinct values from large table

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!
The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

Postgres JSONB timestamp query very slow compared to timestamp column query

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdated": "2016-12-26T12:09:43.901Z",
"lastUpdatedTimestamp": "1482754183"
}
}
The actual JSONB column stores much more information, but I've omitted the irrelevant data. The data format cannot be changed since this is legacy information.
I'm trying to efficiently obtain a count of all records where the lastUpdated value is greater or equal to some reference time (I'll use 2015-12-01T10:10:10Z in the following examples):
explain analyze SELECT count(*) FROM "accounts"
WHERE data->>'lastUpdated' >= '2015-12-01T10:10:10Z';
This takes over 22 seconds:
Aggregate (cost=843795.05..843795.06 rows=1 width=0) (actual time=22292.584..22292.584 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..842317.05 rows=591201 width=0)
(actual time=1.410..22142.046 rows=1773603 loops=1)
Filter: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.234 ms
Execution time: 22292.671 ms
I've tried adding the following text index:
CREATE INDEX accounts_last_updated ON accounts ((data->>'lastUpdated'));
But the query is still rather slow, at over 17 seconds:
Aggregate (cost=815548.64..815548.65 rows=1 width=0) (actual time=17172.844..17172.845 rows=1 loops=1)
-> Bitmap Heap Scan on accounts (cost=18942.24..814070.64 rows=591201 width=0)
(actual time=1605.454..17036.081 rows=1773603 loops=1)
Recheck Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Heap Blocks: exact=28955 lossy=397518
-> Bitmap Index Scan on accounts_last_updated (cost=0.00..18794.44 rows=591201 width=0)
(actual time=1596.645..1596.645 rows=1773603 loops=1)
Index Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.373 ms
Execution time: 17172.974 ms
I've also tried following the directions in Create timestamp index from JSON on PostgreSQL and have tried creating the following function and index:
CREATE OR REPLACE FUNCTION text_to_timestamp(text)
RETURNS timestamp AS
$$SELECT to_timestamp($1, 'YYYY-MM-DD HH24:MI:SS.MS')::timestamp; $$
LANGUAGE sql IMMUTABLE;
CREATE INDEX accounts_last_updated ON accounts
(text_to_timestamp(data->>'lastUpdated'));
But this doesn't give me any improvement, in fact it was slower, taking over 24 seconds for the query, versus 22 seconds for the unindexed version:
explain analyze SELECT count(*) FROM "accounts"
WHERE text_to_timestamp(data->>'lastUpdated') >= '2015-12-01T10:10:10Z';
Aggregate (cost=1287195.80..1287195.81 rows=1 width=0) (actual time=24143.150..24143.150 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..1285717.79 rows=591201 width=0)
(actual time=4.044..23971.723 rows=1773603 loops=1)
Filter: (text_to_timestamp((data ->> 'lastUpdated'::text)) >= '2015-12-01 10:10:10'::timestamp without time zone)
Planning time: 1.107 ms
Execution time: 24143.183 ms
In one last act of desperation, I decided to add another timestamp column and update it to contain the same values as data->>'lastUpdated':
alter table accounts add column updated_at timestamp;
update accounts set updated_at = text_to_timestamp(data->>'lastUpdated');
create index accounts_updated_at on accounts(updated_at);
This has given me by far the best performance:
explain analyze SELECT count(*) FROM "accounts" where updated_at >= '2015-12-01T10:10:10Z';
Aggregate (cost=54936.49..54936.50 rows=1 width=0) (actual time=676.955..676.955 rows=1 loops=1)
-> Index Only Scan using accounts_updated_at on accounts
(cost=0.43..50502.48 rows=1773603 width=0) (actual time=0.026..552.442 rows=1773603 loops=1)
Index Cond: (updated_at >= '2015-12-01 10:10:10'::timestamp without time zone)
Heap Fetches: 0
Planning time: 4.643 ms
Execution time: 678.962 ms
However, I'd very much like to avoid adding another column just to improve the speed of ths one query.
This leaves me with the following question: is there any way to improve the performance of my JSONB query so it can be as efficient as the individual column query (the last query where I used updated_at instead of data->>'lastUpdated')? As it stands, it takes from 17 seconds to 24 seconds for me to query the JSONB data using data->>'lastUpdated', while it takes only 678 ms to query the updated_at column. It doesn't make sense that the JSONB query would be so much slower. I was hoping that by using the text_to_timestamp function that it would improve the performance, but it hasn't been the case (or I'm doing something wrong).
In your first and second try most execution time is spent on index recheck or filtering, which must read every json field index hits, reading json is expensive. If index hits a couple hundred rows, query will be fast, but if index hits thousands or hundreds of thousand rows - filtering/rechecking json field will take some serious time. In second try, using additionally another function makes it even worse.
JSON field is good for storing data, but are not intended to be used in analytic queries like summaries, statistics and its bad practice to use json objects to be used in where conditions, atleast as main filtering condition like in your case.
That last act of depression of yours is the right way to go :)
To improve query performance, you must add one or some several columns with key vales which will be used most in where conditions.