postgresql improve the query scan/filtering - postgresql

I have the following table for attributes of different objects
create table attributes(id serial primary key,
object_id int,
attribute_id text,
text_data text,
int_data int,
timestamp_data timestamp,
state text default 'active');
an object will have different type of attributes and attribute value will be in one column among text_data or int_data or timestamp_data , depending on attribute data type.
sample data is here
I want to retrieve the records, my query is
select * from attributes
where attribute_id = 55 and state='active'
order by text_data
which is very slow.
increased the work_mem to 1 GB for current session. using set command
SET work_mem TO '1 GB'; to improve the sort method from external merge Disk to quicksort
But no improvement in query execution. Query executed plan is
Gather Merge (cost=750930.58..1047136.19 rows=2538728 width=128) (actual time=18272.405..27347.556 rows=3462116 loops=1)
Workers Planned: 2
Workers Launched: 2
Buffers: shared hit=235635 read=201793
-> Sort (cost=749930.56..753103.97 rows=1269364 width=128) (actual time=14299.222..15494.774 rows=1154039 loops=3)
Sort Key: text_data
Sort Method: quicksort Memory: 184527kB
Worker 0: Sort Method: quicksort Memory: 266849kB
Worker 1: Sort Method: quicksort Memory: 217050kB
Buffers: shared hit=235635 read=201793
-> Parallel Seq Scan on attributes (cost=0.00..621244.50 rows=1269364 width=128) (actual time=0.083..3410.570 rows=1154039 loops=3)
Filter: ((attribute_id = 185) AND (state = 'active'))
Rows Removed by Filter: 8652494
Buffers: shared hit=235579 read=201793
Planning Time: 0.453 ms
Execution Time: 29135.285 ms
the query total runtime in 45 sec
Successfully run. Total query runtime: 45 secs 237 msec.
3462116 rows affected.
To improve filtering and query execution time, created index on attribute_id and state
create index attribute_id_state on attributes(attribute_id,state);
Sort (cost=875797.49..883413.68 rows=3046474 width=128) (actual time=47189.534..49035.361 rows=3462116 loops=1)
Sort Key: text_data
Sort Method: quicksort Memory: 643849kB
Buffers: shared read=406048
-> Bitmap Heap Scan on attributes (cost=64642.80..547711.91 rows=3046474 width=128) (actual time=981.857..10348.441 rows=3462116 loops=1)
Recheck Cond: ((attribute_id = 185) AND (state = 'active'))
Heap Blocks: exact=396586
Buffers: shared read=406048
-> Bitmap Index Scan on attribute_id_state (cost=0.00..63881.18 rows=3046474 width=0) (actual time=751.909..751.909 rows=3462116 loops=1)
Index Cond: ((attribute_id = 185) AND (state = 'active'))
Buffers: shared read=9462
Planning Time: 0.358 ms
Execution Time: 50388.619 ms
but query become very slow after creating index.
Table has 29.5 Million rows. text_data is null in 9 Million rows.
Query is returning almost 3 million records, which is 10% of table.
Is there any other index or the other way like changing parameter etc to improve the query ?

Some suggestions:
ORDER BY clauses can be accelerated by indexes. So if you put your ordering column in your compound index you may get things to go much faster.
CREATE INDEX attribute_id_state_data
ON attributes(attribute_id, state, text_data);
This index is redundant with the one in your question, so drop that one when you create this one.
You use SELECT *, a notorious performance and maintainability antipattern. You're much better off naming the columns you want. This is especially important when your result sets are large: why waste CPU and network resources on data you don't need in your application? So, let's assume you want to do this. If you don't need all those columns, remove some of them from this SELECT.
SELECT id, object_id, attribute_id, text_data, int_data,
timestamp_data, state ...
You can use the INCLUDE clause on your index so it covers your query, that is so the query can be satisfied entirely from the index.
CREATE INDEX attribute_id_state_data
ON attributes(attribute_id, state, text_data)
INCLUDE (id, object_id, int_data, timestamp_data, state)
When you use this BTREE index, your query is satisfied by random-accessing the index to the first eligible row and then scanning the index sequentially. There's no need for PostgreSQL to bounce back to the table's data. It doesn't get much faster than that for a big result set.
If you remove some columns from your SELECT clause, you can also remove them from the index's INCLUDE clause.
You ORDER BY a large-object TEXT column. That's a lot of data to sort in each record, whether during index creation or a query. It's stored out-of-line, so it's not as fast. Can you rework your application to use a limited-length VARCHAR column for this instead? It will be more efficient.

Related

PostgreSql not using index

I have a table named snapshots with a column named data in jsonb format
An Index is created on snapshots table
create index on snapshots using(( data->>'creator' ));
The following query was using index initially but not after couple of days
SELECT id, data - 'sections' - 'sharing' AS data FROM snapshots WHERE data->>'creator' = 'abc#email.com' ORDER BY xmin::text::bigint DESC
below is the output by running explain analyze
Sort (cost=19.10..19.19 rows=35 width=77) (actual time=292.159..292.163 rows=35 loops=1)
Sort Key: (((xmin)::text)::bigint) DESC
Sort Method: quicksort Memory: 30kB
-> Seq Scan on snapshots (cost=0.00..18.20 rows=35 width=77) (actual time=3.500..292.104 rows=35 loops=1)
Filter: ((data ->> 'creator'::text) = 'abc#email.com'::text)
Rows Removed by Filter: 152
Planning Time: 0.151 ms
Execution Time: 292.198 ms
A table with 187 rows is very small. For very small tables, a sequential scan is the most efficient strategy.
What is surprising here is the long duration of the query execution (292 milliseconds!). Unless you have incredibly lame or overloaded storage, this must mean that the table is extremely bloated – it is comparatively large, but almost all pages are empty, with only 187 live rows. You should rewrite the table to compact it:
VACUUM (FULL) snapshots;
Then the query will become must faster.

Efficient full text search in PostgreSQL, sorting on another column

In PostgreSQL, how can one efficiently do a full text search on one column, sorting on another column?
Say I have a table tbl with columns a, b, c, ... and many (> a million) rows. I want to do a full text search on column a and sort the results by some other column.
So I create a tsvector va from column a,
ALTER TABLE tbl
ADD COLUMN va tsvector GENERATED ALWAYS AS (to_tsvector('english', a)) STORED;
create an index iva for that,
CREATE INDEX iva ON tbl USING GIN (va);
and an index ib for column b,
CREATE INDEX ib ON tbl (b);
Then I query like
SELECT * FROM tbl WHERE va ## to_tsquery('english', 'test') ORDER BY b LIMIT 100
Now the obvious execution strategy for Postgres would be:
for frequent words to do an Index Scan using ib, filtering for va ## 'test'::tsquery, and stopping after 100 matches,
while for rare words to do a (Bitmap) Index Scan using iva with condition
va ## 'test'::tsquery, and then to sort on b manually
However, Postgres' query planner seems not to take word frequency into account:
With a low LIMIT (e.g. 100) it always uses strategy 1 (as I checked with EXPLAIN), and in my case takes over a minute for rare (or not occurring) words. However, if I trick it into using strategy 2 by setting a large (or no) LIMIT, it returns in a millisecond!
The other way round, with a larger LIMIT (e.g. 200) it always uses strategy 2 which works well for rare words but is very slow for frequent words
So how do I get Postgres to use a good query plan in every case?
Since there seems currently no way to let Postgres to choose the right plan automatically,
how do I get the number of rows containing a specific lexeme so I can decide on the best strategy?
(SELECT COUNT(*) FROM tbl WHERE va ## to_tsquery('english', 'test') is horribly slow (~ 1 second for lexemes occurring in 10000 rows), and ts_stat seems also not to help, apart from building my own word frequency list)
how do I then tell Postgres to use this strategy?
Here a concrete example
I have a table items with 1.5 million rows, with a tsvector column v3 on which I do the text search, and a column rating on which I sort. In this case I determined the query planner always chooses strategy 1 if the LIMIT is 135 or less, else strategy 2
Here the EXPLAIN ANALYZE for the rare word 'aberdeen' (occurring in 132 rows) with LIMIT 135:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 ## to_tsquery('english', 'aberdeen')
ORDER BY rating DESC NULLS LAST LIMIT 135
Limit (cost=0.43..26412.78 rows=135 width=28) (actual time=5915.455..499917.390 rows=132 loops=1)
Buffers: shared hit=4444267 read=2219412
I/O Timings: read=485517.381
-> Index Scan using ir on items (cost=0.43..1429202.13 rows=7305 width=28) (actual time=5915.453..499917.242 rows=132 loops=1)
Filter: (v3 ## '''aberdeen'''::tsquery)"
Rows Removed by Filter: 1460845
Buffers: shared hit=4444267 read=2219412
I/O Timings: read=485517.381
Planning:
Buffers: shared hit=253
Planning Time: 1.270 ms
Execution Time: 499919.196 ms
and with LIMIT 136:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 ## to_tsquery('english', 'aberdeen')
ORDER BY rating DESC NULLS LAST LIMIT 136
Limit (cost=26245.53..26245.87 rows=136 width=28) (actual time=29.870..29.889 rows=132 loops=1)
Buffers: shared hit=57 read=83
I/O Timings: read=29.085
-> Sort (cost=26245.53..26263.79 rows=7305 width=28) (actual time=29.868..29.876 rows=132 loops=1)
Sort Key: rating DESC NULLS LAST
Sort Method: quicksort Memory: 34kB
Buffers: shared hit=57 read=83
I/O Timings: read=29.085
-> Bitmap Heap Scan on items (cost=88.61..25950.14 rows=7305 width=28) (actual time=1.361..29.792 rows=132 loops=1)
Recheck Cond: (v3 ## '''aberdeen'''::tsquery)"
Heap Blocks: exact=132
Buffers: shared hit=54 read=83
I/O Timings: read=29.085
-> Bitmap Index Scan on iv3 (cost=0.00..86.79 rows=7305 width=0) (actual time=1.345..1.345 rows=132 loops=1)
Index Cond: (v3 ## '''aberdeen'''::tsquery)"
Buffers: shared hit=3 read=2
I/O Timings: read=1.299
Planning:
Buffers: shared hit=253
Planning Time: 1.296 ms
Execution Time: 29.932 ms
and here for the frequent word 'game' (occurring in 240464 rows) with LIMIT 135:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 ## to_tsquery('english', 'game')
ORDER BY rating DESC NULLS LAST LIMIT 135
Limit (cost=0.43..26412.78 rows=135 width=28) (actual time=3.240..542.252 rows=135 loops=1)
Buffers: shared hit=2876 read=1930
I/O Timings: read=529.523
-> Index Scan using ir on items (cost=0.43..1429202.13 rows=7305 width=28) (actual time=3.239..542.216 rows=135 loops=1)
Filter: (v3 ## '''game'''::tsquery)
Rows Removed by Filter: 867
Buffers: shared hit=2876 read=1930
I/O Timings: read=529.523
Planning:
Buffers: shared hit=208 read=45
I/O Timings: read=15.626
Planning Time: 25.174 ms
Execution Time: 542.306 ms
and with LIMIT 136:
EXPLAIN (ANALYZE, BUFFERS) SELECT nm FROM items WHERE v3 ## to_tsquery('english', 'game')
ORDER BY rating DESC NULLS LAST LIMIT 136
Limit (cost=26245.53..26245.87 rows=136 width=28) (actual time=69419.656..69419.675 rows=136 loops=1)
Buffers: shared hit=1757820 read=457619
I/O Timings: read=65246.893
-> Sort (cost=26245.53..26263.79 rows=7305 width=28) (actual time=69419.654..69419.662 rows=136 loops=1)
Sort Key: rating DESC NULLS LAST
Sort Method: top-N heapsort Memory: 41kB
Buffers: shared hit=1757820 read=457619
I/O Timings: read=65246.893
-> Bitmap Heap Scan on items (cost=88.61..25950.14 rows=7305 width=28) (actual time=110.959..69326.343 rows=240464 loops=1)
Recheck Cond: (v3 ## '''game'''::tsquery)
Rows Removed by Index Recheck: 394527
Heap Blocks: exact=49894 lossy=132284
Buffers: shared hit=1757817 read=457619
I/O Timings: read=65246.893
-> Bitmap Index Scan on iv3 (cost=0.00..86.79 rows=7305 width=0) (actual time=100.537..100.538 rows=240464 loops=1)
Index Cond: (v3 ## '''game'''::tsquery)
Buffers: shared hit=1 read=60
I/O Timings: read=26.870
Planning:
Buffers: shared hit=253
Planning Time: 1.195 ms
Execution Time: 69420.399 ms
This is not easy to solve: full text search requires a GIN index, but a GIN index cannot support ORDER BY. Also, if you have a B-tree index for ORDER BY and a GIN index for the full text search, these can be combined using a bitmap index scan, but a bitmap index scan cannot support ORDER BY either.
I see a certain possibility if you create your own “stop word” list that contains all the frequent words in your data (in addition to the normal English stop words). Then you can define a text search dictionary that uses that stop word file and a text search configuration english_rare using that dictionary.
Then you could create your full text index using that configuration and query in two steps like this:
look for rare words:
SELECT *
FROM (SELECT *
FROM tbl
WHERE va ## to_tsquery('english_rare', 'test')
OFFSET 0) AS q
ORDER BY b LIMIT 100;
The subquery with OFFSET 0 will keep the optimizer from scanning the index on b.
For rare words, this will return the correct result quickly. For frequent words, this will return nothing, since to_tsquery will return an empty result. To distinguish between a miss because the word does not occur and a miss because the word is frequent, watch for the following notice:
NOTICE: text-search query contains only stop words or doesn't contain lexemes, ignored
look for frequent words (if the first query gave you the notice):
SELECT *
FROM (SELECT *
FROM tbl
ORDER BY b) AS q
WHERE va ## to_tsquery('english', 'test')
LIMIT 100;
Note that we use the normal English configuration here. This will always scan the index on b and will be reasonably fast for frequent search terms.
Solution for my scenario, which I think will work well for many real-world cases:
Let Postgres use the "rare-word strategy" (2. in the question) always or mostly. The reason is that there always should be the possibility for a user to sort by relevance (e.g. using ts_rank), in which case the other strategy cannot be used, so one has to make sure that the "rare-word strategy" is fast enough for all searches anyway.
To force Postgres to use this strategy one can use a subquery, as Laurenz Albe has pointed out:
SELECT * FROM
(SELECT * FROM tbl WHERE va ## to_tsquery('english', 'test') OFFSET 0) AS q
ORDER BY b LIMIT 100;
Alternatively one can simply set LIMIT sufficiently high (while only fetching as many results as needed).
I could achieve sufficient performance (nearly all queries take < 1 second) by
doing each search first against a smaller ts_vector containing the most relevant parts of each document (e.g. title and summary), and checking the full document only if this first query yields not enough results.
specially treating words occurring very frequently, e.g. only allowing them in AND-combination with other words (adding them to the stop words is problematic since those are not sensibly treated when occurring in phrases for example)
increasing RAM and increasing shared_buffers so the whole table can be cached (8.5 GB for me currently)
For cases where these optimizations are not enough, to achieve better performance for all queries (i.e. also those sorting by relevance, which are the hardest), I think one would have to use a more sophisticated text search index instead of GIN. There is the RUM index extension which looks promising, but I haven't tried it yet.
PS: Contrary to my observation in the question I have now found that under certain circumstances the planner does take word frequency into account and makes decisions in the right direction:
For rare words the borderline LIMIT above which it chooses the "rare-word strategy" is lower than for frequent words, and in a certain range this choice seems very good. However this is in no way reliable and sometimes the choice is very wrong, e.g. for low LIMITs it chooses the "frequent-word strategy" also for very rare or non-occurring words which leads to awful slowness.
It appears to depend on many factors and seems not predictable.

PostgreSQL: latest row in DISTINCT ON less performant than max row in GROUP BY

I have a situation that I would like to better understand:
I've a table t with two rows and one index:
CREATE TABLE t (
refid BIGINT NOT NULL,
created TIMESTAMPTZ NOT NULL
);
CREATE INDEX t_refid_created ON t (refid, created);
In order to get the latest (with the highest created value) row for each distinct refid, I composed two queries:
-- index only scan t_refid_created_desc_idx
SELECT DISTINCT ON (refid) * FROM t
ORDER BY refid, created DESC;
-- index scan t_refid_created_idx
SELECT refid, max(created) FROM t GROUP BY refid;
When t has about 16M rows and the variance in refid is about 500 different values, the second query returns substantially faster than the second one.
At first I figured that because I'm ordering by created DESC it needs to do a backwards index scan and when starting from a value with high variance (created). So I added the following index:
CREATE index t_refid_created_desc_idx ON t (refid, created DESC);
It was indeed used (instead of the backwards scan on the previous index) but there was no improvement.
If I understand correctly, the second query would aggregate by refid and then scan each aggregate to find the max created value. That sounds like a lot of work.
The first query, to the best of my understanding, should have simply iterated over the first part of the index, then for each refid it should have used the second part of the index, taking the first value.
Obviously it is not the case and SELECT DISTINCT query takes twice as long as GROUP BY.
What am I missing here?
Here are EXPLAIN ANALYZE outputs for the first and second queries:
Unique (cost=0.56..850119.78 rows=291 width=16) (actual time=0.103..13414.913 rows=469 loops=1)
-> Index Only Scan using t_refid_created_desc_idx on t (cost=0.56..808518.47 rows=16640527 width=16) (actual time=0.102..12113.454 rows=16640527 loops=1)
Heap Fetches: 16640527
Planning time: 0.157 ms
Execution time: 13415.047 ms
Finalize GroupAggregate (cost=599925.13..599932.41 rows=291 width=16) (actual time=3454.350..3454.884 rows=469 loops=1)
Group Key: refid
-> Sort (cost=599925.13..599926.59 rows=582 width=16) (actual time=3454.344..3454.509 rows=1372 loops=1)
Sort Key: refid
Sort Method: quicksort Memory: 113kB
-> Gather (cost=599837.29..599898.40 rows=582 width=16) (actual time=3453.194..3560.602 rows=1372 loops=1)
Workers Planned: 2
Workers Launched: 2
-> Partial HashAggregate (cost=598837.29..598840.20 rows=291 width=16) (actual time=3448.225..3448.357 rows=457 loops=3)
Group Key: refid
-> Parallel Seq Scan on t (cost=0.00..564169.53 rows=6933553 width=16) (actual time=0.047..2164.459 rows=5546842 loops=3)
Planning time: 0.157 ms
Execution time: 3561.727 ms
The first query runs in about 10 seconds, while the second one achieves the same results in 2 seconds! And without even using the index!
I'm using PostgreSQL 10.5.
I cannot answer the riddle why the DISTINCT ON does not consider the second plan. From the cost estimates we see thst PostgreSQL considers it cheaper.
I guess that nobody has implemented pushing down DISTINCT into parallel plans. You could ask the mailing list.
However, the problem with the first query are the 16 million heap fetches. This means that this is actually a normal index scan! It looks like a bad misestimate on the side of the planner.
If I am right, a VACUUM on the table that cleans the visibility map should improve the first query considerably.

Postgresql. Optimize retriving distinct values from large table

I have one de-normalized table with 40+ columns (~ 1.5 million rows, 1 Gb).
CREATE TABLE tbl1 (
...
division_id integer,
division_name varchar(10),
...
);
I need to speed up query
SELECT DISTINCT division_name, division_id
FROM table
ORDER BY division_name;
Query return only ~250 rows, but very slow cause size of table.
I have tried to create index:
create index idx1 on tbl1 (division_name, division_id)
But current execution plan:
explain analyze SELECT Distinct division_name, division_id FROM tbl1 ORDER BY 1;
QUERY PLAN
-----------------------------------------------------------------
Sort (cost=143135.77..143197.64 rows=24748 width=74) (actual time=1925.697..1925.723 rows=294 loops=1)
Sort Key: division_name
Sort Method: quicksort Memory: 74kB
-> HashAggregate (cost=141082.30..141329.78 rows=24748 width=74) (actual time=1923.853..1923.974 rows=294 loops=1)
Group Key: division_name, division_id
-> Seq Scan on tbl1 (cost=0.00..132866.20 rows=1643220 width=74) (actual time=0.069..703.008 rows=1643220 loops=1)
Planning time: 0.311 ms
Execution time: 1925.883 ms
Any suggestion why index does not work or how I can speed up query in other way?
Server Postgresql 9.6.
p.s. Yes, table has 40+ columns and de-normalized, but I know all pros and cons for with decision.
Update1
#a_horse_with_no_name suggest to use vacuum analyze instead of analyze to update table statistic. Now query plain is:
QUERY PLAN
------------------------
Unique (cost=0.55..115753.43 rows=25208 width=74) (actual time=0.165..921.426 rows=294 loops=1)
-> Index Only Scan using idx1 on tbl1 (cost=0.55..107538.21 rows=1643044 width=74) (actual time=0.162..593.322 rows=1643220 loops=1)
Heap Fetches: 0
Much better!
The index will probably only help if PostgreSQL chooses an “index only scan”, that means that it does not have to look at the table data at all.
Normally PostgreSQL has to check the table data (“heap”) to see if a row is visible for the current transaction, because visibility information is not stored in the index.
If, however, the table does not change much and has recently been VACUUMed, PostgreSQL knows that most of the pages consist only of items visible for everyone (there is a “visibility map” to keep track of that information), and then it might be cheaper to scan the index.
Try running VACUUM on the table and see if that causes an index only scan to be used.
Other than that, there is no way to speed up such a query.

Postgres JSONB timestamp query very slow compared to timestamp column query

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdated": "2016-12-26T12:09:43.901Z",
"lastUpdatedTimestamp": "1482754183"
}
}
The actual JSONB column stores much more information, but I've omitted the irrelevant data. The data format cannot be changed since this is legacy information.
I'm trying to efficiently obtain a count of all records where the lastUpdated value is greater or equal to some reference time (I'll use 2015-12-01T10:10:10Z in the following examples):
explain analyze SELECT count(*) FROM "accounts"
WHERE data->>'lastUpdated' >= '2015-12-01T10:10:10Z';
This takes over 22 seconds:
Aggregate (cost=843795.05..843795.06 rows=1 width=0) (actual time=22292.584..22292.584 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..842317.05 rows=591201 width=0)
(actual time=1.410..22142.046 rows=1773603 loops=1)
Filter: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.234 ms
Execution time: 22292.671 ms
I've tried adding the following text index:
CREATE INDEX accounts_last_updated ON accounts ((data->>'lastUpdated'));
But the query is still rather slow, at over 17 seconds:
Aggregate (cost=815548.64..815548.65 rows=1 width=0) (actual time=17172.844..17172.845 rows=1 loops=1)
-> Bitmap Heap Scan on accounts (cost=18942.24..814070.64 rows=591201 width=0)
(actual time=1605.454..17036.081 rows=1773603 loops=1)
Recheck Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Heap Blocks: exact=28955 lossy=397518
-> Bitmap Index Scan on accounts_last_updated (cost=0.00..18794.44 rows=591201 width=0)
(actual time=1596.645..1596.645 rows=1773603 loops=1)
Index Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.373 ms
Execution time: 17172.974 ms
I've also tried following the directions in Create timestamp index from JSON on PostgreSQL and have tried creating the following function and index:
CREATE OR REPLACE FUNCTION text_to_timestamp(text)
RETURNS timestamp AS
$$SELECT to_timestamp($1, 'YYYY-MM-DD HH24:MI:SS.MS')::timestamp; $$
LANGUAGE sql IMMUTABLE;
CREATE INDEX accounts_last_updated ON accounts
(text_to_timestamp(data->>'lastUpdated'));
But this doesn't give me any improvement, in fact it was slower, taking over 24 seconds for the query, versus 22 seconds for the unindexed version:
explain analyze SELECT count(*) FROM "accounts"
WHERE text_to_timestamp(data->>'lastUpdated') >= '2015-12-01T10:10:10Z';
Aggregate (cost=1287195.80..1287195.81 rows=1 width=0) (actual time=24143.150..24143.150 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..1285717.79 rows=591201 width=0)
(actual time=4.044..23971.723 rows=1773603 loops=1)
Filter: (text_to_timestamp((data ->> 'lastUpdated'::text)) >= '2015-12-01 10:10:10'::timestamp without time zone)
Planning time: 1.107 ms
Execution time: 24143.183 ms
In one last act of desperation, I decided to add another timestamp column and update it to contain the same values as data->>'lastUpdated':
alter table accounts add column updated_at timestamp;
update accounts set updated_at = text_to_timestamp(data->>'lastUpdated');
create index accounts_updated_at on accounts(updated_at);
This has given me by far the best performance:
explain analyze SELECT count(*) FROM "accounts" where updated_at >= '2015-12-01T10:10:10Z';
Aggregate (cost=54936.49..54936.50 rows=1 width=0) (actual time=676.955..676.955 rows=1 loops=1)
-> Index Only Scan using accounts_updated_at on accounts
(cost=0.43..50502.48 rows=1773603 width=0) (actual time=0.026..552.442 rows=1773603 loops=1)
Index Cond: (updated_at >= '2015-12-01 10:10:10'::timestamp without time zone)
Heap Fetches: 0
Planning time: 4.643 ms
Execution time: 678.962 ms
However, I'd very much like to avoid adding another column just to improve the speed of ths one query.
This leaves me with the following question: is there any way to improve the performance of my JSONB query so it can be as efficient as the individual column query (the last query where I used updated_at instead of data->>'lastUpdated')? As it stands, it takes from 17 seconds to 24 seconds for me to query the JSONB data using data->>'lastUpdated', while it takes only 678 ms to query the updated_at column. It doesn't make sense that the JSONB query would be so much slower. I was hoping that by using the text_to_timestamp function that it would improve the performance, but it hasn't been the case (or I'm doing something wrong).
In your first and second try most execution time is spent on index recheck or filtering, which must read every json field index hits, reading json is expensive. If index hits a couple hundred rows, query will be fast, but if index hits thousands or hundreds of thousand rows - filtering/rechecking json field will take some serious time. In second try, using additionally another function makes it even worse.
JSON field is good for storing data, but are not intended to be used in analytic queries like summaries, statistics and its bad practice to use json objects to be used in where conditions, atleast as main filtering condition like in your case.
That last act of depression of yours is the right way to go :)
To improve query performance, you must add one or some several columns with key vales which will be used most in where conditions.