How to improve JPA/PostgreSQL query performance?

How to improve JPA/PostgreSQL query performance? - postgresql

I have a table with 36.64 million entries.
The table definition as follow:
id integer, PK
attribute, varchar 255
value, varchar 255
store_id, integer
timestamp, timestamp without timezone
mac_address, varchar 255
plus, mac_address and timestamp column has index.
the query:
select count(*) from table where mac_address = $1 and timestamp between $2 and $3
select * from table where mac_address = $1 and timestamp between $2 and $3
If I run this in pgAdmin, it took a total of 10 seconds.
If I run this using JPA, it took more than 40 seconds. There is no EAGER loading.
I've look into SimpleJpaRepository code. it is exactly these two query, a count() and a getResultList()
questions:
1. looks like timestamp index is not used in both pgAdmin and JPA. I've checked this with ANALYZE and EXPLAIN. But why?
2. Why does JPA needs 10x more time? ORM adds overhead, but 10 times?
3. How do I improve it?
EDIT 1:
Maybe the count() from JPA is not using index scan, it use sequential = slow. my postgresql version is 9.5.
EDIT 2:
in JPA, it is using setFirstResult() and setMaxResult() to get a total of 100 entries. From total of 259242
I try to mimic it with LIMIT and OFFSET, but I didn't see these keywords in JPA query. Maybe JPA is getting all result and then do paging in memory, which in turns cause performance issue?
The first execute of count() query takes 19 to 55 seconds using pgAdmin.
The EXPLAIN of the two query.
count()
Aggregate (cost=761166.10..761166.11 rows=1 width=4) (actual time=1273.871..1273.871 rows=1 loops=1)
Output: count(id)
Buffers: shared read=92986 written=56
-> Bitmap Heap Scan on public.device_messages playerstat0_ (cost=11165.36..760309.47 rows=342650 width=4) (actual time=76.217..1258.389 rows=259242 loops=1)
Output: id, attributecode, attributevalue, store_id, "timestamp", mac_address
Recheck Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
Rows Removed by Index Recheck: 6281401
Heap Blocks: exact=36622 lossy=55083
Buffers: shared read=92986 written=56
-> Bitmap Index Scan on device_messages_mac_address_timestamp_idx (cost=0.00..11079.70 rows=342650 width=0) (actual time=69.636..69.636 rows=259242 loops=1)
Index Cond: (((playerstat0_.mac_address)::text = '0011E004CA34'::text) AND (playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone))
Buffers: shared read=1281
Planning time: 0.138 ms
Execution time: 1274.275 ms
select
Limit (cost=3362.52..5043.49 rows=100 width=34) (actual time=30.291..42.846 rows=100 loops=1)
Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
Buffers: shared hit=15447 read=1676"
-> Index Scan Backward using device_messages_pkey on public.device_messages playerstat0_ (cost=0.57..5759855.56 rows=342650 width=34) (actual time=2.597..42.834 rows=300 loops=1)
Output: id, attributecode, attributevalue, mac_address, store_id, "timestamp"
Filter: ((playerstat0_."timestamp" >= '2018-04-04 00:00:00'::timestamp without time zone) AND (playerstat0_."timestamp" <= '2018-05-04 00:00:00'::timestamp without time zone) AND ((playerstat0_.mac_address)::text = '0011E004CA34'::text))
Rows Removed by Filter: 154833
Buffers: shared hit=15447 read=1676
Planning time: 0.180 ms
Execution time: 42.878 ms
EDIT 3:
After more testing, it is confirmed that the cause is count(). select with limit and offset is pretty fast. The count() alone could take up to a minute.
mentioned here postgresql slow counting
While the count estimate function works (ROWS from query plan), I couldn't call that from JPA.
EDIT 3:
I kinda solve the problem, but not completely.
About select, after creating index which matches the query, it actually runs quite fast, 2~5 seconds. But that is without sorting. Sorting adds another process step to the query.
The count() is slow, and is confirmed by postgresql document. the MVCC force count() to do a heap scan, similar to sequence scan to the whole table.
The final problem which I still not sure it that the query on production server is mush slower than testing server. 60 seconds on production and 5 seconds on testing server. With same table size and data. But the big difference is production server has about 20+ insert operation per second. Testing server has no insert operation going on. I am guessing maybe the insert operation needs a write lock and so the query is slow because it has to wait for the lock?

You should be able to get better performance with an index of both mac_address and timestamp in the same index:
CREATE INDEX [CONCURRENTLY] ON table (mac_address, timestamp);
The reason the timestamp index is not used is because it would need to cross reference it with the mac_address index to find the correct rows (which would actually take longer than just looking up the rows directly)
I have no experience with JPA so I can't really say why it's slower.

Related

In Postgres, how to generate a UTC timestamp with timezone column from a text column with YYYY-MM-DDTHH:MM:SSZ dates

My table has a "Timestamp" column (text type) with YYYY-MM-DDTHH:MM:SSZ formatted dates. I want to generate a timestamptz formatted column with a continuous UTC timestamp but have been unable to do it. I have tried many methods suggested in forums and documentation but I have not been able to get anything to work.
Here is a data example from the table:
select "Timestamp",("Timestamp"::timestamp with time zone) from public.time_177168 limit 1
This returns:
"2022-12-10T04:10:02-06:00" (Text) and "2022-12-10 10:10:02+00" (timestamp with time zone)
Here are a few examples of my attempts to generate the new column but they all return:
ERROR: generation expression is not immutable SQL state: 42P17
Attempt 1:
alter table public.time_177168 ADD COLUMN "TimestampUTC" timestamp with time zone GENERATED ALWAYS AS ("Timestamp"::timestamp with time zone) STORE
Attempt 2:
alter table public.time_177168 ADD COLUMN "TimestampUTC" timestamp with time zone GENERATED ALWAYS AS ("Timestamp"::timestamp AT TIME ZONE 'ETC/UTC') STORED
The overall goal is to be able to quickly order queries by UTC time. I am not able to change the data type for the existing "Timestamp" column because of legacy applications that use this database.
Any ideas or suggestion would be greatly appreciated.
Additional Information:
Using the solution below I was able to get the query performance to an acceptable level.
Original Query:
EXPLAIN ANALYSE SELECT "Timestamp","Column1","Column2","Column3" FROM time_177168 WHERE "Timestamp">'2022-11-06T00:59:00-06:00' ORDER BY ("Timestamp"::timestamp with time zone) limit 5000
Query Plan:
Limit (cost=125360.32..125943.69 rows=5000 width=81) (actual time=5826.521..5828.301 rows=5000 loops=1)
-> Gather Merge (cost=125360.32..198037.52 rows=622904 width=81) (actual time=5826.520..5827.743 rows=5000 loops=1)
Workers Planned: 2
Workers Launched: 0
-> Sort (cost=124360.29..125138.92 rows=311452 width=81) (actual time=5826.186..5826.712 rows=5000 loops=1)
Sort Key: ((Timestamp)::timestamp with time zone)
Sort Method: top-N heapsort Memory: 1089kB
-> Parallel Seq Scan on time_177168 (cost=0.00..103667.87 rows=311452 width=81) (actual time=0.136..5302.325 rows=747701 loops=1)
Filter: (Timestamp > '2022-11-06T00:59:00-06:00'::text)
Rows Removed by Filter: 438784
Planning Time: 0.145 ms
Execution Time: 5829.070 ms
New Query (Based on Accepted Solution)
EXPLAIN ANALYSE SELECT "Timestamp","Column1","Column2","Column3" FROM time_177168 WHERE "Timestamp">'2022-11-06T00:59:00-06:00' ORDER BY "TimestampUTC" limit 5000
Query Plan:
Limit (cost=0.43..2793.20 rows=5000 width=81) (actual time=728.625..748.371 rows=5000 loops=1)
-> Index Scan using timestamputc_time_177168 on time_177168 (cost=0.43..417511.91 rows=747486 width=81) (actual time=728.623..747.778 rows=5000 loops=1)
Filter: (Timestamp > '2022-11-06T00:59:00-06:00'::text)
Rows Removed by Filter: 438784
Planning Time: 0.134 ms
Execution Time: 756.844 ms

As long as you know the function is truly immutable, you can just declare it as such. So create a function like:
CREATE FUNCTION str2timestamp(text) RETURNS timestamp with time zone
IMMUTABLE SET timezone = 'UTC' LANGUAGE sql
RETURN to_timestamp($1, 'YYYY-MM-DD\THH24:MI:SS);
That is safe, because timezone is fixed while the function is running.
Such a function can be used to define a generated column using the following steps:
ALTER TABLE public.time_177168
ADD "TimestampUTC" timestamp with time zone
GENERATED ALWAYS AS (str2timestamp("Timestamp")) STORED;

Check out this informative answer to a somewhat similar question.
This might do the trick for you:
select "Timestamp",("Timestamp"::timestamp with time zone) AT TIME ZONE 'UTC' from public.mx_time_well_177168 limit 1
Try adding AT TIME ZONE 'UTC' to it.

Does truncate a timestamp break the indexes?

I was wondering if my indexes are working well since I am using nodejs and the dates with microseconds are not allowed in this language. So in my query for some good comparison I am doing this kind of thing:
`WHERE (created_at::timestamp(0), uuid) < (${createdAt}::timestamp(0), ${uuid})`
Since I am using a the cast which truncate to seconds, I supposed that the indexes are break. Am I right ? The solution then would be to change the precision of the timestamps stored, or is there another solution to keep the old ones ?

You could change the PostgreSQL data type to millisecond precision:
ALTER TABLE tab ALTER created_at TYPE timestamp(3) without time zone;

By using the recommended EXPLAIN(ANALYZE, VERBOSE, BUFFERS).
I created a table named users with a constraint on the created_at
create table users (
id uuid default uuid_generate_v4() not null
constraint users_pkey primary key,
created_at timestamp default CURRENT_TIMESTAMP
);
create index users_created_at_idx on users (created_at);
The test:
EXPLAIN(ANALYZE, VERBOSE, BUFFERS)
SELECT id
FROM users
WHERE (created_at >= '2022-01-21 15:43:33.631779');
Index Scan using users_created_at_idx on public.users (cost=0.14..4.16 rows=1 width=16) (actual time=0.010..0.018 rows=0 loops=1)
Output: id
Index Cond: (users.created_at >= '2022-01-21 15:43:33.631779'::timestamp without time zone)
Buffers: shared hit=1
Planning Time: 0.074 ms
Execution Time: 0.058 ms
EXPLAIN(ANALYZE, VERBOSE, BUFFERS)
SELECT id
FROM users
WHERE (created_at::timestamp(0) >= '2022-01-21 15:43:33.631779'::timestamp(0));
Seq Scan on public.users (cost=0.00..4.50 rows=33 width=16) (actual time=0.034..0.043 rows=0 loops=1)
Output: id
Filter: ((users.created_at)::timestamp(0) without time zone >= '2022-01-21 15:43:34'::timestamp(0) without time zone)
Rows Removed by Filter: 100
Buffers: shared hit=3
Planning Time: 0.073 ms
Execution Time: 0.089 ms
As we can see the index on the created_at column is not taken into account when we cast and truncate.

Slow execution time for a postgres query with multiple column index

We are running PostgresSql 9.6.11 database on Amazon RDS. The execution time of one of the queries is 6633.645 ms. This seems very slow. What changes can I make to improve the execution time for this query.
The query is selecting 3 columns where the data matches 6 of the columns.
select
platform,
publisher_platform,
adset_id
FROM "adsets"
WHERE
(("adsets"."account_id" IN ('1595321963838425', '1320001405', 'urn:li:sponsoredAccount:507697540')) AND
("adsets"."date" >= '2019-05-06 00:00:00.000000+0000') AND ("adsets"."date" <= '2019-05-13 23:59:59.999999+0000'))
GROUP BY
"adsets"."platform",
"adsets"."publisher_platform",
"adsets"."adset_id"
ORDER BY
"adsets"."platform",
"adsets"."publisher_platform",
"adsets"."adset_id";
The query is based on a table called adset table. The table has the following columns
account_id | text
campaign_id | text
adset_id | text
name | text
date | timestamp without time zone
publisher_platform | text
and 15 other columns which are a mix of integers and text fields.
We have added the following indexes -
"adsets_composite_unique_key" UNIQUE CONSTRAINT, btree (platform, account_id, campaign_id, adset_id, date, publisher_platform)
"adsets_account_id_date_idx" btree (account_id DESC, date DESC) CLUSTER
"adsets_account_id_index" btree (account_id)
"adsets_adset_id_index" btree (adset_id)
"adsets_campaign_id_index" btree (campaign_id)
"adsets_name_index" btree (name)
"adsets_platform_platform_id_publisher_platform" btree (account_id, platform, publisher_platform, adset_id)
"idx_account_date_adsets" btree (account_id, date)
"platform_pub_index" btree (platform, publisher_platform, adset_id).
The work_mem on postgres has been set to 125MB
Explain (analyse) shows
Group (cost=33447.55..33532.22 rows=8437 width=29) (actual time=6625.170..6633.062 rows=2807 loops=1)
Group Key: platform, publisher_platform, adset_id
-> Sort (cost=33447.55..33468.72 rows=8467 width=29) (actual time=6625.168..6629.271 rows=22331 loops=1)
Sort Key: platform, publisher_platform, adset_id
Sort Method: quicksort Memory: 2513kB
-> Bitmap Heap Scan on adsets (cost=433.63..32895.18 rows=8467 width=29) (actual time=40.003..6471.898 rows=22331 loops=1)
Recheck Cond: ((account_id = ANY ('{1595321963838425,1320001405,urn:li:sponsoredAccount:507697540}'::text[])) AND (date >= '2019-05-06 00:00:00'::timestamp without time zone) AND (date <= '
2019-05-13 23:59:59.999999'::timestamp without time zone))
Heap Blocks: exact=52907
-> Bitmap Index Scan on idx_account_date_adsets (cost=0.00..431.51 rows=8467 width=0) (actual time=27.335..27.335 rows=75102 loops=1)
Index Cond: ((account_id = ANY ('{1595321963838425,1320001405,urn:li:sponsoredAccount:507697540}'::text[])) AND (date >= '2019-05-06 00:00:00'::timestamp without time zone) AND (date
<= '2019-05-13 23:59:59.999999'::timestamp without time zone))
Planning time: 5.380 ms
Execution time: 6633.645 ms
(12 rows)
Explain depesz

First of all, you are using GROUP BY without actually selecting any aggregates. You might as well just do SELECT DISTINCT in your query. This aside, here is the B tree index which you probably should be using:
CREATE INDEX idx ON adsets (account_id, date, platform, publisher_platform,
adset_id);
The problem with your current index is that, while it does cover the columns you are selecting, it does not involve the columns which appear in the WHERE clause. This means that Postgres might choose to not even use the index, and rather just scan the entire table.
Note that my suggestion still does nothing to deal with the select distinct portion of the query, but at least it might speed up everything which comes before that part of the query.
Here is your updated query:
SELECT DISTINCT
platform,
publisher_platform,
adset_id
FROM adsets
WHERE
account_id IN ('1595321963838425', '1320001405',
'urn:li:sponsoredAccount:507697540') AND
date >= '2019-05-06' AND date < '2019-05-14';

Your problem are the many “false positives” that are found during the bitmap index scan phase and removed during the heap scan phase. Since there is no additional filter, I guess that the extra rows must be removed because they are not visible.
See if a VACUUM adsets will improve the query performance.

Postgres JSONB timestamp query very slow compared to timestamp column query

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdated": "2016-12-26T12:09:43.901Z",
"lastUpdatedTimestamp": "1482754183"
}
}
The actual JSONB column stores much more information, but I've omitted the irrelevant data. The data format cannot be changed since this is legacy information.
I'm trying to efficiently obtain a count of all records where the lastUpdated value is greater or equal to some reference time (I'll use 2015-12-01T10:10:10Z in the following examples):
explain analyze SELECT count(*) FROM "accounts"
WHERE data->>'lastUpdated' >= '2015-12-01T10:10:10Z';
This takes over 22 seconds:
Aggregate (cost=843795.05..843795.06 rows=1 width=0) (actual time=22292.584..22292.584 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..842317.05 rows=591201 width=0)
(actual time=1.410..22142.046 rows=1773603 loops=1)
Filter: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.234 ms
Execution time: 22292.671 ms
I've tried adding the following text index:
CREATE INDEX accounts_last_updated ON accounts ((data->>'lastUpdated'));
But the query is still rather slow, at over 17 seconds:
Aggregate (cost=815548.64..815548.65 rows=1 width=0) (actual time=17172.844..17172.845 rows=1 loops=1)
-> Bitmap Heap Scan on accounts (cost=18942.24..814070.64 rows=591201 width=0)
(actual time=1605.454..17036.081 rows=1773603 loops=1)
Recheck Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Heap Blocks: exact=28955 lossy=397518
-> Bitmap Index Scan on accounts_last_updated (cost=0.00..18794.44 rows=591201 width=0)
(actual time=1596.645..1596.645 rows=1773603 loops=1)
Index Cond: ((data ->> 'lastUpdated'::text) >= '2015-12-01T10:10:10Z'::text)
Planning time: 1.373 ms
Execution time: 17172.974 ms
I've also tried following the directions in Create timestamp index from JSON on PostgreSQL and have tried creating the following function and index:
CREATE OR REPLACE FUNCTION text_to_timestamp(text)
RETURNS timestamp AS
$$SELECT to_timestamp($1, 'YYYY-MM-DD HH24:MI:SS.MS')::timestamp; $$
LANGUAGE sql IMMUTABLE;
CREATE INDEX accounts_last_updated ON accounts
(text_to_timestamp(data->>'lastUpdated'));
But this doesn't give me any improvement, in fact it was slower, taking over 24 seconds for the query, versus 22 seconds for the unindexed version:
explain analyze SELECT count(*) FROM "accounts"
WHERE text_to_timestamp(data->>'lastUpdated') >= '2015-12-01T10:10:10Z';
Aggregate (cost=1287195.80..1287195.81 rows=1 width=0) (actual time=24143.150..24143.150 rows=1 loops=1)
-> Seq Scan on accounts (cost=0.00..1285717.79 rows=591201 width=0)
(actual time=4.044..23971.723 rows=1773603 loops=1)
Filter: (text_to_timestamp((data ->> 'lastUpdated'::text)) >= '2015-12-01 10:10:10'::timestamp without time zone)
Planning time: 1.107 ms
Execution time: 24143.183 ms
In one last act of desperation, I decided to add another timestamp column and update it to contain the same values as data->>'lastUpdated':
alter table accounts add column updated_at timestamp;
update accounts set updated_at = text_to_timestamp(data->>'lastUpdated');
create index accounts_updated_at on accounts(updated_at);
This has given me by far the best performance:
explain analyze SELECT count(*) FROM "accounts" where updated_at >= '2015-12-01T10:10:10Z';
Aggregate (cost=54936.49..54936.50 rows=1 width=0) (actual time=676.955..676.955 rows=1 loops=1)
-> Index Only Scan using accounts_updated_at on accounts
(cost=0.43..50502.48 rows=1773603 width=0) (actual time=0.026..552.442 rows=1773603 loops=1)
Index Cond: (updated_at >= '2015-12-01 10:10:10'::timestamp without time zone)
Heap Fetches: 0
Planning time: 4.643 ms
Execution time: 678.962 ms
However, I'd very much like to avoid adding another column just to improve the speed of ths one query.
This leaves me with the following question: is there any way to improve the performance of my JSONB query so it can be as efficient as the individual column query (the last query where I used updated_at instead of data->>'lastUpdated')? As it stands, it takes from 17 seconds to 24 seconds for me to query the JSONB data using data->>'lastUpdated', while it takes only 678 ms to query the updated_at column. It doesn't make sense that the JSONB query would be so much slower. I was hoping that by using the text_to_timestamp function that it would improve the performance, but it hasn't been the case (or I'm doing something wrong).

In your first and second try most execution time is spent on index recheck or filtering, which must read every json field index hits, reading json is expensive. If index hits a couple hundred rows, query will be fast, but if index hits thousands or hundreds of thousand rows - filtering/rechecking json field will take some serious time. In second try, using additionally another function makes it even worse.
JSON field is good for storing data, but are not intended to be used in analytic queries like summaries, statistics and its bad practice to use json objects to be used in where conditions, atleast as main filtering condition like in your case.
That last act of depression of yours is the right way to go :)
To improve query performance, you must add one or some several columns with key vales which will be used most in where conditions.

Postgres not using partial timestamp index on interval queries (e.g., now() - interval '7 days' )

I have a simple table that store precipitation readings from online gauges. Here's the table definition:
CREATE TABLE public.precip
(
gauge_id smallint,
inches numeric(8, 2),
reading_time timestamp with time zone
)
CREATE INDEX idx_precip3_id
ON public.precip USING btree
(gauge_id)
CREATE INDEX idx_precip3_reading_time
ON public.precip USING btree
(reading_time)
CREATE INDEX idx_precip_last_five_days
ON public.precip USING btree
(reading_time)
TABLESPACE pg_default WHERE reading_time > '2017-02-26 00:00:00+00'::timestamp with time zone
It's grown quite large: about 38 million records that go back 18 months. Queries rarely request rows that are more than 7 days old and I created the partial index on the reading_time field so Postgres can traverse a much smaller index. But it's not using the partial index on all queries. It does use the partial index on
explain analyze select * from precip where gauge_id = 208 and reading_time > '2017-02-27'
Bitmap Heap Scan on precip (cost=8371.94..12864.51 rows=1169 width=16) (actual time=82.216..162.127 rows=2046 loops=1)
Recheck Cond: ((gauge_id = 208) AND (reading_time > '2017-02-27 00:00:00+00'::timestamp with time zone))
-> BitmapAnd (cost=8371.94..8371.94 rows=1169 width=0) (actual time=82.183..82.183 rows=0 loops=1)
-> Bitmap Index Scan on idx_precip3_id (cost=0.00..2235.98 rows=119922 width=0) (actual time=20.754..20.754 rows=125601 loops=1)
Index Cond: (gauge_id = 208)
-> Bitmap Index Scan on idx_precip_last_five_days (cost=0.00..6135.13 rows=331560 width=0) (actual time=60.099..60.099 rows=520867 loops=1)
Total runtime: 162.631 ms
But it does not use the partial index on the following. Instead, it's use the full index on reading_time
explain analyze select * from precip where gauge_id = 208 and reading_time > now() - interval '7 days'
Bitmap Heap Scan on precip (cost=8460.10..13007.47 rows=1182 width=16) (actual time=154.286..228.752 rows=2067 loops=1)
Recheck Cond: ((gauge_id = 208) AND (reading_time > (now() - '7 days'::interval)))
-> BitmapAnd (cost=8460.10..8460.10 rows=1182 width=0) (actual time=153.799..153.799 rows=0 loops=1)
-> Bitmap Index Scan on idx_precip3_id (cost=0.00..2235.98 rows=119922 width=0) (actual time=15.852..15.852 rows=125601 loops=1)
Index Cond: (gauge_id = 208)
-> Bitmap Index Scan on idx_precip3_reading_time (cost=0.00..6223.28 rows=335295 width=0) (actual time=136.162..136.162 rows=522993 loops=1)
Index Cond: (reading_time > (now() - '7 days'::interval))
Total runtime: 228.647 ms
Note that today is 3/5/2017, so these two queries are essentially requesting the rows. But it seems like Postgres won't use the partial index unless the timestamp in the where clause is "hard coded". Is the query planner not evaluating now() - interval '7 days' before deciding which index to use? I posted the query plans as suggested by one of the first people to respond.
I've written several functions (stored procedures) that summarize rain fall in the last 6 hours, 12 hours .... 72 hours. They all use the interval approach in the query (e.g., reading_time > now() - interval '7 days'). I don't want to move this code into the application to send Postgres the hard coded timestamp. That would create a lot of messy php code that shouldn't be necessary.
Suggestions on how to encourage Postgres to use the partial index instead? My plan is to redefine the date range on the partial index nightly (drop index --> create index), but that seems a bit silly if Postgres isn't going to use it.
Thanks,
Alex

Generally speaking, an index can be used, when the indexed column(s) is/are compared to constants (literal values), function calls, which are marked at least STABLE (which means that within a single statement, multiple calls of the functions -- with same parameters -- will produce the same results), and combination of those.
now() (which is an alias of current_timestamp) is marked as STABLE and timestamp_mi_interval() (which is the back-up function for the operator <timestamp> - <interval>) is marked as IMMUTABLE, which is even stricter than STABLE (now(), current_timestamp and transaction_timestamp marks the start of the transaction, statement_timestamp() marks the start of the statement -- still STABLE -- but clock_timestamp() gives the timestamp as seen on a clock, thus it is VOLATILE).
So in theory, the WHERE reading_time > now() - interval '7 days' should be able to use an index on the reading_time column. And it really does. But, since you defined a partial index, the planner needs to prove the following:
However, keep in mind that the predicate must match the conditions used in the queries that are supposed to benefit from the index. To be precise, a partial index can be used in a query only if the system can recognize that the WHERE condition of the query mathematically implies the predicate of the index. PostgreSQL does not have a sophisticated theorem prover that can recognize mathematically equivalent expressions that are written in different forms. (Not only is such a general theorem prover extremely difficult to create, it would probably be too slow to be of any real use.) The system can recognize simple inequality implications, for example "x < 1" implies "x < 2"; otherwise the predicate condition must exactly match part of the query's WHERE condition or the index will not be recognized as usable. Matching takes place at query planning time, not at run time.
And that is what is happening with your query, which has and reading_time > now() - interval '7 days'. By the time now() - interval '7 days' is evaluated, the planning already happened. And PostgreSQL couldn't prove that the predicate (reading_time > '2017-02-26 00:00:00+00') will be true. But when you used reading_time > '2017-02-27' it could prove that.
You could "guide" the planner with constant values, like this:
where gauge_id = 208
and reading_time > '2017-02-26 00:00:00+00'
and reading_time > now() - interval '7 days'
This way the planner realizes, that it can use the partial index, because indexed_col > index_condition and indexed_col > something_else implies that indexed_col will larger than (at least) index_condition. Maybe it will be larger than something_else too, but it doesn't matter for using the index.
I'm not sure if that is the answer you were looking for though. IMHO, if you have a really large amount of data (and PostgreSQL 9.5+) a single BRIN index might suit your needs better.

Queries are planned and then cached for possible later use, which includes the choice of indexes to apply. Since your query includes the volatile function now(), the partial index can not be used because the planner has no certainty about what the volatile function will return and thus if it will match the partial index. Any human reading the query will understand that the partial index would be a match, but the planner is not that smart that it knows what now() does; the only thing it knows is that it is a volatile function.
A better solution in your case would be to partition the table into smaller chunks based on the reading_time. Properly crafted queries will then only access a single partition.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to improve JPA/PostgreSQL query performance? - postgresql

Related

In Postgres, how to generate a UTC timestamp with timezone column from a text column with YYYY-MM-DDTHH:MM:SSZ dates

Does truncate a timestamp break the indexes?

Slow execution time for a postgres query with multiple column index

Postgres JSONB timestamp query very slow compared to timestamp column query

Postgres not using partial timestamp index on interval queries (e.g., now() - interval '7 days' )

Categories

Resources