Postgresql partial index on date to query sub period - postgresql

On some very big tables, one of the most common indexes are on date field (e.g. date of insert, or date of operation).
But when the table is very large, the index tend also to but large.
I thought I could create a partial index for each month on the table, so that the query on the 'current' month would be quicker.
ex.
CREATE TABLE test_index_partial
(
exint1 integer,
exint2 integer,
exdatetime1 timestamp without time zone
);
CREATE INDEX part_date_201503
ON test_index_partial USING btree
(exdatetime1)
WHERE exdatetime1 >= '20150301 000000' and exdatetime1 < '20150401 000000' ;
For my tests, I fed a table with the following instruction :
INSERT INTO test_index_partial SELECT sint1, sint2, th
from generate_series(0, 100, 1) sint1,
generate_series(0, 100, 1) sint2,
generate_series(date_trunc('hour', now()) - interval '90 day', now(), '1 day') th ;
When I perform a select with the exact condition, it uses the partial index. But after creating a 'all_dates' index, not partial, on the date field, it prefered the whole index.
What I wanted : if I select with a more narrow period (ex., a five day period), I want postgresql to get the tinyer index. In my example : I'd want him to choose the index of the month instead of the whole index on all the table.
But it seems Postgresql don't assume that the condition :
exdatetime1 >= '20150310 000000' and exdatetime1 <'20150315 000000'
is contained in the condition of the index
exdatetime1 >= '20150301 000000' and exdatetime1 <'20150401 000000'
I use postgresql 9.3, is it a limitation of partial index of this version ? or do I have other way to write it ?

Related

Range contains element filter not using index of the element column

I have a table with a timestamp column that has a btree index.
CREATE TABLE tstest AS
SELECT '2019-11-10'::timestamp + random() * '10 day'::interval ts
FROM generate_series(1,10000) s;
CREATE INDEX ON tstest(ts);
I would like to find all row between a timerange. Both ends of the range can be "infinite"/null, start of the range must be exclusive and end is inclusice. So this forms a query:
SELECT * FROM tstest WHERE ts <# tsrange($1, $2, '(]');
The result is correct but the index of the ts column is not used and a seq scan is done instead.
To use the index correctly I have to do the query like this:
SELECT * FROM tstest WHERE ($1 IS null OR $1 < ts) AND ($2 IS null OR ts <= $2);
I like the <# syntax more. It is easier to understand and it is shorter.
Is there something I can do differently to utilize the index and make the query faster? Maybe a different type of index instead?
I have also tried to add a gist index for ts column using the btree_gist module but that didn't change the situation.
I have tested this with PostgreSQL 9.6 and 12.0.
A btree index (which you are using) doesn't support the <# operator and a GiST index wouldn't help because the timestamp column isn't a range type.
But you can still make the btree index usable by using coalesce and infinity which removes the use of the OR condition:
SELECT *
FROM tstest
WHERE ts > coalesce($1, '-infinity')
and ts <= coalesce($2, 'infinity');

Redshift: milliseconds to timestamp

Let us say we have have two tables:
CREATE TABLE IF NOT EXISTS tech_time(
ms_since_epoch BIGINT
);
CREATE TABLE IF NOT EXISTS readable_time(
ts timestamp without time zone,
);
Let us say tech_time has data and we would like to populate readable_time.
So in Postgres you could use to_timestamp(double precision) and do something like
INSERT INTO readable_time(ts)
SELECT DISTINCT to_timestamp(ms_since_epoch::float / 1000) AS ts,
FROM tech_time;
No such function seems to exist in Amazon Redshift:
function to_timestamp(double precision) does not exist
My question is: how do I properly populate readable_time, while losing the least amount of precision?
We can try using DATEADD and add the ms_since_epoch to January 1, 1970:
INSERT INTO readable_time (ts)
SELECT DATEADD(ms, ms_since_epoch, 'epoch')
FROM tech_time;

PostgreSQL using different index for same query

I have a SQL query which is using inner join on two tables and filtering data based on several params. Going by the query plan, for different values of query params (like different date range), Postgres is using different index.
I am aware of the fact that Postgres determines if the index has to be used or not, depending on the number or rows in the result set. But why does Postgres choose to use different index for same query. The query time varies by a factor of 10, between the two cases. How can I optimise the query? As Postgres does not allows the user to define the index to be used in a query.
Edit:
explain (analyze, buffers, verbose) SELECT COUNT(*) FROM "bookings" INNER JOIN "hotels" ON "hotels"."id" = "bookings"."hotel_id" WHERE "bookings"."hotel_id" = 37016 AND (bookings.status in (0,1,2,3,4,5,6,7,9,10,11,12)) AND (bookings.source in (0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70) or bookings.status in (0,1,2,3,4,5,6,7,8,9,10,11,13)) AND (
bookings.source in (4,66,65)
OR
date(timezone('+05:30',bookings.created_at))>checkin
OR
(
( date(timezone('+05:30',bookings.created_at))=checkin
and
extract (epoch from COALESCE(cancellation_time,NOW())-bookings.created_at)>600
)
OR
( date(timezone('+05:30',bookings.created_at))<checkin
and
extract (epoch from COALESCE(cancellation_time,NOW())-bookings.created_at)>600
and
(
extract (epoch from ((bookings.checkin||' '||hotels.checkin_time)::timestamp -COALESCE(cancellation_time,bookings.checkin))) < extract(epoch from '16 hours'::interval)
OR
(DATE(bookings.checkout)-DATE(bookings.checkin))*(COALESCE(bookings.oyo_rooms,0)+COALESCE(bookings.owner_rooms,0)) > 3
)
)
)
) AND (bookings.checkin >= '2018-11-21') AND (bookings.checkin <= '2019-05-19') AND "bookings"."hotel_id" = '37016' AND "bookings"."status" IN (0, 1, 2, 3, 12);
QueryPlan : https://explain.depesz.com/s/SPeb
explain (analyze, buffers, verbose) SELECT COUNT(*) FROM "bookings" INNER JOIN "hotels" ON "hotels"."id" = 37016 WHERE "bookings"."hotel_id" = 37016 AND (bookings.status in (0,1,2,3,4,5,6,7,9,10,11,12)) AND (bookings.source in (0,1,2,3,4,5,6,7,8,9,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70) or bookings.status in (0,1,2,3,4,5,6,7,8,9,10,11,13)) AND (
bookings.source in (4,66,65)
OR
date(timezone('+05:30',bookings.created_at))>checkin
OR
(
( date(timezone('+05:30',bookings.created_at))=checkin
and
extract (epoch from COALESCE(cancellation_time,now())-bookings.created_at)>600
)
OR
( date(timezone('+05:30',bookings.created_at))<checkin
and
extract (epoch from COALESCE(cancellation_time,now())-bookings.created_at)>600
and
(extract (epoch from ((bookings.checkin||' '||hotels.checkin_time)::timestamp -COALESCE(cancellation_time,bookings.checkin))) < extract(epoch from '16 hours'::interval)
OR
(DATE(bookings.checkout)-DATE(bookings.checkin))*(COALESCE(bookings.oyo_rooms,0)+COALESCE(bookings.owner_rooms,0)) > 3
)
)
)
) AND (bookings.checkin >= '2018-11-22') AND (bookings.checkin <= '2019-05-19') AND "bookings"."hotel_id" = '37016' AND "bookings"."status" IN (0,1,2,3,4,12);
QueryPlan: https://explain.depesz.com/s/DWD
Finally found the solution to this problem. I am querying on the basis of more than 10 possible values of a column (status in this case). If I break this query into multiple sub-queries each querying upon only 1 status value and aggregate the result using union all, then the query plan executed uses optimized index for each subquery.
Results: The query time decreased by 10 times by this change.
Possible explanation for this behaviour, the query planner fetches less number of rows for each subquery and uses the optimized index in this case. I am not sure about if this is the correct explanation.

PostgreSQL gist index

I have a table with two date like dateTo and dateFrom, i would like use daterange approach in queries and a gist index, but it seem doesn't work. The table looks like:
CREATE TABLE test (
id bigeserial,
begin_date date,
end_date date
);
CREATE INDEX "idx1"
ON test
USING gist (daterange(begin_date, end_date));
Then when i try to explain a query like:
SELECT t.*
FROM test t
WHERE daterange(t.begin_date,t.end_date,'[]') && daterange('2015-12-30 00:00:00.0','2016-10-28 00:00:00.0','[]')
i get a Seq Scan.
Is this usage of gist index wrong, or is this scenario not feasible?
You have an index on the expression daterange(begin_date, end_date), but you query your table with daterange(begin_date, end_date, '[]') && .... PostgreSQL won't do math instead of you. To re-phrase your problem, it is like you're indexing (int_col + 2) and querying WHERE int_col + 1 > 2. Because the two expressions are different, the index will not be used in any circumstances. But as you can see, you can do the math (i.e. re-phrase the formula) sometimes.
You'll either need:
CREATE INDEX idx1 ON test USING gist (daterange(begin_date, end_date, '[]'));
Or:
CREATE INDEX idx2 ON test USING gist (daterange(begin_date, end_date + 1));
Note: both of them creates a range which includes end_date. The latter one uses the fact that daterange is discrete.
And use the following predicates for each of the indexes above:
WHERE daterange(begin_date, end_date, '[]') && daterange(?, ?, ?)
Or:
WHERE daterange(begin_date, end_date + 1) && daterange(?, ?, ?)
Note: the third parameter of the range constructor on the right side of && does not matter (in the context of index usage).

Retrieving multiple values from a large jsonb field faster (postgresql 9.4)

tl;dr
Using PSQL 9.4, is there a way to retrieve multiple values from a jsonb field, such as you would with the imaginary function:
jsonb_extract_path(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key'])
With the hope of speeding up the otherwise almost linear time required to select multiple values (1 value = 300ms, 2 values = 450ms, 3 values = 600ms)
Background
I have the following jsonb table:
CREATE TABLE "public"."analysis" (
"date" date NOT NULL,
"name" character varying (10) NOT NULL,
"country" character (3) NOT NULL,
"x" jsonb,
PRIMARY KEY(date,name)
);
With roughly 100 000 rows where each rows has a jsonb dictionary with 90+ keys and corresponding values. I'm trying to write an SQL query to select a few (< 10) key+values in a fairly quick way (< 500 ms)
Index and querying: 190ms
I started by adding an index:
CREATE INDEX ON analysis USING GIN (x);
This makes querying based on values in the "x" dictionary fast, such as this:
SELECT date, name, country FROM analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
This takes ~190 ms (acceptable for us)
Retrieving dictionary values
However, once I start adding keys to return in the SELECT part, execution time rises almost linear:
1 value: 300ms
select jsonb_extract_path(x, 'a_dictionary_key') from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
Takes 366ms (+175ms)
select x#>'{a_dictionary_key}' as gear_down_altitude from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100 ;
Takes 300ms (+110ms)
3 values: 600ms
select jsonb_extract_path(x, 'a_dictionary_key'), jsonb_extract_path(x, 'a_second_dictionary_key'), jsonb_extract_path(x, 'a_third_dictionary_key') from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
Takes 600ms (+410, or +100 for each value selected)
select x#>'{a_dictionary_key}' as a_dictionary_key, x#>'{a_second_dictionary_key}' as a_second_dictionary_key, x#>'{a_third_dictionary_key}' as a_third_dictionary_key from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100 ;
Takes 600ms (+410, or +100 for each value selected)
Retrieving more values faster
Is there a way to retrieve multiple values from a jsonb field, such as you would with the imaginary function:
jsonb_extract_path(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key'])
Which could possibly speed up these lookups. It can return them either as columns or as an list/array or even a json object.
Retrieving an array using PL/Python
Just for the heck of it I made a custom function using PL/Python, but that was much slower (5s+), possibly due to json.loads:
CREATE OR REPLACE FUNCTION retrieve_objects(data jsonb, k VARCHAR[])
RETURNS TEXT[] AS $$
if not data:
return []
import simplejson as json
j = json.loads(data)
l = []
for i in k:
l.append(j[i])
return l
$$ LANGUAGE plpython2u;
# Usage:
# select retrieve_objects(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key']) from analysis where date > '2014-01-01' and date < '2014-05-01'
Update 2015-05-21
I re-implemented the table using hstore with GIN index and the performance is almost identical to using jsonb, i.e not helpfull in my case.
You're using the #> operator, which looks like it performs a path search. Have you tried a normal -> lookup? Like:
select json_column->'json_field1'
, json_column->'json_field2'
It would be interesting to see what happened if you used a temporary table. Like:
create temporary table tmp_doclist (doc jsonb)
;
insert tmp_doclist
(doc)
select x
from analysis
where ... your conditions here ...
;
select doc->'col1'
, doc->'col2'
, doc->'col3'
from tmp_doclist
;
This is hard to test without the data.
Create a custom type
create type my_query_result_type (
a_dictionary_key float,
a_second_dictionary_key float
)
And your query
select (json_populate_record(null::my_query_result_type,j::json)).* from analysis;
You should be able to use a temporary table instead of type which will be created at, runtime making your query dynamic.
But first check it out if this helps form the performance point of view.