Retrieving multiple values from a large jsonb field faster (postgresql 9.4) - postgresql

tl;dr
Using PSQL 9.4, is there a way to retrieve multiple values from a jsonb field, such as you would with the imaginary function:
jsonb_extract_path(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key'])
With the hope of speeding up the otherwise almost linear time required to select multiple values (1 value = 300ms, 2 values = 450ms, 3 values = 600ms)
Background
I have the following jsonb table:
CREATE TABLE "public"."analysis" (
"date" date NOT NULL,
"name" character varying (10) NOT NULL,
"country" character (3) NOT NULL,
"x" jsonb,
PRIMARY KEY(date,name)
);
With roughly 100 000 rows where each rows has a jsonb dictionary with 90+ keys and corresponding values. I'm trying to write an SQL query to select a few (< 10) key+values in a fairly quick way (< 500 ms)
Index and querying: 190ms
I started by adding an index:
CREATE INDEX ON analysis USING GIN (x);
This makes querying based on values in the "x" dictionary fast, such as this:
SELECT date, name, country FROM analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
This takes ~190 ms (acceptable for us)
Retrieving dictionary values
However, once I start adding keys to return in the SELECT part, execution time rises almost linear:
1 value: 300ms
select jsonb_extract_path(x, 'a_dictionary_key') from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
Takes 366ms (+175ms)
select x#>'{a_dictionary_key}' as gear_down_altitude from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100 ;
Takes 300ms (+110ms)
3 values: 600ms
select jsonb_extract_path(x, 'a_dictionary_key'), jsonb_extract_path(x, 'a_second_dictionary_key'), jsonb_extract_path(x, 'a_third_dictionary_key') from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100;
Takes 600ms (+410, or +100 for each value selected)
select x#>'{a_dictionary_key}' as a_dictionary_key, x#>'{a_second_dictionary_key}' as a_second_dictionary_key, x#>'{a_third_dictionary_key}' as a_third_dictionary_key from analysis where date > '2014-01-01' and date < '2014-05-01' and cast(x#>> '{a_dictionary_key}' as float) > 100 ;
Takes 600ms (+410, or +100 for each value selected)
Retrieving more values faster
Is there a way to retrieve multiple values from a jsonb field, such as you would with the imaginary function:
jsonb_extract_path(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key'])
Which could possibly speed up these lookups. It can return them either as columns or as an list/array or even a json object.
Retrieving an array using PL/Python
Just for the heck of it I made a custom function using PL/Python, but that was much slower (5s+), possibly due to json.loads:
CREATE OR REPLACE FUNCTION retrieve_objects(data jsonb, k VARCHAR[])
RETURNS TEXT[] AS $$
if not data:
return []
import simplejson as json
j = json.loads(data)
l = []
for i in k:
l.append(j[i])
return l
$$ LANGUAGE plpython2u;
# Usage:
# select retrieve_objects(x, ARRAY['a_dictionary_key', 'a_second_dictionary_key', 'a_third_dictionary_key']) from analysis where date > '2014-01-01' and date < '2014-05-01'
Update 2015-05-21
I re-implemented the table using hstore with GIN index and the performance is almost identical to using jsonb, i.e not helpfull in my case.

You're using the #> operator, which looks like it performs a path search. Have you tried a normal -> lookup? Like:
select json_column->'json_field1'
, json_column->'json_field2'
It would be interesting to see what happened if you used a temporary table. Like:
create temporary table tmp_doclist (doc jsonb)
;
insert tmp_doclist
(doc)
select x
from analysis
where ... your conditions here ...
;
select doc->'col1'
, doc->'col2'
, doc->'col3'
from tmp_doclist
;

This is hard to test without the data.
Create a custom type
create type my_query_result_type (
a_dictionary_key float,
a_second_dictionary_key float
)
And your query
select (json_populate_record(null::my_query_result_type,j::json)).* from analysis;
You should be able to use a temporary table instead of type which will be created at, runtime making your query dynamic.
But first check it out if this helps form the performance point of view.

Related

PostgreSQL - Comparing value based on condition (<, >, =) which written on a column?

I have this sample database:
Table 1:
Type Condition_weight Price
A >50 1000
A >10 & <50 500
A <10 100
As I remember, I can do a comparison on the Condition_weight without doing too much on query.
My expectation query is something like this:
select Price from Table_1
where Type = 'A'
and {my_input_this_is a number} satisfy Condition_weight
I read it somewhere about this solution but cant find it again.
You can create a function that returns true - you will have to do the logic to extract min and max and compare value.
pseudo code...
CREATE FUNCTION conditionWeightIsSatisfed(number weight)
BEGIN
set #minValue = 0;
set #MaxValue = 1000;
..... do your conversion of the text values.....
select weight >= #minValue and weight <= #maxValue
END

How can I query whether a value lies before or after a range?

I have a Postgres range and a value, and want to be able to determine if the value lies before, within, or after the range.
Determining if the value lies within the range is trivial:
SELECT '[1,10]'::int4range #> 3; -- t
But looking at range functions and operators, the #> operator is the only one I see that doesn't require both operands to be ranges, so determining whether the value lies before or after the range is not as straightforward.
I'm currently constructing a trivial range to represent my value (inclusive endpoints, both equal to the value), and then using << (strictly left of) and >> (strictly right of):
SELECT '[1,10]'::int4range << '[11,11]'::int4range; -- t
SELECT '[1,10]'::int4range >> '[-3,-3]'::int4range; -- t
This works, but having to construct this trivial range representing a single discrete value just so I can use the << and >> operators feels a bit kludge-y to me. Is there some built-in function or operator I'm overlooking that would allow me to do these queries using the value directly?
I considered and rejected an approach based on using lower(range) > value and upper(range) < value, as that doesn't account for the inclusivity/exclusivity of the range's bounds.
I'm using Postgres 9.6.5, but it doesn't look like anything has changed in this regard in Postgres 10.
I [...] rejected an approach based on using lower(range) > value and upper(range) < value, as that doesn't account for the inclusivity/exclusivity of the range's bounds.
I am not sure what you mean with that. lower() and upper() do account for inclusive/exclusive: lower('(1,10]'::int4range); returns 2 and lower('[1,10]'::int4range); returns 1
It seems to me creating an operator for this would be quite easy:
Create two functions to compare an int to an int4range:
create function int_smaller_than_range(p_value int, p_check int4range)
returns boolean
as
$$
select p_value < lower(p_check);
$$
language sql;
create function int_greater_than_range(p_value int, p_check int4range)
returns boolean
as
$$
select p_value > upper(p_check);
$$
language sql;
Then create the operators:
create operator < (
procedure = int_smaller_than_range,
leftarg = int,
rightarg = int4range,
negator = >
);
create operator > (
procedure = int_greater_than_range,
leftarg = int,
rightarg = int4range,
negator = <
);
This can now be used like this:
select 4 > int4range(5,10); -> false
select 4 < int4range(4,10,'[]'); -> false
select 4 < int4range(4,10,'(]'); -> true
select 5 > int4range(4,10,'[]'); -> false
select 11 > int4range(4,10,'[]'); -> false
select 11 > int4range(4,10,'[)'); -> true

PostgreSQL - sort by UUID version 1 timestamp

I am using UUID version 1 as the primary key. I would like to sort on UUID v1 timestamp. Right now if I do something like this:
SELECT id, title
FROM table
ORDER BY id DESC;
PostgreSQL does not sort records by UUID timestamp, but by UUID string representation, which ends up with unexpected sorting result in my case.
Am I missing something, or there is not a built in way to do this in PostgreSQL?
The timestamp is one of the parts of a v1 UUID. It is stored in hex format as hundreds nanoseconds since 1582-10-15 00:00. This function extracts the timestamp:
create or replace function uuid_v1_timestamp (_uuid uuid)
returns timestamp with time zone as $$
select
to_timestamp(
(
('x' || lpad(h, 16, '0'))::bit(64)::bigint::double precision -
122192928000000000
) / 10000000
)
from (
select
substring (u from 16 for 3) ||
substring (u from 10 for 4) ||
substring (u from 1 for 8) as h
from (values (_uuid::text)) s (u)
) s
;
$$ language sql immutable;
select uuid_v1_timestamp(uuid_generate_v1());
uuid_v1_timestamp
-------------------------------
2016-06-16 12:17:39.261338+00
122192928000000000 is the interval between the start of the Gregorian calendar and the Unix timestamp.
In your query:
select id, title
from t
order by uuid_v1_timestamp(id) desc
To improve performance an index can be created on that:
create index uuid_timestamp_ndx on t (uuid_v1_timestamp(id));

PostgreSQL hierarchical nested set huge database

I have a database that must store thousands of scenarios (each scenario with a single unix_timestamp value). Each scenario has 1,800,000 registers organized in a Nested Set structure.
The general table structure is given by:
table_skeleton:
- unix_timestamp integer
- lft integer
- rgt integer
- value
Usually, my SELECTs are will perform taking all nested values within an specific scenario, it means for example:
SELECT * FROM table_skeleton WHERE unix_timestamp = 123 AND lft >= 10 AND rgt <= 53
So I hierarchically divided my table into master / children within groups of dates, for example:
table_skeleton_201303 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
and
table_skeleton_201304 inherits table_skeleton:
- unix_timestamp integer
- lft integer
- ...
And also created index for each children according to the usual search I am expecting, it is for example:
Create Index idx_201303
on table_skeleton_201303
using btree(unix_timestamp, lft, rgt)
It improved the retrieval, but it still takes about 1 minute for each select.
I imagined that this was because the index was too big to be loaded into memory always so I tried to create partial index for each timestamp, for example:
Create Index idx_201303_1362981600
on table_skeleton_201303
using btree(lft, rgt)
WHERE unix_timestamp = 1362981600
And in fact the second type of index created is much, much, much smaller than the general one. However, when I run an EXPLAIN ANALYZE for the SELECT I've previously shown here, the query solver ignores my new partial index and keeps using the giant old one.
Is there a reason for that?
Is there any new approach to optimize such type of huge nested set hierarchical database?
When you filter on a table by field_a > x and field_b > y, then an index for field_a, field_b will (actually just may, depending on the distribution and the percentage of rows with field_a > x, as per the statistics collected) only be used for "field_a > x", and field_b > y will be a sequential search.
In the case above, having two indexes, one for each field, could be used and each of the results joined, the internal equivalent of:
SELECT *
FROM table t
JOIN (
SELECT id table field_a > x) ta ON (ta.id = t.id)
JOIN (
SELECT id table field_b > y) tb ON (tb.id = t.id);
There is a change you could benefit from a GIST index, and treating your lft and rgt fields as points:
CREATE INDEX ON table USING GIST (unix_timestamp, point(lft, rgt));
SELECT * table
WHERE unix_timestamp = 123 AND
point(lft,rgt) <# box(point(10,'-inf'), point('inf',53));

Postgresql partial index on date to query sub period

On some very big tables, one of the most common indexes are on date field (e.g. date of insert, or date of operation).
But when the table is very large, the index tend also to but large.
I thought I could create a partial index for each month on the table, so that the query on the 'current' month would be quicker.
ex.
CREATE TABLE test_index_partial
(
exint1 integer,
exint2 integer,
exdatetime1 timestamp without time zone
);
CREATE INDEX part_date_201503
ON test_index_partial USING btree
(exdatetime1)
WHERE exdatetime1 >= '20150301 000000' and exdatetime1 < '20150401 000000' ;
For my tests, I fed a table with the following instruction :
INSERT INTO test_index_partial SELECT sint1, sint2, th
from generate_series(0, 100, 1) sint1,
generate_series(0, 100, 1) sint2,
generate_series(date_trunc('hour', now()) - interval '90 day', now(), '1 day') th ;
When I perform a select with the exact condition, it uses the partial index. But after creating a 'all_dates' index, not partial, on the date field, it prefered the whole index.
What I wanted : if I select with a more narrow period (ex., a five day period), I want postgresql to get the tinyer index. In my example : I'd want him to choose the index of the month instead of the whole index on all the table.
But it seems Postgresql don't assume that the condition :
exdatetime1 >= '20150310 000000' and exdatetime1 <'20150315 000000'
is contained in the condition of the index
exdatetime1 >= '20150301 000000' and exdatetime1 <'20150401 000000'
I use postgresql 9.3, is it a limitation of partial index of this version ? or do I have other way to write it ?