Optimizing Postgres JSONB query with not null constraint - postgresql

I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdatedTime": "2016-12-26T12:09:43.901Z",
"UID": "2c5bb7fd-1a00-4988-8d92-ffaa52ebc20d",
"data": {
"country": "UK",
"verified_at": "2017-01-01T23:49:10.217Z"
}
}
The data format cannot be changed since this is legacy information.
I need to obtain all accounts where the country is UK, the verified_at value is not null and the lastUpdatedTime value is greater than some given value.
So far, I have the following query:
SELECT * FROM "accounts"
WHERE (data #> '{ "data": { "country": "UK" } }')
AND (data->'data' ? 'verified_at')
AND ((data->'data' ->> 'verified_at') is not null)
AND (data ->>'lastUpdatedTime' > '2016-02-28T05:49:08.511846')
ORDER BY data ->>'lastUpdatedTime' LIMIT 100 OFFSET 0;
And the following indexes:
"accounts_idxgin" gin (data)
"accounts_idxgin_on_data" gin ((data -> 'data'::text))
I've managed to get the query time down to about 1000 to 4000ms
Here is the analyze from the query:
Bitmap Heap Scan on accounts (cost=41.31..6934.50 rows=9 width=1719)
(actual time=7.273..1067.657 rows=23190 loops=1)
Recheck Cond: ((data -> 'data'::text) ? 'verified_at'::text)
Filter: ((((data -> 'data'::text) ->> 'verified_at'::text) IS NOT NULL)
AND ((data ->> 'lastUpdatedTime'::text) > '2016-02-01 05:49:08.511846'::text)
AND (((data -> 'data'::text) ->> 'country'::text) = 'UK'::text))
Rows Removed by Filter: 4
Heap Blocks: exact=16039
-> Bitmap Index Scan on accounts_idxgin_on_data (cost=0.00..41.30 rows=1773 width=0)
(actual time=4.618..4.618 rows=23194 loops=1)
Index Cond: ((data -> 'data'::text) ? 'verified_at'::text)
Planning time: 0.448 ms
Execution time: 1069.344 ms
(9 rows)
I have the following questions
Is there anything I can do to further speed up this query?
What is the correct way to speed up a field is not null query with JSONB? I ended up using the existence operator with (data->'data' ? 'verified_at') to filter out a large number of non-matching records, because much of my data doesn't have verified_at as a top level key. This increased the speed of the query, but I'm wondering if there's a general approach to optimizing this type of query.
In order to use the existence operator with (data->'data' ? 'verified_at'), I needed to add another index on ((data -> 'data'::text)). I already had an index on gin (data), but the existence operator didn't use this. Why is that? I thought the existence and containment operators would use this index.

3: Not really. This case is explicitly mentioned in the docs.
When you have an index on the column data, it is only used, when you query your table, like data #> '...' or data ? '...'. When you have an index on the expression (data -> 'data'), these queries can take advantage of it: (data -> 'data') #> '...' or (data -> 'data') ? '...'.
2: usual jsonb indexes won't help during a (jsonb_col -> '<key>') is [not] null query at all. And unfortunately, you cannot use jsonb_col #> '{"<key>":null}' either, because the JSON object might lack the key entirely. Also reverse use of the index (for is not null) is not possible at all. But there may be a trick...
1: Not much. There may be some improvements, but don't expect huge performance advantages. So here them go:
You can use the jsonb_path_ops operator class instead of the (default) jsonb_ops. This should mean a little improvement in performance, but they cannot use the existence operator (?). But we won't need it anyway.
You have a single, index-unfriendly, boolean typed expression, which slows you down. Thankfully you can use a partial index here if you only interested in true values.
So, your index should look something like this:
create index accounts_idxgin_on_data
on accounts using gin ((data -> 'data') jsonb_path_ops)
where (data -> 'data' ->> 'verified_at') is not null;
With this index, you can use the following query:
select *
from accounts
where (data -> 'data') #> '{"country":"UK"}'
and (data -> 'data' ->> 'verified_at') is not null
and (data ->> 'lastUpdatedTime') > '2016-02-28T05:49:08.511Z'
order by data ->>'lastUpdatedTime';
Note: for proper timestamp comparisons, you should use (data ->> 'lastUpdatedTime')::timestamptz > '2016-02-28T05:49:08.511Z'.
http://rextester.com/QWUW41874

After playing around a bit more, I've managed to reduce my query time from around 1000ms to 350ms by creating the following partial index:
CREATE INDEX index_accounts_partial_on_verified_at
ON accounts ((data->'data'->'verified_at'))
WHERE (data->'data'->>'verified_at') IS NOT NULL
AND (data->'data' ? 'verified_at')
AND (data->'data'->>'country' = 'UK');
I was able to hardcode some of the values in this index, such as country=UK because I only need to consider UK accounts for this query. I was also able to remove the index on ((data->'data')) which was 258MB, and replace it with the partial index which is only 1360 kB!
For anyone interested, I found the details for building a partial JSONB index from here

Use the path access operator for faster access to lower-level objects:
SELECT * FROM "accounts"
WHERE data #>> '{data, country}' = 'UK'
AND data #>> '{data, verified_at}' IS NOT NULL
AND data ->> 'lastUpdatedTime' > '2016-02-28T05:49:08.511846'
ORDER BY data ->> 'lastUpdatedTime' LIMIT 100 OFFSET 0;
The index only works on the top-level key. So, with an index on column data queries like data #> [[key]] are supported. However, for a query on data -> 'data' ? 'verified_at' you need an index on data->'data'.
Two more points:
I don't think it is necessary to test for the presence of verified_at. If it is not there it simply comes out as NULL so it gets caught by the same test.
Comparing string representations of timestamp values may work if the JSON value is properly and consistently formatted. Cast to timestamp to be on the safe side.

Related

How to use an index when using jsonb_array_elements in Postgres

I have the next table structure:
create table public.listings (id varchar(255) not null, data jsonb not null);
And the next indexes:
create index listings_data_index on public.listings using gin(data jsonb_ops);
create unique index listings_id_index on public.listings(id);
alter table public.listings add constraint listings_id_pk primary key(id);
With this row:
id | data
1 | {"attributes": {"ccid": "123", "listings": [{"vin": "1234","body": "Sleeper", "make": "International"}, { "vin": "5678", "body": "Sleeper", "make": "International" }]}}
The use case needs to retrieve a specific item inside the listings array that matches a specific vin.
I am accomplishing that with the next query:
SELECT elems
FROM public.listings, jsonb_array_elements(data->'attributes'->'listings') elems
WHERE id = '1' AND elems->'vin' ? '1234';
The output is what I need:
{"vin": "1234","body": "Sleeper", "make": "International"}
Now I am in the phase of optimizing this query, since there will be millions of rows, and up to 100K items inside listings array.
When I run the explain over that query is shows this:
Nested Loop (cost=0.01..2.53 rows=1 width=32)
-> Seq Scan on listings (cost=0.00..1.01 rows=1 width=32)
Filter: ((id)::text = '1'::text)
-> Function Scan on jsonb_array_elements elems (cost=0.01..1.51 rows=1 width=32)
Filter: ((value -> 'vin'::text) ? '1234'::text)
I wonder what would be the right way to construct an index for that, or if I need to modify the query to another that is more efficient.
Thank you!
First: with a table as small as that, you will never see PostgreSQL use an index. You need to try with realistic amounts. Second: while PostgreSQL will happily use an index for the condition on id, it can never use an index for such a JSON search, no matter how you write it.

Index created for PostgreSQL jsonb column not utilized

I have created an index for a field in jsonb column as:
create index on Employee using gin ((properties -> 'hobbies'))
Query generated is:
CREATE INDEX employee_expr_idx ON public.employee USING gin (((properties -> 'hobbies'::text)))
My search query has structure as:
SELECT * FROM Employee e
WHERE e.properties #> '{"hobbies": ["trekking"]}'
AND e.department = 'Finance'
Running EXPLAIN command for this query gives:
Seq Scan on employee e (cost=0.00..4452.94 rows=6 width=1183)
Filter: ((properties #> '{"hobbies": ["trekking"]}'::jsonb) AND (department = 'Finance'::text))
Going by this, I am not sure if index is getting used for search.
Is this entire setup ok?
The expression you use in the WHERE clause must match the expression in the index exactly, your index uses the expression: ((properties -> 'hobbies'::text)) but your query only uses e.properties on the left hand side.
To make use of that index, your WHERE clause needs to use the same expression as was used in the index:
SELECT *
FROM Employee e
WHERE (properties -> 'hobbies') #> '["trekking"]'
AND e.department = 'Finance'
However: your execution plan shows that the table employee is really tiny (rows=6). With a table as small as that, a Seq Scan is always going to be the fastest way to retrieve data, no matter what kind of indexes you define.

How to efficiently index multiple nested numbers in JSONB structure with PostgreSQL for efficient comparison operations? (optional: SQLAlchemy)

I want to make sure PostgreSQL indexing works properly (B-Tree, to allow efficient greater than / smaller than operations on numbers) inside a JSONB on multiple nested numbers. The JSONB "data" column would look as follows:
data: {
a: {n: 1000, str: 'blabla'},
b: {n: 2000, str: 'blabla'},
c: {n: 3000, str: 'blabla'},
d: {n: 4000, str: 'blabla'},
...[we can assume 10 such nested dicts]
}
Where I would select rows based on combinations of multiple nested numbers, ex:
WHERE data['a']['n'] == 1000
AND data['b']['n'] == 2000
AND data['c']['n'] >= 3000
AND data['d']['n'] <= 4000
and adding multiple ORDER BYs such as:
ORDER BY DESC(data['a']['n']) + DESC(data['b']['n']) etc.
to achieve ordering based on the a, b, c, d hierarchy and nested numbers 'n' in ascending or descending order.
I've put some code below, but I can't tell if the indexing is working as expected and I'm wondering if this is the right way or if there's a better way to achieve this? (ideally using JSONB)
I'm using PostgreSQL 11 (with SQLAlchemy ORM), so the table and index declaration look as per below:
class TableWithJSONB(db.Base):
__tablename__ = 'tablewithjsonb'
id = Column(Integer, primary_key=True)
data = Column(NestedMutable.as_mutable(JSONB), nullable=False)
__table_args__ = ( # Adding Indexes
# GIN using jsonb_path_ops => are these indexes useful?
Index(
'ix_data_a_gin',
text("(data->'a') jsonb_path_ops"),
postgresql_using='gin',
),
Index(
'ix_data_b_gin',
text("(data->'b') jsonb_path_ops"),
postgresql_using='gin',
),
Index(
'ix_data_c_gin',
text("(data->'c') jsonb_path_ops"),
postgresql_using='gin',
),
...
# BTree Indexes on nested numbers
Index(
'ix_data_a_bTree',
text("((data #> '{a, n}')::INTEGER) int4_ops"),
),
# BTree Indexes on nested numbers
Index(
'ix_data_b_bTree',
text("((data #> '{b, n}')::INTEGER) int4_ops"),
),
# BTree Indexes on nested numbers
Index(
'ix_data_c_bTree',
text("((data #> '{c, n}')::INTEGER) int4_ops"),
),
...
)
After reading what I could find on the subject, I'm not sure if the b-Tree index actually works as expected for each nested numerical value inside JSONB. Also, I can't tell if the GIN jsonb_path_ops index makes any sense on the nested dicts a, b, c, d for the usage described above. Is this the right way or is there a better way?
UPDATE: I seem to have answered my own question. See dbfiddle here
Indexing nested numeric value in JSONB (with b-Tree index):
CREATE INDEX i_btree_a ON tablewithjsonb (((data #> '{a, n}')::INTEGER) int4_ops);
Successfully creates index on the numeric value data['a']['n'] in JSONB.
The index is used with queries such as:
explain analyze select * from tablewithjsonb
where (data #> '{a, n}')::INTEGER <= 10000;
Creating a combined index on multiple numeric values within the same JSONB works as well (in this particular case the index above (i_btree_a) would be redundant, searching on data['a']['n'] would use the index i_btree_a_b below instead):
CREATE INDEX i_btree_a_b ON tablewithjsonb
(((data #> '{a, n}')::INTEGER) int4_ops,
((data #> '{b, n}')::INTEGER) int4_ops);
..which would be used on queries such as:
explain analyze select * from tablewithjsonb
where (data #> '{a, n}')::INTEGER <= 10000 AND
where (data #> '{b, n}')::INTEGER <= 10000;
Indexing nested string/text value in JSONB (with b-Tree index):
CREATE INDEX i_btree_s_a ON tablewithjsonb ((data #>> '{a, s}'));
The b-Tree index will be used for equality (=) and LIKE operations (Execution Time: 0.048 ms):
explain analyze select * from tablewithjsonb
where (data #>> '{a, s}') = 'blabla';
explain analyze select * from tablewithjsonb
where (data #>> '{a, s}') LIKE '%blabla%';
(Update: when I tried this separately, it went for sequential scan instead of index. Why?)
Although I don't understand why the following goes for a sequential scan (Execution Time: 53.712 ms) (why?):
explain analyze select * from tablewithjsonb
where (data #>> '{a, s}') LIKE '%blabla 1 5%';
Indexing nested string/text value in JSONB (with GIN full text search index):
CREATE INDEX i_gin_ts_s_a ON tablewithjsonb
USING GIN (( to_tsvector('english', (data #>> '{a, s}')) ));
The GIN full text search index will be used for queries as such:
explain analyze select * from tablewithjsonb where
to_tsvector('english', (data #>> '{a, s}')) ## to_tsquery('blabla & 1 & 5:*');
(Execution Time: 34.845 ms)
I note that this last query (via GIN full-text search) is quite slow (why?), not far from the sequential scan mentioned above where Execution Time was 53.712 ms.

How to use jsonb index in postgres

My pg is 9.5+.
I have a jsonb data in column 'body':
{
"id":"58cf96481ebf47eba351db3b",
"JobName":"test",
"JobDomain":"SAW",
"JobStatus":"TRIGGERED",
"JobActivity":"ACTIVE"
}
And I create index for body and key:
CREATE INDEX scheduledjob_request_id_idx ON "ScheduledJob" USING gin ((body -> 'JobName'));
CREATE INDEX test_index ON "ScheduledJob" USING gin (body jsonb_path_ops)
This are my queries:
SELECT body FROM "ScheduledJob" WHERE body #> '{"JobName": "analytics_import_transaction_job"}';
SELECT body FROM "ScheduledJob" WHERE (body#>'{JobName}' = '"analytics_import_transaction_job"') LIMIT 10;
Those are return correct data, but no one use index.
I saw the explain:
-> Seq Scan on public."ScheduledJob" (cost=0.00..4.55 rows=1 width=532)
So, I don't know why didn't use the index, and how to use the index for jsonb correctly.
Update:
I create index before insert data, the query can use index.
But I create index after insert the first data, the query will be
scan all records.
This is so strange, and how can I make the index useful when I insert data first.
So, I do some research and test that:
SELECT body FROM "ScheduledJob" WHERE (body#>'{JobName}' = '"analytics_import_transaction_job"') LIMIT 10;
This kind of query will never use the index.
And only the table have enough data, index can be available anytime.

Postgres jsonb query missing index?

We have the following json documents stored in our PG table (identities) in a jsonb column 'data':
{
"email": {
"main": "mainemail#email.com",
"prefix": "aliasPrefix",
"prettyEmails": ["stuff1", "stuff2"]
},
...
}
I have the following index set up on the table:
CREATE INDEX ix_identities_email_main
ON identities
USING gin
((data -> 'email->main'::text) jsonb_path_ops);
What am I missing that is preventing the following query from hitting that index?? It does a full seq scan on the table... We have tens of millions of rows, so this query is hanging for 15+ minutes...
SELECT * FROM identities WHERE data->'email'->>'main'='mainemail#email.com';
If you use JSONB data type for your data column, in order to index ALL "email" entry values you need to create following index:
CREATE INDEX ident_data_email_gin_idx ON identities USING gin ((data -> 'email'));
Also keep in mind that for JSONB you need to use appropriate list of operators;
The default GIN operator class for jsonb supports queries with the #>,
?, ?& and ?| operators
Following queries will hit this index:
SELECT * FROM identities
WHERE data->'email' #> '{"main": "mainemail#email.com"}'
-- OR
SELECT * FROM identities
WHERE data->'email' #> '{"prefix": "aliasPrefix"}'
If you need to search against array elements "stuff1" or "stuff2", index above will not work , you need to explicitly add expression index on "prettyEmails" array element values in order to make query work faster.
CREATE INDEX ident_data_prettyemails_gin_idx ON identities USING gin ((data -> 'email' -> 'prettyEmails'));
This query will hit the index:
SELECT * FROM identities
WHERE data->'email' #> '{"prettyEmails":["stuff1"]}'