I have the next table structure:
create table public.listings (id varchar(255) not null, data jsonb not null);
And the next indexes:
create index listings_data_index on public.listings using gin(data jsonb_ops);
create unique index listings_id_index on public.listings(id);
alter table public.listings add constraint listings_id_pk primary key(id);
With this row:
id | data
1 | {"attributes": {"ccid": "123", "listings": [{"vin": "1234","body": "Sleeper", "make": "International"}, { "vin": "5678", "body": "Sleeper", "make": "International" }]}}
The use case needs to retrieve a specific item inside the listings array that matches a specific vin.
I am accomplishing that with the next query:
SELECT elems
FROM public.listings, jsonb_array_elements(data->'attributes'->'listings') elems
WHERE id = '1' AND elems->'vin' ? '1234';
The output is what I need:
{"vin": "1234","body": "Sleeper", "make": "International"}
Now I am in the phase of optimizing this query, since there will be millions of rows, and up to 100K items inside listings array.
When I run the explain over that query is shows this:
Nested Loop (cost=0.01..2.53 rows=1 width=32)
-> Seq Scan on listings (cost=0.00..1.01 rows=1 width=32)
Filter: ((id)::text = '1'::text)
-> Function Scan on jsonb_array_elements elems (cost=0.01..1.51 rows=1 width=32)
Filter: ((value -> 'vin'::text) ? '1234'::text)
I wonder what would be the right way to construct an index for that, or if I need to modify the query to another that is more efficient.
Thank you!
First: with a table as small as that, you will never see PostgreSQL use an index. You need to try with realistic amounts. Second: while PostgreSQL will happily use an index for the condition on id, it can never use an index for such a JSON search, no matter how you write it.
Related
I have created an index for a field in jsonb column as:
create index on Employee using gin ((properties -> 'hobbies'))
Query generated is:
CREATE INDEX employee_expr_idx ON public.employee USING gin (((properties -> 'hobbies'::text)))
My search query has structure as:
SELECT * FROM Employee e
WHERE e.properties #> '{"hobbies": ["trekking"]}'
AND e.department = 'Finance'
Running EXPLAIN command for this query gives:
Seq Scan on employee e (cost=0.00..4452.94 rows=6 width=1183)
Filter: ((properties #> '{"hobbies": ["trekking"]}'::jsonb) AND (department = 'Finance'::text))
Going by this, I am not sure if index is getting used for search.
Is this entire setup ok?
The expression you use in the WHERE clause must match the expression in the index exactly, your index uses the expression: ((properties -> 'hobbies'::text)) but your query only uses e.properties on the left hand side.
To make use of that index, your WHERE clause needs to use the same expression as was used in the index:
SELECT *
FROM Employee e
WHERE (properties -> 'hobbies') #> '["trekking"]'
AND e.department = 'Finance'
However: your execution plan shows that the table employee is really tiny (rows=6). With a table as small as that, a Seq Scan is always going to be the fastest way to retrieve data, no matter what kind of indexes you define.
My pg is 9.5+.
I have a jsonb data in column 'body':
{
"id":"58cf96481ebf47eba351db3b",
"JobName":"test",
"JobDomain":"SAW",
"JobStatus":"TRIGGERED",
"JobActivity":"ACTIVE"
}
And I create index for body and key:
CREATE INDEX scheduledjob_request_id_idx ON "ScheduledJob" USING gin ((body -> 'JobName'));
CREATE INDEX test_index ON "ScheduledJob" USING gin (body jsonb_path_ops)
This are my queries:
SELECT body FROM "ScheduledJob" WHERE body #> '{"JobName": "analytics_import_transaction_job"}';
SELECT body FROM "ScheduledJob" WHERE (body#>'{JobName}' = '"analytics_import_transaction_job"') LIMIT 10;
Those are return correct data, but no one use index.
I saw the explain:
-> Seq Scan on public."ScheduledJob" (cost=0.00..4.55 rows=1 width=532)
So, I don't know why didn't use the index, and how to use the index for jsonb correctly.
Update:
I create index before insert data, the query can use index.
But I create index after insert the first data, the query will be
scan all records.
This is so strange, and how can I make the index useful when I insert data first.
So, I do some research and test that:
SELECT body FROM "ScheduledJob" WHERE (body#>'{JobName}' = '"analytics_import_transaction_job"') LIMIT 10;
This kind of query will never use the index.
And only the table have enough data, index can be available anytime.
I've got a Postgres 9.4.4 database with 1.7 million records with the following information stored in a JSONB column called data in a table called accounts:
data: {
"lastUpdatedTime": "2016-12-26T12:09:43.901Z",
"UID": "2c5bb7fd-1a00-4988-8d92-ffaa52ebc20d",
"data": {
"country": "UK",
"verified_at": "2017-01-01T23:49:10.217Z"
}
}
The data format cannot be changed since this is legacy information.
I need to obtain all accounts where the country is UK, the verified_at value is not null and the lastUpdatedTime value is greater than some given value.
So far, I have the following query:
SELECT * FROM "accounts"
WHERE (data #> '{ "data": { "country": "UK" } }')
AND (data->'data' ? 'verified_at')
AND ((data->'data' ->> 'verified_at') is not null)
AND (data ->>'lastUpdatedTime' > '2016-02-28T05:49:08.511846')
ORDER BY data ->>'lastUpdatedTime' LIMIT 100 OFFSET 0;
And the following indexes:
"accounts_idxgin" gin (data)
"accounts_idxgin_on_data" gin ((data -> 'data'::text))
I've managed to get the query time down to about 1000 to 4000ms
Here is the analyze from the query:
Bitmap Heap Scan on accounts (cost=41.31..6934.50 rows=9 width=1719)
(actual time=7.273..1067.657 rows=23190 loops=1)
Recheck Cond: ((data -> 'data'::text) ? 'verified_at'::text)
Filter: ((((data -> 'data'::text) ->> 'verified_at'::text) IS NOT NULL)
AND ((data ->> 'lastUpdatedTime'::text) > '2016-02-01 05:49:08.511846'::text)
AND (((data -> 'data'::text) ->> 'country'::text) = 'UK'::text))
Rows Removed by Filter: 4
Heap Blocks: exact=16039
-> Bitmap Index Scan on accounts_idxgin_on_data (cost=0.00..41.30 rows=1773 width=0)
(actual time=4.618..4.618 rows=23194 loops=1)
Index Cond: ((data -> 'data'::text) ? 'verified_at'::text)
Planning time: 0.448 ms
Execution time: 1069.344 ms
(9 rows)
I have the following questions
Is there anything I can do to further speed up this query?
What is the correct way to speed up a field is not null query with JSONB? I ended up using the existence operator with (data->'data' ? 'verified_at') to filter out a large number of non-matching records, because much of my data doesn't have verified_at as a top level key. This increased the speed of the query, but I'm wondering if there's a general approach to optimizing this type of query.
In order to use the existence operator with (data->'data' ? 'verified_at'), I needed to add another index on ((data -> 'data'::text)). I already had an index on gin (data), but the existence operator didn't use this. Why is that? I thought the existence and containment operators would use this index.
3: Not really. This case is explicitly mentioned in the docs.
When you have an index on the column data, it is only used, when you query your table, like data #> '...' or data ? '...'. When you have an index on the expression (data -> 'data'), these queries can take advantage of it: (data -> 'data') #> '...' or (data -> 'data') ? '...'.
2: usual jsonb indexes won't help during a (jsonb_col -> '<key>') is [not] null query at all. And unfortunately, you cannot use jsonb_col #> '{"<key>":null}' either, because the JSON object might lack the key entirely. Also reverse use of the index (for is not null) is not possible at all. But there may be a trick...
1: Not much. There may be some improvements, but don't expect huge performance advantages. So here them go:
You can use the jsonb_path_ops operator class instead of the (default) jsonb_ops. This should mean a little improvement in performance, but they cannot use the existence operator (?). But we won't need it anyway.
You have a single, index-unfriendly, boolean typed expression, which slows you down. Thankfully you can use a partial index here if you only interested in true values.
So, your index should look something like this:
create index accounts_idxgin_on_data
on accounts using gin ((data -> 'data') jsonb_path_ops)
where (data -> 'data' ->> 'verified_at') is not null;
With this index, you can use the following query:
select *
from accounts
where (data -> 'data') #> '{"country":"UK"}'
and (data -> 'data' ->> 'verified_at') is not null
and (data ->> 'lastUpdatedTime') > '2016-02-28T05:49:08.511Z'
order by data ->>'lastUpdatedTime';
Note: for proper timestamp comparisons, you should use (data ->> 'lastUpdatedTime')::timestamptz > '2016-02-28T05:49:08.511Z'.
http://rextester.com/QWUW41874
After playing around a bit more, I've managed to reduce my query time from around 1000ms to 350ms by creating the following partial index:
CREATE INDEX index_accounts_partial_on_verified_at
ON accounts ((data->'data'->'verified_at'))
WHERE (data->'data'->>'verified_at') IS NOT NULL
AND (data->'data' ? 'verified_at')
AND (data->'data'->>'country' = 'UK');
I was able to hardcode some of the values in this index, such as country=UK because I only need to consider UK accounts for this query. I was also able to remove the index on ((data->'data')) which was 258MB, and replace it with the partial index which is only 1360 kB!
For anyone interested, I found the details for building a partial JSONB index from here
Use the path access operator for faster access to lower-level objects:
SELECT * FROM "accounts"
WHERE data #>> '{data, country}' = 'UK'
AND data #>> '{data, verified_at}' IS NOT NULL
AND data ->> 'lastUpdatedTime' > '2016-02-28T05:49:08.511846'
ORDER BY data ->> 'lastUpdatedTime' LIMIT 100 OFFSET 0;
The index only works on the top-level key. So, with an index on column data queries like data #> [[key]] are supported. However, for a query on data -> 'data' ? 'verified_at' you need an index on data->'data'.
Two more points:
I don't think it is necessary to test for the presence of verified_at. If it is not there it simply comes out as NULL so it gets caught by the same test.
Comparing string representations of timestamp values may work if the JSON value is properly and consistently formatted. Cast to timestamp to be on the safe side.
I imported all tables from http://www.geonames.org/ into my local postgresql 9.5.3.0 database and peppered it with indexes like so:
create extension pg_trgm;
CREATE INDEX name_trgm_idx ON geoname USING GIN (name gin_trgm_ops);
CREATE INDEX fcode_trgm_idx ON geoname USING GIN (fcode gin_trgm_ops);
CREATE INDEX fclass_trgm_idx ON geoname USING GIN (fclass gin_trgm_ops);
CREATE INDEX alternatename_trgm_idx ON alternatename USING GIN (alternatename gin_trgm_ops);
CREATE INDEX isolanguage_trgm_idx ON alternatename USING GIN (isolanguage gin_trgm_ops);
CREATE INDEX alt_geoname_id_idx ON alternatename (geonameid)
And now I would like to query the country names in different languages and cross reference the geonames attributes with these alternative names like so:
select g.geonameid as geonameid ,a.alternatename as name,g.country as country, g.fcode as fcode
from geoname g,alternatename a
where
a.isolanguage=LOWER('de')
and a.alternatename ilike '%Sa%'
and (a.ishistoric = FALSE OR a.ishistoric IS NULL)
and (a.isshortname = TRUE OR a.isshortname IS NULL)
and a.geonameid = g.geonameid
and g.fclass='A'
and g.fcode ='PCLI';
Unfortunately though this query takes as long as 13 to 15 seconds on an octacore machine with a fast SSD. 'Explain analyze verbose' shows this:
Nested Loop (cost=0.43..237138.04 rows=1 width=25) (actual time=1408.443..10878.115 rows=15 loops=1)
Output: g.geonameid, a.alternatename, g.country, g.fcode
-> Seq Scan on public.alternatename a (cost=0.00..233077.17 rows=481 width=18) (actual time=0.750..10862.089 rows=2179 loops=1)
Output: a.alternatenameid, a.geonameid, a.isolanguage, a.alternatename, a.ispreferredname, a.isshortname, a.iscolloquial, a.ishistoric
Filter: (((a.alternatename)::text ~~* '%Sa%'::text) AND ((a.isolanguage)::text = 'de'::text))
Rows Removed by Filter: 10675099
-> Index Scan using pk_geonameid on public.geoname g (cost=0.43..8.43 rows=1 width=11) (actual time=0.006..0.006 rows=0 loops=2179)
Output: g.geonameid, g.name, g.asciiname, g.alternatenames, g.latitude, g.longitude, g.fclass, g.fcode, g.country, g.cc2, g.admin1, g.admin2, g.admin3, g.admin4, g.population, g.elevation, g.gtopo30, g.timezone, g.moddate
Index Cond: (g.geonameid = a.geonameid)
Filter: ((g.fclass = 'A'::bpchar) AND ((g.fcode)::text = 'PCLI'::text))
Rows Removed by Filter: 1
Which to me seems to indicate that somehow a sequence scan is performed on 481 rows (which I deem to be fairly low), but nevertheless takes very long. I currently can't make sense of this. Any ideas?
The trigrams only work if you have minimum of 3 characters you're searching for %Sa% won't work, %foo% will. However your indexes are still not good enough. Depending on what parameters are dynamic use multicolumn or filtered indexes:
CREATE INDEX jkb1 ON geoname(fclass, fcode, geonameid, country);
CREATE INDEX jkb2 ON geoname(geonameid, country) WHERE fclass = 'A' AND fcode = 'PCLI';
Same for the other table:
CREATE INDEX jkb3 ON alternatename(geonameid, alternatename) WHERE (a.ishistoric = FALSE OR a.ishistoric IS NULL)
AND (a.isshortname = TRUE OR a.isshortname IS NULL) AND isolanguage=LOWER('de')
I have a simple table which has a user_birthday field with a type of date (which can be
NULL value)
CREATE TABLE users
(
user_id bigserial NOT NULL,
user_email text NOT NULL,
user_password text,
user_first_name text NOT NULL,
user_middle_name text,
user_last_name text NOT NULL,
user_birthday date,
CONSTRAINT pk_users PRIMARY KEY (user_id)
)
There's an index (btree) defined on that field, with the rule of NOT
user_birthday IS NULL.
CREATE INDEX ix_users_birthday
ON users
USING btree
(user_birthday)
WHERE NOT user_birthday IS NULL;
Trying to follow up on another idea, I've added the extension btree_gist and created the following index:
CREATE INDEX ix_users_birthday_gist
ON glances.users
USING gist
(user_birthday)
WHERE NOT user_birthday IS NULL;
But it had no affect either, as from what I could read it is not used for range checking.
The PostgreSQL version is 9.3.4.0 (22) Postgres.app
and issue also exists in 9.3.3.0 (21) Postgres.app
I've been intrigued by the following queries:
Query #1:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01)')
Query #2:
EXPLAIN ANALYZE SELECT *
FROM users
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
which, at first glance both should have the same execution plan, but for some
reason, here are the results:
Query #1:
"Seq Scan on users (cost=0.00..52314.25 rows=11101 width=241) (actual
time=0.014..478.983 rows=208886 loops=1)"
" Filter: (user_birthday <# '[1978-07-15,1983-03-01)'::daterange)"
" Rows Removed by Filter: 901214"
"Total runtime: 489.584 ms"
Query #2:
"Bitmap Heap Scan on users (cost=4468.01..46060.53 rows=210301 width=241)
(actual time=57.104..489.785 rows=209019 loops=1)"
" Recheck Cond: ((user_birthday >= '1978-07-15'::date) AND (user_birthday
<= '1983-03-01'::date))"
" Rows Removed by Index Recheck: 611375"
" -> Bitmap Index Scan on ix_users_birthday (cost=0.00..4415.44
rows=210301 width=0) (actual time=54.621..54.621 rows=209019 loops=1)"
" Index Cond: ((user_birthday >= '1978-07-15'::date) AND
(user_birthday <= '1983-03-01'::date))"
"Total runtime: 500.983 ms"
As you can see, the <# daterange is not utilizing the existing index, while
BETWEEN does.
Important to note that the actual use case for this rule is in a more complex query,
which doesn't result in the Recheck Cond and Bitmap Heap scan.
In the application complex query, the difference between the two methods (with 1.2 million records) is massive:
Query #1 at 415ms
Query #2 at 84ms.
Is this a bug with daterange?
Am I doing something wrong? or datarange <# is performing as designed?
There's also a discussion in the pgsql-bugs mailing list
BETWEEN includes upper and lower border. Your condition
WHERE user_birthday BETWEEN '1978-07-15'::date AND '1983-03-01'::date
matches
WHERE user_birthday <# daterange('[1978-07-15,1983-03-01]')
I see you mention a btree index. For that use simple comparison operators.
Detailed manual page on which index is good for which operators.
The range type operators <# or #> would work with GiST indexes.
Example:
Perform this hours of operation query in PostgreSQL