Effectively searching through entire 1 level nested JSONB in Postgres - postgresql

Let's say we need to check if a jsonb column contains a particular value matching by a substring in any of the value (non-nested, only first level).
How does one effectively optimize a query to search entire JSONB column (this means every key) for a value?
Is there some good alternative to doing ILIKE %val% on jsonb datatype casted to text?
jsonb_each_text(jsonb_column) ILIKE '%val%'
As an example consider this data:
SELECT
'{
"col1": "somevalue",
"col2": 5.5,
"col3": 2016-01-01,
"col4": "othervalue",
"col5": "yet_another_value"
}'::JSONB
How would you go about optimizing a query like that when in need to search for pattern %val% in records containing different keys configuration for different rows in a jsonb column?
I'm aware that searching with preceding and following % sign is inefficient, thus looking for a better way but having hard time finding one. Also, indexing all the fields within the json column explicitly is not an option since they vary for each type of record and would create a huge set of indexes (not every row has the same set of keys).
Question
Is there a better alternative to extracting each key-value pair to text and performing an ILIKE/POSIX search?

If you know you will need to query only a few known keys, then you can simply index those expressions.
This is a too simple but self explaining example:
create table foo as SELECT '{"col1": "somevalue", "col2": 5.5, "col3": "2016-01-01", "col4": "othervalue", "col5": "yet_another_value"}'::JSONB as bar;
create index pickfoo1 on foo ((bar #>> '{col1}'));
create index pickfoo2 on foo ((bar #>> '{col2}'));
This is the basic idea, even it isn't useful for ilike querys, but you can do more things (depending on your needs).
For example: If you need only case insensitive matching, it would be sufficient to do:
-- Create index over lowered value:
create index pickfoo1 on foo (lower(bar #>> '{col1}'));
create index pickfoo2 on foo (lower(bar #>> '{col2}'));
-- Check that it matches:
select * from foo where lower(bar #>> '{col1}') = lower('soMEvaLUe');
NOTE: This is only an example: If you perform an explain over the previous select, you will see that postgres actually performs a
sequential scan instead of using the index. But this is because we are
testing over a table with a single row, which is not the usual. But
I'm sure you could test it with a bigger table ;-)
With huge tables, even like queries should benefit of the index if the firt wilcard doesn't appear at the beginning of the string (but it isn't a matter of jsonb but a matter of btree indexes itself).
If you need to optimize queries like:
select * from foo where bar #>> '{col1}' ilike '%MEvaL%';
...then you should consider using GIN or GIST indexes instead.

Related

Is it possible to index the position of an array column in PostgreSQL?

Let's say I want to find rows in the table my_table that have the value 5 at the first position of the array column my_array_column. To prepare the table, I executed the following statements:
CREATE TABLE my_table (
id serial primary key,
my_array_column integer[]
);
CREATE INDEX my_table_my_array_column_index on "my_table" USING GIN ("my_array_column");
SET enable_seqscan TO off;
INSERT INTO my_table (my_array_column) VALUES ('{5,7,10}');
Now, the query can look like this:
select * from my_table where my_array_column[1] = 5;
This works, but it doesn't use the created GIN index. Is it possible to search for the value 5 at a specific position with an index?
I want to find rows in the table my_table that have the value 5 at the first position of the array column
A partial index would be most efficient for that definition:
CREATE INDEX my_table_my_array_special_idx ON my_table ((true))
WHERE my_array_column[1] = 5;
If only a small fraction of rows qualifies, a partial index is accordingly smaller. Plus, the actual index column only occupies minimum space (typically 8 bytes). And, on top of that, Postgres 13 or later can apply index deduplication to make the index much smaller, yet.
Once the index is fully cached, its small size does not make it much faster, but still.
And most writes do not have to manipulate the index, which may be the most important benefit, depending on the workload.
Oh, and Postgres collects statistics for a partial index. So you can expect the query planner to make a fully educated choice when that index is involved.
Related:
PostgreSQL partial index unused when created on a table with existing data
Index that is not used, yet influences query
It's applicable when the query repeats the same condition.
Typically, you have something useful as index field on top of your declared purpose. But if you don't, just use any small constant - true in my example, but anything < 8 bytes is equally good.
Minor disclaimer: The "first position" in a Postgres array does not necessarily have index 1. If non-standard array indexes are possible, consider:
...
WHERE (my_array_column[:])[1] = 5;
In index and queries.
See:
Normalize array subscripts for 1-dimensional array so they start with 1
You can index just the first position. You need an extra set of parentheses in the create statement to do that:
create index on my_table ((my_array_column[1]));
Or you could augment your query to work with your gin index, on the theory that an array can't have the first element be 5 unless at least one element is 5.
select * from my_table where my_array_column[1] = 5 and my_array_column #> ARRAY[5];
Of course this won't be very efficient if a lot of your arrays contain 5, but in some other spot in the array. It would have to recheck all of those "false matches" to eliminate them. So if you only care about the first element, the first index I showed is better. (Of course, if you only care about the first element, why use an array to start with?)
If you always look at the first position a regular B-Tree index will do:
create index on my_table ( (my_array_column[1]) );
If you don't know the position, then a GIN index is indeed needed, but you need to use an operator that is supported by a gin index, that would be e.g. the #> operator. But for that you need to use a different query:
select *
from my_table
where my_array_column #> array[5];
That would find all rows where the array column contains the value 5.
But you should head the advice given in the manual regarding the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.

Postgres: Fast facets on big blobs

I have a Postgres table with a large jsonb column.
CREATE TABLE mytable (id integer, my_jsonb jsonb)
The my_jsonb column contains data like this:
{
name: 'Bob',
city: 'Somecity',
zip: '12345'
}
The table contains several million rows.
I need to create facets, i.e. aggregations, on individual fields in our user interface. For example:
city | count
New York | 1000
Chicago | 3000
Los Angeles | 4000
maybe 200 more values...
My current query, which yields the correct results, looks like this:
select my_jsonb->>'city', count(*)
from mytable
where foo='bar'
group by my_jsonb->>'city'
order by my_jsonb->>'city'
The problem is that it is painfully slow. It takes 5-10 seconds, depending on the particular column that I pick. It has to do a full table scan and extract each jsonb value, row by row.
Question: how do I create an index that does this query efficiently, and works no matter which jsonb field I choose?
A GIN index doesn't work. The query optimizer doesn't use it. Same for a simple BTREE on the jsonb column.
I'm thinking that there might be some kind of expression index, and I might be able to rewrite the facet query to use the expression, but I haven't figured it out.
Worst case, I can extract all of the values into a second table and index that, I'd prefer not to.
Your only hope would be an index-only scan, but since that doesn't work with expression indexes, you're out. There is no way to avoid scanning the whole table and extracting the JSON values.
You'll have to extract the JSON values in a normalized form. This goes as a reminder that data models involving JSON are very often a bad choice in a relational database (although there are valid use cases).

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?
What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function
Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).
The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.

How to combine DISTINCT and ORDER BY in array_agg of jsonb values in postgresSQL

Note: I am using the latest version of Postgres (9.4)
I am trying to write a query which does a simple join of 2 tables, and groups by the primary key of the first table, and does an array_agg of several fields in the 2nd table which I want returned as an object. The array needs to be sorted by a combination of 2 fields in the json objects, and also uniquified.
So far, I have come up with the following:
SELECT
zoo.id,
ARRAY_AGG(
DISTINCT ROW_TO_JSON((
SELECT x
FROM (
SELECT animals.type, animals.name
) x
))::JSONB
-- ORDER BY animals.type, animals.name
)
FROM zoo
JOIN animals ON animals.zooId = zoo.id
GROUP BY zoo.id;
This results in one row for each zoo, with a an aggregate array of jsonb objects, one for each animal, uniquely.
However, I can't seem to figure out how to also sort this by the parameters in the commented out part of the code.
If I take out the distinct, I can ORDER BY original fields, which works great, but then I have duplicates.
If you use row_to_json() you will lose the column names unless you put in a row that is typed. If you "manually" build the jsonb object with json_build_object() using explicit names then you get them back:
SELECT zoo.id, array_agg(za.jb) AS animals
FROM zoo
JOIN (
SELECT DISTINCT ON (zooId, "type", "name")
zooId, json_build_object('animal_type', "type", 'animal_name', "name")::jsonb AS jb
FROM animals
ORDER BY zooId, jb->>'animal_type', jb->>'animal_name'
-- ORDER BY zooId, "type", "name" is far more efficient
) AS za ON za.zooId = zoo.id
GROUP BY zoo.id;
You can ORDER BY the elements of a jsonb object, as shown above, but (as far as I know) you cannot use DISTINCT on a jsonb object. In your case this would be rather inefficient anyway (first building all the jsonb objects, then throwing out duplicates) and at the aggregate level it is plain impossible with standard SQL. You can achieve the same result, however, by applying the DISTINCT clause before building the jsonb object.
Also, avoid use of SQL key words like "type" and standard data types like "name" as column names. Both are non-reserved keywords so you can use them in their proper contexts, but practically speaking your commands could get really confusing. You could, for instance, have a schema, with a table, a column in that table, and a data type each called "type" and then you could get this:
SELECT type::type FROM type.type WHERE type = something;
While PostgreSQL will graciously accept this, it is plain confusing at best and prone to error in all sorts of more complex situations. You can get a long way by double-quoting any key words, but they are best just avoided as identifiers.

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.