Is it possible to index the position of an array column in PostgreSQL? - postgresql

Let's say I want to find rows in the table my_table that have the value 5 at the first position of the array column my_array_column. To prepare the table, I executed the following statements:
CREATE TABLE my_table (
id serial primary key,
my_array_column integer[]
);
CREATE INDEX my_table_my_array_column_index on "my_table" USING GIN ("my_array_column");
SET enable_seqscan TO off;
INSERT INTO my_table (my_array_column) VALUES ('{5,7,10}');
Now, the query can look like this:
select * from my_table where my_array_column[1] = 5;
This works, but it doesn't use the created GIN index. Is it possible to search for the value 5 at a specific position with an index?

I want to find rows in the table my_table that have the value 5 at the first position of the array column
A partial index would be most efficient for that definition:
CREATE INDEX my_table_my_array_special_idx ON my_table ((true))
WHERE my_array_column[1] = 5;
If only a small fraction of rows qualifies, a partial index is accordingly smaller. Plus, the actual index column only occupies minimum space (typically 8 bytes). And, on top of that, Postgres 13 or later can apply index deduplication to make the index much smaller, yet.
Once the index is fully cached, its small size does not make it much faster, but still.
And most writes do not have to manipulate the index, which may be the most important benefit, depending on the workload.
Oh, and Postgres collects statistics for a partial index. So you can expect the query planner to make a fully educated choice when that index is involved.
Related:
PostgreSQL partial index unused when created on a table with existing data
Index that is not used, yet influences query
It's applicable when the query repeats the same condition.
Typically, you have something useful as index field on top of your declared purpose. But if you don't, just use any small constant - true in my example, but anything < 8 bytes is equally good.
Minor disclaimer: The "first position" in a Postgres array does not necessarily have index 1. If non-standard array indexes are possible, consider:
...
WHERE (my_array_column[:])[1] = 5;
In index and queries.
See:
Normalize array subscripts for 1-dimensional array so they start with 1

You can index just the first position. You need an extra set of parentheses in the create statement to do that:
create index on my_table ((my_array_column[1]));
Or you could augment your query to work with your gin index, on the theory that an array can't have the first element be 5 unless at least one element is 5.
select * from my_table where my_array_column[1] = 5 and my_array_column #> ARRAY[5];
Of course this won't be very efficient if a lot of your arrays contain 5, but in some other spot in the array. It would have to recheck all of those "false matches" to eliminate them. So if you only care about the first element, the first index I showed is better. (Of course, if you only care about the first element, why use an array to start with?)

If you always look at the first position a regular B-Tree index will do:
create index on my_table ( (my_array_column[1]) );
If you don't know the position, then a GIN index is indeed needed, but you need to use an operator that is supported by a gin index, that would be e.g. the #> operator. But for that you need to use a different query:
select *
from my_table
where my_array_column #> array[5];
That would find all rows where the array column contains the value 5.
But you should head the advice given in the manual regarding the use of arrays:
Arrays are not sets; searching for specific array elements can be a sign of database misdesign. Consider using a separate table with a row for each item that would be an array element. This will be easier to search, and is likely to scale better for a large number of elements.

Related

PostgreSQL - should i only create index for the rest of columns that don't have index yet?

if table_name has 3 column (a,b,c), and I'm going to create an index with those 3 column:
CREATE INDEX idx_table_name_a_b_c ON table_name (a,b,c);
But there's already an index of column a that I previously created :
CREATE INDEX idx_table_name_a ON table_name (a);
Should I create only for the other 2, or create for those 3 columns that also include column a (with above query)?
Note that index considerations are only possible if you have a query. It never makes sense to index a table as such, but only to index a table for a query.
So let's assume that you have a query that would benefit from a three-column index, like
SELECT count(*) FROM table_name
WHERE a = 12 AND b = 42 AND c BETWEEN 7 AND 22;
The best option is to create that index and drop the existing one, because the three-column index can serve all purposes that the single-column index can (that is because a is the leading column in the index).
Such an index will lead to a single index-only scan on the table, which (if you have VACUUMed the table) is the most efficient way to execute the query.
The second best option is to create the two-column index you proposed and leave the single-column index on a.
Then the optimizer's strategy will depend on the distribution of values.
If the condition on a is selective enough, PostgreSQL will ignore your new index and just scan the one on a.
If the condition on b and c is selective, PostgreSQL will scan only your new index.
If all conditions together are not selective, PostgreSQL may choose a sequential scan of the table and ignore all your indexes.
If neither the condition on a nor the conditions on b and c together are selective, but all three conditions together are selective, PostgreSQL can opt to perform a bitmap index scan on both indexes and combine the result.

Postgres multi-column index (integer, boolean, and array)

I have a Postgres 9.4 database with a table like this:
| id | other_id | current | dn_ids | rank |
|----|----------|---------|---------------------------------------|------|
| 1 | 5 | F | {123,234,345,456,111,222,333,444,555} | 1 |
| 2 | 7 | F | {123,100,200,900,800,700,600,400,323} | 2 |
(update) I already have a couple indexes defined. Here is the CREATE TABLE syntax:
CREATE TABLE mytable (
id integer NOT NULL,
other_id integer,
rank integer,
current boolean DEFAULT false,
dn_ids integer[] DEFAULT '{}'::integer[]
);
CREATE SEQUENCE mytable_id_seq START WITH 1 INCREMENT BY 1 NO MINVALUE NO MAXVALUE CACHE 1;
ALTER TABLE ONLY mytable ALTER COLUMN id SET DEFAULT nextval('mytable_id_seq'::regclass);
ALTER TABLE ONLY mytable ADD CONSTRAINT mytable_pkey PRIMARY KEY (id);
CREATE INDEX ind_dn_ids ON mytable USING gin (dn_ids);
CREATE INDEX index_mytable_on_current ON mytable USING btree (current);
CREATE INDEX index_mytable_on_other_id ON mytable USING btree (other_id);
CREATE INDEX index_mytable_on_other_id_and_current ON mytable USING btree (other_id, current);
I need to optimize queries like this:
SELECT id, dn_ids
FROM mytable
WHERE other_id = 5 AND current = F AND NOT (ARRAY[100,200] && dn_ids)
ORDER BY rank ASC
LIMIT 500 OFFSET 1000
This query works fine, but I'm sure it could be much faster with smart indexing. There are about 250,000 rows in the table and I always have current = F as a predicate. The input array I'm comparing to the stored array will have 1-9 integers, as well. The other_id can vary. But generally, before limiting, the scan will match between 0-25,000 rows.
Here's an example EXPLAIN:
Limit (cost=36944.53..36945.78 rows=500 width=65)
-> Sort (cost=36942.03..37007.42 rows=26156 width=65)
Sort Key: rank
-> Seq Scan on mytable (cost=0.00..35431.42 rows=26156 width=65)
Filter: ((NOT current) AND (NOT ('{-1,35257,35314}'::integer[] && dn_ids)) AND (other_id = 193))
Other answers on this site and the Postgres docs suggest it's possible to add a compound index to improve performance. I already have one on [other_id, current]. I've also read in various places that indexing can improve the performance of the ORDER BY in addition to the WHERE clause.
What's the right type of compound index to use for this query? I don't care about space at all.
Does it matter much how I order the terms in the WHERE clause?
What's the right type of compound index to use for this query? I don't care about space at all.
This depends on the complete situation. Either way, the GIN index you already have is most probably superior to a GiST index in your case:
Difference between GiST and GIN index
You can combine either with integer columns once you install the additional module btree_gin (or btree_gist, respectively).
Multicolumn index on 3 fields with heterogenous data types
However, that does not cover the boolean data type, which typically doesn't make sense as index column to begin with. With just two (three incl. NULL) possible values it's not selective enough.
And a plain btree index is more efficient for integer. While a multicolumn btree index on two integer columns would certainly help, you'll have to test carefully if combining (other_id, dn_ids) in a multicolumn GIN index is worth more than it costs. Probably not. Postgres can combine multiple indexes in a bitmap index scan rather efficiently.
Finally, while indexes can be used for sorted output, this will probably not pay to apply for a query like you display (unless you select large parts of the table).
Not applicable to updated question.
Partial indexes might be an option. Other than that, you already have all the indexes you need.
I would drop the pointless index on the boolean column current completely, and the index on just rank is probably never used for this query.
Does it matter much how I order the terms in the WHERE clause?
The order of WHERE conditions is completely irrelevant.
Addendum after question update
The utility of indexes is bound to selective criteria. If more than roughly 5 % (depends on various factors) of the table are selected, a sequential scan of the whole table is typically faster than dealing with the overhead on any indexes - except for pre-sorting output, that's the one thing an index is still good for in such cases.
For a query that fetches 25,000 of 250,000 rows, indexes are mostly just for that - which gets all the more interesting if you attach a LIMIT clause. Postgres can stop fetching rows from an index once the LIMIT is satisfied.
Be aware that Postgres always needs to read OFFSET + LIMIT rows, so performance deteriorate with the sum of both.
Even with your added information, much of what's relevant is still in the dark. I am going to assume that:
Your predicate NOT (ARRAY[100,200] && dn_ids) is not very selective. Ruling out 1 to 10 ID values should typically retain the majority of rows unless you have very few distinct elements in dn_ids.
The most selective predicate is other_id = 5.
A substantial part of the rows is eliminated with NOT current.
Aside: current = F isn't valid syntax in standard Postgres. Must be NOT current or current = FALSE;
While a GIN index would be great to identify few rows with matching arrays faster than any other index type, this seems hardly relevant for your query. My best guess is this partial, multicolumn btree index:
CREATE INDEX foo ON mytable (other_id, rank, dn_ids)
WHERE NOT current;
The array column dn_ids in a btree index cannot support the && operator, I just include it to allow index-only scans and filter rows before accessing the heap (the table). May even be faster without dn_ids in the index:
CREATE INDEX foo ON mytable (other_id, rank) WHERE NOT current;
GiST indexes may become more interesting in Postgres 9.5 due to this new feature:
Allow GiST indexes to perform index-only scans (Anastasia Lubennikova,
Heikki Linnakangas, Andreas Karlsson)
Aside: current is a reserved word in standard SQL, even if it's allowed as identifier in Postgres.
Aside 2: I assume id is an actual serial column with the column default set. Just creating a sequence like you demonstrate, would do nothing.
Auto increment SQL function
Unfortunately I don't think you can combine a BTree and a GIN/GIST index into a single compound index, so the planner is going to have to choose between using the other_id index or the dn_ids index. One advantage of using other_id, as you pointed out, is that you could use a multicolumn index to improve the sort performance. The way you would do this would be
CREATE INDEX index_mytable_on_other_id_and_current
ON mytable (other_id, rank) WHERE current = F;
This is using a partial index, and will allow you to skip the sort step when you are sorting by rank and querying on other_id.
Depending on the cardinality of other_id, the only benefit of this might be the sorting. Because your plan has a LIMIT clause, it's difficult to tell. SEQ scans can be the fastest option if you're using > 1/5 of the table, especially if you're using a standard HDD instead of solid state. If you're planner insists on SEQ scanning when you know an IDX scan is faster (you've tested with enable_seqscan false, you may want to try fine tuning your random_page_cost or effective_cache_size.
Finally, I'd recomment not keeping all of these indexes. Find the ones you need, and cull the rest. Indexes cause huge performance degradation in inserts (especially mutli-column and GIN/GIST indexes).
The simplest index for your query is mytable(other_id, current). This handles the first two conditions. This would be a normal b-tree type index.
You can satisfy the array condition using a GIST index on mytable(dn_ids).
However, I don't think you can mix the different data types in one index, at least not without extensions.

Create an index for json_array_elements in PostgreSQL

I need to create an index from a query that uses json_array_elements()
SELECT *, json_array_elements(nested_json_as_text::json) as elements FROM my_table
Since the json contains multiple elements, the result is that the original index is now duplicated across rows and no longer unique.
I am not very familiar with creating indices and want to avoid doing anything destructive. What is the best way to create a column of unique integers for this case?
Found an answer:
SELECT *, json_array_elements(nested_json_as_text::json) as elements, row_number() over () as my_index FROM my_table

Optimization of count query for PostgreSQL

I have a table in postgresql that contains an array which is updated constantly.
In my application i need to get the number of rows for which a specific parameter is not present in that array column. My query looks like this:
select count(id)
from table
where not (ARRAY['parameter value'] <# table.array_column)
But when increasing the amount of rows and the amount of executions of that query (several times per second, possibly hundreds or thousands) the performance decreses a lot, it seems to me that the counting in postgresql might have a linear order of execution (I’m not completely sure of this).
Basically my question is:
Is there an existing pattern I’m not aware of that applies to this situation? what would be the best approach for this?
Any suggestion you could give me would be really appreciated.
PostgreSQL actually supports GIN indexes on array columns. Unfortunately, it doesn't seem to be usable for NOT ARRAY[...] <# indexed_col, and GIN indexes are unsuitable for frequently-updated tables anyway.
Demo:
CREATE TABLE arrtable (id integer primary key, array_column integer[]);
INSERT INTO arrtable(1, ARRAY[1,2,3,4]);
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
-- Use the following *only* for testing whether Pg can use an index
-- Do not use it in production.
SET enable_seqscan = off;
explain (buffers, analyze) select count(id)
from arrtable
where not (ARRAY[1] <# arrtable.array_column);
Unfortunately, this shows that as written we can't use the index. If you don't negate the condition it can be used, so you can search for and count rows that do contain the search element (by removing NOT).
You could use the index to count entries that do contain the target value, then subtract that result from a count of all entries. Since counting all rows in a table is quite slow in PostgreSQL (9.1 and older) and requires a sequential scan this will actually be slower than your current query. It's possible that on 9.2 an index-only scan can be used to count the rows if you have a b-tree index on id, in which case this might actually be OK:
SELECT (
SELECT count(id) FROM arrtable
) - (
SELECT count(id) FROM arrtable
WHERE (ARRAY[1] <# arrtable.array_column)
);
It's guaranteed to perform worse than your original version for Pg 9.1 and below, because in addition to the seqscan your original requires it also needs an GIN index scan. I've now tested this on 9.2 and it does appear to use an index for the count, so it's worth exploring for 9.2. With some less trivial dummy data:
drop index arrtable_arraycolumn_gin_arr_idx ;
truncate table arrtable;
insert into arrtable (id, array_column)
select s, ARRAY[1,2,s,s*2,s*3,s/2,s/4] FROM generate_series(1,1000000) s;
CREATE INDEX arrtable_arraycolumn_gin_arr_idx
ON arrtable USING GIN(array_column);
Note that a GIN index like this will slow updates down a LOT, and is quite slow to create in the first place. It is not suitable for tables that get updated much at all - like your table.
Worse, the query using this index takes up to twice times as long as your original query and at best half as long on the same data set. It's worst for cases where the index is not very selective like ARRAY[1] - 4s vs 2s for the original query. Where the index is highly selective (ie: not many matches, like ARRAY[199]) it runs in about 1.2 seconds vs the original's 3s. This index simply isn't worth having for this query.
The lesson here? Sometimes, the right answer is just to do a sequential scan.
Since that won't do for your hit rates, either maintain a materialized view with a trigger as #debenhur suggests, or try to invert the array to be a list of parameters that the entry does not have so you can use a GiST index as #maniek suggests.
Is there an existing pattern I’m not aware of that applies to this
situation? what would be the best approach for this?
Your best bet in this situation might be to normalize your schema. Split the array out into a table. Add a b-tree index on the table of properties, or order the primary key so it's efficiently searchable by property_id.
CREATE TABLE demo( id integer primary key );
INSERT INTO demo (id) SELECT id FROM arrtable;
CREATE TABLE properties (
demo_id integer not null references demo(id),
property integer not null,
primary key (demo_id, property)
);
CREATE INDEX properties_property_idx ON properties(property);
You can then query the properties:
SELECT count(id)
FROM demo
WHERE NOT EXISTS (
SELECT 1 FROM properties WHERE demo.id = properties.demo_id AND property = 1
)
I expected this to be a lot faster than the original query, but it's actually much the same with the same sample data; it runs in the same 2s to 3s range as your original query. It's the same issue where searching for what is not there is much slower than searching for what is there; if we're looking for rows containing a property we can avoid the seqscan of demo and just scan properties for matching IDs directly.
Again, a seq scan on the array-containing table does the job just as well.
I think with Your current data model You are out of luck. Try to think of an algorithm that the database has to execute for Your query. There is no way it could work without sequential scanning of data.
Can You arrange the column so that it stores the inverse of data (so that the the query would be select count(id) from table where ARRAY[‘parameter value’] <# table.array_column) ? This query would use a gin/gist index.

Is PostgreSQL IN() statement still fast with up to 1000 arguments?

I'm querying to return all rows from a table except those that are in some list of values that is constant at query time. E.g. SELECT * FROM table WHERE id IN (%), and % is guaranteed to be a list of values, not be a subquery. However, this list of values may be up to 1000 elements long in some cases. Should I limit this to a smaller sublist (as few as 50-100 elements is as low as I can go, in this case) or will there be a negligible performance gain?
I assume it's a large table, otherwise it wouldn't matter much.
Depending on table size and number of keys, this may turn into a sequence scan. If there are many IN keys, Postgres often chooses not to use an index for it. The more keys, the bigger the chance of a sequence scan.
If you use another indexed column in WHERE, like:
select * from table where id in (%) and my_date > '2010-01-01';
It's likely to fetch all rows matching the indexed (my_date) columns, and then perform an in-memory scan on them.
Using a JOIN to a persistent or temporary table may, but does not have to help. It still will need to locate all the rows, either with a nested loop (unlikely for large data), or for a hash/merge join.
I would say the solution is:
Use as few IN keys as possible.
Use other criteria for indexing and querying whenever possible. If IN requires an in-memory scan of all rows, at least there will be fewer of them thanks to additional criteria.
Use a temporary table to JOIN, gives better performance and has no limits. An IN() having a 1000 arguments, will give you problems in any database.