Is this the right way to create a partial index in Postgres? - postgresql

We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.

Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).

For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.

Related

restriction in second position - index not used - why?

I have created the below example and do not understand why the planner does not use index i2 for the query. As can be seen in pg_stats, it understands that column uniqueIds contains unique values. it also understands that column fourOtherIds contains only 4 different values. Shouldn't a search of index i2 then be by far the fastest way? Looking for uniqueIds in only four different index leaves of fourOtherIds? What is wrong with my understanding of how an index works? Why does it think using i1 makes more sense here, even though it has to filter out 333.333 rows? In my understanding it should use i2 to find the one row (or few rows, as there is no unique constraint) that has uniqueIds 4000 first and then apply where fourIds = 1 as a filter.
create table t (fourIds int, uniqueIds int,fourOtherIds int);
insert into t ( select 1,*,5 from generate_series(1 ,1000000));
insert into t ( select 2,*,6 from generate_series(1000001,2000000));
insert into t ( select 3,*,7 from generate_series(2000001,3000000));
insert into t ( select 4,*,8 from generate_series(3000001,4000000));
create index i1 on t (fourIds);
create index i2 on t (fourOtherIds,uniqueIds);
analyze t;
select n_distinct,attname from pg_stats where tablename = 't';
/*
n_distinct|attname |
----------+------------+
4.0|fourids |
-1.0|uniqueids |
4.0|fourotherids|
*/
explain analyze select * from t where fourIds = 1 and uniqueIds = 4000;
/*
QUERY PLAN |
--------------------------------------------------------------------------------------------------------------------------+
Gather (cost=1000.43..22599.09 rows=1 width=12) (actual time=0.667..46.818 rows=1 loops=1) |
Workers Planned: 2 |
Workers Launched: 2 |
-> Parallel Index Scan using i1 on t (cost=0.43..21598.99 rows=1 width=12) (actual time=25.227..39.852 rows=0 loops=3)|
Index Cond: (fourids = 1) |
Filter: (uniqueids = 4000) |
Rows Removed by Filter: 333333 |
Planning Time: 0.107 ms |
Execution Time: 46.859 ms |
*/
Not every conceivable optimization has been implemented. You are looking for a variant of an index skip scan AKA a loose index scan. PostgreSQL does not automatically implement those (yet--people were working on it but I don't know if they still are. Also, I think I've read that one of the 3rd party extensions/forks, citus maybe, has implemented it). You can emulate one yourself using a recursive CTE, but that would be quite annoying to do.

Junction table vs. many columns vs. arrays in PostgreSQL: Memory and performance

I'm building a Postgres database for a product search (up to 3 million products) with large groups of similar data for each product, e.g. the prices for different countries, and country-specific average ratings, with up to 170 countries.
The natural solution seems to use arrays (e.g. a real[] column for the prices and another for the ratings). However, the data needs to be indexed individually for each country for sorting and range queries (the data for different countries is not reliably correlated). So from this discussion I think it would be better to use individual columns for each country.
There are about 8 country-specific properties of which maybe 4 need to be indexed, so I may end up with more than 1300 columns and 650 indexes. Might that be a problem? Is there a better solution?
EDIT after everyone is telling me about many-to-many relationships, normalization and so on:
I am not convinced. If I understand correctly, this always comes down to a junction table (known under many names), as in Erwin Brandstetter's answer.
As I mentioned in my first comment, this would be a great solution if for each product there were prices and ratings for a few countries only. If this is not the case however, a junction table may lead to a significantly higher memory requirement (consider the ever-repeated product-id and country-id, and even more serious, the row-overhead for a narrow table with hundreds millions of rows).
Here is a Python script to demonstrate this. It creates a junction table product_country for prices and ratings of products in different countries, and a "multi-column table" products for the same. The tables are populated with random values for 100,000 products and 100 countries.
For simplicity I use ints to identify products and countries, and for the junction-table-approach, I only build the junction table.
import psycopg2
from psycopg2.extras import execute_values
from random import random
from time import time
cn = psycopg2.connect(...)
cn.autocommit = True
cr = cn.cursor()
num_countries = 100
num_products = 100000
def junction_table():
print("JUNCTION TABLE")
cr.execute("CREATE TABLE product_country (product_id int, country_id int, "
"price real, rating real, PRIMARY KEY (product_id, country_id))")
t = time()
for p in range(num_products):
# use batch-insert, without that it would be about 10 times slower
execute_values(cr, "INSERT INTO product_country "
"(product_id, country_id, price, rating) VALUES %s",
[[p, c, random() * 100, random() * 5]
for c in range(num_countries)])
print(f"Insert data took {int(time() - t)}s")
t = time()
cr.execute("CREATE INDEX i_price ON product_country (country_id, price)")
cr.execute("CREATE INDEX i_rating ON product_country (country_id, rating)")
print(f"Creating indexes took {int(time() - t)}s")
sizes('product_country')
def many_column_table():
print("\nMANY-COLUMN TABLE")
cr.execute("CREATE TABLE products (product_id int PRIMARY KEY, "
+ ', '.join([f'price_{i} real' for i in range(num_countries)]) + ', '
+ ', '.join([f'rating_{i} real' for i in range(num_countries)]) + ')')
t = time()
for p in range(num_products):
cr.execute("INSERT INTO products (product_id, "
+ ", ".join([f'price_{i}' for i in range(num_countries)]) + ', '
+ ", ".join([f'rating_{i}' for i in range(num_countries)]) + ') '
+ "VALUES (" + ",".join(["%s"] * (1 + 2 * num_countries)) + ') ',
[p] + [random() * 100 for i in range(num_countries)]
+ [random() * 5 for i in range(num_countries)])
print(f"Insert data took {int(time() - t)}s")
t = time()
for i in range(num_countries):
cr.execute(f"CREATE INDEX i_price_{i} ON products (price_{i})")
cr.execute(f"CREATE INDEX i_rating_{i} ON products (rating_{i})")
print(f"Creating indexes took {int(time() - t)}s")
sizes('products')
def sizes(table_name):
cr.execute(f"SELECT pg_size_pretty(pg_relation_size('{table_name}'))")
print("Table size: " + cr.fetchone()[0])
cr.execute(f"SELECT pg_size_pretty(pg_indexes_size('{table_name}'))")
print("Indexes size: " + cr.fetchone()[0])
if __name__ == '__main__':
junction_table()
many_column_table()
Output:
JUNCTION TABLE
Insert data took 179s
Creating indexes took 28s
Table size: 422 MB
Indexes size: 642 MB
MANY-COLUMN TABLE
Insert data took 138s
Creating indexes took 31s
Table size: 87 MB
Indexes size: 433 MB
Most importantly, the total size (table+indexes) of the junction table is about twice the size of the many-column table, and the table-only size is even nearly 5 times larger.
This is easily explained by the row-overhead and the repeated product-id and country-id in each row (10,000,000 rows, vs. just 100,000 rows of the many-column table).
The sizes scale approximately linearly with the number of products (I tested with 700,000 products), so for 3 million products the junction table would be about 32 GB (12.7 GB relation + 19.2 GB indexes), while the many-column table would be just 15.6 GB (2.6 GB table + 13 GB indexes), which is decisive if everything should be cached in RAM.
Query times are about the same when all is cached, here a somewhat typical example for 700,000 products :
EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, price, rating FROM product_country
WHERE country_id=7 and price < 10
ORDER BY rating DESC LIMIT 200
-- Limit (cost=0.57..1057.93 rows=200 width=12) (actual time=0.037..2.250 rows=200 loops=1)
-- Buffers: shared hit=2087
-- -> Index Scan Backward using i_rating on product_country (cost=0.57..394101.22 rows=74544 width=12) (actual time=0.036..2.229 rows=200 loops=1)
-- Index Cond: (country_id = 7)
-- Filter: (price < '10'::double precision)
-- Rows Removed by Filter: 1871
-- Buffers: shared hit=2087
-- Planning Time: 0.111 ms
-- Execution Time: 2.364 ms
EXPLAIN (ANALYZE, BUFFERS)
SELECT product_id, price_7, rating_7 FROM products
WHERE price_7 < 10
ORDER BY rating_7 DESC LIMIT 200
-- Limit (cost=0.42..256.82 rows=200 width=12) (actual time=0.023..2.007 rows=200 loops=1)
-- Buffers: shared hit=1949
-- -> Index Scan Backward using i_rating_7 on products (cost=0.42..91950.43 rows=71726 width=12) (actual time=0.022..1.986 rows=200 loops=1)
-- Filter: (price_7 < '10'::double precision)
-- Rows Removed by Filter: 1736
-- Buffers: shared hit=1949
-- Planning Time: 0.672 ms
-- Execution Time: 2.265 ms
Regarding flexibility, data integrity etc., I see no serious problem with the multi-column approach: I can easily add and delete columns for countries, and if a sensible naming scheme is used for the columns it should be easy to avoid mistakes.
So I think I have every reason not to use a junction table.
Further, with arrays all would be clearer and simpler than with many columns, and if there were a way to easily define individual indexes for the array elements, that would be the best solution (maybe even the total indexes-size could be reduced?).
So I think my original question is still valid. However there is much more to consider and to test of course. Also, I'm in no way a database expert, so tell me if I'm wrong.
Here the test tables from the script for 5 products and 3 countries:
The "natural" solution for a relational database is to create additional tables in one-to-many or many-to-many relationships. Look into database normalization.
Basic m:n design for product ratings per country:
CREATE TABLE country (
country_id varchar(2) PRIMARY KEY
, country text UNIQUE NOT NULL
);
CREATE TABLE product (
product_id int GENERATED ALWAYS AS IDENTITY PRIMARY KEY
, product text NOT NULL
-- more?
);
CREATE TABLE product_ratings (
product_id int REFERENCES product
, country_id varchar(2) REFERENCES country
, rating1 real
, rating2 real
-- more?
, PRIMARY KEY (product_id, country_id)
);
fiddle
More details:
How to implement a many-to-many relationship in PostgreSQL?

Index not used on filterd aggregate query

I need to increase performance of the following query, which filters on column status_classification and aggregrates on classification -> 'flags' (a jsonb field in the form: '{"flags": ["NO_CLASS_FOUND"]}'::jsonb):
SELECT SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["NO_CLASS_FOUND"]' THEN 1 ELSE 0 END) AS "no_class_found",
SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["RULE"]' THEN 1 ELSE 0 END) AS "rule",
SUM(CASE WHEN ("result_materials"."classification" -> 'flags') #> '["NO_MAPPING"]' THEN 1 ELSE 0 END) AS "no_mapping"
FROM "result_materials"
WHERE "result_materials"."status_classification" = 'PROCESSED';
To improve performance i created an index on status_classification, but the query plan shows that the index was never hit, and a Seq Scan was performed:
Aggregate (cost=1010.15..1010.16 rows=1 width=24) (actual time=19.942..19.946 rows=1 loops=1)
-> Seq Scan on result_materials (cost=0.00..869.95 rows=6231 width=202) (actual time=0.024..4.660 rows=6231 loops=1)
Filter: ((status_classification)::text = 'PROCESSED'::text)
Rows Removed by Filter: 5
Planning Time: 1.212 ms
Execution Time: 20.187 ms
I've tried (all sql at the end of question):
adding an index to status_classification
adding a GIN index to classification -> 'flags'
adding a multi field GIN index, with classification -> 'flags' and status_classification (see here)
The index is still not hit, and performance suffers on as the table grows. Cardinality is low in status_classification field, but the entries in classification -> 'flags' are quite rare, so i would have thought an index very practical here.
Why is the index not used? What am i doing wrong?
SQL to recreate my db:
create table result_materials (
uuid int,
status_classification varchar(30),
classification jsonb
);
insert into result_materials(uuid, classification, status_classification)
select seq
, case(random() *2)::int
when 0 then '{"flags": ["NO_CLASS_FOUND"]}'::jsonb
when 1 then '{"flags": ["RULE"]}'::jsonb
when 2 then '{"flags": ["NO_MAPPING"]}'::jsonb end
as dummy
, case(random() *2)::int
when 0 then 'NOT_PROCESSABLE'
when 1 then 'PROCESSABLE' end
as sta
from generate_series(1, 150000) seq;
Indexes attempted:
-- status_classification
create index other_testes on result_materials (status_classification);
-- classification -> 'flags'
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'));
-- multi field gin
-- REQUIRES you to run: CREATE EXTENSION btree_gin;
CREATE INDEX idx_testes ON result_materials USING gin ((classification -> 'flags'), status_classification);
The query takes 20ms and only removes 5 rows of 6k, yes a scan it's a good choice.
Try adding more rows to the table, and check the cardinality of your clause.

Limits of Postgres Query Optimization (Already using Index-Only Scans)

I have a Postgres query that has already been optimized, but we're hitting 100% CPU usage under peak load, so I wanted to see if there's more that can yet be done in optimizing the database interactions. It already is using two index-only scans in the join, so my question is if there's much more to be done on the Postgres side of things.
The database is an Amazon-hosted Postgres RDS db.m3.2xlarge instance (8 vCPUs and 30 GB of memory) running 9.4.1, and the results below are from a period with low CPU usage and minimal connections (around 15). Peak usage is around 300 simultaneous connections, and that's when we're maxing our CPU (which kills performance on everything).
Here's the query and the EXPLAIN:
Query:
EXPLAIN (ANALYZE, BUFFERS)
SELECT m.valdate, p.index_name, m.market_data_closing, m.available_date
FROM md.market_data_closing m
JOIN md.primitive p on (m.primitive_id = p.index_id)
where p.index_name = ?
order by valdate desc
;
Output:
Sort (cost=183.80..186.22 rows=967 width=44) (actual time=44.590..54.788 rows=11133 loops=1)
Sort Key: m.valdate
Sort Method: quicksort Memory: 1254kB
Buffers: shared hit=181
-> Nested Loop (cost=0.85..135.85 rows=967 width=44) (actual time=0.041..32.853 rows=11133 loops=1)
Buffers: shared hit=181
-> Index Only Scan using primitive_index_name_index_id_idx on primitive p (cost=0.29..4.30 rows=1 width=25) (actual time=0.018..0.019 rows=1 loops=1)
Index Cond: (index_name = '?'::text)
Heap Fetches: 0
Buffers: shared hit=3
-> Index Only Scan using market_data_closing_primitive_id_valdate_available_date_mar_idx on market_data_closing m (cost=0.56..109.22 rows=2233 width=27) (actual time=0.016..12.059 rows=11133 loops=1)
Index Cond: (primitive_id = p.index_id)
Heap Fetches: 42
Buffers: shared hit=178
Planning time: 0.261 ms
Execution time: 64.957 ms
Here are the table sizes:
md.primitive: 14283 rows
md.market_data_closing: 13544087 rows
For reference, here is the underlying spec for the tables and indices:
CREATE TABLE md.primitive(
index_id serial NOT NULL,
index_name text NOT NULL UNIQUE,
index_description text not NULL,
index_source_code text NOT NULL DEFAULT 'MAN',
index_source_spec json NOT NULL DEFAULT '{}',
frequency text NULL,
primitive_type text NULL,
is_maintained boolean NOT NULL default true,
create_dt timestamp NOT NULL,
create_user text NOT NULL,
update_dt timestamp not NULL,
update_user text not NULL,
PRIMARY KEY
(
index_id
)
) ;
CREATE INDEX ON md.primitive
(
index_name ASC,
index_id ASC
);
CREATE TABLE md.market_data_closing(
valdate timestamp NOT NULL,
primitive_id int references md.primitive,
market_data_closing decimal(28, 10) not NULL,
available_date timestamp NULL,
pricing_source text not NULL,
create_dt timestamp NOT NULL,
create_user text NOT NULL,
update_dt timestamp not NULL,
update_user text not NULL,
PRIMARY KEY
(
valdate,
primitive_id
)
) ;
CREATE INDEX ON md.market_data_closing
(
primitive_id ASC,
valdate DESC,
available_date DESC,
market_data_closing ASC
);
What else can be done?
It seems the nested loop is taking an absurd amount of time and primitive table is returning only one row. You can try eliminating the nested loop by doing something like this:
SELECT m.valdate, m.market_data_closing, m.available_date
FROM md.market_data_closing m
WHERE m.primitive_id = (SELECT p.index_id
FROM md.primitive p
WHERE p.index_name = ?
OFFSET 0 -- probably not needed, try it)
ORDER BY valdate DESC;
This does not return p.index_name but that can be easily fixed by selecting it as a const.
For next generations reading this: the problem seems to be with index
md.market_data_closing(
...
PRIMARY KEY
(
valdate,
primitive_id
)
This seems to be an incorrect index. Should be:
md.market_data_closing(
...
PRIMARY KEY
(
primitive_id,
valdate
)
Explanation why. This kind of query:
...
JOIN md.primitive p on (m.primitive_id = p.index_id)
...
Will only be effective only if primitive_id is the first field.
Also
order by validate
Will be more effective if validate is second.
Why?
Because index is a B-tree structure.
(
valdate,
primitive_id
)
results in
valdate1
primitive_id1
primitive_id2
primitive_id3
valdate2
primitive_id1
primitive_id2
Using this tree you can't search by primitive_id1 effectively
But
(
primitive_id,
valdate
)
results in
primitive_id1
valdate1
valdate2
valdate3
primitive_id2
valdate1
valdate2
Which is effective for looking up by primitive_id.
There is another solution to this problem:
If you don't want to change the index, you add a strict equal condition on valdate.
Say 'valdate = some_date', this will make your index effective.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.