Simple lookup query very slow on Postgres, fast in MySQL - postgresql

I'm beating my head on this since yesterday, and I don't understanf what's happening:
I am populating a dimensional schema for a datawarehousing project, using Pentaho Kettle to perform a "dimension lookup/update", which basically looks up for existing rows in a dimension table, inserting the ones which do not exist and returning the technical key.
The dimension table itself is very simple:
CREATE TABLE dim_loan
(
_tech_id INTEGER NOT NULL,
loan_id INTEGER,
type TEXT,
interest_rate_type TEXT,
_dim_project_id integer,
_validity_from date,
_validity_to date,
_version integer,
PRIMARY KEY (_tech_id)
);
CREATE INDEX dim_loan_pk_idx ON dim_loan USING btree (_tech_id);
CREATE INDEX dim_loan_compound_idx ON dim_loan USING btree (loan_id, _dim_project_id, _validity_from, _validity_to);
The table should contain, at the end of the process, around 650k rows. The transformations starts fast(ish), at around 1500 rows/sec.
The performance drops steadily reaching 50 rows/sec by the time the table has around 50k rows.
The queries that Kettle does look like this:
SELECT _tech_id, _version, "type" AS "Loan Type", interest_rate_type AS int_rate, _validity_from, _validity_to FROM "public".dim_loan WHERE loan_id = $1 AND _dim_project_id = $2 AND $3 >= _validity_from AND $4 < _validity_to
The query planner estimates an execution time of 0.1 msecs:
"Index Scan using dim_loan_compound_idx on dim_loan (cost=0.42..7.97 rows=1 width=42) (actual time=0.043..0.043 rows=0 loops=1)"
" Index Cond: ((loan_id = 1) AND (_dim_project_id = 2) AND ('2016-01-01'::date >= _validity_from) AND ('2016-01-01'::date < _validity_to))"
"Total runtime: 0.078 ms"
Of course real execution times are much different, around 10ms, which is unacceptable. Enabling slow query log with auto_explain I see with increased frequency entries like this:
Seq Scan on dim_loan (cost=0.00..2354.21 rows=12 width=52)
Filter: (($3 >= _validity_from) AND ($4 < _validity_to) AND (_dim_project_id = $2) AND ((loan_id)::double precision = $1))
< 2016-12-18 21:30:19.859 CET >LOG: duration: 14.260 ms plan:
Query Text: SELECT _tech_id, _version, "type" AS "Loan Type", interest_rate_type AS int_rate, _validity_from, _validity_to FROM "public".dim_loan WHERE loan_id = $1 AND _dim_project_id = $2 AND $3 >= _validity_from
AND $4 < _validity_to
Which don't tell the whole story anyway as it's not only these queries that run slow, but all of them.
Of course I tried to tweak the memory parameters up to silly amounts with no real difference in performance, I also tried the latest 9.6, which exhibited the same behavior as 9.3, which is what I'm using.
The same transformation, on a MySQL database with the same indexes, runs happily at 5000 rows/sec from start to finish. I really want to use PG and I'm sure that it's something trivial, but what!?
Maybe something with the jdbc driver? I verified that it does use a single connection all the time, so it's not even a connection overhead issue...

Just found out that the cause is indeed loan id being cast to double, which of course rendered the index useless! The reason is a wrong assumption made by Kettle on the metadata of this column, which comes from an excel file.
Now the performance is on par with MySQL! Happy days

Related

PostgreSQL - how to improve this query/index

PostgresSQL 11.2
Settings:
shared_buffers = 1024MB
effective_cache_size = 2048MB
maintenance_work_mem = 320MB
checkpoint_completion_target = 0.5
wal_buffers = 3932kB
default_statistics_target = 100
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 64MB
max_worker_processes = 4
max_parallel_workers_per_gather = 2
max_parallel_workers = 4
I've got a table with about 40M rows.
The query I'm doing on it(more fields exist in the query, it's the where clauses that count):
select id,name from my_table where
action_performed = true AND
should_still_perform_action = false AND
action_performed_at <= '2021-09-05 00:00:00.000'
LIMIT 100;
The date is just something I picked for this example.
The point is to get items that need to be processed. The client retrieving this data would then use the metadata to find a file and upload it to a cloud provider. This can take some time.
The timestamp condition is really only there to say "only process entries older than today" or the given timestamp, in general. The order in which they are returned is of no practical importance, since the goal is to perform processing on any entry that has not yet been processed. The LIMIT was introduced to stop the application doing so from hanging, because of the network activity.
Table definition(redacted):
Table "public.my_table"
action_performed_at | timestamp without time zone | | | now()
should_still_perform_action | boolean | | not null | true
action_performed | boolean | | not null | false
Indexes:
"index001" btree (action_performed_at, should_still_perform_action, action_performed) WHERE should_still_perform_action = false AND action_performed = true
"index002" btree (action_performed, should_still_perform_action, action_performed_at DESC) WHERE should_still_perform_action = false AND action_performed = true
These are all indexes added over time, all worked at the start, but are no longer being used now.
Re-indexing also does not seem to work, only dropping and re-creating them works for a while.
While the table hold 40M rows, the amount of rows matching these conditions is roughly around 100K.
The query plan looks like this:
QUERY PLAN
------------------------------------------------------------------------------------
Limit (cost=0.00..707.80 rows=100 width=3595) (actual time=18520.627..100644.933 rows=100 loops=1)
Buffers: shared hit=0 read=1392361 dirtied=26 written=26
-> Seq Scan on my_table (cost=0.00..4164264.45 rows=5883377 width=3595) (actual time=18520.624..100644.073 rows=100 loops=1)
Filter: (action_performed AND (NOT should_still_perform_action) AND (action_performed_at <= '2021-09-05 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 19846606
Buffers: shared hit=0 read=1392361 dirtied=26 written=26
Planning Time: 63.667ms
Execution Time: 100645.548 ms
(10 rows)
Using the query found here: https://github.com/ioguix/pgsql-bloat-estimation/blob/master/btree/btree_bloat-superuser.sql
This is the result:
current_database | schemaname | tblname | idxname | real_size | extra_size | extra_pct | fillfactor | bloat_size | bloat_pct | is_na
------------------+------------+------------------+---------------------------------------+-------------+-------------+------------------+------------+-------------+------------------+-------
mine | public | my_table | index001 | 343244800 | 341598208 | 99.5202863961814 | 90 | 341426176 | 99.4701670644391 | f
mine | public | my_table | index002 | 3290316800 | 2338521088 | 71.0728245985311 | 90 | 2231902208 | 67.832441180132 | f
And I'm looking for a way to do this better. Sure, I could drop and recreate the index after time I see it slow down, but that's not exactly a good way of doing things.
Changing LIMIT to FETCH changes nothing.
I'm wondering if I can improve this without changing SELECT to FETCH, which I've never used before and I'm not even sure the client can handle.
What should I do here?
EDIT:
After an analyze:
QUERY PLAN
------------------------------------------------------------------------------------
Limit (cost=0.00..690.14 rows=1000 width=3591) (actual time=0.044..5840.228 rows=1000 loops=1)
Buffers: shared hit=3 read=81426 dirtied=18 written=18
-> Seq Scan on my_table (cost=0.00..4163978.60 rows=6033500 width=3591) (actual time=0.034..5839.599 rows=100 loops=1)
Filter: (action_performed AND (NOT should_still_perform_action) AND (action_performed_at <= '2021-09-05 00:00:00'::timestamp without time zone))
Rows Removed by Filter: 953640
Buffers: shared hit=3 read=81426 dirtied=18 written=18
Planning Time: 63.667ms
Execution Time: 100645.548 ms
(10 rows)
The estimate (cost=0.00..4164264.45 rows=5883377 width=3595) shows the planner expects over 5M records to match the criteria. It is significantly different from the expected 100K you mention.
In cases like this `ANALYZE public.my_table;' usually helps. It refreshes statistics for the table data.
Your main problem seems to be index bloat, which is caused by more DELETEs or UPDATEs than autovacuum can clean up. Solve that problem, and you should be fine.
Tune autovacuum to be as fast as possible:
ALTER TABLE my_table SET (autovacuum_vacuum_cost_delay = 0);
Also, give autovacuum enough memory by setting maintenance_work_mem to a high value up to 1GB.
Then rebuild the indexes so that they are no longer bloated. If pgstattuple tells you that the table is bloated too, VACUUM (FULL) that table instead.
Make sure that you don't have long running database transactions most of the time.
By the way, this is the perfect index for this query:
CREATE INDEX ON my_table (action_performed_at)
WHERE action_performed AND NOT should_still_perform_action;
Entire time is taking to scan the table as per the plan. Seq scan is happening, even index is available on the required columns. It seems rows return by this query is quite high, not only 100k rows.
If you check the below condition in the plan, around 20M rows are removed by all 3 filter used in where clause in the query.
Rows Removed by Filter: 19846606
Please check
Why index isn't picking by reviewing the cardinality of the columns,
how many exact rows return by the query and When this table last
analyze.
is Autovaccuum enable in this database? when the last autovaccuum run for this table ?
Because of the statistics that are collected behind the index, the distribution histogram present detailed values only for the first column of the index key.
If this first column has only 2 values the accuracy is inconsistent and the optimizer will create a bad execution plan.
To bypass this trouble, you must place the column action_performed_at as the first column of the index key.
Another point is that you do not need to have column stored in an index with a single value. When you create an index with a WHERE clause that's rely on MyColum = A_Single_Value, you can ignore this column into the index key.
Finally you can use the INCLUDE clause, that MS SQL Server invented 16 years ago and arrived in PostGreSQL, to add some more columns that do not participate in any seek, but is necessary for the SELECT. This will use only the index and do not use a two phase access SEEK for index and SCAN for the table.
So I will try an index like this one :
CREATE INDEX SQLpro__B6B13FC3_6F90_4EEC_BA61_CA6C96C7958A__20210914
ON my_table (action_performed_at)
INCLUDE (id, name)
WHERE action_performed = true AND
should_still_perform_action = true;

Avoid using Nested Loop Join while using a non Equi join condition

Postgres is using a Nested Loop Join algorithm when I use a non equi join condition in my update query. I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation as per
[https://www.postgresql.org/docs/8.3/planner-optimizer.html]
The update query and the execution plan is below.
Query
explain analyze
UPDATE target_tbl tgt
set descr = stage.descr,
prod_name = stage.prod_name,
item_name = stage.item_name,
url = stage.url,
col1_name = stage.col1_name,
col2_name = stage.col2_name,
col3_name = stage.col3_name,
col4_name = stage.col4_name,
col5_name = stage.col5_name,
col6_name = stage.col6_name,
col7_name = stage.col7_name,
col8_name = stage.col8_name,
flag = stage.flag
from tbl1 stage
where tgt.col1 = stage.col1
and tgt.col2 = stage.col2
and coalesce(tgt.col3, 'col3'::text) = coalesce(stage.col3, 'col3'::text)
and coalesce(tgt.col4, 'col4'::text) = coalesce(stage.col4, 'col4'::text)
and stage.row_number::int >= 1::int
and stage.row_number::int < 50001::int;
Execution Plan
Update on target_tbl tgt (cost=0.56..3557.91 rows=1 width=813) (actual time=346153.460..346153.460 rows=0 loops=1)
-> Nested Loop (cost=0.56..3557.91 rows=1 width=813) (actual time=4.326..163876.029 rows=50000 loops=1)
-> Seq Scan on tbl1 stage (cost=0.00..2680.96 rows=102 width=759) (actual time=3.060..2588.745 rows=50000 loops=1)
Filter: (((row_number)::integer >= 1) AND ((row_number)::integer < 50001))
-> Index Scan using tbl_idx on target_tbl tgt (cost=0.56..8.59 rows=1 width=134) (actual time=3.152..3.212 rows=1 loops=50000)
Index Cond: ((col1 = stage.col1) AND (col2 = stage.col2) AND (COALESCE(col3, 'col3'::text) = COALESCE(stage.col3, 'col3'::text)) AND (COALESCE(col4, 'col4'::text) = COALESCE(stage.col4, 'col4'::text)))
Planning time: 17.700 ms
Execution time: 346157.168 ms
Is there any way to avoid the nested loop join during the execution of the above query?
Or is there a way that can help me to reduce the cost of the the nested loop scan, currently it takes 6-7 minutes to update just 50000 records?
PostgreSQL can choose a different join strategy in that case. The reason why it doesn't is the gross mis-estimate in the sequential scan: 102 instead of 50000.
Fix that problem, and things will get better:
ANALYZE tbl1;
If that is not enough, collect more detailed statistics:
ALTER TABLE tbl1 ALTER row_number SET STATISTICS 1000;
ANALYZE tbl1;
All this assumes that row_number is an integer and the type cast is redundant. If you made the mistake to use a different data type, an index is your only hope:
CREATE INDEX ON tbl1 ((row_number::integer));
ANALYZE tbl1;
I understand that the Nested Loop Join can be very costly as the right relation is scanned once for every row found in the left relation
But the "right relation" here is an index scan, not a scan of the full table.
You can get it to stop using the index by changing the leading column of the join condition to something like where tgt.col1+0 = stage.col1 .... Upon doing that, it will probably change to a hash join or a merge join, but you will have to try it and see if it does. Also, the new plan may not actually be faster. (And fixing the estimation problem would be preferable, if that works)
Or is there a way that can help me to reduce the cost of the the
nested loop scan, currently it takes 6-7 minutes to update just 50000
records?
Your plan shows that over half the time is spent on the update itself, so probably reducing the cost of just the nested loop scan can have only a muted impact on the overall time. Do you have a lot of indexes on the table? The maintenance of those indexes might be a major bottleneck.

Optimizing a query that compares a table to itself with millions of rows

I could use some help optimizing a query that compares rows in a single table with millions of entries. Here's the table's definition:
CREATE TABLE IF NOT EXISTS data.row_check (
id uuid NOT NULL DEFAULT NULL,
version int8 NOT NULL DEFAULT NULL,
row_hash int8 NOT NULL DEFAULT NULL,
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_check_pkey
PRIMARY KEY (id, version)
);
I'm reworking our push code and have a test bed with millions of records across about 20 tables. I run my tests, get the row counts, and can spot when some of my insert code has changed. The next step is to checksum each row, and then compare the rows for differences between versions of my code. Something like this:
-- Run my test of "version 0" of the push code, the base code I'm refactoring.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 0, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
truncate table record_changes_log;
-- Run my test of "version 1" of the push code, the new code I'm validating.
-- Insert the ID and checksum for each pushed row.
INSERT INTO row_check (id,version,row_hash,table_name)
SELECT id, 1, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_check_pkey DO UPDATE SET
row_hash = EXCLUDED.row_hash,
table_name = EXCLUDED.table_name;
That gets two rows in row_check for every row in record_changes_log, or any other table I'm checking. For the two runs of record_changes_log, I end up with more than 8.6M rows in row_check. They look like this:
id version row_hash table_name
e6218751-ab78-4942-9734-f017839703f6 0 -142492569 record_changes_log
6c0a4111-2f52-4b8b-bfb6-e608087ea9c1 0 -1917959999 record_changes_log
7fac6424-9469-4d98-b887-cd04fee5377d 0 -323725113 record_changes_log
1935590c-8d22-4baf-85ba-00b563022983 0 -1428730186 record_changes_log
2e5488b6-5b97-4755-8a46-6a46317c1ae2 0 -1631086027 record_changes_log
7a645ffd-31c5-4000-ab66-a565e6dad7e0 0 1857654119 record_changes_log
I asked yesterday for some help on the comparison query, and it lead to this:
select v0.table_name,
v0.id,
v0.row_hash as v0,
v1.row_hash as v1
from row_check v0
join row_check v1 on v0.id = v1.id and
v0.version = 0 and
v1.version = 1 and
v0.row_hash <> v1.row_hash
That works, but now I'm hoping to optimize it a bit. As an experiment, I clustered the data on version and then built a BRIN index, like this:
drop index if exists row_check_version_btree;
create index row_check_version_btree
on row_check
using btree(version);
cluster row_check using row_check_version_btree;
drop index row_check_version_btree; -- Eh? I want to see how the BRIN performs.
drop index if exists row_check_version_brin;
create index row_check_version_brin
on row_check
using brin(row_hash);
vacuum analyze row_check;
I ran the query through explain analyze and get this:
Merge Join (cost=1.12..559750.04 rows=4437567 width=51) (actual time=1511.987..14884.045 rows=10 loops=1)
Output: v0.table_name, v0.id, v0.row_hash, v1.row_hash
Inner Unique: true
Merge Cond: (v0.id = v1.id)
Join Filter: (v0.row_hash <> v1.row_hash)
Rows Removed by Join Filter: 4318290
Buffers: shared hit=8679005 read=42511
-> Index Scan using row_check_pkey on ascendco.row_check v0 (cost=0.56..239156.79 rows=4252416 width=43) (actual time=0.032..5548.180 rows=4318300 loops=1)
Output: v0.id, v0.version, v0.row_hash, v0.table_name
Index Cond: (v0.version = 0)
Buffers: shared hit=4360752
-> Index Scan using row_check_pkey on ascendco.row_check v1 (cost=0.56..240475.33 rows=4384270 width=24) (actual time=0.031..6070.790 rows=4318300 loops=1)
Output: v1.id, v1.version, v1.row_hash, v1.table_name
Index Cond: (v1.version = 1)
Buffers: shared hit=4318253 read=42511
Planning Time: 1.073 ms
Execution Time: 14884.121 ms
...which I did not really get the right idea from...so I ran it again to JSON and fed the results into this wonderful plan visualizer:
http://tatiyants.com/pev/#/plans
The tips there are right, the top node estimate is bad. The result is 10 rows, the estimate is for about 443,757 rows.
I'm hoping to learn more about optimizing this kind of thing, and this query seems like a good opportunity. I have a lot of notions about what might help:
-- CREATE STATISTICS?
-- Rework the query to move the where comparison?
-- Use a better index? I did try a GIN index and a straight B-tree on version, but neither was superior.
-- Rework the row_check format to move the two hashes into the same row instead of splitting them over two rows, compare on insert/update, flag non-matches, and add a partial index for the non-matching values.
Granted, it's funny to even try to index something where there are only two values (0 and 1 in the case above), so there's that. In fact, is there any sort of clever trick for Booleans? I'll always be comparing two versions, so "old" and "new", which I can express however makes life best. I understand that Postgres only has bitmap indexes internally at search/merge (?) time and that it does not have a bitmap type index. Would there be some kind of INTERSECT that might help? I don't know how Postgres implements set math operators internally.
Thanks for any suggestions on how to rethink this data or the query to make it faster for comparisons involving millions, or tens of millions, of rows.
I'm going to add an answer to my own question, but am still interested in what anyone else has to say. In the process of writing out my original question, I realized that maybe a redesign is in order. This hinges on my plan to only ever compare two versions at a time. That's a good solution here, but there are other cases where it wouldn't work. Anyway, here's a slightly different table design that folds the two results into a single row:
DROP TABLE IF EXISTS data.row_compare;
CREATE TABLE IF NOT EXISTS data.row_compare (
id uuid NOT NULL DEFAULT NULL,
hash_1 int8, -- Want NULL to defer calculating hash comparison until after both hashes are entered.
hash_2 int8, -- Ditto
hashes_match boolean, -- Likewise
table_name text NOT NULL DEFAULT NULL,
CONSTRAINT row_compare_pkey
PRIMARY KEY (id)
);
The following expression index should, hopefully, be very small as I shouldn't have any non-matching entries:
CREATE INDEX row_compare_fail ON row_compare (hashes_match)
WHERE hashes_match = false;
The trigger below does the column calculation, once hash_1 and hash_2 are both provided:
-- Run this as a BEFORE INSERT or UPDATE ROW trigger.
CREATE OR REPLACE FUNCTION data.on_upsert_row_compare()
RETURNS trigger AS
$BODY$
BEGIN
IF NEW.hash_1 = NULL OR
NEW.hash_2 = NULL THEN
RETURN NEW; -- Don't do the comparison, hash_1 hasn't been populated yet.
ELSE-- Do the comparison. The point of this is to avoid constantly thrashing the expression index.
NEW.hashes_match := NEW.hash_1 = NEW.hash_2;
RETURN NEW; -- important!
END IF;
END;
$BODY$
LANGUAGE plpgsql;
This now adds 4.3M rows instead of 8.6M rows:
-- Add the first set of results and build out the row_compare records.
INSERT INTO row_compare (id,hash_1,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_1 = EXCLUDED.hash_1,
table_name = EXCLUDED.table_name;
-- I'll truncate the record_changes_log and push my sample data again here.
-- Add the second set of results and update the row compare records.
-- This time, the hash is going into the hash_2 field for comparison
INSERT INTO row_compare (id,hash_2,table_name)
SELECT id, hashtext(record_changes_log::text),'record_changes_log'
FROM record_changes_log
ON CONFLICT ON CONSTRAINT row_compare_pkey DO UPDATE SET
hash_2 = EXCLUDED.hash_2,
table_name = EXCLUDED.table_name;
And now the results are simple to find:
select * from row_compare where hashes_match = false;
This changes the query time from around 17 seconds to around 24 milliseconds.

Is this the right way to create a partial index in Postgres?

We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.
Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).
For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.