my query is not picking up the created index - postgresql

I created a index named user_type_index
my query:
User.where("user_type IN (:user_types)", user_types: user_types)
user_types = ['super_admin', 'system_admin']
even though user_type has an index, this query runs sequentially.
tried dropping and creating the index, still no use
class User < ActiveRecord::Base
def matching_with_user_types
where("user_type IN (:user_types)", user_types: user_types)
end
end
user_types = ['super_admin', 'system_admin']
User.matching_with_user_types(user_types).explain
below is the output:
Seq Scan on users (cost=0.00..11.75 rows=2 width=537)
Filter: ((user_type)::text = ANY ('{super_admin,system_admin}'::text[]))
(2 rows)
expected: query using index
actual: querying sequentially
PostgreSQL query
Fetching records matching with all
EXPLAIN SELECT id FROM "users" WHERE (user_type IN ('super_admin','system_admin','network','super'));
QUERY PLAN
--------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..3039.28 rows=75228 width=377)
Filter: ((user_type)::text = ANY ('{'super_admin','system_admin','network','super'}'::text[]))
(2 rows)
Fetching records matching with one:
EXPLAIN SELECT id FROM "users" WHERE (user_type IN ('super_admin'));
QUERY PLAN
--------------------------------------------------------------------------------------
Seq Scan on users (cost=0.00..2789.24 rows=25418 width=377)
Filter: ((user_type)::text = ANY ('{'super_admin'}'::text[]))
(2 rows)

Related

How to use an index when using jsonb_array_elements in Postgres

I have the next table structure:
create table public.listings (id varchar(255) not null, data jsonb not null);
And the next indexes:
create index listings_data_index on public.listings using gin(data jsonb_ops);
create unique index listings_id_index on public.listings(id);
alter table public.listings add constraint listings_id_pk primary key(id);
With this row:
id | data
1 | {"attributes": {"ccid": "123", "listings": [{"vin": "1234","body": "Sleeper", "make": "International"}, { "vin": "5678", "body": "Sleeper", "make": "International" }]}}
The use case needs to retrieve a specific item inside the listings array that matches a specific vin.
I am accomplishing that with the next query:
SELECT elems
FROM public.listings, jsonb_array_elements(data->'attributes'->'listings') elems
WHERE id = '1' AND elems->'vin' ? '1234';
The output is what I need:
{"vin": "1234","body": "Sleeper", "make": "International"}
Now I am in the phase of optimizing this query, since there will be millions of rows, and up to 100K items inside listings array.
When I run the explain over that query is shows this:
Nested Loop (cost=0.01..2.53 rows=1 width=32)
-> Seq Scan on listings (cost=0.00..1.01 rows=1 width=32)
Filter: ((id)::text = '1'::text)
-> Function Scan on jsonb_array_elements elems (cost=0.01..1.51 rows=1 width=32)
Filter: ((value -> 'vin'::text) ? '1234'::text)
I wonder what would be the right way to construct an index for that, or if I need to modify the query to another that is more efficient.
Thank you!
First: with a table as small as that, you will never see PostgreSQL use an index. You need to try with realistic amounts. Second: while PostgreSQL will happily use an index for the condition on id, it can never use an index for such a JSON search, no matter how you write it.

Index created for PostgreSQL jsonb column not utilized

I have created an index for a field in jsonb column as:
create index on Employee using gin ((properties -> 'hobbies'))
Query generated is:
CREATE INDEX employee_expr_idx ON public.employee USING gin (((properties -> 'hobbies'::text)))
My search query has structure as:
SELECT * FROM Employee e
WHERE e.properties #> '{"hobbies": ["trekking"]}'
AND e.department = 'Finance'
Running EXPLAIN command for this query gives:
Seq Scan on employee e (cost=0.00..4452.94 rows=6 width=1183)
Filter: ((properties #> '{"hobbies": ["trekking"]}'::jsonb) AND (department = 'Finance'::text))
Going by this, I am not sure if index is getting used for search.
Is this entire setup ok?
The expression you use in the WHERE clause must match the expression in the index exactly, your index uses the expression: ((properties -> 'hobbies'::text)) but your query only uses e.properties on the left hand side.
To make use of that index, your WHERE clause needs to use the same expression as was used in the index:
SELECT *
FROM Employee e
WHERE (properties -> 'hobbies') #> '["trekking"]'
AND e.department = 'Finance'
However: your execution plan shows that the table employee is really tiny (rows=6). With a table as small as that, a Seq Scan is always going to be the fastest way to retrieve data, no matter what kind of indexes you define.

Postgres SELECT query not using index

I've created an index on a fairly large table of data (100 million + rows)
create index on web_responses_rebuild_2019_09 (response_time)
When I run an explain on a simple query of the data, it does not use the index, it does a full table scan
explain select *
from web_responses_rebuild_2019_09
where response_time between 1567398218
and 1567398220;
Results:
QUERY PLAN
"Seq Scan on web_responses_rebuild_2019_09 (cost=10000000000.00..10044668761.32 rows=300 width=734)"
" Filter: ((response_time >= 1567398218) AND (response_time <= 1567398220))"
...and we have a similarly named table and index that correctly does use the index.

Is this the right way to create a partial index in Postgres?

We have a table with 4 million records, and for a particular frequently used use-case we are only interested in records with a particular salesforce userType of 'Standard' which are only about 10,000 out of 4 million. The other usertype's that could exist are 'PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess' and 'CsnOnly'.
So for this use case I thought creating a partial index would be better, as per the documentation.
So I am planning to create this partial index to speed up queries for records with a usertype of 'Standard' and prevent the request from the web from getting timed out:
CREATE INDEX user_type_idx ON user_table(userType)
WHERE userType NOT IN
('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
The lookup query will be
select * from user_table where userType='Standard';
Could you please confirm if this is the right way to create the partial index? It would of great help.
Postgres can use that but it does so in a way that is (slightly) less efficient than an index specifying where user_type = 'Standard'.
I created a small test table with 4 million rows, 10.000 of them having the user_type 'Standard'. The other values were randomly distributed using the following script:
create table user_table
(
id serial primary key,
some_date date not null,
user_type text not null,
some_ts timestamp not null,
some_number integer not null,
some_data text,
some_flag boolean
);
insert into user_table (some_date, user_type, some_ts, some_number, some_data, some_flag)
select current_date,
case (random() * 4 + 1)::int
when 1 then 'PowerPartner'
when 2 then 'CSPLitePortal'
when 3 then 'CustomerSuccess'
when 4 then 'PowerCustomerSuccess'
when 5 then 'CsnOnly'
end,
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,4e6 - 10000) as t(i)
union all
select current_date,
'Standard',
clock_timestamp(),
42,
rpad(md5(random()::text), (random() * 200 + 1)::int, md5(random()::text)),
(random() + 1)::int = 1
from generate_series(1,10000) as t(i);
(I create tables that have more than just a few columns as the planner's choices are also driven by the size and width of the tables)
The first test using the index with NOT IN:
create index ix_not_in on user_table(user_type)
where user_type not in ('PowerPartner', 'CSPLitePortal', 'CustomerSuccess', 'PowerCustomerSuccess', 'CsnOnly');
explain (analyze true, verbose true, buffers true)
select *
from user_table
where user_type = 'Standard'
Results in:
QUERY PLAN
--------------------------------------------------------------------------------------------------------------------------------
Bitmap Heap Scan on stuff.user_table (cost=139.68..14631.83 rows=11598 width=139) (actual time=1.035..2.171 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Recheck Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=262
-> Bitmap Index Scan on ix_not_in (cost=0.00..136.79 rows=11598 width=0) (actual time=1.007..1.007 rows=10000 loops=1)
Index Cond: (user_table.user_type = 'Standard'::text)
Buffers: shared hit=40
Total runtime: 2.506 ms
(The above is a typical execution time after I ran the statement about 10 times to eliminate caching issues)
As you can see the planner uses a Bitmap Index Scan which is a "lossy" scan that needs an extra step to filter out false positives.
When using the following index:
create index ix_standard on user_table(id)
where user_type = 'Standard';
This results in the following plan:
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------
Index Scan using ix_standard on stuff.user_table (cost=0.29..443.16 rows=10267 width=139) (actual time=0.011..1.498 rows=10000 loops=1)
Output: id, some_date, user_type, some_ts, some_number, some_data, some_flag
Buffers: shared hit=313
Total runtime: 1.815 ms
Conclusion:
Your index is used but an index on only the type that you are interested in is a bit more efficient.
The runtime is not that much different. I executed each explain about 10 times, and the average for the ix_standard index was slightly below 2ms and the average of the ix_not_in index was slightly above 2ms - so not a real performance difference.
But in general the Index Scan will scale better with increasing table sizes than the Bitmap Index Scan will do. This is basically due to the "Recheck Condition" - especially if not enough work_mem is available to keep the bitmap in memory (for larger tables).
For the index to be used, the WHERE condition must be used in the query as you wrote it.
PostgreSQL has some ability to make deductions, but it won't be able to infer that userType = 'Standard' is equivalent to the condition in the index.
Use EXPLAIN to find out if your index can be used.

Redshift SELECT * performance versus COUNT(*) for non existent row

I am confused about what Redshift is doing when I run 2 seemingly similar queries. Neither should return a result (querying a profile that doesn't exist). Specifically:
SELECT * FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 36.75s
versus
SELECT COUNT(*) FROM profile WHERE id = 'id_that_doesnt_exist' and project_id = 1;
Execution time: 0.2s
Given that the table is sorted by project_id then id I would have thought this is just a key lookup. The SELECT COUNT(*) ... returns 0 results in 0.2sec which is about what I would expect. The SELECT * ... returns 0 results in 37.75sec. That's a huge difference for the same result and I don't understand why?
If it helps schema as follows:
CREATE TABLE profile (
project_id integer not null,
id varchar(256) not null,
created timestamp not null,
/* ... approx 50 other columns here */
)
DISTKEY(id)
SORTKEY(project_id, id);
Explain from SELECT COUNT(*) ...
XN Aggregate (cost=435.70..435.70 rows=1 width=0)
-> XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=0)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Explain from SELECT * ...
XN Seq Scan on profile (cost=0.00..435.70 rows=1 width=7356)
Filter: (((id)::text = 'id_that_doesnt_exist'::text) AND (project_id = 1))
Why is the non count much slower? Surely Redshift knows the row doesn't exist?
The reason is: in many RDBMS's the answer on count(*) question usually come without actual data scan: just from index or table statistics. Redshift stores minimal and maximal value for a block that used to give exist or not exists answers for example like in describer case. In case requested value inside of min/max block boundaries the scan will be performed only on filtering fields data. In case requested value is lower or upper block boundaries the answer will be given much faster on basis of the stored statistics. In case of "select * " question RedShift actually scans all columns data as asked in query: "*" but filter only by columns in "where " clause.