Postgres slow bitmap heap scan - postgresql

I have tables messages phones with around 6M rows. And this query perfomance is very poor
SELECT t1.id, t2.number, t1.name, t1.gender
FROM messages t1
INNER JOIN phones t2 ON t2.id = t1.parent_id
INNER JOIN regions t6 ON t6.id = t1.region_id
WHERE t2.number IS NOT NULL AND t1.entity AND NOT t2.type AND t1.region_id = 50
ORDER BY t1.id LIMIT 100
EXPLAIN ANALYZE result: http://explain.depesz.com/s/Pd6D
Btree indexes on all colums in where condition. Primary keys on all id colums, foreign keys in messages table on parent_id and region_id as well. Vacuum on all tables runned too.
But over 15sec on just 100 rows is too slow. What is wrong?
Postgres 9.3, ubuntu 13.10, cpu 2x 2.5Ghz, 4gb ram, pg config http://pastebin.com/mPVH1YJi

This completely depends on your read vs. write load, but one solution may be to create composite indexes for the most common / general cases.
For example, BTREE(parent_id, region_id) to turn that heap scan into an index scan would be huge. Since you have dynamic queries, there might be a few other combinations of composite indexes you might need for other queries, but I would recommend using only two columns in your composite indexes for now (as each query is different). Note that BTREE(parent_id, region_id) can also be scanned when only parent_id is needed, so there is no need to carry a BTREE(parent_id) index as well.

Related

Does PostgreSQL's Statistics Collector track *all* usage of indexes?

After taking over the DBA duties on a fairly complex database, I wanted to eliminate any indexes that are consuming substantial disk space, but not being used. I ran the following, to identify unused indexes, sorting to prioritize those that consume the most space on disk:
SELECT
schemaname,
pg_stat_all_indexes.relname AS table,
pg_class.relname AS index,
pg_total_relation_size(oid) AS size,
idx_scan,
idx_tup_read,
idx_tup_fetch
FROM pg_class
JOIN pg_stat_all_indexes ON pg_stat_all_indexes.indexrelname = pg_class.relname
WHERE
relkind =('i')
ORDER BY size DESC
I was a little surprised at just how many large indexes appear not to be used, at all -- as evidenced by a 0 for the idx_scan column. Some of these apparently-unused indexes include a function call that does something pretty specific (as in the contrived example below), and appear to have been set up to assist with API functionality.
--not real index
CREATE INDEX foo_transform_foo_name_idx
ON foo USING btree
(foo_transform_name(foo_name));
My question, here, is whether the Statistics Collector captures all uses of a particular index, even if those indexes were scanned from a SQL-language function, or in some other way?
These indexes have never been scanned. However, there are some other uses for indexes:
they enforce uniqueness and other constraints
they make ANALYZE gather statistics on indexed expressions
Use this query from my blog to find the indexes that you can drop without any negative consequences:
SELECT s.schemaname,
s.relname AS tablename,
s.indexrelname AS indexname,
pg_relation_size(s.indexrelid) AS index_size
FROM pg_catalog.pg_stat_user_indexes s
JOIN pg_catalog.pg_index i ON s.indexrelid = i.indexrelid
WHERE s.idx_scan = 0 -- has never been scanned
AND 0 <>ALL (i.indkey) -- no index column is an expression
AND NOT i.indisunique -- is not a UNIQUE index
AND NOT EXISTS -- does not enforce a constraint
(SELECT 1 FROM pg_catalog.pg_constraint c
WHERE c.conindid = s.indexrelid)
ORDER BY pg_relation_size(s.indexrelid) DESC;

Parallel Workers Not Being Used With BRIN Index

We're currently running a query that performs a pretty simple join and group by for a row count and at the end a union all.
(select
table_p."name",
table_p.id,
count(table.id),
sum(table.views)
from table
inner join table_p on table_p.id = table.pageid
where table.date BETWEEN '2020-03-01' AND '2020-03-31'
group by table_p.id
order by table_p.id)
union all
(select
table_p."name",
table_p.id,
count(table.id),
sum(table.views)
from table
inner join table_p on table_p.id = table.pageid
where table.date BETWEEN '2020-02-01' AND '2020-02-29'
group by table_p.id
order by table_p.id)
union all ....
We've decided to use a BRIN index due to the count of our table being 360 million records. We do have the option to go with B-Tree if needed.
Now for some reason, we're seeing in the explain analyze that the BRIN Index has "parallel aware" set to false with two workers being listed in the plan outputted? Also we're seeing a linear performance when breaking up the amount that we're querying, i.e. one month in 5 seconds, four months in 20 seconds. I'd assume this means that we're querying asynchronously rather than parallel.
Does anyone have any ideas on what we could potentially be missing in order to get parallel queries going where possible? Does BRIN not work with Parallel Workers?
Edit: Here is the BRIN index on "table":
CREATE INDEX table_brin_idx
ON table USING brin
(date, teamid, id, pageid, devicetypeid, makeid, modelid)
TABLESPACE pg_default;
My postgres version is PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
Here's a link to the explain analyze that's too big to post here.
Information from PostgreSQL documentation: Currently, parallel index scans are supported only for btree indexes.
Source: https://www.postgresql.org/docs/11/parallel-plans.html#PARALLEL-SCANS

Inner join on tables with 50M and 30K entries

I have two tables A and B. A contains 50 million entries and B contains just 30 thousand. I have created default indexes (B-tree) on the columns used to join the tables. The join field is of type character varying.
I am querying the database with this query:
SELECT count(*)
from B INNER JOIN A
ON B.id = A.id;
The execution time of the above query is approximately 8 seconds. When I saw the execution plan, the planner applies a sequential scan to table A scanning all the 50 million entries (this is taking most of the time) and an index scan on table B.
How can I speed up the query?
You cannot speed up this query if you want an exact result.
The most efficient join strategy will probably be a hash or merge join, depending on your work_mem setting.
You might be able to get some speed improvement with an index only scan; try to VACUUM both tables before querying.
The only tuning method would be to make sure both tables are cached in RAM.
There are ways to get estimated counts, see my blog for details.

Optimise LEFT JOIN in PostgreSQL (PGAdmin4)

I have 2 tables in PostgreSQL one of which is 16 million rows and the other is around 3000. They both share 2 common IDs, but the larger table has thousands of iterations of the same ID.
I'm trying to do a LEFT JOIN with a few conditions as follows:
SELECT LT.Col1, LT.Col2, LT.Col3, ST.Col1, ST.Col2
FROM large_table as LT
LEFT JOIN small_table as ST
ON LT.id1 = ST.id1 AND LT.id2 = ST.id2
WHERE LT.Col1 > 30
AND LT.Col2 > 2
AND LT.Col3 BETWEEN '11:00:00'::time AND '21:00:00'::time
I have created multi-column Indexes based on id1 and id2 for each table, but the query is just running and running. Using PGAdmin4 on a macbook pro 16gb RAM, 2.9ghz quad core i7. I've checked the computer performance and it's not struggling. Does anybody have any advice on how to speed up the query? Am I just asking too much of it?
Since this is a left outer join, your best bet is to use indexes on large_table that reduces the number of rows early on.
Unfortunately none of your conditions checks for equality, so a combined index is useless.
You could create indexes on the three columns of large_table and see if PostgreSQL uses them (e.g. using a bitmap inex scan and combining the results).
Those indexes that PostgreSQL chooses not to use can be dropped again.
You might try creating combined index for tuple (id1, id2) in both tables. Then use ON (LT.id1, LT.id2) = (ST.id1, ST.id2)

Postgres using an index for one table but not another

I have three tables in my app, call them tableA, tableB, and tableC. tableA has fields for tableB_id and tableC_id, with indexes on both. tableB has a field foo with an index, and tableC has a field bar with an index.
When I do the following query:
select *
from tableA
left outer join tableB on tableB.id = tableA.tableB_id
where lower(tableB.foo) = lower(my_input)
it is really slow (~1 second).
When I do the following query:
select *
from tableA
left outer join tableC on tableC.id = tabelA.tableC_id
where lower(tableC.bar) = lower(my_input)
it is really fast (~20 ms).
From what I can tell, the tables are about the same size.
Any ideas as to the huge performance difference between the two queries?
UPDATES
Table sizes:
tableA: 2061392 rows
tableB: 175339 rows
tableC: 1888912 rows
postgresql-performance tag info
Postgres version - 9.3.5
Full text of the queries are above.
Explain plans - tableB tableC
Relevant info from tables:
tableA
tableB_id, integer, no modifiers, storage plain
"index_tableA_on_tableB_id" btree (tableB_id)
tableC_id, integer, no modifiers, storage plain,
"index_tableA_on_tableB_id" btree (tableC_id)
tableB
id, integer, not null default nextval('tableB_id_seq'::regclass), storage plain
"tableB_pkey" PRIMARY_KEY, btree (id)
foo, character varying(255), no modifiers, storage extended
"index_tableB_on_lower_foo_tableD" UNIQUE, btree (lower(foo::text), tableD_id)
tableD is a separate table that is otherwise irrelevant
tableC
id, integer, not null default nextval('tableC_id_seq'::regclass), storage plain
"tableC_pkey" PRIMARY_KEY, btree (id)
bar, character varying(255), no modifiers, storage extended
"index_tableC_on_tableB_id_and_bar" UNIQUE, btree (tableB_id, bar)
"index_tableC_on_lower_bar" btree (lower(bar::text))
Hardware:
OS X 10.10.2
CPU: 1.4 GHz Intel Core i5
Memory: 8 GB 1600 MHz DDR3
Graphics: Intel HD Graphics 5000 1536 MB
Solution
Looks like running vacuum and then analyze on all three tables fixed the issue. After running the commands, the slow query started using "index_patients_on_foo_tableD".
The other thing is that you have your indexed columns queried as lower() , which can also be creating a partial index when the query is running.
If you will always query the column as lower() then your column should be indexed as lower(column_name) as in:
create index idx_1 on tableb(lower(foo));
Also, have you looked at the execution plan? This will answer all your questions if you can see how it is querying the tables.
Honestly, there are many factors to this. The best solution is to study up on INDEXES, specifically in Postgres so you can see how they work. It is a bit of holistic subject, you can't really answer all your problems with a minimal understanding of how they work.
For instance, Postgres has an initial "lets look at these tables and see how we should query them" before the query runs. It looks over all tables, how big each of the tables are, what indexes exist, etc. and then figures out how the query should run. THEN it executes it. Oftentimes, this is what is wrong. The engine incorrectly determines how to execute it.
A lot of the calculations of this are done off of the summarized table statistics. You can reset the summarized table statistics for any table by doing:
vacuum [table_name];
(this helps to prevent bloating from dead rows)
and then:
analyze [table_name];
I haven't always seen this work, but often times it helps.
ANyway, so best bet is to:
a) Study up on Postgres indexes (a SIMPLE write up, not something ridiculously complex)
b) Study up the execution plan of the query
c) Using your understanding of Postgres indexes and how the query plan is executing, you cannot help but solve the exact problem.
For starters, your LEFT JOIN is counteracted by the predicate on the left table and is forced to act like an [INNER] JOIN. Replace with:
SELECT *
FROM tableA a
JOIN tableB b ON b.id = a.tableB_id
WHERE lower(b.foo) = lower(my_input);
Or, if you actually want the LEFT JOIN to include all rows from tableA:
SELECT *
FROM tableA a
LEFT JOIN tableB b ON b.id = a.tableB_id
AND lower(b.foo) = lower(my_input);
I think you want the first one.
An index on (lower(foo::text)) like you posted is syntactically invalid. You better post the verbatim output from \d tbl in psql like I commented repeatedly. A shorthand syntax for a cast (foo::text) in an index definition needs more parentheses, or use the standard syntax: cast(foo AS text):
Create index on first 3 characters (area code) of phone field?
But that's also unnecessary. You can just use the data type (character varying(255)) of foo. Of course, the data type character varying(255) rarely makes sense in Postgres to begin with. The odd limitation to 255 characters is derived from limitations in other RDBMS which do not apply in Postgres. Details:
Refactor foreign key to fields
Be that as it may. The perfect index for this kind of query would be a multicolumn index on B - if (and only if) you get index-only scans out of this:
CREATE INDEX "tableB_lower_foo_id" ON tableB (lower(foo), id);
You can then drop the mostly superseded index "index_tableB_on_lower_foo". Same for tableC.
The rest is covered by the (more important!) indices in table A on tableB_id and tableC_id.
If there are multiple rows in tableA per tableB_id / tableC_id, then either one of these competing commands can swing the performance to favor the respective query by physically clustering related rows together:
CLUSTER tableA USING "index_tableA_on_tableB_id";
CLUSTER tableA USING "index_tableA_on_tableC_id";
You can't have both. It's either B or C. CLUSTER also does everything a VACUUM FULL would do. But be sure to read the details first:
Optimize Postgres timestamp query range
And don't use mixed case identifiers, sometimes quoted, sometimes not. This is very confusing and is bound to lead to errors. Use legal, lower-case identifiers exclusively - then it doesn't matter if you double-quote them or not.