I'm using Postgres.
I have queries of the form:
`SELECT COUNT(*) from clicks where link_id=1`
Clicks is millions of rows.
These queries are taking 10-20 seconds.
Are there any elegant ways to accelerate this?
Edit: Query plan:
Indexes
CREATE INDEX clicks_link_id_index
ON public.clicks USING btree
(link_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX clicks_link_workspace_id_index
ON public.clicks USING btree
(link_workspace_id ASC NULLS LAST)
TABLESPACE pg_default;
CREATE INDEX index_date_trunc
ON public.clicks USING btree
(date_trunc('day'::text, inserted_at) ASC NULLS LAST)
TABLESPACE pg_default;
SET ENABLE_SEQSCAN TO OFF:
It thinks the index-only scan is going to be 4 times cheaper than the sequential scan, so I am confused on why it didn't pick index-only-scan in the first place (before you had set enable_seqscan=off). Now if you set that back to the default, does it go back to using the seq scan again?
If you freshly VACUUM the table, does it get much faster with the index-only scan? While 99% of the rows are fetched without heap fetches, the remaining 1% could account for most of the time.
Or maybe you should just throw more hardware at it. Having more RAM in which to cache data (either in shared_buffers or in filesystem cache) can't hurt.
Related
Within my db I have table prediction_fsd with about 5 million entries. The site table contains approx 3 million entries. I need to execute queries that look like
SELECT prediction_fsd.id AS prediction_fsd_id,
prediction_fsd.site_id AS prediction_fsd_site_id,
prediction_fsd.html_hash AS prediction_fsd_html_hash,
prediction_fsd.prediction AS prediction_fsd_prediction,
prediction_fsd.algorithm AS prediction_fsd_algorithm,
prediction_fsd.model_version AS prediction_fsd_model_version,
prediction_fsd.timestamp AS prediction_fsd_timestamp,
site_1.id AS site_1_id,
site_1.url AS site_1_url,
site_1.status AS site_1_status
FROM prediction_fsd
LEFT OUTER JOIN site AS site_1
ON site_1.id = prediction_fsd.site_id
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
LIMIT 1
at the moment this query takes about ~4 seconds. I'd like to reduce that by introducing an index. Which tables and fields should I include in that index. I'm having troubles properly understanding the EXPLAIN ANALYZE output of Postgres
CREATE INDEX prediction_fsd_site_id_algorithm_timestamp
ON public.prediction_fsd USING btree
(site_id, algorithm, "timestamp" DESC)
TABLESPACE pg_default;
By introducing a combined index as suggested by Frank Heikens I was able to bring down the query execution time to 0.25s
These three SQL lines point to a possible BTREE index to help you.
WHERE 95806 = prediction_fsd.site_id
AND prediction_fsd.algorithm = 'xgboost'
ORDER BY prediction_fsd.timestamp DESC
You're filtering the rows of the table by equality on two columns, and ordering by the third column. So try this index.
CREATE INDEX site_alg_ts ON prediction_fsd
(site_id, algorithm, timestamp DESC);
This BTREE index lets PostgreSQL random-access it to the first eligible row, which happens also to be the row you want with your ORDER BY ... LIMIT 1 clause.
The query plan in your question says that PostgreSQL did an expensive Parallel Sequential Scan on all five megarows of that table. This index will almost certainly change that to a cheap index lookup.
On the other table, it appears that you already look up rows in it via the primary key id. So you don't need any other index for that one.
I am new to Postgres and a bit confused on how Postgres decides which index to use if I have more than one btree indexes defined as below.
CREATE INDEX index_1 ON sample_table USING btree (col1, col2, COALESCE(col3, 'col3'::text));
CREATE INDEX index_2 ON sample_table USING btree (col1, COALESCE(col3, 'col3'::text));
I am using col1, col2, COALESCE(col3, 'col3'::text) in my join condition when I write to sample_table (from source tables) but when I do a explain analyze to get the query plan I see sometimes that it uses index_2 to scan rather than index_1 and sometimes just goes with sequential scan .I want to understand what can make Postgres to use one index over another?
Without seeing EXPLAIN (ANALYZE, BUFFERS) output, I can only give a generic answer.
PostgreSQL considers all execution plans that are feasible and estimates the row count and cost for each node. Then it takes the plan with the lowest cost estimate.
It could be that the condition on col2 is sometimes more selective and sometimes less, for example because you sometimes compare it to rare and sometimes to frequent values. If the condition involving col2 is not selective, it does not matzer much which of the two indexes is used. In that case PostgreSQL prefers the smaller two-column index.
We're currently running a query that performs a pretty simple join and group by for a row count and at the end a union all.
(select
table_p."name",
table_p.id,
count(table.id),
sum(table.views)
from table
inner join table_p on table_p.id = table.pageid
where table.date BETWEEN '2020-03-01' AND '2020-03-31'
group by table_p.id
order by table_p.id)
union all
(select
table_p."name",
table_p.id,
count(table.id),
sum(table.views)
from table
inner join table_p on table_p.id = table.pageid
where table.date BETWEEN '2020-02-01' AND '2020-02-29'
group by table_p.id
order by table_p.id)
union all ....
We've decided to use a BRIN index due to the count of our table being 360 million records. We do have the option to go with B-Tree if needed.
Now for some reason, we're seeing in the explain analyze that the BRIN Index has "parallel aware" set to false with two workers being listed in the plan outputted? Also we're seeing a linear performance when breaking up the amount that we're querying, i.e. one month in 5 seconds, four months in 20 seconds. I'd assume this means that we're querying asynchronously rather than parallel.
Does anyone have any ideas on what we could potentially be missing in order to get parallel queries going where possible? Does BRIN not work with Parallel Workers?
Edit: Here is the BRIN index on "table":
CREATE INDEX table_brin_idx
ON table USING brin
(date, teamid, id, pageid, devicetypeid, makeid, modelid)
TABLESPACE pg_default;
My postgres version is PostgreSQL 11.6, compiled by Visual C++ build 1800, 64-bit
Here's a link to the explain analyze that's too big to post here.
Information from PostgreSQL documentation: Currently, parallel index scans are supported only for btree indexes.
Source: https://www.postgresql.org/docs/11/parallel-plans.html#PARALLEL-SCANS
I have a table with 2000000 records.
I have one index:
CREATE INDEX test
ON public."PaymentReceipt" USING btree
("CompanyID" DESC NULLS LAST, created_at ASC NULLS LAST)
TABLESPACE pg_default;
If I run this query, it will run the index:
explain
select *
from public."PaymentReceipt"
where "CompanyID" = '5c67762bd0949'
order by "created_at" desc
limit 100 offset 1600589
But if I run this query, it won't use the index:
explain
select *
from public."PaymentReceipt"
where "CompanyID" = '5c67762bd0949'
order by "created_at" desc
limit 100 offset 1600590
I'm not sure what happened to the index!
OFFSET is bad for query performance.
The reason is that in order to skip the first 1000000 rows, the database has to find and then discard them. So the work for the database is the same as if you omit the OFFSET clause.
Now sequential scans of a table are cheaper than index scans, so the latter offer an advantage only if the number of rows retrieved is much smaller than with a sequential scan. At some point the query plan will “tilt” and PostgreSQL will assume that an index scan is cheaper. You have found exactly that point.
If your random_page_cost setting reflects your hardware correctly, PostgreSQL will estimate this correctly. In a nutshell, PostgreSQL is doing the right thing.
Inside a Before trigger function, I'm trying to optimize a SELECT which uses an array intersection of the form:
select into matching_product * from products where global_ids && NEW.global_ids
The above is pegging the cpu at 100% while doing some modest batch inserts. (without the above select in the trigger function the cpu drops to ~5%)
I did define a GIN-index on global_ids but that doesn't seem to work.
Any other way to optimize the above? E.g.: Should I just go ahead and create a N-M relationship between products and global_ids and do some joins to get the same result?
EDIT
Seems the GIN-index IS used, however it's still slow. Not sure what I can expect, (YMMV and all that) but the table has ~200,000 items. Doing a query like below takes 300ms. I feel this should be near instant.
select * from products where global_ids && '{871712323629}'
Doing an explain on the above shows:
Bitmap Heap Scan on products (cost=40.51..3443.85 rows=1099 width=490)
Recheck Cond: (global_ids && '{871712323629}'::text[])
-> Bitmap Index Scan on "global_ids_GIN" (cost=0.00..40.24 rows=1099 width=0)
Index Cond: (global_ids && '{871712323629}'::text[])
Table definition, removed irrelevant columns
CREATE TABLE public.products
(
id text COLLATE pg_catalog."default" NOT NULL,
global_ids text[] COLLATE pg_catalog."default",
CONSTRAINT products_pkey PRIMARY KEY (id)
)
WITH (
OIDS = FALSE
)
TABLESPACE pg_default;
Index
CREATE INDEX "global_ids_GIN"
ON public.products USING gin
(global_ids COLLATE pg_catalog."default")
TABLESPACE pg_default;
I cannot think of any reason why such a query should behave differently inside a PL/pgSQL function; my experiments suggest that it doesn't.
Run EXPLAIN (ANALYZE, BUFFERS) on a query like you run inside the function several times to get a good estimate of the duration you are to expect.
Run EXPLAIN (ANALYZE, BUFFERS) on inserts like the ones you are doing in batch on a similar table without a trigger to measure how long heap insert and index maintenance will take.
Add these values and multiply with the number of rows you insert in a batch.
If you end up with roughly the same time as you experience, there is no mystery to solve.
If you end up with a “lossy” bitmap index scan (look at EXPLAIN (ANALYZE, BUFFERS) output), you can boost the performance by increasing work_mem.