I have one table which contains "133,072,194" records and I am trying to execute
SELECT COUNT(test)
FROM mytable
WHERE test = false
but it is taking Execution time: 128320.712 ms
I already have indexing on test column. Could you please let me know, what I can optimize or change, so my query became faster?
Because of this, my other select query is also not working.
If there are many rows where test is FALSE, you won't be able to get an exact result faster than with a sequential scan, which is slow for big tables.
If you have only few rows that satisfy the condition, you should create a partial index:
CREATE INDEX mytable_notest_ind ON mytable(id) WHERE NOT test;
(assuming that id is the primary key) and keep mytable autovacuumed often enough that you get an index only scan.
But usually exact results for queries like this are not required.
You could calculate an estimated count from the table statistics with a query like this:
SELECT t.reltuples
* (1 - t.nullfrac)
* mcv.freq AS count_false
FROM pg_stats AS s
CROSS JOIN LATERAL unnest(s.most_common_vals::text::boolean[],
s.most_common_freqs) AS mcv(val, freq)
JOIN pg_class AS t
ON s.tablename = t.relname
AND s.schemaname = t.relnamespace::regnamespace::text
WHERE s.tablename = 'mytable'
AND s.attname = 'test'
AND mcv.val = FALSE;
That would be very fast.
See my blog post for more considerations about the speed of SELECT count(*).
Related
(Postgres v13)
I've got a query which takes 2 - 5 seconds to plan. The query joins my languages table and translations table to get translation results for multiple languages. When I add even more languages/translations to load the execution time is exponentially growing.
select
key0_.id as col_0_0_,
key0_.name as col_1_0_,
(select
count(screenshot60_.id)
from
screenshot screenshot60_
inner join
key key61_
on screenshot60_.key_id=key61_.id
where
key0_.id=key61_.id) as col_2_0_,
languages2_.tag as col_3_0_,
translatio31_.id as col_4_0_,
translatio31_.text as col_5_0_,
translatio31_.state as col_6_0_,
translatio31_.auto as col_7_0_,
translatio31_.mt_provider as col_8_0_,
languages3_.tag as col_11_0_,
translatio32_.id as col_12_0_,
translatio32_.text as col_13_0_,
translatio32_.state as col_14_0_,
translatio32_.auto as col_15_0_,
translatio32_.mt_provider as col_16_0_,
... the same over and over many times ...
languages30_.tag as col_227_0_,
translatio59_.id as col_228_0_,
translatio59_.text as col_229_0_,
translatio59_.state as col_230_0_,
translatio59_.auto as col_231_0_,
translatio59_.mt_provider as col_232_0_,
0 as col_233_0_,
0 as col_234_0_
from
key key0_
inner join
project project1_
on key0_.project_id=project1_.id
inner join
language languages2_
on project1_.id=languages2_.project_id
and (
languages2_.tag='en-US'
)
inner join
language languages3_
on project1_.id=languages3_.project_id
and (
languages3_.tag='es-PE'
)
... many times the same ...
inner join
language languages30_
on project1_.id=languages30_.project_id
and (
languages30_.tag='es-MX'
)
left outer join
translation translatio31_
on key0_.id=translatio31_.key_id
and (
translatio31_.language_id=languages2_.id
)
... many times the same ...
left outer join
translation translatio59_
on key0_.id=translatio59_.key_id
and (
translatio59_.language_id=languages30_.id
)
where
(
key0_.name in (
'base_administrative_notes.desc'
)
)
and
key0_.project_id=836
group by
key0_.id ,
languages2_.tag ,
translatio31_.id ,
languages3_.tag ,
translatio32_.id ,
... many times the same ...
languages30_.tag ,
translatio59_.id
order by
key0_.name asc nulls first,
key0_.id asc nulls first limit 1
The visualised EXPLAIN ANALYSE result: https://explain.dalibo.com/plan/uWS (the full query can be found there as well as raw output from explain (ANALYZE, COSTS, VERBOSE, BUFFERS, FORMAT JSON)).
I found in other threads that this can be caused by using too many indexes on the tables, but I only have a unique index on my translations table on key_id and language_id columns.
EDIT:
I've found out that setting join_collapse_limit to some value between 1 to 5 reduces the planning to under 200ms. Don't know if this is the best solution, but I am going to use it as a workaround for now.
As Laurenz Albe explained, the planner is probably trying to reorder the joins to optimize the query.
With n tables, the number of possible joins order is n! (factorial n).
My suggestion is to :
make sure the order is the best in your query
set that particular parameter to 1 before the query
play the query
reset the parameter
You can check Alicja's slide deck (from slide 22) where she illustrates that particular problem with examples here: https://www.postgresql.eu/events/pgconfeu2017/sessions/session/1617/slides/9/FromMinutesToMilliseconds.pdf
I have a bunch of tables in postgresql and I run a query as follows
SELECT DISTINCT ON ...some stuff...
FROM "rent_flats" INNER JOIN "rent_flats_linked_users"
ON "rent_flats_linked_users"."rent_flat_id" = "rent_flats"."id"
INNER JOIN "users"
ON "users"."id" = rent_flats_linked_users"."user_id"
INNER JOIN "owners"
ON "owners"."id" = "users"."profile_id" AND "users"."profile_type" = 'Owner'
INNER JOIN "phone_numbers"
ON "phone_numbers"."person_id" = "owners"."id" AND "phone_numbers"."person_type" = 'Owner'
INNER JOIN "phone_number_categories"
ON "phone_number_categories"."id" = "phone_numbers"."phone_number_category_id"
INNER JOIN "localities"
ON "localities"."id" = "rent_flats"."locality_id"
INNER JOIN "regions"
ON "regions"."id" = "localities"."region_id"
INNER JOIN "cities"
ON "cities"."id" = "regions"."city_id"
INNER JOIN "property_types"
ON "property_types"."id" = "rent_flats"."property_type_id"
INNER JOIN "apartment_types"
ON "apartment_types"."id" = "rent_flats"."apartment_type_id"
WHERE "rent_flats"."status" = 3
AND (((extract(epoch from age(current_date,rent_flats.date_added))/86400)::int) IN (cities.short_period,cities.long_period))
AND (phone_number_categories.name IN ('SMS','SMS & Mobile'))
ORDER BY rf_id, phone_numbers.priority ASC
Note: The rent_flats table contains around 5 million rows, and rent_flats_linked_users contains around 600k rows and users contains 350k rows.Other tables are small in size.
The query takes about 6.8 secs to execute and the explain analyses shows that around 50% of the total time goes in sequential scans of the rent_flats, users and rent_flats_linked_users tables and the other 30% in Hash joins.
On setting seq_scan to off...the query takes even longer to ~11 secs (in this case Hash and Hash join take upto 97.5% of the time)
Here's the explain query plan analyses.
I have put indices on the fields involved in the inner joins as well as on fields involved in the filters like phone_numbers.priority and cities.short_period and cities.long_period. But I still get a sequential scan. What can be the reasons and possible solutions to fasten the query?
I suspect that if there is a part of that query worth optimising then it is this:
(((extract(epoch from age(current_date,rent_flats.date_added))/86400)::int) IN (cities.short_period,cities.long_period))
You really need to turn that into something like:
rent_flats.date_added in (...)
Then you can index date_added, and maybe index (date_added, status).
the next step would be to make sure that the join columns are indexed.
I'm using PostgreSQL 9.2.9 and have the following problem.
There are function:
CREATE OR REPLACE FUNCTION report_children_without_place(text, date, date, integer)
RETURNS TABLE (department_name character varying, kindergarten_name character varying, a1 bigint) AS $BODY$
BEGIN
RETURN QUERY WITH rh AS (
SELECT (array_agg(status ORDER BY date DESC))[1] AS status, request
FROM requeststatushistory
WHERE date <= $3
GROUP BY request
)
SELECT
w.name,
kgn.name,
COUNT(*)
FROM kindergarten_request_table_materialized kr
JOIN rh ON rh.request = kr.id
JOIN requeststatuses s ON s.id = rh.status AND s.sysname IN ('confirmed', 'need_meet_completion', 'kindergarten_need_meet')
JOIN workareas kgn ON kr.kindergarten = kgn.id AND kgn.tree <# CAST($1 AS LTREE) AND kgn.active
JOIN organizationforms of ON of.id = kgn.organizationform AND of.sysname IN ('state','municipal','departmental')
JOIN workareas w ON w.tree #> kgn.tree AND w.active
JOIN workareatypes mt ON mt.id = w.type AND mt.sysname = 'management'
WHERE kr.requestyear = $4
GROUP BY kgn.name, w.name
ORDER BY w.name, kgn.name;
END
$BODY$ LANGUAGE PLPGSQL STABLE;
EXPLAIN ANALYZE SELECT * FROM report_children_without_place('83.86443.86445', '14-04-2015', '14-04-2015', 2014);
Total runtime: 242805.085 ms.
But query from function's body executes much faster:
EXPLAIN ANALYZE WITH rh AS (
SELECT (array_agg(status ORDER BY date DESC))[1] AS status, request
FROM requeststatushistory
WHERE date <= '14-04-2015'
GROUP BY request
)
SELECT
w.name,
kgn.name,
COUNT(*)
FROM kindergarten_request_table_materialized kr
JOIN rh ON rh.request = kr.id
JOIN requeststatuses s ON s.id = rh.status AND s.sysname IN ('confirmed', 'need_meet_completion', 'kindergarten_need_meet')
JOIN workareas kgn ON kr.kindergarten = kgn.id AND kgn.tree <# CAST('83.86443.86445' AS LTREE) AND kgn.active
JOIN organizationforms of ON of.id = kgn.organizationform AND of.sysname IN ('state','municipal','departmental')
JOIN workareas w ON w.tree #> kgn.tree AND w.active
JOIN workareatypes mt ON mt.id = w.type AND mt.sysname = 'management'
WHERE kr.requestyear = 2014
GROUP BY kgn.name, w.name
ORDER BY w.name, kgn.name;
Total runtime: 2156.740 ms.
Why function executed so longer than the same query? Thank's
Your query runs faster because the "variables" are not actually variable -- they are static values (IE strings in quotes). This means the execution planner can leverage indexes. Within your stored procedure, your variables are actual variables, and the planner cannot make assumptions about indexes. For example - you might have a partial index on requeststatushistory where "date" is <= '2012-12-31'. The index can only be used if the $3 is known. Since it might hold a date from 2015, the partial index would be of no use. In fact, it would be detrimental.
I frequently construct a string within my functions where I concatenate my variables as literals and then execute the function using something like the following:
DECLARE
my_dynamic_sql TEXT;
BEGIN
my_dynamic_sql := $$
SELECT *
FROM my_table
WHERE $$ || quote_literal($3) || $$::TIMESTAMPTZ BETWEEN start_time
AND end_time;$$;
/* You can only see this if client_min_messages = DEBUG */
RAISE DEBUG '%', my_dynamic_sql;
RETURN QUERY EXECUTE my_dynamic_sql;
END;
The dynamic SQL is VERY useful because you can actually get an explain of the query when I have set client_min_messages=DEBUG; I can scrape the query from the screen and paste it back in after EXPLAIN or EXPLAIN ANALYZE and see what the execution planner is doing. This also allows you to construct very different queries as needed to optimize for variables (IE exclude unnecessary tables if warranted) and maintain a common API for your clients.
You may be tempted to avoid the dynamic SQL for fear of performance issues (I was at first) but you will be amazed at how LITTLE time is spent in planning compared to some of the cost of a couple of table scans on your seven-table join!
Good luck!
Follow-up: You might experiment with Common Table Expressions (CTEs) for performance as well. If you have a table that has a low signal-to-noise ratio (has many, many more records in it than you actually want to return) then a CTE can be very helpful. PostgreSQL executes CTEs early in the query, and materializes the resulting rows in memory. This allows you to use the same result set multiple times and in multiple places in your query. The benefit can really be surprising if you design it correctly.
sql_txt := $$
WITH my_cte as (
select fk1 as moar_data 1
, field1
, field2 /*do not need all other fields taking up RAM!*/
from my_table
where field3 between $$ || quote_literal(input_start_ts) || $$::timestamptz
and $$ || quote_literal(input_end_ts) || $$::timestamptz
),
keys_cte as ( select key_field
from big_look_up_table
where look_up_name = ANY($$ ||
QUOTE_LITERAL(input_array_of_names) || $$::VARCHAR[])
)
SELECT field1, field2, moar_data1, moar_data2
FROM moar_data_table
INNER JOIN my_cte
USING (moar_data1)
WHERE moar_data_table.moar_data_key in (select key_field from keys_cte) $$;
An execution plan is likely to show that it chooses to use an index on moar_data_tale.moar_data_key. This would appear to go against what I said above in my prior answer - except for the fact that the keys_cte results are materialized (and therefore cannot be changed by another transaction in a race-condition) -- you have your own little copy of the data for use in this query.
Oh - and CTEs can use other CTEs that are declared earlier in the same query. I have used this "trick" to replace sub-queries in very complex joins and seen great improvements.
Happy Hacking!
I have given a task to optimize the below sql query. Currently the query is timing out and causing a lot of blocking . I just started using t-sql, so please help me with optimizing the query.
select ExcludedID
from OfferConditions with (NoLock)
where OfferID = 27251
and ExcludedID in (210,223,409,423,447,480,633,...lots and lots of these...,
13346,13362,13380,13396,13407,1,2)
union
select CustomerGroupID as ExcludedID
from CPE_IncentiveCustomerGroups ICG with (NoLock)
inner join CPE_RewardOptions RO with (NoLock)
on RO.RewardOptionID = ICG.RewardOptionID
where RO.IncentiveID = 27251
AND ICG.Deleted = 0 and RO.Deleted = 0 and
and ExcludedUsers = 1
and CustomerGroupID in (210,223,409,423,447,480,633,...lots and lots of these...,
13346,13362,13380,13396,13407,1,2);
You can try to insert those IDs to temp table and join it instead of using IN statement.
The key to solving you problem is NOT to fix the SQL, but to fix indexes on your tables. For example, you should have a compound index on the OfferConditions table with OfferID and ExcludedID.
When you create the indexes on the other tables, remember that if the field is in the where OR the join filter, it should be part of your compound index.
I'd like to count all the Orders that are not urgent and whose order status = 1 (shipped).
This should be a very simple query to optimize. I'd like to put a simple filtered index on the Orders table to cover this query to make it a constant time/O(1) operation. However, when I look at the query plan, it looks like it's using a Index Scan which doesn't make sense. Ideally, this query should just returning the number of items in the index.
The table look like this (simplified to get to the essence):
CREATE TABLE [dbo].[Orders](
[Id] [int] IDENTITY(1,1) NOT NULL,
[IsUrgent] [bit] NOT NULL,
[Status] [tinyint] NOT NULL
CONSTRAINT [PK_Orders] PRIMARY KEY CLUSTERED ( [Id] ASC )
I've created this filtered index:
CREATE INDEX IX_Orders_ShippedNonUrgent ON Orders(Id) WHERE IsUrgent = 0 AND Status = 1;
Now, when I do this query:
SELECT COUNT(*) FROM Orders WHERE IsUrgent = 0 AND Status = 1
I see that the query plan is using IX_Orders_ShippedNonUrgent, but it's doing an Index Scan and performing around 200 reads across the ~150,000 rows in Orders.
Is it possible to always have this query run in constant time assuming the filtered index is kept up to date? Ideally, it should only perform 1 read to get the size of the index.
If I switch to a non-filtered index like this:
CREATE INDEX IX_Orders_IsUrgentStatus ON Orders(IsUrgent, Status);
The query plan uses an Index Seek, but still performs many more reads than should be necessary to answer this simple query.
UPDATE
I'm able to do this
SELECT TOP 1 rows FROM sys.partitions p
INNER JOIN sys.indexes i
ON i.name = 'IX_Orders_ShippedNonUrgent'
AND i.object_id = p.object_id
AND i.index_id = p.index_id
and get the result in 9 reads but it seems like there should be a much easier and less brittle way of using the simple COUNT(*) query.
It seems like what I'm wanting isn't possible. The best answer was left in the comments by Nikola Markovinović which is to forget about the filtered index and use an indexed view instead:
CREATE VIEW [dbo].vw_Orders_TotalShippedNonUrgent WITH SCHEMABINDING
AS
SELECT COUNT_BIG(*) AS TotalOrders
FROM dbo.Orders WHERE IsUrgent = 0 AND Status = 1;
with
CREATE UNIQUE CLUSTERED INDEX IX_vw_Orders_TotalShippedNonUrgent ON vw_Orders_TotalShippedNonUrgent(TotalOrders);
This forces creating views and their index for each summary statistic that I want as well as rewriting the query to ask the view instead of the simple approach, but it is fast at only 2 reads.
I'll leave this question open for awhile in case anyone has a simpler approach that's just as fast.