postgres Query optimization merge index - postgresql

I am experienced in fine tuning in oracle, but in postgres I am unable to improve performance.
Problem statement: I need to aggregate rows out of one postgres table - that has large no.of columns (110) and 175 million rows for a month range. The query other than aggregation has a very simple where clause :
where time between '2019-03-15' and '2019-04-15'
and org_name in ('xxx','yyy'.. 15 elements)
There are individutal btree indexes on table for each "time" idx_time column and "org_name" idx_org_name but not composite index.
I tried creating new index with ('org_name','time') but my manager does not want to change anything.
How can I make it run faster? It takes 15 minutes now (in case of smaller set of org_name it takes 6 minutes ). Most time is spent on data access from table.
Is parallel execution possible?
thanks, Jay
QUERY EXPLAIN ANALYZE :

Related

Query takes long time to run on postgreSQL database despite creating an index

Using PostgreSQL 14.3.1, I have created a database instance that is now 1TB in size. The main userlogs table is 751GB in size with 525GB used for data and 226GB used for various indexes on this table. The userlogs table currently contains over 900 million rows. In order to assist with querying this table, a separate Logdates table holds all unique dates for the user logs and there is an integer foreign key column for logdates created in userlogs called logdateID. Amongst the various indexes on the userlogs table, one of them is on logdateID. There are 104 date entries in Logdates table. When running the below query I would expect the index to be used and the 104 records to be retrieved in a reasonable period of time.
select distinct logdateid from userlogs;
This query took a few hours to return with the data. I did an explain plan on the query and the output is as shown below.
"HashAggregate (cost=80564410.60..80564412.60 rows=200 width=4)"
" Group Key: logdateid"
" -> Seq Scan on userlogs (cost=0.00..78220134.28 rows=937710528 width=4)"
I then issues the below command to request the database to use the index.
set enable_seqscan=off
The revised explain plan now comes as below:
"Unique (cost=0.57..3705494150.82 rows=200 width=4)"
" -> Index Only Scan using ix_userlogs_logdateid on userlogs (cost=0.57..3703149874.49 rows=937710528 width=4)"
However, when running the same query, it still takes a few hours to retrieve the data. My question is, why should it take that long to retrieve the data if it is doing an index only scan?
The machine on which the database sits is highly spec'd: a xeon 16-core processor, that with virtualisation enabled, gives 32 logical cores. There is 96GB of RAM and data storage is via a RAID 10 configured 2TB SSD disk with a separate 500GB system SSD disk.
There is no possibilities to optimize such queries in PostGreSQL due to the internal structure of the data storage into rows inside pages.
All queries involving an aggregate in PostGreSQL such as COUNT, COUNT DISTINCT or DISTINCT must read all rows inside the table pages to produce the result.
Let'us take a look over the paper I wrote about this problem :
PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances
It seems like your table has none of its pages set as all visible (compare pg_class.relallvisible to the actual number of pages in the table), which is weird because even insert-only tables should get autovacuumed in v13 and up. This will severely punish the index-only scan. You can try to manually vacuum the table to see if that changes things.
It is also weird that it is not using parallelization. It certainly should be. What are your non-default configuration settings?
Finally, I wouldn't expect even the poor plan you show to take a few hours. Maybe your hardware is not performing up to what it should. (Also, RAID 10 requires at least 4 disks, but your description makes it sound like that is not what you have)
Since you have the foreign key table, you could use that in your query, just testing each row that it has at least one row from the log table.
select logdateid from logdate where exists
(select 1 from userlogs where userlogs.logdateid=logdate.logdateid);

Postgres - distinct query slow over 5 million data

I am trying to do a select distinct on a table with 5 millions of data which is taking approximately 2 minutes. My intent is to improve the speed to to seconds.
Query: - select distinct accounttype from t_fin_do where country_id='abc'
Tried composite index, the cost just went up
You can try to create a partial and covering index for this query with:
create index on t_fin_do(country_id, accounttype) where country_id='abc';
Depending on the selectivity of country_id this could be faster than a table seq scan.

Very long query planning times for database with lots of partitions in PostgreSQL

I have a PostgreSQL 10 database that contains two tables which both have two levels of partitioning (by list).
The data is now stored within 5K to 10K partitioned tables (grand-children of the two tables mentioned above) depending on the day.
There are three indexes per grand-child partition table but the two columns on which partitioning is done aren't indexed.
(Since I don't think this is needed no?)
The issue I'm observing is that the query planning time is very slow but the query execution time very fast.
Even when the partition values were hard-coded in the query.
Researching the issue, I thought that the linear search use by PostgreSQL 10 to find the metadata of the partition was the cause of it.
cf: https://blog.2ndquadrant.com/partition-elimination-postgresql-11/
So I decided to try out PostgreSQL 11 which includes the two aforementioned patches:
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=499be013de65242235ebdde06adb08db887f0ea5
https://git.postgresql.org/gitweb/?p=postgresql.git;a=commit;h=9fdb675fc5d2de825414e05939727de8b120ae81
Helas, it seems that the version change doesn't change anything.
Now I know that having lots of partitions isn't greatly appreciated by PostgreSQL but I still would like to understand why the query planner is so slow in PostgreSQL 10 and now PostgreSQL 11.
An example of a query would be:
EXPLAIN ANALYZE
SELECT
table_a.a,
table_b.a
FROM
(
SELECT
a,
b
FROM
table_a
WHERE
partition_level_1_column = 'foo'
AND
partition_level_2_column = 'bar'
)
AS table_a
INNER JOIN
(
SELECT
a,
b
FROM
table_b
WHERE
partition_level_1_column = 'baz'
AND
partition_level_2_column = 'bat'
)
AS table_b
ON table_b.b = table_a.b
LIMIT
10;
Running it will on database with 5K partitions will return Planning Time: 7155.647 ms but Execution Time: 2.827 ms.

Query on large, indexed table times out

I am relatively new to using Postgres, but am wondering what could be the workaround here.
I have a table with about 20 columns and 250 million rows, and an index created for the timestamp column time (but no partitions).
Queries sent to the table have been failing (although using the view first/last 100 rows function in PgAdmin works), running endlessly. Even simple select * queries.
For example, if I want to LIMIT a selection of the data to 10
SELECT * from mytable
WHERE time::timestamp < '2019-01-01'
LIMIT 10;
Such a query hangs - what can be done to optimize queries in a table this large? When the table was of a smaller size (~ 100 million rows), queries would always complete. What should one do in this case?
If time is of data type timestamp or the index is created on (time::timestamp), the query should be fast as lightning.
Please show the CREATE TABLE and the CREATE INDEX statement, and the EXPLAIN output for the query for more details.
"Query that doesn't complete" usually means that it does disk swaps. Especially when you mention the fact that with 100M rows it manages to complete. That's because index for 100M rows still fits in your memory. But index twice this size doesn't.
Limit won't help you here, as database probably decides to read the index first, and that's what kills it.
You could try and increase available memory, but partitioning would actually be the best solution here.
Partitioning means smaller tables. Smaller tables means smaller indexes. Smaller indexes have better chances to fit into your memory.

How to speed up for huge tables in SQL select query?

I have the following huge tables.
Table_1 with 500000 (0.5m) rows
Table_8 with 20000000 (20m) rows
Table_13 with 4000000 (4m) rows
Table_6 with 500000 (0.5m) rows
Table_15 with 200000 (0.2m) rows
I need to pull out so many records(recent events) to show on google map by joining about 28 tables.
How to speed up for huge tables in SQL select query ?
I searched from Google to use the clustered index and non clustered indexes. By getting the advices from DTA(Database Engine Tuning Advisor), I build those clustered index and non clustered indexes. But it still take long time.
I have 2 views and 1 stored procedure as the following.
https://gist.github.com/LaminGitHub/57b314b34599b2a47e65
Please kindly give me idea.
Best Regards