CloudSQL with PostgreSQL very slow performance - postgresql

I wanted to migrate from BigQuery to CloudSQL to save cost.
My problem is that CloudSQL with PostgreSQL is very very slow compare to BigQuery.
A query that takes 1.5 seconds in BigQuery takes almost 4.5 minutes(!) on CloudSQL with PostgreSQL.
I have CloudSQL with PostgreSQL server with the following configs:
My database have a main table with 16M rows (around 14GB in RAM).
A example query:
EXPLAIN ANALYZE
SELECT
"title"
FROM
public.videos
WHERE
EXISTS (SELECT
*
FROM (
SELECT
COUNT(DISTINCT CASE WHEN LOWER(param) LIKE '%thriller%' THEN '0'
WHEN LOWER(param) LIKE '%crime%' THEN '1' END) AS count
FROM
UNNEST(categories) AS param
) alias
WHERE count = 2)
ORDER BY views DESC
LIMIT 12 OFFSET 0
The table is a videos tables with categories column as text[].
The search condition here looks where there is a categories which is like '%thriller%' and like '%crime%' exactly two times
The EXPLAIN ANALYZE of this query gives this output (CSV): link.
The EXPLAIN (BUFFERS) of this query gives this output (CSV): link.
Query Insights graph:
Memory profile:
BigQuery reference for the same query on the same table size:
Server config: link.
Table describe: link.
My goal is to have Cloud SQL with the same query speed as Big Query

For anyone coming here wondering how to tune their postgres machine on cloud sql they call it flags and you can do it from the UI although not all the config options are edit able.
https://cloud.google.com/sql/docs/postgres/flags#console

The initial query looks overcomplicated. It could be rewritten as:
SELECT v."title"
FROM public.videos v
WHERE array_to_string(v.categories, '^') ILIKE ALL (ARRAY['%thriller%', '%crime%'])
ORDER BY views DESC
LIMIT 12 OFFSET 0;
db<>fiddle demo

PostGreSQL is very slow by design on every queries involving COUNT aggregate function and there is absolutly nothing to do except materialized view to enforces the performances.
The tests I have made on my machine with 48 cores about COUNT performances compare from PostGreSQL to MS SQL Server is clear : SQL Server is between 61 and 561 times faster in all situations, and with columnstore index SQL Server can be 1,533 time faster…
The same speed is reached when using any other RDBMS. The explanation is clearly the PG MVCC that maintain ghost rows inside table and index pages, that needs to browse every rows to know if it is an active or ghost row... In all the other RDBMS, the count is done by reading only one information at the top of the page (number of rows in the page) and also by using parallelized access or in SQL Server a batch access and not a row access...
There is nothing to do to speed up the count in PG until the storage engine will not been enterely rewrite to avoid ghost slots inside pages...

I believe you need to use a full-text search and the special GIN index. The steps:
Create the helper function for index: CREATE OR REPLACE FUNCTION immutable_array_to_string(text[]) RETURNS text as $$ SELECT array_to_string($1, ','); $$ LANGUAGE sql IMMUTABLE;
Create index itself:
CREATE INDEX videos_cats_fts_idx ON videos USING gin(to_tsvector('english', LOWER(immutable_array_to_string(categories))));
Use the following query:
SELECT title FROM videos WHERE (to_tsvector('english', immutable_array_to_string(categories)) ## (to_tsquery('english', 'thriller & crime'))) limit 12 offset 0;
Be aware that this query has a different meaning for 'crime' and 'thriller'. They are not just substrings. They are tokens in English phrases. But it looks that actually it is better for your task. Also, this index is not good for frequently changed data. It should work fine when you have mostly read-only data.
PS
This answer is inspired by answer & comments: https://stackoverflow.com/a/29640418/159923

Apart from the sql syntax optimization, have you tried Postgresql tune?
I check the explaination has found only two workers in parallel and 25KMemory used in sorting.
Workers Planned: 2"
Sort Method: quicksort Memory: 25kB"
For your query, it is typical OLAP query. it performance usually related the memory(memory and cpu cores used(workers). The default postgres use KB level memory and few workers. You can tune your postgresql.conf to optimized it work as OLAP type database.
===================================================
Here is my suggestion: use more memory(9MB as work mem ) and more cpu(max 16)
# DB Version: 13
# OS Type: linux
# DB Type: dw
# Total Memory (RAM): 24 GB
# CPUs num: 16
# Data Storage: ssd
max_connections = 40
shared_buffers = 6GB
effective_cache_size = 18GB
maintenance_work_mem = 2GB
checkpoint_completion_target = 0.9
wal_buffers = 16MB
default_statistics_target = 500
random_page_cost = 1.1
effective_io_concurrency = 200
work_mem = 9830kB
min_wal_size = 4GB
max_wal_size = 16GB
max_worker_processes = 16
max_parallel_workers_per_gather = 8
max_parallel_workers = 16
max_parallel_maintenance_workers = 4
You can add it to you postgresql.conf last line. And restart your postgresql server to make it effect.
To further optimization,
reduce the connection and increase the work_mem.
200* 9830 is about 2GB memory for all connections. If you has less( for example, 100) connections, you can get more memory for query working.
====================================
Regarding using text array type and unnest. you can try to add proper index.
That's all, good luck.
WangYong

Related

Query takes long time to run on postgreSQL database despite creating an index

Using PostgreSQL 14.3.1, I have created a database instance that is now 1TB in size. The main userlogs table is 751GB in size with 525GB used for data and 226GB used for various indexes on this table. The userlogs table currently contains over 900 million rows. In order to assist with querying this table, a separate Logdates table holds all unique dates for the user logs and there is an integer foreign key column for logdates created in userlogs called logdateID. Amongst the various indexes on the userlogs table, one of them is on logdateID. There are 104 date entries in Logdates table. When running the below query I would expect the index to be used and the 104 records to be retrieved in a reasonable period of time.
select distinct logdateid from userlogs;
This query took a few hours to return with the data. I did an explain plan on the query and the output is as shown below.
"HashAggregate (cost=80564410.60..80564412.60 rows=200 width=4)"
" Group Key: logdateid"
" -> Seq Scan on userlogs (cost=0.00..78220134.28 rows=937710528 width=4)"
I then issues the below command to request the database to use the index.
set enable_seqscan=off
The revised explain plan now comes as below:
"Unique (cost=0.57..3705494150.82 rows=200 width=4)"
" -> Index Only Scan using ix_userlogs_logdateid on userlogs (cost=0.57..3703149874.49 rows=937710528 width=4)"
However, when running the same query, it still takes a few hours to retrieve the data. My question is, why should it take that long to retrieve the data if it is doing an index only scan?
The machine on which the database sits is highly spec'd: a xeon 16-core processor, that with virtualisation enabled, gives 32 logical cores. There is 96GB of RAM and data storage is via a RAID 10 configured 2TB SSD disk with a separate 500GB system SSD disk.
There is no possibilities to optimize such queries in PostGreSQL due to the internal structure of the data storage into rows inside pages.
All queries involving an aggregate in PostGreSQL such as COUNT, COUNT DISTINCT or DISTINCT must read all rows inside the table pages to produce the result.
Let'us take a look over the paper I wrote about this problem :
PostGreSQL vs Microsoft SQL Server – Comparison part 2 : COUNT performances
It seems like your table has none of its pages set as all visible (compare pg_class.relallvisible to the actual number of pages in the table), which is weird because even insert-only tables should get autovacuumed in v13 and up. This will severely punish the index-only scan. You can try to manually vacuum the table to see if that changes things.
It is also weird that it is not using parallelization. It certainly should be. What are your non-default configuration settings?
Finally, I wouldn't expect even the poor plan you show to take a few hours. Maybe your hardware is not performing up to what it should. (Also, RAID 10 requires at least 4 disks, but your description makes it sound like that is not what you have)
Since you have the foreign key table, you could use that in your query, just testing each row that it has at least one row from the log table.
select logdateid from logdate where exists
(select 1 from userlogs where userlogs.logdateid=logdate.logdateid);

PostgreSQL Database size is not equal to sum of size of all tables

I am using an AWS RDS PostgreSQL instance. I am using below query to get size of all databases.
SELECT datname, pg_size_pretty(pg_database_size(datname))
from pg_database
order by pg_database_size(datname) desc
One database's size is 23 GB and when I ran below query to get sum of size of all individual tables in this particular database, it was around 8 GB.
select pg_size_pretty(sum(pg_total_relation_size(table_schema || '.' || table_name)))
from information_schema.tables
As it is an AWS RDS instance, I don't have rights on pg_toast schema.
How can I find out which database object is consuming size?
Thanks in advance.
The documentation says:
pg_total_relation_size ( regclass ) → bigint
Computes the total disk space used by the specified table, including all indexes and TOAST data. The result is equivalent to pg_table_size + pg_indexes_size.
So TOAST tables are covered, and so are indexes.
One simple explanation could be that you are connected to a different database than the one that is shown to be 23GB in size.
Another likely explanation would be materialized views, which consume space, but do not show up in information_schema.tables.
Yet another explanation could be that there have been crashes that left some garbage files behind, for example after an out-of-space condition during the rewrite of a table or index.
This is of course harder to debug on a hosted platform, where you don't have shell access...

Postgres multi-column index is taking forever to complete

I have a table with around 270,000,000 rows and this is how I created it.
CREATE TABLE init_package_details AS
SELECT pcont.package_content_id as package_content_id,
pcont.activity_id as activity_id,
pc.org_id as org_id,
pc.bed_type as bed_type,
pc.is_override as is_override,
pmmap.package_id as package_id,
pcont.activity_qty as activity_qty,
pcont.charge_head as charge_head,
pcont.activity_charge as charge,
COALESCE(pc.charge,0) - COALESCE(pc.discount,0) as package_charge
FROM a pc
JOIN b od ON
(od.org_id = pc.org_id AND od.status='A')
JOIN c pm ON
(pc.package_id=pm.package_id)
JOIN d pmmap ON
(pmmap.pack_master_id=pm.package_id)
JOIN e pcont ON
(pcont.package_id=pmmap.package_id);
I need to build index on the init_package_details table.
This table is getting created at around 5-6 mins.
I have created btree index like,
CREATE INDEX init_package_details_package_content_id_idx
ON init_package_details(package_content_id);`
which is taking 10 mins (More than the time to create and populate the table itself)
And, when I create another index like,
CREATE INDEX init_package_details_package_act_org_bt_id_idx
ON init_package_details(activity_id,org_id,bed_type);
It just freezes and taking forever to complete. I waited for around 30 mins before I manually cancelled it.
Below are stats from iotop -o if it helps,
When I created table Averaging around 110-120 MB/s (This is how 270 million rows got inserted in 5-6 mins)
When I created First Index, It was averaging at around 70 MB/s
On second index, it is snailing at 5-7 MB/s
Could someone explain Why is this happening? Is there anyway I can speedup the index creations here?
EDIT 1: There are no other connections accessing the table. And, pg_stat_activity shows active as status throughout the running time. This happens inside a transaction (this is happening between BEGIN and COMMIT, it contains many other scripts in same .sql file).
EDIT 2:
postgres=# show work_mem ;
work_mem
----------
5MB
(1 row)
postgres=# show maintenance_work_mem;
maintenance_work_mem
----------------------
16MB
Building indexes takes a long time, that's normal.
If you are not bottlenecked on I/O, you are probably on CPU.
There are a few things to improve the performance:
Set maintenance_work_mem very high.
Use PostgreSQL v11 or better, where several parallel workers can be used.

Postgres query performance in write-heavy database

I have a Postgres 9.6 database hosted on Heroku (standard-4 with 15GB cache).
One of my tables has almost 100M rows and that table gets maybe 3k inserts/updates per minute on average, but that can spike to 10x.
I have a few queries that run against that table that are basically counting rows with different columns having various values.
I also have a copy of the database that is currently getting no updates (basically idle).
Queries against the live database are much slower than queries on the copy and I don't know if this is due to the load/contention from the updates or poor query optimization on my part.
The values that the queries are counting are very common values.
select count(*)
from my_table
where a = '123'
and b = 'abc'
There are a lot of rows with the combination of 123 and abc. I have created a partial index for that:
create index my_table_abc_123_index
on my_table (a, b)
where a = '123' and b = 'abc'
This seems to get me an index only scan that's fast on the copy (~0.5s) but much slower on the live DB (~26s). The costs also vary quite a lot (copy first, live second):
Aggregate (cost=29751.39..29751.39 rows=1 width=8)
Aggregate (cost=179603.59..179603.59 rows=1 width=8)
Those both return a count of about 1.4M and the tables are about the same size.
It also seems that when the writes go up (10x) the degradation is significant.
It's also interesting that Postgres doesn't always choose my partial indexes, opting for a full composite index on (a, b), depending on the values passed for those columns.
I've read the Postgres indexing documentation and realize that for common values it's not always best to use the index (though Postgres is choosing an index in all these cases).
I'm wondering if this is the sort of difference I should be expecting between a live and idle database, and if I'm approaching creating indexes correctly for my workload (which is definitely write-heavy) or in general.
Some other details:
individual (full) column indexes seem much more expensive, but I've not tried individual column partial indexes.
the autovacuum threshold has been set pretty low (0.01) to deal with the dead rows due to lots of updates

Slow Postgres 9.3 Queries, again

This is a follow-up to the question at Slow Postgres 9.3 queries.
The new indexes definitely help. But what we're seeing is sometimes queries are much slower in practice than when we run EXPLAIN ANALYZE. An example is the following, run on the production database:
explain analyze SELECT * FROM messages WHERE groupid=957 ORDER BY id DESC LIMIT 20 OFFSET 31980;
QUERY PLAN
---------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=127361.90..127441.55 rows=20 width=747) (actual time=152.036..152.143 rows=20 loops=1)
-> Index Scan Backward using idx_groupid_id on messages (cost=0.43..158780.12 rows=39869 width=747) (actual time=0.080..150.484 rows=32000 loops=1)
Index Cond: (groupid = 957)
Total runtime: 152.186 ms
(4 rows)
With slow query logging turned on, we see instances of this query taking over 2 seconds. We also have log_lock_waits=true, and no slow locks are reported around the same time. What could explain the vast difference in execution times?
LIMIT x OFFSET y generally performs not much faster than LIMIT x + y. A large OFFSET is always comparatively expensive. The suggested index in the linked question helps, but while you cannot get index-only scans out of it, Postgres still has to check visibility in the heap (the main relation) for at least x + y rows to determine the correct result.
SELECT *
FROM messages
WHERE groupid = 957
ORDER BY id DESC
LIMIT 20
OFFSET 31980;
CLUSTER on your index (groupid,id) would help to increase locality of data in the heap and reduce the number of data pages to be read per query. Definitely a win. But if all groupid are equally likely to be queried, that's not going to remove the bottleneck of too little RAM for cache. If you have concurrent access, consider pg_repack instead of CLUSTER:
Optimize Postgres timestamp query range
Do you actually need all columns returned? (SELECT *) A covering index enabling index-only scans might help if you only need a few small columns returned. (autovacuum must be strong enough to cope with writes to the table, though. Read-only table would be ideal.)
Also, according to your linked question, your table is 32 GB on disk. (Typically a bit more in RAM). The index on (groupid,id) adds another 308 MB at least (without any bloat):
SELECT pg_size_pretty(7337880.0 * 44); -- row count * tuple size
Making sense of Postgres row sizes
You have 8 GB RAM, of which you expect around 4.5 GB to be used for cache (effective_cache_size = 4608MB). That's enough to cache the index for repeated use, but not nearly enough to also cache the whole table.
If your query happens to find data pages in cache, it's fast. Else, not so much. Big difference, even with SSD storage (much more with HDD).
Not directly related to this query, but 8 MB of work_mem (work_mem = 7864kB) seems way to small for your setup. Depending on various other factors I would set this to at least 64MB (unless you have many concurrent queries with sort / hash operations). Like #Craig commented, EXPLAIN (BUFFERS, ANALYZE) might tell us more.
The best query plan also depends on value frequencies. If only few rows pass the filter, the result might be empty for certain groupid and the query is comparatively fast. If a large portion of the table has to be fetched, a plain sequential scan wins. You need valid table statistics (autovacuum again). And possibly a larger statistics target for groupid:
Keep PostgreSQL from sometimes choosing a bad query plan
Since OFFSET is slow, an alternative is to simulate OFFSET using another column and some index preparation. We require a UNIQUE column (like a PRIMARY KEY) on the table. If there is none, one can be added with:
CREATE SEQUENCE messages_pkey_seq ;
ALTER TABLE messages
ADD COLUMN message_id integer DEFAULT nextval('messages_pkey_seq');
Next we create the position column for the OFFSET simulation:
ALTER TABLE messages ADD COLUMN position INTEGER;
UPDATE messages SET position = q.position FROM (SELECT message_id,
row_number() OVER (PARTITION BY group_id ORDER BY id DESC) AS position
FROM messages ) AS q WHERE q.message_id=messages.message_id ;
CREATE INDEX ON messages ( group_id, position ) ;
Now we are ready for the new version of the query in the OP:
SELECT * FROM messages WHERE group_id = 957 AND
position BETWEEN 31980 AND (31980+20-1) ;