Group by count query takes a long time - postgresql

I have a table like this:
Column | Type | Modifiers
-------------+-----------------------------+-------------------------------------------------------
id | integer | not null default nextval('oks_id_seq'::regclass)
uname | text | not null
ess | text |
quest | text |
details | text |
status | character(1) | not null default 'q'::bpchar
last_parsed | timestamp without time zone |
qstatus | character(1) | not null default 'q'::bpchar
media_wc | integer | not null default 0
Indexes:
"oks_pkey" PRIMARY KEY, btree (id)
"oks_uname_key" UNIQUE CONSTRAINT, btree (uname)
"last_parsed_idx" btree (last_parsed)
"qstatus_idx" btree (qstatus)
"status_idx" btree (status)
And I have a query like this:
SELECT COUNT(status), status FROM oks GROUP BY status ORDER BY status;
Which results in:
count | status
---------+--------
1478472 | d
23599 | p
10178 | q
6278206 | s
(4 rows)
Which is great, but this takes forever, and for some reason Postgres keeps the whole index on disk, because disk activity is really high during the query.
Sort (cost=1117385.91..1117385.92 rows=4 width=2) (actual time=54122.991..54122.993 rows=4 loops=1)
Sort Key: status
Sort Method: quicksort Memory: 25kB
-> HashAggregate (cost=1117385.82..1117385.86 rows=4 width=2) (actual time=54122.280..54122.283 rows=4 loops=1)
-> Seq Scan on oks (cost=0.00..1078433.55 rows=7790455 width=2) (actual time=0.009..47978.616 rows=7790455 loops=1)
Total runtime: 54123.487 ms
(6 rows)
In my config I have the memory usage set at
work_mem = 128MB
Any ideas about how I can optimize such queries that use group by on the whole table? This seems unrealistically slow as it would have been much faster with flat files storage.
Edit:
I was able to get the query to run in a fraction of a second by modifying the postgres config files. Specifically, setting
fsync = off
synchronous_commit = off
full_page_writes = off
commit_delay = 2000
effective_cache_size = 4GB
work_mem = 512MB
maintenance_work_mem = 512MB
Not sure if these are optimal, but these options work in my case.
fsync = off helped the most I think.

try to use cstore. It is column store "table".
info:
https://github.com/citusdata/cstore_fdw
how to use cstore:
https://stackoverflow.com/questions/29970937/psql-using-cstore-table-for-aggregation-big-data

Related

How can I optimize a postgresql query where one dependent column is a timestamp

I have a table with a foreign key and a timestamp for when the row was most recently updated. rows with the same foreign key value are updated at roughly the same time, plus or minus an hour. I have an index on (foreign_key, timestamp). This is on postgresql 11.
When I make a query like:
select * from table where foreign_key = $1 and timestamp > $2 order by primary_key;
It will use my index in cases where the timestamp query is selective across the entire table. But if the timestamp is far enough in the past that the majority of rows match it will scan the primary_key index assuming it'll be faster. This problem goes away if I remove the order by.
I've looked at Postgresql's CREATE STATISTICS, but it doesn't seem to help in cases where the correlation is over a range of values like a timestamp plus or minus five minutes, rather than an specific value.
What are the best ways to work around this? I can remove the order by, but that complicates the business logic. I can partition the table on the foreign key id, but that is also a pretty expensive change.
Specifics:
Table "public.property_home_attributes"
Column | Type | Collation | Nullable | Default
----------------------+-----------------------------+-----------+----------+------------------------------------------------------
id | integer | | not null | nextval('property_home_attributes_id_seq'::regclass)
mls_id | integer | | not null |
property_id | integer | | not null |
formatted_attributes | jsonb | | not null |
created_at | timestamp without time zone | | |
updated_at | timestamp without time zone | | |
Indexes:
"property_home_attributes_pkey" PRIMARY KEY, btree (id)
"index_property_home_attributes_on_property_id" UNIQUE, btree (property_id)
"index_property_home_attributes_on_updated_at" btree (updated_at)
"property_home_attributes_mls_id_updated_at_idx" btree (mls_id, updated_at)
The table has about 16 million rows.
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') ORDER BY id ASC LIMIT 1000;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..10147.83 rows=1000 width=880) (actual time=1519.718..22310.674 rows=1000 loops=1)
-> Index Scan using property_home_attributes_pkey on property_home_attributes (cost=0.56..6094202.57 rows=600576 width=880) (actual time=1519.716..22310.398 rows=1000 loops=1)
Filter: ((updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone) AND (mls_id = 46))
Rows Removed by Filter: 358834
Planning Time: 0.110 ms
Execution Time: 22310.842 ms
(6 rows)
and then without the order by:
psql=# EXPLAIN ANALYZE SELECT * FROM property_home_attributes WHERE mls_id = 46 AND (property_home_attributes.updated_at < '2019-10-30 16:52:06.326774') LIMIT 1000;
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=0.56..1049.38 rows=1000 width=880) (actual time=0.053..162.081 rows=1000 loops=1)
-> Index Scan using foo on property_home_attributes (cost=0.56..629893.60 rows=600576 width=880) (actual time=0.053..161.992 rows=1000 loops=1)
Index Cond: ((mls_id = 46) AND (updated_at < '2019-10-30 16:52:06.326774'::timestamp without time zone))
Planning Time: 0.100 ms
Execution Time: 162.140 ms
(5 rows)
If you want to keep PostgreSQL from using an index scan on property_home_attributes_pkey to support the ORDER BY, you can simply use
ORDER BY primary_key + 0

Speed up fulltext search - pgsql

I saw a millions of threads according to speed up postgresql query with full-text search. I tried to do everything, but have not more ideas.
I have pretty big (20 612 971 records at this moment) table and search in it with pgsql's fulltext serach, then order it by ts_rank_cd. I reached up to around 3500-4000ms to execute query. Any ideas to make it faster ? If its possible I dont want to use external soft like sphinx or solr. So, native postgresql solutions are preffered :) Below are describe of my table and example of explain analyze select.
# \d artifacts.item
Table "artifacts.item"
Column | Type | Modifiers
-------------------------+-----------------------------+-------------------------------------------------------------
add_timestamp | timestamp without time zone |
author_account_id | integer |
description | text |
id | integer | not null default nextval('artifacts.item_id_seq'::regclass)
removed_since_timestamp | timestamp without time zone |
slug | character varying(2044) | not null
thumb_height | integer |
thumb_path | character varying(2044) | default NULL::character varying
thumb_width | integer |
title | character varying(2044) | not null
search_data | tsvector |
tags | integer[] |
is_age_restricted | boolean | not null default false
is_on_homepage | boolean | not null default false
is_public | boolean | not null default false
thumb_filename | character varying(2044) |
is_removed | boolean | not null default false
Indexes:
"artifacts_item_add_timestamp_idx" btree (add_timestamp DESC NULLS LAST)
"artifacts_item_id_idx" btree (id)
"artifacts_item_is_on_homepage_add_timestamp" btree (is_on_homepage DESC, add_timestamp DESC NULLS LAST)
"artifacts_item_is_on_homepage_idx" btree (is_on_homepage)
"artifacts_item_search_results" gin (search_data) WHERE is_public IS TRUE AND is_removed IS FALSE
"artifacts_item_tags_gin_idx" gin (tags)
"artifacts_item_thumbs_list" btree (is_public, is_removed, id DESC)
"index1" btree (add_timestamp)
"itemIdx" btree (is_public, is_removed, is_age_restricted)
"item_author_account_id_idx" btree (author_account_id)
analyze:
# explain analyze SELECT i.id,
# i.title,
# i.description,
# i.slug,
# i.thumb_path,
# i.thumb_filename,
# CONCAT(
# i.thumb_path,
# '/',
# i.thumb_filename
# ) AS thumb_url,
# (CASE WHEN i.thumb_width = 0 THEN 280 ELSE i.thumb_width END) as thumb_width,
# (CASE WHEN i.thumb_height = 0 THEN 280 ELSE i.thumb_height END) as thumb_height,
# (i.thumb_height > i.thumb_width) AS is_vertical,
# i.add_timestamp
# FROM artifacts.item AS i
# WHERE i.is_public IS true
# AND i.is_removed IS false
# AND (i.search_data ## to_tsquery('public.polish', $$'lego'$$))
# ORDER BY ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# ts_rank_cd(i.search_data, to_tsquery('public.polish', $$'lego'$$)) desc,
# i.add_timestamp DESC NULLS LAST
# LIMIT 60
# OFFSET 0;
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=358061.78..358061.93 rows=60 width=315) (actual time=335.870..335.876 rows=60 loops=1)
-> Sort (cost=358061.78..358357.25 rows=118189 width=315) (actual time=335.868..335.868 rows=60 loops=1)
Sort Key: (ts_rank_cd(search_data, '''lego'' | ''lega'''::tsquery)), add_timestamp
Sort Method: top-N heapsort Memory: 55kB
-> Bitmap Heap Scan on item i (cost=2535.96..353980.19 rows=118189 width=315) (actual time=33.163..308.371 rows=62025 loops=1)
Recheck Cond: ((search_data ## '''lego'' | ''lega'''::tsquery) AND (is_public IS TRUE) AND (is_removed IS FALSE))
-> Bitmap Index Scan on artifacts_item_search_results (cost=0.00..2506.42 rows=118189 width=0) (actual time=23.066..23.066 rows=62085 loops=1)
Index Cond: (search_data ## '''lego'' | ''lega'''::tsquery)
Total runtime: 335.967 ms
(9 rows)
Time: 3444.731 ms
There are 62025 rows that match the condition, and they have to be ordered…
That will take a while. Is there a chance that you can have the whole database or at least the index in RAM? That would help.

How can I make this query run faster in postgres

I have this query which takes 86 sec to execute.
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr,
's' stock_type
from customer, stock, date
where customer_k = stock_customer_k
and stock_soldate_k = date_k
group by cust_id, cust_first_name, cust_last_name, cust_prf, cust_birth_country, cust_login, cust_email_address, date_year;
EXPLAIN ANALYZE RESULT:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=639753.55..764040.06 rows=2616558 width=213) (actual time=81192.575..86536.398 rows=190581 loops=1)
Group Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
-> Sort (cost=639753.55..646294.95 rows=2616558 width=213) (actual time=81192.468..83977.960 rows=2685453 loops=1)
Sort Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
Sort Method: external merge Disk: 460920kB
-> Hash Join (cost=6527.66..203691.58 rows=2616558 width=213) (actual time=60.500..2306.082 rows=2685453 loops=1)
Hash Cond: (stock.stock_customer_k = customer.customer_k)
-> Merge Join (cost=1423.66..144975.59 rows=2744641 width=30) (actual time=8.820..1412.109 rows=2750311 loops=1)
Merge Cond: (date.date_k = stock.stock_soldate_k)
-> Index Scan using date_key_idx on date (cost=0.29..2723.33 rows=73049 width=8) (actual time=0.013..7.164 rows=37622 loops=1)
-> Index Scan using stock_soldate_k_index on stock (cost=0.43..108829.12 rows=2880404 width=30) (actual time=0.004..735.043 rows=2750312 loops=1)
-> Hash (cost=3854.00..3854.00 rows=100000 width=191) (actual time=51.650..51.650rows=100000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 16139kB
-> Seq Scan on customer (cost=0.00..3854.00 rows=100000 width=191) (actual time=0.004..30.341 rows=100000 loops=1)
Planning time: 1.761 ms
Execution time: 86621.807 ms
I have work_mem=512MB. I have indexes created on
cust_id, customer_k, stock_customer_k, stock_soldate_k and date_k.
There are about 100,000 rows in customer, 3,000,000 rows in stock and 80,000 rows in date.
How can I make this query run faster?
I would appreciate any help!
TABLE DEFINITIONS
date
Column | Type | Modifiers
---------------------+---------------+-----------
date_k | integer | not null
date_id | character(16) | not null
date_date | date |
date_year | integer |
stock
Column | Type | Modifiers
-----------------------+--------------+-----------
stock_soldate_k | integer |
stock_soltime_k | integer |
stock_customer_k | integer |
stock_ds_price | numeric(7,2) |
stock_es_price | numeric(7,2) |
stock_ls_price | numeric(7,2) |
stock_ws_price | numeric(7,2) |
customer:
Column | Type | Modifiers
---------------------------+-----------------------+-----------
customer_k | integer | not null
customer_id | character(16) | not null
cust_first_name | character(20) |
cust_last_name | character(30) |
cust_prf | character(1) |
cust_birth_country | character varying(20) |
cust_login | character(13) |
cust_email_address | character(50) |
TABLE "stock" CONSTRAINT "st1" FOREIGN KEY (stock_soldate_k) REFERENCES date(date_k)
"st2" FOREIGN KEY (stock_customer_k) REFERENCES customer(customer_k)
Try this:
with stock_grouped as
(select stock_customer_k, date_year, sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr
from stock, date
where stock_soldate_k = date_k
group by stock_customer_k, date_year)
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
total_yr,
's' stock_type
from customer, stock_grouped
where customer_k = stock_customer_k
This query anticipates the grouping over the join.
A big performance penalty that you get is that about 450MB of intermediate data is stored externally: Sort Method: external merge Disk: 460920kB. This happens because the planner first needs to satisfy the join conditions between the 3 tables, including the possibly inefficient table customer, before the aggregation sum() can take place, even while the aggregation can be perfectly well performed on table stock alone.
Query
Because your tables are fairly large, you are better off reducing the number of eligible rows as soon as possible and preferably before any joining. In this case that means doing the aggregation on table stock in a sub-query and join that result to the other two tables:
SELECT c.cust_id AS customer_id,
c.cust_first_name AS customer_first_name,
c.cust_last_name AS customer_last_name,
c.cust_prf AS customer_prf,
c.cust_birth_country AS customer_birth_country,
c.cust_login AS customer_login,
c.cust_email_address AS customer_email_address,
d.date_year AS ddyear,
ss.total_yr,
's' stock_type
FROM (
SELECT
stock_customer_k AS ck,
stock_soldate_k AS sdk,
sum((stock_ls_price-stock_ws_price-stock_ds_price+stock_es_price)*0.5) AS total_yr
FROM stock
GROUP BY 1, 2) ss
JOIN customer c ON c.customer_k = ss.ck
JOIN date d ON d.date_k = ss.sdk;
The sub-query on stock will result in far fewer rows, depending on the average number of orders per customer per date. Also, in the sum() function, multiplying by 0.5 is far cheaper than dividing by 2 (although in the grand scheme of things it will be relatively marginal).
Data model
You should also look seriously at your data model.
In table customer you use data types like char(30), which will always take up 30 bytes in your row, even when you store just 'X'. Using a varchar(30) data type is much more efficient when many strings are shorter than the declared maximum width, because it takes up less space and thus requires fewer page reads (and writes on the intermediate data).
Table stock uses numeric(7,2) for prices. Use of the numeric data type may give accurate results when subjecting data to many, many repeated operations, but they are also very slow. The double precision data type will be much faster and equally accurate in your scenario. For presentation purposes you can round the value off to the desired precision.
As a suggestion, create a table stock_f with double precision data types instead of numeric, copy all data over from stock to stock_f and run the query on that table.

How can this simple query take so long?

=> SELECT * FROM "tags" WHERE ("kind" = 'View') ORDER BY "name";
Time: 278.318 ms
The tags table contains 358 rows. All of them are views at the moment.
Column | Type | Modifiers
-------------+--------------------------+-------------------------------------
id | uuid | not null default uuid_generate_v4()
name | text | not null
slug | text | not null
kind | text | not null
external_id | text |
created_at | timestamp with time zone | not null default now()
updated_at | timestamp with time zone |
filter | json |
Indexes:
"tags_pkey" PRIMARY KEY, btree (id)
"tags_kind_index" btree (kind)
"tags_name_index" btree (name)
Analyze says:
QUERY PLAN
----------------------------------------------------------------------------------------------------------
Sort (cost=9.29..9.47 rows=358 width=124) (actual time=0.654..0.696 rows=358 loops=1)
Sort Key: name
Sort Method: quicksort Memory: 75kB
-> Seq Scan on tags (cost=0.00..6.25 rows=358 width=124) (actual time=0.006..0.108 rows=358 loops=1)
Filter: (kind = 'View'::text)
Total runtime: 0.756 ms
(6 rows)
Did you run analyze tags? It will update the table's statistics.
First if all the kind values are 'view' then an index on that column is useless. The index will only be used if the column's cardinality is high otherwise it is cheaper to do a sequential scan on the table.
Second with only 358 rows it is cheaper to do a sequential scan anyway.

SQL optimization problem

In our product database, ther is a slow sql which runs very frequently , just as the following
csv logs shows that it usually taks more than 1 second , But it is so inconceivable that when I
excute the sql in Database Server, it usually just taks about 0.3 s。From the plan ,we can see
that it uses a right index。 Anybody can explain it and have some optimization advice,thanks !
--sql
SELECT id, content, the_geo, lon, lat, skyid, addtime
FROM skytf.tbl_map
where skytf.ST_Distance_sphere(the_geo, skytf.geometryFromText('POINT(0 -4.0E-5)')) <= 1000
and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext('POINT(0 -4.0E-5)'),0.005)
order by addtime desc limit 30
--csv log
2011-08-30 03:02:06.029 CST,"lbs","skytf",28656,"192.168.168.46:53430",4e5b9d45.6ff0,356,"SELECT",2011-08-29 22:08:05 CST,106/3030952,0,LOG,00000,"duration: 1782.945 ms execute <unnamed>: SELECT id,content,the_geo,lon,lat,skyid,addtime FROM skytf.tbl_map where skytf.ST_Distance_sphere(the_geo,skytf.geometryFromText($1))<=1000 and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext($2),0.005) order by addtime desc limit 30 ","parameters: $1 = 'POINT(0 -4.0E-5)', $2 = 'POINT(0 -4.0E-5)'",,,,,,,,""
There are plenty of the duration logs, I just paste a line 。 ""*duration: 1782.945 ms"*
--explain analyze
skytf=> explain analyze SELECT id,content,the_geo,lon,lat,skyid,addtime FROM skytf.tbl_map
skytf-> where skytf.ST_Distance_sphere(the_geo,skytf.geometryFromText('POINT(0 -4.0E-5)'))<=1000
skytf-> and the_geo && skytf.ST_BUFFER(skytf.geometryfromtext('POINT(0 -4.0E-5)'),0.005)
skytf-> order by addtime desc limit 30 ;
QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------------------------
Limit (cost=11118.56..11118.64 rows=30 width=128) (actual time=338.031..338.037 rows=30 loops=1)
-> Sort (cost=11118.56..11124.60 rows=2416 width=128) (actual time=338.030..338.032 rows=30 loops=1)
Sort Key: addtime
Sort Method: top-N heapsort Memory: 29kB
-> Bitmap Heap Scan on tbl_map (cost=201.51..11047.21 rows=2416 width=128) (actual time=53.455..309.962 rows=78121 loops=1)
Recheck Cond: ((the_geo)::box && '(0.005,0.00496),(-0.005,-0.00504)'::box)
Filter: (skytf.st_distance_sphere(the_geo, '01010000000000000000000000F168E388B5F804BF'::skytf.geometry) <=1000::double precision)
-> Bitmap Index Scan on tbl_map_idx_gin (cost=0.00..200.91 rows=7249 width=0) (actual time=49.392..49.392 rows=78559 loops=1)
Index Cond: ((the_geo)::box && '(0.005,0.00496),(-0.005,-0.00504)'::box)
Total runtime: 338.125 ms
(10 rows)
--table information
skytf=> \dt+ skytf.tbl_map
List of relations
Schema | Name | Type | Owner | Size | Description
------------+---------------+-------+-------+--------+-------------
skytf | tbl_map | table | lbs | 158 MB |
(1 row)
skytf=> \d skytf.tbl_map
Table "skytf.tbl_map"
Column | Type | Modifiers
--------------+--------------------------------+-----------------------------------------------------------------------
id | integer | not null default nextval('skytf.tbl_map_id_seq'::regclass)
content | character varying(100) |
lon | double precision |
lat | double precision |
skyid | integer |
addtime | timestamp(1) without time zone | default now()
the_geo | skytf.geometry |
viewcount | integer | default 0
lastreadtime | timestamp without time zone |
ischeck | boolean |
Indexes:
"tbl_map_pkey" PRIMARY KEY, btree (id)
"idx_map_book_skyid" btree (skyid, addtime)
"tbl_map_idx_gin" gist ((the_geo::box))
It is possible to have slower queries while checkpoint is executed.
Try to log checkpoints and if this is your case tune checkpoints for latency. In general this may reduce bytes written per sec. but will improve slow queries.