Attempt to avoid duplicated code with CTE takes far too long

Attempt to avoid duplicated code with CTE takes far too long - postgresql

I need to create a view with some calculated and aggregated values. So I need certain values multiple times, like total_dist_pts in the example below.
There is a loc_a_run table with about 350 rows so far (constantly growing) and a loc_a_pline table with somewhat more than 4 million rows (also growing):
prod=# \d loc_a_run
Table "public.loc_a_run"
Column | Type | Collation | Nullable | Default
----------------+--------------------------+-----------+----------+-----------------------------
loc_a_run_id | integer | | not null | generated always as identity
execution_time | timestamp with time zone | | not null | CURRENT_TIMESTAMP(0)
run_type_id | smallint | | not null |
...
has_errors | boolean | | |
Indexes:
"loc_a_run_pkey" PRIMARY KEY, btree (loc_a_run_id)
prod=# \d loc_a_pline
Table "public.loc_a_pline"
Column | Type | Collation | Nullable | Default
----------------+-----------+-----------+----------+-----------------------------
loc_a_pline_id | bigint | | not null | generated always as identity
loc_a_run_id | integer | | not null |
is_top | boolean | | not null |
...
dist_left | numeric | | not null |
dist_right | numeric | | not null |
...
Indexes:
"loc_a_pline_pkey" PRIMARY KEY, btree (loc_a_pline_id)
Foreign-key constraints:
"loc_a_pline_loc_a_run_id_fkey" FOREIGN KEY (loc_a_run_id) REFERENCES loc_a_run(loc_a_run_id) ON UPDATE CASCADE ON DELETE CASCADE
The solution I use right now:
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(
CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END)
/ COUNT(pline.loc_a_pline_id)
AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=154201.17..154577.71 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=154201.17..154519.69 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=153201.15..153204.56 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=153113.01..153130.07 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Hash Join (cost=21.67..120633.75 rows=1623963 width=62)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112756.41 rows=1623963 width=32)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107766.55 rows=1867855 width=30)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes around 14.2s to execute. I lack the experience to assess how good or bad the performance is for this part, but I could live with it. Of course, faster would be an advantage.
Because this contains duplicated code I tried to get rid of it by using a CTE (in the final view I need this for a few more calculations, but the pattern is the same):
WITH dist_pts AS (
SELECT loc_a_run_id
, CASE
WHEN is_top IS true
THEN ROUND(dist_right - dist_left, 2)
ELSE ROUND(dist_left - dist_right, 2)
END AS pts
FROM loc_a_pline
)
SELECT run.loc_a_run_id AS run_id
, run_type.run_type
, SUM(dist_pts.pts) AS total_dist_pts
, COUNT(pline.loc_a_pline_id) AS total_plines
, SUM(dist_pts.pts) / COUNT(pline.loc_a_pline_id) AS dist_pts_per_pline
FROM loc_a_run AS run
JOIN dist_pts USING (loc_a_run_id)
JOIN loc_a_pline AS pline USING (loc_a_run_id)
JOIN run_type USING (run_type_id)
WHERE run.has_errors IS false
GROUP BY run_id, run_type;
Query plan:
"Finalize GroupAggregate (cost=575677889.59..575678266.13 rows=1365 width=108)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Gather Merge (cost=575677889.59..575678208.12 rows=2730 width=76)"
" Workers Planned: 2"
" -> Sort (cost=575676889.57..575676892.98 rows=1365 width=76)"
" Sort Key: run.loc_a_run_id, run_type.run_type"
" -> Partial HashAggregate (cost=575676801.43..575676818.49 rows=1365 width=76)"
" Group Key: run.loc_a_run_id, run_type.run_type"
" -> Parallel Hash Join (cost=155366.81..111024852.15 rows=23232597464 width=62)"
" Hash Cond: (loc_a_pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline (cost=0.00..107877.85 rows=1869785 width=22)"
" -> Parallel Hash (cost=120758.30..120758.30 rows=1625641 width=48)"
" -> Hash Join (cost=21.67..120758.30 rows=1625641 width=48)"
" Hash Cond: (run.run_type_id = run_type.run_type_id)"
" -> Hash Join (cost=20.55..112872.83 rows=1625641 width=18)"
" Hash Cond: (pline.loc_a_run_id = run.loc_a_run_id)"
" -> Parallel Seq Scan on loc_a_pline pline (cost=0.00..107877.85 rows=1869785 width=12)"
" -> Hash (cost=17.14..17.14 rows=273 width=6)"
" -> Seq Scan on loc_a_run run (cost=0.00..17.14 rows=273 width=6)"
" Filter: (has_errors IS FALSE)"
" -> Hash (cost=1.05..1.05 rows=5 width=34)"
" -> Seq Scan on loc_a_run_type run_type (cost=0.00..1.05 rows=5 width=34)"
This takes forever and seems to be the wrong approach. I struggle to understand the query plan to find my mistake(s).
So my questions are:
Why does the CTE approach take so much time?
What would be the smartest solution to avoid duplicated code and eventually reduce execution time?
Is there a way to SUM(dist_pts.pts) only once?
Is there a way to COUNT(pline.loc_a_pline_id) in the same go as the subtraction in the CTE instead of accessing the big loc_a_pline table again? (is it accessed again at all?)
Any help is highly appreciated

Consider creating an index on loc_a_pline.loc_a_run_id. Postgres doesn't automatically create indexes on the referencing side of FK relationships. That should drastically improve the speed and remove the sequential scans over loc_a_pline from the execution plan.
Additionally, I'd suggest gathering all the data you want in the first portion of the CTE and then separating the aggregates out into their own portion. Something like this that accesses all of the tables once and aggregates over the set once:
WITH dist_pts AS (
SELECT run.loc_a_run_id rid
, pline.loc_a_pline_id pid
, rt.run_type
, CASE
WHEN pline.is_top IS true
THEN ROUND(pline.dist_right - pline.dist_left, 2)
ELSE ROUND(pline.dist_left - pline.dist_right, 2)
END AS pts
FROM loc_a_run run
JOIN loc_a_pline pline ON run.loc_a_run_id = pline.loc_a_run_id
JOIN run_type rt ON run.run_type_id = rt.run_type_id
WHERE run.has_errors IS FALSE
), aggs AS
(
SELECT SUM(dist_pts.pts)::NUMERIC total_dist_pts
, COUNT(dist_pts.pid)::NUMERIC total_plines
, dist_pts.rid
FROM dist_pts
GROUP BY dist_pts.rid
)
SELECT dp.rid
, dp.run_type
, aggs.total_dist_pts
, aggs.total_plines
, (aggs.total_dist_pts / aggs.total_plines) dist_pts_per_pline
FROM dist_pts dp
JOIN aggs ON dp.rid = aggs.rid
GROUP BY dp.rid, dp.run_type, aggs.total_dist_pts, aggs.total_plines
;

Related

Optimize postgress update query with join statement on large tables

I have 2 tables A and B.
A contains ~6M records and B ~1M records.
B contains foreign key to A, column name a_id.
I need to move the relationship between the tables, so table A will contain b_id, and remove the a_id from B.
I created this migration:
UPDATE A
SET b_id = B.id
FROM B
INNER JOIN A ON B.a_id = A.id;
This is the explain query:
"Update on A (cost=64444.71..138800306174.79 rows=7984529761815 width=915)"
" -> Nested Loop (cost=64444.71..138800306174.79 rows=7984529761815 width=915)"
" -> Seq Scan on A (cost=0.00..485941.45 rows=5779245 width=899)"
" -> Materialize (cost=64444.71..642396.62 rows=1381587 width=16)"
" -> Hash Join (cost=64444.71..628741.69 rows=1381587 width=16)"
" Hash Cond: (A.id = B.a_id)"
" -> Seq Scan on A (cost=0.00..485941.45 rows=5779245 width=10)"
" -> Hash (cost=40427.87..40427.87 rows=1381587 width=14)"
" -> Seq Scan on B (cost=0.00..40427.87 rows=1381587 width=14)"
This is taking too long(More than half an hour), more than my time window i have for the migration. Is there anything to do in order to optimize the update statement?

In PostgreSQL what does hashed subplan mean and what is the final query rewritten?

This is a follow-up question I posted a few days ago.
In PostgreSQL what does hashed subplan mean?
Below was my question.
I want to know how the optimizer rewrote the query and how to read the execution plan in PostgreSQL. Here is the sample code.
DROP TABLE ords;
CREATE TABLE ords (
ORD_ID INT NOT NULL,
ORD_PROD_ID VARCHAR(2) NOT NULL,
ETC_CONTENT VARCHAR(100));
ALTER TABLE ords ADD CONSTRAINT ords_PK PRIMARY KEY(ORD_ID);
CREATE INDEX ords_X01 ON ords(ORD_PROD_ID);
INSERT INTO ords
SELECT i
,chr(64+case when i <= 10 then i else 26 end)
,rpad('x',100,'x')
FROM generate_series(1,10000) a(i);
SELECT COUNT(*) FROM ords WHERE ORD_PROD_ID IN ('A','B','C');
DROP TABLE delivery;
CREATE TABLE delivery (
ORD_ID INT NOT NULL,
VEHICLE_ID VARCHAR(2) NOT NULL,
ETC_REMARKS VARCHAR(100));
ALTER TABLE delivery ADD CONSTRAINT delivery_PK primary key (ORD_ID, VEHICLE_ID);
CREATE INDEX delivery_X01 ON delivery(VEHICLE_ID);
INSERT INTO delivery
SELECT i
, chr(88 + case when i <= 10 then mod(i,2) else 2 end)
, rpad('x',100,'x')
FROM generate_series(1,10000) a(i);
analyze ords;
analyze delivery;
This is the SQL I am interested in.
SELECT *
FROM ords a
WHERE ( EXISTS (SELECT 1
FROM delivery b
WHERE a.ORD_ID = b.ORD_ID
AND b.VEHICLE_ID IN ('X','Y')
)
OR a.ORD_PROD_ID IN ('A','B','C')
);
Here is the execution plan
| Seq Scan on portal.ords a (actual time=0.038..2.027 rows=10 loops=1) |
| Output: a.ord_id, a.ord_prod_id, a.etc_content |
| Filter: ((alternatives: SubPlan 1 or hashed SubPlan 2) OR ((a.ord_prod_id)::text = ANY ('{A,B,C}'::text[]))) |
| Rows Removed by Filter: 9990 |
| Buffers: shared hit=181 |
| SubPlan 1 |
| -> Index Only Scan using delivery_pk on portal.delivery b (never executed) |
| Index Cond: (b.ord_id = a.ord_id) |
| Filter: ((b.vehicle_id)::text = ANY ('{X,Y}'::text[])) |
| Heap Fetches: 0 |
| SubPlan 2 |
| -> Index Scan using delivery_x01 on portal.delivery b_1 (actual time=0.023..0.025 rows=10 loops=1) |
| Output: b_1.ord_id |
| Index Cond: ((b_1.vehicle_id)::text = ANY ('{X,Y}'::text[])) |
| Buffers: shared hit=8 |
| Planning: |
| Buffers: shared hit=78 |
| Planning Time: 0.302 ms |
| Execution Time: 2.121 ms
I don't know how the optimizer transformed the SQL. What is the final SQL the optimizer rewrote? I have only one EXISTS sub-query in the SQL above, why are there two sub-plans? What does "hashed Sub-Plan 2" mean? I would appreciate it if anyone share a little knowledge with me.
Below is Laurenz Albe's answer.
You have the misconception that the optimizer rewrites the SQL statement. That is not the case. Rewriting the query is the job of the query rewriter, which for example replaces views with their definition. The optimizer comes up with a sequence of execution steps to compute the result. It produces a plan, not an SQL statement.
The optimizer plans two alternatives: either execute subplan 1 for each row found, or execute subplan 2 once (note that it is independent of a), build a hash table from the result and probe that hash for each row found in a.
At execution time, PostgreSQL decides to use the latter strategy, that is why subplan 1 is never executed.
Laurenz's answer enlightened me.
But, I wondered what the final query rewritten by the query rewriter would be.
Here is the rewritten query I thought the query rewriter would do.
Am I right?
What do you, readers of this question, think that the final rewritten query would be?
(
SELECT *
FROM ords a
WHERE EXISTS (SELECT 1
FROM delivery b
WHERE a.ORD_ID = B.ORD_ID
AND b.VEHICLE_ID IN ('X','Y')
OFFSET 0 --> the optimizer prevented subquery collapse.
)
*alternative OR*
SELECT a.*
FROM ords a *(Semi Hash Join)* delivery b --> the optimizer used b as an build input
WHERE a.ORD_ID = b.ORD_ID
AND b.VEHICLE_ID IN ('X','Y') --> the optimzer used the delivery_x01 index.
)
*filtered OR*
SELECT *
FROM ords a
WHERE a.ORD_PROD_ID IN ('A','B','C') --> the optimizer cannot use the ords_x01 index due to the query transformation

No. The subplans are not generated by the rewriter, but by the optimizer. As soon as the optimizer takes over, you leave the realm of SQL for good. The procedural execution steps it generates cannot be represented in the declarative SQL language.

How to index group by date_trunc('second',date)

How to index 'group by date_trunc('second',date'). The cost for date_trunc('second',date) is very high even i added the following index.
select c.name, t.code ,count(*) count_second,sum(volume) sum_volume from
tradedetail t inner join company c on c.code=t.code and country = 'MY'
where date::date=current_date and t.buysell='S' group by t.code,
c.name,date_trunc('second',date) having sum(volume) >=4000
CREATE INDEX tradedetail_sbuysell
ON tradedetail
USING btree
(code COLLATE pg_catalog."default", date_trunc('second',date), buysell COLLATE pg_catalog."default")
where buysell = 'S';
"GroupAggregate (cost=4094738.36..4108898.28 rows=435690 width=30)"
" Filter: (sum(t.volume) >= 4000::numeric)"
" -> Sort (cost=4094738.36..4095827.58 rows=435690 width=30)"
" Sort Key: t.code, c.name, (date_trunc('second'::text, t.date))"
" -> Nested Loop (cost=0.01..4033076.58 rows=435690 width=30)"
" -> Seq Scan on company c (cost=0.00..166.06 rows=4735 width=14)"
" Filter: ((country)::text = 'MY'::text)"
" -> Index Scan using tradedetail_index5 on tradedetail t (cost=0.01..849.77 rows=172 width=22)"
" Index Cond: (((code)::text = (c.code)::text) AND ((date)::date = ('now'::cstring)::date) AND ((buysell)::text = 'S'::text))"

Why can't PostgreSQL do this simple FULL JOIN?

Here's a minimal setup with 2 tables a and b each with 3 rows:
CREATE TABLE a (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON a (value);
CREATE TABLE b (
id SERIAL PRIMARY KEY,
value TEXT
);
CREATE INDEX ON b (value);
INSERT INTO a (value) VALUES ('x'), ('y'), (NULL);
INSERT INTO b (value) VALUES ('y'), ('z'), (NULL);
Here is a LEFT JOIN that works fine as expected:
SELECT * FROM a
LEFT JOIN b ON a.value IS NOT DISTINCT FROM b.value;
with output:
id | value | id | value
----+-------+----+-------
1 | x | |
2 | y | 1 | y
3 | | 3 |
(3 rows)
Changing "LEFT JOIN" to "FULL JOIN" gives an error:
SELECT * FROM a
FULL JOIN b ON a.value IS NOT DISTINCT FROM b.value;
ERROR: FULL JOIN is only supported with merge-joinable or hash-joinable join conditions
Can someone please answer:
What is a "merge-joinable or hash-joinable join condition" and why joining on a.value IS NOT DISTINCT FROM b.value doesn't fulfill this condition, but a.value = b.value is perfectly fine?
It seems that the only difference is how NULL values are handled. Since the value column is indexed in both tables, running an EXPLAIN on a NULL lookup is just as efficient as looking up values that are non-NULL:
EXPLAIN SELECT * FROM a WHERE value = 'x';
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.67 rows=6 width=36)
Recheck Cond: (value = 'x'::text)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value = 'x'::text)
EXPLAIN SELECT * FROM a WHERE value ISNULL;
QUERY PLAN
--------------------------------------------------------------------------
Bitmap Heap Scan on a (cost=4.20..13.65 rows=6 width=36)
Recheck Cond: (value IS NULL)
-> Bitmap Index Scan on a_value_idx (cost=0.00..4.20 rows=6 width=0)
Index Cond: (value IS NULL)
This has been tested with PostgreSQL 9.6.3 and 10beta1.
There has been discussion about this issue, but it doesn't directly answer the above question.

PostgreSQL implements FULL OUTER JOIN with either a hash or a merge join.
To be eligible for such a join, the join condition has to have the form
<expression using only left table> <operator> <expression using only right table>
Now your join condition does look like this, but PostgreSQL does not have a special IS NOT DISTINCT FROM operator, so it parses your condition into:
(NOT ($1 IS DISTINCT FROM $2))
And such an expression cannot be used for hash or merge joins, hence the error message.
I can think of a way to work around it:
SELECT a_id, NULLIF(a_value, '<null>'),
b_id, NULLIF(b_value, '<null>')
FROM (SELECT id AS a_id,
COALESCE(value, '<null>') AS a_value
FROM a
) x
FULL JOIN
(SELECT id AS b_id,
COALESCE(value, '<null>') AS b_value
FROM b
) y
ON x.a_value = y.b_value;
That works if <null> does not appear anywhere in the value columns.

I just solved such a case by replacing the ON condition with "TRUE", and moving the original "ON" condition into a WHERE clause. I don't know the performance impact of this, though.

How can I make this query run faster in postgres

I have this query which takes 86 sec to execute.
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr,
's' stock_type
from customer, stock, date
where customer_k = stock_customer_k
and stock_soldate_k = date_k
group by cust_id, cust_first_name, cust_last_name, cust_prf, cust_birth_country, cust_login, cust_email_address, date_year;
EXPLAIN ANALYZE RESULT:
QUERY PLAN
-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
GroupAggregate (cost=639753.55..764040.06 rows=2616558 width=213) (actual time=81192.575..86536.398 rows=190581 loops=1)
Group Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
-> Sort (cost=639753.55..646294.95 rows=2616558 width=213) (actual time=81192.468..83977.960 rows=2685453 loops=1)
Sort Key: customer.cust_id, customer.cust_first_name, customer.cust_last_name, customer.cust_prf, customer.cust_birth_country, customer.cust_login, customer.cust_email_address, date.date_year
Sort Method: external merge Disk: 460920kB
-> Hash Join (cost=6527.66..203691.58 rows=2616558 width=213) (actual time=60.500..2306.082 rows=2685453 loops=1)
Hash Cond: (stock.stock_customer_k = customer.customer_k)
-> Merge Join (cost=1423.66..144975.59 rows=2744641 width=30) (actual time=8.820..1412.109 rows=2750311 loops=1)
Merge Cond: (date.date_k = stock.stock_soldate_k)
-> Index Scan using date_key_idx on date (cost=0.29..2723.33 rows=73049 width=8) (actual time=0.013..7.164 rows=37622 loops=1)
-> Index Scan using stock_soldate_k_index on stock (cost=0.43..108829.12 rows=2880404 width=30) (actual time=0.004..735.043 rows=2750312 loops=1)
-> Hash (cost=3854.00..3854.00 rows=100000 width=191) (actual time=51.650..51.650rows=100000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 16139kB
-> Seq Scan on customer (cost=0.00..3854.00 rows=100000 width=191) (actual time=0.004..30.341 rows=100000 loops=1)
Planning time: 1.761 ms
Execution time: 86621.807 ms
I have work_mem=512MB. I have indexes created on
cust_id, customer_k, stock_customer_k, stock_soldate_k and date_k.
There are about 100,000 rows in customer, 3,000,000 rows in stock and 80,000 rows in date.
How can I make this query run faster?
I would appreciate any help!
TABLE DEFINITIONS
date
Column | Type | Modifiers
---------------------+---------------+-----------
date_k | integer | not null
date_id | character(16) | not null
date_date | date |
date_year | integer |
stock
Column | Type | Modifiers
-----------------------+--------------+-----------
stock_soldate_k | integer |
stock_soltime_k | integer |
stock_customer_k | integer |
stock_ds_price | numeric(7,2) |
stock_es_price | numeric(7,2) |
stock_ls_price | numeric(7,2) |
stock_ws_price | numeric(7,2) |
customer:
Column | Type | Modifiers
---------------------------+-----------------------+-----------
customer_k | integer | not null
customer_id | character(16) | not null
cust_first_name | character(20) |
cust_last_name | character(30) |
cust_prf | character(1) |
cust_birth_country | character varying(20) |
cust_login | character(13) |
cust_email_address | character(50) |
TABLE "stock" CONSTRAINT "st1" FOREIGN KEY (stock_soldate_k) REFERENCES date(date_k)
"st2" FOREIGN KEY (stock_customer_k) REFERENCES customer(customer_k)

Try this:
with stock_grouped as
(select stock_customer_k, date_year, sum(((stock_ls_price-stock_ws_price-stock_ds_price)+stock_es_price)/2) total_yr
from stock, date
where stock_soldate_k = date_k
group by stock_customer_k, date_year)
select cust_id customer_id,
cust_first_name customer_first_name,
cust_last_name customer_last_name,
cust_prf customer_prf,
cust_birth_country customer_birth_country,
cust_login customer_login,
cust_email_address customer_email_address,
date_year ddyear,
total_yr,
's' stock_type
from customer, stock_grouped
where customer_k = stock_customer_k
This query anticipates the grouping over the join.

A big performance penalty that you get is that about 450MB of intermediate data is stored externally: Sort Method: external merge Disk: 460920kB. This happens because the planner first needs to satisfy the join conditions between the 3 tables, including the possibly inefficient table customer, before the aggregation sum() can take place, even while the aggregation can be perfectly well performed on table stock alone.
Query
Because your tables are fairly large, you are better off reducing the number of eligible rows as soon as possible and preferably before any joining. In this case that means doing the aggregation on table stock in a sub-query and join that result to the other two tables:
SELECT c.cust_id AS customer_id,
c.cust_first_name AS customer_first_name,
c.cust_last_name AS customer_last_name,
c.cust_prf AS customer_prf,
c.cust_birth_country AS customer_birth_country,
c.cust_login AS customer_login,
c.cust_email_address AS customer_email_address,
d.date_year AS ddyear,
ss.total_yr,
's' stock_type
FROM (
SELECT
stock_customer_k AS ck,
stock_soldate_k AS sdk,
sum((stock_ls_price-stock_ws_price-stock_ds_price+stock_es_price)*0.5) AS total_yr
FROM stock
GROUP BY 1, 2) ss
JOIN customer c ON c.customer_k = ss.ck
JOIN date d ON d.date_k = ss.sdk;
The sub-query on stock will result in far fewer rows, depending on the average number of orders per customer per date. Also, in the sum() function, multiplying by 0.5 is far cheaper than dividing by 2 (although in the grand scheme of things it will be relatively marginal).
Data model
You should also look seriously at your data model.
In table customer you use data types like char(30), which will always take up 30 bytes in your row, even when you store just 'X'. Using a varchar(30) data type is much more efficient when many strings are shorter than the declared maximum width, because it takes up less space and thus requires fewer page reads (and writes on the intermediate data).
Table stock uses numeric(7,2) for prices. Use of the numeric data type may give accurate results when subjecting data to many, many repeated operations, but they are also very slow. The double precision data type will be much faster and equally accurate in your scenario. For presentation purposes you can round the value off to the desired precision.
As a suggestion, create a table stock_f with double precision data types instead of numeric, copy all data over from stock to stock_f and run the query on that table.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Attempt to avoid duplicated code with CTE takes far too long - postgresql

Related

Optimize postgress update query with join statement on large tables

In PostgreSQL what does hashed subplan mean and what is the final query rewritten?

How to index group by date_trunc('second',date)

Why can't PostgreSQL do this simple FULL JOIN?

How can I make this query run faster in postgres

Categories

Resources