I have a very simple model. A table Articles has 10 relations : 2 are 12N, 8 are N2N.
The issue is that this table can contain 10K rows on average, and some of the N2N relation tables have more than 200 records.
All in all, querying the data from table Articles with all join records leads to a total of 10 849 200 000 000 records, that is obviously too much.
therefore I have to change the model to optimize it, maybe avoiding N2N relations but including some of the N2N data as JSONB column in Articles, but this may lead into complexity while updating/adding/deleting records in the N2N tables
What would be a better approach in your opinion ?
What is the best way to update/delete/add records in JSONB columns ?
running postgres 12 in GCloud
Thks
[EDIT]
here is the article table (on a data schema)
CREATE TABLE IF NOT EXISTS data.articles
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
title character varying(3000) COLLATE pg_catalog."default" DEFAULT ''::character varying,
"prioriteId" uuid,
"isAlerte" boolean DEFAULT false,
"categorieId" uuid,
"publicationStartDate" timestamp without time zone DEFAULT '2021-12-03 14:20:03.172'::timestamp without time zone,
"publicationEndDate" timestamp without time zone DEFAULT '2021-12-03 14:20:03.172'::timestamp without time zone,
"hasAction" boolean DEFAULT false,
"briefDescription" character varying(3000) COLLATE pg_catalog."default" DEFAULT ''::character varying,
content text COLLATE pg_catalog."default" DEFAULT ''::text,
contact1 character varying(300) COLLATE pg_catalog."default" DEFAULT ''::character varying,
contact2 character varying(300) COLLATE pg_catalog."default" DEFAULT ''::character varying,
author character varying(300) COLLATE pg_catalog."default" DEFAULT ''::character varying,
"recordCreationDate" timestamp without time zone NOT NULL DEFAULT now(),
"recordUpdateDate" timestamp without time zone NOT NULL DEFAULT now(),
CONSTRAINT "PK_0a6e2c450d83e0b6052c2793334" PRIMARY KEY (id),
CONSTRAINT "FK_278d87b271a80d56e5d6cc0f888" FOREIGN KEY ("categorieId")
REFERENCES data.ref_categories (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION,
CONSTRAINT "FK_a55acd217d26e0d60f57b5f38f7" FOREIGN KEY ("prioriteId")
REFERENCES data.ref_priorites (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
TABLESPACE pg_default;
ALTER TABLE data.articles
OWNER to postgres;
-- Index: IDX_0a6e2c450d83e0b6052c279333
-- DROP INDEX data."IDX_0a6e2c450d83e0b6052c279333";
CREATE INDEX "IDX_0a6e2c450d83e0b6052c279333"
ON data.articles USING btree
(id ASC NULLS LAST)
TABLESPACE pg_default;
here is one of the 10 reference tables. All 10 tables have similar definition
CREATE TABLE IF NOT EXISTS data.ref_metiers
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
section character varying(300) COLLATE pg_catalog."default" NOT NULL,
key character varying(300) COLLATE pg_catalog."default" NOT NULL,
value character varying(300) COLLATE pg_catalog."default" NOT NULL,
parent character varying(300) COLLATE pg_catalog."default" NOT NULL,
"order" integer NOT NULL,
CONSTRAINT "PK_24035e57be39b22b5ee482f4a72" PRIMARY KEY (id)
)
TABLESPACE pg_default;
ALTER TABLE data.ref_metiers
OWNER to postgres;
-- Index: IDX_24035e57be39b22b5ee482f4a7
-- DROP INDEX data."IDX_24035e57be39b22b5ee482f4a7";
CREATE INDEX "IDX_24035e57be39b22b5ee482f4a7"
ON data.ref_metiers USING btree
(id ASC NULLS LAST)
TABLESPACE pg_default;
Here is one of the 8 tables to create the N2N relation
CREATE TABLE IF NOT EXISTS data.articles_metiers
(
id uuid NOT NULL DEFAULT uuid_generate_v4(),
"articleId" uuid NOT NULL,
"metierId" uuid NOT NULL,
CONSTRAINT "PK_62b37716d5cae9a5a9bee96c4da" PRIMARY KEY (id, "articleId", "metierId"),
CONSTRAINT "FK_43128a90c1671f1639d23ea3d5e" FOREIGN KEY ("articleId")
REFERENCES data.articles (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION,
CONSTRAINT "FK_f42358a7091bc79819c714987e4" FOREIGN KEY ("metierId")
REFERENCES data.ref_metiers (id) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE NO ACTION
)
the query (very simple)
select
a.*,
m.id,
d.*,
ad.*,
f.key,
f.value,
me.key,
me.value,
pe.key,
pe.value,
po.key,
po.value,
ser.key,
ser.value,
con.key,
con.value
from data.articles a
inner join data.articles_magasins am
on a."id" = am."articleId"
inner join data.hypermarches_stores m
on am."magasinId" = m."id"
inner join data.articles_documents ad
on a."id" = ad."articleId"
inner join data.base_documents d
on ad."documentId" = d."id"
inner join data.articles_fonctions af
on a."id" = af."articleId"
inner join data.ref_fonctions f
on af."fonctionId" = f."id"
inner join data.articles_metiers ame
on a."id" = ame."articleId"
inner join data.ref_metiers me
on ame."metierId" = me."id"
inner join data.articles_perimetres ape
on a."id" = ape."articleId"
inner join data.ref_perimetres pe
on ape."perimetreId" = pe."id"
inner join data.articles_poles apo
on a."id" = apo."articleId"
inner join data.ref_poles po
on apo."poleId" = po."id"
inner join data.articles_services ase
on a."id" = ase."articleId"
inner join data.ref_services ser
on ase."serviceId" = ser."id"
inner join data.articles_statuts_contribution acon
on a."id" = acon."articleId"
inner join data.ref_statuts_contrib con
on acon."statutId" = con."id"
and a screenshot of the plan in PDAdmin (not sure what info is relevant)
[EDIT 2]
adding a where a.id="lklmklmkm" condition too get back only 1 article takes 7s (the dev server is very small to be honest but it shhould be faster anyway)
its explain analyze :
"Nested Loop (cost=2066.53..27008.53 rows=394240 width=6493) (actual time=92.243..341.860 rows=394240 loops=1)"
" -> Nested Loop (cost=0.00..259.68 rows=1 width=1048) (actual time=0.564..1.101 rows=1 loops=1)"
" Join Filter: (acon.""statutId"" = con.id)"
" Rows Removed by Join Filter: 5"
" -> Seq Scan on articles_statuts_contribution acon (cost=0.00..249.00 rows=1 width=32) (actual time=0.554..1.089 rows=1 loops=1)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 9999"
" -> Seq Scan on ref_statuts_contrib con (cost=0.00..10.30 rows=30 width=1048) (actual time=0.006..0.007 rows=6 loops=1)"
" -> Nested Loop (cost=2066.53..22806.46 rows=394240 width=5461) (actual time=91.676..247.043 rows=394240 loops=1)"
" -> Hash Join (cost=2066.53..14342.47 rows=7040 width=2365) (actual time=77.916..87.851 rows=7040 loops=1)"
" Hash Cond: (ame.""metierId"" = me.id)"
" -> Nested Loop (cost=2055.86..14310.70 rows=7040 width=1349) (actual time=77.880..84.789 rows=7040 loops=1)"
" -> Nested Loop (cost=1000.57..7340.93 rows=176 width=1333) (actual time=36.323..40.931 rows=176 loops=1)"
" -> Gather (cost=1000.28..5352.65 rows=22 width=164) (actual time=30.476..30.682 rows=22 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Nested Loop (cost=0.28..4350.45 rows=9 width=164) (actual time=17.455..20.764 rows=7 loops=3)"
" -> Parallel Seq Scan on articles_documents ad (cost=0.00..4288.83 rows=9 width=81) (actual time=17.434..20.702 rows=7 loops=3)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 73326"
" -> Index Scan using ""IDX_87eab207e6374a967ae94feb8e"" on base_documents d (cost=0.28..6.85 rows=1 width=83) (actual time=0.007..0.007 rows=1 loops=22)"
" Index Cond: (id = ad.""documentId"")"
" -> Materialize (cost=0.29..1986.11 rows=8 width=1169) (actual time=0.266..0.451 rows=8 loops=22)"
" -> Nested Loop (cost=0.29..1986.07 rows=8 width=1169) (actual time=5.830..9.852 rows=8 loops=1)"
" -> Nested Loop (cost=0.29..237.99 rows=1 width=1153) (actual time=0.767..1.265 rows=1 loops=1)"
" Join Filter: (af.""fonctionId"" = f.id)"
" Rows Removed by Join Filter: 2"
" -> Nested Loop (cost=0.29..227.31 rows=1 width=137) (actual time=0.755..1.251 rows=1 loops=1)"
" -> Index Scan using ""IDX_0a6e2c450d83e0b6052c279333"" on articles a (cost=0.29..8.30 rows=1 width=121) (actual time=0.021..0.028 rows=1 loops=1)"
" Index Cond: (id = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" -> Seq Scan on articles_fonctions af (cost=0.00..219.00 rows=1 width=32) (actual time=0.730..1.217 rows=1 loops=1)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 9999"
" -> Seq Scan on ref_fonctions f (cost=0.00..10.30 rows=30 width=1048) (actual time=0.008..0.008 rows=3 loops=1)"
" -> Seq Scan on articles_metiers ame (cost=0.00..1748.00 rows=8 width=32) (actual time=5.057..8.576 rows=8 loops=1)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 79992"
" -> Materialize (cost=1055.29..6881.87 rows=40 width=32) (actual time=0.236..0.239 rows=40 loops=176)"
" -> Gather (cost=1055.29..6881.67 rows=40 width=32) (actual time=41.545..41.620 rows=40 loops=1)"
" Workers Planned: 2"
" Workers Launched: 2"
" -> Hash Join (cost=55.29..5877.67 rows=17 width=32) (actual time=29.068..35.769 rows=13 loops=3)"
" Hash Cond: (am.""magasinId"" = m.id)"
" -> Parallel Seq Scan on articles_magasins am (cost=0.00..5822.33 rows=17 width=32) (actual time=28.870..35.563 rows=13 loops=3)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 133320"
" -> Hash (cost=52.35..52.35 rows=235 width=16) (actual time=0.302..0.303 rows=235 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 20kB"
" -> Seq Scan on hypermarches_stores m (cost=0.00..52.35 rows=235 width=16) (actual time=0.062..0.214 rows=235 loops=1)"
" -> Hash (cost=10.30..10.30 rows=30 width=1048) (actual time=0.023..0.024 rows=41 loops=1)"
" Buckets: 1024 Batches: 1 Memory Usage: 12kB"
" -> Seq Scan on ref_metiers me (cost=0.00..10.30 rows=30 width=1048) (actual time=0.007..0.013 rows=41 loops=1)"
" -> Materialize (cost=0.00..3536.13 rows=56 width=3144) (actual time=0.002..0.006 rows=56 loops=7040)"
" -> Nested Loop (cost=0.00..3535.85 rows=56 width=3144) (actual time=13.749..17.341 rows=56 loops=1)"
" -> Nested Loop (cost=0.00..1761.92 rows=8 width=1048) (actual time=5.872..9.430 rows=8 loops=1)"
" Join Filter: (ape.""perimetreId"" = pe.id)"
" Rows Removed by Join Filter: 56"
" -> Seq Scan on ref_perimetres pe (cost=0.00..10.30 rows=30 width=1048) (actual time=0.018..0.020 rows=8 loops=1)"
" -> Materialize (cost=0.00..1748.04 rows=8 width=32) (actual time=0.731..1.174 rows=8 loops=8)"
" -> Seq Scan on articles_perimetres ape (cost=0.00..1748.00 rows=8 width=32) (actual time=5.844..9.379 rows=8 loops=1)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 79992"
" -> Materialize (cost=0.00..1773.25 rows=7 width=2096) (actual time=0.984..0.987 rows=7 loops=8)"
" -> Nested Loop (cost=0.00..1773.21 rows=7 width=2096) (actual time=7.866..7.880 rows=7 loops=1)"
" Join Filter: (ase.""serviceId"" = ser.id)"
" Rows Removed by Join Filter: 77"
" -> Seq Scan on ref_services ser (cost=0.00..10.30 rows=30 width=1048) (actual time=0.009..0.011 rows=12 loops=1)"
" -> Materialize (cost=0.00..1759.78 rows=7 width=1080) (actual time=0.350..0.654 rows=7 loops=12)"
" -> Nested Loop (cost=0.00..1759.74 rows=7 width=1080) (actual time=4.190..7.830 rows=7 loops=1)"
" -> Nested Loop (cost=0.00..229.68 rows=1 width=1048) (actual time=0.551..1.048 rows=1 loops=1)"
" Join Filter: (apo.""poleId"" = po.id)"
" Rows Removed by Join Filter: 2"
" -> Seq Scan on articles_poles apo (cost=0.00..219.00 rows=1 width=32) (actual time=0.543..1.038 rows=1 loops=1)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 9999"
" -> Seq Scan on ref_poles po (cost=0.00..10.30 rows=30 width=1048) (actual time=0.005..0.005 rows=3 loops=1)"
" -> Seq Scan on articles_services ase (cost=0.00..1530.00 rows=7 width=32) (actual time=3.637..6.776 rows=7 loops=1)"
" Filter: (""articleId"" = '000b827c-7a6a-4430-a28d-827c286983a5'::uuid)"
" Rows Removed by Filter: 69993"
"Planning Time: 2.550 ms"
"Execution Time: 362.384 ms"
[update 3]
Have applied some configs suggested by #sticky bit :
remove the unnecessary id column to the N2N tables
add some indexes on the N2N Tables (articleId and metierId) in the previous example
apply ANALYZE to update the statistics
Still no luck
Have then tried to find which join has an issue.
I found this one that takes 3s to retrieve 400K rows (again the dev server is very small, but both CPU and Memory are not full at all, this is a managed postgres on GCP ):
select
a.*,
m.id
from data.articles a
inner join data.articles_magasins am
on a."id" = am."articleId"
inner join data.hypermarches_stores m
on am."magasinId" = m."id"
articles_magasins has 400 000 rows
data.articles has 10 000 rows
data.hypermarches_stores has 235 rows
articles_magasins has an index on articleId and another index on magasinId
both data.articles and data.hypermarches_stores have indexes on id
However the execution plan doesn't use any of the indexes
"Hash Join (cost=473.29..9534.67 rows=400000 width=137)"
" Hash Cond: (am.""magasinId"" = m.id)"
" -> Hash Join (cost=418.00..8410.44 rows=400000 width=137)"
" Hash Cond: (am.""articleId"" = a.id)"
" -> Seq Scan on articles_magasins am (cost=0.00..6942.00 rows=400000 width=32)"
" -> Hash (cost=293.00..293.00 rows=10000 width=121)"
" -> Seq Scan on articles a (cost=0.00..293.00 rows=10000 width=121)"
" -> Hash (cost=52.35..52.35 rows=235 width=16)"
" -> Seq Scan on hypermarches_stores m (cost=0.00..52.35 rows=235 width=16)"
What's wrong ?
I want to index my tables for the following query:
select
t.*
from main_transaction t
left join main_profile profile on profile.id = t.profile_id
left join main_customer customer on (customer.id = profile.user_id)
where
(upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%')))
and t.service_type = 'SERVICE_1'
and t.status = 'SUCCESS'
and t.mode = 'AUTO'
and t.transaction_type = 'WITHDRAW'
and customer.client = 'corp'
and t.pub_date>='2018-09-05' and t.pub_date<='2018-11-05'
order by t.pub_date desc, t.id asc
LIMIT 1000;
This is how I tried to index my tables:
CREATE INDEX main_transaction_pr_id ON main_transaction (profile_id);
CREATE INDEX main_profile_user_id ON main_profile (user_id);
CREATE INDEX main_customer_client ON main_customer (client);
CREATE INDEX main_transaction_gin_req_no ON main_transaction USING gin (upper(request_no) gin_trgm_ops);
CREATE INDEX main_customer_gin_phone ON main_customer USING gin (upper(phone) gin_trgm_ops);
CREATE INDEX main_transaction_general ON main_transaction (service_type, status, mode, transaction_type); --> don't know if this one is true!!
After indexing like above my query is spending over 4.5 seconds for just selecting 1000 rows!
I am selecting from the following table which has 34 columns including 3 FOREIGN KEYs and it has over 3 million data rows:
CREATE TABLE main_transaction (
id integer NOT NULL DEFAULT nextval('main_transaction_id_seq'::regclass),
description character varying(255) NOT NULL,
request_no character varying(18),
account character varying(50),
service_type character varying(50),
pub_date" timestamptz(6) NOT NULL,
"service_id" varchar(50) COLLATE "pg_catalog"."default",
....
);
I am also joining two tables (main_profile, main_customer) for searching customer.phone and for selecting customer.client. To get to the main_customer table from main_transaction table, I can only go by main_profile
My question is how can I index my table too increase performance for above query?
Please, do not use UNION for OR for this case (upper(t.request_no) like upper(('%'||#requestNumber||'%')) or OR upper(c.phone) LIKE upper(concat('%',||#phoneNumber||,'%'))) instead can we use case when condition? Because, I have to convert my PostgreSQL query into Hibernate JPA! And I don't know how to convert UNION except Hibernate - Native SQL which I am not allowed to use.
Explain:
Limit (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.381 rows=1 loops=1)
-> Sort (cost=411601.73..411601.82 rows=38 width=1906) (actual time=3885.380..3885.380 rows=1 loops=1)
Sort Key: t.pub_date DESC, t.id
Sort Method: quicksort Memory: 27kB
-> Hash Join (cost=20817.10..411600.73 rows=38 width=1906) (actual time=3214.473..3885.369 rows=1 loops=1)
Hash Cond: (t.profile_id = profile.id)
Join Filter: ((upper((t.request_no)::text) ~~ '%20181104-2158-2723948%'::text) OR (upper((customer.phone)::text) ~~ '%20181104-2158-2723948%'::text))
Rows Removed by Join Filter: 593118
-> Seq Scan on main_transaction t (cost=0.00..288212.28 rows=205572 width=1906) (actual time=0.068..1527.677 rows=593119 loops=1)
Filter: ((pub_date >= '2016-09-05 00:00:00+05'::timestamp with time zone) AND (pub_date <= '2018-11-05 00:00:00+05'::timestamp with time zone) AND ((service_type)::text = 'SERVICE_1'::text) AND ((status)::text = 'SUCCESS'::text) AND ((mode)::text = 'AUTO'::text) AND ((transaction_type)::text = 'WITHDRAW'::text))
Rows Removed by Filter: 2132732
-> Hash (cost=17670.80..17670.80 rows=180984 width=16) (actual time=211.211..211.211 rows=181516 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 3166kB
-> Hash Join (cost=6936.09..17670.80 rows=180984 width=16) (actual time=46.846..183.689 rows=181516 loops=1)
Hash Cond: (customer.id = profile.user_id)
-> Seq Scan on main_customer customer (cost=0.00..5699.73 rows=181106 width=16) (actual time=0.013..40.866 rows=181618 loops=1)
Filter: ((client)::text = 'corp'::text)
Rows Removed by Filter: 16920
-> Hash (cost=3680.04..3680.04 rows=198404 width=8) (actual time=46.087..46.087 rows=198404 loops=1)
Buckets: 131072 Batches: 4 Memory Usage: 2966kB
-> Seq Scan on main_profile profile (cost=0.00..3680.04 rows=198404 width=8) (actual time=0.008..20.099 rows=198404 loops=1)
Planning time: 0.757 ms
Execution time: 3885.680 ms
With the restriction to not use UNION, you won't get a good plan.
You can slightly speed up processing with the following indexes:
main_transaction ((service_type::text), (status::text), (mode::text),
(transaction_type::text), pub_date)
main_customer ((client::text))
These should at least get rid of the sequential scans, but the hash join that takes the lion's share of the processing time will remain.
This query searches for product_groupings often purchased with product_grouping ID 99999. As this query fans out to all the orders that contain product_grouping 99999, and then joins back down to count the number of times each product_grouping has been ordered, and takes the top 10.
Is there any way to speed this query up?
SELECT product_groupings.*, count(product_groupings.id) AS product_groupings_count
FROM "product_groupings"
INNER JOIN "products" ON "product_groupings"."id" = "products"."product_grouping_id"
INNER JOIN "variants" ON "products"."id" = "variants"."product_id"
INNER JOIN "order_items" ON "variants"."id" = "order_items"."variant_id"
INNER JOIN "shipments" ON "order_items"."shipment_id" = "shipments"."id"
INNER JOIN "orders" ON "shipments"."order_id" = "orders"."id"
INNER JOIN "shipments" "shipments_often_purchased_with_join" ON "orders"."id" = "shipments_often_purchased_with_join"."order_id"
INNER JOIN "order_items" "order_items_often_purchased_with_join" ON "shipments_often_purchased_with_join"."id" = "order_items_often_purchased_with_join"."shipment_id"
INNER JOIN "variants" "variants_often_purchased_with_join" ON "order_items_often_purchased_with_join"."variant_id" = "variants_often_purchased_with_join"."id"
INNER JOIN "products" "products_often_purchased_with_join" ON "variants_often_purchased_with_join"."product_id" = "products_often_purchased_with_join"."id"
WHERE "products_often_purchased_with_join"."product_grouping_id" = 99999 AND (product_groupings.id != 99999) AND "product_groupings"."state" = 'active' AND ("shipments"."state" NOT IN ('pending', 'cancelled'))
GROUP BY product_groupings.id
ORDER BY product_groupings_count desc LIMIT 10
schema:
CREATE TABLE product_groupings (
id integer NOT NULL,
state character varying(255) DEFAULT 'active'::character varying,
brand_id integer,
product_content_id integer,
hierarchy_category_id integer,
hierarchy_subtype_id integer,
hierarchy_type_id integer,
product_type_id integer,
description text,
keywords text,
created_at timestamp without time zone,
updated_at timestamp without time zone
);
CREATE INDEX index_product_groupings_on_brand_id ON product_groupings USING btree (brand_id);
CREATE INDEX index_product_groupings_on_hierarchy_category_id ON product_groupings USING btree (hierarchy_category_id);
CREATE INDEX index_product_groupings_on_hierarchy_subtype_id ON product_groupings USING btree (hierarchy_subtype_id);
CREATE INDEX index_product_groupings_on_hierarchy_type_id ON product_groupings USING btree (hierarchy_type_id);
CREATE INDEX index_product_groupings_on_name ON product_groupings USING btree (name);
CREATE INDEX index_product_groupings_on_product_content_id ON product_groupings USING btree (product_content_id);
CREATE INDEX index_product_groupings_on_product_type_id ON product_groupings USING btree (product_type_id);
ALTER TABLE ONLY product_groupings
ADD CONSTRAINT product_groupings_pkey PRIMARY KEY (id);
CREATE TABLE products (
id integer NOT NULL,
name character varying(255) NOT NULL,
prototype_id integer,
deleted_at timestamp without time zone,
created_at timestamp without time zone,
updated_at timestamp without time zone,
item_volume character varying(255),
upc character varying(255),
state character varying(255),
volume_unit character varying(255),
volume_value numeric,
container_type character varying(255),
container_count integer,
upc_ext character varying(8),
product_grouping_id integer,
short_pack_size character varying(255),
short_volume character varying(255),
additional_upcs character varying(255)[] DEFAULT '{}'::character varying[]
);
CREATE INDEX index_products_on_additional_upcs ON products USING gin (additional_upcs);
CREATE INDEX index_products_on_deleted_at ON products USING btree (deleted_at);
CREATE INDEX index_products_on_name ON products USING btree (name);
CREATE INDEX index_products_on_product_grouping_id ON products USING btree (product_grouping_id);
CREATE INDEX index_products_on_prototype_id ON products USING btree (prototype_id);
CREATE INDEX index_products_on_upc ON products USING btree (upc);
ALTER TABLE ONLY products
ADD CONSTRAINT products_pkey PRIMARY KEY (id);
CREATE TABLE variants (
id integer NOT NULL,
product_id integer NOT NULL,
sku character varying(255) NOT NULL,
name character varying(255),
price numeric(8,2) DEFAULT 0.0 NOT NULL,
deleted_at timestamp without time zone,
supplier_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
inventory_id integer,
product_active boolean DEFAULT false NOT NULL,
original_name character varying(255),
original_item_volume character varying(255),
protected boolean DEFAULT false NOT NULL,
sale_price numeric(8,2) DEFAULT 0.0 NOT NULL
);
CREATE INDEX index_variants_on_inventory_id ON variants USING btree (inventory_id);
CREATE INDEX index_variants_on_product_id_and_deleted_at ON variants USING btree (product_id, deleted_at);
CREATE INDEX index_variants_on_sku ON variants USING btree (sku);
CREATE INDEX index_variants_on_state_attributes ON variants USING btree (deleted_at, product_active, protected, id);
CREATE INDEX index_variants_on_supplier_id ON variants USING btree (supplier_id);
ALTER TABLE ONLY variants
ADD CONSTRAINT variants_pkey PRIMARY KEY (id);
CREATE TABLE order_items (
id integer NOT NULL,
price numeric(8,2),
total numeric(8,2),
variant_id integer NOT NULL,
shipment_id integer,
created_at timestamp without time zone,
updated_at timestamp without time zone,
quantity integer DEFAULT 1
);
CREATE INDEX index_order_items_on_shipment_id ON order_items USING btree (shipment_id);
CREATE INDEX index_order_items_on_variant_id ON order_items USING btree (variant_id);
ALTER TABLE ONLY order_items
ADD CONSTRAINT order_items_pkey PRIMARY KEY (id);
CREATE TABLE shipments (
id integer NOT NULL,
order_id integer,
shipping_method_id integer NOT NULL,
number character varying,
state character varying(255) DEFAULT 'pending'::character varying NOT NULL,
created_at timestamp without time zone,
updated_at timestamp without time zone,
supplier_id integer,
confirmed_at timestamp without time zone,
canceled_at timestamp without time zone,
out_of_hours boolean DEFAULT false NOT NULL,
delivered_at timestamp without time zone,
uuid uuid DEFAULT uuid_generate_v4()
);
CREATE INDEX index_shipments_on_order_id_and_supplier_id ON shipments USING btree (order_id, supplier_id);
CREATE INDEX index_shipments_on_state ON shipments USING btree (state);
CREATE INDEX index_shipments_on_supplier_id ON shipments USING btree (supplier_id);
ALTER TABLE ONLY shipments
ADD CONSTRAINT shipments_pkey PRIMARY KEY (id);
CREATE TABLE orders (
id integer NOT NULL,
number character varying(255),
ip_address character varying(255),
state character varying(255),
ship_address_id integer,
active boolean DEFAULT true NOT NULL,
completed_at timestamp without time zone,
created_at timestamp without time zone,
updated_at timestamp without time zone,
tip_amount numeric(8,2) DEFAULT 0.0,
confirmed_at timestamp without time zone,
delivery_notes text,
cancelled_at timestamp without time zone,
courier boolean DEFAULT false NOT NULL,
scheduled_for timestamp without time zone,
client character varying(255),
subscription_id character varying(255),
pickup_detail_id integer,
);
CREATE INDEX index_orders_on_bill_address_id ON orders USING btree (bill_address_id);
CREATE INDEX index_orders_on_completed_at ON orders USING btree (completed_at);
CREATE UNIQUE INDEX index_orders_on_number ON orders USING btree (number);
CREATE INDEX index_orders_on_ship_address_id ON orders USING btree (ship_address_id);
CREATE INDEX index_orders_on_state ON orders USING btree (state);
ALTER TABLE ONLY orders
ADD CONSTRAINT orders_pkey PRIMARY KEY (id);
Query plan:
Limit (cost=685117.80..685117.81 rows=10 width=595) (actual time=33659.659..33659.661 rows=10 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, (count(product_groupings.id))
Buffers: shared hit=259132 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> Sort (cost=685117.80..685117.81 rows=14 width=595) (actual time=33659.658..33659.659 rows=10 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, (count(product_groupings.id))
Sort Key: (count(product_groupings.id))
Sort Method: top-N heapsort Memory: 30kB
Buffers: shared hit=259132 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> HashAggregate (cost=685117.71..685117.75 rows=14 width=595) (actual time=33659.407..33659.491 rows=122 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, count(product_groupings.id)
Buffers: shared hit=259129 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> Hash Join (cost=453037.24..685117.69 rows=14 width=595) (actual time=26019.889..33658.886 rows=181 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name
Hash Cond: (order_items_often_purchased_with_join.variant_id = variants_often_purchased_with_join.id)
Buffers: shared hit=259129 read=85657, temp read=30892 written=30886
I/O Timings: read=5542.213
-> Hash Join (cost=452970.37..681530.70 rows=4693428 width=599) (actual time=22306.463..32908.056 rows=8417034 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name, order_items_often_purchased_with_join.variant_id
Hash Cond: (products.product_grouping_id = product_groupings.id)
Buffers: shared hit=259080 read=85650, temp read=30892 written=30886
I/O Timings: read=5540.529
-> Hash Join (cost=381952.28..493289.49 rows=5047613 width=8) (actual time=21028.128..25416.504 rows=8417518 loops=1)
Output: products.product_grouping_id, order_items_often_purchased_with_join.variant_id
Hash Cond: (order_items_often_purchased_with_join.shipment_id = shipments_often_purchased_with_join.id)
Buffers: shared hit=249520 read=77729
I/O Timings: read=5134.878
-> Seq Scan on public.order_items order_items_often_purchased_with_join (cost=0.00..82689.54 rows=4910847 width=8) (actual time=0.003..1061.456 rows=4909856 loops=1)
Output: order_items_often_purchased_with_join.shipment_id, order_items_often_purchased_with_join.variant_id
Buffers: shared hit=67957
-> Hash (cost=373991.27..373991.27 rows=2274574 width=8) (actual time=21027.220..21027.220 rows=2117538 loops=1)
Output: products.product_grouping_id, shipments_often_purchased_with_join.id
Buckets: 262144 Batches: 1 Memory Usage: 82717kB
Buffers: shared hit=181563 read=77729
I/O Timings: read=5134.878
-> Hash Join (cost=249781.35..373991.27 rows=2274574 width=8) (actual time=10496.552..20383.404 rows=2117538 loops=1)
Output: products.product_grouping_id, shipments_often_purchased_with_join.id
Hash Cond: (shipments.order_id = orders.id)
Buffers: shared hit=181563 read=77729
I/O Timings: read=5134.878
-> Hash Join (cost=118183.04..233677.13 rows=1802577 width=8) (actual time=6080.516..14318.439 rows=1899610 loops=1)
Output: products.product_grouping_id, shipments.order_id
Hash Cond: (variants.product_id = products.id)
Buffers: shared hit=107220 read=55876
I/O Timings: read=5033.540
-> Hash Join (cost=83249.21..190181.06 rows=1802577 width=8) (actual time=4526.391..11330.434 rows=1899808 loops=1)
Output: variants.product_id, shipments.order_id
Hash Cond: (order_items.variant_id = variants.id)
Buffers: shared hit=88026 read=44439
I/O Timings: read=4009.465
-> Hash Join (cost=40902.30..138821.27 rows=1802577 width=8) (actual time=3665.477..8553.803 rows=1899816 loops=1)
Output: order_items.variant_id, shipments.order_id
Hash Cond: (order_items.shipment_id = shipments.id)
Buffers: shared hit=56654 read=43022
I/O Timings: read=3872.065
-> Seq Scan on public.order_items (cost=0.00..82689.54 rows=4910847 width=8) (actual time=0.003..2338.108 rows=4909856 loops=1)
Output: order_items.variant_id, order_items.shipment_id
Buffers: shared hit=55987 read=11970
I/O Timings: read=1059.971
-> Hash (cost=38059.31..38059.31 rows=812284 width=8) (actual time=3664.973..3664.973 rows=834713 loops=1)
Output: shipments.id, shipments.order_id
Buckets: 131072 Batches: 1 Memory Usage: 32606kB
Buffers: shared hit=667 read=31052
I/O Timings: read=2812.094
-> Seq Scan on public.shipments (cost=0.00..38059.31 rows=812284 width=8) (actual time=0.017..3393.420 rows=834713 loops=1)
Output: shipments.id, shipments.order_id
Filter: ((shipments.state)::text <> ALL ('{pending,cancelled}'::text[]))
Rows Removed by Filter: 1013053
Buffers: shared hit=667 read=31052
I/O Timings: read=2812.094
-> Hash (cost=37200.34..37200.34 rows=1470448 width=8) (actual time=859.887..859.887 rows=1555657 loops=1)
Output: variants.product_id, variants.id
Buckets: 262144 Batches: 1 Memory Usage: 60768kB
Buffers: shared hit=31372 read=1417
I/O Timings: read=137.400
-> Seq Scan on public.variants (cost=0.00..37200.34 rows=1470448 width=8) (actual time=0.009..479.528 rows=1555657 loops=1)
Output: variants.product_id, variants.id
Buffers: shared hit=31372 read=1417
I/O Timings: read=137.400
-> Hash (cost=32616.92..32616.92 rows=661973 width=8) (actual time=1553.664..1553.664 rows=688697 loops=1)
Output: products.product_grouping_id, products.id
Buckets: 131072 Batches: 1 Memory Usage: 26903kB
Buffers: shared hit=19194 read=11437
I/O Timings: read=1024.075
-> Seq Scan on public.products (cost=0.00..32616.92 rows=661973 width=8) (actual time=0.011..1375.757 rows=688697 loops=1)
Output: products.product_grouping_id, products.id
Buffers: shared hit=19194 read=11437
I/O Timings: read=1024.075
-> Hash (cost=125258.00..125258.00 rows=1811516 width=12) (actual time=4415.081..4415.081 rows=1847746 loops=1)
Output: orders.id, shipments_often_purchased_with_join.order_id, shipments_often_purchased_with_join.id
Buckets: 262144 Batches: 1 Memory Usage: 79396kB
Buffers: shared hit=74343 read=21853
I/O Timings: read=101.338
-> Hash Join (cost=78141.12..125258.00 rows=1811516 width=12) (actual time=1043.228..3875.433 rows=1847746 loops=1)
Output: orders.id, shipments_often_purchased_with_join.order_id, shipments_often_purchased_with_join.id
Hash Cond: (shipments_often_purchased_with_join.order_id = orders.id)
Buffers: shared hit=74343 read=21853
I/O Timings: read=101.338
-> Seq Scan on public.shipments shipments_often_purchased_with_join (cost=0.00..37153.55 rows=1811516 width=8) (actual time=0.006..413.785 rows=1847766 loops=1)
Output: shipments_often_purchased_with_join.order_id, shipments_often_purchased_with_join.id
Buffers: shared hit=31719
-> Hash (cost=70783.52..70783.52 rows=2102172 width=4) (actual time=1042.239..1042.239 rows=2097229 loops=1)
Output: orders.id
Buckets: 262144 Batches: 1 Memory Usage: 73731kB
Buffers: shared hit=42624 read=21853
I/O Timings: read=101.338
-> Seq Scan on public.orders (cost=0.00..70783.52 rows=2102172 width=4) (actual time=0.012..553.606 rows=2097229 loops=1)
Output: orders.id
Buffers: shared hit=42624 read=21853
I/O Timings: read=101.338
-> Hash (cost=20222.66..20222.66 rows=637552 width=595) (actual time=1278.121..1278.121 rows=626176 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name
Buckets: 16384 Batches: 4 Memory Usage: 29780kB
Buffers: shared hit=9559 read=7921, temp written=10448
I/O Timings: read=405.651
-> Seq Scan on public.product_groupings (cost=0.00..20222.66 rows=637552 width=595) (actual time=0.020..873.844 rows=626176 loops=1)
Output: product_groupings.id, product_groupings.featured, product_groupings.searchable, product_groupings.state, product_groupings.brand_id, product_groupings.product_content_id, product_groupings.hierarchy_category_id, product_groupings.hierarchy_subtype_id, product_groupings.hierarchy_type_id, product_groupings.product_type_id, product_groupings.meta_description, product_groupings.meta_keywords, product_groupings.name, product_groupings.permalink, product_groupings.description, product_groupings.keywords, product_groupings.created_at, product_groupings.updated_at, product_groupings.tax_category_id, product_groupings.trimmed_name
Filter: ((product_groupings.id <> 99999) AND ((product_groupings.state)::text = 'active'::text))
Rows Removed by Filter: 48650
Buffers: shared hit=9559 read=7921
I/O Timings: read=405.651
-> Hash (cost=66.86..66.86 rows=4 width=4) (actual time=2.223..2.223 rows=30 loops=1)
Output: variants_often_purchased_with_join.id
Buckets: 1024 Batches: 1 Memory Usage: 2kB
Buffers: shared hit=49 read=7
I/O Timings: read=1.684
-> Nested Loop (cost=0.17..66.86 rows=4 width=4) (actual time=0.715..2.211 rows=30 loops=1)
Output: variants_often_purchased_with_join.id
Buffers: shared hit=49 read=7
I/O Timings: read=1.684
-> Index Scan using index_products_on_product_grouping_id on public.products products_often_purchased_with_join (cost=0.08..5.58 rows=2 width=4) (actual time=0.074..0.659 rows=6 loops=1)
Output: products_often_purchased_with_join.id
Index Cond: (products_often_purchased_with_join.product_grouping_id = 99999)
Buffers: shared hit=5 read=4
I/O Timings: read=0.552
-> Index Scan using index_variants_on_product_id_and_deleted_at on public.variants variants_often_purchased_with_join (cost=0.09..30.60 rows=15 width=8) (actual time=0.222..0.256 rows=5 loops=6)
Output: variants_often_purchased_with_join.id, variants_often_purchased_with_join.product_id
Index Cond: (variants_often_purchased_with_join.product_id = products_often_purchased_with_join.id)
Buffers: shared hit=44 read=3
I/O Timings: read=1.132
Total runtime: 33705.142 ms
Gained a significant ~20x increase in throughput using a sub select;
SELECT product_groupings.*, count(product_groupings.id) AS product_groupings_count
FROM "product_groupings"
INNER JOIN "products" ON "products"."product_grouping_id" = "product_groupings"."id"
INNER JOIN "variants" ON "variants"."product_id" = "products"."id"
INNER JOIN "order_items" ON "order_items"."variant_id" = "variants"."id"
INNER JOIN "shipments" ON "shipments"."id" = "order_items"."shipment_id"
WHERE ("product_groupings"."id" != 99999)
AND "product_groupings"."state" = 'active'
AND ("shipments"."state" NOT IN ('pending', 'cancelled'))
AND ("shipments"."order_id" IN (
SELECT "shipments"."order_id"
FROM "shipments"
INNER JOIN "order_items" ON "order_items"."shipment_id" = "shipments"."id"
INNER JOIN "variants" ON "variants"."id" = "order_items"."variant_id"
INNER JOIN "products" ON "products"."id" = "variants"."product_id"
WHERE "products"."product_grouping_id" = 99999 AND ("shipments"."state" NOT IN ('pending', 'cancelled'))
GROUP BY "shipments"."order_id"
ORDER BY "shipments"."order_id" ASC
))
GROUP BY product_groupings.id
ORDER BY product_groupings_count desc
LIMIT 10
Although I'd welcome any further optimisations. :)
I have several large tables in Postgres 9.2 (millions of rows) where I need to generate a unique code based on the combination of two fields, 'source' (varchar) and 'id' (int). I can do this by generating row_numbers over the result of:
SELECT source,id FROM tablename GROUP BY source,id
but the results can take a while to process. It has been recommended that if the fields are indexed, and there are a proportionally small number of index values (which is my case), that a loose index scan may be a better option: http://wiki.postgresql.org/wiki/Loose_indexscan
WITH RECURSIVE
t AS (SELECT min(col) AS col FROM tablename
UNION ALL
SELECT (SELECT min(col) FROM tablename WHERE col > t.col) FROM t WHERE t.col IS NOT NULL)
SELECT col FROM t WHERE col IS NOT NULL
UNION ALL
SELECT NULL WHERE EXISTS(SELECT * FROM tablename WHERE col IS NULL);
The example operates on a single field though. Trying to return more than one field generates an error: subquery must return only one column. One possibility might be to try retrieving an entire ROW - e.g. SELECT ROW(min(source),min(id)..., but then I'm not sure what the syntax of the WHERE statement would need to look like to work with individual row elements.
The question is: can the recursion-based code be modified to work with more than one column, and if so, how? I'm committed to using Postgres, but it looks like MySQL has implemented loose index scans for more than one column: http://dev.mysql.com/doc/refman/5.1/en/group-by-optimization.html
As recommended, I'm attaching my EXPLAIN ANALYZE results.
For my situation - where I'm selecting distinct values for 2 columns using GROUP BY, it's the following:
HashAggregate (cost=1645408.44..1654099.65 rows=869121 width=34) (actual time=35411.889..36008.475 rows=1233080 loops=1)
-> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=34) (actual time=4413.311..25450.840 rows=22025768 loops=1)
Total runtime: 36127.789 ms
(3 rows)
I don't know how to do a 2-column index scan (that's the question), but for purposes of comparison, using a GROUP BY on one column, I get:
HashAggregate (cost=1590346.70..1590347.69 rows=99 width=8) (actual time=32310.706..32310.722 rows=100 loops=1)
-> Seq Scan on tablename (cost=0.00..1535284.96 rows=22024696 width=8) (actual time=4764.609..26941.832 rows=22025768 loops=1)
Total runtime: 32350.899 ms
(3 rows)
But for a loose index scan on one column, I get:
Result (cost=181.28..198.07 rows=101 width=8) (actual time=0.069..1.935 rows=100 loops=1)
CTE t
-> Recursive Union (cost=1.74..181.28 rows=101 width=8) (actual time=0.062..1.855 rows=101 loops=1)
-> Result (cost=1.74..1.75 rows=1 width=0) (actual time=0.061..0.061 rows=1 loops=1)
InitPlan 1 (returns $1)
-> Limit (cost=0.00..1.74 rows=1 width=8) (actual time=0.057..0.057 rows=1 loops=1)
-> Index Only Scan using tablename_id on tablename (cost=0.00..38379014.12 rows=22024696 width=8) (actual time=0.055..0.055 rows=1 loops=1)
Index Cond: (id IS NOT NULL)
Heap Fetches: 0
-> WorkTable Scan on t (cost=0.00..17.75 rows=10 width=8) (actual time=0.017..0.017 rows=1 loops=101)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 3
-> Result (cost=1.75..1.76 rows=1 width=0) (actual time=0.016..0.016 rows=1 loops=100)
InitPlan 2 (returns $3)
-> Limit (cost=0.00..1.75 rows=1 width=8) (actual time=0.016..0.016 rows=1 loops=100)
-> Index Only Scan using tablename_id on tablename (cost=0.00..12811462.41 rows=7341565 width=8) (actual time=0.015..0.015 rows=1 loops=100)
Index Cond: ((id IS NOT NULL) AND (id > t.id))
Heap Fetches: 0
-> Append (cost=0.00..16.79 rows=101 width=8) (actual time=0.067..1.918 rows=100 loops=1)
-> CTE Scan on t (cost=0.00..2.02 rows=100 width=8) (actual time=0.067..1.899 rows=100 loops=1)
Filter: (id IS NOT NULL)
Rows Removed by Filter: 1
-> Result (cost=13.75..13.76 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
One-Time Filter: $5
InitPlan 5 (returns $5)
-> Index Only Scan using tablename_id on tablename (cost=0.00..13.75 rows=1 width=0) (actual time=0.002..0.002 rows=0 loops=1)
Index Cond: (id IS NULL)
Heap Fetches: 0
Total runtime: 2.040 ms
The full table definition looks like this:
CREATE TABLE tablename
(
source character(25),
id bigint NOT NULL,
time_ timestamp without time zone,
height numeric,
lon numeric,
lat numeric,
distance numeric,
status character(3),
geom geometry(PointZ,4326),
relid bigint
)
WITH (
OIDS=FALSE
);
CREATE INDEX tablename_height
ON public.tablename
USING btree
(height);
CREATE INDEX tablename_geom
ON public.tablename
USING gist
(geom);
CREATE INDEX tablename_id
ON public.tablename
USING btree
(id);
CREATE INDEX tablename_lat
ON public.tablename
USING btree
(lat);
CREATE INDEX tablename_lon
ON public.tablename
USING btree
(lon);
CREATE INDEX tablename_relid
ON public.tablename
USING btree
(relid);
CREATE INDEX tablename_sid
ON public.tablename
USING btree
(source COLLATE pg_catalog."default", id);
CREATE INDEX tablename_source
ON public.tablename
USING btree
(source COLLATE pg_catalog."default");
CREATE INDEX tablename_time
ON public.tablename
USING btree
(time_);
Answer selection:
I took some time in comparing the approaches that were provided. It's at times like this that I wish that more than one answer could be accepted, but in this case, I'm giving the tick to #jjanes. The reason for this is that his solution matches the question as originally posed more closely, and I was able to get some insights as to the form of the required WHERE statement. In the end, the HashAggregate is actually the fastest approach (for me), but that's due to the nature of my data, not any problems with the algorithms. I've attached the EXPLAIN ANALYZE for the different approaches below, and will be giving +1 to both jjanes and joop.
HashAggregate:
HashAggregate (cost=1018669.72..1029722.08 rows=1105236 width=34) (actual time=24164.735..24686.394 rows=1233080 loops=1)
-> Seq Scan on tablename (cost=0.00..908548.48 rows=22024248 width=34) (actual time=0.054..14639.931 rows=22024982 loops=1)
Total runtime: 24787.292 ms
Loose Index Scan modification
CTE Scan on t (cost=13.84..15.86 rows=100 width=112) (actual time=0.916..250311.164 rows=1233080 loops=1)
Filter: (source IS NOT NULL)
Rows Removed by Filter: 1
CTE t
-> Recursive Union (cost=0.00..13.84 rows=101 width=112) (actual time=0.911..249295.872 rows=1233081 loops=1)
-> Limit (cost=0.00..0.04 rows=1 width=34) (actual time=0.910..0.911 rows=1 loops=1)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..965442.32 rows=22024248 width=34) (actual time=0.908..0.908 rows=1 loops=1)
Heap Fetches: 0
-> WorkTable Scan on t (cost=0.00..1.18 rows=10 width=112) (actual time=0.201..0.201 rows=1 loops=1233081)
Filter: (source IS NOT NULL)
Rows Removed by Filter: 0
SubPlan 1
-> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.100..0.100 rows=1 loops=1233080)
Index Cond: (ROW(source, id) > ROW(t.source, t.id))
Heap Fetches: 0
SubPlan 2
-> Limit (cost=0.00..0.05 rows=1 width=34) (actual time=0.099..0.099 rows=1 loops=1233080)
-> Index Only Scan using tablename_sid on tablename (cost=0.00..340173.38 rows=7341416 width=34) (actual time=0.098..0.098 rows=1 loops=1233080)
Index Cond: (ROW(source, id) > ROW(t.source, t.id))
Heap Fetches: 0
Total runtime: 250491.559 ms
Merge Anti Join
Merge Anti Join (cost=0.00..12099015.26 rows=14682832 width=42) (actual time=48.710..541624.677 rows=1233080 loops=1)
Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 363464177
-> Index Only Scan using tablename_pkey on tablename src (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=48.566..5064.551 rows=22024982 loops=1)
Heap Fetches: 0
-> Materialize (cost=0.00..1115255.89 rows=22024248 width=42) (actual time=0.011..40551.997 rows=363464177 loops=1)
-> Index Only Scan using tablename_pkey on tablename nx (cost=0.00..1060195.27 rows=22024248 width=42) (actual time=0.008..8258.890 rows=22024982 loops=1)
Heap Fetches: 0
Total runtime: 541750.026 ms
Rather hideous, but this seems to work:
WITH RECURSIVE
t AS (
select a,b from (select a,b from foo order by a,b limit 1) asdf union all
select (select a from foo where (a,b) > (t.a,t.b) order by a,b limit 1),
(select b from foo where (a,b) > (t.a,t.b) order by a,b limit 1)
from t where t.a is not null)
select * from t where t.a is not null;
I don't really understand why the "is not nulls" are needed, as where do the nulls come from in the first place?
DROP SCHEMA zooi CASCADE;
CREATE SCHEMA zooi ;
SET search_path=zooi,public,pg_catalog;
CREATE TABLE tablename
( source character(25) NOT NULL
, id bigint NOT NULL
, time_ timestamp without time zone NOT NULL
, height numeric
, lon numeric
, lat numeric
, distance numeric
, status character(3)
, geom geometry(PointZ,4326)
, relid bigint
, PRIMARY KEY (source,id,time_) -- <<-- Primary key here
) WITH ( OIDS=FALSE);
-- invent some bogus data
INSERT INTO tablename(source,id,time_)
SELECT 'SRC_'|| (gs%10)::text
,gs/10
,gt
FROM generate_series(1,1000) gs
, generate_series('2013-12-01', '2013-12-07', '1hour'::interval) gt
;
Select unique values for two key fields:
VACUUM ANALYZE tablename;
EXPLAIN ANALYZE
SELECT source,id,time_
FROM tablename src
WHERE NOT EXISTS (
SELECT * FROM tablename nx
WHERE nx.source =src.source
AND nx.id = src.id
AND time_ > src.time_
)
;
Generates this plan here (Pg-9.3):
QUERY PLAN
----------------------------------------------------------------------------------------------------------------------------------
Hash Anti Join (cost=4981.00..12837.82 rows=96667 width=42) (actual time=547.218..1194.335 rows=1000 loops=1)
Hash Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 145000
-> Seq Scan on tablename src (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.010..210.810 rows=145000 loops=1)
-> Hash (cost=2806.00..2806.00 rows=145000 width=42) (actual time=546.497..546.497 rows=145000 loops=1)
Buckets: 16384 Batches: 1 Memory Usage: 9063kB
-> Seq Scan on tablename nx (cost=0.00..2806.00 rows=145000 width=42) (actual time=0.006..259.864 rows=145000 loops=1)
Total runtime: 1197.374 ms
(9 rows)
The hash-joins will probably disappear once the data outgrows the work_mem:
Merge Anti Join (cost=0.83..8779.56 rows=29832 width=120) (actual time=0.981..2508.912 rows=1000 loops=1)
Merge Cond: ((src.source = nx.source) AND (src.id = nx.id))
Join Filter: (nx.time_ > src.time_)
Rows Removed by Join Filter: 184051
-> Index Scan using tablename_sid on tablename src (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.055..250.621 rows=145000 loops=1)
-> Index Scan using tablename_sid on tablename nx (cost=0.41..4061.57 rows=32544 width=120) (actual time=0.008..603.403 rows=328906 loops=1)
Total runtime: 2510.505 ms
Lateral joins can give you a clean code to select multiple columns in nested selects, without checking for null as no subqueries in select clause.
-- Assuming you want to get one '(a,b)' for every 'a'.
with recursive t as (
(select a, b from foo order by a, b limit 1)
union all
(select s.* from t, lateral(
select a, b from foo f
where f.a > t.a
order by a, b limit 1) s)
)
select * from t;
I'm using PostgreSQL 8.4. What is the best to optimize this query:
SELECT DISTINCT campaigns.* FROM campaigns
LEFT JOIN adverts ON campaign_id = campaigns.id
LEFT JOIN shops ON campaigns.shop_id = shops.id
LEFT JOIN exports_adverts ON advert_id = adverts.id
LEFT JOIN exports ON export_id = exports.id
LEFT JOIN rotations ON rotations.advert_id = adverts.id
LEFT JOIN blocks ON block_id = blocks.id
WHERE
(shops.is_active = TRUE)
AND exports.user_id = any(uids)
OR blocks.user_id = any(uids)
AND campaigns.id = any(ids)
My tables is:
CREATE TABLE campaigns (
id integer NOT NULL,
shop_id integer NOT NULL,
title character varying NOT NULL,
...
);
CREATE TABLE adverts (
id integer NOT NULL,
campaign_id integer NOT NULL,
title character varying NOT NULL,
...
);
CREATE TABLE shops (
id integer NOT NULL,
title character varying NOT NULL,
is_active boolean DEFAULT true NOT NULL,
...
);
CREATE TABLE exports (
id integer NOT NULL,
title character varying,
user_id integer NOT NULL,
...
);
CREATE TABLE exports_adverts (
id integer NOT NULL,
export_id integer NOT NULL,
advert_id integer NOT NULL,
...
);
CREATE TABLE rotations (
id integer NOT NULL,
block_id integer NOT NULL,
advert_id integer NOT NULL,
...
);
CREATE TABLE blocks (
id integer NOT NULL,
title character varying NOT NULL,
user_id integer NOT NULL,
...
);
I already have an index on all fields used in this query. Is there anything I can do to optimize this query?
EXPLAIN for this query:
Unique (cost=284529.95..321207.47 rows=57088 width=106) (actual time=508048.104..609870.600 rows=106 loops=1)
-> Sort (cost=284529.95..286567.59 rows=815056 width=106) (actual time=508048.102..602413.688 rows=8354563 loops=1)
Sort Key: campaigns.id, campaigns.shop_id, campaigns.title
Sort Method: external merge Disk: 1017136kB
-> Hash Left Join (cost=2258.33..62419.56 rows=815056 width=106) (actual time=49.509..17510.009 rows=8354563 loops=1)
Hash Cond: (rotations.block_id = blocks.id)
-> Merge Right Join (cost=1719.44..44560.73 rows=815056 width=110) (actual time=42.194..12317.422 rows=8354563 loops=1)
Merge Cond: (rotations.advert_id = adverts.id)
-> Index Scan using rotations_advert_id_key on rotations (cost=0.00..29088.30 rows=610999 width=8) (actual time=0.040..3026.898 rows=610999 loops=1)
-> Sort (cost=1719.44..1737.90 rows=7386 width=110) (actual time=42.144..3965.416 rows=8354563 loops=1)
Sort Key: adverts.id
Sort Method: external sort Disk: 1336kB
-> Hash Left Join (cost=739.01..1244.87 rows=7386 width=110) (actual time=10.519..21.351 rows=10571 loops=1)
Hash Cond: (exports_adverts.export_id = exports.id)
-> Hash Left Join (cost=715.60..1119.90 rows=7386 width=114) (actual time=10.178..17.472 rows=10571 loops=1)
Hash Cond: (adverts.id = exports_adverts.advert_id)
-> Hash Left Join (cost=304.71..433.53 rows=2781 width=110) (actual time=3.614..5.106 rows=3035 loops=1)
Hash Cond: (campaigns.id = adverts.campaign_id)
-> Hash Join (cost=1.13..9.32 rows=112 width=106) (actual time=0.051..0.303 rows=106 loops=1)
Hash Cond: (campaigns.shop_id = shops.id)
-> Seq Scan on campaigns (cost=0.00..6.23 rows=223 width=106) (actual time=0.011..0.150 rows=223 loops=1)
-> Hash (cost=1.08..1.08 rows=4 width=4) (actual time=0.015..0.015 rows=4 loops=1)
-> Seq Scan on shops (cost=0.00..1.08 rows=4 width=4) (actual time=0.010..0.012 rows=4 loops=1)
Filter: is_active
-> Hash (cost=234.37..234.37 rows=5537 width=8) (actual time=3.546..3.546 rows=5537 loops=1)
-> Seq Scan on adverts (cost=0.00..234.37 rows=5537 width=8) (actual time=0.010..2.200 rows=5537 loops=1)
-> Hash (cost=227.06..227.06 rows=14706 width=8) (actual time=6.532..6.532 rows=14706 loops=1)
-> Seq Scan on exports_adverts (cost=0.00..227.06 rows=14706 width=8) (actual time=0.016..3.028 rows=14706 loops=1)
-> Hash (cost=14.85..14.85 rows=685 width=4) (actual time=0.311..0.311 rows=685 loops=1)
-> Seq Scan on exports (cost=0.00..14.85 rows=685 width=4) (actual time=0.014..0.156 rows=685 loops=1)
-> Hash (cost=368.95..368.95 rows=13595 width=4) (actual time=7.281..7.281 rows=13595 loops=1)
-> Seq Scan on blocks (cost=0.00..368.95 rows=13595 width=4) (actual time=0.027..3.990 rows=13595 loops=1)
Splitting or into 2 union queries may help (without DISTINCT):
SELECT campaigns.*
FROM campaigns
LEFT JOIN adverts ON campaign_id = campaigns.id
LEFT JOIN shops ON campaigns.shop_id = shops.id
LEFT JOIN exports_adverts ON advert_id = adverts.id
LEFT JOIN exports ON export_id = exports.id
LEFT JOIN rotations ON rotations.advert_id = adverts.id
LEFT JOIN blocks ON block_id = blocks.id
WHERE
(shops.is_active = TRUE)
AND exports.user_id = any(uids)
UNION
SELECT campaigns.*
FROM campaigns
LEFT JOIN adverts ON campaign_id = campaigns.id
LEFT JOIN shops ON campaigns.shop_id = shops.id
LEFT JOIN exports_adverts ON advert_id = adverts.id
LEFT JOIN exports ON export_id = exports.id
LEFT JOIN rotations ON rotations.advert_id = adverts.id
LEFT JOIN blocks ON block_id = blocks.id
WHERE
blocks.user_id = any(uids)
AND campaigns.id = any(ids)