I am working on Posrgres 9.6 with PostGIS 2.3, hosted on AWS RDS. I'm trying to optimize some geo-radius queries for data that comes from different tables.
I'm considering two approaches: single query with multiple joins or two separate but simpler queries.
At a high level, and simplifying the structure, my schema is:
CREATE EXTENSION "uuid-ossp";
CREATE EXTENSION IF NOT EXISTS postgis;
CREATE TABLE addresses (
id bigint NOT NULL,
latitude double precision,
longitude double precision,
line1 character varying NOT NULL,
"position" geography(Point,4326),
CONSTRAINT enforce_srid CHECK ((st_srid("position") = 4326))
);
CREATE INDEX index_addresses_on_position ON addresses USING gist ("position");
CREATE TABLE locations (
id bigint NOT NULL,
uuid uuid DEFAULT uuid_generate_v4() NOT NULL,
address_id bigint NOT NULL
);
CREATE TABLE shops (
id bigint NOT NULL,
name character varying NOT NULL,
location_id bigint NOT NULL
);
CREATE TABLE inventories (
id bigint NOT NULL,
shop_id bigint NOT NULL,
status character varying NOT NULL
);
The addresses table holds the geographical data. The position column is calculated from the lat-lng columns when the rows are inserted or updated.
Each address is associated to one location.
Each address may have many shops, and each shop will have one inventory.
I've omitted them for brevity, but all the tables have the proper foreign key constraints and btree indexes on the reference columns.
The tables have a few hundreds of thousands of rows.
With that in place, my main use case can be satisfied by this single query, which searches for addresses within 1000 meters from a central geographical point (10.0, 10.0) and returns data from all the tables:
SELECT
s.id AS shop_id,
s.name AS shop_name,
i.status AS inventory_status,
l.uuid AS location_uuid,
a.line1 AS addr_line,
a.latitude AS lat,
a.longitude AS lng
FROM addresses a
JOIN locations l ON l.address_id = a.id
JOIN shops s ON s.location_id = l.id
JOIN inventories i ON i.shop_id = s.id
WHERE ST_DWithin(
a.position, -- the position of each address
ST_SetSRID(ST_Point(10.0, 10.0), 4326), -- the center of the circle
1000, -- radius distance in meters
true
);
This query works, and EXPLAIN ANALYZE shows that it does correctly use the GIST index.
However, I could also split this query in two and manage the intermediate results in the application layer. For example, this works too:
--- only search for the addresses
SELECT
a.id as addr_id,
a.line1 AS addr_line,
a.latitude AS lat,
a.longitude AS lng
FROM addresses a
WHERE ST_DWithin(
a.position, -- the position of each address
ST_SetSRID(ST_Point(10.0, 10.0), 4326), -- the center of the circle
1000, -- radius distance in meters
true
);
--- get the rest of the data
SELECT
s.id AS shop_id,
s.name AS shop_name,
i.status AS inventory_status,
l.id AS location_id,
l.uuid AS location_uuid
FROM locations l
JOIN shops s ON s.location_id = l.id
JOIN inventories i ON i.shop_id = s.id
WHERE
l.address_id IN (1, 2, 3, 4, 5) -- potentially thousands of values
;
where the values in l.address_id IN (1, 2, 3, 4, 5) come from the first query.
The query plans for the two split queries look simpler than the first one's, but I wonder if that in itself means that the second solution is better.
I know that inner joins are pretty well optimized, and that a single round-trip to the DB would be preferable.
What about memory usage? Or resource contention on the tables? (e.g. locks)
I (re-)combined your second code into a single query, using IN(...):
--- get the rest of the data
SELECT
s.id AS shop_id,
s.name AS shop_name,
i.status AS inventory_status,
l.id AS location_id,
l.uuid AS location_uuid
FROM locations l
JOIN shops s ON s.location_id = l.id
JOIN inventories i ON i.shop_id = s.id
WHERE l.address_id IN ( --- only search for the addresses
SELECT a.id
FROM addresses a
WHERE ST_DWithin(a.position, ST_SetSRID(ST_Point(10.0, 10.0), 4326), 1000 true)
);
Or, similarly ,using EXISTS(...):
--- get the rest of the data
SELECT
s.id AS shop_id,
s.name AS shop_name,
i.status AS inventory_status,
l.id AS location_id,
l.uuid AS location_uuid
FROM locations l
JOIN shops s ON s.location_id = l.id
JOIN inventories i ON i.shop_id = s.id
WHERE EXISTS ( SELECT * --- only search for the addresses
FROM addresses a
WHERE a.id = l.address_id
AND ST_DWithin( a.position, ST_SetSRID(ST_Point(10.0, 10.0), 4326), 1000, true)
);
Related
I'm trying to join several tables and pull out each DISTINCT root record (from table_a), but for some reason I keep getting duplicates. Here is my select query:
Fiddle
select
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
This returns the following:
[
{
"id": 2,
"tableName": "Root record 2"
},
{
"id": 2,
"tableName": "Root record 2"
}
]
But I am only expecting, in this case,
[
{
"id": 2,
"tableName": "Root record 2"
}
]
What am I doing wrong here?
Here's the fiddle and, just in case, the create and insert statements below:
create schema if not exists my_schema;
create table if not exists my_schema.table_a (
id serial primary key,
table_a_name varchar (255) not null
);
create table if not exists my_schema.table_b (
id serial primary key,
table_a_id bigint not null references my_schema.table_a (id)
);
create table if not exists my_schema.table_d (
id serial primary key
);
create table if not exists my_schema.table_c (
id serial primary key,
table_b_id bigint not null references my_schema.table_b (id),
table_d_ids bigint[] not null
);
insert into my_schema.table_a values
(1, 'Root record 1'),
(2, 'Root record 2'),
(3, 'Root record 3');
insert into my_schema.table_b values
(10, 2),
(11, 2),
(12, 3);
insert into my_schema.table_d values
(100),
(101),
(102),
(103),
(104);
insert into my_schema.table_c values
(1000, 10, array[]::int[]),
(1001, 10, array[100]),
(1002, 11, array[100, 101]),
(1003, 12, array[102]),
(1004, 12, array[103]);
Short answer is use distinct, and this will get the results you want:
select distinct
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
That said, this doesn't sit well with me because I assume this is not the end of your query.
The root issue is that you have two records from table_b - table_d that match this criteria. If you follow the breadcrumbs back, you will see there really are two matches:
select
ta.id,
ta.table_a_name as "tableName", tb.*, tc.*, td.*
from my_schema.table_a ta
left join my_schema.table_b tb
on (tb.table_a_id = ta.id)
left join my_schema.table_c tc
on (tc.table_b_id = tb.id)
left join my_schema.table_d td
on (td.id = any(tc.table_d_ids))
where td.id = any(array[100]);
So 'distinct' is just a lazy fix to say if there are dupes, limit it to one...
My next question is, is there more to it than this? What's supposed to happen next? Do you really just want candidates from table_a, or is this part 1 of a longer issue? If there is more to it, then there is likely a better solution than a simple select distinct.
-- edit 10/1/2022 --
Based on your comment, I have one final suggestion. Because this really all there is to your output AND you don't actually need the data from the b/c/d tables, then I think a semi-join is a better solution.
It's slightly more code (not going to win any golf or de-obfuscation contents), but it's much more efficient than a distinct or group by all columns. The reason is a distinct pulls every row result and then has to order and remove dupes. A semi-join, by contrast, will "stop looking" once it finds a match. It also scales very well. Almost every time I see a distinct misused, it's better served by a semi-join.
select
ta.id,
ta.table_a_name as "tableName"
from my_schema.table_a ta
where exists (
select null
from
table_b tb,
table_c tc,
table_d tc
where
tb.table_a_id = ta.id and
tc.table_b_id = tb.id and
td.id = any(tc.table_d_ids) and
td.id = any(array[100])
)
I didn't suggest this initially because I was unclear on the "what next."
I have 3 tables, Transaction, Transaction_Items and Transaction_History.
Where the Transaction is the parent table, while Transaction_Items and Transaction_History are the children tables, with one to many relationship.
When i try to join those tables together, if i have 2+ Transaction_History records, or 2+ Transaction_Items i get duplicated or triplicated record results.
This is the SQL query im currently using which works, but what worries me that in the future if i have to Join another one-to-many table, it will duplicate the results again.
I found a workaround for this, but i was just wondering if there is a better and cleaner way to do this ?
The results should be a PostgreSQL JSON array which will contain the Transaction_Items and Transaction_History
SELECT
TR.id AS transaction_id,
TR.transaction_number,
TR.status,
TR.status AS status,
to_json(TR_INV.list),
COUNT(TR_INV) item_cnt,
COUNT(THR) tr_cnt,
json_agg(THR)
FROM transaction_transaction AS TR
LEFT JOIN (
SELECT
array_agg(t) list, -- this is a workaround method
t.transaction_id
FROM (
SELECT
TR_INV.transaction_id transaction_id,
IT.id,
IT.stock_number,
CAT.key category_key,
ITP.description description,
ITP.serial_number serial_number,
ITP.color color,
ITP.manufacturer manufacturer,
ITP.inventory_model inventory_model,
ITP.average_cost average_cost,
ITP.location_in_store location_in_store,
ITP.firearm_caliber firearm_caliber,
ITP.federal_firearm_number federal_firearm_number,
ITP.sold_price sold_price
FROM transaction_transaction_item TR_INV
LEFT JOIN inventory_item IT ON IT.id = TR_INV.item_id
LEFT JOIN inventory_itemprofile ITP ON ITP.id = IT.current_profile_id
LEFT JOIN inventory_category CAT ON CAT.id = ITP.category_id
LEFT JOIN inventory_categorytype CAT_T ON CAT_T.id = CAT.category_type_id
) t
GROUP BY t.transaction_id
) TR_INV ON TR_INV.transaction_id = TR.id
LEFT JOIN transaction_transactionhistory THR ON THR.transaction_id = TR.id
AND (THR.audit_code_id = 44 OR THR.audit_code_id = 27 OR THR.audit_code_id = 28)
WHERE TR.store_id = 21
AND TR.transaction_type = 'Pawn_Loan' AND TR.date_made >= '2018-10-08'
GROUP BY TR.id, TR_INV.list
What you want to do can be achieved by not using joins, as shown below.
Because your actual tables have so many columns that I don't know and should not care. I just created the simplest forms of them for demonstration.
CREATE TABLE transactions (
tid serial PRIMARY KEY,
name varchar(40) NOT NULL
);
CREATE TABLE transaction_histories (
hid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
history varchar(40) NOT NULL
);
CREATE TABLE transaction_items (
iid serial PRIMARY KEY ,
tid integer REFERENCES transactions(tid),
item varchar(40) NOT NULL
);
INSERT INTO transactions(tid,name) Values(1, 'transaction');
INSERT INTO transaction_histories(tid, history) Values(1, 'history1');
INSERT INTO transaction_histories(tid, history) Values(1, 'history2');
INSERT INTO transaction_items(tid, item) Values(1, 'item1');
INSERT INTO transaction_items(tid, item) Values(1, 'item2');
select
t.*,
(select count(*) from transaction_histories h where h.tid= t.tid) h_count ,
(select json_agg(h) from transaction_histories h where h.tid= t.tid) h ,
(select count(*) from transaction_items i where i.tid= t.tid) i_count ,
(select json_agg(i) from transaction_items i where i.tid= t.tid) i
from transactions t;
There are three tables: businesses, categories, categorizations,
CREATE TABLE businesses (
id SERIAL PRIMARY KEY,
name varchar(40)
);
CREATE TABLE categories (
id SERIAL PRIMARY KEY,
name varchar(40)
);
CREATE TABLE categorizations (
business_id integer,
category_id integer
);
So business has many categories through categorizations.
If I want to select businesses without categories, I would do something
like this:
SELECT businesses.* FROM businesses
LEFT OUTER JOIN categorizations
ON categorizations.business_id = businesses.id
LEFT OUTER JOIN categories
ON categories.id = categorizations.category_id
GROUP BY businesses.id
HAVING count(categories.id) = 0;
The question is: How do I select businesses without categories AND
businesses with category named "Media" in one query?
You can use a union:
SELECT businesses.*
FROM businesses
LEFT OUTER JOIN categorizations
ON categorizations.business_id = businesses.id
GROUP BY businesses.id
HAVING count(categorizations.business_id) = 0
UNION
SELECT businesses.*
FROM businesses
INNER JOIN categorizations
ON categorizations.business_id = businesses.id
INNER JOIN categories
ON categories.id = categorizations.category_id
WHERE categories.name = 'Media';
Note that in the first instance (businesses with no categories at all) that you won't need to join as far as categories - you can detect the lack of category in the junction table. If it is possible for the same business to have the same category more than once, you'll need to introduce the second query with DISTINCT.
I would try:
SELECT b.* FROM businesses b
LEFT JOIN categorizations cz ON b.business_id = cz.business_id
LEFT JOIN categories cs ON cz.category_id = cs.category_id
WHERE COALESCE(cs.name, 'Media') = 'Media';
... in the hope that businesses with no categorizations would get NULL entries on their joins.
The double-negation trick works for this kind of selections:
SELECT * FROM businesses b
WHERE NOT EXISTS (
SELECT *
FROM categorizations bc
JOIN categories c ON bc.category_id = c.category_id
WHERE bc.business_id = b.business_id
AND c.name <> 'Media'
);
I am trying to create a country_name, and country cid pair between each country that are neighbours:
Here's the schema:
CREATE TABLE country (
cid INTEGER PRIMARY KEY,
cname VARCHAR(20) NOT NULL,
height INTEGER NOT NULL,
population INTEGER NOT NULL);
CREATE TABLE neighbour (
country INTEGER REFERENCES country(cid) ON DELETE RESTRICT,
neighbor INTEGER REFERENCES country(cid) ON DELETE RESTRICT,
length INTEGER NOT NULL,
PRIMARY KEY(country, neighbor));
My query:
create view neighbour_pair as (
select c1.cid, c1.cname, c2.cid, c2.cname
from neighbour n join country c1 on c1.cid = n.country
join country c2 on n.neighbor = c2.cid);
I am getting error code 42701 which means that there is a duplicate column.
The actual error message I am getting is:
ERROR: column "cid" specified more than once
********** Error **********
ERROR: column "cid" specified more than once
SQL state: 42701
I am unsure how to go around the error problem since I WANT the pair of neighbour countries with the country name and their cid.
Nevermind. I edited the first line of the query and changed the column names
create view neighbour_pair as
select c1.cid as c1cid, c1.cname as c1name, c2.cid as c2cid, c2.cname as c2name
from neighbour n join country c1 on c1.cid = n.country
join country c2 on n.neighbor = c2.cid;
I ran into a similar issue recently. I had a query like:
CREATE VIEW pairs AS
SELECT p.id, p.name,
(SELECT count(id) from results
where winner = p.id),
(SELECT count(id) from results
where winner = p.id OR loser = p.id)
FROM players p LEFT JOIN matches m ON p.id = m.id
GROUP BY 1,2;
The error was telling me: ERROR: column "count" specified more than once. The query WAS working via psycopg2, however when I brought it into a .sql file for testing the error arose.
I realized I just needed to alias the 2 count subqueries:
CREATE VIEW pairs AS
SELECT p.id, p.name,
(SELECT count(id) from results
where winner = p.id) as wins,
(SELECT count(id) from results
where winner = p.id OR loser = p.id) as matches
FROM players p LEFT JOIN matches m ON p.id = m.id
GROUP BY 1,2;
You can use alias with AS:
For example your view could be as follows:
create view neighbour_pair as
(
select c1.**cid**
, c1.cname
, c2.**cid AS cid_c2**
, c2.cname
from neighbour n
join country c1 on c1.cid = n.country
join country c2 on n.neighbor = c2.cid
);
The problem: I need to select, for each building in my table that has say at least 2 pharmacies and 2 education centers within a radius of 1km, all POIs (pharmacies, comercial centres, medical centers, education centers, police stations, fire stations) which are within 1km of the respective building. table structure->
building (id serial, name varchar )
poi_category(id serial, cname varchar) --cname being the category name of course
poi(id serial, name varchar, c_id integer)-- c_id is the FK referencing poi_category(id)
all coordinate columns are of type geometry not geography (let's call them geom)
here's the way i thought it should be done but i'm not sure it's even correct let alone the optimal solution to this problem
SELECT r.id_b, r.id_p
FROM (
SELECT b.id AS id_b, p.id AS id_p, pc.id AS id_pc,pc.cname
FROM building AS b, poi AS p, poi_category AS pc
WHERE ST_DWithin(b.geom,p.geom, 1000) AND p.c_id=pc.id
) AS r,
(
SELECT * FROM r GROUP BY id_b
) AS r1
HAVING count (
SELECT *
FROM r, r1
WHERE r1.id_b=r.id_b AND r.id_pc='pharmacy'
)>1
AND
count (
SELECT *
FROM r, r1
WHERE r1.id_b=r.id_b AND r.id_pc='ed. centre'
)>1
Is this the way to go for what i need ? What solution would be better from a performance point of view? What about the most elegant solution?
I've also posted here :http://gis.stackexchange.com/questions/11445/postgis-advanced-selection-query
This is a solution I elaborated. It's the fastest one I could find but it's still slow. Given the nature of the task I doubt it can be made faster...
WITH
building AS (
SELECT way, osm_id
FROM osm_polygon
WHERE tags #> hstore('building','yes')
--ORDER BY 1
LIMIT 1000
),
pharmacy AS (
SELECT way
FROM osm_poi
WHERE tags #> hstore('amenity','pharmacy')
),
school AS (
SELECT way
FROM osm_poi
WHERE tags #> hstore('amenity','school')
)
SELECT ST_AsText(building.way) AS geom, building.osm_id AS label
FROM building
WHERE
(SELECT count(*) > 1
FROM pharmacy
WHERE ST_DWithin(building.way,pharmacy.way,1000))
AND
(SELECT count(*) > 1
FROM school
WHERE ST_DWithin(building.way,school.way,1000))
Yours. S.