How to find duplicates by name filtered by geo coordinate - postgresql

I need to find duplicate entries (accommodations) by name which will be done like this:
CREATE TABLE tbl_folded AS
SELECT name
, array_agg(id) AS ids
FROM accommodations
GROUP BY 1;
Which is fine to get all the ids of accommodations with same name, unfortunately they need further filtering. I just need to get accommodation of same name within a location.
Every accommodation has an address (addresses table has foreign key, accommodation_id and lonlat column for the geo coordinate).
In order to find the closest locations I would go for s.th. like this
ORDER BY ST_Distance(addresses.lonlat, addresses.lonlat)
So how can I extend the query above to apply this location filtering?
Help is very much appreciated.
Column | Type | Modifiers
-------------+------------------------+-------------------------------------------------------------
id | integer | not null default nextval('accommodations_id_seq'::regclass)
name | character varying(255) |
category | character varying(255) |
Table "public.addresses"
Column | Type | Modifiers
------------------+-----------------------------+--------------------------------------------------------
id | integer | not null default nextval('addresses_id_seq'::regclass)
formatted | character varying(255) |
city | character varying(255) |
state | character varying(255) |
country_code | character varying(255) |
postal | character varying(255) |
lonlat | geography(Point,4326) |
accommodation_id | integer |

You can first get the duplicate accommodation_id from addresses table grouping by lonlat column like
select accommodation_id
from addresses
group by lonlat
having count(*) > 1
Then join this result with accommodation table to get the names column like below
CREATE TABLE tbl_folded AS
select ac.id,
ac.names
from accommodation ac
join (
select accommodation_id
from addresses
group by lonlat
having count(*) > 1
) tab on ac.id = tab.accommodation_id

So this is how I solved it. I just filter coordinates within a radius
SELECT
lower(name) AS base_name,
array_agg(accommodations.id) AS ids
FROM accommodations
INNER JOIN addresses ON accommodations.id = addresses.accommodation_id
GROUP BY 1, round(ST_X(lonlat::geometry)::numeric,2), round(ST_Y(lonlat::geometry)::numeric,2)

Related

PostgreSQL create column for each array_agg element instead of comma-separate

I want to force a new column for each string_agg element (i.e., Fiction, Mystery would instead be 'Fiction' in one column, 'Mystery' in the next column) returned from this query, and 2.) I need to be able to expand the tag-columns up to five tags max:
SELECT books.isbn_13 as "ISBN", title as "Title",
author as "Author",
string_agg(tag_name, ', ') as "Tags"
FROM books
LEFT JOIN book_tags on books.isbn_13 = book_tags.isbn_13
GROUP BY books.isbn_13;
Right now everything looks good, except I would like a column for each Tag instead of comma-separated values. Here is my CURRENT result:
ISBN | Title | Author | Tags
1111111111111 | The Adventures of Steve | Russell Barron | Fiction, Mystery
2222222222222 | It's all a mystery to me | Mystery Man | Mystery
3333333333333 | Biography of a Programmer | Solo Artist | Biography
4444444444444 | Steve and Russel go to Mars | Russell Groupon |
6666666666666 | Newest Book you Must Have | Newbie Onthescene |
Desired result (separating tags into columns where there is more than one):
ISBN | Title | Author | Tag1 | Tag2 | Tag3 | Tag4
1111111111111 | The Adventures of Steve | Russell Barron | Fiction | Mystery | Male Protagonists | Fantasy|
2222222222222 | It's all a mystery to me | Mystery Man | Mystery
3333333333333 | Biography of a Programmer | Solo Artist | Biography
4444444444444 | Steve and Russel go to Mars | Russell Groupon |
6666666666666 | Newest Book you Must Have | Newbie Onthescene |
SCHEMA for books table (parent):
CREATE TABLE public.books
(
isbn_13 character varying(13) COLLATE pg_catalog."default" NOT NULL,
title character varying(100) COLLATE pg_catalog."default",
author character varying(80) COLLATE pg_catalog."default",
publish_date date,
price numeric(6,2),
content bytea,
CONSTRAINT books_pkey PRIMARY KEY (isbn_13)
)
SCHEMA book_tags table:
CREATE TABLE public.book_tags
(
isbn_13 character varying(13) COLLATE pg_catalog."default" NOT NULL,
tag_name character varying(30) COLLATE pg_catalog."default" NOT NULL,
CONSTRAINT book_tags_pkey PRIMARY KEY (isbn_13, tag_name),
CONSTRAINT book_tags_isbn_13_fkey FOREIGN KEY (isbn_13)
REFERENCES public.books (isbn_13) MATCH SIMPLE
ON UPDATE NO ACTION
ON DELETE CASCADE
)
I've researched group by, crosstab/pivot resources for hours with no luck. This seems like it should be a simple thing to do but I'm a very-beginner and haven't found an answer. Thanks in advance for any guidance.
With CTE as (
SELECT books.isbn_13 as "ISBN",
title as "Title",
author as "Author",
tag_name as "Tag",
row_number() over (partition by books.isbn_13) as rn
FROM books
LEFT JOIN book_tags
on books.isbn_13 = book_tags.isbn_13
)
SELECT "ISBN", "Title", "Author",
MAX( CASE WHEN rn = 1 THEN Tag END) as Tag1,
MAX( CASE WHEN rn = 2 THEN Tag END) as Tag2,
MAX( CASE WHEN rn = 3 THEN Tag END) as Tag3,
MAX( CASE WHEN rn = 4 THEN Tag END) as Tag4,
MAX( CASE WHEN rn = 5 THEN Tag END) as Tag5
FROM CTE
GROUP BY "ISBN", "Title", "Author";

Postgresql how to?

Given the following tables:
\d users
Table "public.users"
Column | Type | Modifiers
--------------------+-----------------------------+----------------------------------------
id | integer | not null
email | character varying | not null default ''::character varying
encrypted_password | character varying | not null default ''::character varying
created_at | timestamp without time zone |
address_id | integer |
\d address
Table "public.address"
Column | Type | Modifiers
----------+-------------------+-----------
id | integer | not null
street | character varying |
city | character varying |
province | character varying |
zip | character varying |
country | character varying |
Write some SQL to select all users who live in XXXX?
Now scope that query to users created older than 3 months ordered by newest first?
What index would improve the performance of this query?
First query:
select u.id, u.email
from users u
inner join address a
on u.address_id = a.id
where a.city = 'XXXX'
Second query:
select u.id, u.email
from users u
inner join address a
on u.address_id = a.id
where a.city = 'XXXX' AND
u.created_at < NOW() - interval '3 months'
order by u.created_at desc
assuming you want to select users who live in a particular city, lets say "Addis Ababa" in a particular country "Ethiopia"
SELECT * FROM "users" AS u JOIN "address" AS a ON a.id = u.address_id WHERE a.city = 'Addis Ababa' AND a.country = 'Ethiopia' AND u.created_at < CURRENT_DATE - INTERVAL '3 months' ORDER BY u.created_at DESC;

Insert output of query into another table in postgres?

I'm working in Postgres 9.4. I have two tables:
Table "public.parcel"
Column | Type | Modifiers
ogc_fid | integer | not null default
wkb_geometry | geometry(Polygon,4326) |
county | character varying |
parcel_area | double precision |
Table "public.county"
Column | Type | Modifiers
--------+------------------------+-----------
name | character(1) |
chull | geometry(Polygon,4326) |
area | double precision |
I would like to find all the unique values of county in parcel, and the total areas of the attached parcels, and then insert them into the county table as name and area respectively.
I know how to do the first half of this:
SELECT county,
SUM(parcel_area) AS area
FROM inspire_parcel
GROUP BY county;
But what I don't know is how to insert these values into county. Can anyone advise?
I think it's something like:
UPDATE county SET name, area = (SELECT county, SUM(parcel_area) AS area
FROM inspire_parcel GROUP BY county)
You use INSERT INTO. So, something like this:
INSERT INTO county
SELECT county, SUM(parcel_area) AS area
FROM inspire_parcel GROUP BY county;

Need cleaner update method in PostgreSQL 9.1.3

Using PostgreSQL 9.1.3 I have a points table like so (What's the right way to show tables here??)
| Column | Type | Table Modifiers | Storage
|--------|-------------------|-----------------------------------------------------|----------|
| id | integer | not null default nextval('points_id_seq'::regclass) | plain |
| name | character varying | not null | extended |
| abbrev | character varying | not null | extended |
| amount | real | not null | plain |
In another table, orders I have a bunch of columns where the name of the column exists in the points table via the abbrev column, as well as a total_points column
| Column | Type | Table Modifiers |
|--------------|--------|--------------------|
| ud | real | not null default 0 |
| sw | real | not null default 0 |
| prrv | real | not null default 0 |
| total_points | real | default 0 |
So in orders I have the sw column, and in points I'll now have an amount that realtes to the column where abbrev = sw
I have about 15 columns like that in the points table, and now I want to set a trigger so that when I create/update an entry in the points table, I calculate a total score. Basically with just those three shown I could do it long-hand like this:
UPDATE points
SET total_points =
ud * (SELECT amount FROM points WHERE abbrev = 'ud') +
sw * (SELECT amount FROM points WHERE abbrev = 'sw') +
prrv * (SELECT amount FROM points WHERE abbrev = 'prrv')
WHERE ....
But that's just plain ugly and repetative, and like I said there are really 15 of them (right now...). I'm hoping there's a more sophisticated way to handle this.
In general each of those silly names on the orders table represents a type of work associated with the order, and each of those types has a 'cost' to it, which is stores in the points table. I'm not married to this structure if there's a cleaner setup.
"Serialize" the costs for orders:
CREATE TABLE order_cost (
order_cost_id serial PRIMARY KEY
, order_id int NOT NULL REFERENCES order
, cost_type_id int NOT NULL REFERENCES points
, cost int NOT NULL DEFAULT 0 -- in Cent
);
For a single row:
UPDATE orders o
SET total_points = COALESCE((
SELECT sum(oc.cost * p.amount) AS order_cost
FROM order_cost oc
JOIN points p ON oc.cost_type_id = p.id
WHERE oc.order_id = o.order_id
), 0);
WHERE o.order_id = $<order_id> -- your order_id here ...
Never use the lossy type real for currency data. Use exact types like money, numeric or just integer - where integer is supposed to store the amount in Cent.
More advice in this closely related example:
How to implement a many-to-many relationship in PostgreSQL?

Sort SELECT result by pairs of columns

In the following PostgreSQL 8.4.13 table
(where author users give grades to id users):
# \d pref_rep;
Table "public.pref_rep"
Column | Type | Modifiers
-----------+-----------------------------+-----------------------------------------------------------
id | character varying(32) | not null
author | character varying(32) | not null
good | boolean |
fair | boolean |
nice | boolean |
about | character varying(256) |
stamp | timestamp without time zone | default now()
author_ip | inet |
rep_id | integer | not null default nextval('pref_rep_rep_id_seq'::regclass)
Indexes:
"pref_rep_pkey" PRIMARY KEY, btree (id, author)
Check constraints:
"pref_rep_check" CHECK (id::text <> author::text)
Foreign-key constraints:
"pref_rep_author_fkey" FOREIGN KEY (author) REFERENCES pref_users(id) ON DELETE CASCADE
"pref_rep_id_fkey" FOREIGN KEY (id) REFERENCES pref_users(id) ON DELETE CASCADE
how to find faked entries, which have same id and same author_ip?
I.e. some users register several accounts and then submit bad notes (the good, fair, nice columns above) for other users. But I can still identify them by their author_ip addresses.
I'm trying to find them by fetching:
# select id, author_ip from pref_rep group by id, author_ip;
id | author_ip
-------------------------+-----------------
OK490496816466 | 94.230.231.106
OK360565502458 | 78.106.102.16
DE25213 | 178.216.72.185
OK331482634936 | 95.158.209.5
VK25785834 | 77.109.20.182
OK206383671767 | 80.179.90.103
OK505822972559 | 46.158.46.126
OK237791033602 | 178.76.216.77
VK90402803 | 109.68.173.37
MR16281819401420759860 | 109.252.139.198
MR5586967138985630915 | 2.93.14.248
OK341086615664 | 93.77.75.142
OK446200841566 | 95.59.127.194
But I need to sort the above result.
How can I sort it by the number of pairs (id, author_ip) desc please?
select id, pr.author_ip
from
pref_rep pr
inner join
(
select author_ip
from pref_rep
group by author_ip
having count(*) > 1
) s using(author_ip)
order by 2, 1