PostgreSQL: Address matching using fuzzymatch from two tables - postgresql

What I want to do;
I have two tables with two address columns , both stored as text I want to create a view returning the matching rows.
What I've tried;
I've created and index on both columns and tables as below;
CREATE INDEX idx_table1_fulladdress ON table1 (LOWER(fulladdress_ppd));
Then run the following;
CREATE OR REPLACE VIEW view_adresscheck AS
SELECT
--from table1
table1.postcode,
table1.fulladdress_ppd,
--from table2
table2.epc_postcode,
table2.fulladdress_epc
FROM
table1,
table2
WHERE
table1.postcode = table2.epc_postcode
AND
table2.fulladdress_epc = table1.fulladdress_ppd ::text;
What hasn't worked
The above returned fewer records than I know to be there. On inspection this is because the address format is not consistent between the two tables ie.
table1.fulladdress_ppd = Flat 2d The building the street
table2.fulladdress_epc = Flat 2/d The building the street, the town
The address isn't consistently formatted within the table either ie in table not all addresses include town so I can't use regex or trim to bulk clean.
I've then seen the fuzzystrmatch module in postgres and this sounds like it might resolve my problem.
Question
Which of Soundex, Levenshtein, Metaphone is most appropriate. Most records are in English by some place names are Gaelic running on 9.6.

talking from experience of matching address from different sources. What you could do is index each address. Regardless of formatting the above address would return the same number. You then match on these indexes.
eg in the UK you have what are called UDPRN numbers for each postal address in the country.

Related

Snowflake invalid identifier when performin a join

I have been trying to do an outer join across two different tables in two different schemas. I am trying to filter out before from the table variants the character that are smaller than 4 and bigger than 5 digits. The join was not working with a simply where clause in the end, hence this decision.
The problem is if I do not put the quotes, Snowflake will say that I put invalid identifiers. However, when I run this with the quotes, it works but I get as values in the fields of the column raw.stitch_heroku.spree_variants.SKU only named as the column name, all across the table!
SELECT
analytics.dbt_lcasucci.product_category.product_description,
'raw.stitch_heroku.spree_variants.SKU'
FROM analytics.dbt_lcasucci.product_category
LEFT JOIN (
SELECT * FROM raw.stitch_heroku.spree_variants
WHERE LENGTH('raw.stitch_heroku.spree_variants.SKU')<=5
and LENGTH('raw.stitch_heroku.spree_variants.SKU')>=4
) ON 'analytics.dbt_lcasucci.product_category.product_id'
= 'raw.stitch_heroku.spree_variants.SKU'
Is there a way to work this around? I am confused and have not found this issue on forums yet!
thx in advance
firstly single quote define a string literal 'this is text' where as double quotes are table/column names "this_is_a_table_name"
add alias's to the tables makes the SQL more readable, and the duplicate length command can be reduced with a between, thus this should work better:
SELECT pc.product_description,
sp.SKU
FROM analytics.dbt_lcasucci.product_category AS PC
LEFT JOIN (
SELECT SKU
FROM raw.stitch_heroku.spree_variants
WHERE LENGTH(SKU) BETWEEN 4 AND 5
) AS sp
ON pc.product_id = sp.SKU;
So I reduced the sub-selects results as you only used sku from sp but given you are comparing product_id to sku as your example exists you don't need to join to sp.
the invalid indentifiers implies to me something is named incorrectly, the first step there is to check the tables exist and the columns are named as you expect and the type of the columns are the same for the JOIN x ON y clause via:
describe table analytics.dbt_lcasucci.product_category;
describe table raw.stitch_heroku.spree_variants;

Add columns to Postgres, source file changed

I get data files from one of our vendors. Each line is a continuous string with most places filled out. They have plenty of sections that are just space characters to be used as filler locations for future columns. I have a parser that formats it into a CSV so I can upload it into postgres. Today the vendor informs us that they are adding a column by splitting one of their filler fields into 2 columns. X and Filler
For example index 0:5 is the name, 5:20 is filler and 20:X is other stuff. They are splitting 5:20 into 5:10 and 10:20 where 10:20 will still be a placeholder column.
NAME1 AUHDASFAAF!##!12312312541 -> NAME1, ,AUHDASFAAF,.....
Is now
NAME1AAAAA AUHDASFAAF!##!12312312541 -> NAME1,AAAAA, ,AUHDASFAAF,......
Modifying my parser to account for this change is the easy part. How do I edit my postgres table to accept this new column from the CSV file? Ideally I dont want to remake and reupload all of the data into the table.
Columns are in the order they are defined. When you add a new column it goes at the end. There's no direct way to add a column in the middle. While insert values (...) is convenient, you should not rely on the order of columns in the table.
There are various work arounds like dropping and recreating the table or dropping and adding columns. These are all pretty inconvenient and you'll have to do it again when there's another change.
You should never make assumptions about the order of columns in the table either in an insert or select *. You can either spell out all the columns, or you can create a view which specifies the order of the columns.
You don't have to write the columns out by hand. Get them from information_schema.columns and edit their order as necessary for your queries or to set up your view.
select column_name
from information_schema.columns
where table_name = ?

How to query an ampersand symbol in Postgres

I have a Postgres table that has names and addresses. Some of these name fields are both names of a couple -- for example, "John & Jane".
I am trying to write a query that pulls out only those rows where this is the case.
When I run this query, it selects 0 rows even though I know that they exist in the table:
SELECT count(*) FROM name_list where namefirst LIKE '%&%';
Does anyone know how to address this?

POSTGRES SELECT AS

I am joining two tables house and tower, both have some of the same column names such as id, created_at, deleted, address etc. I wonder if it is possible to return the columns in the following fashion: house.created_at, house.id, tower.created_at, tower.id etc. I know I can query with AS, I was wondering if it is possible to query something like this: SELECT house.* AS house, tower.* AS tower. I tried it like this, but it was not valid SQL. Any idea how I can chase the column names prefix easily ?

PostgreSQL 9.5 Select only non matching records from two tables

I have three tables representing some geographical datas and :
- one with the actual datas,
- one storing the name of the streets,
- one storing the combination between the street number and the street name (table address).
I already have some address existing in my table, in order to realize an INSERT INTO SELECT in a fourth table, I am looking on how to build the SELECT query to retrieve only the objects not already existing in the address table.
I tried different approaches, including the NOT EXISTS and the id_street IS NULL conditions, but I didn't manage to make it work.
Here is an example : http://rextester.com/KMSW4349
Thanks
You can simply use EXCEPT to remove the rows already in address:
INSERT INTO address(street_number,id_street)
SELECT DISTINCT datas.street_number, street.id_street
FROM datas
LEFT JOIN street USING (street_name)
EXCEPT
SELECT street_number, id_street FROM address;
You could end up with duplicates if there are concurrent data modifications on address.
To avoid that, you'd add a unique constraint and use INSERT ... ON CONFLICT DO NOTHING.
Your sub query is not correct. You have to match with the outer tables:
INSERT INTO address(street_number,id_street)
SELECT DISTINCT street_number, id_street
FROM datas
LEFT JOIN street ON street.street_name=datas.street_name
WHERE NOT EXISTS (SELECT * FROM address a2 WHERE a2.street_number = datas.street_number AND a2.id_street = street.id_street);