Getting address from db by fuzzy match POSTGRESQL - postgresql

I have got address database with 1 million rows. And user will be add any address text(without specific structure and grammar mistakes acceptable). I must seperate address by sections like region, city, town, village and so on. So I almost have done it with trigram alghoritm. But it's so slow. My question is how can I optimize my request? For now I have got this:
FROM adresses_1
ORDER BY SIMILARITY(CONCAT(region, district, city, town, area, street, building), **address_text**) DESC
LIMIT 1;```

you could run the addresses they enter through an address standardization API (like smartystreets) to validate the address and pick out the address components you want (to store in discreet fields). This will make future retrieval, filtering, proximity searching, etc very accurate. I have used smartystreets on millions of records in the past.

Your expression as written is not indexable. If you build a GiST trigram index on the expression CONCAT(region, district, city, town, area, street, building), then you could use:
ORDER BY CONCAT(region, district, city, town, area, street, building) <-> **address_text** ASC
LIMIT 1
Or if you build the GIN trigram index instead, the ORDER BY wouldn't be directly indexable; but instead you could use the index to efficiently filter out anything "obviously" not close, then sort the remaining ones.
WHERE CONCAT(region, district, city, town, area, street, building) % **address_text**
ORDER BY SIMILARITY(CONCAT(region, district, city, town, area, street, building), **address_text**) DESC
LIMIT 1
Or you could do as Jake proposes, and use software specially written for standardizing addresses.

Related

Atomic values / divisibility to reach 1NF

After reading about normalization I am unsure of how to interpreter the 1 NF requirements
According to wikipedia, something is in first normal form, if the "domain of each attribute contains only atomic indivisible values"
My question is: Who decides what is indivisible or not?
You may divide a date datatype into year, month, day, second, nanoseconds. You may aswell divide an adress into the exact latitude coordinates. When can you really be sure that you have reached 1NF?
Would this table be considered 1NF?
fullName
fullAdresss
Joe Zowesson
87th Victoria Street London EC96 1MB, 14584
Mason Hamburg
47th Jeremy Street London EC26 1MB, 13584
Dedrik Terry
27th Burger Street London EC16 1MB, 17584
My interpretation here is that the value Joe Zowesson is indivisible in regards to the column fullName. And that both zip code, street number and street name is atomic in relation to the column name fullAddress.
I am almost certain that I am in the wrong, but I can not yet understand why.
The question is in regards to an upcoming exam, where I will need to "proove" which normal form something currently is in. Something that I find very hard depending on how you interpreter the word atomic.
You have misunderstood the concept of 1NF basically. By atomic value, it is meant that when you have a column for Name, you should not store any other values alongside it. In other words, the column intended for the Name should not store ID, Address or anything else together with Name, so that when you query the column Name you get only Name, and not name with Id or Address. And Name can be in any form you want whether it be First name + Last name or First name + Last name + Middle name + Previous name.
The decision of whether you need separate columns for the related data should be made during design. Let's suppose you have table Student:
StudentId
FullName
Address
Average grade
1
John Done
New York, US
3.4
2
Robert Bored
New York, US
0
3
Student LName
Dallas, US
1
4
Another LName
Munich, Germany
2
In this case, it means that you do not write queries and don't need data based on First name, Last name separately, but you need all at once for example:
SELECT FullName
FROM Student
WHERE StudentId = 1;
John Done
And when you need First name, Last name separately, you decompose them into several columns, for example:
StudentId
FullName
LastName
Address
Average grade
1
John
Done
New York, US
3.4
2
Robert
Bored
New York, US
0
3
Student
LName
Dallas, US
1
4
Another
LName
Munich, Germany
2
And your queries might look like this:
SELECT LastName, AverageGrade
FROM Student
WHERE AverageGrade >= 1 AND FirstName != 'John';
The result will be:
| LastName | AverageGrade |
---------------------------
| LName | 1 |
| LName | 2 |
Or something like this maybe:
UPDATE Student
SET AverageGrade = 4
WHERE LastName = 'LName' AND FirstName != 'Student'
Basically, the decision depends on how you manipulate the data and in which form you need it.
To sum it up. Whether the relation is in 1NF or not depends on what values you're trying to store on this table, as I mentioned above, one column should store only one type of value, e.g ID, Address, Name, etc. And the decision of how your columns' values will look depends on the design and how you NEED TO STORE the data. If you do not need to query fistname, middlename, lastname, secondname separately, then what you can do is just save all of them in one column FullName and it will still be in 1NF. But if you need them separately, you can store them in separate columns, and again it will still be in 1NF, but it might violate other rules.
Here are some tutorials you might find useful: https://www.studytonight.com/dbms/first-normal-form.php
Let the application, and how it will be used, guide you as to what data should be split further into additional fields (or not).
For example;
If, in your application, you are constantly splitting first name from last name so that you can say "Hi Joe" on correspondence, you should split fullName into two fields. Conversely, If you had two fields firstName and lastName, and were always concatenating them so that you could correctly address an envelope, it would make more sense to have those two fields stored in a single column in your table.
In practice, it is not uncommon for a database to show some de-normalization with the above example given how common both scenarios are but the risk is that they get out of sync if someone updates first name (for example) but doesn't update fullName.
Consider things like how you will force your users to follow a certain pattern if you decide to go with a single column fullName. How would you prevent "Smith, Joe" if your application needed "Joe Smith"?
Dates are another good example and again, whether you split the parts into separate columns depends on how they will be used.
A datetime field which indicates when a row was inserted probably doesn't need to be split out, but if you had many queries which were only interested in the year (for example), it might make sense to split it out.
This only scratches the surface which is why this answer is more about how to think about the underlying problem. Yes normalizing your database is important for all kinds of reasons, but how far you go with it depends on how your data will be used at the end of the day.

Parsing addresses from varchar in PostgreSQL

Could you please advise me what is the best way of parsing address from string? I have available a table of addresses exported in the form of OSM Points (city, street, house number, country code, post code, geometry column, ...), and text parameter entered by user, for example:
'Prague Letna 15'
This string I need to parse (city, street name, street number, ...) and based on these data I want select from the Points table the greatest similarity point. I will be grateful for any advice.
I treid this:
select *
from parse_address('Prague Letna 15')
but result is not good.

SQL Query sort by closest match

We have a Locations search page that is giving us a challenge I've never run across before.
In our database, we have a list of cities, states, etc. with the corresponding geocodes. All was working fun until now...
We have two locations in a city named "Black River Falls, WI" and we've recently opened one in "River Falls, WI".
So our table has records as follows:
Location City State
-------------------------------------
1 Black River Falls WI
2 Black River Falls WI
3 River Falls WI
Obviously our query uses a "LIKE" clause to match city, but when a customer searches the text "River Falls", in the search results, the first results shown are always "Black River Falls".
In our application, we always use the first match, and use it as the default. (We could change it, but it would be a lot of un-budgeted work)
I know I could simple change the sort order to have "River Falls" come up first, but that's a sloppy solution that works only in this one case.
What I'm wondering is if there is a way, through T-SQL (SQL Server 2008r2) to sort by "best match" where "River Falls" would "win" if we search for "River Falls, WI" and "Black River Falls" would work if we search for "Black River Falls" WI.
You can use the "DIFFERENCE" function to search using the closest SOUNDEX match.
Select * From Locations WHERE City=#City ORDER BY Difference(City, #City) DESC
From the MSDN Documentation:
The integer returned is the number of characters in the SOUNDEX values
that are the same. The return value ranges from 0 through 4: 0
indicates weak or no similarity, and 4 indicates strong similarity or
the same values.
DIFFERENCE and SOUNDEX are collation sensitive.
Like this:
;WITH cte As
(
SELECT *
, ROW_NUMBER() OVER(ORDER BY LEN(City)-LEN(#UserText)) As MatchPrio
FROM Cities
WHERE City LIKE '%'+#UserText+'%'
)
SELECT *
FROM cte
WHERE MatchPrio = 1
Update:
You can change the ORDER BY expression above to also use DIFFERENCE(..) or any other combination of criteria.

Greatest n per group with multiple criteria for greatest

I need to select the largest, most recent or currently active term across a number of schools, with the assumption that is possible for a school to have multiple concurrent terms (ie, one term that honors students are registered in, and another for non honors). Also need to take into account the end date, as the honors term may have the same start date but may be year long instead of just a semester, and I want the semester.
Code looks something like this:
SELECT t.school_id, t.term_id, COUNT(s.id) AS size, t.start_date, t.end_date
FROM term t
INNER JOIN students s ON t.term_id = s.term_id
WHERE t.school_id = (some school id)
GROUP BY t.school_id, t.term_id
ORDER BY t.start_date DESC, t.end_date ASC, size DESC LIMIT 1;
This works perfectly to find the largest currently or most recently active term, but I want to be able to eliminate the WHERE t.school_id = (some school id) part.
A standard greatest n per group can easily choose the largest OR most recent term, but I need to select the most recent term that ends soonest with the largest number of students.
Not sure I am interpreting your question correctly. Would be easier if you had supplied table definitions including primary and foreign keys.
If you want the the most recent term that ends soonest with the largest number of students per school, this might do it:
SELECT DISTINCT ON (t.school_id)
t.school_id, t.term_id, s.size, t.start_date, t.end_date
FROM term t
JOIN (
SELECT term_id, COUNT(s.id) AS size
FROM students
GROUP BY term_id
) s USING (term_id)
ORDER BY t.school_id, t.start_date DESC, t.end_date, size DESC;
More explanation for DISTINCT ON in this related answer:
Select first row in each GROUP BY group?

Switch column data where a column contains a number?

I have a table that have 3 columns
id, company and adress
i found a bug today that saved the adress in the company-column and company in the adress-column SOMETIMES, i have corrected the bug and now im trying to put the data in the right places
every adress has a number in it so my guess is that the easiest way is to switch adress and company columns if there is a number in the company-column (if there should be a number in the real company name this wont matter that much :p).
How should i write this in TSQL?
I'm not sure this is right thing to do here but as I can't think of any other alternative this should do it.
Update dbo.MyTable
Set Company = Address,
Address = Company
Where Company like '%[0-9]%'
You can try this: i put a simple protection to avoid the swap if the company adress already contains a number
insert into COMPANY (NAME, ADDRESS)
VALUES ('2 bld d''Italie' , 'CA') ,
('Take 2' , 'anselmo street 234') ,
('Microsoft' , '1 Microsoft Way Redmond'),
('lake street 14' , 'Norton'),
('lake street 17' , 'trendMicro');
SELECT * FROM COMPANY
UPDATE COMPANY set NAME = ADDRESS, ADDRESS = NAME
WHERE NAME like '%[0-9]%' and (ADDRESS not like '%[0-9]%')
SELECT * FROM COMPANY
You could notice that the take 2 line won't be swapped