PostgreSQL join with similar address - postgresql

I am trying to join data from disparate sources. The only common field to join is address. In table 1 , address has extra data (representing neighborhood) between street and state. Is there a way to join these tables using most similar address? I have 85,000 addresses, so a manual search using LIKE and wildcards will not work.
Table 1:
"239 Dudley St Dudley Square Roxbury MA 02119"
"539 Dudley St Dudley Square Roxbury MA 02119"
Table 2:
"239 Dudley St Roxbury MA 02119"
"539 Dudley St Roxbury MA 02119"

I have two suggestions:
1) "All words in the table 2 address are present in the table 1 address":
select *
from t1 join
t2 on (string_to_array(t2.address,' ') <# string_to_array(t1.address,' '));
2) "For each table 1 address find the most similar address from the table 2":
select distinct on(t1.address) *
from t1 cross join t2
order by t1.address, similarity(t1.address, t2.address) desc;

Related

More Efficient Way to Join Three Tables Together in Postgres

I am attempting to link three tables together in postgres.
All three tables are generated from subqueries. The first table is linked to the second table by the variable call_sign as a FULL JOIN (because I want the superset of entries from both tables). The third table has an INNER JOIN with the second table also on call_sign (but theoretically could have been linked to the first table)
The query runs but is quite slow and I feel will become even slower as I add more data. I realize that there are certain things that I can do to speed things up - like not pulling unnecessary data in the subqueries and not converting text to numbers on the fly. But is there a better way to structure the JOINs between these three tables?
Any advice would be appreciated because I am a novice in postgres.
Here is the code:
select
(CASE
WHEN tmp1.frequency_assigned is NULL
THEN tmp2.lower_frequency
ELSE tmp1.frequency_assigned END) as master_frequency,
(CASE
WHEN tmp1.call_sign is NULL
THEN tmp2.call_sign
ELSE tmp1.call_sign END) as master_call_sign,
(CASE
WHEN tmp1.entity_type is NULL
THEN tmp2.entity_type
ELSE tmp1.entity_type END) as master_entity_type,
(CASE
WHEN tmp1.licensee_id is NULL
THEN tmp2.licensee_id
ELSE tmp1.licensee_id END) as master_licensee_id,
(CASE
WHEN tmp1.entity_name is NULL
THEN tmp2.entity_name
ELSE tmp1.entity_name END) as master_entity_name,
tmp3.market_name
FROM
(select cast(replace(frequency_assigned, ',','.') as decimal) AS frequency_assigned,
frequency_upper_band,
f.uls_file_number,
f.call_sign,
entity_type,
licensee_id,
entity_name
from combo_fr f INNER JOIN combo_en e
ON f.call_sign=e.call_sign
ORDER BY frequency_assigned DESC) tmp1
FULL JOIN
(select cast(replace(lower_frequency, ',','.') as decimal) AS lower_frequency,
upper_frequency,
e.uls_file_number,
mf.call_sign,
entity_type,
licensee_id,
entity_name
FROM market_mf mf INNER JOIN combo_en e
ON mf.call_sign=e.call_sign
ORDER BY lower_frequency DESC) tmp2
ON tmp1.call_sign=tmp2.call_sign
INNER JOIN
(select en.call_sign,
mk.market_name
FROM combo_mk mk
INNER JOIN combo_en en
ON mk.call_sign=en.call_sign) tmp3
ON tmp2.call_sign=tmp3.call_sign
ORDER BY master_frequency DESC;
you'll want to unwind those queries and do it all in one join, if you can. Soemthing like:
select <whatever you need>
from combo_fr f
JOIN combo_en e ON f.call_sign=e.call_sign
JOIN market_mf mf mf ON mf.call_sign=e.call_sign
JOIN combo_mk mk ON mk.call_sign=en.call_sign
I can't completely grok what you're doing, but some of the join clauses might have to become LEFT JOINs in order to deal with places where the call sign does or does not appear.
After creating indexes on call_sign for all four involved tables, try this:
WITH nodup AS (
SELECT call_sign FROM market_mf
EXCEPT SELECT call_sign FROM combo_fr
) SELECT
CAST(REPLACE(u.master_frequency_string, ',','.') AS DECIMAL)
AS master_frequency,
u.call_sign AS master_call_sign,
u.entity_type AS master_entity_type,
u.licensee_id AS master_licensee_id,
u.entity_name AS master_entity_name,
combo_mk.market_name
FROM (SELECT frequency_assigned AS master_frequency_string, call_sign,
entity_type, licensee_id, entity_name
FROM combo_fr
UNION ALL SELECT lower_frequency, call_sign,
entity_type, licensee_id, entity_name
FROM market_mf INNER JOIN nodup USING (call_sign)
) AS u
INNER JOIN combo_en USING (call_sign)
INNER JOIN combo_mk USING (call_sign)
ORDER BY 1 DESC;
I post this because this is the simplest way to understand what you need.
If there are no call_sign values which appear in both market_mf and
combo_fr, WITH nodup ... and INNER JOIN nodup ... can be omitted.
I am making the assumption that call_sign is unique in both combo_fr and market_mf ( = there are no two records in each table with the same value), even if there can be values which can appear in both tables.
It is very unfortunate that you order by a computed column, and that the computation is so silly. A certain optimization would be to convert the frequency strings once and for all in the table itself. The steps would be:
(1) add numeric frequncy columns to your tables (2) populate them with the values converted from the current text columns (3) convert new values directly into the new columns, by inputting them with a locale which has the desired decimal separator.

Biggest sale of every employee on Northwind?

I'm trying to list the biggest sale of each employee on Northwind database, and so far the best I could do is this;
select top (select count(EmployeeID) from Employees)
max(Quantity*OrderDetails.UnitPrice) TotalSale, FirstName+' '+LastName Name, ProductName from Orders
left join OrderDetails
on
OrderDetails.OrderID=Orders.OrderID
left join Employees
on
Orders.EmployeeID=Employees.EmployeeID
left join Products
on
OrderDetails.ProductID=Products.ProductID
group by FirstName,LastName, ProductName
order by TotalSale desc
But even though I used the group by I get repeated records;
TotalSale Name ProductName
15810,00 Andrew Fuller Côte de Blaye
15810,00 Nancy Davolio Côte de Blaye
10540,00 Robert King Côte de Blaye
10540,00 Anne Dodsworth Côte de Blaye
10540,00 Margaret Peacock Côte de Blaye
9903,20 Janet Leverling Thüringer Rostbratwurst
8432,00 Steven Buchanan Côte de Blaye
7905,00 Janet Leverling Côte de Blaye
7427,40 Andrew Fuller Thüringer Rostbratwurst
Warning: Null value is eliminated by an aggregate or other SET operation.
(9 row(s) affected)
So I have 9 employees and I used top function for that but the employees are not unique, I also tried to use distinct function but it didn't work either.
So I would appreciate a hand please!
Your issue is that you also group by productname. So you will get max sales per employee and per product name.
What you can do is, drop the product name in group by, in this case you will see max total sales only per employee.
select max(Quantity*OrderDetails.UnitPrice) TotalSale, FirstName+' '+LastName Name
from Orders
left join [Order Details] as OrderDetails on OrderDetails.OrderID=Orders.OrderID
left join Employees on Orders.EmployeeID=Employees.EmployeeID
left join Products on OrderDetails.ProductID=Products.ProductID
group by FirstName,LastName
order by TotalSale desc
In case you want to see the product name as well, you can encapsulate your query in a subquery and create a rownum based on employee name and order by total sales. You can select rows with rownum 1 in the outer query. Use rank function in case you need to show all occurances of product names.
SELECT TotalSale, Name, ProductName
FROM
(
select max(Quantity*OrderDetails.UnitPrice) TotalSale
,FirstName + ' ' + LastName Name
,ProductName
,Rnk = Rank() OVER(PARTITION BY Employees.EmployeeId ORDER BY MAX(Quantity*OrderDetails.UnitPrice) DESC)
from Orders
left join [Order Details] as OrderDetails on OrderDetails.OrderID=Orders.OrderID
left join Employees on Orders.EmployeeID=Employees.EmployeeID
left join Products on OrderDetails.ProductID=Products.ProductID
group by FirstName,LastName, Employees.EmployeeId, ProductName
) as sub
where sub.Rnk = 1
order by Name

Assistance with a duplication query

When a new customer contacts us, they are allocated a reference number.
Unfortunately our contact centre sometimes logs the same person without checking if they have contacted us before and the customer ends up with two reference numbers. We want to cleanse this, so:
I would like to output instances where the customer's surname, address1 and zipcode are duplicated but only if the customer has different reference numbers.
This is the type of data that I'd like to see output:
Ref LastName Address 1 Zip
1875 Faulkner 10 Smith Street 08540
1876 Faulkner 10 Smith Street 08540
I have tried a few ideas, the latest being (forgive the huge amount of code here):
with Duplicates as
(
select r.LastName
, a.Address1
, a.ZipCode
, COUNT(*) as DuplicateCount
FROM Reference r
INNER JOIN Address a ON a.ReferenceNumber = r.ReferenceNumber
LEFT OUTER JOIN Telephone t ON r.ReferenceNumber = t.ReferenceNumber
LEFT OUTER Join Email e ON r.ReferenceNumber = e.ReferenceNumber
group by r.LastName
, a.Address1
, a.ZipCode
having COUNT(*) > 1
)
SELECT
r.ReferenceNumber
, r.LastName
, r.FirstName
,a.ReferenceNumber
, a.Address1
, a.Address2
, a.Address3
, a.Address4
, a.ZipCode
,t.ReferenceNumber
, t.TelephoneNumber
,e.ReferenceNumber
, e.EmailAddress
, d.DuplicateCount
FROM Reference r
INNER JOIN Address a ON a.ReferenceNumber = r.ReferenceNumber
LEFT OUTER JOIN Telephone t ON r.ReferenceNumber = t.ReferenceNumber
LEFT OUTER Join Email e ON r.ReferenceNumber = e.ReferenceNumber
join Duplicates d on d.LastName = r.LastName
AND d.Address1 = a.Address1
AND d.ZipCode = a.ZipCode;
Unfortunately this returns all duplicates, not those with the same surname, address1 and zipcode and different reference numbers.
Do you have any advice on how I can achieve this?
Many thanks.
Try using this part of your code in a self join by putting the data into a table variable or using aliases.
SELECT
r.ReferenceNumber
, r.LastName
, r.FirstName
,a.ReferenceNumber
, a.Address1
, a.Address2
, a.Address3
, a.Address4
, a.ZipCode
,t.ReferenceNumber
, t.TelephoneNumber
,e.ReferenceNumber
, e.EmailAddress
, d.DuplicateCount
FROM Reference r
INNER JOIN Address a ON a.ReferenceNumber = r.ReferenceNumber
LEFT OUTER JOIN Telephone t ON r.ReferenceNumber = t.ReferenceNumber
LEFT OUTER Join Email e ON r.ReferenceNumber = e.ReferenceNumber
An example of self join is here.

TSQL/SQL Server 2008 R2 - Recursive select consolidating self-referenced table Unit and apply SUM on UnitSale and UnitCharge

I've been searching here and everywhere and I cant find a proper path to follow on my problem.
Here is the structure I am using:
Table [Unit] - represents an unit of an organization, like Management, General Coordination, Production Team 1, etc.
This table is self-referenced by his own key on the ParentID column.
Table [UnitSale] - holds fictitious sales data, referencing a specific Unit.
Table [UnitCharge] - hold fictitious costs and charges of a specific Unit.
My goal is to select the Units, from the top-most member of the tree, recursively consolidating its child-Units, by applying SUM on each UnitSale and UnitCharge of the children, and finally applying theses totals to the current Unit, in this case, the top most.
Image of sample data: http://brit.dyndns-work.com:89/Brit/SampleData.png
Check the SQL Fiddle: http://sqlfiddle.com/#!3/75c3cc/3
Any help?
CTE is a good way to go. I would however do it bottom-up attributing sales from lower level to upper level, group by unit and finally join to unit for description and calculate rate. Check the updated fiddle: http://sqlfiddle.com/#!3/75c3cc/16/0.
with cte1 as
(
select u.id, u.parentid, s.salevalue, c.chargevalue
from Unit u
left join UnitSale s on s.unitid = u.id
left join UnitCharge c on c.unitid = u.id
union all
select u.id, u.parentid, x.salevalue, x.chargevalue
from Unit u
inner join cte1 x on x.parentid = u.id
)
, cte2 as
(
select id, sum(salevalue) as totalsale, sum(chargevalue) as totalcharge
from cte1
group by id
)
select u.id, u.description, u.parentid, x.totalsale, x.totalcharge, x.totalsale / x.totalcharge as rate
from cte2 x
inner join unit u on u.id = x.id
order by u.description

The multi-part identifier "t.PartNumber" could not be bound - with union

I need the records from TableMain which have a record match in ActivePNs and also a match in [Parts]. It seems that a join should do the trick but I keep running up against either a "could not be bound" or a "invalid column name" error.
I'm sure I could accomplish what I need by creating a temp table, but I'm trying to keep it simple.
Select * from TableMain t
INNER JOIN (select [PartNumber]
From ActivePNs ap
Where ap.PartNumber = t.PartNumber
Union
select [Number] PartNumber
From [Parts] p
Where p.Number = t.PartNumber) c
On t.PartNumber = c.PartNumber
Assuming there aren't multiple rows in ActivePNs or Parts for a given PartNumber, then from what I've understood, this should do the trick - only finding rows in TableMain that have a PartNumber in ActivePNs and Parts:
Select t.*
from TableMain t
INNER JOIN ActivePNs ap ON t.PartNumber = ap.PartNumber
INNER JOIN Parts p ON t.PartNumber = p.Number
Your problem is in the SELECT after the UNION.
select [Number] PartNumber -- You rename Number to PartNumber
From [Parts] p
Where p.Number = t.PartNumber -- but still reference Number here
The aliasing of Number in the SELECT means there's no column p.Number for use in the WHERE portion of the query.
A derived table cannot be correlated with the tables it is being joined to. What you are trying to do could be implemented like this:
SELECT
t.*,
COALESCE(ap.PartNumber, p.Number) AS PartNumber
FROM TableMain t
LEFT JOIN ActivePN ap ON ap.PartNumber = t.PartNumber
LEFT JOIN Parts p ON p.Number = t.PartNumber
WHERE NOT (ap.PartNumber IS NULL AND p.Number IS NULL)