fuzzy merging two tables postgresql - postgresql

I need to join two tables based on names. And the problem is that names may be a slight mispelling in one of the database. I have remedy this problem in the past using Stata and Python's fuzzy merging, where names are matched based on how closely similar they are, but I am wondering if this is possible to do in Postgresql.
For example, may data may be something similar to this:
Table A:
first_name_a | last_name_a | id_a
----------------------------------
William | Hartnell | 1
Matt | Smithe | 2
Paul | McGann | 3
David | Tennant | 4
Colin | Baker | 5
Table B:
first_name_b | last_name_b | id_b
----------------------------------
Matt | Smith | a
Peter | Davison | b
Dave | Tennant | c
Colin | Baker | d
Will | Hartnel | e
And in the end, I hope my results would look something like:
first_name_a | last_name_a | id_a | first_name_b | last_name_b | id_b
----------------------------------------------------------------------
William | Hartnell | 1 | Will | Hartnel | e
Matt | Smithe | 2 | Matt | Smith | a
Paul | McGann | 3 | | |
David | Tennant | 4 | Dave | Tennant | c
Colin | Baker | 5 | Colin | Baker | d
| | | Peter | Davison | b
My Sonic Screwdriver gives me some pseudo-code like this:
SELECT a.*, b.* FROM A a
JOIN B b
WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1

The DML you mention:
SELECT a.*, b.* FROM A a
JOIN B b
WHERE LEVENSHTEIN(first_name_a, first_name_b) IS LESS THAN 1
AND LEVENSHTEIN(last_name_a, last_name_b) IS LESS THAN 1
Looks correct, just bump up the 'fuzziness' (given 'IS LESS THAN 1' substitute 1 for the 'fuzzyness' level that you you require)
See http://www.postgresql.org/docs/9.1/static/fuzzystrmatch.html for reference info on LEVENSHTEIN.

Done up as an SQLFiddle. Play with the thresholds/look at some of the other mapping functions mentioned in matching fuzzy strings.

Related

I need help identifying group of table members that have different status in another table

I wasn't able to google my way to figuring this out. I'm still very new to TSQL and I thought I could solve this with a self joins and sub queries. But I'm getting to many results and don't know how to tame them. I appreciate the help. It's nice to see all the different methods people suggest for the same problem. I know I get tunnel vision when trying to solve a problem, when it's better to try it from a different angle.
My goal is this. I want to Return the HouseholdID of all Households whose PersonID's HairColor don't all match each other. Whatever the color may be. So below HouseHoldID 200 would return since their PersonID's HairColor differ from each other. Unlike HouseHoldID 300 whose PersonID's HairColor do match each other.
HouseholdMember
+------------+-----------------+-----------+
| MemberID | HouseholdID | PersonID |
+------------+-----------------+-----------+
| 100 | 200 | 1 |
| 101 | 200 | 2 |
| 102 | 200 | 3 |
| 103 | 300 | 4 |
| 104 | 300 | 5 |
| 105 | 300 | 6 |
+------------+-----------------+-----------+
Person
+------------+-----------------+-----------+------------+
| PersonID | FirstName | LastName | HairColor |
+------------+-----------------+-----------+------------+
| 1 | Josh | Smith | Brown |
| 2 | Jerry | Smith | Black |
| 3 | Ethan | Smith | Red |
| 4 | Mike | Jones | Black |
| 5 | Devan | Jones | Black |
| 6 | Todd | Jones | Black |
+------------+-----------------+-----------+------------+
Household
+---------------+-----------------+----------------+
| HouseholdID | Name | Address |
+---------------+-----------------+----------------+
| 200 | Smith's | 123 Candy Dr |
| 300 | Jones's | 812 Dentist Ln |
+---------------+-----------------+----------------+
One option uses aggregation:
WITH cte AS (
SELECT hm.HouseholdID
FROM HouseholdMember hm
INNER JOIN Person p ON hm.PersonID = p.PersonID
GROUP BY hm.HouseholdID
HAVING COUNT(DISTINCT p.HairColor) > 1
)
SELECT *
FROM Household
WHERE HouseholdID IN (SELECT HouseholdID FROM cte);
Demo

how to migrate relational tables to dynamoDB table

I am new at DynamoDB, in my current project, I am trying to migrate most relational tables to Dynamo DB. I am facing a tricky scenario which I don't know how to solve
In Posgresql, 2 tables:
Student
id | name | age | address | phone
---+--------+-----+---------+--------
1 | Alex | 18 | aaaaaa | 88888
2 | Tome | 19 | bbbbbb | 99999
3 | Mary | 18 | ccccc | 00000
4 | Peter | 20 | dddddd | 00000
Registration
id | class | student | year
---+--------+---------+---------
1 | A1 | 1 | 2018
2 | A1 | 3 | 2018
3 | A1 | 4 | 2017
4 | B1 | 2 | 2018
My query:
select s.id, s.name, s.age, s.address, s.phone
from Registration r inner join Student s on r.student = s.id
where r.class = 'A1' and r.year = '2018'
Result:
id | name | age | address | phone
---+--------+-----+---------+--------
1 | Alex | 18 | aaaaaa | 88888
3 | Mary | 18 | ccccc | 00000
So, how can I design the dynamoDB table to achieve this result? in extend for CRUD
Any advice is appreciated
DynamoDB table design is going to depend largely on your access patterns. Without knowing the full requirements and queries needed by your app, it's not going to be possible to write a proper answer. But given your example here's a table design that might work:
| (GSI PK) |
(P. Key) | (Sort) | (GSI Sort)
studentId | itemType | name | age | address | phone | year
----------+----------+--------+-----+---------+-------+------
1 | Details | Alex | 18 | aaaaaa | 88888 |
1 | Class_A1 | | | | | 2018
2 | Details | Tome | 19 | bbbbbb | 99999 |
2 | Class_B1 | | | | | 2018
3 | Details | Mary | 18 | ccccc | 00000 |
3 | Class_A1 | | | | | 2018
4 | Details | Peter | 20 | dddddd | 00000 |
4 | Class_A1 | | | | | 2017
Note the global secondary index with the partition key on the item type and the sort key on the year.
With this design we have a few query options:
1) Get student for a given id: GetItem(partitionKey: studentId, sortkey: Details)
2) Get all classes for a given student id: Query(partitionKey: studentId, sortkey: STARTS_WITH("Class"));
3) Get all students in class A1 and year 2018: Query(GSI partitionkey: "Class_A1", sortkey: equals(2018))
For global secondary indexes, the partition and sort key don't need to be unique therefore you can have many Class_A1, 2018 combos. If you haven't already read the Best Practices for DyanmoDB I highly recommend reading it in full.

Informix - concatenating data contained in same column based on id

I have a need to concatenate strings in the same field based on id in Informix. I realize this can be done easily in MSSQL.
Here is an example of my current table:
id | doc_num | page_num | description
-------------------------------------------------
1 | 1 | 1 | This is the story about
1 | 1 | 2 | a girl named Daisy.
1 | 2 | 1 | Daisy had a dog named
1 | 2 | 2 | Rover.
2 | 1 | 1 | This story is about Bob.
2 | 2 | 1 | Bob is a DBA who works
2 | 2 | 2 | at an important company
2 | 2 | 3 | that develops important
2 | 2 | 4 | software.
Desired output:
id | description
------------------------------------------------------------
1 | This is a story about a girl named Daisy.
| Daisy has a dog named Rover.
------------------------------------------------------------
2 | This story is about Bob. Bob is a DB who works at an
| important company that develops important software.
------------------------------------------------------------
I found my answer here:
https://dba.stackexchange.com/questions/65101/multiple-table-rows-in-one-row-informix
Since I am running Informix 12, it works using rank() over() sys_connect_by_path().

how to flatten rows to columns in postgreSQL

using postgresql 9.3 I have a table that shows indivual permits issued across a single year below:
permit_typ| zipcode| address| name
-------------+------+------+-----
CONSTRUCTION | 20004 | 124 fake streeet | billy joe
SUPPLEMENTAL | 20005 | 124 fake streeet | james oswald
POST CARD | 20005 | 124 fake streeet | who cares
HOME OCCUPATION | 20007 | 124 fake streeet | who cares
SHOP DRAWING | 20009 | 124 fake streeet | who cares
I am trying to flatten this so it looks like
CONSTRUCTION | SUPPLEMENTAL | POST CARD| HOME OCCUPATION | SHOP DRAWING | zipcode
-------------+--------------+-----------+----------------+--------------+--------
1 | 2 | 3 | 5 | 6 | 20004
1 | 2 | 3 | 5 | 6 | 20005
1 | 2 | 3 | 5 | 6 | 20006
1 | 2 | 3 | 5 | 6 | 20007
1 | 2 | 3 | 5 | 6 | 20008
have been trying to use Crosstab but its a bit above my rusty SQL experiance. anybody have any ideas
I usually approach this type of query using conditional aggregation. In Postgres, you can do:
select zipcode,
sum( (permit_typ = 'CONSTRUCTION')::int) as Construction,
sum( (permit_typ = 'SUPPLEMENTAL')::int) as SUPPLEMENTAL,
. . .
from t
group by zipcode;

Merge multiple tables with a common column name

I am trying to merge multiple tables that have a common column name which need not have the same values across the tables. For ex,
-tmp1-
id dat
1 234
2 432
3 412
-tmp2-
id nom
1 jim
2
3 ryan
4 jack
-tmp3-
id pin
1 gi23
2 x4ed
3 yit42
8 hiu11
If above are the input, the output needs to be,
id dat nom pin
1 234 jim gi23
2 432 x4ed
3 412 ryan yit42
4 jack
8 hiu11
Thanks in advance.
postgresql 8.2.15 on greenplum from R(pass-through queries)
use FULL JOIN ... USING (id) syntax.
please see example: http://sqlfiddle.com/#!12/3aff2/1
this is how diffrent join types work (provided that tab1.row3 meets joining condition with tab2.row1, and tab1.row3 meets tab2.row2):
| tab1 | | tab2 | | JOIN | | LEFT JOIN | | RIGHT JOIN | | FULL JOIN |
-------- -------- ------------------------- ------------------------- ------------------------- -------------------------
| row1 | | tab1.row1 | | tab1.row1 |
| row2 | | tab1.row2 | | tab1.row2 |
| row3 | | row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 | | tab1.row3 | tab2.row1 |
| row4 | | row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 | | tab1.row4 | tab2.row2 |
| row3 | | tab2.row3 | | tab2.row3 |
| row4 | | tab2.row4 | | tab2.row4 |