Perl : Tracking duplicates - perl

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.

You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

Related

PostgreSQL - Compare ts_vector fields

I have two tables in which I have data coming from two different sources. One of the field of each table contains the title of a movie, but for some reason out of my control, the titles are not always exactly the same.
So I use the ts_vector to get rid of all the minor differences (stop words, plurals and so on).
See an example here: http://sqlfiddle.com/#!17/5ccbc/3
My problem is how to compare the two ts_vector without taking into account the numberic values, but just the text content. If I compare directly the two fields, I only get the exact match between values, including position of each word. The only solution I have found is using the strip() function, that remove positions and weights from tsvector, leaving only the text content.
I was wondering if there is a fastest way to compare ts_vectors.
You could create in index on the stripped vector:
create index on tbl1 (strip(ts_title));
create index on tbl2 (strip(ts_title));
But given that your query has to fetch every row of each table, it is unlikely this would serve much of a point. Doing a merge join between the precomputed stripped vectors could be faster, but probably not once you include the overhead of building and maintaining the indexes. If the real WHERE clause is more restrictive (selecting only a few rows from one or the other of the tables) then please share the real query.

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

fuzzy match in postgresql

I have two table in my database , agridata and geoname. I am trying to find out geoid column for names in agridata like below
select geonameid , name from geoname where name in (select distinct district_name from agridata );
I want to do a fuzzy match of the names as exact names are not in database. How to go about it ?
You can use a variety of matching algorithms (see here), but I'm not 100% sure they will work with an in clause. I'd imagine you really want to use a soundex join e.g.
select distinct g.geonameid, g.name from geoname g join agridata a on soundex(a.name) = g.name
or similar.
If you've got a huge match set to deal with, you may want to consider using some kind of search index such as ElasticSearch/Solr.
Use extension for PostgreSQL called pg_trgm, implementation of trigram matching.
"We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages"
I used it, it's very fast and gives great results.

How to do a SQL query using columns from a related table?

I've got three related SQL tables, simplified they look like this:
ShopTable
[ShopID]
ShelfTable
[ShelfID]
[ShopID]
InventoryTable
[ShelfID]
[Value]
[ShopID] and [ShelfID] are relations. Now what I want to do is get the SUM of [Value] for one [ShopID], but this obviously won't work since [ShopID] ain't part of InventoryTable:
SELECT SUM([Value]) WHERE [ShopID] = '1'
How do I have to write the query to filter the InventoryTable using the ShopID?
SELECT SUM(i.value)
FROM shelfTable s
JOIN inventoryTable i
ON i.shelfId = s.shelfId
WHERE s.shopId = 1
This is a fundamental question about relations between tables, so I'll provide some detail, hoping that you can use some of these ideas when writing SQL queries in the future.
Let's start with one basic thing first. [ShopID] could refer to two different but related columns, one in [ShopTable] and one in [ShelfTable]. The same things applies to [ShelfID]. It's useful to always specify the table.
You describe [ShopID] and [ShelfID] as "relations." As Damien_The_Unbeliever has commented, those columns are, in fact, two pairs of primary and foreign keys. That is, [ShelfTable].[ShelfID] identifies a "shelf" record, and [InventoryTable].[ShelfID] relates an "inventory item" (whatever that is) to a "shelf." (It's not always possible to interpret rows in a database this naively, but I'm willing to guess I'm not too far off from reality.)
Likewise, each "shelf" belongs to one "shop," and [ShelfTable].[ShopID] refers to that specific "shop." Notice that because we have the value of [ShopID] already (I'll call it "#MyShopID"), we don't even need the [ShopTable] here. We can just use [ShelfTable].[ShopID] to filter for the "shelves" we're interested in.
You're asking to get the sum total of [InventoryTable].[Value] for one [ShopID] value, but [ShopID] doesn't show up in [InventoryTable]. That's where your (inner) join comes into play. You know that you'll be adding up values from [InventoryTable], but you've got to specify the particular "shop." You specify #MyShopID for [ShelfTable].[ShelfID], which will do your filtering in [InventoryTable] for you.
One final thing before composing the query. I'm assuming that you haven't oversimplified your tables too much, and that [Value] is the total value of each "inventory item," and not just a unit value. If it wasn't, we'd have to multiply values by quantities, etc., but I'll let you check your own work here.
So, here's what we do:
We select FROM the [InventoryTable]
but we INNER JOIN to the [ShelfTable] on [ShelfID] from both tables
and we only want "shelves" from one "shop," i.e. WHERE [ShelfTable].[ShopID] = #MyShopID
and then we SELECT the SUM([InventoryTable].[Value])
and we're done. In SQL, let's remove the brackets, provide some table aliases, and we'll get a query that looks like this:
SELECT SUM(inv.Value)
FROM InventoryTable AS inv
INNER JOIN ShelfTable AS shf ON shf.ShelfID = inv.ShelfID
WHERE shf.ShopID = #MyShopID
;
Here are a few take-away points to consider. Notice we handled the FROM clause first. You'll always want to do that.
You'll also want a "driving table" to start with, in this case, [InventoryTable]. The other tables in your join add extra information and provide you a means to filter, but don't otherwise interfere with your summing up. More complex queries don't offer such an obvious luxury, but we're not getting too fancy here.
You'll also note, just briefly, that because [ShelfID] is a primary key in [ShelfTable], those [ShelfID]'s are unique values in [ShelfTable], and so each "inventory" thing belongs to a single "shelf." So the join won't cause us to double-count values. That's a good thing to remember when you're not dealing with primary and foreign keys, like we're doing here.
Hope that helps. And I hope I didn't come across as too pedantic.

Better performance for SQLite Select Statement

I'm developing an Iphone App where the user types in any string into a searchbar and presses the search button. After that a result list should appear.
In my SQLite I have four columns a, b, c, d. Let's say they have the following Values:
Dataset 1:
a: code1
b: report1
c: description1_1
d: description1_2
Dataset 2:
a: code2
b: report2
c: description2_1
d: description2_2
So if the user enters a value of: "1_1" then the first dataset will be selected because of clumn c.
If the user enters a value of: "report" then the first and second dataset will be selected.
As I'm using a database with nearly 60.000 Datasets searching for a part-string is really killing the performance.
Setting an index at all 4 columns will make the size of the SQLite database much too huge.
So I didn't use an index at all.
My Select Statement looks like this:
NSString *sql = [NSString stringWithFormat:#"SELECT * FROM scode WHERE a LIKE '%#%#%#' OR c LIKE '%#%#%#' OR d LIKE '%#%#%#'", wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard, wildcard, searchBar.text, wildcard];
Is there any good way to enhance the performance of searching for a part-string in all columns?
Thank you and kind regards,
Daniel
You're after Full Text Searching, which SQLite doesn't natively support. I don't have any experience with 3rd party support, but based on results there are a few options.
You answered your own question: Do the index on all four columns. And measure the size difference. Considering the storage capacity of the iPhone, you're probably out of balance trying to reduce storage.
The rule of thumb with SQLite performance is not to doa query that isn't indexed.
You can see what SQLite is actually doing by creating your database on the Mac using the same schema and EXPLAIN QUERY PLAN. (There's also EXPLAIN, which is more detailed but less obvious.)
You can create a separate table, with two columns: a pattern string and a key value (which is used to refer to your data tables). Lets call this table "search_index".
Then, on any change to your data table entries, you update the "search_index" table:
remove rows with keys of changed data table rows
for each column in data table, use the first X characters of the data, and add them to search_index with the key
You can work out the details yourself, but in this way, you just build your own (partial) search index.
When querying, you can use up to X characters to search in the search_index table alone. If the user types more than X characters you at least have a limited set of data table rows to search in. So you can search those 60k rows easily.
Find a good value for X to balance storage requirements and usability and performance.
EDIT: Looks like you do not want to search only the beginning of the words? Well, then you should not just use the "first X characters", but you should split the data into single words, and use the full words in search_index. Though in practice you will still have around a fourth of the index storage requirements compared to giving all columns an index. So, its still a good thing to build your own "search_index".