fuzzy match in postgresql - postgresql

I have two table in my database , agridata and geoname. I am trying to find out geoid column for names in agridata like below
select geonameid , name from geoname where name in (select distinct district_name from agridata );
I want to do a fuzzy match of the names as exact names are not in database. How to go about it ?

You can use a variety of matching algorithms (see here), but I'm not 100% sure they will work with an in clause. I'd imagine you really want to use a soundex join e.g.
select distinct g.geonameid, g.name from geoname g join agridata a on soundex(a.name) = g.name
or similar.
If you've got a huge match set to deal with, you may want to consider using some kind of search index such as ElasticSearch/Solr.

Use extension for PostgreSQL called pg_trgm, implementation of trigram matching.
"We can measure the similarity of two strings by counting the number of trigrams they share. This simple idea turns out to be very effective for measuring the similarity of words in many natural languages"
I used it, it's very fast and gives great results.

Related

Postgres...how to improve ilike results (quality not speed)

I have a list of chemicals in my database and I provide our users with the ability to do a live search via our website. I use SQLAlchemy and the query I use looks something like this:
Compound.query.filter(Compound.name.ilike(f'%{name}%')).limit(50).all()
When someone searches for toluene, for example, they don't get the result they're looking for because there are many chemicals that have the word toluene in them, such as:
2, 4 Dinitrotoluene
2-Chloroethyl-p-toluenesulfonate
4-Bromotoluene
6-Amino-m-toluenesulfonic acid
a,2,4-trichlorotoluene
a,o-Dichlorotoluene
a-Bromtoluene
etc...
I realize I could increase my limit but I feel like 50 is more than enough. Or, I could change the ilike(f'%{name}%')) to something like ilike(f'{name}%')) but our business requirements don't want this. What I'd rather do is improve the ability for Postgres to return results so that toluene is at the top of the search results.
Any ideas on how Postgres' ilike capability?
Thanks in advance.
One option is to better rank the results. Postgres text search allows you to rank the results.
A cheap and dirty version of preferential ranking is to do multiple queries for name = ?, ilike(f'{name}%')), and ilike(f'%{name}%')) using a union. That way the ilike(f'{name}%')) results come first.
And rather than a hard limit, offer pagination. SQLAlchemy has paginate to help.
ILIKE yields a boolean. It doesn't specify what order to return the results, just whether to return them at all (you can order by a boolean, but if you only return trues there is nothing left to order by). So by the time you are done improving it, it would no longer be ILIKE at all but something else completely.
You might be looking for something like <-> from pg_trgm, which provides a distance score which can be sorted on. Although really, you could just order the result based on the length of the compound name, and return the shortest 50 that contain the target.
something like ilike(f'{name}%')) but our business requirements don't want this
Isn't your business requirement to get better results?
But at least in my database, this could just return a bunch of names in inverted format, like toluene, 2,4-dinitro, so the results might not be much better, unless you avoid storing such inverted names. Sorting by either <-> or by length would overcome that problem. But they would also penalize toluene, ACS reagent grade 99.99% by HPLC, should you have names like that.

Efficient way to find ordered string's exact, prefix and postfix match in PostgreSQL

Given a table name table and a string column named column, I want to search for the word word in that column in the following way: exact matches be on top, followed by prefix matches and finally postfix matches.
Currently I got the following solutions:
Solution 1:
select column
from (select column,
case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end as rank
from table) as ranked
where rank is not null
order by rank;
Solution 2:
select column
from table
where column like 'word'
or column like 'word%'
or column like '%word'
order by case
when column like 'word' then 1
when column like 'word%' then 2
when column like '%word' then 3
end;
Now my question is which one of the two solutions are more efficient or better yet, is there a solution better than both of them?
Your 2nd solution looks simpler for the planner to optimize, but it is possible that the first one gets the same plan as well.
For the Where, is not needed as it is covered by ; it might confuse the DB to do 2 checks instead of one.
But the biggest problem is the third one as this has no way to be optimized by an index.
So either way, PostgreSQL is going to scan your full table and manually extract the matches. This is going to be slow for 20,000 rows or more.
I recommend you to explore fuzzy string matching and full text search; looks like that is what you're trying to emulate.
Even if you don't want the full power of FTS or fuzzy string matching, you definitely should add the extension "pgtrgm", as it will enable you to add a GIN index on the column that will speedup LIKE '%word' searches.
https://www.postgresql.org/docs/current/pgtrgm.html
And seriously, have a look to FTS. It does provide ranking. If your requirements are strict to what you described, you can still perform the FTS query to "prefilter" and then apply this logic afterwards.
There are tons of introduction articles to PostgreSQL FTS, here's one:
https://www.compose.com/articles/mastering-postgresql-tools-full-text-search-and-phrase-search/
And even I wrote a post recently when I added FTS search to my site:
https://deavid.wordpress.com/2019/05/28/sedice-adding-fts-with-postgresql-was-really-easy/

Perl : Tracking duplicates

I am trying to figure out what would be the best way to go ahead and locate duplicates in a 5 column csv data. The real data has more than million rows in it.
Following is the content of mentioned 6 columns.
Name, address, city, post-code, phone number, machine number
Data does not have fixed length, data might in certain columns might be missing in certain instances.
I am thinking of using perl to first normalize all the short forms used in names, city and address. Fellow perl enthusiasts from stackoverflow have helped me a lot.
But there would still be a lot of data which would be difficult to match.
So I am wondering is it possible to match content based on "LIKELINESS / SIMILARITY" (eg. google similar to gugl) the likeliness would be required to overcome errors that creeped in while collecting data.
I have 2 tasks in hand w.r.t. the data.
Flag duplicate rows with certain identifier
Mention the percentage match between similar rows.
I would really appreciate if I could get suggestions as to what all possible methods could be employed and which would propbably be best because of their certain merits.
You could write a Perl program to do this, but it will be easier and faster to put it into a SQL database and use that.
Most SQL databases have a way to import CSV. For this answer, I suggest PostgreSQL because it has very powerful string functions which you will need to find your fuzzy duplicates. Create your table with an auto incremented ID column if your CSV data doesn't already have unique IDs.
Once the import is done, add indexes on the columns you want to check for duplicates.
CREATE INDEX name ON whatever (name);
You can do a self-join to look for duplicates in whatever way you like. Here's an example that finds duplicate names.
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE t1.name = t2.name
PostgreSQL has powerful string functions including regexes to do the comparisons.
Indexes will have a hard time working on things like lower(t1.name). Depending on the sorts of duplicates you want to work with, you can add indexes for these transforms (this is a feature of PostgreSQL). For example, if you wanted to search case insensitively you can add an index on the lower-case name. (Thanks #asjo for pointing that out)
CREATE INDEX ON whatever ((lower(name)));
// This will be muuuuuch faster
SELECT id
FROM whatever t1
JOIN whatever t2 ON t1.id < t2.id
WHERE lower(t1.name) = lower(t2.name)
A "likeness" match can be achieved in several ways, a simple one would be to use the fuzzystrmatch functions like metaphone(). Same trick as before, add a column with the transformed row and index it.
Other simple things like data normalization are better done on the data itself before adding indexes and looking for duplicates. For example, trim out and squish extra whitespace.
UPDATE whatever SET name = trim(both from name);
UPDATE whatever SET name = regexp_replace(name, '[[:space:]]+', ' ');
Finally, you can use the Postgres Trigram module to add fuzzy indexing to your table (thanks again to #asjo).

How to index a postgres table by name, when the name can be in any language?

I have a large postgres table of locations (shops, landmarks, etc.) which the user can search in various ways. When the user wants to do a search for the name of a place, the system currently does (assuming the search is on cafe):
lower(location_name) LIKE '%cafe%'
as part of the query. This is hugely inefficient. Prohibitively so. It is essential I make this faster. I've tried indexing the table on
gin(to_tsvector('simple', location_name))
and searching with
(to_tsvector('simple',location_name) ## to_tsquery('simple','cafe'))
which works beautifully, and cuts down the search time by a couple of orders of magnitude.
However, the location names can be in any language, including languages like Chinese, which aren't whitespace delimited. This new system is unable to find any Chinese locations, unless I search for the exact name, whereas the old system could find matches to partial names just fine.
So, my question is: Can I get this to work for all languages at once, or am I on the wrong track?
If you want to optimize arbitrary substring matches, one option is to use the pg_tgrm module. Add an index:
CREATE INDEX table_location_name_trigrams_key ON table
USING gin (location_name gin_trgm_ops);
This will break "Simple Cafe" into "sim", "imp", "mpl", etc., and add an entry to the index for each trigam in each row. The query planner can then automatically use this index for substring pattern matches, including:
SELECT * FROM table WHERE location_name ILIKE '%cafe%';
This query will look up "caf" and "afe" in the index, find the intersection, fetch those rows, then check each row against your pattern. (That last check is necessary since the intersection of "caf" and "afe" matches both "simple cafe" and "unsafe scaffolding", while "%cafe%" should only match one). The index becomes more effective as the input pattern gets longer since it can exclude more rows, but it's still not as efficient as indexing whole words, so don't expect a performance improvement over to_tsvector.
Catch is, trigrams don't work at all for patterns that under three characters. That may or may not be a deal-breaker for your application.
Edit: I initially added this as a comment.
I had another thought last night when I was mostly asleep. Make a cjk_chars function that takes an input string, regexp_matches the entire CJK Unicode ranges, and returns an array of any such characters or NULL if none. Add a GIN index on cjk_chars(location_name). Then query for:
WHERE CASE
WHEN cjk_chars('query') IS NOT NULL THEN
cjk_chars(location_name) #> cjk_chars('query')
AND location_name LIKE '%query%'
ELSE
<tsvector/trigrams>
END
Ta-da, unigrams!
For full text search in a multi-language environment you need to store the language each datum is in along side the text its self. You can then use the language-specified flavours of the tsearch functions to get proper stemming, etc.
eg given:
CREATE TABLE location(
location_name text,
location_name_language text
);
... plus any appropriate constraints, you might write:
CREATE INDEX location_name_ts_idx
USING gin(to_tsvector(location_name_language, location_name));
and for search:
SELECT to_tsvector(location_name_language,location_name) ## to_tsquery('english','cafe');
Cross-language searches will be problematic no matter what you do. In practice I'd use multiple matching strategies: I'd compare the search term to the tsvector of location_name in the simple configuration and the stored language of the text. I'd possibly also use a trigram based approach like willglynn suggests, then I'd unify the results for display, looking for common terms.
It's possible you may find Pg's fulltext search too limited, in which case you might want to check out something like Lucerne / Solr.
See:
* controlling full text search.
* tsearch dictionaries
Similar to what #willglynn already posted, I would consider the pg_trgm module. But preferably with a GiST index:
CREATE INDEX tbl_location_name_trgm_idx
USING gist(location_name gist_trgm_ops);
The gist_trgm_ops operator class ignore case generally, and ILIKE is just as fast as LIKE. Quoting the source code:
Caution: IGNORECASE macro means that trigrams are case-insensitive.
I use COLLATE "C" here - which is effectively no special collation (byte order instead), because you obviously have a mix of various collations in your column. Collation is relevant for ordering or ranges, for a basic similarity search, you can do without it. I would consider setting COLLATE "C" for your column to begin with.
This index would lend support to your first, simple form of the query:
SELECT * FROM tbl WHERE location_name ILIKE '%cafe%';
Very fast.
Retains capability to find partial matches.
Adds capability for fuzzy search.
Check out the % operator and set_limit().
GiST index is also very fast for queries with LIMIT n to select n "best" matches. You could add to the above query:
ORDER BY location_name <-> 'cafe'
LIMIT 20
Read more about the "distance" operator <-> in the manual here.
Or even:
SELECT *
FROM tbl
WHERE location_name ILIKE '%cafe%' -- exact partial match
OR location_name % 'cafe' -- fuzzy match
ORDER BY
(location_name ILIKE 'cafe%') DESC -- exact beginning first
,(location_name ILIKE '%cafe%') DESC -- exact partial match next
,(location_name <-> 'cafe') -- then "best" matches
,location_name -- break remaining ties (collation!)
LIMIT 20;
I use something like that in several applications for (to me) satisfactory results. Of course, it gets a bit slower with multiple features applied in combination. Find your sweet spot ...
You could go one step further and create a separate partial index for every language and use a matching collation for each:
CREATE INDEX location_name_trgm_idx
USING gist(location_name COLLATE "de_DE" gist_trgm_ops)
WHERE location_name_language = 'German';
-- repeat for each language
That would only be useful, if you only want results of a specific language per query and would be very fast in this case.

Possible to rank partial matches in Postgres full text search?

I'm trying to calculate a ts_rank for a full-text match where some of the terms in the query may not be in the ts_vector against which it is being matched. I would like the rank to be higher in a match where more words match. Seems pretty simple?
Because not all of the terms have to match, I have to | the operands, to give a query such as to_tsquery('one|two|three') (if it was &, all would have to match).
The problem is, the rank value seems to be the same no matter how many words match. In other words, it's maxing rather than multiplying the clauses.
select ts_rank('one two three'::tsvector, to_tsquery('one')); gives 0.0607927.
select ts_rank('one two three'::tsvector, to_tsquery('one|two|three|four'));
gives the expected lower value of 0.0455945 because 'four' is not the vector.
But select ts_rank('one two three'::tsvector, to_tsquery('one|two'));
gives 0.0607927 and likewise
select ts_rank('one two three'::tsvector, to_tsquery('one|two|three'));
gives 0.0607927
I would like the result of ts_rank to be higher if more terms match.
Possible?
To counter one possible response: I cannot calculate all possible subsequences of the search query as intersections and then union them all in a query because I am going to be working with large queries. I'm sure there are plenty of arguments against this anyway!
Edit: I'm aware of ts_rank_cd but it does not solve the above problem.
Use the smlar extension (linux only AFAIK, written by the same guys that brought us text search).
It has functions for calculating TFIDF, cosine, or overlap similarity between arrays. It supports indexing so is fast.
Another way would be to "spell-check" the query prior to using it, basically removing any query terms that are not in your corpus.
The conclusion that I have come to is to & the items together for the ranking. In my select query (with which I'm doing the search) the items are |ed. This seems to work.