Comparing two varchar fields for likeness

Comparing two varchar fields for likeness - tsql

I am attempting to write a piece of code that will compare two varchar columns where the amount of characters that match in each are weighted and assigned a value and I can use this value later on to determine if they are a "fuzzy" match or not. So far I have a function that strips numerics and spaces, I figure that I can use this as a starting off point. Does anyone have any direction they can push me in or some advice?
Thanks
Brian

You might look at the SOUNDEX function.

It depends on the type of Data. Soundex , Metaphone ,Double Metaphone are good for Human Names.But not good for comparing street address for example , editdistance (Levenshtein distance) might be used for fuzzy matching the street address.
Jaro–Winkler distance and Q-gram are other techniques used in fuzzy matching that comes to my mind.
Here is an implemention of editdistance.if you are wondering what it is.
http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=51540&whichpage=2

Pointing to the same direction as #GilM, phonetic matching algorithm, there is also another option: Double Metaphone. It is not built-in with SQL Server, as SOUNDEX, but find here a T-SQL version.

Related

Right way to use data type 'F' in SELECT-OPTIONS?

I want to have a SELECT-OPTIONS field in ABAP with the data type FLTP, which is basically a float. But this is not possible using SELECT-OPTIONS.
I tried to use PARAMETERS instead which solved this issue. But now of course I get no results when using this parameter value in the WHERE clause when selecting.
So on the one side I can't use data type 'F', but on the other side I get no results. Is there any way out of this dilema?

Checking floating point values for exact equality is a bad idea. It works in some edge-cases (like 0), but often it does not work. The reason is that not every value the user can express in decimal notation can also be expressed as a floating point value. So the values get rounded internally and now you get inequality where you would expect equality. Check the website "What Every Programmer Should Know About Floating-Point Arithmetic" for more information on this phenomenon.
So offering a SELECT-OPTION or a single PARAMETER to SELECT floating point values out of a table might be a bad idea.
What I would recommend instead is have the user state a range between two values with both fields obligatory:
PARAMETERS:
p_from TYPE f OBLIGATORY,
p_to TYPE f OBLIGATORY.
SELECT somdata
FROM table
WHERE floatfield >= p_from AND floatfield <= p_to.
But another solution you might want to consider is if float is really the appropriate data-type for your situation. When the table is a Z-table, you might want to consider to change the type of that field to a packed number or one of the decfloat flavors, as those will cause you far fewer surprises.

Use Postgresql full text search to fuzzy match all search terms

I have 2 tables (projects and tasks) that both contain a name field. I want users to be able to search both tables at the same time when entering a new item. I want to rank results based on all the terms entered. A user should be able to enter text in any order he/she chooses.
For example, searching on:
office bmt
should yield these results:
PR BMT Time - Office
BMT Office - Development
BMT Office - Development
...
The following search should also work:
BMT canter
should contain this result:
Canterburry - BMT time
So partial matches need to work too.
Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
I now use something like this:
where to_tsvector(projects.name || ' - ' || tasks.name) ## to_tsquery('OFF:*&BMT:*')
I build the search string itself in the Ruby backend by splitting the user entry according to its spaces.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
For example searching for:
off bmt
Gives results that don't contain Off at all because off is ignored completely.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
I could do it by building a list of AND statements in the WHERE clause with LIKE '% ... %' but that would probably hurt performance and doesn't support fuzzysearch.

Ideally if the user would type a small error like:
ofice bmt
The results should still appear.
This could be very hard to do on more than a best-effort basis. If someone enters "Canter", how should the system know if they meant a shortening of Canterburry, or a misspelling of "cancer", or of "cantor", or if they really meant a horse's gait? Perhaps you can create a dictionary of common typos for your specific field? Also, without the specific knowledge that time zones are expected and common, "bmt" seems like a misspelling of, well, something.
This works fine, however in some cases it doesn't and I believe that's because it interprets it like English and ignores some words like of, off, in, etc...
Don't just believe, check and see!
select to_tsquery('english','OFF:*&BMT:*');
to_tsquery
------------
'bmt':*
Yes indeed, to_tsquery does omit stop words, even with the :* thingy.
One option is to use 'simple' rather than 'english' as your configuration:
select to_tsquery('simple','OFF:*&BMT:*');
to_tsquery
-------------------
'off':* & 'bmt':*
Another option is to write tsquery directly rather than processing through to_tsquery. Note that in this case, you have to lower-case it yourself:
select 'off:*&bmt:*'::tsquery;
tsquery
-------------------
'off':* & 'bmt':*
Also note that if you do this with 'office:*', you will never get a match in an 'english' configuration, because 'office' in the document gets stemmed to 'offic', while no stemming occurs when you write 'office:*'::tsquery. So you could use 'simple' rather than 'english' to avoid both stemming and stop words. Or you could test each word in the query individually to see if it gets stemmed before deciding to add :* to it.
Is there a way to avoid this but still have good performance and fuzzy search? I'm not keen on having to sync my PG with ElasticSearch for this.
What do you mean by fuzzysearch? You don't seem to be using that now. You are just using prefix matching, and accidentally using stemming and stopwords. How large is your table to be searched, and what kind of performance is acceptable?
If did you use ElasticSearch, how would you then phrase your searches? If you explained how you would phrase the search in ES, maybe someone can help you do the same thing in PostgreSQL. I don't think we can take it as a given that switching to ES will just magically do the right thing.
I could do it by building a list of AND statements in the WHERE clause
with LIKE '% ... %' but that would probably hurt performance and
doesn't support fuzzysearch.
Have you looked into pg_trgm? It can make those types of queries quite fast. Also, LIKE '%...%' is lot more fuzzy than what you are currently doing, so I don't understand how you will lose that. pg_trgm also provides the '<->' operator which is even fuzzier, and might be your best bet. It can deal with typos fairly well when embedded in long strings, but in short strings they can really be a problem.

In your case, to_tsquery() need to indicate that all words are required, you can use to_tsquery('english', 'off & bmt') and indicates a particular dictionary containing the 'off' word, listed in the link 4, below.
Some tips to use tsvector:
Create a field on your table that contains all fields with terms that you want to search, this field should be the type tsvector
Your search should use tsquery as you mentioned in your answer. In search, you can make some good tricks, like as follow:
2.a. Create a rank, with ts_rank(), indicating the search priority, this indicates the priority and how much the tsquery approximates with original terms
2.b. If you have specific words (like my case, search of chemical terms), you can create a dictionary with the commonly words used, this words can be used to extract radical or parts to compare the similarity.
2.c. About the performance: The tsquery works very well with gin and gist indexes. I have used full text search in a table with +200k registers and the search returns in < 0.4secs.
If you need more fuzzy search in words, you can also use the fuzzy match. I used with tsquery, the levenshtein_less_equal search, using a distance of 3. The function searches words with 3 or minus letters differing from the search, for unique words is a good way to search.
tsquery and tsvector: https://www.postgresql.org/docs/10/datatype-textsearch.html
text search: https://www.postgresql.org/docs/10/textsearch-controls.html#TEXTSEARCH-RANKING
Fuzzy: https://www.postgresql.org/docs/11/fuzzystrmatch.html#id-1.11.7.24.6
Lexize: https://www.postgresql.org/docs/10/textsearch-dictionaries.html#TEXTSEARCH-SIMPLE-DICTIONARY

PostgreSQL Full Text Search: Cant get a partial match of tsvector

Here's the problem:
I have a table in PostgreSQL with adresses in plain text and tsvectors. And i'm trying to find an adress record in a query like this.
SELECT * FROM address_catalog
WHERE address_catalog.search_vector ## to_tsquery('123456:* & Klingon:* & Empire:* & Kronos:* & city:* & Matrok:* & street:* & 789:*')
But the problem is that I don't know anything about the adress in a query. I can't define where a country, a city or a street is in the incoming string. I don't know what order of words the adress has, or does it contain extra words.
I can only search for countries and cities, but if the incoming string contains street, index or anything else, the search returns nothing because of the conjunction of all vector tokens. At the same time, I simply can't delete some string parts or use disjunction, because I never know where in the string the extra words are.
So, is there any way to construct a tsquery to return some best matches for the incoming string? Or maybe partial matches? When i tried to force it to use OR instead of AND everywhere in tsquery, it returned me nearly the whole database. I need vectors intersection... in postgresql.

I'd recommend using the smlar (PDF) extension for this. It was written by the same guys that wrote text search. It lets you use the TF-IDF similarity measure, which allows for "extraneous" query terms
Here's how to compile it (I haven't figured out how to compile it on Windows):
http://blog.databasepatterns.com/2014/07/postgresql-install-smlar-extension.html
And here's how to use it:
http://blog.databasepatterns.com/2014/08/tf-idf-text-search-in-postgres.html

How to specify minimum word length in PostgreSQL full text search?

Is there a way to prevent words shorter than a specified value end up in tsvector? MySQL has the ft_min_word_len option, is there something similar for PostgreSQL?

The short answer would be no.
The tsearch2 uses dictionaries to normalize the text:
12.6. Dictionaries
Dictionaries are used to eliminate words that should not be considered
in a search (stop words), and to normalize words so that different
derived forms of the same word will match. A successfully normalized
word is called a lexeme.
and how the dictionaries are used Parsing and Lexing

Possible to rank partial matches in Postgres full text search?

I'm trying to calculate a ts_rank for a full-text match where some of the terms in the query may not be in the ts_vector against which it is being matched. I would like the rank to be higher in a match where more words match. Seems pretty simple?
Because not all of the terms have to match, I have to | the operands, to give a query such as to_tsquery('one|two|three') (if it was &, all would have to match).
The problem is, the rank value seems to be the same no matter how many words match. In other words, it's maxing rather than multiplying the clauses.
select ts_rank('one two three'::tsvector, to_tsquery('one')); gives 0.0607927.
select ts_rank('one two three'::tsvector, to_tsquery('one|two|three|four'));
gives the expected lower value of 0.0455945 because 'four' is not the vector.
But select ts_rank('one two three'::tsvector, to_tsquery('one|two'));
gives 0.0607927 and likewise
select ts_rank('one two three'::tsvector, to_tsquery('one|two|three'));
gives 0.0607927
I would like the result of ts_rank to be higher if more terms match.
Possible?
To counter one possible response: I cannot calculate all possible subsequences of the search query as intersections and then union them all in a query because I am going to be working with large queries. I'm sure there are plenty of arguments against this anyway!
Edit: I'm aware of ts_rank_cd but it does not solve the above problem.

Use the smlar extension (linux only AFAIK, written by the same guys that brought us text search).
It has functions for calculating TFIDF, cosine, or overlap similarity between arrays. It supports indexing so is fast.
Another way would be to "spell-check" the query prior to using it, basically removing any query terms that are not in your corpus.

The conclusion that I have come to is to & the items together for the ranking. In my select query (with which I'm doing the search) the items are |ed. This seems to work.