please, I am working on a PoC for Person Real-time Identification, and one of the critical aspects of it is to support both minor misspelling and phonetic variations of First, Middle, and Last name. Like HarinGton == HarrinBton or RaphEAl == RafAEl. It's working for longer names, but it's a bit more imprecise for names like Lee and John.
I am using Double Metaphone through dmetaphone() and dmetaphone_alt() in PostgreSQL 13.3 (Supabase.io). And although I appreciate Double Metaphone it has a (too?) short string as the outcome. metaphone() has parameters to make the resulting phonetic representation longer. I investigated dmetaphone() and couldn't find anything other than the default function.
Is there a way of making dmetaphone() and dmetaphone_alt() return a longer phonetic representation similar to metaphone()'s, but with a ALT variation?.
Any help would be much appreciated.
Thanks
Looking at the postgres docs for these features you don't have parametric control over the length of the encoded string for Double Metaphone. In the case of single Metaphone, you can only truncate the output string:
max_output_length sets the maximum length of the output metaphone code; if longer, the output is truncated to this length.
However you may get much better results by using Trigram Similarity or Levenshtein Distance on the encoded output from either of the metaphone methods - this can be a more powerful way to handle phonetic permutations using Metaphones.
Example
Consider all the spelling permutations possible for the artist Cyndi Lauper, using double metaphone with trigram similarity we can achieve 100% similarity between the incorrect string cindy lorper and the correct spelling:
SELECT similarity(dmetaphone('cindy lorper'), dmetaphone('cyndi lauper'));
yields: similarity real: 1 (ie: 100% similarity)
Which means the encodings are identical for both input strings using Double Metaphone. When using Metaphone, they're slightly different. All of the following yield SNTLRPR
SELECT metaphone('cyndy lorper',10);
SELECT metaphone('sinday lorper', 10);
SELECT metaphone('cinday laurper', 10);
SELECT metaphone('cyndi lauper',10);
yields: SNTLPR which is only one character different to SNTLRPR
You can also use Levenshtein Distance to calculate it, which gives you a filterable parameter to work with:
SELECT levenshtein(metaphone('sinday lorper', 10), metaphone('cyndi lauper', 10));
yields: levenshtein integer: 1
It's working for longer names, but it's a bit more imprecise for names
like Lee and John.
It's a bit difficult to see exactly what you're having trouble with - without a more complete reprex.
SELECT similarity(dmetaphone('lee'), dmetaphone('leigh'));
SELECT similarity(dmetaphone('jon'), dmetaphone('john'));
both yield: similarity real: 1 (ie: 100% similarity)
Edit: here's a easy to follow guide for fuzzy matching with postgres
Related
I try to use this solution and this (for str_eval()) but seems other encode or other UTF8's Normalization Form, perhaps combining diacritical marks...
select distinct logradouro, str_eval(logradouro)
from logradouro where logradouro like '%CECi%';
-- logradouro | str_eval
------------------------------+----------------------------
-- AV CECi\u008DLIA MEIRELLES | AV CECi\u008DLIA MEIRELLES
PROBLEM: how to select all rows of the table where the problem exists?That is, where \u occurs?
not works with like '%CECi\u%' neither like '%CECi\\u%'
works with like E'%CECi\u008D%' but is not generic
For Google, edited after solved question: this is a typical XY problem. In the original question (above) I used ~wrong hypothesis. All the solutions bellow are answers to the following (objective) question:
How to select only printable ASCII text?
"Printable ASCII" is a subset of UTF8, it is "all ASCII that is not a 'control character'".
The "non-printable" control characters are UNICODE hexadecimal 00 to 1F and 7F(HTML entity to + or decimal 0 to 31 + 127).
PS1: the zero () is the "end of text" mark of PostgreSQL text datatype internal representation, so not need to be checked, but no problems to include it in the range.
PS2: about the secondary question "how to convert a word with encode bug to a valid word?", see an heuristic at my answer.
This condition will exclude any strings that do not entirely consist of printable ASCII characters:
logradouro ~ '[^\u0020-\u007E]'
Solving with workaround
select distinct logradouro, str_eval(logradouro)
from logradouro where not(logradouro ~ E'^[a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+$');
There is a systematic bug on encode, no way to convert to correct UTF8... Even converting, the problem is that "CECi\u008DLIA" is not "CECíLIA".
The solution is to use a kind of "heuristic spell corrector" on
regexp_replace(logradouro, E'[^a-zA-Z0-9_,;\\- \\.\\(\\)\\/"\'\\*]+', '!')
Example: the i! of "Ceci!lia" is corrected to í.
NOTICE. Any heuristic solution (or neural network) trained with a specific dataset (specific systematic error source) is a black box solution, valid only for that type of systematic error. There is no generalization for this type of problem.
Why does DeepL not translate single words correctly?
Example:
Wrong:
**przekrzywić
наклон
przekrzywić się
наклон**
Correct:
**przekrzywić
перекосить
przekrzywić się
перекоситься**
This is a small example, but I checked many thousands of words and they are all incorrect.
I tried to contact support but it is not possible.
DeepL is not a dictionary, it's a machine translation engine. The less common a language combination is, the less accurate the engine will be because less training data is available for this language combination. The best systems will be from/into English; as soon as the translation is between rare languages like PL and RU, the quality will decrease dramatically.
Neural machine translation works on a context basis, meaning every word is defined by the words surrounding it. The less context, the less accurate the translation.
My index contains the word dog how can i also find this entry if i type dogs? I would find all parts of the word 'dogs','dog','do' to a min length of 2 or 3 chars
I'm not an expert on Bloodhound, but what you're talking about here is called stemming, and it seems like the kind of thing that you could do using the datumTokenizer and the queryTokenizer.
There are stemmers for most languages of varying quality, but I think the one most people are using for English these days is the Snowball Stemmer. There are a number of implementations in JavaScript floating around.
In general for things to work properly you'll want to stem both the uer's query and the results.
I am currently learning Go and am making a lot of progress. One way I do this is to port past projects and prototypes from a prior language to a new one.
Right now I am busying myself with a "language detector" I prototyped in Python a while ago. In this module, I generate an ngram frequency table, where I then calculate the difference between a given text and a known corpora.
This allows one to effectively determine which corpus is the best match by returning the cosine of two vector representations of the given ngram tables. Yay. Math.
I have a prototype written in Go that works perfectly with plain ascii characters, but I would very much like to have it working with unicode multibyte support. This is where I'm doing my head in.
Here is a quick example of what I'm dealing with: http://play.golang.org/p/2bnAjZX3r0
I've only posted the table generating logic since everything already works just fine.
As you can see by running the snippet, the first text works quite well and builds an accurate table. The second text, which is German, has a few double-byte characters in it. Due to the way I am building the ngram sequence, and due to the fact that these specific runes are made of two bytes, there appear 2 ngrams where the first byte is cut off.
Could someone perhaps post a more efficient solution or, at the very least, guide me through a fix? I'm almost positive I am over analysing this problem.
I plan on open sourcing this package and implementing it as a service using Martini, thus providing a simple API people can use for simple linguistic computation.
As ever, thanks!
If I understand correctly, you want chars in your Parse function to hold the last n characters in the string. Since you're interested in Unicode characters rather than their UTF-8 representation, you might find it easier to manage it as a []rune slice, and only convert back to a string when you have your ngram ready to add to the table. This way you don't need to special case non-ASCII characters in your logic.
Here is a simple modification to your sample program that does the above: http://play.golang.org/p/QMYoSlaGSv
By keeping a circular buffer of runes, you can minimise allocations. Also note that reading a new key from a map returns the zero value (which for int is 0), which means the unknown key check in your code is redundant.
func Parse(text string, n int) map[string]int {
chars := make([]rune, 2 * n)
table := make(map[string]int)
k := 0
for _, chars[k] = range strings.Join(strings.Fields(text), " ") + " " {
chars[n + k] = chars[k]
k = (k + 1) % n
table[string(chars[k:k+n])]++
}
return table
}
I am writing a function that transliterates UNICODE digits into ASCII digits, and I am a bit stumped on what to do if the string contains digits from different sets of UNICODE digits. So for example, if I have the string "\x{2463}\x{24F6}" ("④⓶"). Should my function
return 42?
croak that the string contains mixed sets?
carp that the string contains mixed sets and return 42?
give the user an additional argument to specify one of the three above behaviours?
do something else?
Your current function appears to do #1.
I suggest that you should also write another function to do #4, but only when the requirement appears, and not before .
I'm sure Joel wrote about "premature implementation" in a blog article sometime recently, but I can't find it.
I'm not sure I see a problem.
You support numeric conversion from a range of scripts, which is to say, you are aware of the Unicode codepoints for their numeric characters.
If you find an unknown codepoint in your input data, it is an error.
It is up to you what you do in the event of an error; you may insert a space or underscore, or you may abort conversion. What you would do will depend on the environment in which your function executes; it is not something we can tell you.
My initial thought was #4; strictly based on the fact that I like options. However, I changed my mind, when I viewed your function.
The purpose of the function seems to be, simply, to get the resulting digits 0..9. Users may find it useful to send in mixed sets (a feature :) . I'll use it.
If you ever have to handle input in bases greater than 10, you may end up having to treat many variants on the first 6 letters of the Latin alphabet ('ABCDEF') as digits in all their forms.