Double-metaphone errors - metaphone

I'm using Lawrence Philips Double-Metaphone algorithm with great success, but I have found the odd "unexpected result" for some combinations.
Does anyone else have additions or changes to the algorithm for other parts of it they wouldn't mind sharing, or just the combinations that they've found that do not work as expected.
eg. I had issues between:
Peashill and Bushley. (both match with PXL)
Rockliffe and Rockcliffe (RKLF and RKKL)

All Soundex, Metaphone and variant schemes are occasionally going to give results that aren't identical to what you expect. This is unavoidable - they can be regarded as more or less simple hash algorithms with special information preserving properties, and will sometimes produce collisions when you'd rather they didn't, and will sometimes produce differences when you'd rather they didn't.
One possible way of improving things is using 'synonym rings'. This basically produces lists of words that should be regarded as synonyms, independent of the spelling. I encountered them in the context of name matching. For example, variants on Chaudri
included:
CHAUDARY
CHAUDERI
CHAUDERY
CHAUDHARY
CHAUDHERI
CHAUDHERY
CHAUDHRI
CHAUDHRY
CHAUDHURI
CHAUDHURY
CHAUDHY
CHAUDREY
CHAUDRI
CHAUDRY
CHAUDURI
CHAWDHARY
CHAWDHRY
CHAWDHURY
CHDRY
CHODARY
CHODHARI
CHODHOURY
CHODHRY
CHODREY
CHODRY
CHODURY
CHOUDARI
CHOUDARY
CHOUDERY
CHOUDHARI
CHOUDHARY
CHOUDHERY
CHOUDHOURY
CHOUDHRI
CHOUDHRY
CHOUDHURI
CHOUDHURY
CHOUDREY
CHOUDRI
CHOUDRY
CHOUDURY
CHOUWDHRY
CHOWDARI
CHOWDARY
CHOWDHARY
CHOWDHERY
CHOWDHRI
CHOWDHRY
CHOWDHURI
CHOWDHURRYY
CHOWDHURY
CHOWDORY
CHOWDRAY
CHOWDREY
CHOWDRI
CHOWDRURY
CHOWDRY
CHOWDURI
CHOWDURY
CHUDARY
CHUDHRY
CHUDORY
COWDHURY

regular metaphone is returning a difference between Peashill and Bushley
Peashill PXL
Bushley BXL

Related

How to encode normalized(A,B) properly?

I am using clingo to solve a homework problem and stumbled upon something I can't explain:
normalized(0,0).
normalized(A,1) :-
A != 0.
normalized(10).
In my opinion, normalized should be 0 when the first parameter is 0 or 1 in every other case.
Running clingo on that, however, produces the following:
test.pl:2:1-3:12: error: unsafe variables in:
normalized(A,1):-[#inc_base];A!=0.
test.pl:2:12-13: note: 'A' is unsafe
Why is A unsafe here?
According to Programming with CLINGO
Some error messages say that the program
has “unsafe variables.” Such a message usually indicates that the head of one of
the rules includes a variable that does not occur in its body; stable models of such
programs may be infinite.
But in this example A is present in the body.
Will clingo produce an infinite set consisting of answers for all numbers here?
I tried adding number(_) around the first parameter and pattern matching on it to avoid this situation but with the same result:
normalized(number(0),0).
normalized(A,1) :-
A=number(B),
B != 0.
normalized(number(10)).
How would I write normalized properly?
With "variables occuring in the body" actually means in a positive literal in the body. I can recommend the official guide: https://github.com/potassco/guide/releases/
The second thing, ASP is not prolog. Your rules get grounded, i.e. each first order variable is replaced with its domain. In your case A has no domain.
What would be the expected outcome of your program ?
normalized(12351,1).
normalized(my_mom,1).
would all be valid replacements for A so you create an infinite program. This is why 'A' has to be bounded by a domain. For example:
dom(a). dom(b). dom(c). dom(100).
normalized(0,0).
normalized(A,1) :- dom(A).
would produce
normalize(0,0).
normalize(a,1).
normalize(b,1).
normalize(c,1).
normalize(100,1).
Also note that there is no such thing as number/1. ASP is a typefree language.
Also,
normalized(10).
is a different predicate with only one parameter, I do not know how this will fit in your program.
Maybe your are looking for something like this:
dom(1..100).
normalize(0,0).
normalize(X,1) :- dom(X).
foo(43).
bar(Y) :- normalize(X,Y), foo(X).

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.
Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

What's the correct interpretation of this byte string?

In a friend's music directory, I came across this path and filename:
Ministry/Κî•Î¦Î‘Î›Î—Îžî˜ (Psalm 69)/Ministry - Κî•Î¦Î‘Î›Î—Îžî˜ (Psalm 69) - 06 - Scarecrow.mp3
You can google Ministry Κî•Î¦Î‘Î›Î—Îžî˜ and get results. If I feed it into a url encoder, I get %C2%9Ai%C2%95i%C2%A6i%C2%91i%C2%9Bi%C2%97i%C2%9Ei%C2%98.
It's clearly mangled in some way by traversing multiple incorrect encode/decode cycles. What is it supposed to be? How did you get that answer?
I've tried various paper and pencil scribblings with UTF-8, but can't figure out anything that makes sense.
It is supposed to be ΚΕΦΑΛΗΞΘ, which is the title of the Ministry album commonly known as Psalm 69. ΚΕΦΑΛΗΞΘ is what it looks like when the UTF-8 encoded ΚΕΦΑΛΗΞΘ is interpreted as Windows-1252.
This is close, but not identical to your Κî•Î¦Î‘Î›Î—Îžî˜ which has îs in place of two of the Îs. My guess for the discrepancies is, given their change and position, somewhere along the way a TitleCase conversion happened as well.
Got there by way of an educated guess, testing, and #Remy's helpful comment.

Negation of osm class or type

If you search for an airport (aeroway=aerodrome) around brescia, italy, you will also receive a hit for a military airfield, which happens to be tagged as an aerodrome also (it's taggged: aeroway=aerodrome, landuse=military, military=airfield). To avoid this I want to search for aeroway=aerodrome but exclude [military]. I've tried [! military] and [military~"^$"]. Any suggestions?
This particular case may be rare, I realize, but the concept of negating multi-classed elements is useful. And multi-classed elements is not a rare occurance. In general, they seem to be complimentary, not conflicting, so it's not an issue. I also realize that I can weed out conflicting hits with some back-end processing. I wasn't expecting a military airfield to appear with a commercial aerodrome.
In any case, here is a shortened version of my query. I include node, way and relation in full query:
http://overpass-api.de/api/interpreter?
data=[out:json][timeout:25][bbox:45.400861,9.868469,45.641408,10.542755];
(node[aeroway~%22aero|term|heli%22][! military]; ... ) out etc
or:
http://overpass-api.de/api/interpreter?
data=[out:json][timeout:25][bbox:45.400861,9.868469,45.641408,10.542755];
(node[aeroway~%22aero|term|heli%22][military~%22^$%22]; ... ) out etc
If you try to run it, you'll need to include way and relation.
Also, as you can see I don't exactly ask for aeroway=aerodrome. I include terminal and variations on heliport. My experience has been that some aerodromes are tagged only as "terminal", so if you're looking for an airport, asking for "aerodrome" isn't enough.
The correct syntax for negation is as follows:
[military !~ ".*"]
Please see the documentation on the OSM wiki for details.

Problem while using NSPredicate

Sql query:
select * from test_mart
where replace(replace(replace(replace(replace(replace(lower(name),'+'),'_'),'the '),' the'),'a '),' a')='tariq'
I can fire following query very easy, if I have to use simply Sqlite... but In current project I am using Core Data so not familiar about NSPredicate much.
The functionality talks about removing all BUT alphanumeric characters, which means removing special characters.
The characters that should be valid in the comparison would be
ABCDEFGHIJKLMNOPQRESTUVWXYZ1234567890
But we should not fail the comparison for the following characters
:;,~`!##$%^&*()_-+="'/?.>,<|\
Or for the following words
'the' 'an' 'a'
Some examples:
'Walmart' would be seen as the same payee as 'Wal-Mart'
'The Shoe Store' would be seen as the same payee as 'Shoe Store'
'Domino's Pizza' would be seen as the same payee as 'Dominos Pizza'
'Test Payee;' would be seen as the same payee as 'Test Payee'
Can any one suggest appropriate Predicates/Regular Expression ?
Thanks
I would have an extra field in the data base which would be a processed version of the original with all the irrelevant characters stripped out. Then use that for comparisons.
You might want to look at the soundex algorithm which may suite your purposes better... Soundex
It seems to me that you would want to normalize your data before it every gets set into the core data store. So if you're given "Wal-Mart", normalize it to "walmart" once, and then save it. Then you won't be doing all of this expensive on-the-fly comparison many many times.
The normalization would be fairly simple, given your rules:
Strip the words "a", "an", and "the"
Remove punctuation