Sphinx with metaphone and wildcard search - sphinx

we are an anatomy platform and use sphinx for our search. We want to make our search more fuzzier and started to use metaphone to correct spelling mistakes. It finds for example phalanges even though the search word is falanges.
That's good but we want more. We want that the user could type in falange or even falang and we still find phalanges. Any ideas how to accomplish this?
If you are interested you can checkout our sphinx config file here.
Thanks!

Well you can enable both metaphone and min_prefix_len on an index at once. It will sort of work.
falange*
might then just work. (to match phalanges)
The problem is the 'stripped' letters may change the 'sound' of the word (because change the pronunciation)
eg falange becomes FLNJ, but falang acully becomes FLNK - so they no longer 'substrings' of one another. (ie phalanges becomes FLNJS, which FLNK* wont match)
... to be honest I dont know a good solution. You could perhaps get better results, if was to apply stemming, BEFORE metaphone. (so the endings that change the pronouncation of the words are removed.
Alas Sphinx can't do this. If you enable stemming and metaphone together, only ONE of the processors will ever fire.
Two possible solutions, implement stemming outside of sphinx (or maybe with regexp_filter. Not sure if say a porter stemmer can be implemnented purely with regular expressions)
or modify sphinx, so that ALL morphology processors apply. (rather than just the first one that changes the word)

Related

What is "dont" and "isnt" in the pertained GloVe vector files (e.g. glove.6B.50d.txt)?

I found these 2 words "dont" and "isnt" in the vector file glove.6B.50d.txt downloaded from https://nlp.stanford.edu/projects/glove/. I wonder if they were originally "don't" and "isn't". This will likely depend on the sentence_to_word parsing algorithms they used. If someone is familiar, please confirm if this is the case.
A secondary question is if this is a common way to deal with apostrophe for words like "don't", "isn't", "hasn't" and so on. i.e. just filter replace that apostrophe with an empty string such that "don" and "t" becomes one word.
Finally, I am also not sure if GloVe comes with API to do sentence_to_word parsing so you can be consistent with what the researchers have done originally.
I think dont and isnt really are originally don't and isn't. I have seen a few other such examples. I suspect this is just the specific way GloVe researchers handle this.

Sphinx search configuration for words ending with apostrophes

I am trying to improve my Sphinx configuration and I have a trouble with words ending with apostrophes.
For example, for Surfin' USA result, searching with "Surfin USA" returns match but "Surfing USA" doesn't return anything. How can I set Sphinx to return result for such situation?
Hmm, thats an interesting one. Not sure sphinx can automatically deal with this, because it has no way of knowing what the Apostrophe is meant to represent. I suppose there are cases where it could be multiple things.
The only way I can think would be to list them in exceptions, you can build a list of all words want to support
Surfin' > Surfing
Have to use exceptions to be able to use the apostrophe
You might want to add
Surfin > Surfing
too, so can search without the apostrophe too.

How to find literals in source code of Smartforms and in SAPScripts (or reports, if the others can't be done)

I'd like to check hardcoded values in (a lot of) Smartforms and SAPScript forms.
I have found a way to read the source code of both of these, but it seems that i will have to go through a lot of parsing before I get anything reliable.
I've come across function module GET_LITERAL but that doesn't seem to help me much since i have to specify the offset of the value, if i got right what the function is doing in the first place.
I also found RS_LITERAL_LIST but that also doesn't do what i expect.
I also tried searching for reports and methods, but haven't found anything that seemed to help.
A backup plan would be to get some good parsing tool, so do you know of anything like that.
Anyway, any hints would be helpful and appreciated.
[EDIT]
Forgot to mention, the version of my system is 4.6C
If you have a fairly recent version of ABAP, you can use a regex.
Follow the pattern of this example, but use your source as the text and create your own regex. Have it look for any single quotes on the end of a word separated by spaces or any integers with spaces on either side. That's just a start, you might need to work on a better pattern.
String functions count, find, and match

What's a Good package for Phonetic Representation for Various Human Languages?

I'm currently working on a project for which I think being able to come up with phonetic representations of words in various languages would be really helpful. I know Aspell does this pretty well, but I don't think there's a very easy way to get at their phonetic representations, so I ask: is there some other good package for getting the phonetic representation of a word given the word and the language/dialect/accent/whatever it's coming from?
This doesn't need to be in any particular language, but if it were Perl, that would be best.
I've already tried Soundex, Metaphone, DoubleMetaphone, and everything else in Text::Phonetic, and none of that stuff was very good – definitely nowhere near as good as the stuff in Aspell.
The first thing that springs to mind is Soundex. Of course, there is a Perl module Soundex, too. While this is designed to generate a soundex "key" from input it might be useful in mapping different variants to a common key.
There is a package Text::Aspell in CPAN. Might be useful.
I you are trying to make a google style suggestion/correction system, it's not based on just phonetics or AI, but on a massive amount of user input. When a user makes a search, and doesn't click in any link but corrects the input and searches again, it gives google a lot of data about "correct" writing than a phonetics test or dictionary matching.
The main problem is in human language itself, it's not that people speak or write in a deterministic way, let alone in multiple languages.
Of course , i might be wrong, but if you need a library that let's you do this:
getLanguage(string);
I want to see that working, really.

Reliably getting a character count for .doc files

What's a reliable way to automatically count the characters and/or words in a .doc or .docx file?
The only real requirement is a reasonably accurate and reasonably reliable count.
It needs to work with documents containing something other than Latin script, so counting characters is good enough for most cases.
The count does not necessarily need to match Word's, but the closer the better.
Since there are a gazillion different apps that can generate .doc files, it's okay to fail to count anything, but this case needs to be catchable so we're aware that a count may be inaccurate. For all other cases the count must be, say, at least 99% accurate at least 99% of the time.
I'm open as to the involved technologies, but something that can run on a *NIX command line would be greatly preferred.
Is there a reasonable solution for this?
Here's a link to some Linux word-to-text converters.
For example you could use
antiword file.doc | wc
to do the counting.
Edit:
This link shows that AbiWord has a command-line interface, that you could use to convert the .docx format to .txt and then count the words using "wc". AbiWord does support the docx format
Mac OS X has support for reading word files built into the system frameworks, so if you have that, it's easy. MacRuby sample:
NSSpellChecker.sharedSpellChecker.countWordsInString(NSAttributedString.alloc.initWithURL(fileURL, documentAttributes:nil), language:nil)
More portably — though it gives up support for docx — you could simply get Antiword and do antiword | wc -w.
Microsoft has published a specification for the Office binary file formats. Parsing a .DOC file doesn't look trivial, but with some care you should be able to get a dependable, repeatable result. I have no idea how closely it'll match with what Word shows -- that will probably depend (at least partly) on how you define "word" -- for example, whether you consider a group of digits a "word" or not. It probably won't take a lot to figure out how Word treats cases like that, so getting a close match shouldn't be terribly difficult.
If you consider online applications as a solution, yes, there is a solution.
This not so pretty (regarding the design) site offers both word and character count: http://allworldphone.com/count-words-characters.htm
I don't think there is a limit, and it shouldn't be a problem to just copy/paste the contents of your documents into the corresponding textarea and see the result.
Regarding the 100% or 99% accuracy, you could test it with a few (i.e. 20-50 words) by counting them yourself first.
I hope this helps.
Regards. Chris