Use wordforms or regexp in Sphinx to force a mutilword term to be a "word"

Use wordforms or regexp in Sphinx to force a mutilword term to be a "word" - sphinx

Is there a way to "force" Sphinx to index a term such as e.g. iphone 5 into a single-term? For various reasons I can't search for it as "iphone 5" or iphone near\1 5 I need to search for it as iphone 5. Naturally the way Sphinx works this means that it searches for both iphone and 5 anywhere in the document when I want it to search for the exact term iphone 5. Can I somehow index iphone 5 into a single-term to make this happen.
I still need to be able to apply wordforms/regexp and other mapping to the term e.g.
iphone 5>iphone5
This way if someone searches on iphone5 it will find iphone 5 and vice-versa. The issue is if I a search is done on iphone 5 while it will find iphone5 it will also find Selling 5 iphone 6Gs as well whereas if I search on "iphone 5" it will not find iphone5. So my goal is to make iphone 5 into a term that does not require "" to be treated as a phrase without being forced to search as an exact phrase which will break any additional wordform/regexp matching.

Do you control the configuration of the index?
If so you can configure the index to be created with the index_exact_words option.
From the documentation (http://sphinxsearch.com/docs/current.html#conf-index-exact-words) :
12.2.42. index_exact_words
Whether to index the original keywords along with the stemmed/remapped versions. Optional, default is 0 (do not index). Introduced in version 0.9.9-rc1.
When enabled, index_exact_words forces indexer to put the raw keywords in the index along with the stemmed versions. That, in turn, enables exact form operator in the query language to work. This impacts the index size and the indexing time. However, searching performance is not impacted at all.
Example:
index_exact_words = 1
`

Related

Mozilla Deep Speech SST suddenly can't spell

I am using deep speech for speech to text. Up to 0.8.1, when I ran transcriptions like:
byte_encoding = subprocess.check_output(
"deepspeech --model deepspeech-0.8.1-models.pbmm --scorer deepspeech-0.8.1-models.scorer --audio audio/2830-3980-0043.wav", shell=True)
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I would get back results that were pretty good. But since 0.8.2, where the scorer argument was removed, my results are just rife with misspellings that make me think I am now getting a character level model where I used to get a word-level model. The errors are in a direction that looks like the model isn't correctly specified somehow.
Now I when I call:
byte_encoding = subprocess.check_output(
['deepspeech', '--model', 'deepspeech-0.8.2-models.pbmm', '--audio', myfile])
transcription = byte_encoding.decode("utf-8").rstrip("\n")
I now see errors like
endless -> "endules"
service -> "servic"
legacy -> "legaci"
earning -> "erting"
before -> "befir"
I'm not 100% that it is related to removing the scorer from the API, but it is one thing I see changing between releases, and the documentation suggested accuracy improvements in particular.

Short: The scorer matches letter output from the audio to actual words. You shouldn't leave it out.
Long: If you leave out the scorer argument, you won't be able to detect real world sentences as it matches the output from the acoustic model to words and word combinations present in the textual language model that is part of the scorer. And bear in mind that each scorer has specific lm_alpha and lm_beta values that make the search even more accurate.
The 0.8.2 version should be able to take the scorer argument. Otherwise update to 0.9.0, which has it as well. Maybe your environment is changed in a way. I would start in a new dir and venv.
Assuming you are using Python, you could add this to your code:
ds.enableExternalScorer(args.scorer)
ds.setScorerAlphaBeta(args.lm_alpha, args.lm_beta)
And check the example script.

How to allow leading wild cards in custom smart search web part (Kentico 10)

I have a custom index for my products and I am using the Subset Analyzer. This Analyzer works great, but if you do field searches, it does not work.
For example, I have a document with the following fields:
"documentname", "My-Document-Name"
"tags", "1234,5678,9101"
"documentdescription", "This is a great Document, My-Document-Name."
When I just search "name AND tags:(1234)", I get this document in my results because it searches +_content:name.
-- However:
When I search "documentname:(name)^3.0 AND tags:(1234)", I do not get this document in my results.
Of course, when I do "documentname:(*name*)^3.0" I get a parse error saying: '*' or '?' not allowed as first character in WildcardQuery.
How can I either enable wildcard query in my custom CMS.Search webpart?

First of all you have to make sure that a field you checking is in the index with proper name. documentname might not be in the index it can be called _title, depends how you index is set up. Get lukeall and check your index (it should be in \CMS\App_Data\CMSModules\SmartSearch\YourIndexName). You can use luke to test your searches as well.
For examples there is no tags but there is documenttags field.
P.S. Wildcards are working and you are right you can't use them as a first character by default (lucene documentation says: You cannot use a * or ? symbol as the first character of a search), but there is a way to set it up in lucene.net, although i dont know if there are setting for that in Kentico. But i dont think you need wildcards, so your query should be (assuming you have documentname and documenttags in the index):
+(documentname:"My-Name" AND documenttags:"tag1")

Custom 'stemming' in Sphinx with Workforms?

I've found the stem_en and lemmatizer to be either to limiting or inclusive for my needs. Can I make custom stemming with word forms? Either full workds
e.g. Procology > Procotologist
but idealy stems
ology > ologist

No wordforms works on whole words, not 'ending'.
Would have to either use regexp_filter, or develop your own morphology processor (as C code)

Sphinx Search term boost

Is there a way I can add a weight to each word in my query?
I need to do something like this (Lucene query):
"word1^50|word2^45|word3^25|word4^20"
All answers I found online are old and I was hoping this changed.
UPDATE:
Sphinx introduced term boosting in version 2.2.3: http://sphinxsearch.com/docs/current/extended-syntax.html
Usage:
select id,weight() from ljplain where match('open source^2') limit 2 option ranker=expr('sum(max_idf)*1000');

No nothing really changed. The same old workarounds should still work tho.

Core data assertion with "IN" clause in iOS 3.0

Long story short: iPhone app crashes when trying to retrieve a set of data using a group of IDs. I have a set of records tied to a user, look up all records where recordID matches any
entry from user.recordIDs, crashes with error:
unimplemented SQL generation for predicate : (recordID IN {name (user.record.recordID) by user (userID) ...)
I'll open by saying, Yes, I already found this: http://cubic-m.blogspot.com/2010/03/supporting-leopard-while-developing-in.html (iOS 3.0 SQL does not support "IN" clauses using NSSets, must use NSArrays).
Predicate is in the form:
(recordID IN %#.recordID)
where "%#" is user.records (either a set or an array, based on article above).
That's well and good, and seemed to fix most of my application's crashes -- however, it only fixes the crashes for 3.x > 3.0. That is to say, it still doesn't fix the issue on 3.0 firmware. If anyone has any suggestions as to the nuance of early-stage Core Data, please help!

Not sure what the actual problem was here, but the fix was thus:
If we are running iOS 3, create an NSArray with all of the key objects I was looking for (here, recordIDs), and pass that to [NSPredicate predicateWithFormat:#"recordID IN %#", recordIDs] instead of using [#"recordID in %#.recordID", records]
Again, not sure what the actual problem was, but this workaround fixed it.