Using Start/End Modifier search with wordforms isue - sphinx

If I search on a table with a Name field using "^Word$" it will find it.
If I have a Wordform in a Word1 Word2 > Word3 construction e.g.
United States of America > USA
the same query will work. However if I do the same wordform in reverse e.g. Word3 > Word1 Word2:
USA > United States of America
Then is is not found using the same start/end modifier. However my habit is to do Word1 > Word2 Word3 so that Word 2 and Word 3 can still be found in a search which won't work the other way.
Is there a way to set up the Start/End modifier search so that it still finds W1 > W2 W3?

The only suggestion I have is to use regexp_filter to do the expansion, rather than wordforms.
regexp_filter = \bUSA\b => United States of America
or similar. benefit have more control over capitaization (eg only do uppercase USA)
This means, the expansion happens much earlier in the tokeniation process, meaning it has less effect on extended query syntax.
In theory a query
"^Word$"
should then be turned into
"^United States of America$"
which still works :)
I think wordforms dont work, because America$ will have been put into the index as a keyword. But the query looking for both ^ and $ on the one word.

Related

How to search for exact match for non-word string in MS office word?

for example, I have to search for "2/2" and I get these results:
2/2 good result
11/2/2022 bad result
when I use Advance search in MS -word the "find the whole word" is disabled because my search contains special characters.
also, I tried wildcard search but I didn't know how to construct a good search string for that
is there a way to find an exact match for non-word search?
You could use wildcards, with:
.Text = "<2/2>"
With this expression, the < and > specify a word start & end, respectively.
Event that, however, is prone to false matches with strings like 2/2/2022, since the / after the 2/2 designates a word end. Clearly, your code would need some additional checking to check whether a given match is valid. Such checking might be done via a Find expression like:
.Text = "[!/0-9A-Za-z]2/2[!/0-9A-Za-z]"
This tells Word to find 2/2 only when the preceding and following characters are not a /, a number, or a letter. Then, to manipulate the found text, you'd simply move the start of the found range forward one character and the end of the found range back one character.

Sphinx wordforms Street/Saint?

I tried to wordform
St > Street
St. > Saint
Yet when I tried to rotate it it thinks I am duplicating "St". How can I tell it in this one case I mean the period to be in the wordform since I' gathering it is one of the non-indexing chars so sees "St" as equal to "St."?
You would need . in your charset_table, to be able to use it in wordforms.

Search on last word in a field using Sphinx?

I'm using SphinxQL to prepare Sphinx searches (in fact part of a NOT operator) but am unable to do something that is pretty simple with Mysql: like '% Word'. I simply need to know when a specific word is the last one in the field/string but SphinxQL doesn't seem to lend itself to that.
The quick brown fox jumped over the lazy dog.
Lazy Dog day afternoons
I'm essentially looking to search on
select Description from idx_Table WHERE (MATCH('#(Description) Fox Dog (not like '% Dog'))
I get that the above is not proper SphinxQL at all but is essentially what I am trying to achieve.
There is a field end modifier, so can specifically match the last word in the a field.
... WHERE MATCH('#(Description) Fox Dog$')
Will only get you matches where the last word is Dog. Use phrase marks if want the last two (or more!) words.
... WHERE MATCH('#(Description) "Fox Dog$ " ')
But there is still no assertions to say, match this, EXCEPT when it's the last word
... WHERE MATCH('#(Description) Fox Dog$ -Dog$')
would execute, may well be excluding 'valid' matches.

Field position limit in Sphinx to *start* search at character position?

As far as I can tell "Field position limit" in sphinx only allows you to force search to the first N characters in a document? Is there anyway to use it to force search AFTER the first N characters instead?
The quick brown fox jumped over the lazy dog and he was crazy as a fox and just as fast
Fox[20]
will find the first fox and not the second.
What I am looking for is something like
Fox[50] that won't start search until char 50 ("and he was crazy as a fox and just as fast")
Well you could say
"bla bla" #field[50] -"bla bla"
But you have the old problem of it also exlcuding items with it after as well as before.
Otherwise think you will have to look at ranking expressions, there is min_hit_pos which can use. Would have to use the ranking expression to change the ranking calculation, and then 'post filter' based on the weight. Can use the weight in WHERE, via virtual attributes.
(this wont work either, see comments!)

Perl part-of-speech tagging: need tag set for Lingua::EN::Tagger

So, I want to use Lingua::EN::Tagger, but I can't seem to find the domain of tags (aka the parts of speech: ie MD, NN, etc.
I found something similar to what I want here: http://engtagger.rubyforge.org/
but this is for ruby, not perl, and I'm not sure if the tags are going to be the exact same set.
Thanks in advance.
The tag set is available in the readme from Lingua0EN-Tagger-0.23 (http://cpansearch.perl.org/src/ACOBURN/Lingua-EN-Tagger-0.23/README)
Lingua::EN::Tagger
This module uses part-of-speech statistics from the Penn Treebank
to assign POS tags to English text. The tagger applies a bigram (two-word)
Hidden Markov Model to guess the appropriate POS tag for a word. That means
that the tagger will try to assign a POS tag based on the known POS tags
for a given word and the POS tag assigned to its predecessor.
The tagger tends to assume unknown words are nouns, but this behavior is
configurable.
The POS tagger can also be used to find maximal noun phrases in tagged text.
You can also use this module to extract all nouns and/or noun phrases.
TAG SET
----------------------------------------------------------------
The set of POS tags used here is a modified version of the
Penn Treebank tagset. Tags with non-letter characters have been
redefined to work better in our data structures. Also, the
``Determiner'' tag (DET) has been changed from `DT', in order to
avoid confusion with the HTML tag, <DT>.
-----------------------------------------------------------------
CC Conjunction, coordinating and, or
CD Adjective, cardinal number 3, fifteen
DET Determiner this, each, some
EX Pronoun, existential there there
FW Foreign words
IN Preposition / Conjunction for, of, although, that
JJ Adjective happy, bad
JJR Adjective, comparative happier, worse
JJS Adjective, superlative happiest, worst
LS Symbol, list item A, A.
MD Verb, modal can, could, 'll
NN Noun aircraft, data
NNP Noun, proper London, Michael
NNPS Noun, proper, plural Australians, Methodists
NNS Noun, plural women, books
PDT Determiner, prequalifier quite, all, half
POS Possessive 's, '
PRP Determiner, possessive second mine, yours
PRPS Determiner, possessive their, your
RB Adverb often, not, very, here
RBR Adverb, comparative faster
RBS Adverb, superlative fastest
RP Adverb, particle up, off, out
SYM Symbol *
TO Preposition to
UH Interjection oh, yes, mmm
VB Verb, infinitive take, live
VBD Verb, past tense took, lived
VBG Verb, gerund taking, living
VBN Verb, past/passive participle taken, lived
VBP Verb, base present form take, live
VBZ Verb, present 3SG -s form takes, lives
WDT Determiner, question which, whatever
WP Pronoun, question who, whoever
WPS Determiner, possessive & question whose
WRB Adverb, question when, how, however
PP Punctuation, sentence ender ., !, ?
PPC Punctuation, comma ,
PPD Punctuation, dollar sign $
PPL Punctuation, quotation mark left ``
PPR Punctuation, quotation mark right ''
PPS Punctuation, colon, semicolon, elipsis :, ..., -
LRB Punctuation, left bracket (, {, [
RRB Punctuation, right bracket ), }, ]