Sphinx wordform file for English - sphinx

Is there are any existings wordform file for English irregular verb ?
i.e. i want Sphinx to return same result for find and found.
It seems that i need wordform file because setting morphology = stem_en not seems deal with irregular verb

Related

postgresql fulltext returning wrong results

I'm using postgresql full text tsvector column.
But I found a problem:
When I search for "calça"
The results contains the following results:
1- calça red
2- calça blue
3- calçado red
Why "calçado" is being returned when I search for "calça" ?
Is there any configuration so I can solve this?
Thanks.
It isn't just a matter that one string contains the other. The Portuguese stemmer thinks this is the way they should be stemmed. If you turn the longer word into 'calçadot', for example, it no longer stems it, because (presumably) 'adot' is not recognized as a Portuguese suffix which ought to be removed the way 'ado' is.
If you don't want stemming at all, then you could change the config to 'simple', which doesn't stem. But at that point, maybe you don't want full text search at all, and could just use LIKE instead with a pg_trgm index.
If it is just this particular word that you don't want stemmed, I think you can set up a synonym dictionary which will map calçado to itself, which will bypass stemming.

Pentaho spoon search and replace especial character in rows

I have a csv file with mime type US-ASCII and one column in the dataset look like this:
id
V_name
210001
cha?ne des Puys
210030
M?los
213004
G?ll?
213021
S?phan
221110
Afd?ra
And so on.
I would like to change those characters to:
id
V_name
210001
chaine des Puys
210030
Milos
213004
Gollu
213021
Suphan
221110
Afdera
The thing is that there are 95 rows of this kind, how can I search and replace those rows?
I using the suite PDI spoon.
Thanks in advance.
As #Iłya Bursov has stated, the source file you are reading doesn't provide the correct characters, it is providing the ? in the source, so if you want to correct it, you'll have to do it manually.
I don't think it is worth it, unless you know you are going to get always the same set of V_name over time and different files. In that case you could create a file to correlate the V_name in your source with the ? characters to a V_name_corrected with the correct display for the characters. This seems to be an exercise, so I would let the names as they are. In real life, I would contact the provider of the file with the incorrect character set to tell them that they need to correct the generation of the file to provide the correct characters in the file.

Sphinx search: multi-term wordforms not indexed correctly

I'm having an issue with specific entries in my wordforms file that are not being
interpreted as expected.
Here are a couple of examples:
1/48 > forty-eighth
1/96 > ninety-sixth
As you can see, these entries contain both slashes and hyphens, which may be related to
my issue.
For some reason, Sphinx doesn't correctly equate each fraction to the spelled out
version. Search results for "1/48" are not the same as for "forty-eighth", as they should
be. In other words, the mapping between these equivalent forms is not working.
In my Sphinx config, I have the forward slash (/) set as a blend character, so I assume
that the fraction is being recognized properly.
In support of that belief, the following wordforms entry does work correctly:
1/4 > fourth
Does anyone have any idea why my multi-term synonyms would not be working as expected?
I have tried replacing the hyphen with a space, but this doesn't change the result at
all. Would it help to change the order of the terms (i.e., on which side of the ">" they
should be placed)?
Thank you very much for any help.
When using characters in Sphinx it is always good to keep in mind the following:
By default, the Sphinx tokenizer handles unknown characters as whitespace
https://sphinxsearch.com/blog/2014/11/26/sphinx-text-processing-pipeline/
That has given me weird results too when using wordforms.
I would suggest you add the hyphen to charset_tables so ninety-sixth becomes one word. ignore_chars is also an option but then you will be looking for ninetysixth instead.
Much depends on the rest of your dataset and use cases ofcourse.

How to use unicode characters in Eclipse File Search?

We have some XML file that contains some invalid character, and the program says neither which file it is, nor which line number or character offset. It would be a few seconds work to fix the problem if I could just search for exactly that character, but I cannot find how to express a Unicode character in the file search (or at least I assume so, since the search returns nothing).
Neither 0x1e nor \u001e seem to match anything.
[EDIT] I mean, I can still change the code, and eventually find which file it is by catching the Exception, and using some kind of script/tool to find where exactly the character is, but I do believe it should be possible to search with Unicode in Eclipse, and that is what I am asking in this question.
It may be a problem with the character encoding.
As you're going to need to perform a global / site-wide search to find the , you'll probably need to set the global text file encoding:
Preferences -> Workspace -> Text file encoding
This option may be under the 'General' section in Eclipse, depending on your setup and installed plugins etc.
Ensure that the encoding is set to UTF-8.
You will also need to escape the unicode character sequences, like so:
\u2665
(which I see you have tried)

Searching for hash tag via thinking sphinx

Is it possible to search hash tag via thinking_spinx? Can't find solution.
Need to find all titles with hash tag only: "title#text","#text", etc.
You'll want to make sure Sphinx is indexing the hash character - which is done via the charset_table setting. Thinking Sphinx finds this value in config/sphinx.yml (create it if you haven't already), which is set up via environments, much like config/database.yml.
development:
charset_table: "0..9, A..Z->a..z, _, a..z, \#, U+410..U+42F->U+430..U+44F, U+430..U+44F"
All other characters and character ranges listed are Sphinx's default set for UTF-8, which is what Thinking Sphinx uses by default.