Accent and gender issues for Full Text Search postgresql - postgresql

I got two different inputs that should be stemmed to the same output :
SELECT to_tsvector('french', 'fermier'),
to_tsvector('french', 'fermièr'),
to_tsvector('french', 'fermiér'),
to_tsvector('french', 'fermiere'),
to_tsvector('french', 'fermière'),
to_tsvector('french', 'fermiére')
-- Output
'fermi':1 'fermier':1 'fermier':1 'fermier':1 'fermi':1 'fermier':1
Ignoring accent would be possible but it would still collapse output to two differents options : fermier or fermi.
For information, the only difference between fermier and fermiere is the gender, the former is masculine and the latter is feminine.
Thus, issue is that the feminine form is stemmed to the masculine form which is itself stemmed to fermi.
I don't understand why stemming is not invariable from gender here.

Related

Using unaccent with two different rules

German language uses diacritical characters ä, ö, ü. For international use, they are translated into ae, oe, ue (not a, o, u). This mean, that Müller is Mueller on his ID document. This is what we get, when we read the document with (for example) passport reader and this is what we save to the database table.
In next step we search for the records. We do it in two ways:
by entering search data with passport reader (no problem in here)
by entering search data manually
With manual insert there is little problem, because user may enter data international way: 'Mueller' or popular way 'Müller'.
This problem can be solved by using postgres extension Unaccent and modification of unaccent.rules file, so despite is user inserts 'Mueller' or 'Müller', we search in the database for Mueller.
So far so good...
BUT
in the same table we have also other origin's names - for example Turkish ones. Turks translates theirs umlauts (ä, ö, ü) directly into a, o, u, and this way they are saved on the documents, so Müller would by Muller on Turkish document. This causes a problem because (as described before) we search with German unnaccent.rules and we don't find people who we search for.
Long story, but finally question...
... does anybody have any idea how to handle it?
Is there any way, to have two unaccent.rules and use them with or???... for example
Select * from table
where last_name = unaccent('Müller' (use German rules))
or last_name = unaccent('Müller' (use Turkish rules))
(I know that what's above does not work, but maybe there is something similar we could use)
regards
M
The solution should be simple. Define your German unaccent dictionary (I'll call it entumlauten), then query like
SELECT ...,
last_name = unaccent('unaccent', 'Müller') AS might_be_turkish,
last_name = unaccent('entumlauten', 'Müller') AS might_be_german,
FROM tab
WHERE last_name IN (unaccent('unaccent', 'Müller'),
unaccent('entumlauten', 'Müller'))
IN (or (= ANY) will perform better than OR, because it can use an index scan. The additional columns in the SELECT list tell you which condition was matched.
Use soundex() function. This is suitable only for creating lists for human user to pick wanted name. You probably should clean all diacritics (use the Turkish way) before using this.
It also handles similar sounding letters, like C, S and Z or D and T. So Schmidt would match Smith or Jönssen matches Johnson.

Odd to_tsquery results for s:* and t:*

I was experimenting with PostgreSQL's text search feature - particularly with the normalization function to_tsquery.
I was using english dictionary(config) and for some reason s and t won't normalize. I understand why i and a would not, but s and t? Interesting.
Are they matched to single space and tab?
Here is the query:
select
to_tsquery('english', 'a:*') as for_a,
to_tsquery('english', 's:*') as for_s,
to_tsquery('english', 't:*') as for_t,
to_tsquery('english', 'u:*') as for_u
fiddle just in case.
You would see 'u:*' is returning as 'u:*' and 'a:*' is not returning anything.
The letters s and t are considered stop words in the english text search dictionary, therefore they get discarded. You can read the stop word list under tsearch_data/english.stop in the postgres shared folder, which you can locate by typing pg_config --sharedir
With pg 11 on ubuntu/debian/mint, that would be
cat /usr/share/postgresql/11/tsearch_data/english.stop
Quoting from the docs,
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching.
It is best to discard english grammar and think of words in a programmatic and logical way as described above. Full text search does not try to infer context based on sentence structuring so it has no use for these words. After all, it's called full text search and not natural language search.
As to how they arrived on the conclusion to add s and t to the stop word list, statistical analysis must have revealed these characters to be noise.

How to get ts_headline to respect phraseto_tsquery

I have a query that uses phrase search to match whole phrases.
SELECT ts_headline(
'simple',
'This is my test text. My test text has many words. Well, not THAT many words.',
phraseto_tsquery('simple', 'text has many words')
);
Which results in:
This is my test <b>text</b>. My test <b>text</b> <b>has</b> <b>many</b> <b>words</b>. Well, not THAT <b>many</b> <b>words</b>.
But I would have expected this:
This is my test text. My test <b>text</b> <b>has</b> <b>many</b> <b>words</b>. Well, not THAT many words.
Or ideally even this:
This is my test text. My test <b>text has many words</b>. Well, not THAT many words.
Sidenote:
phraseto_tsquery('simple', 'text has many words')
is equivalent to
to_tsquery('simple', 'text <-> has <-> many <-> words')
I'm not sure if I'm doing something wrong, or if ts_headline simply does not support this kind of highlighting.
phraseto_tsquery('simple', 'text has many words') generates correct query but it seems the problem is in ts_headline function. Seems like an already reported BUG #155172.

how to fulltext index both chinese and english characters together by using ngram parser in mysql 5.7?

I have a table named 'comp' with a column 'compName', the compName contain different country's Characters, I am using mysql5.7 with ngram parser, now it is fine to search the Chinese word, but the it brings me the bad result when i searched English word. According to INFORMATION_SCHEMA.INNODB_FT_INDEX_CACHE table , i found it participle English word by the character, like: abc will be participled as ab,bc, But as we understand, the English will be participled as "SPACE", right ? So how to resolve this kinds of case when using ngram parser in mysql5.7.

Cannot use t-sql contains with short words

I call my statement with CONTAINS function, but sometimes it does not return correct records, e.g. I want to return row which contain in one field word 'Your':
SELECT [Email]
,[Comment]
FROM [USERS]
WHERE CONTAINS(Comment, 'Your')
It gives mi 0 result despite that this field contains this word (the same with 'as', 'to', 'was', 'me'). When I use 'given' instead of 'Your' then I receive a result. Is there maybe a list of words which cannot be used with CONTAINS? Or maybe this words are to short (when i use 'name' then i receive the results)? The work 'Your' is at the beginning in field Comment.
The field is of type 'text' and has enabled full-text index.
Words such as those you mention are "stop words"; they are expressly excluded from being indexed and searched in Full Text Search due to how common (and thereby meaningless for searches) they are. You'll notice the same thing when searching Google, for instance.
It is possible to edit the list, but I would avoid doing so except perhaps to add words to it; the words in the list are chosen very well, IMHO, for their lack of utility in searches.