Indexing and searching French text with diacritics in Lucene - unicode

I am using Lucene Search.
I have uploaded a French file (french.txt) with the following content.
multimédia francophone pour l'enseignement du français langue étrangère
If I search for francophone then it shows the file in the search result. But when I search for multimédia or français or étrangère, it does not show any results.
I have tried to use org.apache.lucene.analysis.fr.FrenchAnalyzer, but it is still not working.
How can we search French words such as those above?

Do you use a ISOLatin1AccentFilterFactory in the analyzers for the field where this text is indexed? Make sure that if you have it for the index analyzer, you also have it for the query analyzer.

BTW. If you are using ISOLatin1AccentFilter note that it was deprecated in favor of ASCIIFoldingFilter

Basically, you have 2 options:
Index and search your French files with Snowball analyzer for French
Index your French docs as usual, but search with FuzzyQuery (not very accurate, but may be enough in your particular case).

Related

postgresql fulltext returning wrong results

I'm using postgresql full text tsvector column.
But I found a problem:
When I search for "calça"
The results contains the following results:
1- calça red
2- calça blue
3- calçado red
Why "calçado" is being returned when I search for "calça" ?
Is there any configuration so I can solve this?
Thanks.
It isn't just a matter that one string contains the other. The Portuguese stemmer thinks this is the way they should be stemmed. If you turn the longer word into 'calçadot', for example, it no longer stems it, because (presumably) 'adot' is not recognized as a Portuguese suffix which ought to be removed the way 'ado' is.
If you don't want stemming at all, then you could change the config to 'simple', which doesn't stem. But at that point, maybe you don't want full text search at all, and could just use LIKE instead with a pg_trgm index.
If it is just this particular word that you don't want stemmed, I think you can set up a synonym dictionary which will map calçado to itself, which will bypass stemming.

Odd to_tsquery results for s:* and t:*

I was experimenting with PostgreSQL's text search feature - particularly with the normalization function to_tsquery.
I was using english dictionary(config) and for some reason s and t won't normalize. I understand why i and a would not, but s and t? Interesting.
Are they matched to single space and tab?
Here is the query:
select
to_tsquery('english', 'a:*') as for_a,
to_tsquery('english', 's:*') as for_s,
to_tsquery('english', 't:*') as for_t,
to_tsquery('english', 'u:*') as for_u
fiddle just in case.
You would see 'u:*' is returning as 'u:*' and 'a:*' is not returning anything.
The letters s and t are considered stop words in the english text search dictionary, therefore they get discarded. You can read the stop word list under tsearch_data/english.stop in the postgres shared folder, which you can locate by typing pg_config --sharedir
With pg 11 on ubuntu/debian/mint, that would be
cat /usr/share/postgresql/11/tsearch_data/english.stop
Quoting from the docs,
Stop words are words that are very common, appear in almost every document, and have no discrimination value. Therefore, they can be ignored in the context of full text searching.
It is best to discard english grammar and think of words in a programmatic and logical way as described above. Full text search does not try to infer context based on sentence structuring so it has no use for these words. After all, it's called full text search and not natural language search.
As to how they arrived on the conclusion to add s and t to the stop word list, statistical analysis must have revealed these characters to be noise.

Is it possible to combine multiple text search configuration in FTS on postgresql?

I tried to combine multiple text search to use it into text search on postgresql.
I tried :
Create text search configuration test (
copy = english, french
)
But this didn't work:
text search configuration parameter "french" not recognized
I have a column which mixed of english french words and I want to get multiple configuration texts to search the queries items.
Example:
to_tsvector('test', words) ## to_tsquery('test','activité')
to_tsvector('test', words) ## to_tsquery('test', 'mystery')
How can I mix different text configurations to get result when I look for a french or english word?
The French text search configuration uses French stemming (the french_stem dictionary), while for English english_stem is used.
How do you want to stem for both? You could create a text search configuration that applies both stemmers, but I guess that the result would not be convincing. Similar for stop words.
You can explicitly specify the text search configuration in the query if you know what language you want to search for.

Foreign languages words in a text

I've a french text with some words in english and I want to find those words and highlight all of them at once. Is there any program that can help me do that? Is it possible to do this with any other foreign language?
I'm using microsoft Word.
Word can do this IF the English words are formatted with the English language (and the rest in the French language). In that case, Word's FIND functionality advanced options are able to filter so that the language formatting is searched (instead of text).

Lucene not searching full non-ASCII character

I am using Lucene seacrh engine for fulltext search it give search result for non ascii character also but the problem is suppose I added a text 帕普部分分配数量 and will search with
only one character 帕 it will give result but when will search with full non-ascii word 帕普部分分配数量 it is not giving any result, the strange thing is when I put spaces between each charcter for example 帕 普 部 分 分 配 数 量 and theb will search it give result
Will realy appreciate any help
Thanx
Be sure to use the same Analyzer when indexing and searching.
What happens is your Analyzer is indexing each characters as an individual Term, and then if you search with a different analyzer (IE WhiteSpaceAnalyzer) it searches for a Token containing all the specified characters in your Query.
To search for a sequence of characters like you want, you need to use the same Analyzer and have the QueryParser build a PhraseQuery with all the individual Tokens.
Some sample code of your indexing and searching routines would make it easier to help you.