Hibernate Search Turkish Character

Hibernate Search Turkish Character - hibernate-search

I am trying to search a field filled with Turkish characters.
For example: Müdürlük.
But when I try to query this field not able to get results. I am new to hibernate search. What should I do?

There may be other problem (I can't tell without the code), but one thing you'll probably want is to use a language-specific analyzer, so that "mudurluk" will match "Müdürlük" for instance, or so that "istanbul" will match "İstanbul" (capital dotted "i").
To do that with Hibernate Search using annotations, fill in the analyzer attribute of your #Field annotation:
#Field(analyzer = #Analyzer(impl = org.apache.lucene.analysis.tr.TurkishAnalyzer.class))
String myProperty;
If you mapped your entity using the programmatic API, the process should be fairly similar.
Please see the official documentation for more details about analyzers in Hibernate Search: https://docs.jboss.org/hibernate/stable/search/reference/en-US/html_single/#_analyzer
EDIT: don't forget to reindex your data after you changed the analyzer.

Related

How to allow leading wild cards in custom smart search web part (Kentico 10)

I have a custom index for my products and I am using the Subset Analyzer. This Analyzer works great, but if you do field searches, it does not work.
For example, I have a document with the following fields:
"documentname", "My-Document-Name"
"tags", "1234,5678,9101"
"documentdescription", "This is a great Document, My-Document-Name."
When I just search "name AND tags:(1234)", I get this document in my results because it searches +_content:name.
-- However:
When I search "documentname:(name)^3.0 AND tags:(1234)", I do not get this document in my results.
Of course, when I do "documentname:(*name*)^3.0" I get a parse error saying: '*' or '?' not allowed as first character in WildcardQuery.
How can I either enable wildcard query in my custom CMS.Search webpart?

First of all you have to make sure that a field you checking is in the index with proper name. documentname might not be in the index it can be called _title, depends how you index is set up. Get lukeall and check your index (it should be in \CMS\App_Data\CMSModules\SmartSearch\YourIndexName). You can use luke to test your searches as well.
For examples there is no tags but there is documenttags field.
P.S. Wildcards are working and you are right you can't use them as a first character by default (lucene documentation says: You cannot use a * or ? symbol as the first character of a search), but there is a way to set it up in lucene.net, although i dont know if there are setting for that in Kentico. But i dont think you need wildcards, so your query should be (assuming you have documentname and documenttags in the index):
+(documentname:"My-Name" AND documenttags:"tag1")

Query string parsing as number when it should be a string

I am trying to send a search input to a REST service. In some cases the form input is a long string of numbers (example: 1234567890000000000123456789). I am getting 500 error, and it looks like something is trying the convert the string to a number. The data type for the source database is a string.
Is there something that can be done in building the query string that will force the input to be interpreted as a string?
The service is an implementation of ArcGIS server.
More information on this issue per request.
To test, I have been using a client form provided with the service installation (see illustration below).
I have attempted to add single and double quotes, plus wildcard characters in the form entry. The form submission does not error, but no results are found. If I shorten the number("1234"), or add some alpha numeric characters ("1234A"), the form submission does not error.
The problem surfaced after a recent upgrade to 10.1. I have looked for information that would tie this to a known problem, but not found anything yet.

In terms of forcing the input to be interpreted as a string, you enclose the input in single quotes (e.g., '1234567890000000000123456789'). Though if you are querying a field of type string then you need to enclose all search strings in single quotes, and in that case none of your queries should be working. So it's a little hard to tell from the information you've provided what exactly you are doing and what might be going wrong. Can you provide more detail and/or code? Are you formatting a where clause that you are using in a Query object via one of Esri's client side API's (such as the JavaScript API)? In that case, for fields of data type string you definitely need to enclose the search text in single quotes. For example if the field you are querying were called 'FIELD', this is how you'd format the where clause:
FIELD = '1234'
or
FIELD Like '1234%'
for a wildcard search. If you are trying to enter query criteria directly into the Query form of a published ArcGIS Server service/layer, then there too you need to enclose the search in single quotes, as in the above examples.

According to an Esri help technician, this is known bug.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.

I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

Solr search error when dealing with Arabic string

I'm struggling with Solr search Arabic for several days and made some experiment. Here is the simple reflection of the problem.
After I store some Arabic sentence (now only 1 word السوري ) into database and have Solr index it, then query it by q=*:*&wt=python,(if no wt part, it was garbled chars) the response is:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
The actual word I store there for index is coding in another way:
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
As you can tell, there is a one-to-to corresponding from \xd8↔\u00d8. But I don't know what is the name of this coding, thus I cannot convert it. And when I do the search as: <>/select/?q=السوري&wt=python,the response is:
{'responseHeader':{'status':0,'QTime':0,'params':{'wt':'python','q':u'\u0627\u0644\u0633\u0648\u0631\u064a'}},'response':{'numFound':0,'start':0,'docs':[]}}
No docs found and it seems using a third version for coding u'\u0627\u0644\u0633\u0648\u0631\u064a'. if I take it and encode('utf8') then it convert back to '\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'.
In summary, when it (السوري) is in my code (python) or in data base (mysql),
it presents as 'form1':
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
When it is indexed by Solr, it converts to form2:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
And when I use <>/select/?q=السوري&wt=python, to query from browser (Google chrome), it becomes form3:
'\u0627\u0644\u0633\u0648\u0631\u064a'
(which could convert back to form1 by encode('utf8') But since they are different, the search matches nothing.
Therefore, those three different encode strategy may be the core problem. Could anyone help me figure it out and solve the search problem?
Thanks in advance.

Make Lucene index a value and store another

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.

If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, #"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}

You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.