Handling Lucene query parser errors - lucene.net

My application takes a user inputted string and tries to parse it with the Lucene query parser. I noticed however that there are several formats of strings that provoke an error in this query parser.
e.g.:
~anystring
anystring +
First I tried molding my user inputted string so that it could not contain these cases, but as I see it, there could be more cases I do not foresee now.
How do you handle Query parser exceptions? How do you prevent them?

We catch the remaining parse exceptions and display an error message ("Your search did not match any documents. Suggestion: Try different keywords.").
See also How to make the Lucene QueryParser more forgiving?

query.replace(/([\!\*\+\&\|\(\)\[\]\{\}\^\~\?\:\"\/])/g, "");

Related

Input string was not in correct formate while inserting data to postgresql through entityframework in .net core using dynamic query

I am getting error while inserting data to pgsql with .net core entity framework
error is Input string was not in correct format
this is my query executing
INSERT INTO public."MedQuantityVerification"("Id","MedId","ActivityBy","ActivityOn","Quantity","ActivityType","SupposedOn","Note") Values(7773866,248953,8887,'7/14/2018 10:43:43 PM','42.5 qty',5,NULL,'I counted forty two {point} five.')
anyhow when I run that query directly to postgresql browser it works fine
looks like issue on c# side it is but not know what?
also issue is with {point}
this is how I executing the dynamic query
db.Database.ExecuteSqlRaw(query);
You have to escape the curly brackets:
{point} should be {{point}}
ExecuteSqlRaw utilizes curly braces to parameterize the raw query so if your query naturally includes them like OP's does the function is going to try and parse them. Doubling up the braces like in Koen Schepens' answer acts as an escape sequence and tells the function not to parse it as a parameter.
The documentation for the function uses the following example as to the purpose of why it does what it does:
var userSuppliedSearchTerm = ".NET";
context.Database.ExecuteSqlRaw("UPDATE Blogs SET Rank = 50 WHERE Name = {0}", userSuppliedSearchTerm);
Note that you'll want to use this to your advantage any time you're accepting user-input and passing it to ExecuteSqlRaw. If the curly brace is in a parameter instead of the main string it doesn't need to be escaped.

Query string parsing as number when it should be a string

I am trying to send a search input to a REST service. In some cases the form input is a long string of numbers (example: 1234567890000000000123456789). I am getting 500 error, and it looks like something is trying the convert the string to a number. The data type for the source database is a string.
Is there something that can be done in building the query string that will force the input to be interpreted as a string?
The service is an implementation of ArcGIS server.
More information on this issue per request.
To test, I have been using a client form provided with the service installation (see illustration below).
I have attempted to add single and double quotes, plus wildcard characters in the form entry. The form submission does not error, but no results are found. If I shorten the number("1234"), or add some alpha numeric characters ("1234A"), the form submission does not error.
The problem surfaced after a recent upgrade to 10.1. I have looked for information that would tie this to a known problem, but not found anything yet.
In terms of forcing the input to be interpreted as a string, you enclose the input in single quotes (e.g., '1234567890000000000123456789'). Though if you are querying a field of type string then you need to enclose all search strings in single quotes, and in that case none of your queries should be working. So it's a little hard to tell from the information you've provided what exactly you are doing and what might be going wrong. Can you provide more detail and/or code? Are you formatting a where clause that you are using in a Query object via one of Esri's client side API's (such as the JavaScript API)? In that case, for fields of data type string you definitely need to enclose the search text in single quotes. For example if the field you are querying were called 'FIELD', this is how you'd format the where clause:
FIELD = '1234'
or
FIELD Like '1234%'
for a wildcard search. If you are trying to enter query criteria directly into the Query form of a published ArcGIS Server service/layer, then there too you need to enclose the search in single quotes, as in the above examples.
According to an Esri help technician, this is known bug.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

Solr search error when dealing with Arabic string

I'm struggling with Solr search Arabic for several days and made some experiment. Here is the simple reflection of the problem.
After I store some Arabic sentence (now only 1 word السوري ) into database and have Solr index it, then query it by q=*:*&wt=python,(if no wt part, it was garbled chars) the response is:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
The actual word I store there for index is coding in another way:
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
As you can tell, there is a one-to-to corresponding from \xd8↔\u00d8. But I don't know what is the name of this coding, thus I cannot convert it. And when I do the search as: <>/select/?q=السوري&wt=python,the response is:
{'responseHeader':{'status':0,'QTime':0,'params':{'wt':'python','q':u'\u0627\u0644\u0633\u0648\u0631\u064a'}},'response':{'numFound':0,'start':0,'docs':[]}}
No docs found and it seems using a third version for coding u'\u0627\u0644\u0633\u0648\u0631\u064a'. if I take it and encode('utf8') then it convert back to '\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'.
In summary, when it (السوري) is in my code (python) or in data base (mysql),
it presents as 'form1':
'\xd8\xa7\xd9\x84\xd8\xb3\xd9\x88\xd8\xb1\xd9\x8a'
When it is indexed by Solr, it converts to form2:
'\u00d8\u00a7\u00d9\u201e\u00d8\u00b3\u00d9\u02c6\u00d8\u00b1\u00d9\u0160'
And when I use <>/select/?q=السوري&wt=python, to query from browser (Google chrome), it becomes form3:
'\u0627\u0644\u0633\u0648\u0631\u064a'
(which could convert back to form1 by encode('utf8') But since they are different, the search matches nothing.
Therefore, those three different encode strategy may be the core problem. Could anyone help me figure it out and solve the search problem?
Thanks in advance.

Make Lucene index a value and store another

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.
If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, #"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}
You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.