Lucene wildcards and words with the letter 's' on the end - lucene.net

Im having a bit of trouble finding some information on whats happening with my lucene searches.
(Id:gloves* Search:gloves* SpellCheckerSource:gloves*) OR
(Id:gloves Search:gloves SpellCheckerSource:gloves) OR
(Id:glove* Search:glove* SpellCheckerSource:glove*)
When I search for the above I get the following rewritten term
(() () ())
(Id:glove Search:glove SpellCheckerSource:glove)
(() ConstantScore(Search:glove*) ConstantScore(SpellCheckerSource:glove*))
This is using LUKE, I have been running the query in LUKE to try see whats going on.
http://www.getopt.org/luke/
Now what I want to be able to do is search for a term ie gloves* which ends up being (() () ())
I don't understand why this gets translated like this is there and issue with my query or with my index?
LUKE tells me the structure explanation is as follows
lucene.BooleanQuery
clauses=3, maxClauses=1024
Clause 0: SHOULD
lucene.BooleanQuery
clauses=3, maxClauses=1024
Clause 0: SHOULD
lucene.BooleanQuery
clauses=0, maxClauses=1024, coord=false
Clause 1: SHOULD
lucene.BooleanQuery
clauses=0, maxClauses=1024, coord=false
Clause 2: SHOULD
lucene.BooleanQuery
clauses=0, maxClauses=1024, coord=false
Clause 1: SHOULD
lucene.BooleanQuery
clauses=3, maxClauses=1024
Clause 0: SHOULD
lucene.TermQuery
Term: field='Id' text='glove'
Clause 1: SHOULD
lucene.TermQuery
Term: field='Search' text='glove'
Clause 2: SHOULD
lucene.TermQuery
Term: field='SpellCheckerSource' text='glove'
Clause 2: SHOULD
lucene.BooleanQuery
clauses=3, maxClauses=1024
Clause 0: SHOULD
lucene.BooleanQuery
clauses=0, maxClauses=1024, coord=false
Clause 1: SHOULD
lucene.ConstantScoreQuery, ConstantScore(Search:glove*)
Filter: Search:glove*
Clause 2: SHOULD
lucene.ConstantScoreQuery, ConstantScore(SpellCheckerSource:glove*)
Filter: SpellCheckerSource:glove*
This seems strange to me on multiple levels
Why have I got translated blank clauses?
Why have I got a mix of TermQuery,ConstantScoreQuery And BooleanQuery?
Where are ConstantScoreQuery getting generated?
It should be noted everything works fine for me when i search for a term with out and s IE glove or with out a wildcard just the combination of the two seems to break the query.

This is probably happening because there are no terms in your index that match "gloves*".
When a MultiTermQuery is rewritten, it finds the Terms that are suitable, and creates primitive queries (such as TermQuery) on those terms. If no suitable terms are found, you'll see an empty query generated instead, like what you've shown.
A TermQuery is already a primitive query, and no rewriting is needed there. It doesn't have to enumerate terms or anything, it just runs the thing.
The other piece of this is analysis. Your query for gloves is getting analyzed to glove (EnglishAnalyzer perhaps?). MultiTermQueries (like wildcard, fuzzy, regex and prefix queries) are not analyzed by the QueryParser. Your prefix query is trying to find " "gloves", but all those plural s, have been stemmed away, so it doesn't find any matches.

Related

AEM 6.3 Query builder - How to search for case insensitive?

How can we make the query to ignore case sensitivity of the property.value ?
Our query:
path=/content/central-content/jcr:content/main/decline_letter
property.value=0091A
property=#letterNumber
type=nt:unstructured
Works for 0091A but fails for 0091a
Using fulltext seemed to be helping/working.
path=/content/central-content/jcr:content/main/decline_letter
fulltext=0091A
property=#letterNumber
type=nt:unstructured
orderby.case=ignore
fulltext may not be a good solution is we have to be searching among a lot of nodes/data. In our case, we search for a very minimal number of nodes.

Erlang mnesia equivalent of "select * from Tb"

I'm a total erlang noob and I just want to see what's in a particular table I have. I want to just "select *" from a particular table to start with. The examples I'm seeing, such as the official documentation, all have column restrictions which I don't really want. I don't really know how to form the MatchHead or Guard to match anything (aka "*").
A very simple primer on how to just get everything out of a table would be very appreciated!
For example, you can use qlc:
F = fun() ->
Q = qlc:q([R || R <- mnesia:table(foo)]),
qlc:e(Q)
end,
mnesia:transaction(F).
The simplest way to do it is probably mnesia:dirty_match_object:
mnesia:dirty_match_object(foo, #foo{_ = '_'}).
That is, match everything in the table foo that is a foo record, regardless of the values of the fields (every field is '_', i.e. wildcard). Note that since it uses record construction syntax, it will only work in a module where you have included the record definition, or in the shell after evaluating rr(my_module) to make the record definition available.
(I expected mnesia:dirty_match_object(foo, '_') to work, but that fails with a bad_type error.)
To do it with select, call it like this:
mnesia:dirty_select(foo, [{'_', [], ['$_']}]).
Here, MatchHead is _, i.e. match anything. The guards are [], an empty list, i.e. no extra limitations. The result spec is ['$_'], i.e. return the entire record. For more information about match specs, see the match specifications chapter of the ERTS user guide.
If an expression is too deep and gets printed with ... in the shell, you can ask the shell to print the entire thing by evaluating rp(EXPRESSION). EXPRESSION can either be the function call once again, or v(-1) for the value returned by the previous expression, or v(42) for the value returned by the expression preceded by the shell prompt 42>.

MongoDB - Using regex wildcards for search that properly filter results

I have a Mongo search set up that goes through my entries based on numerous criteria.
Currently the easiest way (I know it's not performance-friendly due to using wildcards, but I can't figure out a better way to do this due to case insensitivity and users not putting in whole words) is to use regex wildcards in the search. The search ends up looking like this:
{ gender: /Womens/i, designer: /Voodoo Girl/i } // Should return ~200 results
{ gender: /Mens/i, designer: /Voodoo Girl/i } // Should return 0 results
In the example above, both searches are returning ~200 results ("Voodoo Girl" is a womenswear label and all corresponding entries have a gender: "Womens" field.). Bizarrely, when I do other searches, like:
{ designer: /Voodoo Girl/i, store: /Store XYZ/i } // should return 0 results
I get the correct number of results (0). Is this an order thing? How can I ensure that my search only returns results that match all of my wildcarded queries?
For reference, the queries are being made in nodeJS through a simple db.products.find({criteria}) lookup.
To answer the aside real fast, something like ElasticSearch is a wonderful way to get more powerful, performant searching capabilities in your app.
Now, the reason that your searches are returning results is that "mens" is a substring of "womens"! You probably want either /^Mens/i and /^Womens/i (if Mens starts the gender field), or /\bMens\b/ if it can appear in the middle of the field. The first form will only match the given field from the beginning of the string, while the second form looks for the given word surrounded by word boundaries (that is, not as a substring of another word).
If you can use the /^Mens/ form (note the lack of the /i), it's advisable, as anchored case-sensitive regex queries can use indexes, while other regex forms cannot.
$regex can only use an index efficiently when the regular expression has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match.

TermQuery not returning on a known search term, but WildcardQuery does

Am hoping someone with enough insight into the inner workings of Lucene might be able to point me in the right direction =)
I'll skip most of the surrounding irellevant code, and cut right to the chase. I have a Lucene index, to which I am adding the following field to the index (variables replaced by their literal values):
document.Add( new Field("Typenummer", "E5CEB501A244410EB1FFC4761F79E7B7",
Field.Store.YES , Field.Index.UN_TOKENIZED));
Later, when I search my index (using other types of queries), I am able to verify that this field does indeed appear in my index - like when looping through all Fields returned by Document.GetFields()
Field: Typenummer, Value: E5CEB501A244410EB1FFC4761F79E7B7
So far so good :-)
Now the real problem is - why can I not use a TermQuery to search against this value and actually get a result.
This code produces 0 hits:
// Returns 0 hits
bq.Add( new TermQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
But if I switch this to a WildcardQuery (with no wildcards), I get the 1 hit I expect.
// returns the 1 hit I expect
bq.Add( new WildcardQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
I've checked field lengths, I've checked that I am using the same Analyzer and so on and I am still on square 1 as to why this is.
Can anyone point me in a direction I should be looking?
I finally figured out what was going on. I'm expanding the tags for this question as it, much to my surprise, actually turned out to be an issue with the CMS this particular problem exists in. In summary, the problem came down to this:
The field is stored UN_TOKENIZED, meaning Lucene will store it excactly "as-is"
The BooleanQuery I pasted snippets from gets sent to the Sitecore SearchManager inside a PreparedQuery wrapper
The behaviour I expected from this was, that my query (having already been prepared) would go - unaltered - to the Lucene API
Turns out I was wrong. It passes through a RewriteQuery method that copies my entire set of nested queries as-is, with one exception - all the Term arguments are passed through a LowercaseStrategy()
As I indexed an UPPERCASE Term (UN_TOKENIZED), and Sitecore changes my PreparedQuery to lowercase - 0 results are returned
Am not going to start an argument of whether this is "by design" or "by design flaw" implementation of the Lucene Wrapper API - I'll just note that rewriting my query when using the PreparedQuery overload is... to me... unexpected ;-)
Further teachings from this; storing the field as TOKENIZED will eliminate this problem too, as the StandardAnalyzer by default will lowercase all tokens.

Why is this Lucene query a "contains" instead of a "startsWith"?

string q = "m";
Query query = new QueryParser("company", new StandardAnalyzer()).Parse(q+"*");
will result in query being a prefixQuery :company:a*
Still I will get results like "Fleet Africa" where it is rather obvious that the A is not at the start and thus gives me undesired results.
Query query = new TermQuery(new Term("company", q+"*"));
will result in query being a termQuery :company:a* and not returning any results. Probably because it interprets the query as an exact match and none of my values are the "a*" literal.
Query query = new WildcardQuery(new Term("company", q+"*"));
will return the same results as the prefixquery;
What am I doing wrong?
StandardAnalyzer will tokenize "Fleet Africa" into "fleet" and "africa". Your a* search will match the later term.
If you want to consider "Fleet Africa" as one single term, use an analyzer that does not break up your string on whitespaces. KeywordAnalyzer is an example, but you may still want to lowercase your data so queries are case insensitive.
The short answer: all your queries do not constrain the search to the start of the field.
You need an EdgeNGramTokenFilter or something like it.
See this question for an implementation of autocomplete in Lucene.
Another solution could be to use StringField to store the data for ex: "Fleet Africa"
Then use a WildCardQuery.. Now f* or F* would give results but A* or a* won't.
StringField is indexed but not tokenized.