Why is this Lucene query a "contains" instead of a "startsWith"? - lucene.net

string q = "m";
Query query = new QueryParser("company", new StandardAnalyzer()).Parse(q+"*");
will result in query being a prefixQuery :company:a*
Still I will get results like "Fleet Africa" where it is rather obvious that the A is not at the start and thus gives me undesired results.
Query query = new TermQuery(new Term("company", q+"*"));
will result in query being a termQuery :company:a* and not returning any results. Probably because it interprets the query as an exact match and none of my values are the "a*" literal.
Query query = new WildcardQuery(new Term("company", q+"*"));
will return the same results as the prefixquery;
What am I doing wrong?

StandardAnalyzer will tokenize "Fleet Africa" into "fleet" and "africa". Your a* search will match the later term.
If you want to consider "Fleet Africa" as one single term, use an analyzer that does not break up your string on whitespaces. KeywordAnalyzer is an example, but you may still want to lowercase your data so queries are case insensitive.

The short answer: all your queries do not constrain the search to the start of the field.
You need an EdgeNGramTokenFilter or something like it.
See this question for an implementation of autocomplete in Lucene.

Another solution could be to use StringField to store the data for ex: "Fleet Africa"
Then use a WildCardQuery.. Now f* or F* would give results but A* or a* won't.
StringField is indexed but not tokenized.

Related

how to find partial search in Mongodb?

How to find partial search?
Now Im trying to find
db.content.find({$text: {$search: "Customer london"}})
It finds all records matching customer, and all records matching london.
If I am searching for a part of a word for example lond or custom
db.content.find({$text: {$search: "lond"}})
It returns an empty result. How can I modify the query to get the same result like when I am searching for london?
You can use regex to get around with it (https://docs.mongodb.com/manual/reference/operator/query/regex/). However, it will work for following :
if you have word Cooking, following queries may give you result
cooking(exact matching)
coo(part of the word)
cooked(The word containing the english root of the document word, where cook is the root word from which cooking or cooked are derived)
If you would like to go one step further and get a result document containing cooking when you type vooking (missplled V instead of C), go for elasticsearch.
Elasticsearch is easy to setup, has extremely powerful edge-ngram analyzer which converts each words into smaller weightage words. Hence when you misspell, you will still get a document based on score elasticsearch gives to that document.
You can read about it here : https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-edgengram-tokenizer.html
it will always return the empty array for partial words like when you are searching for lond to get this type of text london..
Because it take full words and search same as that they are ..
Not achive same results like :-
LO LON LOND LONDO LONDON
Here you may get help from ELASTIC-SEARCH . It is quite good for full text search when implement with mongoDB.
Refrence : ElasticSearch
Thanks
The find all is to an Array
clientDB.collection('details').find({}).toArray().then((docs) =>
I now used the str.StartWith in a for loop to pick out my record.
if (docs[i].name.startsWith('U', 0)) {
return console.log(docs[i].name);
} else {
console.log('Record not found!!!')
};
This may not be efficient, but it works for now

Satisfy this JPQL requirement

Requirement: The content of one string is present in the other. This tests on whole words only, but multi-word queries are allowed. For example, a query with givenName:Jane will match users with givenName values of "Jane" and "Jane Ann", but not "Janet". A multi-word query for givenName:Mary Ann would match values of "Mary Ann Evans" and "Sarah Mary Ann" but not "Ann Mary".
This is what I have so far.
I have a SiteRepository that extends the JpaRepository interface and it consists of the following method.
#Query("SELECT s from Site s WHERE s.name LIKE :givenName")
public List<Site> findByName(#Param("givenName") String givenName);
Now, the user is suppose to pass in a String value to the method below (for example ":Jane" or ":Mary Ann").
public List<Site> findByName(String givenName) //Where Site is an entity with a name field.
In the above method, I essentially check the first character of the parameter giveName to see if it is ":" and if it is, I substring givenName to cut out the colon and concatenate the following characters.
return siteRepository.findByName("%" + givenName.substring(1, givenName.length()) + "%"
+ " AND NOT LIKE _" + givenName.substring(1, givenName.length()) + "_");
So if I called, findByName(":Jane")
I should get the following JPQL query: SELECT s from Site s WHERE s.name LIKE %Jane% AND NOT LIKE _Jane_
This however does not work. Any help will be appreciated and thanks in advance.
I am not an expert in jpql, so I will write here only my assumptions which are not necessarily correct, but I try to help the op anyways.
If my answer is wrong, then please, leave a comment at the answer to let me know about it.
I think this answer does not deserve to be down-voted, as I have made clear that my answer is not necessarily correct and I will remove it if it is incorrect.
This was my idea described as a comment to the question without knowing the technical details:
You should write a query which checks whether all the words are
present and the first index of a word is before the last index of the
next word.
LOCATE(searchString, candidateString [, startIndex])
searches for the position of searchString in candidateString starting from startIndex, indexes being started from 1. The idea is to write a query which checks whether the locate returns other than 0 having a value of startIndex dependent of the startIndex of the previous string. (Source) If all your strings match the criteria, then the record should be included into the results.
I found a solution. This seems to work for the cases I tested so far.
" SELECT s FROM Site s WHERE s.name NOT LIKE 'Jane_' AND s.name NOT LIKE '_Jane' AND s.name NOT LIKE 'Jane' AND s.name LIKE '%Jane%' "

MongoDB - Using regex wildcards for search that properly filter results

I have a Mongo search set up that goes through my entries based on numerous criteria.
Currently the easiest way (I know it's not performance-friendly due to using wildcards, but I can't figure out a better way to do this due to case insensitivity and users not putting in whole words) is to use regex wildcards in the search. The search ends up looking like this:
{ gender: /Womens/i, designer: /Voodoo Girl/i } // Should return ~200 results
{ gender: /Mens/i, designer: /Voodoo Girl/i } // Should return 0 results
In the example above, both searches are returning ~200 results ("Voodoo Girl" is a womenswear label and all corresponding entries have a gender: "Womens" field.). Bizarrely, when I do other searches, like:
{ designer: /Voodoo Girl/i, store: /Store XYZ/i } // should return 0 results
I get the correct number of results (0). Is this an order thing? How can I ensure that my search only returns results that match all of my wildcarded queries?
For reference, the queries are being made in nodeJS through a simple db.products.find({criteria}) lookup.
To answer the aside real fast, something like ElasticSearch is a wonderful way to get more powerful, performant searching capabilities in your app.
Now, the reason that your searches are returning results is that "mens" is a substring of "womens"! You probably want either /^Mens/i and /^Womens/i (if Mens starts the gender field), or /\bMens\b/ if it can appear in the middle of the field. The first form will only match the given field from the beginning of the string, while the second form looks for the given word surrounded by word boundaries (that is, not as a substring of another word).
If you can use the /^Mens/ form (note the lack of the /i), it's advisable, as anchored case-sensitive regex queries can use indexes, while other regex forms cannot.
$regex can only use an index efficiently when the regular expression has an anchor for the beginning (i.e. ^) of a string and is a case-sensitive match.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

Mongoengine filtering on a listField with __contains not working

I have a field in my document place_names which is a list of all possible place names for a location. Example New York City with have New York City, NYC, big apple etc.
I want the user to be able to query on any of these values or any part of the above values.
For example if they search for "apple" i want them to get New York City back. I was trying to use the __contains filter in mongoengine as below
place_names is of type ListField()
pn = request.POST.get('place_name', None)
try:
places_list = Places.objects()
if pn is not None and pn != "":
places_list.filter(place_names__contains = pn)
In the above example the filter doesn't work the way I expect it to. It works as a regular filter and doesn't do the "_contains". The same filter works fine if the type is StringField(). Is it possible to use "_contains" with ListFields? If not is there any way around this? thanks :)
__contains is a string lookup using a regex under the hood. To check if an item is in a listfield you should use the __in however, that does an exact match.
You could denormalise and create a ListField with the place names split into single words and lowercased, then you can use __in to determine if there is a match.