Lucene Wildcard Query - Length of matched string - lucene.net

I have set up a simple lucene.net index and am testing out a few queries.
I have an index with a field called "Biography" and i am running this query
WildcardQuery query = new WildcardQuery(new Term("Biography", "*anag*"));
This returns back matches for records with the word Management - which is great
If i search for this...
WildcardQuery query = new WildcardQuery(new Term("Biography", "*anagm*"));
then i get no results.
Here are the 2 strings i have in the index
"im good at project management"
"im good at programming and project management. i like managing things"
Is there a character limit to wildcard searching?
My usecase will be a free text search box for users - hence im not sure what they may type in and wanting to do a wildcard

The partial word "anagm" does not occur in either of your two sentences so returning 0 results should be the expected behavior:
"im good at project management"
"im good at programming and project management. i like managing things"
Which sentence did you think would match? and Why?
Lucene is more often used to match words or more specifically tokens from the original sentences. Doing wildcard matches with Lucene (as one might do with Sql) is quite a bit less common since leading with a wild card is not performant (just as it is not with sql either).

Related

Autocomplete by most frequent words - postgres or lucene?

We're using Postgres and its fulltext feature to search for documents (posts content) in our system, and it works really well.
For autocomplete we want to build index (dictionary?) with all words used in documents and search by most frequent ones.
We will always search for one word. We will never search for phrase.
So if I write:
"th"
I will receive (suppose the most frequent words in our documents):
"this"
"there"
"thoughts"
...
How to do it with Postgres? Or maybe we need some more advanced solution like apache lucene / solr ?
Neither postgres fulltext search (which provides lexems) nor postgres trigrams seems to be suitable for this work. Or maybe I am wrong ?
I don't want to manually parse text and ignore all english stopwords which would be error prone. Postgres does good job with this while building lexems index. But intead of lexems, we need to build and search words dictionary without normalization
Thank you for your assistance

Is it possible to make lucene.net ignore case of the field for queries?

I have documents indexed with field "GuidId" field and "guidid". How can I make lucene net ignore the case ...so that the following query searches regardless of the case ?
TermQuery termQuery = new TermQuery(new Term("GuidId", guidId.ToString()));
I don't want to write another query for the documents with fields "guidid" ..i.e. lowercase
Ideally, don't have fields names with funky cases. If you are defining field names dynamically or some such, then you should lowercase them before adding them to the index. That done, it should be easy enough to keep the query fields' names lowercase as well, and you're in good shape.
If, for whatever reason, you are stuck with this case-sensitive data, you'll be stuck expanding your queries to search all the known permutations of the field name to get all your results. Something like:
Query finalQuery = new DisjunctionMaxQuery(0)
finalQuery.add(new TermQuery(new Term("GuidId", guidId.ToString())));
finalQuery.add(new TermQuery(new Term("guidid", guidId.ToString())));
DisjunctionMaxQuery would probably be a good choice here, since it only returns the maximum scoring hit among is query collection, rather than possibly compounding scores across multiple hits.
You can also use MultiFieldQueryParser to similar effect. I don't believe it uses DisjunctionMax, but it doesn't sound like it would likely be that big a deal in this case.

Stemming does not work properly for MongoDB text index

I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a string "big hunting dogs" in a document's field that is part of the text index, you should be able to search on "hunt" or "hunting" as well as on "dog" or "dogs". MongoDB should normalize or stem the text when indexing and also when searching. So in my example, I would expect it to save words "dog" and "hunt" in the index and search for a stemmed version of this words. If I search for "hunting", MongoDB should search for "hunt".
Well, this is not how it works for me. I am running MongoDB 2.4.8 on Linux with full text search enabled. If my record has value "big hunting dogs", only searching for "big" will produce the result, while searches for "hunt" or "dog" produce nothing. It is as if the words that are not in their "normalized" form are not stored in the text the index (or stored in a way it cannot find them). Searches using $regex operator work fine, that is I am able to find the document by searching on a string like /hunting/ against the field in question.
I tried dropping and recreating the full text index - nothing changed. I can only find the documents containing the words on their "normal" form. Searching for words like "dogs" or "hunting" (or even "dog" or "hunt") produces no results.
Do I misunderstand or misuse the full text search operations or is there a bug in MongoDB?
After a fair amount of experimenting and scratching my head I discovered the reason for this behavior. It turned out that the documents in the collection in question had attribute 'language'. Apparently the presence and the value of that attribute made these documents non-searchable. (The value happened to be 'ENG'. It is possible that changing it to 'eng' would make this document searchable again. The field, however, served a completely different purpose). After I renamed the field to 'lang' I was able to find the document containing the word "dogs" by searching for "dog" or "dogs".
I wonder whether this is expected behavior of MongoDB - that the presence of language attribute in the document would affect the text search.
Michael,
The "language" field (if present) allows each document to override the
language in which the stemming of words would be done. I think, as
you specified to MongoDB a language which it didn't recognize ("ENG"),
it was unable to stem the words at all. As others pointed out, you can use the
language_override option to specify that MongoDB should be using some
other field for this purpose (say "lang") and not the default one ("language").
Below is a nice quote (about full text indexing and searching) which
is exactly related to your issue. It is taken from this book.
"MongoDB: The Definitive Guide, 2nd Edition"
Searching in Other Languages
When a document is inserted (or the index is first created), MongoDB looks at the
indexes fields and stems each word, reducing it to an essential unit. However, different
languages stem words in different ways, so you must specify what language the index
or document is. Thus, text-type indexes allow a "default_language" option to be
specified, which defaults to "english" but can be set to a number of other languages
(see the online documentation for an up-to-date list).
For example, to create a French-language index, we could say:
> db.users.ensureIndex({"profil" : "text", "interets" : "text"}, {"default_language" : "french"})
Then French would be used for stemming, unless otherwise specified. You can, on a
per-document basis, specify another stemming language by having a "language" field
that describes the document’s language:
> db.users.insert({"username" : "swedishChef", "profile" : "Bork de bork", language : "swedish"})
What the book does not mention (at least this page of it doesn't) is that
one can use the language_override option to specify that MongoDB
should be using some other field for this purpose (say "lang") and
not the default one ("language").
In http://docs.mongodb.org/manual/tutorial/specify-language-for-text-index/ take a look at the language_override option when setting up the index. It allows you to change the name of the field that should be used to define the language of the text search. That way you can leave the "language" property for your application's use, and call it something else (e.g. searchlang or something like that).

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

Thinking sphinx fuzzy search?

I am implementing sphinx search in my rails application.
I want to search with fuzzy on. It should search for spelling mistakes e.g if is enter search query charact*a*ristics, it should search for charact*e*ristics.
How should I implement this
Sphinx doesn't naturally allow for spelling mistakes - it doesn't care if the words are spelled correctly or not, it just indexes them and matches them.
There's two options around this - either use thinking-sphinx-raspell to catch spelling errors by users when they search, and offer them the choice to search again with an improved query (much like Google does); or maybe use the soundex or metaphone morphologies so words are indexed in a way that accounts for how they sound. Search on this page for stemming, you'll find the relevant section. Also have a read of Sphinx's documentation on the matter as well.
I've no idea how reliable either option would be - personally, I'd opt for #1.
By default, Sphinx does not pay any attention to wildcard searching using an asterisk character. You can turn it on, though:
development:
enable_star: true
# ... repeat for other environments
See http://pat.github.io/thinking-sphinx/advanced_config.html Wildcard/Star Syntax section.
Yes, Sphinx generaly always uses the extended match modes.
There are the following matching modes available:
SPH_MATCH_ALL, matches all query words (default mode);
SPH_MATCH_ANY, matches any of the query words;
SPH_MATCH_PHRASE, matches query as a phrase, requiring perfect match;
SPH_MATCH_BOOLEAN, matches query as a boolean expression (see Section 5.2, “Boolean query syntax”);
SPH_MATCH_EXTENDED, matches query as an expression in Sphinx internal query language (see Section 5.3, “Extended query syntax”);
SPH_MATCH_EXTENDED2, an alias for SPH_MATCH_EXTENDED;
SPH_MATCH_FULLSCAN, matches query, forcibly using the "full scan" mode as below. NB, any query terms will be ignored, such that filters, filter-ranges and grouping will still be applied, but no text-matching.
SPH_MATCH_EXTENDED2 was used during 0.9.8 and 0.9.9 development cycle, when the internal matching engine was being rewritten (for the sake of additional functionality and better performance). By 0.9.9-release, the older version was removed, and SPH_MATCH_EXTENDED and SPH_MATCH_EXTENDED2 are now just aliases.
enable_star
Enables star-syntax (or wildcard syntax) when searching through prefix/infix indexes. >Optional, default is is 0 (do not use wildcard syntax), for compatibility with 0.9.7. >Known values are 0 and 1.
For example, assume that the index was built with infixes and that enable_star is 1. Searching should work as follows:
"abcdef" query will match only those documents that contain the exact "abcdef" word in them.
"abc*" query will match those documents that contain any words starting with "abc" (including the documents which contain the exact "abc" word only);
"*cde*" query will match those documents that contain any words which have "cde" characters in any part of the word (including the documents which contain the exact "cde" word only).
"*def" query will match those documents that contain any words ending with "def" (including the documents that contain the exact "def" word only).
Example:
enable_star = 1