I've implemented Lucenet.NET on my site, using it to index my products which are theatre shows, tours and attractions around London.
I want to implement a "Did you mean?" feature for when users misspell product names that takes the whole product titles into account and not just single words. For example,
If the user typed:
Lodnon Eye
I would like to auto-suggest:
London
London Eye
I assume I nead to have the analyzer index the titles as if they are a single entity, so that SpellChecker can nearest-match on the phrase, as well as the individual words.
How would I do this?
There is a excellent blog series here:
Lucene.NET
Introduction to Lucene
Indexing basics
Search basics
Did you mean..
Faceted Search
Class Reference
I have also found another project called SimpleLucene which you can use to maintain your lucene indexes whenever you need to update or delete a document. Read about it here
i've just recently implemented a phrase autosuggest system in lucene.net.
basically, the java version of lucene has a shinglefilter in one of the contrib folders which breaks down a sentence into all possible phrase combinations. Unfortunately lucene.nets contrib filters aren't quite there yet and so we don't have a shingle filter.
but, a lucene index written in java can be read by lucene.net as long as the versions are the same. so what i did was the following :
created a spell index in lucene.net using the spellcheck.IndexDictionary method as laid out in the "did you mean" section of jake scotts link. please note that only creates a spelling index of single words, not phrases.
i then created a java app that uses the shingle filter to create phrases of the text i'm searching and saves it in a temporary index.
i then wrote another method in dotnet to open this temporary index and add each of the phrases as a line or document into my spelling index that already contains the single words. the trick is to make sure the documents you're adding have the same form as the rest of the spell documents, so i ripped out the methods used in the spellchecker code in the lucene.net project and edited those.
once you've done that you can call the spellcheck.suggestsimilar method and pass it a misspelled phrase and it will return you a valid suggestion.
This is probably not the best solution and I definitely would use the answer suggested by spaceman but here is another possible solution. Use the KeywordAnalyzer or the KeywordTonenizer on each title, this will not break down the title into separate tokens but keep it as one token. Using the SuggestSimilar method would return the whole title as suggestions.
Related
Having a collection of {_id: 'xxx', text: 'abc'}
What is the best way to have a list of entities with the same text, considering also spelling mistakes, for example 'gogle' 'google' ordered by the number of similar entities?
mongodb doesn't have the capability to find misspelled items. there are some thirdparty libraries/plugins that offer this feature by storing double metaphone key codes along with the original version of the text. here's an example program in c# that gets the result you want. see this page for a brief explainer on how it works. if you're not coding in c#, there's this plugin for mongoose.
We regularly need to perform a handful of relatively simple tests against a bunch of MS Word documents. As these checks are currently done manually, I am striving for a way to automate this. For example:
Check if every page actually has a page number and verify that it is correct.
Verify that a version identifier in the page header is identical across all pages.
Check if the document has a table of contents.
Check if the document has a table of figures.
Check if every figure has a caption.
et cetera. Is this reasonably feasible using PowerShell in conjunction with a Word API?
Powershell can access Word via its object model/Interop (on Windows, at any rate) and AIUI can also work with the Office Open XML OOXML) API, so really you should be able to write any checks you want on the document content. What is slightly less obvious is how you verify that the document content will result in a particular "printed appearance". I'm going to start with some comments on the details first.
Just bear in mind that in the following notes I'm just pointing out a few things that you might have to deal with. If you're examining documents produced by an organisation where people are already broadly speaking following the same standards, it may be easier.
Of the 5 examples you give, without checking the details I couldn't say exactly how you would do them, and there could be difficulties with all of them, but for example
Check if every page actually has a page number and verify that it is correct.
Difficult using either OOXML or the object model, because what you would really be checking is that the header for a particular section had a visible { PAGE } field code. Because that field code might be nested inside other fields that say "if don't display this field code", it's not so easy to be sure that there would be a page number.
Which is what I mean by checking the document's "printed appearance" - if, for example, you can use the object model to print to PDF and have some mechanism that lets PS inspect the PDF's content, that might be a better approach.
Verify that a version identifier in the page header is identical across all pages.
Similar problem to the above, IMO. It depends partly on how the version identifier might be inserted. Is it just a piece of text? Could it be constructed from a number of fields? Might it reference Document Properties or Variables, or Custom XML content?
Check if the document has a table of contents.
Perhaps enough to look for a TOC field that does not have certain options, such as a \c option that a Table of Figures would contain.
Check if the document has a table of figures.
Perhaps enough to check for a TOC field that does have a \c option, perhaps with a specific parameter such as "Figure"
Check if every figure has a caption.
Not sure that you can tell whether a particular image is "a Figure". But if you mean "verify that every graphic object has a caption", you could probably iterate through the inline and floating graphics in the document and verify that there was something that looked like a Word standard caption paragraph within a certain distance of that object. Word has two standard field code patterns for captions AFAIK (one where the chapter number is included and one where it isn't), so you could look for those. You could measure a distance between the image and the caption by ensuring that they were no more than a predefined number of paragraphs apart, or in the case of a floating image, perhaps that the paragraph anchoring the image was no more than so many paragraphs away from the caption.
A couple of more general problems that you might have to deal with:
- just because a document contains a certain feature, such as a ToC field, does not mean that it is visible. A TOC field might have been formatted as not visible. Even harder to detect, it could have been formatted as colored white.
- change tracking. You might have to use the Word object model to "accept changes" before checking whether any given feature is actually there or not. Unless you can find existing code that would help you do that using the OOXML representation of the document, that's probably a strong case for doing checks via the object model.
Some final observations
for future checks, perhaps worth noting that in principle you could create a "DocumentInspector" that users could call from Word BackStage to perform checks on a document. Not sure you can force users to run it, or that you could create it in PS, but perhaps a useful tool.
longer term, if you are doing a very large number of checks, perhaps worth considering whether you could train a ML model to try to detect problems.
Our database stores photo albums and photos.
Each album has title, tags, description.
Each photo has title, tags and description.
All I want is the ability to show 5 search results as soon as the user types a word in a search box.
Then show 50 search result per page and so on.
Which fields should I index - only title or tags (ar embedded array) or both?
What to use for best search experience - MongoDB index on filed or other type of index?
Solution must scale as the data grows.
If anyone can help me with some pointers on how to proceed, that will be great.
I am still using older version of MongoDB 1.8
Thanks
If you require only the title to be searchable as per your last comment, then you could simply use the $regex operator:
http://docs.mongodb.org/manual/reference/operator/regex/#op._S_regex
If you anchor the regular expression (i.e. /^something/) then it will even use your indexes which will be super fast.
The performance of this on a huge database is not going to be fantastic though.
Otherwise, as WiredPrairie suggests, look into the keyword search:
http://docs.mongodb.org/manual/tutorial/model-data-for-keyword-search/
I am using Apache Lucy to speed up a typeahead (autocomplete) field on a Web form. I am querying against nearly 800k records. I have a working setup but would like to limit my responses to terms that begin with the query string. Currently the query matchers either match the whole word or if I tokenize with /./ I can match the query against partials of whole words.
While going through the documentation I found Lucy::Docs::Cookbook::CustomQueryParser.
On that page under the heading Extending the query language, there was a reference to PrefixQuery. This package does not exist in Lucy and I had to do some more searching. Eventually I found the PrefixQuery.pm code sample in lucy's git repository.
Note that this package references another non-existent package called Lucy::Search::Tally. Removing the references to tally allowed me to get this example working, but it is far from a functional matcher. It doesn't handle multiple fields, no scoring, etc…
Does anyone know of a way to make Lucy do prefix matching without all this mucking about?
Found a solution in the Apache docs.
http://lucy.apache.org/docs/perl/Lucy/Docs/Cookbook/CustomQuery.html
I found a great tutorial on performing a faceted search.
http://www.devatwork.nl/articles/lucenenet/faceted-search-and-drill-down-lucenenet/
This article does not explain how to retrieve the narrowed available attributes to filter from (for further drill down).
Lets say I am looking for planners that are red. When I perform the faceted search, I want to return all available attributes to filter from that are red. Then when I add a "weekly format" filter, I want the attribute list to get even smaller, containing only filters available for the segmented group.
I want love to use Solr/SolrNET but I am in a shared hosting situation with limited access to the actual server.
I am fairly new to lucene.net, so examples are much appreciated.
IIUC, you get a BitArray containing the list of the filtered results. In the tutorial's example, you will have combinedResults as this list. If you want to further narrow this down, you need to reiterate the process: run another searchQuery and intersect the results with the BitArray you have for combinedResults.
I want love to use Solr/SolrNET but I am in a shared hosting situation with limited access to the actual server.
You can always use an off-site, hosted Solr solution. See this question for more information.