I'm having trouble with indexing state abbreviation codes such as IN, OR with lucene .net. If I use the standard analyzer when Indexing, I cannot retreive documents by these state abbreviations. If I use the simple analyzer when indexing, I can retreive documents based on these abbreviations, but other queries such as zipcodes indexed as strings no longer work.
Any suggestions on what the best practice for this type of lucene dilemna would be appreciated.
Thanks
Thanks I4V, that post was the same issue. I resolved it by changing my code from
This was a duplicate question. After reading the post
_standardAnalyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30
To:
This was a duplicate question. After reading the post
_standardAnalyzer = new Lucene.Net.Analysis.Standard.StandardAnalyzer(Lucene.Net.Util.Version.LUCENE_30, new HashSet<string>());
Related
Having a collection of {_id: 'xxx', text: 'abc'}
What is the best way to have a list of entities with the same text, considering also spelling mistakes, for example 'gogle' 'google' ordered by the number of similar entities?
mongodb doesn't have the capability to find misspelled items. there are some thirdparty libraries/plugins that offer this feature by storing double metaphone key codes along with the original version of the text. here's an example program in c# that gets the result you want. see this page for a brief explainer on how it works. if you're not coding in c#, there's this plugin for mongoose.
I am learning how to program in Swift and currently working with the Eureka framework on a project. If anyone could provide some hints/suggestions and brief examples on how to accomplish the following it would be greatly appreciated.
Formatting PhoneRow (or any other row really) using a formatter that would insert dashes like 555-123-4567 as the user is typing.
How to limit maximum input length on a per-field basis?
Filtering InlinePickerRow to allow for a user to narrow down a large list of choices to a subset based on input string in a dynamic small search box.
Thank you in advance! :)
Our database stores photo albums and photos.
Each album has title, tags, description.
Each photo has title, tags and description.
All I want is the ability to show 5 search results as soon as the user types a word in a search box.
Then show 50 search result per page and so on.
Which fields should I index - only title or tags (ar embedded array) or both?
What to use for best search experience - MongoDB index on filed or other type of index?
Solution must scale as the data grows.
If anyone can help me with some pointers on how to proceed, that will be great.
I am still using older version of MongoDB 1.8
Thanks
If you require only the title to be searchable as per your last comment, then you could simply use the $regex operator:
http://docs.mongodb.org/manual/reference/operator/regex/#op._S_regex
If you anchor the regular expression (i.e. /^something/) then it will even use your indexes which will be super fast.
The performance of this on a huge database is not going to be fantastic though.
Otherwise, as WiredPrairie suggests, look into the keyword search:
http://docs.mongodb.org/manual/tutorial/model-data-for-keyword-search/
I've implemented Lucenet.NET on my site, using it to index my products which are theatre shows, tours and attractions around London.
I want to implement a "Did you mean?" feature for when users misspell product names that takes the whole product titles into account and not just single words. For example,
If the user typed:
Lodnon Eye
I would like to auto-suggest:
London
London Eye
I assume I nead to have the analyzer index the titles as if they are a single entity, so that SpellChecker can nearest-match on the phrase, as well as the individual words.
How would I do this?
There is a excellent blog series here:
Lucene.NET
Introduction to Lucene
Indexing basics
Search basics
Did you mean..
Faceted Search
Class Reference
I have also found another project called SimpleLucene which you can use to maintain your lucene indexes whenever you need to update or delete a document. Read about it here
i've just recently implemented a phrase autosuggest system in lucene.net.
basically, the java version of lucene has a shinglefilter in one of the contrib folders which breaks down a sentence into all possible phrase combinations. Unfortunately lucene.nets contrib filters aren't quite there yet and so we don't have a shingle filter.
but, a lucene index written in java can be read by lucene.net as long as the versions are the same. so what i did was the following :
created a spell index in lucene.net using the spellcheck.IndexDictionary method as laid out in the "did you mean" section of jake scotts link. please note that only creates a spelling index of single words, not phrases.
i then created a java app that uses the shingle filter to create phrases of the text i'm searching and saves it in a temporary index.
i then wrote another method in dotnet to open this temporary index and add each of the phrases as a line or document into my spelling index that already contains the single words. the trick is to make sure the documents you're adding have the same form as the rest of the spell documents, so i ripped out the methods used in the spellchecker code in the lucene.net project and edited those.
once you've done that you can call the spellcheck.suggestsimilar method and pass it a misspelled phrase and it will return you a valid suggestion.
This is probably not the best solution and I definitely would use the answer suggested by spaceman but here is another possible solution. Use the KeywordAnalyzer or the KeywordTonenizer on each title, this will not break down the title into separate tokens but keep it as one token. Using the SuggestSimilar method would return the whole title as suggestions.
What's the best way to keep track of unique tags for a collection of documents millions of items large? The normal way of doing tagging seems to be indexing multikeys. I will frequently need to get all the unique keys, though. I don't have access to mongodb's new "distinct" command, either, since my driver, erlmongo, doesn't seem to implement it, yet.
Even if your driver doesn't implement distinct, you can implement it yourself. In JavaScript (sorry, I don't know Erlang, but it should translate pretty directly) can say:
result = db.$cmd.findOne({"distinct" : "collection_name", "key" : "tags"})
So, that is: you do a findOne on the "$cmd" collection of whatever database you're using. Pass it the collection name and the key you want to run distinct on.
If you ever need a command your driver doesn't provide a helper for, you can look at http://www.mongodb.org/display/DOCS/List+of+Database+Commands for a somewhat complete list of database commands.
I know this is an old question, but I had the same issue and could not find a real solution in PHP for it.
So I came up with this:
http://snipplr.com/view/59334/list-of-keys-used-in-mongodb-collection/
John, you may find it useful to use Variety, an open source tool for analyzing a collection's schema: https://github.com/jamescropcho/variety
Perhaps you could run Variety every N hours in the background, and query the newly-created varietyResults database to retrieve a listing of unique keys which begin with a given string (i.e. are descendants of a specific parent).
Let me know if you have any questions, or need additional advice.
Good luck!