Make Lucene index a value and store another - lucene.net

I want Lucene.NET to store a value while indexing a modified, stripped-down version of the stored value. e.g. Consider the value:
this_example-has some/weird (chars) 100%
I want it stored right like that (so that I can retrieve exactly that for showing in the results list), but I want lucene to index it as:
this example has some weird chars 100
(you see, like a "sanitized" version of the original value) for a simplified search.
I figure this would be the job of an analyzer, but I don't want to mess with rolling my own. Ideally, the solution should remove everything that is not a letter, a number or quotes, replacing the removed chars by a white-space before indexing.
Any suggestions on how to implement that?
This is because I am indexing products for an e-commerce search, and some have realy creepy names. I think this would improve search assertiveness.
Thanks in advance.

If you don't want a custom analyzer, try storing the value as a separate non-indexed field, and use a simple regex to generate the sanitized version.
var input = "this_example-has some/weird (chars) 100%";
var output = Regex.Replace(input, #"[\W_]+", " ");
You mention that you need another Analyzer for some searching functionality. Dont forget the PerFieldAnalyzerWrapper which will allow you to use different analyzers within the same document.
public static void Main() {
var wrapper = new PerFieldAnalyzerWrapper(defaultAnalyzer: new StandardAnalyzer(Version.LUCENE_29));
wrapper.AddAnalyzer(fieldName: "id", analyzer: new KeywordAnalyzer());
IndexWriter writer = null; // TODO: Retrieve these.
Document document = null;
writer.AddDocument(document, analyzer: wrapper);
}

You are correct that this is the work of the analyzer. And I'd start by using a tool like luke to see what the standard analyzer does with your term before getting into what to use -- it tends to do a good job stripping noise characters and words.

Related

Autocomplete with Firebase

How does one use Firebase to do basic auto-completion/text preview?
For example, imagine a blog backed by Firebase where the blogger can tag posts with tags. As the blogger is tagging a new post, it would be helpful if they could see all currently-existing tags that matched the first few keystrokes they've entered. So if "blog," "black," "blazing saddles," and "bulldogs" were tags, if the user types "bl" they get the first three but not "bulldogs."
My initial thought was that we could set the tag with the priority of the tag, and use startAt, such that our query would look something like:
fb.child('tags').startAt('bl').limitToFirst(5).once('value', function(snap) {
console.log(snap.val())
});
But this would also return "bulldog" as one of the results (not the end of the world, but not the best either). Using startAt('bl').endAt('bl') returns no results. Is there another way to accomplish this?
(I know that one option is that this is something we could use a search server, like ElasticSearch, for -- see https://www.firebase.com/blog/2014-01-02-queries-part-two.html -- but I'd love to keep as much in Firebase as possible.)
Edit
As Kato suggested, here's a concrete example. We have 20,000 users, with their names stored as such:
/users/$userId/name
Oftentimes, users will be looking up another user by name. As a user is looking up their buddy, we'd like a drop-down to populate a list of users whose names start with the letters that the searcher has inputted. So if I typed in "Ja" I would expect to see "Jake Heller," "jake gyllenhaal," "Jack Donaghy," etc. in the drop-down.
I know this is an old topic, but it's still relevant. Based on Neil's answer above, you more easily search doing the following:
fb.child('tags').startAt(queryString).endAt(queryString + '\uf8ff').limit(5)
See Firebase Retrieving Data.
The \uf8ff character used in the query above is a very high code point
in the Unicode range. Because it is after most regular characters in
Unicode, the query matches all values that start with queryString.
As inspired by Kato's comments -- one way to approach this problem is to set the priority to the field you want to search on for your autocomplete and use startAt(), limit(), and client-side filtering to return only the results that you want. You'll want to make sure that the priority and the search term is lower-cased, since Firebase is case-sensitive.
This is a crude example to demonstrate this using the Users example I laid out in the question:
For a search for "ja", assuming all users have their priority set to the lowercased version of the user's name:
fb.child('users').
startAt('ja'). // The user-inputted search
limitToFirst(20).
once('value', function(snap) {
for(key in snap.val()){
if(snap.val()[key].indexOf('ja') === 0) {
console.log(snap.val()[key];
}
}
});
This should only return the names that actually begin with "ja" (even if Firebase actually returns names alphabetically after "ja").
I choose to use limitToFirst(20) to keep the response size small and because, realistically, you'll never need more than 20 for the autocomplete drop-down. There are probably better ways to do the filtering, but this should at least demonstrate the concept.
Hope this helps someone! And it's quite possible the Firebase guys have a better answer.
(Note that this is very limited -- if someone searches for the last name, it won't return what they're looking for. Hence the "best" answer is probably to use a search backend with something like Kato's Flashlight.)
It strikes me that there's a much simpler and more elegant way of achieving this than client side filtering or hacking Elastic.
By converting the search key into its' Unicode value and storing that as the priority, you can search by startAt() and endAt() by incrementing the value by one.
var start = "ABA";
var pad = "AAAAAAAAAA";
start += pad.substring(0, pad.length - start.length);
var blob = new Blob([start]);
var reader = new FileReader();
reader.onload = function(e) {
var typedArray = new Uint8Array(e.target.result);
var array = Array.prototype.slice.call(typedArray);
var priority = parseInt(array.join(""));
console.log("Priority of", start, "is:", priority);
}
reader.readAsArrayBuffer(blob);
You can then limit your search priority to the key "ABB" by incrementing the last charCode by one and doing the same conversion:
var limit = String.fromCharCode(start.charCodeAt(start.length -1) +1);
limit = start.substring(0, start.length -1) +limit;
"ABA..." to "ABB..." ends up with priorities of:
Start: 65666565656565650000
End: 65666665656565650000
Simples!
Based on Jake and Matt's answer, updated version for sdk 3.1. '.limit' no longer works:
firebaseDb.ref('users')
.orderByChild('name')
.startAt(query)
.endAt(`${query}\uf8ff`)
.limitToFirst(5)
.on('child_added', (child) => {
console.log(
{
id: child.key,
name: child.val().name
}
)
})

JQuery Wildcard for using atttributes in selectors

I've research this topic extensibly and I'm asking as a last resort before assuming that there is no wildcard for what I want to do.
I need to pull up all the text input elements from the document and add it to an array. However, I only want to add the input elements that have an id.
I know you can use the \S* wildcard when using an id selector such as $(#\S*), however I can't use this because I need to filter the results by text type only as well, so I searching by attribute.
I currently have this:
values_inputs = $("input[type='text'][id^='a']");
This works how I want it to but it brings back only the text input elements that start with an 'a'. I want to get all the text input elements with an 'id' of anything.
I can't use:
values_inputs = $("input[type='text'][id^='']"); //or
values_inputs = $("input[type='text'][id^='*']"); //or
values_inputs = $("input[type='text'][id^='\\S*']"); //or
values_inputs = $("input[type='text'][id^=\\S*]");
//I either get no values returned or a syntax error for these
I guess I'm just looking for the equivalent of * in SQL for JQuery attribute selectors.
Is there no such thing, or am I just approaching this problem the wrong way?
Actually, it's quite simple:
var values_inputs = $("input[type=text][id]");
Your logic is a bit ambiguous. I believe you don't want elements with any id, but rather elements where id does not equal an empty string. Use this.
values_inputs = $("input[type='text']")
.filter(function() {
return this.id != '';
});
Try changing your selector to:
$("input[type='text'][id]")
I figured out another way to use wild cards very simply. This helped me a lot so I thought I'd share it.
You can use attribute wildcards in the selectors in the following way to emulate the use of '*'. Let's say you have dynamically generated form in which elements are created with the same naming convention except for dynamically changing digits representing the index:
id='part_x_name' //where x represents a digit
If you want to retrieve only the text input ones that have certain parts of the id name and element type you can do the following:
var inputs = $("input[type='text'][id^='part_'][id$='_name']");
and voila, it will retrieve all the text input elements that have "part_" in the beginning of the id string and "_name" at the end of the string. If you have something like
id='part_x_name_y' // again x and y representing digits
you could do:
var inputs = $("input[type='text'][id^='part_'][id*='_name_']"); //the *= operator means that it will retrieve this part of the string from anywhere where it appears in the string.
Depending on what the names of other id's are it may start to get a little trickier if other element id's have similar naming conventions in your document. You may have to get a little more creative in specifying your wildcards. In most common cases this will be enough to get what you need.

whoosh doesn't search for short words like "C#"

i am using whoosh to index over 200,000 books. but i have encountered some problems with it.
the whoosh query parser returns NullQuery for words like "C#", "C++" with meta-characters in them and also for some other short words. this words are used in the title and body of some documents so i am not using keyword type for them. i guess the problem is in the analysis or query-parsing phase of searching or indexing but i can't touch my data blindly. can anyone help me to correct this issue. Tnx.
i fixed the problem by creating a StandardAnalyzer with a regex pattern that meets my requirements,here is the regex pattern:
'\w+[#+.\w]*'
this will make tokenizing of fields to be done successfully, and also the searching goes well.
but when i use queries like "some query++*" or "some##*" the parsed query will be a single Every query, just the '*'. also i found that this is not related to my analyzer and this is the Whoosh's default behavior. so here is my new question: is this behavior correct or it is a bug??
note: removing the WildcardPlugin from the query-parser solves this problem but i also need the WildcardPlugin.
now i am using the following code:
from whoosh.util import rcompile
#for matching words like: '.NET', 'C++' and 'C#'
word_pattern = rcompile('(\.|[\w]+)(\.?\w+|#|\+\+)*')
#i don't need words shorter that two characters so i don't change the minsize default
analyzer = analysis.StandardAnalyzer(expression=word_pattern)
... now in my schema:
...
title = fields.TEXT(analyzer=analyzer),
...
this will solve my first problem, yes. but the main problem is in searching. i don't want to let users to search using the Every query or *. but when i parse queries like C++* i end up an Every(*) query. i know that there is some problem but i can't figure out what it is.
I had the same issue and found out that StandardAnalyzer() uses minsize=2 by default. So in your schema, you have to tell it otherwise.
schema = whoosh.fields.Schema(
name = whoosh.fields.TEXT(stored=True, analyzer=whoosh.analysis.StandardAnalyzer(minsize=1)),
# ...
)

How to search across all the fields?

In Lucene, we can use TermQuery to search a text with a field. I am wondering how to search a keyword across a bunch of fields or all the searchable fields?
Another approach, which doesn't require to index anything more than what you already have, nor to combine different queries, is using the MultiFieldQueryParser.
You can provide a list of fields where you want to search on and your query, that's all.
MultiFieldQueryParser queryParser = new MultiFieldQueryParser(
Version.LUCENE_41,
new String[]{"title", "content", "description"},
new StandardAnalyzer(Version.LUCENE_41));
Query query = queryParser.parse("here goes your query");
This is how I would do it with the original lucene library written in Java. I'm not sure whether the MultiFieldQueryParser is available in lucene.net too.
Two approaches
1) Index-time approach: Use a catch-all field. This is nothing but appending all the text from all the fields (total text from your input doc) and place that resulting huge text in a single field. You've to add an additional field while indexing to act as a catch-all field.
2) Search-time approach: Use a BooleanQuery to combine multiple queries, for example TermQuery instances. Those multiple queries can be formed to cover all the target fields.
Example check at the end of the article.
Use approach 2 if you know the target-field list at runtime. Otherwise, you've got to use the 1st approach.
Another easy approach to search across all fields using "MultifieldQueryParser" is use IndexReader.FieldOption.ALL in your query.
Here is example in c#.
Directory directory = FSDirectory.Open(new DirectoryInfo(HostingEnvironment.MapPath(VirtualIndexPath)));
//get analyzer
Analyzer analyzer = new StandardAnalyzer(Version.LUCENE_29);
//get index reader and searcher
IndexReader indexReader__1 = IndexReader.Open(directory, true);
Searcher indexSearch = new IndexSearcher(indexReader__1);
//add all possible fileds in multifieldqueryparser using indexreader getFieldNames method
var queryParser = new MultiFieldQueryParser(Version.LUCENE_29, indexReader__1.GetFieldNames(IndexReader.FieldOption.ALL).ToArray(), analyzer);
var query = queryParser.Parse(Criteria);
TopDocs resultDocs = null;
//perform search
resultDocs = indexSearch.Search(query, indexReader__1.MaxDoc());
var hits = resultDocs.scoreDocs;
click here to check out my pervious answer to same quesiton in vb.net

TermQuery not returning on a known search term, but WildcardQuery does

Am hoping someone with enough insight into the inner workings of Lucene might be able to point me in the right direction =)
I'll skip most of the surrounding irellevant code, and cut right to the chase. I have a Lucene index, to which I am adding the following field to the index (variables replaced by their literal values):
document.Add( new Field("Typenummer", "E5CEB501A244410EB1FFC4761F79E7B7",
Field.Store.YES , Field.Index.UN_TOKENIZED));
Later, when I search my index (using other types of queries), I am able to verify that this field does indeed appear in my index - like when looping through all Fields returned by Document.GetFields()
Field: Typenummer, Value: E5CEB501A244410EB1FFC4761F79E7B7
So far so good :-)
Now the real problem is - why can I not use a TermQuery to search against this value and actually get a result.
This code produces 0 hits:
// Returns 0 hits
bq.Add( new TermQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
But if I switch this to a WildcardQuery (with no wildcards), I get the 1 hit I expect.
// returns the 1 hit I expect
bq.Add( new WildcardQuery( new Term( "Typenummer",
"E5CEB501A244410EB1FFC4761F79E7B7" ) ), BooleanClause.Occur.MUST );
I've checked field lengths, I've checked that I am using the same Analyzer and so on and I am still on square 1 as to why this is.
Can anyone point me in a direction I should be looking?
I finally figured out what was going on. I'm expanding the tags for this question as it, much to my surprise, actually turned out to be an issue with the CMS this particular problem exists in. In summary, the problem came down to this:
The field is stored UN_TOKENIZED, meaning Lucene will store it excactly "as-is"
The BooleanQuery I pasted snippets from gets sent to the Sitecore SearchManager inside a PreparedQuery wrapper
The behaviour I expected from this was, that my query (having already been prepared) would go - unaltered - to the Lucene API
Turns out I was wrong. It passes through a RewriteQuery method that copies my entire set of nested queries as-is, with one exception - all the Term arguments are passed through a LowercaseStrategy()
As I indexed an UPPERCASE Term (UN_TOKENIZED), and Sitecore changes my PreparedQuery to lowercase - 0 results are returned
Am not going to start an argument of whether this is "by design" or "by design flaw" implementation of the Lucene Wrapper API - I'll just note that rewriting my query when using the PreparedQuery overload is... to me... unexpected ;-)
Further teachings from this; storing the field as TOKENIZED will eliminate this problem too, as the StandardAnalyzer by default will lowercase all tokens.