Combining Lucene Results with Entity Framework Results? - entity-framework

Obviously, changes in a database will not reflect what is in an index all the time. Is anyone using EF with Lucene and combining the results of a Lucene search with results form the same search in EF? I understand you would only want to pull back results from EF that are not in the Lucene results.
Update:
I guess the best way to handle this would be to first search the Lucene index and get a list of results, then you would do a search like this for EF:
Pseudo Code:
var result = (from ef in EntityFrameworkList
where !(from l in LuceneList
select l.documentId)
.Contains(ef.Id)
select ef);
LuceneList.AddRange(result);
For those who like method chains
var result = (EntityFrameworkList.Where(ef => !(LuceneList.Select(l => l.documentId))
.Contains(ef.Id)));

We indeed followed the approach I suggested in the update. One thing to note is we only used this for displaying small lists of results and not for every type of search. For Full Text Search on larger documents we only used the Lucene results since we are not storing FTS data in our database.

Related

How to display score of Hibernate Search query results

Hibernate Search allows to sort search results on relevance. Is it possible to obtain and display (e.g. in a jsp view) this information using Lucene query?
A Query in Hibernate Search can return Projections rather than the simple list of matching entities.
A projection result essentially means each result is an array containing the sequence of projections you asked for. Normally this is used to extract text from a specific field, so to not need loading the data from the database, but there are Projection constants to return also the Score value or the Explanation of the scoring.
query.setProjection( ProjectionConstants.SCORE, ProjectionConstants.EXPLANATION, ProjectionConstants.THIS );
See also the Reference documentation on projections explaining this and more.

Is it possible to make lucene.net ignore case of the field for queries?

I have documents indexed with field "GuidId" field and "guidid". How can I make lucene net ignore the case ...so that the following query searches regardless of the case ?
TermQuery termQuery = new TermQuery(new Term("GuidId", guidId.ToString()));
I don't want to write another query for the documents with fields "guidid" ..i.e. lowercase
Ideally, don't have fields names with funky cases. If you are defining field names dynamically or some such, then you should lowercase them before adding them to the index. That done, it should be easy enough to keep the query fields' names lowercase as well, and you're in good shape.
If, for whatever reason, you are stuck with this case-sensitive data, you'll be stuck expanding your queries to search all the known permutations of the field name to get all your results. Something like:
Query finalQuery = new DisjunctionMaxQuery(0)
finalQuery.add(new TermQuery(new Term("GuidId", guidId.ToString())));
finalQuery.add(new TermQuery(new Term("guidid", guidId.ToString())));
DisjunctionMaxQuery would probably be a good choice here, since it only returns the maximum scoring hit among is query collection, rather than possibly compounding scores across multiple hits.
You can also use MultiFieldQueryParser to similar effect. I don't believe it uses DisjunctionMax, but it doesn't sound like it would likely be that big a deal in this case.

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

MongoEngine search index

I'm trying to implement an inverted-index search engine with MongoDb (MongoEngine) where terms in Posts are assigned weights and then used as embedded documents like such:
class Term(db.EmbeddedDocument):
t = db.StringField()
weight = db.FloatField()
class Post(db.Document):
terms = db.ListField(db.EmbeddedDocumentField(Term))
Then given a term, I can find the Posts that contain the term using this query:
post_list = Post.objects(terms__t=term)
However, this returns a list of Posts, but how can I find the weight of the term for each returned Post without having to iterate through the list of embedded terms looking for the term? Is there a way to query the Posts to automatically return the weight for any returned Posts as well?
Also would appreciate if anyone has any better methods of implementing a search engine in MongoDB?
Thanks!
MongoDB supports a basic text index see: http://docs.mongodb.org/manual/core/index-text/ This is a better way to store and search against documents, especially if you want a score for the match.
You'd have to call the command manually as its not currently implemented in MongoEngine.

Sort a Range Query Using Zend Lucene

According to the documentation, Zend Lucene is supposed to sort lexicographically. I am finding this is not the case. If I have a query 'avg:[050 TO 300]', yes it will return all values in that range, but it will sort them according to the document id, not the value.
I have found that the find() function can accept additional parameters, allowing me to sort by a specific column (eg $hits = $index->find($query, 'avg', SORT_NUMERIC, SORT_ASC);). However, I am creating $query dynamically and do not want to sort every search by 'avg'.
How do I force Lucene to sort the results automatically, lexicographically, when I do a range search? And if that's not possible, how do I dynamically add a sort field to the find function?
Why don't you sort $hits by yourself after getting the result from $index->find(...)? Ok this looks like a workaround and will be time-consuming for very large resultsets, but I guess that this is the easiest way in most cases.