How to support tokenized and untokenized search at the same time - hibernate-search

I try to make hibernate search to support both tokenized and untokenized search(pardon me if I use the wrong term here). An example is as following.
I have a list of entities of the following type.
#Entity
#Indexed
#NormalizerDef(name = "lowercase",
filters = {
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class)
}
)
public class Deal {
//other fields omitted for brevity purposes
#Field(store = Store.YES)
#Field(name = "name_Sort", store = Store.YES, normalizer= #Normalizer(definition="lowercase"))
#SortableField(forField = "name_Sort")
#Column(name = "NAME")
private String name = "New Deal";
//Getters/Setters omitted here
}
I also used the keyword method to build the query builder shown as follows. The getSearchableFields method returns a list of searchable fields. In the this example, "name" will be in this returned list as the field name in Deal is searchable.
protected Query inputFilterBuilder() {
return queryBuilder.keyword()
.wildcard().onFields(getSearchableFields())
.matching("*" + searchRequest.getQuery().toLowerCase() + "*").createQuery();
}
This setup works fine when I only use an entire words to search. For example, if I have two Deal entity, one's name is "Practical Concrete Hat" and the other one's name is "Practical Cotton Cheese". When searching by "Practical", I get these two entities back. But when searching by "Practical Co", I get 0 entity back. The reason is because the field name is tokenized and "Practical Co" is not a key word.
My question is how to support both search at the same time so these 2 entities are returned if searching by "Practical" or "Practical Co".
I read through the official hibernate search documentation and my hunch is that I should add one more field that is for untokenized search. Perhaps the way I construct the query builder needs to be updated as well?
Update
Not working solution using SimpleQueryString.
Based on the provided answer, I've written the following query builder logic. However, it doesn't work.
protected Query inputFilterBuilder() {
String[] searchableFields = getSearchableFields();
if(searchableFields.length == 0) {
return queryBuilder.simpleQueryString().onField("").matching("").createQuery();
}
SimpleQueryStringMatchingContext simpleQueryStringMatchingContext = queryBuilder.simpleQueryString().onField(searchableFields[0]);
for(int i = 1; i < searchableFields.length; i++) {
simpleQueryStringMatchingContext = simpleQueryStringMatchingContext.andField(searchableFields[i]);
}
return simpleQueryStringMatchingContext
.matching("\"" + searchRequest.getQuery() + "\"").createQuery();
}
Working solution using separate analyzer for query and phrase queries.
I found from the official documentation that we can use phrase queries to search for more than one word. So I wrote the following query builder method.
protected Query inputFilterBuilder() {
String[] searchableFields = getSearchableFields();
if(searchableFields.length == 0) {
return queryBuilder.phrase().onField("").sentence("").createQuery();
}
PhraseMatchingContext phraseMatchingContext = queryBuilder.phrase().onField(searchableFields[0]);
for(int i = 1; i < searchableFields.length; i++) {
phraseMatchingContext = phraseMatchingContext.andField(searchableFields[i]);
}
return phraseMatchingContext.sentence(searchRequest.getQuery()).createQuery();
}
This does not work for search using more than one word with a space in between. Then I added separate analyzers for indexing and querying as suggested, all of a sudden, it works.
Analyzers definitons:
#AnalyzerDef(name = "edgeNgram", tokenizer = #TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class),
#TokenFilterDef(factory = EdgeNGramFilterFactory.class,
params = {
#Parameter(name = "minGramSize", value = "1"),
#Parameter(name = "maxGramSize", value = "10")
})
})
#AnalyzerDef(name = "edgeNGram_query", tokenizer = #TokenizerDef(factory = WhitespaceTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = ASCIIFoldingFilterFactory.class),
#TokenFilterDef(factory = LowerCaseFilterFactory.class)
})
Annotation for Deal name field:
#Field(store = Store.YES, analyzer = #Analyzer(definition = "edgeNgram"))
#Field(name = "edgeNGram_query", store = Store.YES, analyzer = #Analyzer(definition = "edgeNGram_query"))
#Field(name = "name_Sort", store = Store.YES, normalizer= #Normalizer(definition="lowercase"))
#SortableField(forField = "name_Sort")
#Column(name = "NAME")
private String name = "New Deal";
Code that override name field's analyzer to use the query analyzer
String[] searchableFields = getSearchableFields();
if(searchableFields.length > 0) {
EntityContext entityContext = fullTextEntityManager.getSearchFactory()
.buildQueryBuilder().forEntity(this.getClass().getAnnotation(SearchType.class).clazz()).overridesForField(searchableFields[0], "edgeNGram_query");
for(int i = 1; i < searchableFields.length; i++) {
entityContext.overridesForField(searchableFields[i], "edgeNGram_query");
}
queryBuilder = entityContext.get();
}
Follow up question
Why does the above tweak actually works?

Your problem here is the wildcard query. Wildcard queries do not support tokenization: they only work on single tokens. In fact, they don't even support normalization, which is why you had to lowercase the user input yourself...
The solution would not be to mix tokenized and untokenized search (that's possible, but wouldn't really solve your problem). The solution would be to forget about wildcard queries altogether and use an edgengram filter in your analyzer.
See this answer for an extended explanation.
If you use the ELasticsearch integration, you will have to rely on a hack to make the "query-only" analyzer work properly. See here.

Related

Build dynamic LINQ queries from a string - Use Reflection?

I have some word templates(maybe thousands). Each template has merge fields which will be filled from database. I don`t like writing separate code for every template and then build the application and deploy it whenever a template is changed or a field on the template is added!
Instead, I'm trying to define all merge fields in a separate xml file and for each field I want to write the "query" which will be called when needed. EX:
mergefield1 will call query "Case.Parties.FirstOrDefault.NameEn"
mergefield2 will call query "Case.CaseNumber"
mergefield3 will call query "Case.Documents.FirstOrDefault.DocumentContent.DocumentType"
Etc,
So, for a particular template I scan its merge fields, and for each merge field I take it`s "query definition" and make that request to database using EntityFramework and LINQ. Ex. it works for these queries: "TimeSlots.FirstOrDefault.StartDateTime" or
"Case.CaseNumber"
This will be an engine which will generate word documents and fill it with merge fields from xml. In addition, it will work for any new template or new merge field.
Now, I have worked a version using reflection.
public string GetColumnValueByObjectByName(Expression<Func<TEntity, bool>> filter = null, string objectName = "", string dllName = "", string objectID = "", string propertyName = "")
{
string objectDllName = objectName + ", " + dllName;
Type type = Type.GetType(objectDllName);
Guid oID = new Guid(objectID);
dynamic Entity = context.Set(type).Find(oID); // get Object by Type and ObjectID
string value = ""; //the value which will be filled with data from database
IEnumerable<string> linqMethods = typeof(System.Linq.Enumerable).GetMethods(BindingFlags.Static | BindingFlags.Public).Select(s => s.Name).ToList(); //get all linq methods and save them as list of strings
if (propertyName.Contains('.'))
{
string[] properies = propertyName.Split('.');
dynamic object1 = Entity;
IEnumerable<dynamic> Child = new List<dynamic>();
for (int i = 0; i < properies.Length; i++)
{
if (i < properies.Length - 1 && linqMethods.Contains(properies[i + 1]))
{
Child = type.GetProperty(properies[i]).GetValue(object1, null);
}
else if (linqMethods.Contains(properies[i]))
{
object1 = Child.Cast<object>().FirstOrDefault(); //for now works only with FirstOrDefault - Later it will be changed to work with ToList or other linq methods
type = object1.GetType();
}
else
{
if (linqMethods.Contains(properies[i]))
{
object1 = type.GetProperty(properies[i + 1]).GetValue(object1, null);
}
else
{
object1 = type.GetProperty(properies[i]).GetValue(object1, null);
}
type = object1.GetType();
}
}
value = object1.ToString(); //.StartDateTime.ToString();
}
return value;
}
I`m not sure if this is the best approach. Does anyone have a better suggestion, or maybe someone has already done something like this?
To shorten it: The idea is to make generic linq queries to database from a string like: "Case.Parties.FirstOrDefault.NameEn".
Your approach is very good. I have no doubt that it already works.
Another approach is using Expression Tree like #Egorikas have suggested.
Disclaimer: I'm the owner of the project Eval-Expression.NET
In short, this library allows you to evaluate almost any C# code at runtime (What you exactly want to do).
I would suggest you use my library instead. To keep the code:
More readable
Easier to support
Add some flexibility
Example
public string GetColumnValueByObjectByName(Expression<Func<TEntity, bool>> filter = null, string objectName = "", string dllName = "", string objectID = "", string propertyName = "")
{
string objectDllName = objectName + ", " + dllName;
Type type = Type.GetType(objectDllName);
Guid oID = new Guid(objectID);
object Entity = context.Set(type).Find(oID); // get Object by Type and ObjectID
var value = Eval.Execute("x." + propertyName, new { x = entity });
return value.ToString();
}
The library also allow you to use dynamic string with IQueryable
Wiki: LINQ-Dynamic

Hibernate Search programmatic API HTMLStripCharFilterFactory

I want to setup Hibernate Search (5.5.1.Final) using Programmatic API.
With annotations i write
#AnalyzerDefs({
#AnalyzerDef(name = "el",
charFilters = {#CharFilterDef(factory = HTMLStripCharFilterFactory.class)},
tokenizer = #TokenizerDef(factory = StandardTokenizerFactory.class),
filters = {
#TokenFilterDef(factory = StandardFilterFactory.class),
#TokenFilterDef(factory = GreekLowerCaseFilterFactory.class),
#TokenFilterDef(factory = StopFilterFactory.class,
params = {#Parameter(name="words", value="stopwords-gr.txt")}),
#TokenFilterDef(factory = EdgeNGramFilterFactory.class,
params = {#Parameter(name="minGramSize", value = "3"),#Parameter(name="maxGramSize", value = "15"),#Parameter(name="side", value = "front")})
}
)
})
With Programmatic API i write
SearchMapping mapping = new SearchMapping();
mapping.analyzerDef("el", StandardTokenizerFactory.class)
.filter(StandardFilterFactory.class)
.filter(GreekLowerCaseFilterFactory.class)
.filter(StopFilterFactory.class)
.filter(EdgeNGramFilterFactory.class)
.param("minGramSize", "3")
.param("maxGramSize", "15")
.param("side", "front");
But i cannot figure out how i will use the HTMLStripCharFilterFactory.
The short answer is, that you cannot. When the charFilters option got introduced as part of HSEARCH-477, it was missed to also add it to the programmatic API. So the functionality just does not exist yet. I created HSEARCH-2199 as a feature request to add this functionality.

JPA Custom Query

I need your help. Basically I want to create a custom query for a view I made that contains most of the data needed by the client. The tricky part here is that the client can specify which columns to include in the search. A sample query would be like:
SELECT distinct s.empno FROM SesdbAllView s
WHERE s.lastname IN :lname AND s.examTaken IN :exam AND
s.training IN :train AND s.trainingFrom BETWEEN :from AND :to AND
s.eligibility IN :elig AND s.profession IN :prof
So I tried translating this to Criteria API but still stuck on how to do it especially in the BETWEEN keywords (where I check a range of a Date and also another for a Integer). When it comes to the IN keywords I'm not sure if I did it correctly as well.
My current code now is:
CriteriaBuilder cb = em.getCriteriaBuilder();
CriteriaQuery<Tuple> cq = cb.createTupleQuery();
Root<SesdbAllView> r = cq.from(SesdbAllView.class);
Predicate p = cb.conjunction();
for (Map.Entry<String, Object> param : parameters.entrySet()) {
if (param.getValue() instanceof List) {
Expression<String> exp = r.get(param.getKey());
p = cb.and(p, exp.in((List<String>)param.getValue()));
} else if (param.getValue() instanceof DateFromTo) {
DateFromTo fromTo = (DateFromTo) param.getValue();
p = cb.between(r.get(param.getKey()).as(Date.class),fromTo.getFrom(),fromTo.getTo());
} else if (param.getValue() instanceof IntegerFromTo) {
IntegerFromTo fromTo = (IntegerFromTo) param.getValue();
p = cb.between(r.get(param.getKey()).as(Integer.class),fromTo.getFrom(),fromTo.getTo());
} else {
p = cb.and(p, cb.equal(r.get(param.getKey()), param.getValue()));
}
}
cq.distinct(true);
cq.multiselect(r.get("empNo"))
.where(p);
List<Tuple> result = em.createQuery(cq).getResultList();

Extending TokenStream

I am trying to index into a document a field with one term that has a payload.
Since the only constructor of Field that can work for me takes a TokenStream, I decided to inherit from this class and give the most basic implementation for what I need:
public class MyTokenStream : TokenStream
{
TermAttribute termAtt;
PayloadAttribute payloadAtt;
bool moreTokens = true;
public MyTokenStream()
{
termAtt = (TermAttribute)GetAttribute(typeof(TermAttribute));
payloadAtt = (PayloadAttribute)GetAttribute(typeof(PayloadAttribute));
}
public override bool IncrementToken()
{
if (moreTokens)
{
termAtt.SetTermBuffer("my_val");
payloadAtt.SetPayload(new Payload(/*bye[] data*/));
moreTokens = false;
}
return false;
}
}
The code which was used while indexing:
IndexWriter writer = //init tndex writer...
Document d = new Document();
d.Add(new Field("field_name", new MyTokenStream()));
writer.AddDocument(d);
writer.Commit();
And the code that was used during the search:
IndexSearcher searcher = //init index searcher
Query query = new TermQuery(new Term("field_name", "my_val"));
TopDocs result = searcher.Search(query, null, 10);
I used the debugger to verify that call to IncrementToken() actually sets the TermBuffer.
My problem is that the returned TopDocs instance returns no documents, and I cant understand why... Actually I started from TermPositions (which gives me approach to the Payload...), but it also gave me no results.
Can someone explain to me what am I doing wrong?
I am currently using Lucene .NET 2.9.2
After you set the TermBuffer you need to return true from IncrementToken, you return false when you have nothing to feed the TermBuffer with anymore

Linq to enties, insert foreign keys

I am using the ADO entity framework for the first time and am not sure of the best way of inserting db recored that contain foreign keys.
this is the code that i am using, I would appreciate any comments and suggestion on this.
using (KnowledgeShareEntities entities = new KnowledgeShareEntities())
{
Questions question = new Questions();
question.que_title = questionTitle;
question.que_question_text = questionText;
question.que_number_of_views = 0;
question.que_is_anonymous = isAnonymous;
question.que_last_activity_datetime = DateTime.Now;
question.que_timestamp = DateTime.Now;
question.CategoriesReference.Value = Categories.CreateCategories(categoryId);
question.UsersReference.Value = Users.CreateUsers(userId);
entities.AddToQuestions(question);
entities.SaveChanges();
return question.que_id;
}
You should use something like
question.UsersReference.EntityKey = new EntityKey("MyEntities.Users",
"ID", userId);
You don't have to have User object to set up foreign key, just use ID.