Is it possible to make lucene.net ignore case of the field for queries? - lucene.net

I have documents indexed with field "GuidId" field and "guidid". How can I make lucene net ignore the case ...so that the following query searches regardless of the case ?
TermQuery termQuery = new TermQuery(new Term("GuidId", guidId.ToString()));
I don't want to write another query for the documents with fields "guidid" ..i.e. lowercase

Ideally, don't have fields names with funky cases. If you are defining field names dynamically or some such, then you should lowercase them before adding them to the index. That done, it should be easy enough to keep the query fields' names lowercase as well, and you're in good shape.
If, for whatever reason, you are stuck with this case-sensitive data, you'll be stuck expanding your queries to search all the known permutations of the field name to get all your results. Something like:
Query finalQuery = new DisjunctionMaxQuery(0)
finalQuery.add(new TermQuery(new Term("GuidId", guidId.ToString())));
finalQuery.add(new TermQuery(new Term("guidid", guidId.ToString())));
DisjunctionMaxQuery would probably be a good choice here, since it only returns the maximum scoring hit among is query collection, rather than possibly compounding scores across multiple hits.
You can also use MultiFieldQueryParser to similar effect. I don't believe it uses DisjunctionMax, but it doesn't sound like it would likely be that big a deal in this case.

Related

Firestore 1 global index vs 1 index per query what is better?

I'm working on my app and I just ran into a dilemma regarding what's the best way to handle indexes for firestore.
I have a query that search for publication in a specify community that contains at least one of the tag and in a geohash range. The index for that query looks like this:
community Ascending tag Ascending location.geohash Ascending
Now if my user doesnt need to filter by tag, I run the query without the arrayContains(tag) which prompt me to create another index:
community Ascending location.geohash Ascending
My question is, is it better to create that second index or, to just use the first one and specifying all possible tags in arrayContains in the query if the user want no filters on tag ?
Neither is pertinently better, but it's a typical space vs time tradeoff.
Adding the extra tags in the query adds some overhead there, but it saves you the (storage) cost for the additional index. So you're trading some small amount of runtime performance for a small amount of space/cost savings.
One thing to check is whether the query with tags can actually run on just the second index, as Firestore may be able to do a zigzag merge join. In that case you could only keep the second, smaller index and save the runtime performance of adding additional clauses, but then get a (similarly small) performance difference on the query where you do specify one or more tags.

Extensive filtering

Example:
{
shortName: "KITT",
longName: "Knight Industries Two Thousand",
fromZeroToSixty: 2,
year: 1982,
manufacturer: "Pontiac",
/* 25 more fields */
}
Ability to query by at least 20 fields which means that only 10 fields are left unindexed
There's 3 fields (all number) that could be used for sorting (both ways)
This leaves me wondering that how does sites with lots of searchable fields do it: e.g real estate or car sale sites where you can filter by every small detail and can choose between several sort options.
How could I pull this off with MongoDB? How should I index that kind of collection?
Im aware that there are dbs specifically made for searching but there must be general rules of thumb to do this (even if less performant) in every db. Im sure not everybody uses Elasticsearch or similar.
---
Optional reading:
My reasoning is that index could be huge but the index order matters. You'll always make sure that fields that return the least results are first and most generic fields are last in index. However, what if user chooses only generic fields? Should I include non-generic fields to query anyway? How to solve ordering in both ways? Or index intersection saves the day and I should just add 20 different indexes?
text index is your friend.
Read up on it here: https://docs.mongodb.com/v3.2/core/index-text/
In short, it's a way to tell mongodb that you want full text search over a specific field, multiple fields, or all fields (yay!)
To allow text indexing of all fields, use the special symbol $**, and define it of type 'text':
db.collection.createIndex( { "$**": "text" } )
you can also configure it with Case Insensitivity or Diacritic Insensitivity, and more.
To perform text searches using the index, use the $text query helper, see: https://docs.mongodb.com/v3.2/reference/operator/query/text/#op._S_text
Update:
In order to allow user to select specific fields to search on, it's possible to use weights when creating the text-index: https://docs.mongodb.com/v3.2/core/index-text/#specify-weights
If you carefully select your fields' weights, for example using different prime numbers only, and then add the $meta text score to your results you may be able to figure out from the "textScore" which field was matched on this query, and so filter out the results that didn't get a hit from a selected search field.
Read more here: https://docs.mongodb.com/v3.2/tutorial/control-results-of-text-search/

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

Mongo query for number of items in a sub collection

This seems like it should be very simple but I can't get it to work. I want to select all documents A where there are one or more B elements in a sub collection.
Like if a Store document had a collection of Employees. I just want to find Stores with 1 or more Employees in it.
I tried something like:
{Store.Employees:{$size:{$ne:0}}}
or
{Store.Employees:{$size:{$gt:0}}}
Just can't get it to work.
This isn't supported. You basically only can get documents in which array size is equal to the value. Range searches you can't do.
What people normally do is that they cache array length in a separate field in the same document. Then they index that field and make very efficient queries.
Of course, this requires a little bit more work from you (not forgetting to keep that length field current).

The fastest way to show Documents with certain property first in MongoDB

I have collections with huge amount of Documents on which I need to do custom search with various different queries.
Each Document have boolean property. Let's call it "isInTop".
I need to show Documents which have this property first in all queries.
Yes. I can easy do sort in this field like:
.sort( { isInTop: -1 } );
And create proper index with field "isInTop" as last field in it. But this will be work slowly, as indexes in mongo works best with unique fields.
So is there is solution to show Documents with field "isInTop" on top of each query?
I see two solutions here.
First: set Documents wich need to be in top the _id from "future". As you know, ObjectId contains timestamp. So I can create ObjectId with timestamp from future and use natural order
Second: create separate collection for Ducuments wich need to be in top. And do queries in it first.
Is there is any other solutions for this problem? Which will work fater?
UPDATE
I have done this issue with sorting on custom field which represent rank.
Using the _id field trick you mention has the problem that at some point in time you will reach the special time, and you can't change the _id field (without inserting a new document and removing the old one).
Creating a special collection which just holds the ones you care about is probably the best option. It gives you the ability to logically (and to some extent, physically) separate the documents.
Newly introduced in mongodb there is also support for a "sparse" index which may fulfill your needs as well. You could only set the "isInTop" field when you want it to be special, and then create a sparse index on it which would not have the problems you would normally have with a single indexed boolean field (in btrees).