Normalize unicode - unicode

Let's say I have document indexed with Apache Solr that contains this string:
Klüft skräms inför
I want to be able to find it with search using this keyword (note the "u"-"ü"):
kluft
Is there a way to do this ?

Use the ASCIIFoldingFilterFactory for both the index and query analyzers.

Related

mongodb index on regex fields not working

I'm new in mongoDB and I'm facing an issue about performance that need your help. I have a collection with 400k records, when not create index for any field on the collection it takes 20-30s for each query then I create indexs for fields that usually using for search query, but the problem is, when using $regex to search for a string field with index on it, mongoDB does not use index on that field, mongodb still scan for all records in that collection, I've searched on internet with this keyword: "index on regex fields mongodb" and I found some answers which say that "MongoDB use prefix of RegEx to lookup indexes" which means you have to use "^" prefix for the index to work like "db.users.find({name: /^key word/})", but that is not working for me, does "index on $regex field" need MongoDB Atlas to work? because i'm using comunity version of mongoDB. Thanks!
There's a lot to unpack here. We'll split the answer into two parts, the first to try and answer some of the direct questions about index usage and the second to explore solutions to satisfy the application requirements.
Index Usage with $regex
As is true with an index in any database that captures the full string value as the key, MongoDB can use the index for a $regex operation but its efficiency in doing so greatly depends on the regex being applied. That is what the Index Use documentation from the comments and the other answers you reference are describing.
In the comments you mention that an example query might be db.users.find({name: {$regex: '.*keyword.*', $options: 'i'}}). That means that the regex is a both unanchored and case-insensitive. The aforementioned doumentation states directly:
Case insensitive regular expression queries generally cannot use indexes effectively.
Why is this? because the substring that you are searching for can be found in any string value captured by the index. So the document with matching value {name: 'a keyword'} would be located at one end of the index, {name: 'keyWord' }, may be somewhere in the middle, and {name: 'Z keyword'} may be at the end. The only way to ensure correct results is for the database to scan the index for all string values. So while it is still using the index, it may not be efficient as most of the scanned values will not be match and will be discarded.
You may always use .explain() to better understand how the database is answering the query, such as if and how it is using an index.
Solutions
So what do we do about this?
Well as #rickhg12hs suggests in the comments, it depends on exactly what you are trying to achieve. You reiterate that that you are looking for 'full regex search capability', but that is really an approach/solution rather than a goal. If what you really need, for example, is just to match an exact string in a case insensitive manner, then something as simple as a case insensitive index would likely do the trick.
However if truly do wish to perform arbitrary substring searching, then you are really looking at search engine capabilities. In that situation your best bets would probably be to emulate their indexes directly in MongoDB (e.g. have the application manually tokenize the strings to be indexed), stand up something like Solr/Elasticsearch next to MongoDB, or use MongoDB's Atlas Search offering. The $text operator mentioned in the comment has limitations when it comes to substring searching (such as just part of a word), which may or may not be relevant for your needs.

Searching a value in a specific field in MongoDB Full Text Search

(MongoDB Full Text Search)
Hello,
I have put some fields in index and this is how I could search for a search_keyword.
BasicDBObject search = new BasicDBObject("$search", "search_keyword");
BasicDBObject textSearch = new BasicDBObject("$text", search);
DBCursor cursor = users.find(textSearch);
I don't want search_keyword to be searched in all the fields specified in the index. *Is there any method to specify search_keyword to be searched in specific fields from the index??*
If so, please give me some idea how to do it in Java.
Thank you.
If you want to index a single field and search for it then it is the way it works by default. Lets say you want to index the field companyName. When you perform $text search on this collection, only the data from the companyName field will be used because you only included that field in your index.
Now the second scenario, your $text index includes more than one field. In this case you cannot limit the search to only look for values indexed from a specific field. The $text index is constructed on the collection level and a collection can have at most one $text index. Your option to limit search on specific field in this case may be to use regex instead.
MongoDB has the flexibility to fulfil requirements of other scenarios, but you can also evaluate using other technologies if your application is heavily search-driven and you are primarily after a full-text search engine for locating documents by keyword with a rich query syntax. ElasticSearch might be an alternative here. It really depends on the type of the application and your needs.

Can MongoDB do alphanumeric sort?

I need to do a alphanumeric sort on one of the fields in my collection. This field contains Chromosome info.
Chr1,
Chr2,
..,
..,
..,
Chr10,
Chr11,
..,
..,
ChrX,
ChrY
Instead of getting the values in this order, I get them in
Chr1,
Chr11
This is a common problem and was wondering whether MongoDB has an in-built solution for this sort or should we hack it as we normally do in Oracle.
If there is no in-built solution, what is the best way to get this sort? Natural sort is an acceptable solution, but I have already applied it to another field.
Any help would would be greatly appreciated.
MongoDB has no build-in way of sorting data alphanumerical. But when all of your data has a field with a string with the same convention of "Chr" + number, you could put the number into a different field, keep it as an integer instead of a string, and sort by that.
Please refer to "https://docs.mongodb.com/manual/reference/collation".
Mongo provides collation for alphanumeric sorting.
Example:-
db.names.find({}).sort({username:1}).collation({locale:'en_US', numericOrdering:true})
You can also use if you are using aggregation in mongodb:-
db.names.aggregate([...query, {$sort:{username:1}]).collation({locale:'en_US', numericOrdering:true});
Don't forget to add sorting of key in query array otherwise you will not get sorting order.
This thing is not available in documentation.

'SQL 'like' statement in mongodb [duplicate]

This question already has answers here:
How to query MongoDB with "like"
(45 answers)
Closed 7 years ago.
Is there in MongoDB/mongoose 'like' statement such as in SQL language?
The single reason of using is a implementation of full-text searching.
MongoDB supports RegularExpressions.
Two ways we can implement regular expression using java API for mongodb:
Expression 1:
I need to search start with that string in document field
String strpattern ="your regular expression";
Pattern p = Pattern.compile(strpattern);
BasicDBObject obj = new BasicDBObject()
obj.put("key",p);
Example : (consider I want to search wherever name start with 'a')
String strpattern ="^a";
Pattern p = Pattern.compile(strpattern);
BasicDBObject obj = new BasicDBObject()
obj.put("name",p);
Expression 2:
I need to search multiple words in one field , you can use as follows
String pattern = "\\b(one|how many)\\b";
BasicDBObject obj = new BasicDBObject();
//if you want to check case sensitive
obj.put("message", Pattern.compile(pattern,Pattern.CASE_INSENSITIVE));
Although you can use regular expressions, you won't get good performance on full text search because a contains query such as /text/ cannot use an index.
A begins-with query such as /^text/ can, however, use an index.
If you are considering full text search on any large scale please consider MongoDB's Multi Key Search feature.
Update
As of MongoDB v3.2 you can also use a text index.
Consider making a script at application level that transforms your data to tokens (words). Then treat tokens as tags, build an index on those tokens, and then search the tokens like searching for tags. This is like creating an inverted index.
For way better search capabilities on text consider using Lucene instead of MongoDB.
Use regex: /^textforesarc$/i. Just don't forget to say goodbye to performance :).
Simulating regex search with like will not hit custom defined index. Only starts with is supported for now.

MongoDB query array containing search text

I have the following query (MongoMapper/Rails):
Card.where(:card_tags => {:$all => search_tags}
Where card_tags is an array of string tags and search_tags is in array of the search strings. At the moment if someone searches 'snow', no results with tag 'snowboarding' will be returned.
How can I modify this query to search whether any strings in card_tags contains any of the strings in search_tags? Regular expressions come to mind but not sure of the syntax given these are arrays...
Thanks
You can use regular expressions but you will be doing full collection scans - this is going to be bad for performance.
You can use regex with an index only if you "starts with" type of searches, but I doubt you want to limit yourself to that.
For fulltext searching, you are better off using some external search service for this - like Lucene, ElasticSearch, or Solr.
Refer to this post too: like query in mongoDB