Mongo: Is & char ignored in text index [duplicate] - mongodb

So I have a document in a collection with on of the fields having a value "###"
I indexed the collection and tried running the query:
db.getCollection('TestCollection').find({$text:{$search:"\"###\""}})
But it didn't show the result
How can I work around this?
Sample Document:
{
"_id" : ObjectId("5b90dc6d3de8562a6ef7c409"),
"field" : "value",
"field2" : "###"
}

Text search is designed to index strings based on language heuristics. Text indexing involves two general steps: tokenizing (converting a string into individual terms of interest) followed by stemming (converting each term into a root form for indexing based on language-specific rules).
During the tokenizing step certain characters (for example, punctuation symbols such as #) are classified as word separators (aka delimiters) rather than text input and used to separate the original string into terms. Language-specific stop words (common words such as "the", "is", or "on" in English) are also excluded from a text index.
Since your search phrase of ### consists entirely of delimiters, there is no corresponding entry in the text index.
If you want to match generic string patterns, you should use regular expressions rather than text search. For example: db.getCollection('TestCollection').find({field2:/###/}). However, please note the caveats on index usage for regular expressions.

Your query has to many curly braces, remove them:
db.getCollection('so2').find({$text:{$search:"\"###\""}})
If you run it, Mongo tells you you're missing a text index. Add it like this:
db.so2.createIndex( { field2: "text" } )
The value you're using is pretty small. Try using longer values.

Related

officejs : Search Word document using regular expression

I want to search strings like "number 1" or "number 152" or "number 36985".
In all above strings "number " will be constant but digits will change and can have any length.
I tried Search option using wildcard but it doesn't seem to work.
basic regEx operators like + seem to not work.
I tried 'number*[1-9]*' and 'number*[1-9]+' but no luck.
This regular expression only selects upto one digit. e.g. If the string is 'number 12345' it only matches number 12345 (the part which is in bold).
Does anyone know how to do this?
Word doesn't use regular expressions in its search (Find) functionality. It has its own set of wildcard rules. These are very similar to RegEx, but not identical and not as powerful.
Using Word's wildcards, the search text below locates the examples given in the question. (Note that the semicolon separator in 1;100 may be soemthing else, depending on the list separator set in Windows (or on the Mac). My European locale uses a semicolon; the United States would use a comma, for example.
"number [0-9]{1;100}"
The 100 is an arbitrary number I chose for the maximum number of repeats of the search term just before it. Depending on how long you expect a number to be, this can be much smaller...
The logic of the search text is: number is a literal; the valid range of characters following the literal are 0 through 9; there may be one to one hundred of these characters - anything in that range is a match.
The only way RegEx can be used in Word is to extract a string and run the search on the string. But this dissociates the string from the document, meaning Word-specific content (formatting, fields, etc.) will be lost.
Try putting < and > on the ends of your search string to indicate the beginning and ending of the desired strings. This works for me: '<number [1-9]*>'. So does '<number [1-9]#>' which is probably what you want. Note that in Word wildcards the # is used where + is used in other RegEx systems.

MongoDB Text Search AND multiple search words with word stemming

I am trying to search for multiple words in text inclusively(AND operation)
without losing word stemming.
For example:
db.supplies.runCommand("text", {search:"printers inks"})
should return results with (printer and ink) or (printers ink) or (printers ink) or (printers inks) , instead of all results with either printer or ink.
This post covers the search for multiple words as an AND operation, but the solution doesn't search for stemmed words ->MongoDB Text Search AND multiple search words.
The only way I could think of is creating a permutation of all the words and then running the search for the number of permutations(which could be large)
This may not be an effective way to search on a large collection.
Is there a better and smarter way to do it ?
So is there a reason you have to use a text search? If it were me i would use a regular expression.
https://docs.mongodb.com/manual/reference/operator/query/regex/
Off the top of my head something like this.
db.collection.find({products:/printers inks|printers|inks/})
Now i suppose you can do the same thing with a text search too.
db.collection.find({$text:{$search : "\"printers inks\" printers inks"}})
note the escaped quotes.

mongoDB query with case insensitive schema element

In my MongoDB collection I have added a record as follows
db.teacher.insert({_id:1 ,"name":"Kaushik"})
If I search
db.teacher.find({name:"Kaushik"})
I get one record. But if I try "NAME" instead of "name" i.e.
db.teacher.find({NAME:"Kaushik"})
It won't return any record.
It means that I must know how schema element is spelled exactly with exact case. Is there way to write query by ignoring case of schema element.
We can search the element value using case insensitive as follows
> db.teacher.find({name:/kAUSHIK/i})
{ "_id" : 1, "name" : "Kaushik" }
Is there similar for schema element; something like
> db.teacher.find({/NAME/i:"kaushik"})
We can search the element value using case insensitive [...]
Is there [something] similar for schema element [?]
No.
We may assume that JavaScript and JSON are case sensitive, and so are MongoDB queries.
That being said, internally MongoDB uses BSON, and the specs say nothing about case-sensitivity of keys. The BNF grammar only said that an element name is a nul terminated modified UTF-8 string:
e_name ::= cstring Key name
cstring ::= (byte*) "\x00" Zero or more modified UTF-8 encoded
characters followed by '\x00'. The
(byte*) MUST NOT contain '\x00', hence
it is not full UTF-8.
But, from the source code (here or here for example), it appears that MongoDB BSON's implementation use strcmp to perform binary comparison on element names, confirming there is no way to achieve what you want.
This might be indeed an issue beyond case sensitivity, as using combining characters, the same character might have several binary representations -- but MongoDB does not perform Unicode normalization. For example:
> db.collection.insert({"é":1})
> db.collection.find({"é":1}).count()
1
> db.collection.find({"e\u0301":1}).count()
0
This related to javascript engine and json specification. in js identifiers are case sensitive. This means you can have a document with two field named "name" and "Name" or "NAME". So mongodb act as two distinct filed with your fields.
You could use a regex like
db.teacher.find({name:/^kaushik$/i})

MongoDB alphabetical sorting bug

I'm using MongoDB standard $sort operation and found out that the result gets disrupted if there is a lower upper case string.
Example:
Google
HTC
LG
Yoc
iTaxi
As you can see the iTaxi gets pushed to the bottom, instead of being located after the HTC.
This is case sensitive sorting where lowercase letters come after uppercase letters. Thus, for sorting and searching purposes, it makes sense to store a "normalized field", where strings are all caps and certain special characters are removed or replaced, e.g.
[ { name : "iTaxi", searchName: "ITAXI" },
{ name : "HTC", searchName: "HTC" },
{ name : "Ümlaut", searchName: "UMLAUT" },
.... ]
In this example, the searchName field should be indexed, not the name field.
The normalization of strings, particularly replacing umlauts and special characters, is a bit tricky. For instance, in German ü should become ue and ß should become ss or sz, but that is widely out of scope of your original question.

Include slashes and parentheses in tokens

Background
I have search indexes containing Greek characters. Many people don't know how to type Greek so they enter something called "beta-code". Beta-code can be converted into Greek. For example, beta-code "NO/MOU" would be converted to "νόμου". Characters such as a slash or parenthesis is used to indicate an accent.
Desired Behavior
I want users to be able to search using either beta-code or text in the Greek script. I figured out that the Whoosh Variations class provides the mechanism I need and it almost solves my problem.
Problem
The Variation class works well except for when a slash or a parenthesis are used to indicate an accent in a users' query. The problem is the query are parsed such that the the special characters used to denote the accent result in the words being split up. For example, a search for "NO/MOU" results in the Variations class being asked to find variations of "no" and "mou" instead of "NO/MOU".
Question
Is there a way to influence how the query is parsed such that slashes and parentheses are included in the search words (i.e. that a search for "NO/MOU" results in a search for a token of ""NO/MOU" instead of "no" and "mou")?
The search parser uses a Tokenizer class for breaking up the search string into individual terms. Whoosh will use the class that is associated with the schema. For example, the case below, the SimpleAnalyzer() will be used when searching the "content" field.
Schema( verse_id = NUMERIC(unique=True, stored=True),
content = TEXT(analyzer=SimpleAnalyzer()) )
By default, the SimpleAnalyzer() uses the following regular expression to tokenize search terms: "\w+(.?\w+)*"
To use a different regular expression, assign the first argument to the SimpleAnalyzer to another regular expression. For example, to include beta-code characters (slashes, parentheses, etc.) in tokens, use the following SimpleAnalyzer:
SimpleAnalyzer( rcompile(r"[\w/*()=\+|&']+(\.?[\w/*()=\+|&']+)*") )
Searches will now allow terms to include the special beta-code characters and the Variations class will be able to convert the term to the unicode version.