I'm using MongoDB standard $sort operation and found out that the result gets disrupted if there is a lower upper case string.
Example:
Google
HTC
LG
Yoc
iTaxi
As you can see the iTaxi gets pushed to the bottom, instead of being located after the HTC.
This is case sensitive sorting where lowercase letters come after uppercase letters. Thus, for sorting and searching purposes, it makes sense to store a "normalized field", where strings are all caps and certain special characters are removed or replaced, e.g.
[ { name : "iTaxi", searchName: "ITAXI" },
{ name : "HTC", searchName: "HTC" },
{ name : "Ümlaut", searchName: "UMLAUT" },
.... ]
In this example, the searchName field should be indexed, not the name field.
The normalization of strings, particularly replacing umlauts and special characters, is a bit tricky. For instance, in German ü should become ue and ß should become ss or sz, but that is widely out of scope of your original question.
Related
I'd like to sort a list of countries in Dart, by localised country name. This is how I'm doing it:
final countryNames = CountryNames.of(context);
_countries.sort((a, b) =>
(countryNames.data[a.isoCode.toUpperCase()] ?? "").compareTo(
countryNames.data[b.isoCode.toUpperCase()] ?? ""));
I'm not worried about the countries that aren't found in countryNames.data -- I just filter those out of the displayed list. The problem is that in English,
Åland Islands
appears at the bottom of the forward-sorted list, and in French and other languages with a proliferation of special characters, the situation is even worse.
Is there an idiomatic way to sort strings in Dart so that special characters are treated more logically?
You would have to create a mapping between regular characters and characters with diacritics, and use it within the comparison such that 'Åland Islands' is considered to be 'Aland Islands' for comparison purposes.
It looks like someone else has already done that and published it as a package: https://pub.dev/packages/diacritic
Eclipse autocomplete works OK for CamelCaseIdentifiers. But it is completely useless for MORE_TRADITONAL_style_identifiers which have upper case prefexes and are separated by "_"s.
Something like MTsi should match the latter, just like CCI matches the former.
Is there a way to do that? I could not find any preference.
Incidentally there is MTst*id.
It looks like this already works, as long as you capitalize every letter in the query:
int MORE_TRADITIONAL_style_identifier();
int main() {
int x = MTSI/*complete*/ // <-- completes MORE_TRADITIONAL_style_identifier
}
But it doesn't if some of the letters in the query are not capitalized, such as MTsi. I think capital letters are the signal to the matching algorithm that two subsequent letters are potentially the beginnings of two different segments, whereas a sequence of lowercase letters like si just expects to find that substring verbatim.
If you feel the matching algorithm could be improved to handle mixed-case queries like this better, you could consider filing a bug and/or contributing a patch.
I want to search strings like "number 1" or "number 152" or "number 36985".
In all above strings "number " will be constant but digits will change and can have any length.
I tried Search option using wildcard but it doesn't seem to work.
basic regEx operators like + seem to not work.
I tried 'number*[1-9]*' and 'number*[1-9]+' but no luck.
This regular expression only selects upto one digit. e.g. If the string is 'number 12345' it only matches number 12345 (the part which is in bold).
Does anyone know how to do this?
Word doesn't use regular expressions in its search (Find) functionality. It has its own set of wildcard rules. These are very similar to RegEx, but not identical and not as powerful.
Using Word's wildcards, the search text below locates the examples given in the question. (Note that the semicolon separator in 1;100 may be soemthing else, depending on the list separator set in Windows (or on the Mac). My European locale uses a semicolon; the United States would use a comma, for example.
"number [0-9]{1;100}"
The 100 is an arbitrary number I chose for the maximum number of repeats of the search term just before it. Depending on how long you expect a number to be, this can be much smaller...
The logic of the search text is: number is a literal; the valid range of characters following the literal are 0 through 9; there may be one to one hundred of these characters - anything in that range is a match.
The only way RegEx can be used in Word is to extract a string and run the search on the string. But this dissociates the string from the document, meaning Word-specific content (formatting, fields, etc.) will be lost.
Try putting < and > on the ends of your search string to indicate the beginning and ending of the desired strings. This works for me: '<number [1-9]*>'. So does '<number [1-9]#>' which is probably what you want. Note that in Word wildcards the # is used where + is used in other RegEx systems.
So I have a document in a collection with on of the fields having a value "###"
I indexed the collection and tried running the query:
db.getCollection('TestCollection').find({$text:{$search:"\"###\""}})
But it didn't show the result
How can I work around this?
Sample Document:
{
"_id" : ObjectId("5b90dc6d3de8562a6ef7c409"),
"field" : "value",
"field2" : "###"
}
Text search is designed to index strings based on language heuristics. Text indexing involves two general steps: tokenizing (converting a string into individual terms of interest) followed by stemming (converting each term into a root form for indexing based on language-specific rules).
During the tokenizing step certain characters (for example, punctuation symbols such as #) are classified as word separators (aka delimiters) rather than text input and used to separate the original string into terms. Language-specific stop words (common words such as "the", "is", or "on" in English) are also excluded from a text index.
Since your search phrase of ### consists entirely of delimiters, there is no corresponding entry in the text index.
If you want to match generic string patterns, you should use regular expressions rather than text search. For example: db.getCollection('TestCollection').find({field2:/###/}). However, please note the caveats on index usage for regular expressions.
Your query has to many curly braces, remove them:
db.getCollection('so2').find({$text:{$search:"\"###\""}})
If you run it, Mongo tells you you're missing a text index. Add it like this:
db.so2.createIndex( { field2: "text" } )
The value you're using is pretty small. Try using longer values.
Background
I have search indexes containing Greek characters. Many people don't know how to type Greek so they enter something called "beta-code". Beta-code can be converted into Greek. For example, beta-code "NO/MOU" would be converted to "νόμου". Characters such as a slash or parenthesis is used to indicate an accent.
Desired Behavior
I want users to be able to search using either beta-code or text in the Greek script. I figured out that the Whoosh Variations class provides the mechanism I need and it almost solves my problem.
Problem
The Variation class works well except for when a slash or a parenthesis are used to indicate an accent in a users' query. The problem is the query are parsed such that the the special characters used to denote the accent result in the words being split up. For example, a search for "NO/MOU" results in the Variations class being asked to find variations of "no" and "mou" instead of "NO/MOU".
Question
Is there a way to influence how the query is parsed such that slashes and parentheses are included in the search words (i.e. that a search for "NO/MOU" results in a search for a token of ""NO/MOU" instead of "no" and "mou")?
The search parser uses a Tokenizer class for breaking up the search string into individual terms. Whoosh will use the class that is associated with the schema. For example, the case below, the SimpleAnalyzer() will be used when searching the "content" field.
Schema( verse_id = NUMERIC(unique=True, stored=True),
content = TEXT(analyzer=SimpleAnalyzer()) )
By default, the SimpleAnalyzer() uses the following regular expression to tokenize search terms: "\w+(.?\w+)*"
To use a different regular expression, assign the first argument to the SimpleAnalyzer to another regular expression. For example, to include beta-code characters (slashes, parentheses, etc.) in tokens, use the following SimpleAnalyzer:
SimpleAnalyzer( rcompile(r"[\w/*()=\+|&']+(\.?[\w/*()=\+|&']+)*") )
Searches will now allow terms to include the special beta-code characters and the Variations class will be able to convert the term to the unicode version.