ICU collation rules with empty strings - unicode

I'm only starting to work with ICU collation rules and I have a question about how I can create a rule applying to empty string.
In short, if I have the following list:
"", "abc", "", "def"
I would like to create a rule so that, after sorting, the empty strings will be at the end of the list:
"abc", "def", "", ""
The character set is not restricted to only Latin, so the rule should be apply to the whole range of Unicode points. I have tried this without success:
"&[last regular] < \u0000"
I also tried looking at some options that the API offers but did not find anything of use.
My question: is it possible to create such a rule? As I said, I am not very familiar with the API and the semantics of the rules, but I suspect that it is not possible, since the empty string does not have any Unicode points to attach weights to.
Thank you!

Related

Mongo: Is & char ignored in text index [duplicate]

So I have a document in a collection with on of the fields having a value "###"
I indexed the collection and tried running the query:
db.getCollection('TestCollection').find({$text:{$search:"\"###\""}})
But it didn't show the result
How can I work around this?
Sample Document:
{
"_id" : ObjectId("5b90dc6d3de8562a6ef7c409"),
"field" : "value",
"field2" : "###"
}
Text search is designed to index strings based on language heuristics. Text indexing involves two general steps: tokenizing (converting a string into individual terms of interest) followed by stemming (converting each term into a root form for indexing based on language-specific rules).
During the tokenizing step certain characters (for example, punctuation symbols such as #) are classified as word separators (aka delimiters) rather than text input and used to separate the original string into terms. Language-specific stop words (common words such as "the", "is", or "on" in English) are also excluded from a text index.
Since your search phrase of ### consists entirely of delimiters, there is no corresponding entry in the text index.
If you want to match generic string patterns, you should use regular expressions rather than text search. For example: db.getCollection('TestCollection').find({field2:/###/}). However, please note the caveats on index usage for regular expressions.
Your query has to many curly braces, remove them:
db.getCollection('so2').find({$text:{$search:"\"###\""}})
If you run it, Mongo tells you you're missing a text index. Add it like this:
db.so2.createIndex( { field2: "text" } )
The value you're using is pretty small. Try using longer values.

Algolia tag not searchable when ending with special characters

I'm coming across a strange situation where I cannot search on string tags that end with a special character. So far I've tried ) and ].
For example, given a Fruit index with a record with a tag apple (red), if you query (using the JS library) with tagFilters: "apple (red)", no results will be returned even if there are records with this tag.
However, if you change the tag to apple (red (not ending with a special character), results will be returned.
Is this a known issue? Is there a way to get around this?
EDIT
I saw this FAQ on special characters. However, it seems as though even if I set () as separator characters to index that only effects the direct attriubtes that are searchable, not the tag. is this correct? can I change the separator characters to index on tags?
You should try using the array syntax for your tags:
tagFilters: ["apple (red)"]
The reason it is currently failing is because of the syntax of tagFilters. When you pass a string, it tries to parse it using a special syntax, documented here, where commas mean "AND" and parentheses delimit an "OR" group.
By the way, tagFilters is now deprecated for a much clearer syntax available with the filters parameter. For your specific example, you'd use it this way:
filters: '_tags:"apple (red)"'

mongoDB query with case insensitive schema element

In my MongoDB collection I have added a record as follows
db.teacher.insert({_id:1 ,"name":"Kaushik"})
If I search
db.teacher.find({name:"Kaushik"})
I get one record. But if I try "NAME" instead of "name" i.e.
db.teacher.find({NAME:"Kaushik"})
It won't return any record.
It means that I must know how schema element is spelled exactly with exact case. Is there way to write query by ignoring case of schema element.
We can search the element value using case insensitive as follows
> db.teacher.find({name:/kAUSHIK/i})
{ "_id" : 1, "name" : "Kaushik" }
Is there similar for schema element; something like
> db.teacher.find({/NAME/i:"kaushik"})
We can search the element value using case insensitive [...]
Is there [something] similar for schema element [?]
No.
We may assume that JavaScript and JSON are case sensitive, and so are MongoDB queries.
That being said, internally MongoDB uses BSON, and the specs say nothing about case-sensitivity of keys. The BNF grammar only said that an element name is a nul terminated modified UTF-8 string:
e_name ::= cstring Key name
cstring ::= (byte*) "\x00" Zero or more modified UTF-8 encoded
characters followed by '\x00'. The
(byte*) MUST NOT contain '\x00', hence
it is not full UTF-8.
But, from the source code (here or here for example), it appears that MongoDB BSON's implementation use strcmp to perform binary comparison on element names, confirming there is no way to achieve what you want.
This might be indeed an issue beyond case sensitivity, as using combining characters, the same character might have several binary representations -- but MongoDB does not perform Unicode normalization. For example:
> db.collection.insert({"é":1})
> db.collection.find({"é":1}).count()
1
> db.collection.find({"e\u0301":1}).count()
0
This related to javascript engine and json specification. in js identifiers are case sensitive. This means you can have a document with two field named "name" and "Name" or "NAME". So mongodb act as two distinct filed with your fields.
You could use a regex like
db.teacher.find({name:/^kaushik$/i})

In mongodb, need quotes around keys for CRUD operations, example: "_id" vs _id?

I'm reading over the MongoDB manual. Some examples, have quotes around the key values, e.g: db.test.find({"_id" : 5}) and others don't, e.g: db.test.find({_id : 5})
Both quoted and un-quoted versions work. But I'm wondering if there are some nuanced difference here I don't know about or is one a preferred best practice?
Thanks.
In JavaScript (the language of the MongoDB shell) those are treated exactly the same. The quotes are needed, however, when a key contains a period like when you're using dot notation to match against an embedded field as in:
db.test.find({"name.last": "Jones"})
My preference is to not use the quotes unless they're needed.

REST pattern for for query parameters that might be LIKE searches

Hi I'm building a RESTful app and can't find the recommended way to pattern optional fuzzy or LIKE queries. For example a strict query might be,
/place?city=New+York&state=NY
Corresponds to SQL "... WHERE city="New York" AND state="NY"
But what if I wanted to search for the city field for any row with "York" in city name?
"... WHERE city LIKE "%{parameter}%" AND state="{parameter2}"
I'm thinking about just adding some kind of url-valid character to the request like this:
/place?city=*York*&state=NY
Is there an established or recommended pattern I should use? Thanks!
It's fine to use query string for searching, but it's a little bit weird to use macro character like "*" or "?" in query string(unless you decide to build a really powerful search engine like Google). More importantly, search is usually considered in fuzzy mode by default, so it's redundant to append/prepend the keyword with "*". If you do need exact search, you could surround the exact(or strict) keyword with double quotes. Namely, instead of using /place?city=*York*&state=NY, I recommend /place?city=York&state="NY".
In fact, Google uses quotes to search for an exact word or set of words, and I also found this site takes this pattern.