how to sort a Thai field using mongoDB - mongodb

How to sort a Thai field using mongoDB as in the following SQL command?
SELECT * FROM employee ORDER BY CONVERT(name USING tis620)

Right now, it is not possible for MongoDB to sort by anything else than "Unicode Code Point". There is an issue in our issue tracker: https://jira.mongodb.org/browse/SERVER-1920 which tracks the inclusion of locale-based, and case-insensitive sorting into MongoDB.

Actually, there is a way! (Though it is a hack.)
I know this is an older thread, but I think it would be useful to answer anyways.
You definitely do not want to do the sorting in your app, because that means you have to get all documents in the collection into memory to sort them and return the window that you want. If your collection is huge, then this is extremely inefficient. The database should be doing the sorting and returning the window to you.
But, MongoDB doesn't support locale-sensitive sorting, you say. How do you solve the problem? The magic is the concept of "sort keys".
Basically, let's say you had the regular English/Latin alphabet from "a" to "z". What you would do is create a sort key mapping from "a" to "01" and from "b" to "02", etc., through to "z" to "26". That is, map every letter to a number in the sort order for that language and then encode that number as a string. Then, map the string you want to sort on to this type of sort key. For example, "abc" would become "010203". Then add a property to your document with the sort key for a property, and append the name of the property with the name of the the locale:
{
name: "abc",
name_en: "010203"
}
Now you can sort in the language "en" just by indexing on the property "name_en" and use plain old English-based MongoDB sorting for selectors and ranges instead of "name" property.
Now, let's say you have another crazy language "xx" where the order of the alphabet is "acb" instead of "abc". (Yes, there are languages that mess with the order of the Latin alphabet in that fashion!) The sort key would be like this:
{
name: "abc",
name_en: "010203",
name_xx: "010302"
}
Now, all you have to do is create indexes on name_en and name_xx and use the regular MongoDB sort in order to sort correctly on those locales. Basically, the extra properties are proxies for sorting in different locales.
So where do you get these mappings, you ask? After all, you're no globalization expert, right?
Well, if you're using Java, C, or C++, you can there are ready-made classes that do this mapping for you. In Java, use the standard Collator class, or use the icu4j Collator class. If you are using C/C++, use C/C++ version of the ICU Collator functions/class. For other languages, you are sort of out-of-luck unless you can find a library that does it already.
I know that both Java and ICU support the Thai locale and can do proper sorting in Thai. Just make sure all your strings are properly encoded in UTF-8.
Here are some links to help you find them:
The standard Java library Collator: http://docs.oracle.com/javase/7/docs/api/java/text/Collator.html#getCollationKey(java.lang.String)
The C++ Collator class: http://icu-project.org/apiref/icu4c/classicu_1_1Collator.html#ae0bc68d37c4a88d1cb731adaa5a85e95
You can also make different sort keys that allow you to sort case-insensitively per locale (yes, case mapping is locale sensitive!) and accent-insensitively, and Unicode variant insensitive, or any combination of the above. The only problem is that now you have many properties that parallel each sortable property, and you have to keep them all in sync when you update the base "name" property. It is a pain in the you-know-what, but still, it is better than doing sorting in your app or business logic layer.
Also be careful of cursors with ranges. In English, for example, we just ignore accents on characters. So, an "Ö" sorts the same way as "O" and it will appear in the range "M" to "Z". But, in Swedish, accented characters sort after "Z". So, if you do a range "M" - "Z", you will include a bunch of records starting with "Ö" that should be there in English, but not in Swedish.
This also has implications in sharding if you split on a text property of a document. Be careful about what ranges go into which shard. It would be better to shard on things that are not locale-sensitive, like hashes.

Related

Firestore query on maps

In a firestore query, how do I check whether an element is the key in a map or not?
For example, I have this document:
I want to check if user's UID matches one of the UIDs in the "authors" map data structure. All the answers that I've seen so far so "where" but I don't think that's allowed syntax for Firestore queries anymore?
You can't query on keys like that (at least not that I know of).
I instead recommend adding a field authorUids that is an array of the UIDs of the authors. With that array, you can then use the array-contains operators:
collectionRef.where('authorUids', 'array'contains', 'ppGr1M8s...');
Can't imagine how you got the impression that "where" is no longer valid (it is) - but in particular, "where" is a test on the value of a field (not it's existence), AND there is no test for "null" nor "not equal to".
BUT - speculating a tad here - you might be able to fake a non-null test in your case:
collectionRef.where('authors.'+ userUID + '.0', ">", U+0000)
(fix the notation as needed) meaning
setting fieldPath to the concatenated string author.ppGr1M8sQWVrrsna6MlcQqxzLA3.0 in your example
and the field Value to the Unicode character value 0 (i.e. the minimal lexical value possible for a string)
so ANY value of the string is greater than null, if it exists at all.
firestore documentation states that documents that do not contain the specified fieldpath will not be returned, but you still need a valid test on the value. I strongly suspect this will result in creating a lot of inefficient indexes, and is highly NOT recommended.
An interesting exercise if this approach actually functions (I haven't tried it and don't intend to) - but really, find another structure - the convoluted explanation of the hack shows what a poor idea it really is.
The most important decisions, especially for a NoSQL database, are your structure/schema decisions - don't put too much effort into forcing yourself to work around bad schema/structure.

In Algolia, how do you construct records to allow for alphabetical sorting of query results?

As far as I know, you can only sort on numeric fields in Algolia, so how do you efficiently set up your records to allow for results to be returned alphabetically based on a specific string field?
For example, let's say in each record in an index you have a field called "title" that contains an arbitrary string value. How would you create a sibling field called "title_sort" that contains a number that allows for the the results to be sorted such that the records come out in alphabetical order by "title"? Is there a particularly well-accepted algorithm for creating such a number from the string in "title"?
If you have a static dataset, then you can just sort your data and put an index on it. This works as long as sorting data every time you update your indices.
I'm also thinking that if you can deal with a partial sorting, meaning that you can accept orc < orb but you need or < os, then you could derive an can use base64 as our index. You can then sort it to as many characters as you have precision for. It's only a partial sorting, but it might be acceptable for your use case. You just need to map your base64 -> base10 mappings to accomodate the sorting.
Additionally, if you don't care about the difference between capital and lowercase letters, then you can do base26 -> base10. The more I think about this the more limited it is, but it might work for your use case.

Stemming does not work properly for MongoDB text index

I am trying to use full text search feature of MongoDB and observing some unexpected behavior. The problem is related to "stemming" aspect of the text indexing feature. The way full text search is described in many articles online, if you have a string "big hunting dogs" in a document's field that is part of the text index, you should be able to search on "hunt" or "hunting" as well as on "dog" or "dogs". MongoDB should normalize or stem the text when indexing and also when searching. So in my example, I would expect it to save words "dog" and "hunt" in the index and search for a stemmed version of this words. If I search for "hunting", MongoDB should search for "hunt".
Well, this is not how it works for me. I am running MongoDB 2.4.8 on Linux with full text search enabled. If my record has value "big hunting dogs", only searching for "big" will produce the result, while searches for "hunt" or "dog" produce nothing. It is as if the words that are not in their "normalized" form are not stored in the text the index (or stored in a way it cannot find them). Searches using $regex operator work fine, that is I am able to find the document by searching on a string like /hunting/ against the field in question.
I tried dropping and recreating the full text index - nothing changed. I can only find the documents containing the words on their "normal" form. Searching for words like "dogs" or "hunting" (or even "dog" or "hunt") produces no results.
Do I misunderstand or misuse the full text search operations or is there a bug in MongoDB?
After a fair amount of experimenting and scratching my head I discovered the reason for this behavior. It turned out that the documents in the collection in question had attribute 'language'. Apparently the presence and the value of that attribute made these documents non-searchable. (The value happened to be 'ENG'. It is possible that changing it to 'eng' would make this document searchable again. The field, however, served a completely different purpose). After I renamed the field to 'lang' I was able to find the document containing the word "dogs" by searching for "dog" or "dogs".
I wonder whether this is expected behavior of MongoDB - that the presence of language attribute in the document would affect the text search.
Michael,
The "language" field (if present) allows each document to override the
language in which the stemming of words would be done. I think, as
you specified to MongoDB a language which it didn't recognize ("ENG"),
it was unable to stem the words at all. As others pointed out, you can use the
language_override option to specify that MongoDB should be using some
other field for this purpose (say "lang") and not the default one ("language").
Below is a nice quote (about full text indexing and searching) which
is exactly related to your issue. It is taken from this book.
"MongoDB: The Definitive Guide, 2nd Edition"
Searching in Other Languages
When a document is inserted (or the index is first created), MongoDB looks at the
indexes fields and stems each word, reducing it to an essential unit. However, different
languages stem words in different ways, so you must specify what language the index
or document is. Thus, text-type indexes allow a "default_language" option to be
specified, which defaults to "english" but can be set to a number of other languages
(see the online documentation for an up-to-date list).
For example, to create a French-language index, we could say:
> db.users.ensureIndex({"profil" : "text", "interets" : "text"}, {"default_language" : "french"})
Then French would be used for stemming, unless otherwise specified. You can, on a
per-document basis, specify another stemming language by having a "language" field
that describes the document’s language:
> db.users.insert({"username" : "swedishChef", "profile" : "Bork de bork", language : "swedish"})
What the book does not mention (at least this page of it doesn't) is that
one can use the language_override option to specify that MongoDB
should be using some other field for this purpose (say "lang") and
not the default one ("language").
In http://docs.mongodb.org/manual/tutorial/specify-language-for-text-index/ take a look at the language_override option when setting up the index. It allows you to change the name of the field that should be used to define the language of the text search. That way you can leave the "language" property for your application's use, and call it something else (e.g. searchlang or something like that).

MongoDB fulltext search + workaround for partial word match

Since it is not possible to find "blueberry" by the word "blue" by using a mongodb full text search, I want to help my users to complete the word "blue" to "blueberry". To do so, is it possible to query all the words in a mongodb full text index -> that I can use the words as suggestions i.e. for typeahead.js?
Language stemming in text search uses an algorithm to try to relate words derived from a common base (eg. "running" should match "run"). This is different from the prefix match (eg. "blue" matching "blueberry") that you want to implement for an autocomplete feature.
To most effectively use typeahead.js with MongoDB text search I would suggest focusing on the prefetch support in typeahead:
Create a keywords collection which has the common words (perhaps with usage frequency count) used in your collection. You could create this collection by running a Map/Reduce across the collection you have the text search index on, and keep the word list up to date using a periodic Incremental Map/Reduce as new documents are added.
Have your application generate a JSON document from the keywords collection with the unique keywords (perhaps limited to "popular" keywords based on word frequency to keep the list manageable/relevant).
You can then use the generated keywords JSON for client-side autocomplete with typeahead's prefetch feature:
$('.mysearch .typeahead').typeahead({
name: 'mysearch',
prefetch: '/data/keywords.json'
});
typeahead.js will cache the prefetch JSON data in localStorage for client-side searches. When the search form is submitted, your application can use the server-side MongoDB text search to return the full results in relevance order.
A simple workaround I am doing right now is to break the text into individual chars stored as a text indexed array.
Then when you do the $search query you simply break up the query into chars again.
Please note that this only works for short strings say length smaller than 32 otherwise the indexing building process will take really long thus performance will be down significantly when inserting new records.
You can not query for all the words in the index, but you can of course query the original document's fields. The words in the search index are also not always the full words, but are stemmed anyway. So you probably wouldn't find "blueberry" in the index, but just "blueberri".
Don't know if this might be useful to some new people facing this problem.
Depending on the size of your collection and how much RAM you have available, you can make a search by $regex, by creating the proper index. E.g:
db.collection.find( {query : {$regex: /querywords/}}).sort({'criteria': -1}).limit(limit)
You would need an index as follows:
db.collection.ensureIndex( { "query": 1, "criteria" : -1 } )
This could be really fast if you have enough memory.
Hope this helps.
For those who have not yet started implementing any database architecture and are here for a solution, go for Elasticsearch. Its a json document driven database similar to mongodb structurally. It has "edge-ngram" analyzer which is really really efficient and quick in giving you did you mean for mis-spelled searches. You can also search partially.

Can Kinosearch do mathematical comparisons on numbers like "greater-than"?

I am using Perl's KinoSearch module to index a bunch of text.
Some of the text has numeric fields associated with each word. For example, the word "Pizza" in the index may have a dollar field value like "5.50" (dollars).
How can I write a query in KinoSearch that will find all words that have a dollar value greater than 5?
I'm not even sure if a full-text search engine can do this kind of thing. It seems more like a SQL query.
After a bunch of searching (heh, heh), I found this in the docs: RangeQuery
I may be able to make that work. But it appears the new required classes are not part of the standard release, yet.