Is there any NoSQL database that allows for 8-bit single-byte encodings, such as ISO-8859-1? - nosql

I'm trying to choose the best NoSQL database service for my data.
Most NoSQL databases out there only support UTF-8, and there's no way to enforce an encoding, unlike relational dbs. And the problem is that UTF-8 uses one byte only for the first 127 characters, but uses two bytes for the next 128, and these are the characters that comprise 80% of my data (don't ask me why I have more of these squiggles than the actual English alphabet, it's a long-winded answer).
I'll have to perform lots of queries with regular expressions on those strings, that look like "àñÿÝtçh" and are made up of mostly characters 128-255 in ISO-8859-1, ISO-8859-15 or Windows-1252. I know that I will never need to store characters outside that range in my database, so it's safe to work with only 256 characters and miss out on the gazillion characters UTF-8 supports. I am also aware that ISO-8859-1 will create lots of compatibility problems with JSON objects and things like that. However, we will be running lots of regex queries, and those can be quite complex, and the extra cost of doubling the bytes just because I have no choice but to use UTF-8 may have a negative impact on performance.
I understand that NoSQL databases tend to be schema-less, and fields are not normally defined with data types and encodings, but NoSQL will suit our project much better than SQL. Cassandra stores strings in 1-byte US-ASCII for the 0-127 lot, and not 1-byte UTF-8. Is there any NoSQL out there that defaults to ISO-8859-1 or ISO-8859-15 for the 0-255 lot?

Related

Postgresql support for national character data types

Looking for a reference that discusses PostgreSQL's support for the NATIONAL CHARACTER set of data types. e.g. this query runs without error:
select cast('foo' as national character varying(10))
yet the docs don't seem to discuss that type Postgres character data types
Does Postgres implement these differently from the CHARACTER data types? That is, how does the NATIONAL keyword affect how data is stored or represented?
Can someone share a link or two to any references I can't seem to find? (other than some mailing list correspondence from a while back)
If you request a national character varying in PostgresSQL, you'll get a regular character varying.
PostgreSQL uses the same encoding for normal and national characters.
“National character” is a leftover from the bad old days when people still used single-byte encodings like LATIN-1 and needed a different encoding for characters that didn't fit.
PostgreSQL has always supported UNICODE encodings, so this is not an issue. Just make sure that you don't specify an encoding other than the default UTF8.
NATIONAL CHARACTER has no real meaning in the SQL:92 standard (section 4.2.1), saying only that it means “a particular implementation-defined character repertoire”. If you are surprised, don’t be. There are many screwy aspects to the SQL standard.
As for text handling in Postgres, you would likely be interested in learning about:
character encoding
Unicode
UTF-8
collations
support for ICU in Postgres 10 and later.
See:
More robust collations with ICU support in PostgreSQL 10 by Peter Eisentraut, post, 2017-05.
Collations: Introduction, Features, Problems by Peter Eisentraut, video, 2019-07-12.
Unicode collation algorithm ( UCA )
ICU User Guide – Locale
List of locales with 209 languages, 501 regions & variants, as defined in ICU

Multiple languages with utf8 in postgresql

How exactly is one meant to seamlessly support all languages stored within postgres's utf8 character set? We seem to be required to specify a single language-specific collation along with the character set, such as en_US.utf8. If I'm not mistaken, we don't have the ability to store both English (en_US) and Chinese (zh_CN) in the same utf8 column, while maintaining any kind of meaningful collation behavior. If I define a column as en_US.utf8, how is it supposed to handle values containing Chinese (zh_CN) characters / byte sequences? The reality is that a single column value can contain multiple languages (ex: "Hello and 晚安"), and simply cannot be collated according to a single language.
Yes, I can physically store any character sequences; but what is the defined behavior for ordering on a en_US.utf8 column that contains English, German, Chinese, Japanese and Korean strings?
I understand that mysql's utf8mb4_unicode_ci collation isn't perfect, and that it is not following any set standard for how to collate the entire unicode set. I can already hear the anti-mysql crowd sighing about how mysql's language-agnostic collations are arbitrary, semantically meaningless, or even purely invalid. But the fact is, it works well enough, and fulfills the expectation that utf8 = multi-language unicode support.
Is postgres just being extremely stubborn with the fact that it's semantically incorrect to collate across the unicode spectrum? I know the developers are very strict when it comes to "doing things according to spec", but this inability to juggle multiple languages is frustrating to say the least. Am I missing something that solves the multi-language problem, or is the official stance that a single utf8 column can handle any language, but only one language at a time?
You are right that there will never be a perfect way to collate strings across languages.
PostgreSQL has decided not to create its own collations but to use those provided by the operating system. The idea behind this is to avoid re-inventing the wheel and to reduce maintenance effort.
So the traditional PostgreSQL answer to your question would be: if you want a string collation that works reasonably well for strings in different languages, complain to your operating system vendor or pick an operating system that provides such a collation.
However, this approach has drawbacks that the PostgreSQL community is aware of:
Few – if any – people decide on an operating system based on the collation support it provides.
PostgreSQL's sorting behaviour depends on the underlying operating system, which leads to frequent questions by confused users on the mailing lists.
With some operating systems collation behaviour can change during an operating system upgrade, leading to corrupt database indexes (see for example this thread).
It may well be that PostgreSQL changes its approach; there have been repeated efforts to use ICU libraries instead of operating system collations (see for example this recent thread), which would mitigate some of these problems.

Specifying ASCII columns in a UTF-8 PostgreSQL database

I have a PostgreSQL database with UTF8 encoding and LC_* en_US.UTF8. The database stores text columns in many different languages.
On some columns however, I am 100% sure there will never be any special characters, i.e. ISO country & currency codes.
I've tried doing something like:
"countryCode" char(3) CHARACTER SET "C" NOT NULL
and
"countryCode" char(3) CHARACTER SET "SQL_ASCII" NOT NULL
but this comes back with the error
ERROR: type "pg_catalog.bpchar_C" does not exist
ERROR: type "pg_catalog.bpchar_SQL_ASCII" does not exist
What am I doing wrong?
More importantly, should I even bother with this? I'm coming from a MySQL background where doing this was a performance and space enhancement, is this also the case with PostgreSQL?
TIA
Honestly, I do not see the purpose of such settings, as:
as #JoachimSauer mentions, ASCII subset in the UTF-8 encoding will occupy exactly the same number of bytes, as that was the main point of inventing UTF-8: keep ASCII unchanged. Therefore I see no size benefits;
all software that is capable of processing strings in different encoding will use a common internal encoding, which is UTF-8 by default for PostgreSQL nowadays. When some textual data comes in to the processing stage, database will convert it into the internal encoding if encodings do not match. Therefore, if you specify some columns as being non-UTF8, this will lead to the extra processing of the data, thus you will loose some cycles (don't think it will be notable performance hit though).
Given there's no space benefits and there's a potential performance hit, I think it is better to leave things as they are, i.e. keep all columns in the database's default encoding.
I think for the same arguments PostgreSQL do not allow to specify encodings for individual objects within the database. Character Set and Locale are set on the per-database level.

Compressing ASCII data to fit within a UTF-32 API?

I have an API that receives Unicode data, but I only need to store ASCII in it. I'd like to compress & obfuscate (or encrypt) the string values that will be persisted in Unicode.
My desire is to either compress this schema data, or to encrypt it from prying eyes. I don't think it's possible to do both well.
Considering that I want to restrict my source data to valid, printable ASCII; how can I "compress" that original string value into a value that is either smaller, obfuscated, or both?
Here is how I imagine this working (though you may have a better way):
This source code will take a given String as input
The bytes representation of that string will be taken (UTF8, ASCII, you decide)
Some magic happens - (this is the part I need your help on)
The resulting bytes will be converted into an int or long (no decimal points)
The number will be converted into a corresponding character using this utility
http://baseanythingconvert.codeplex.com/SourceControl/changeset/view/77855#1558651
(note that utility will be used to enforce the constraint is that the "final" Unicode name must not include the following characters '/', '\', '#', '?' or '%')
Background
The Microsoft Azure Table has an API that accepts Unicode data for the storage or property names. This is a schema-free database (so columns can be created ad-hoc), therefore the schema is stored per row. The downside is that this schema data is stored on disk multiple times, and it is also transmitted over the wire, quite redundantly, in an XML blob.
In addition, I'm working on a utility that dynamically encrypts/decrypts Azure Table Data, but the schema is unencrypted. I'd like to mask or obfuscate this header information somehow.
These are just some ideas.
Isn't step 3 actually straightforward (just compress and/or encrypt the data into different bytes)? For 7-bit ASCII, you can also, before compressing and/or encrypting, store the data by packing the bits so they fit into fewer bytes.
If you can use UTF-32, UTF-8, and so on in step 5, you have access to all the characters in the Unicode Standard, up to 0x10FFFD, with some exceptions; for example, some code points are noncharacters in the Unicode Standard, such as 0xFFFF, and others are invalid characters, such as 0xD800.

What is the difference between serializing and encoding?

What is the difference between serializing and encoding?
When should I use each in a web service?
Serializing is about moving structured data over a storage/transmission medium in a way that the structure can be maintained. Encoding is more broad, like about how said data is converted to different forms, etc. Perhaps you could think about serializing being a subset of encoding in this example.
With regard to a web service, you will probably be considering serializing/deserializing certain data for making/receiving requests/responses - effectively transporting "messages". Encoding is at a lower level, like how your messaging/web service type/serialization mechanism works under the hood.
"Serialization" is the process of converting data (which may include arrays, objects and similar structures) into a single string so it can be stored or transmitted easily. For instance, a single database field can't store an array because database engines don't understand that concept. To be stored in a database, an array must either be broken down into individual elements and stored in multiple fields (which only works if none of the elements are themselves arrays or objects) or it must be serialized so it can be stored as a single string in a single field. Later, the string can be retrieved from the database and unserialized, yielding a copy of the original array.
"Encoding" is a very general term, referring to any format in which data can be represented, or the process of converting data into a particular representation.
For example, in most programming languages, a sequence of data elements can be encoded as an array. (Usually we call arrays "data structures" rather than "encodings", but technically all data structures are also encodings.) Likewise, the word "encoding" itself can be encoded as a sequence of English letters. Those letters themselves may be encoded as written symbols, or as bit patterns in a text encoding such as ASCII or UTF8.
The process of encoding converts a given piece of data into another representation (another encoding, if you want to be technical about it.)
As you may have already realized, serialization is an example of an encoding process; it transforms a complex data structure into a single sequence of text characters.
That sequence of characters can then be further encoded into some sort of binary representation for more compact storage or quicker transmission. I don't know what form of binary encoding is used in BSON, but presumably it is more efficient than the bit patterns used in text encodings.