Encoding binary into unicode - unicode

I have a byte array that I need to store into a nvarchar DB column. A nvarchar takes 2 bytes. What is the optimal encoding?
Ideally I would store N bytes into a nvarchar of lenght N/2, but there is invalid unicode sequences that worries me.

The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?
Just go for base64, it's safe.
If you can't control the input bytes, you're bound to running into encoding problems sooner or later.

Usually Base64 is a good way, but you may use just Unicode code points.
Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]
This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).

Related

16-bit encoding that has all bits mapped to some value

UTF-32 has its last bits zeroed.
As I understand it UTF-16 doesn't use all its bits either.
Is there a 16-bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7-bit?
UTF-32 has its last bits zeroed
This might be not correct, depending on how you count. Typically we count from left, so the high (i.e. first) bits of UTF-32 will be zero
As I understand it UTF-16 doesn't use all its bits either
It's not correct either. UTF-16 uses all of its bits. It's just that the range [0xD800—0xDFFF] is reserved for UTF-16 surrogate pairs so those values will never be assigned any character and will never appear in UTF-32. If you need to encode characters outside the BMP with UTF-16 then those values will be used
In fact Unicode was limited to U+10FFFF just because of UTF-16, even though UTF-8 and UTF-32 themselves are able to represent up to U+7FFFFFFF and U+FFFFFFFF respectively. The use of surrogate pair makes it impossible to encode values larger than 0x10FFFF in UTF-16
See Why Unicode is restricted to 0x10FFFF?
Is there a 16 bit encoding that has all bit combinations mapped to some value, preferably a subset of UTF, like ASCII for 7 bit?
First there's no such thing as "a subset of UTF", since UTF isn't a character set but a way to encode Unicode code points
Prior to the existence of UTF-16 Unicode was a fixed 16-bit character set encoded with UCS-2. So UCS-2 might be the closest you'll get, which encodes only the characters in the BMP. Other fixed 16-bit non-Unicode charsets also has an encoding that maps all of the bit combinations to some characters
However why would you want that? UCS-2 has been deprecated long ago. Some old tools and less experienced programmers still imply that Unicode is always 16-bit long like that which is correct and will break modern text processing
Also note that not all the values below 0xFFFF are assigned, so no encoding can map every 16-bit value to a Unicode code point
Further reading
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
What is a "surrogate pair" in Java?

Why use Base64?

Base64 encoding increases the size of the input by around 37% when sent over the wire. If this is the case, why not use UTF-8 to encode the contents(say a .jpg file). This way the size of the file does not increase right?
eg: If I want to send the string "asd", a UTF-8 encoded version of this will be 3 bytes, whereas a Base64 encoded version will be 4 bytes.
The purpose of Base64 is to allow binary data to be transferred over a communication channel that cannot be relied on to transfer all possible byte values end-to-end. In particular, Base64 is used where byte values between 128 and 255 cannot be easily and reliably transferred.
In contrast, UTF-8 is used to encode Unicode across a channel that can be assumed to reliably transfer all possible byte values end-to-end (sometime referred to as an "8-bit clean" channel).
So, you have two problems with your proposal. First, a JPEG is binary data, not Unicode, so UTF-8 isn't really appropriate: if you "encode a JPEG as UTF-8" in the obvious way (treating the JPEG as a sequence of bytes, each associated with a Unicode code point from U+00 to U+FF, and then encoding those code points as UTF-8), it will double the size of all byte values from 128-255, so you'll have, on average, a 50% increase in file size. Second, even if you did this, the resulting encoded JPEG would require a communication channel that's 8-bit clean, so it couldn't be used in situations where Base64 is needed anyway.
Edit: In a comment, you asked if we couldn't use "input binary -> 7 bit ASCII encoding -> send over wire" to save space. I assume you mean taking the input binary as a long stream of bits and chopping them up into 7-bit chunks and sending those as ASCII? Yes, that could be done and would only increase size by 14%, but it's not just the non-ASCII byte values 128-255 that cause problems. In MIME email, where Base64 is most frequently used, differences in line-ending convention (carriage return, line feed, or a combination) from platform to platform, certain historical line length restrictions enshrined in the standards, and so on mean that not all ASCII characters (bytes 0-127) can be safely used. Base64 is not the best trade-off possible between compatibility and efficiency, but it's pretty close.
Base64 is usually used in instances to represent arbitrary binary data in a text format, it has a 33.3'% overhead but that's better than say hex notation which has a 50% overhead.
utf-8 is a text encoding which cannot represent arbitrary binary data which is what a jped file is.
There is little to no reason to convert the binary data to text to transfer it over the wire so a many times people do it because they don't know any better.
The only reason to use it is if you get it from apis or libraries.

Are surrogate pairs the only way to represent code points larger than 2 bytes in UTF-16?

I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean:
it will use 2 bytes for code points in the range of U+0000 to U+FFFF.
it will use surrogate pairs for code points larger than U+FFFF (4 bytes per code point).
Or does some programming languages use their own "tricks" when encoding and do not follow this standard 100%.
UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.
I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.
But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 221), and it's up to you how to represent and serialize those.
I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.
The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)
UTF-16 per se is standard. However most languages whose strings are based on 16-bit code units (whether or not they claim to ‘support’ UTF-16) can use any sequence of code units, including invalid surrogates. For example this is typically an acceptable string literal:
"x \uDC00 y \uD800 z"
and usually you only get an error when you attempt to write it to another encoding.
Python's optional encode/decode option surrogateescape uses such invalid surrogates to smuggle tokens representing the single bytes 0x80–0xFF into standalone surrogate code units U+DC80–U+DCFF, resulting in a string such as this. This is typically only used internally, so you're unlikely to meet it in files or on the wire; and it only applies to UTF-16 in as much as Python's str datatype is based on 16-bit code units (which is on ‘narrow’ builds between 3.0 and 3.3).
I'm not aware of any other commonly-used extensions/variants of UTF-16.

Why must unicode use utf-8?

as far I know, the UNICODE is the industry standard for character mapping.
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
do i make any sense?
Who says it must be encoded as UTF-8? There are several common encodings for Unicode, including UTF-16 (big- or little-endian), and some less common ones such as UTF-7 and UTF-32.
Unicode itself is not an encoding; it's merely a specification of numeric code points for several thousand characters.
The Unicode code point for lowercase a is 0x61 in hexadecimal, 97 in decimal, or 0141 in octal.
If you're suggesting that 'a' should be encoded as the 6-character ASCII string "U+0061", that would be terribly wasteful of space and more difficult to decode than UTF-8.
If you're suggesting storing the numeric values directly, that's what UTF-32 does: it stores each character as a 32-bit (4-octet) number that directly represents the code point. The trouble with that is that it's nearly as wasteful of space as "U+0061" (4 bytes per character vs. 6.)
The UTF-8 encoding has a number of advantages. One is that it's upward compatible with ASCII. Another is that it's reasonably efficient even for non-ASCII characters, as long as most of the encoded text is within the first few thousand code points.
UTF-16 has some other advantages, but I personally prefer UTF-8. MS Windows tends to use UTF-16, but mostly for historical reasons; Windows added Unicode support when there were fewer than 65536 defined code points, which made UTF-16 equvalent to UCS-2, which is a simpler representation.
UTF-8 is only one 'memory format' of Unicode. There is also UTF-16, UTF-32 and a number of other memory mapping formats.
UTF-8 has been used broadly because it is upwardly compatible with an 8 bit character code like Ascii.
You can tell a browser via html, mySQL at several levels, and Notepad++ vie encoding option to use other formats for the data they operate on.
DuckDuckGo or Google Unicode and you will find plenty of articles on this on the internet. Here is one: https://ssl.icu-project.org/docs/papers/forms_of_unicode/
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value
Stored data is a sequence of byte values, generally interpreted at the lowest level as numbers. We usually use bytes that can be one of 256 values, so we look at them as numbers in the range 0 to 255.
So when you say 'just stored as a String with "U+0061"' what sequence of numbers in the range 0-255 do you mean?
Unicode code points like U+0061 are written in hexadecimal. Hexadecimal 61 is the number 97 in the more familiar decimal system, so perhaps you think that the letter 'a' should be stored as a single byte with the value 97. You might be surprised to learn that this is exactly how the encoding UTF-8 represents this string.
Of course there are more than 256 characters defined in Unicode, so not all Unicode characters can be stored as bytes with the same value as their Unicode codepoint. UTF-8 has one way of dealing with this, and there are other encodings with different ways.
UTF-32, for example, is an encoding which uses 4 bytes together at a time to represent a codepoint. Since one byte has 256 values four bytes can have 256 × 256 × 256 × 256, or 4,294,967,296 different arrangements. We can number those arrangements of bytes from 0 to 4,294,967,295 and then store every Unicode codepoint as the arrangement of bytes that we've numbered with the number corresponding to the Unicode codepoint value. This is exactly what UTF-32 does.
(However, there are different ways to assign numbers to those arrangements of four bytes and so there are multiple versions of UTF-32, such as UTF-32BE and UTF-32LE. Typically a particular medium of storing or transmitting bytes specifies its own numbering scheme, and the encoding 'UTF-32' without further qualification implies that whatever the medium's native scheme is should be used.)
Read this article:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)
do i make any sense?
Not a lot! (Read on ...)
as far I know, the UNICODE (sic) is the industry standard for character mapping.
That is incorrect. Unicode IS NOT a standard for character mapping. It is a standard that defines a set of character codes and what they mean.
It is essentially a catalogue that defines a mapping of codes (Unicode "code points") to conceptual characters, but it is not a standard for mapping characters. It certainly DOES NOT define a standard way to represent the code points; i.e. a mapping to a representation. (That is what character encoding schemes do!)
What I don't get is that why it has to be encoded via UTF-8 and not directly as Unicode?
That is incorrect. Character data DOES NOT have to be encoded in UTF-8. It can be encoded as UTF-8. But it can also be encoded in a number of other ways too:
The Unicode has specified a number of encoding schemes, including UTF-8, UTF-16 and UTF-32, and various historical variants.
There are many other standard encoding schemes (probably hundreds of them). This Wikipedia page lists some of the common ones.
The various different encoding schemes have different purposes (and different limitations). For example:
ASCII and LATIN-1 are 7 and 8-bit character sets (respectively) that encode a small subset of Unicode code-points. (ASCII encodes roman letters and numbers, some punctuation, and "control codes". LATIN-1 adds a number of accented latin letters using in Western Europe and some other common "typographical" characters.)
UTF-8 is a variable length encoding scheme that encodes Unicode code points as 1 to 5 bytes (octets). (It is biased towards western usage ... since it encodes all latin / roman letters and numbers as single bytes.)
UTF-16 is designed for encoding Unicode code points in 16-bit units. (Java Strings are essentially UTF-16 encoded.)
Say the letter "a", why can't it be just stored as a String with "U+0061" as the value, and must be stored as octal 0061?
In fact, a Java String is represented as a sequence of char values. The char type is a 16-bit unsigned integer type; i.e. it has values 0 through 65535. And the char value that represents a lowercase "a" character is hex 0061 == octal 141 == decimal 97.
You are incorrect about "octal 0061" ... but I can't figure out what distinction you are actually trying to make here, so I can't really comment on that.

What is the most efficient binary to text encoding?

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.
I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.
According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.
Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?
Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/
If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.