How to convert between UTF-16 LE <-> BE? - unicode

If I want to convert UTF-16 BE <-> LE, what should I consider? Can I treat them just a plain 2-byte integer array? Or should I follow some special Unicode algorithm to handle some exceptional case?

You just need to byte-reorder the code units, thus taking two bytes, swapping them and writing them back. That is all there is to consider.
But usually there is a simple way of reading a stream in one encoding and writing it back in another encoding. Often with negligible performance drawbacks (especially in the UTF-16 case). So to make your code clearer you should probably opt for such a solution. But the trivial way should work regardless
if you know the input encoding precisely.

Related

Encoding binary into unicode

I have a byte array that I need to store into a nvarchar DB column. A nvarchar takes 2 bytes. What is the optimal encoding?
Ideally I would store N bytes into a nvarchar of lenght N/2, but there is invalid unicode sequences that worries me.
The most optimal solution would be to store binary in a binary column. So you mean the most optimal encoding within the constraints of this suboptimal scenario?
Just go for base64, it's safe.
If you can't control the input bytes, you're bound to running into encoding problems sooner or later.
Usually Base64 is a good way, but you may use just Unicode code points.
Unicode codepoints go from 0 to 10FFFF, but you can encode easily and efficiently 2 bytes and an half into a Unicode code point. Depending on your requirements, you may shift all codepoints by 128, so that you have ASCII for boundaries (and you do not need to worry about byte 0, and still you have enough code points for the 20bit binary data (per code point). [Or maybe just escape 0 as 0x10000]
This is generic, for Unicode (so generic Unicode). If you know the encoding (e.g. UTF-8, you may choose different encoding).

Are surrogate pairs the only way to represent code points larger than 2 bytes in UTF-16?

I know that this is probably a stupid question, but I need to be sure on this issue. So I need to know for example if a programming language says that its String type uses UTF-16 encoding, does that mean:
it will use 2 bytes for code points in the range of U+0000 to U+FFFF.
it will use surrogate pairs for code points larger than U+FFFF (4 bytes per code point).
Or does some programming languages use their own "tricks" when encoding and do not follow this standard 100%.
UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.
I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.
But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 221), and it's up to you how to represent and serialize those.
I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.
The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)
UTF-16 per se is standard. However most languages whose strings are based on 16-bit code units (whether or not they claim to ‘support’ UTF-16) can use any sequence of code units, including invalid surrogates. For example this is typically an acceptable string literal:
"x \uDC00 y \uD800 z"
and usually you only get an error when you attempt to write it to another encoding.
Python's optional encode/decode option surrogateescape uses such invalid surrogates to smuggle tokens representing the single bytes 0x80–0xFF into standalone surrogate code units U+DC80–U+DCFF, resulting in a string such as this. This is typically only used internally, so you're unlikely to meet it in files or on the wire; and it only applies to UTF-16 in as much as Python's str datatype is based on 16-bit code units (which is on ‘narrow’ builds between 3.0 and 3.3).
I'm not aware of any other commonly-used extensions/variants of UTF-16.

Are there any real-world uses for converting numbers between different bases?

I know that we need to convert decimal, octal, and hexadecimal into binary, but I am confused about conversion of decimal to octal or octal to hexadecimal or decimal to hexadecimal.
Why and where we need these types of conversion?
Different bases are good for different purposes.
Decimal is obviously what most people know how to deal with, so is good for output of real quantities to end users.
Hex is short and has an even ratio of exactly 2 characters per byte, so it's good for expressing large numbers like SHA1 hashes or private keys and the like in a type-able format, particularly since those numbers don't really represent a quantity, so users don't need to be able to understand them as numbers.
Octal is mostly for legacy reasons -- UNIX file permission codes are traditionally expressed as octal numbers, for example, because three bits per digit corresponds nicely to the three bits per user-category of the UNIX permission encoding scheme.
One sometimes will want to use numbers in one base for a purpose where another base is desired. Thus, the various conversion functions available. In truth, however, my experience is that in practice you almost never convert from one base to another much, except to convert numbers from some non-binary base into binary (in the form of your language of choice's native integral type) and back out into whatever base you need to output. Most of the time one goes from one non-binary base to another is when learning about bases and getting a feel for what numbers in different bases look like, or when debugging using hexadecimal output. Even then, if a computer does it the main method is to convert to binary and then back out, because current computers are just inherently good at dealing with base-2 numbers and not-so-good at anything else.
One important place you see numbers actually stored and operated on in decimal is in some financial applications or others where it's important that "number-of-decimal-place" level precision be preserved. Sometimes fixed-point arithmetic can work for currency, but not always, and if it doesn't using binary-floating-point is a bad idea. Older systems actually had built in support for this in the form of binary-coded-decimal arithmetic. In BCD, each 4 bits acts as a decimal digit, so you give up a chunk of every 4 bits of storage in exchange for maintaining your level of precision in the base-of-choice of the non-computing world.
Oddly enough, there is one common use case for other bases that's a bit hidden. Modern languages with large number support (e.g. Python 2.x's long type or Java's BigInteger and BigDecimal type) will usually store the numbers internally in an array with each element being a digit in some base. Then they implement the math they support on strings of digits of that base. Really efficient bigint implementations may actually use use a base approaching 2^(bits in machine native word size); a base 2^64 number is obviously impossible to usefully output in that form, but doing the calculations in chunks of that size ends up making the best use of space and the CPU. (I don't know if that's the best base; it may be best to use a base of half that number of bits to simplify overflow handling from one digit to the next. It's been awhile since I wrote my own bigint and I never implemented the faster/more-complicated versions of multiplication and division.)
MIME uses hexadecimal system for Quoted Printable encoding (e.g. mail subject in Unicode) abd 64-based system for Base64 encoding.
If your workplace is stuck in IPv4 CIDR - you'll be doing quite a lot of bin -> hex -> decimal conversions managing most of the networking equipment until you get them memorized (or just use some random, simple tool).
Even that usage is a bit few-and-far-between - most businesses just adopt the lazy "/24 everything" approach.
If you do a lot of graphics work - there's the chance you'll want to convert colors between systems and need to convert from hex -> dec... most tools have this built in to the color picker, though.
I suppose there's no practical reason to be able to do other than it's really simple and there's no point not learning how to do it. :)
... unless, for some reason, you're trying to do mantissa binary math in your head.
All of these bases have their uses. Hexadecimal in particular is useful as a shorthand for binary. Every hexadecimal digit is equivalent to 4 bits, so you can write a full 32-bit value as a string of 8 hex digits. Likewise, octal digits are equivalent to 3 bits, and are used frequently as a shorthand for things like Unix file permissions (777 = set read, write, execute bits for user/group/other).
No one base is special--they all have their (obscure) uses. Decimal is special to us because it reflects human experience (10 fingers) but that's really the only reason.
A real world use case: a program prints error code in decimal, to get info from a database or the internet you need the hexadecimal format, because the bits of the error 'number' convey extra info you need to look at it in binary.
I'm there are occasional uses for this. One use case would be a little app that allows user who wants to convert decimal to octal ... like you can with lots of calculators.
But I'm not sure I understand the point of the question. Standard libraries typically don't provide methods like String toOctal(String decimal). Instead, you would normally convert from a decimal String to a primitive integer and then from the primitive integer to (say) an octal String.

Space-saving character encoding for japanese?

In my opinion a common problem: character encoding in combination with a bitmap-font. Most multi-language encodings have an huge space between different character types and even a lot of unused code points there. So if I want to use them I waste a lot of memory (not only for saving multi-byte text - i mean specially for spaces in my bitmap-font) - and VRAM is mostly really valuable... So the only reasonable thing seems to be: Using an custom mapping on my texture for i.e. UTF-8 characters (so that no space is waste). BUT: This effort seems to be same with use an own proprietary character encoding (so also own order of characters in my texture). In my specially case I got texture space for 4096 different characters and need characters to display latin languages as well as japanese (its a mess with utf-8 that only support generall cjk codepages). Had somebody ever a similiar problem (I really wonder, if not)? If theres already any approach?
Edit: The same Problem is described here http://www.tonypottier.info/Unicode_And_Japanese_Kanji/ but it doesnt provide an real solution how to save these bitmapfont mappings to utf-8 space efficent. So any further help is welcome!
Edit2:
Thank you very much for your answer. Im sorry, that my problem wasn't clear enough described.
What I really want to solve, is: The CJK Unicode range is over 20000 characters. But only a subset of around 2000 characters are necessary to display japanese text properly. These characteres are spreaded in range from U+4E00 to U+9FA5. So I need to transform these Unicode Codepoints (only the 2000 for japanese) somehow to the coordinates of my created texture (where I can order the characters also like I want).
i.e. U+4E03 is a japanese character, but U+4E04, U+4E05, U+4E06 is not. Then U+4E07 is a japanese character as well. So the easiest solution, I can see: after character U+4E03 leave three spaces in my texture (or write the not necessary characters U+4E04, U+4E05, U+4E06 there) and then write U+4E07. But this would waste soo much texture space (20000 characters, even if only 2000 are necessary). So I want to be able to put in my texture only: "...U+4E03, U+4E07...". But I have no idea how to write my displayText function then - because I cant know where are the texture coordinates of the glyph I want to display. There would be a hashmap or something like this necessary, but I have no idea how to store these data (it would be a mess to write for every character something like ...{U+4E03, 128}, {U+4E07, 129}... to fill the hasmap).
To the questions:
1) No specific format - so I will write the displayText function by myself.
2) No reason against unicode - its only that CJK range problem for my bitmapfont.
3) I think, thats generally plattform & language independent, but in my case Im using C++ with OpenGL on Mac OS X/iOS.
Thank you very much for your help! If you have any further idea for this, it would really help me a lot!
What is the real problem you want to solve?
Is it that a UTF-8 encoded string occupies three bytes per character? If yes, switch to UTF-16. Otherwise don't blame UTF-8. (Explanation: UTF-8 is just an algorithm to convert a sequence of integers to a sequence of bytes. It has nothing to do with the grouping of characters in codepages. That in turn is what Unicode code points are for.)
Is it that the Unicode code points are distributed over many "codepages" (where a "codepage" means a block of 256 adjacent Unicode code points)? If yes, invent a mapping from the Unicode code points (0x000000 - 0x10FFFF) to a smaller set of integers. In terms of memory this should cost no more than 4 bytes times the number of characters you really need. The lookup time would be approximately 24 memory accesses, 24 integer comparisons and 24 branch instructions. (In fact, this would be a binary search in a tree map.) And if that's too expensive you could use a mapping based on a hash table.
Is it something else? Then please give us some examples, to better understand your problem.
As far as I understand it you should probably write a small utility program that takes as input a set of Unicode code points that you want to use in your application and then generates The code and data for displaying texts. This raises the questions:
Do you have to use a specific bitmap font format or will you write the displayText function yourself?
Is there any reason against using Unicode for all strings and to convert them to your bitmap-optimized encoding just for the time when you render text? The encoding conversion would of course be internal to the displayText method and not visible to the normal application code.
Just out of interest: Is the problem specific to a certain programming language or environment?
Update:
I am assuming that your main problem is some function like this:
Rectangle position(int codepoint)
If I had to do this, I would start by having one bitmap for each character. The bitmap's file name would be the codepoint, so that the "big picture" can be regenerated easily, just in case you find some more characters you need. The preparation consists of the following steps:
Load all the bitmaps and determine their dimensions. The result of this step is a map from integers to (width, height) pairs.
Compute a good layout for the character images in the big picture and remember where each character was placed. Save the big picture. Save the mapping from codepoints to (x, y, width, height) to another file. This can be a text file, or if you don't have disk space, a binary file. The details don't matter.
The displayText function would then work as follows:
void displayText(int x, int y, String s) {
for (char c : s.toCharArray()) { // TODO: handle code points correctly
int codepoint = c;
Rectangle position = positions.get(codepoint);
if (position != null) {
// draw bitmap
x += position.width;
}
}
}
Map<Integer, Rectangle> positions = loadPositionsFromFile();
Now the only problem that is left is how this map can be represented in memory using as little memory as possible, and still be fast enough. That, of course, depends on your programming language.
The in-memory representation could be a few arrays that contain x, y, width, height. For each element, a 16 bit integer should be enough. And probably you only need 8 bits for width and height anyway. Another array would then map the codepoint to the index into positionData (or some special value if the codepoint is not available). This would be an array of 20000 16 bit integers, so in summary you have:
2000 * (2 + 2 + 1 + 1) = 12000 bytes for positionX, positionY, positionWidth and positionHeight
20000 * 2 = 40000 bytes for codepointToIndexInPositionArrays, if you use an array instead of a map.
Compared to the size of the bitmap itself, this should be small enough. And since the arrays don't change they can be in read-only memory.
I believe the most efficient (lossless) method for encoding this data will be to use a Huffman encoding to store your document information. This is a classic information theory problem. You will need to perform a mapping to go from your compressed space to your character space.
This technique will compress your document as efficiently as possible, based on character frequency per document (or whatever domain/documents you choose to apply it to). Only the characters you use will be stored, and they will be stored in an efficient manner directly proportional to how often they are used.
I think the best way for you to solve this problem is to use an existing implementation (UTF16, UTF8...) This will be much less error prone than implementing your own Huffman coding in order to save a little bit of space. Disk space and bandwidth is cheap, errors that anger customers or managers are not. It is my belief that a Huffman encoding will theoretically be the most efficient (lossless) encoding possible, but not the most practical for this application. Check out the link though, this might help with some of these concepts.
-Brian J. Stinar-
UTF-8 is usually a very efficient encoding. If your application focuses primarily on Asia and other regions with multi-byte character sets, you may benefit more from using UTF-16. You could of course write your own encoding, but it won't save yo that much data and it will provide you with a lot of work.
If you really need to compact your data (and I wonder if and why) you could best use some algorithm to compress you UTF data. Most algorithms work more efficient on larger blocks of data, but there are also algorithms for compressing small chunks of text. I think you will save yourself a lot of time if you explore these instead of defining your own encoding.
The paper is pretty much obsolete, it isn't 1980 any more, scrounging bits is not a requirement of almost any display application. When developing an application, e.g. the iPhone you have to plan for l10n across multiple languages so saving a few bits for just Japanese is a bit pointless.
Japan is still on Shift-JIS because like China with GB18030, Hong Kong with BIG5, etc, they have a big, stable, and efficient resource pool already locked into locale encodings. Migrating to Unicode requires re-writing a significant amount of framework tools and the additional testing that ensues.
If you look at the iPod it saves bits by only supporting Latin, Chinese, Japanese, and Korean, skipping Thai and other scripts. As memory prices and dropped and storage increased with the iPhone Apple have been able to add support for more scripts.
UTF-8 is the way to save space, use UTF-8 for storage and convert to UCS-2 or higher for more convenient manipulation and display. The differences between Shift-JIS and Unicode are really pretty minor.
Chinese alone has more than 4096 characters, and I'm not talking punctuation, but characters that are used to form words. From Wikipedia:
The number of Chinese characters contained in the Kangxi dictionary is approximately 47,035, although a large number of these are rarely used variants accumulated throughout history.
Even though many of those are rarely used, even if 90% weren't needed you'd still exhaust your quota. (I think the actual number used in modern text is somewhere around 10 - 20k.)
If you know in advance which characters you'll need to use your best bet may be to create an indirection table of Unicode codepoints to indexes into your texture. Then you only have to put as many characters in your texture as you'll actually use. I believe Flash (and some PDFs) do something like this internally.
You could use multiple bitmaps and load them on demand, instead of a single bitmap that tries to encompass all possible characters.

What is the most efficient binary to text encoding?

The closest contenders that I could find so far are yEnc (2%) and ASCII85 (25% overhead). There seem to be some issues around yEnc mainly around the fact that it uses an 8-bit character set. Which leads to another thought: is there a binary to text encoding based on the UTF-8 character set?
This really depends on the nature of the binary data, and the constraints that "text" places on your output.
First off, if your binary data is not compressed, try compressing before encoding. We can then assume that the distribution of 1/0 or individual bytes is more or less random.
Now: why do you need text? Typically, it's because the communication channel does not pass through all characters equally. e.g. you may require pure ASCII text, whose printable characters range from 0x20-0x7E. You have 95 characters to play with. Each character can theoretically encode log2(95) ~= 6.57 bits per character. It's easy to define a transform that comes pretty close.
But: what if you need a separator character? Now you only have 94 characters, etc. So the choice of an encoding really depends on your requirements.
To take an extremely stupid example: if your channel passes all 256 characters without issues, and you don't need any separators, then you can write a trivial transform that achieves 100% efficiency. :-) How to do so is left as an exercise for the reader.
UTF-8 is not a good transport for arbitrarily encoded binary data. It is able to transport values 0x01-0x7F with only 14% overhead. I'm not sure if 0x00 is legal; likely not. But anything above 0x80 expands to multiple bytes in UTF-8. I'd treat UTF-8 as a constrained channel that passes 0x01-0x7F, or 126 unique characters. If you don't need delimeters then you can transmit 6.98 bits per character.
A general solution to this problem: assume an alphabet of N characters whose binary encodings are 0 to N-1. (If the encodings are not as assumed, then use a lookup table to translate between our intermediate 0..N-1 representation and what you actually send and receive.)
Assume 95 characters in the alphabet. Now: some of these symbols will represent 6 bits, and some will represent 7 bits. If we have A 6-bit symbols and B 7-bit symbols, then:
A+B=95 (total number of symbols)
2A+B=128 (total number of 7-bit prefixes that can be made. You can start 2 prefixes with a 6-bit symbol, or one with a 7-bit symbol.)
Solving the system, you get: A=33, B=62. You now build a table of symbols:
Raw Encoded
000000 0000000
000001 0000001
...
100000 0100000
1000010 0100001
1000011 0100010
...
1111110 1011101
1111111 1011110
To encode, first shift off 6 bits of input. If those six bits are greater or equal to 100001 then shift another bit. Then look up the corresponding 7-bit output code, translate to fit in the output space and send. You will be shifting 6 or 7 bits of input each iteration.
To decode, accept a byte and translate to raw output code. If the raw code is less than 0100001 then shift the corresponding 6 bits onto your output. Otherwise shift the corresponding 7 bits onto your output. You will be generating 6-7 bits of output each iteration.
For uniformly distributed data I think this is optimal. If you know that you have more zeros than ones in your source, then you might want to map the 7-bit codes to the start of the space so that it is more likely that you can use a 7-bit code.
The short answer would be: No, there still isn't.
I ran into the problem with encoding as much information into JSON string, meaning UTF-8 without control characters, backslash and quotes.
I went out and researched how many bit you can squeeze into valid UTF-8 bytes. I disagree with answers stating that UTF-8 brings too much overhead. It's not true.
If you take into account only one-byte sequences, it's as powerful as standard ASCII. Meaning 7 bits per byte. But if you cut out all special characters you'll be left with something like Ascii85.
But there are fewer control characters in higher planes. So if you use 6-byte chunks you'll be able to encode 5 bytes per chunk. In the output you'll get any combination of UTF-8 characters of any length (for 1 to 6 bytes).
This will give you a better result than Ascii85: 5/6 instead of 4/5, 83% efficiency instead of 80%. In theory it'll get even better with higher chunk length: about 84% at 19-byte chunks.
In my opinion the encoding process becomes too complicated while it provides very little profit. So Ascii85 or some modified version of it (I'm looking at Z85 now) would be better.
I searched for most efficient binary to text encoding last year. I realized for myself that compactness is not the only criteria. The most important is where you are able to use encoded string. For example, yEnc has 2% overhead, but it is 8-bit encoding, so its usage is very very limited.
My choice is Z85. It has acceptable 25% overhead, and encoded string can be used almost everywhere: XML, JSON, source code etc. See Z85 specification for details.
Finally, I've written Z85 library in C/C++ and use it in production.
According to Wikipedia
basE91 produces the shortest plain ASCII output for compressed 8-bit binary input.
Currently base91 is the best encoding if you're limited to ASCII characters only and don't want to use non-printable characters. It also has the advantage of lightning fast encoding/decoding speed because a lookup table can be used, unlike base85 which has to be decoded using slow divisions
Going above that base122 will help increasing efficiency a little bit, but it's not 8-bit clean. However because it's based on UTF-8 encoding, it should be fine to use for many purposes. And 8-bit clean is just meaningless nowadays
Note that base122 is in fact base-128 because the 6 invalid values (128 – 122) are encoded specially so that a series of 14 bits can always be represented with at most 2 bytes, exactly like base-128 where 7 bits will be encoded in 1 byte, and in reality can be optimized to be more efficient than base-128
Base-122 Encoding
Base-122 encoding takes chunks of seven bits of input data at a time. If the chunk maps to a legal character, it is encoded with the single byte UTF-8 character: 0xxxxxxx. If the chunk would map to an illegal character, we instead use the the two-byte UTF-8 character: 110xxxxx 10xxxxxx. Since there are only six illegal code points, we can distinguish them with only three bits. Denoting these bits as sss gives us the format: 110sssxx 10xxxxxx. The remaining eight bits could seemingly encode more input data. Unfortunately, two-byte UTF-8 characters representing code points less than 0x80 are invalid. Browsers will parse invalid UTF-8 characters into error characters. A simple way of enforcing code points greater than 0x80 is to use the format 110sss1x 10xxxxxx, equivalent to a bitwise OR with 0x80 (this can likely be improved, see §4). Figure 3 summarizes the complete base-122 encoding.
http://blog.kevinalbs.com/base122
See also How viable is base128 encoding for scenarios like JavaScript strings?
Next to the ones listed on Wikipedia, there is Bommanews:
B-News (or bommanews) was developed to lift the weight of the overhead inherent to UUEncode and Base64 encoding: it uses a new encoding method to stuff binary data in text messages. This method eats more CPU resources, but it manages to lower the loss from approximately 40% for UUEncode to 3.5% (the decimal point between those digits is not dirt on your monitor), while still avoiding the use of ANSI control codes in the message body.
It's comparable to yEnc: source
yEnc is less CPU-intensive than B-News and reaches about the same low level of overhead, but it doesn't avoid the use of all control codes, it just leaves out those that were (experimentally) observed to have undesired effects on some servers, which means that it's somewhat less RFC compliant than B-News.
http://b-news.sourceforge.net/
http://www.iguana.be/~stef/
http://bnews-plus.sourceforge.net/
If you are looking for an efficient encoding for large alphabets, you might want to try escapeless. Both escapeless252 and yEnc have 1.6% overhead, but with the first it's fixed and known in advance while with the latter it actually ranges from 0 to 100% depending on the distribution of bytes.