Related
I cannot understand some key elements of encoding:
Is ASCII only a character or it also has its encoding scheme algorithm ?
Does other windows code pages such as Latin1 have their own encoding algorithm ?
Are UTF7, 8, 16, 32 the only encoding algorithms ?
Does the UTF alghoritms are used only with the UNICODE set ?
Given the ASCII text: Hello World, if I want to convert it into Latin1 or BIG5, which encoding algorithms are being used in this process ? More specifically, does Latin1/Big5 use their own encoding alghoritm or I have to use a UTF alghoritm ?
1: Ascii is just an encoding — a really simple encoding. It's literally just the positive end of a signed byte (0...127) mapped to characters and control codes.
Refer to https://www.ascii.codes/ to see the full set and inspect the characters.
There are definitely encoding algorithms to convert ascii strings to and from strings in other encodings, but there is no compression/decompression algorithm required to write or read ascii strings like there is for utf8 or utf16, if that's what you're implying.
2: LATIN-1 is also not a compressed (usually called 'variable width') encoding, so there's no algorithm needed to get in and out of it.
See https://kb.iu.edu/d/aepu for a nice description of LATIN-1 conceptually and of each character in the set. Like a lot of encodings, its first 128 slots are just ascii. Like ascii, it's 1 byte in size, but it's an unsigned byte, so after the last ascii character (DEL/127), LATIN1 adds another 128 characters.
As with any conversion from one string encoding to another, there is an algorithm specifically tailored to that conversion.
3: Again, unicode encodings are just that — encodings. But they're all compressed except for utf32. So unless you're working with utf32 there is always a compression/decompression step required to write and read them.
Note: When working with utf32 strings there is one nonlinear oddity that has to be accounted for... combining characters. Technically that is yet another type of compression since they save space by not giving a codepoint to every possible combination of uncombined character and combining character. They "precombine" a few, but they would run out of slots very quickly if they did them all.
4: Yes. The compression/decompression algorithms for the compressed unicode encodings are just for those encodings. They would not work for any other encoding.
Think of it like zip/unzip. Unzipping anything other than a zipped file or folder would of course not work. That goes for things that are not compressed in the first place and also things that are compressed but using another compression algorithm (e.g.: rar).
I recently wrote the utf8 and utf16 compression/decompression code for a new cross-platform library being developed, and I can tell you quite confidently if you feed a Big5-encoded string into my method written specifically for decompressing utf8... not only would it not work, it might very well crash.
Re: your "Hello World" question... Refer to my answer to your second question about LATIN-1. No conversion is required to go from ascii to LATIN-1 because the first 128 characters (0...127) of LATIN-1 are ascii. If you're converting from LATIN-1 to ascii, the same is true for the lower half of LATIN-1, but if any of the characters beyond 127 are in the string, it would be what's called a "lossy"/partial conversion or an outright failure, depending on your tolerance level for lossiness. In your example, however, all of the characters in "Hello World" have the exact same values in both encodings, so it would convert perfectly, without loss, in either direction.
I know practically nothing about Big5, but regardless, don't use utf-x algos for other encodings. Each one of those is written very specifically for 1 particular encoding (or in the case of conversion: pair of encodings).
If you're curious about utf8/16 compression/decompression algorithms, the unicode website is where you should start (watch out though. they don't use the compression/decompression metaphor in their documentation):
http://unicode.org
You probably won't need anything else.
... except maybe a decent codepoint lookup tool: https://www.unicode.codes/
You can roll your own code based on the unicode documentation, or use the official unicode library:
http://site.icu-project.org/home
Hope this helps.
In general, most encoding schemes like ASCII or Latin-1 are simply big tables mapping characters to specific byte sequences. There may or may not be some specific algorithm how the creators came up with those specific character⟷byte associations, but there's generally not much more to it than that.
One of the innovations of Unicode specifically is the indirection of assigning each character a unique number first and foremost, and worrying about how to encode that number into bytes secondarily. There are a number of encoding schemes for how to do this, from the UCS and GB 18030 encodings to the most commonly used UTF-8/UTF-16 encodings. Some are largely defunct by now like UCS-2. Each one has their pros and cons in terms of space tradeoffs, ease of processing and transportability (e.g. UTF-7 for safe transport over 7-bit system like email). Unless otherwise noted, they can all encode the full set of current Unicode characters.
To convert from one encoding to another, you pretty much need to map bytes from one table to another. Meaning, if you look at the EBCDIC table and the Windows 1250 table, the characters 0xC1 and 0x41 respectively both seem to represent the same character "A", so when converting between the two encodings, you'd map those bytes as equivalent. Yes, that means there needs to be one such mapping between each possible encoding pair.
Since that is obviously rather laborious, modern converters virtually always go through Unicode as a middleman. This way each encoding only needs to be mapped to the Unicode table, and the conversion can be done with encoding A → Unicode code point → encoding B. In the end you just want to identify which characters look the same/mean the same, and change the byte representation accordingly.
A character encoding is a mapping from a sequence of characters to a sequence of bytes (in the past there were also encodings to a sequence of bits - they are falling out of fashion). Usually this mapping is one-to-one but not necessarily onto. This means there may be byte sequences that don't correspond to a character sequence in this encoding.
The domain of the mapping defines which characters can be encoded.
Now to your questions:
ASCII is both, it defines 128 characters (some of them are control codes) and how they are mapped to the byte values 0 to 127.
Each encoding may define its own set of characters and how they are mapped to bytes
no, there are others as well ASCII, ISO-8859-1, ...
Unicode uses a two step mapping: first the characters are mapped to (relatively) small integers called "code points", then these integers are mapped to a byte sequence. The first part is the same for all UTF encodings, the second step differs. Unicode has the ambition to contain all characters. This means, most characters are in the "UNICODE set".
Every character in the world has been assigned a unicode value [ numbered from 0 to ...]. It is actually an unique value. Now, it depends on an individual that how he wants to use that unicode value. He can even use it directly or can use some known encoding schemes like utf8, utf16 etc. Encoding schemes map that unicode value into some specific bit sequence [ can vary from 1 byte to 4 bytes or may be 8 in future if we get to know about all the languages of universe/aliens/multiverse ] so that it can be uniquely identified in the encoding scheme.
For example ASCII is an encoding scheme which only encodes 128 characters out of all characters. It uses one byte for every character which is equivalent to utf8 representation. GSM7 is one other format which uses 7 bit per character to encode 128 characters from unicode character list.
Utf8:
It uses 1 byte for characters whose unicode value is till 127.
Beyond this it has its own way of representing the unicode values.
Uses 2 byte for Cyrillic then 3 bytes for Hindi characters.
Utf16:
It uses 2 byte for characters whose unicode value is till 127.
and it also uses 2 byte for Cyrillic, Hindi characters.
All the utf encoding schemes fixes initial bits in specific pattern [ eg: 110|restbits] and rest bits [eg: initialbits|11001] takes the unicode value to make a unique representation.
Wikipedia on utf8, utf16, unicode will make it clear.
I coded an utf translator which converts incoming utf8 text across all languages into its equivalent utf16 text.
Whats the smallest hash I can get without making things overly collidable? I figure a good example is hashing "foo".
input = foo
sha1 = 0beec7b5ea3f0fdbc95d0dd47f3c5bc275da8a33
sha1 + b64 = C+7Hteo/D9vJXQ3UfzxbwnXaijM
Are there any other standards out there like Base64 that utilize unicode characters? maybe including upper/lower umlaut characters such as Ü and ü to pack more bits into each character? Ideally I'd love to compress the sha1 hash into 4-6 unicode characters I can tack onto a URL.
Reversibly encoding the hash doesn't impact collision rate... Unless your encoding causes some loss of data (then it isn't reversible any more).
Base64 and other binary-to-text encoding schemes are all reversible. Your first output is the hexadecimal (or base16) representation, which is 50% efficient. Base64 achieves 75% efficiency, meaning it cuts the 40-character hex representation to 28 characters.
The most efficient binary encoding scheme is yEnc, which achieves 98% efficiency, meaning a 100 byte long input will be roughly 102 bytes when encoded with yEnc. This is where the real problem arises for you: SHA-1 outputs are 160 bits (20 bytes) long. If you achieve 200% character-byte efficiency by using every 2-byte UTF16 character, you're still looking at 10 characters. You can't achieve this, because 2-byte values from U+D7FF to U+E000 are not valid UTF16 characters. Those byte values are reserved as prefixes for higher-plane characters.
Even if you find such a hyper-efficient1 encoding scheme using unicode, you can't really use those as URLs. Unicode characters are forbidden from URLs and to be standards compliant, you should use % encodings for your URLs. Many browsers will automatically convert them, so you may find this acceptable, but many of the characters you would regularly use would not be human readable and many more would appear to be in different languages.
At this point, if you really need short URLs, you should reconsider using a hash value and instead implement your own identity service (e.g. assign every page or resource an incremental ID, which is admittedly hard to scale) or utilizing another link-shortening service.
1: This not possible from a bit standpoint. Unicode could achieve a higher character-to-bit ratio, but the unicode characters themselves are represented by multiple bytes. The % encodings for UTF8, which most browsers use as the default for unrecognized encodings, get messy quickly.
As I understand it, UTF-8 is a superset of ASCII, and therefore includes the control characters which are not used to represent printable characters.
My question is: Are there any bytes (of the 256 different) that are not used by the UTF-8 encoding?
I wondered if you could convert/encode UTF-8 text to binary.
Here my though process:
I have no idea how the UTF-8 text encoding works and how it can use so many characters (only that it uses multiple bytes for characters not in ASCII (Latin-1??)) but I know that ASCII text is valid in UTF-8 so the control characters (bytes 0-30) are not used differently by the UTF-8 encoding but they are at the same time not used for displaying characters, right??
So of the 256 different bytes, only ~230 are used. For a 1000 (binary) long Unicode text there are only 1000^230 different texts? Right?
If that is true, you could convert it to a binary data which is smaller than 1000 bytes.
Wolfram alpha: 1000 bytes of unicode (assumption unicode only uses 230 of the 256 different bytes) --> 496 bytes
Yes, it is possible to devise encodings which are more space-efficient than UTF-8, but you have to weigh the advantages against the disadvantages.
For example, if your primary target is (say) ISO-8859-1, you could map the character codes 0xA0-0xFF to themselves, and only use 0x80-0x9F to select an extension map somewhat vaguely like UTF-8 uses (nearly) all of 0x80-0xFF to encode sequences which can represent all of Unicode > 0x80. You would gain a significant advantage when the majority of your text does not use characters in the ranges 0x80-0x9F or 0x0100-0x1EFFFFFFFF, but correspondingly lose when this is not the case.
Or you could require the user to keep a state variable which tells you which range of characters is currently selected, and have each byte in the stream act as an index into that range. This has significant disadvantages, but used to be how these things were done way back when (witness e.g. ISO-2022).
The original UTF-8 draft before Ken Thompson and Rob Pike famously intervened was probably also somewhat more space-efficient than the final specification, but the changes they introduced had some very attractive properties, trading (I assume) some space efficiency for lack of contextual ambiguity.
I would urge you to read the Wikipedia article about UTF-8 to understand the design desiderata -- the spec is possible to grasp in just a few minutes, although you might want to reserve an hour or more to follow footnotes etc. (The Thompson anecdote is currently footnote #7.)
All in all, unless you are working on space travel or some similarly effeciency-intensive application, losing UTF-8 compatibility is probably not worth the time you have already spent, and you should stop now.
0xF8-0xFF are not valid anywhere in UTF-8, and some other bytes are not valid at certain positions.
The lead byte of a character indicates the number of bytes used to encode the character, and each continuation byte has 10 as its two high order bits. This is so that you can pick any byte within the text and find the start of the character containing it. If you don't mind losing this ability, you could certainly come up with more efficient encoding.
You have to distinguish Characters, Unicode and UTF-8 encoding:
In encodings like ASCII, LATIN-1, etc. there is a one-to-one relation of one character to one number between 0 and 255 so a character can be encoded by exactly one byte (e.g. "A"->65). For decoding such a text you need to know which encoding was used (does 65 really mean "A"?).
To overcome this situation Unicode assigns every Character (including all kinds of special things like control characters, diacritic marks, etc.) a unique number in the range from 0 to 0x10FFFF (so-called Unicode codepoint). As this range does not fit into one byte the question is how to encode. There are several ways to do this, e.g. simplest way would always use 4 bytes for each character. As this consumes a lot of space a more efficient encoding is UTF-8: Here every Unicode codepoint (= Character) is encoded in one, two, three or four bytes (for this encoding not all byte values from 0 to 255 are used but this is only a technical detail).
What's the basis for Unicode and why the need for UTF-8 or UTF-16?
I have researched this on Google and searched here as well, but it's not clear to me.
In VSS, when doing a file comparison, sometimes there is a message saying the two files have differing UTF's. Why would this be the case?
Please explain in simple terms.
Why do we need Unicode?
In the (not too) early days, all that existed was ASCII. This was okay, as all that would ever be needed were a few control characters, punctuation, numbers and letters like the ones in this sentence. Unfortunately, today's strange world of global intercommunication and social media was not foreseen, and it is not too unusual to see English, العربية, 汉语, עִבְרִית, ελληνικά, and ភាសាខ្មែរ in the same document (I hope I didn't break any old browsers).
But for argument's sake, let’s say Joe Average is a software developer. He insists that he will only ever need English, and as such only wants to use ASCII. This might be fine for Joe the user, but this is not fine for Joe the software developer. Approximately half the world uses non-Latin characters and using ASCII is arguably inconsiderate to these people, and on top of that, he is closing off his software to a large and growing economy.
Therefore, an encompassing character set including all languages is needed. Thus came Unicode. It assigns every character a unique number called a code point. One advantage of Unicode over other possible sets is that the first 256 code points are identical to ISO-8859-1, and hence also ASCII. In addition, the vast majority of commonly used characters are representable by only two bytes, in a region called the Basic Multilingual Plane (BMP). Now a character encoding is needed to access this character set, and as the question asks, I will concentrate on UTF-8 and UTF-16.
Memory considerations
So how many bytes give access to what characters in these encodings?
UTF-8:
1 byte: Standard ASCII
2 bytes: Arabic, Hebrew, most European scripts (most notably excluding Georgian)
3 bytes: BMP
4 bytes: All Unicode characters
UTF-16:
2 bytes: BMP
4 bytes: All Unicode characters
It's worth mentioning now that characters not in the BMP include ancient scripts, mathematical symbols, musical symbols, and rarer Chinese, Japanese, and Korean (CJK) characters.
If you'll be working mostly with ASCII characters, then UTF-8 is certainly more memory efficient. However, if you're working mostly with non-European scripts, using UTF-8 could be up to 1.5 times less memory efficient than UTF-16. When dealing with large amounts of text, such as large web-pages or lengthy word documents, this could impact performance.
Encoding basics
Note: If you know how UTF-8 and UTF-16 are encoded, skip to the next section for practical applications.
UTF-8: For the standard ASCII (0-127) characters, the UTF-8 codes are identical. This makes UTF-8 ideal if backwards compatibility is required with existing ASCII text. Other characters require anywhere from 2-4 bytes. This is done by reserving some bits in each of these bytes to indicate that it is part of a multi-byte character. In particular, the first bit of each byte is 1 to avoid clashing with the ASCII characters.
UTF-16: For valid BMP characters, the UTF-16 representation is simply its code point. However, for non-BMP characters UTF-16 introduces surrogate pairs. In this case a combination of two two-byte portions map to a non-BMP character. These two-byte portions come from the BMP numeric range, but are guaranteed by the Unicode standard to be invalid as BMP characters. In addition, since UTF-16 has two bytes as its basic unit, it is affected by endianness. To compensate, a reserved byte order mark can be placed at the beginning of a data stream which indicates endianness. Thus, if you are reading UTF-16 input, and no endianness is specified, you must check for this.
As can be seen, UTF-8 and UTF-16 are nowhere near compatible with each other. So if you're doing I/O, make sure you know which encoding you are using! For further details on these encodings, please see the UTF FAQ.
Practical programming considerations
Character and string data types: How are they encoded in the programming language? If they are raw bytes, the minute you try to output non-ASCII characters, you may run into a few problems. Also, even if the character type is based on a UTF, that doesn't mean the strings are proper UTF. They may allow byte sequences that are illegal. Generally, you'll have to use a library that supports UTF, such as ICU for C, C++ and Java. In any case, if you want to input/output something other than the default encoding, you will have to convert it first.
Recommended, default, and dominant encodings: When given a choice of which UTF to use, it is usually best to follow recommended standards for the environment you are working in. For example, UTF-8 is dominant on the web, and since HTML5, it has been the recommended encoding. Conversely, both .NET and Java environments are founded on a UTF-16 character type. Confusingly (and incorrectly), references are often made to the "Unicode encoding", which usually refers to the dominant UTF encoding in a given environment.
Library support: The libraries you are using support some kind of encoding. Which one? Do they support the corner cases? Since necessity is the mother of invention, UTF-8 libraries will generally support 4-byte characters properly, since 1, 2, and even 3 byte characters can occur frequently. However, not all purported UTF-16 libraries support surrogate pairs properly since they occur very rarely.
Counting characters: There exist combining characters in Unicode. For example, the code point U+006E (n), and U+0303 (a combining tilde) forms ñ, but the code point U+00F1 forms ñ. They should look identical, but a simple counting algorithm will return 2 for the first example, and 1 for the latter. This isn't necessarily wrong, but it may not be the desired outcome either.
Comparing for equality: A, А, and Α look the same, but they're Latin, Cyrillic, and Greek respectively. You also have cases like C and Ⅽ. One is a letter, and the other is a Roman numeral. In addition, we have the combining characters to consider as well. For more information, see Duplicate characters in Unicode.
Surrogate pairs: These come up often enough on Stack Overflow, so I'll just provide some example links:
Getting string length
Removing surrogate pairs
Palindrome checking
Unicode
is a set of characters used around the world
UTF-8
a character encoding capable of encoding all possible characters (called code points) in Unicode.
code unit is 8-bits
use one to four code units to encode Unicode
00100100 for "$" (one 8-bits);11000010 10100010 for "¢" (two 8-bits);11100010 10000010 10101100 for "€" (three 8-bits)
UTF-16
another character encoding
code unit is 16-bits
use one to two code units to encode Unicode
00000000 00100100 for "$" (one 16-bits);11011000 01010010 11011111 01100010 for "𤭢" (two 16-bits)
Unicode is a fairly complex standard. Don’t be too afraid, but be
prepared for some work! [2]
Because a credible resource is always needed, but the official report is massive, I suggest reading the following:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) An introduction by Joel Spolsky, Stack Exchange CEO.
To the BMP and beyond! A tutorial by Eric Muller, Technical Director then, Vice President later, at The Unicode Consortium (the first 20 slides and you are done)
A brief explanation:
Computers read bytes and people read characters, so we use encoding standards to map characters to bytes. ASCII was the first widely used standard, but covers only Latin (seven bits/character can represent 128 different characters). Unicode is a standard with the goal to cover all possible characters in the world (can hold up to 1,114,112 characters, meaning 21 bits/character maximum. Current Unicode 8.0 specifies 120,737 characters in total, and that's all).
The main difference is that an ASCII character can fit to a byte (eight bits), but most Unicode characters cannot. So encoding forms/schemes (like UTF-8 and UTF-16) are used, and the character model goes like this:
Every character holds an enumerated position from 0 to 1,114,111 (hex: 0-10FFFF) called a code point.
An encoding form maps a code point to a code unit sequence. A code unit is the way you want characters to be organized in memory, 8-bit units, 16-bit units and so on. UTF-8 uses one to four units of eight bits, and UTF-16 uses one or two units of 16 bits, to cover the entire Unicode of 21 bits maximum. Units use prefixes so that character boundaries can be spotted, and more units mean more prefixes that occupy bits. So, although UTF-8 uses one byte for the Latin script, it needs three bytes for later scripts inside a Basic Multilingual Plane, while UTF-16 uses two bytes for all these. And that's their main difference.
Lastly, an encoding scheme (like UTF-16BE or UTF-16LE) maps (serializes) a code unit sequence to a byte sequence.
character: π
code point: U+03C0
encoding forms (code units):
UTF-8: CF 80
UTF-16: 03C0
encoding schemes (bytes):
UTF-8: CF 80
UTF-16BE: 03 C0
UTF-16LE: C0 03
Tip: a hexadecimal digit represents four bits, so a two-digit hex number represents a byte.
Also take a look at plane maps on Wikipedia to get a feeling of the character set layout.
The article What every programmer absolutely, positively needs to know about encodings and character sets to work with text explains all the details.
Writing to buffer
if you write to a 4 byte buffer, symbol あ with UTF8 encoding, your binary will look like this:
00000000 11100011 10000001 10000010
if you write to a 4 byte buffer, symbol あ with UTF16 encoding, your binary will look like this:
00000000 00000000 00110000 01000010
As you can see, depending on what language you would use in your content this will effect your memory accordingly.
Example: For this particular symbol: あ UTF16 encoding is more efficient since we have 2 spare bytes to use for the next symbol. But it doesn't mean that you must use UTF16 for Japan alphabet.
Reading from buffer
Now if you want to read the above bytes, you have to know in what encoding it was written to and decode it back correctly.
e.g. If you decode this :
00000000 11100011 10000001 10000010
into UTF16 encoding, you will end up with 臣 not あ
Note: Encoding and Unicode are two different things. Unicode is the big (table) with each symbol mapped to a unique code point. e.g. あ symbol (letter) has a (code point): 30 42 (hex). Encoding on the other hand, is an algorithm that converts symbols to more appropriate way, when storing to hardware.
30 42 (hex) - > UTF8 encoding - > E3 81 82 (hex), which is above result in binary.
30 42 (hex) - > UTF16 encoding - > 30 42 (hex), which is above result in binary.
Originally, Unicode was intended to have a fixed-width 16-bit encoding (UCS-2). Early adopters of Unicode, like Java and Windows NT, built their libraries around 16-bit strings.
Later, the scope of Unicode was expanded to include historical characters, which would require more than the 65,536 code points a 16-bit encoding would support. To allow the additional characters to be represented on platforms that had used UCS-2, the UTF-16 encoding was introduced. It uses "surrogate pairs" to represent characters in the supplementary planes.
Meanwhile, a lot of older software and network protocols were using 8-bit strings. UTF-8 was made so these systems could support Unicode without having to use wide characters. It's backwards-compatible with 7-bit ASCII.
Unicode is a standard which maps the characters in all languages to a particular numeric value called a code point. The reason it does this is that it allows different encodings to be possible using the same set of code points.
UTF-8 and UTF-16 are two such encodings. They take code points as input and encodes them using some well-defined formula to produce the encoded string.
Choosing a particular encoding depends upon your requirements. Different encodings have different memory requirements and depending upon the characters that you will be dealing with, you should choose the encoding which uses the least sequences of bytes to encode those characters.
For more in-depth details about Unicode, UTF-8 and UTF-16, you can check out this article,
What every programmer should know about Unicode
Why Unicode? Because ASCII has just 127 characters. Those from 128 to 255 differ in different countries, and that's why there are code pages. So they said: let’s have up to 1114111 characters.
So how do you store the highest code point? You'll need to store it using 21 bits, so you'll use a DWORD having 32 bits with 11 bits wasted. So if you use a DWORD to store a Unicode character, it is the easiest way, because the value in your DWORD matches exactly the code point.
But DWORD arrays are of course larger than WORD arrays and of course even larger than BYTE arrays. That's why there is not only UTF-32, but also UTF-16. But UTF-16 means a WORD stream, and a WORD has 16 bits, so how can the highest code point 1114111 fit into a WORD? It cannot!
So they put everything higher than 65535 into a DWORD which they call a surrogate-pair. Such a surrogate-pair are two WORDS and can get detected by looking at the first 6 bits.
So what about UTF-8? It is a byte array or byte stream, but how can the highest code point 1114111 fit into a byte? It cannot! Okay, so they put in also a DWORD right? Or possibly a WORD, right? Almost right!
They invented utf-8 sequences which means that every code point higher than 127 must get encoded into a 2-byte, 3-byte or 4-byte sequence. Wow! But how can we detect such sequences? Well, everything up to 127 is ASCII and is a single byte. What starts with 110 is a two-byte sequence, what starts with 1110 is a three-byte sequence and what starts with 11110 is a four-byte sequence. The remaining bits of these so called "startbytes" belong to the code point.
Now depending on the sequence, following bytes must follow. A following byte starts with 10, and the remaining bits are 6 bits of payload bits and belong to the code point. Concatenate the payload bits of the startbyte and the following byte/s and you'll have the code point. That's all the magic of UTF-8.
ASCII - Software allocates only 8 bit byte in memory for a given character. It works well for English and adopted (loanwords like façade) characters as their corresponding decimal values falls below 128 in the decimal value. Example C program.
UTF-8 - Software allocates one to four variable 8-bit bytes for a given character. What is meant by a variable here? Let us say you are sending the character 'A' through your HTML pages in the browser (HTML is UTF-8), the corresponding decimal value of A is 65, when you convert it into decimal it becomes 01000010. This requires only one byte, and one byte memory is allocated even for special adopted English characters like 'ç' in the word façade. However, when you want to store European characters, it requires two bytes, so you need UTF-8. However, when you go for Asian characters, you require minimum of two bytes and maximum of four bytes. Similarly, emojis require three to four bytes. UTF-8 will solve all your needs.
UTF-16 will allocate minimum 2 bytes and maximum of 4 bytes per character, it will not allocate 1 or 3 bytes. Each character is either represented in 16 bit or 32 bit.
Then why does UTF-16 exist? Originally, Unicode was 16 bit not 8 bit. Java adopted the original version of UTF-16.
In a nutshell, you don't need UTF-16 anywhere unless it has been already been adopted by the language or platform you are working on.
Java program invoked by web browsers uses UTF-16, but the web browser sends characters using UTF-8.
UTF stands for stands for Unicode Transformation Format. Basically, in today's world there are scripts written in hundreds of other languages, formats not covered by the basic ASCII used earlier. Hence, UTF came into existence.
UTF-8 has character encoding capabilities and its code unit is eight bits while that for UTF-16 it is 16 bits.
I've heard people talking about "base 64 encoding" here and there. What is it used for?
When you have some binary data that you want to ship across a network, you generally don't do it by just streaming the bits and bytes over the wire in a raw format. Why? because some media are made for streaming text. You never know -- some protocols may interpret your binary data as control characters (like a modem), or your binary data could be screwed up because the underlying protocol might think that you've entered a special character combination (like how FTP translates line endings).
So to get around this, people encode the binary data into characters. Base64 is one of these types of encodings.
Why 64?
Because you can generally rely on the same 64 characters being present in many character sets, and you can be reasonably confident that your data's going to end up on the other side of the wire uncorrupted.
It's basically a way of encoding arbitrary binary data in ASCII text. It takes 4 characters per 3 bytes of data, plus potentially a bit of padding at the end.
Essentially each 6 bits of the input is encoded in a 64-character alphabet. The "standard" alphabet uses A-Z, a-z, 0-9 and + and /, with = as a padding character. There are URL-safe variants.
Wikipedia is a reasonably good source of more information.
Years ago, when mailing functionality was introduced, so that was utterly text based, as the time passed, need for attachments like image and media (audio,video etc) came into existence. When these attachments are sent over internet (which is basically in the form of binary data), the probability of binary data getting corrupt is high in its raw form. So, to tackle this problem BASE64 came along.
The problem with binary data is that it contains null characters which in some languages like C,C++ represent end of character string so sending binary data in raw form containing NULL bytes will stop a file from being fully read and lead in a corrupt data.
For Example :
In C and C++, this "null" character shows the end of a string. So "HELLO" is stored like this:
H E L L O
72 69 76 76 79 00
The 00 says "stop here".
Now let’s dive into how BASE64 encoding works.
Point to be noted : Length of the string should be in multiple of 3.
Example 1 :
String to be encoded : “ace”, Length=3
Convert each character to decimal.
a= 97, c= 99, e= 101
Change each decimal to 8-bit binary representation.
97= 01100001, 99= 01100011, 101= 01100101
Combined : 01100001 01100011 01100101
Separate in a group of 6-bit.
011000 010110 001101 100101
Calculate binary to decimal
011000= 24, 010110= 22, 001101= 13, 100101= 37
Covert decimal characters to base64 using base64 chart.
24= Y, 22= W, 13= N, 37= l
“ace” => “YWNl”
Example 2 :
String to be encoded : “abcd” Length=4, it's not multiple of 3. So to make string length multiple of 3 , we must add 2 bit padding to make length= 6. Padding bit is represented by “=” sign.
Point to be noted : One padding bit equals two zeroes 00 so two padding bit equals four zeroes 0000.
So lets start the process :–
Convert each character to decimal.
a= 97, b= 98, c= 99, d= 100
Change each decimal to 8-bit binary representation.
97= 01100001, 98= 01100010, 99= 01100011, 100= 01100100
Separate in a group of 6-bit.
011000, 010110, 001001, 100011, 011001, 00
so the last 6-bit is not complete so we insert two padding bit which equals four zeroes “0000”.
011000, 010110, 001001, 100011, 011001, 000000 ==
Now, it is equal. Two equals sign at the end show that 4 zeroes were added (helps in decoding).
Calculate binary to decimal.
011000= 24, 010110= 22, 001001= 9, 100011= 35, 011001= 25, 000000=0 ==
Covert decimal characters to base64 using base64 chart.
24= Y, 22= W, 9= j, 35= j, 25= Z, 0= A ==
“abcd” => “YWJjZA==”
Base-64 encoding is a way of taking binary data and turning it into text so that it's more easily transmitted in things like e-mail and HTML form data.
http://en.wikipedia.org/wiki/Base64
It's a textual encoding of binary data where the resultant text has nothing but letters, numbers and the symbols "+", "/" and "=". It's a convenient way to store/transmit binary data over media that is specifically used for textual data.
But why Base-64? The two alternatives for converting binary data into text that immediately spring to mind are:
Decimal: store the decimal value of each byte as three numbers: 045 112 101 037 etc. where each byte is represented by 3 bytes. The data bloats three-fold.
Hexadecimal: store the bytes as hex pairs: AC 47 0D 1A etc. where each byte is represented by 2 bytes. The data bloats two-fold.
Base-64 maps 3 bytes (8 x 3 = 24 bits) in 4 characters that span 6-bits (6 x 4 = 24 bits). The result looks something like "TWFuIGlzIGRpc3Rpb...". Therefore the bloating is only a mere 4/3 = 1.3333333 times the original.
Aside from what's already been said, two very common uses that have not been listed are
Hashes:
Hashes are one-way functions that transform a block of bytes into another block of bytes of a fixed size such as 128bit or 256bit (SHA/MD5). Converting the resulting bytes into Base64 makes it much easier to display the hash especially when you are comparing a checksum for integrity. Hashes are so often seen in Base64 that many people mistake Base64 itself as a hash.
Cryptography:
Since an encryption key does not have to be text but raw bytes it is sometimes necessary to store it in a file or database, which Base64 comes in handy for. Same with the resulting encrypted bytes.
Note that although Base64 is often used in cryptography is not a security mechanism. Anyone can convert the Base64 string back to its original bytes, so it should not be used as a means for protecting data, only as a format to display or store raw bytes more easily.
Certificates
x509 certificates in PEM format are base 64 encoded. http://how2ssl.com/articles/working_with_pem_files/
In the early days of computers, when telephone line inter-system communication was not particularly reliable, a quick & dirty method of verifying data integrity was used: "bit parity". In this method, every byte transmitted would have 7-bits of data, and the 8th would be 1 or 0, to force the total number of 1 bits in the byte to be even.
Hence 0x01 would be transmited as 0x81; 0x02 would be 0x82; 0x03 would remain 0x03 etc.
To further this system, when the ASCII character set was defined, only 00-7F were assigned characters. (Still today, all characters set in the range 80-FF are non-standard)
Many routers of the day put the parity check and byte translation into hardware, forcing the computers attached to them to deal strictly with 7-bit data. This force email attachments (and all other data, which is why HTTP & SMTP protocols are text-based), to be convert into a text-only format.
Few of the routers survived into the 90s. I severely doubt any of them are in use today.
From http://en.wikipedia.org/wiki/Base64
The term Base64 refers to a specific MIME content transfer encoding.
It is also used as a generic term for any similar encoding scheme that
encodes binary data by treating it numerically and translating it into
a base 64 representation. The particular choice of base is due to the
history of character set encoding: one can choose a set of 64
characters that is both part of the subset common to most encodings,
and also printable. This combination leaves the data unlikely to be
modified in transit through systems, such as email, which were
traditionally not 8-bit clean.
Base64 can be used in a variety of contexts:
Evolution and Thunderbird use Base64 to obfuscate e-mail passwords[1]
Base64 can be used to transmit and store text that might otherwise cause delimiter collision
Base64 is often used as a quick but insecure shortcut to obscure secrets without incurring the overhead of cryptographic key management
Spammers use Base64 to evade basic anti-spamming tools, which often do not decode Base64 and therefore cannot detect keywords in encoded
messages.
Base64 is used to encode character strings in LDIF files
Base64 is sometimes used to embed binary data in an XML file, using a syntax similar to ...... e.g.
Firefox's bookmarks.html.
Base64 is also used when communicating with government Fiscal Signature printing devices (usually, over serial or parallel ports) to
minimize the delay when transferring receipt characters for signing.
Base64 is used to encode binary files such as images within scripts, to avoid depending on external files.
Can be used to embed raw image data into a CSS property such as background-image.
Some transportation protocols only allow alphanumerical characters to be transmitted. Just imagine a situation where control characters are used to trigger special actions and/or that only supports a limited bit width per character. Base64 transforms any input into an encoding that only uses alphanumeric characters, +, / and the = as a padding character.
Base64 is a binary to a text encoding scheme that represents binary data in an ASCII string format. It is designed to carry data stored in binary format across the network channels.
Base64 mechanism uses 64 characters to encode. These characters consist of:
10 numeric value: i.e., 0,1,2,3,...,9
26 Uppercase alphabets: i.e., A,B,C,D,...,Z
26 Lowercase alphabets: i.e., a,b,c,d,...,z
2 special characters (these characters depends on operating system): i.e. +,/
How base64 works
The steps to encode a string with base64 algorithm are as follow:
Count the number of characters in a String. If it is not multiple of 3, then pad it with special characters (i.e. =) to make it multiple of 3.
Convert string to ASCII binary format 8-bit using the ASCII table.
After converting to binary format, divide binary data into chunks of 6-bits.
Convert chunks of 6-bit binary data to decimal numbers.
Convert decimals to string according to the base64 Index Table. This table can be an example, but as I said, 2 special characters may vary.
Now, we got the encoded version of the input string.
Let's make an example: convert string THS to base64 encoding string.
Count the number of characters: it is already a multiple of 3.
Convert to ASCII binary format 8-bit. We got (T)01010100 (H)01001000 (S)01010011
Divide binary data into chunks of 6-bits. We got 010101 000100 100001 010011
Convert chunks of 6-bit binary data to decimal numbers.We got 21 4 33 19
Convert decimals to string according to the base64 Index Table. We got VEhT
It's used for converting arbitrary binary data to ASCII text.
For example, e-mail attachments are sent this way.
“Base64 encoding schemes are commonly used when there is a need to encode binary data that needs be stored and transferred over media that are designed to deal with textual data. This is to ensure that the data remains intact without modification during transport”(Wiki, 2017)
Example could be the following: you have a web service that accept only ASCII chars. You want to save and then transfer user’s data to some other location (API) but recipient want receive untouched data. Base64 is for that. . . The only downside is that base64 encoding will require around 33% more space than regular strings.
Another Example:: uenc = url encoded = aHR0cDovL2xvYy5tYWdlbnRvLmNvbS9hc2ljcy1tZW4tcy1nZWwta2F5YW5vLXhpaS5odG1s = http://loc.querytip.com/asics-men-s-gel-kayano-xii.html.
As you can see we can’t put char “/” in URL if we want to send last visited URL as parameter because we would break attribute/value rule for “MOD rewrite” – GET parameter.
A full example would be: “http://loc.querytip.com/checkout/cart/add/uenc/http://loc.magento.com/asics-men-s-gel-kayano-xii.html/product/93/”
I use it in a practical sense when we transfer large binary objects (images) via web services. So when I am testing a C# web service using a python script, the binary object can be recreated with a little magic.
[In python]
import base64
imageAsBytes = base64.b64decode( dataFromWS )
The usage of Base64 I'm going to describe here is somewhat a hack. So if you don't like hacks, please do not go on.
I went into trouble when I discovered that MySQL's utf8 does not support 4-byte unicode characters since it uses a 3-byte version of utf8. So what I did to support full 4-byte unicode over MySQL's utf8? Well, base64 encode strings when storing into the database and base64 decode when retrieving.
Since base64 encoding and decoding is very fast, the above worked perfectly.
You have the following points to take note of:
Base64 encoding uses 33% more storage
Strings stored in the database wont be human readable (You could sell that as a feature that database strings use a basic form of encryption).
You could use the above method for any storage engine that does not support unicode.
Mostly, I've seen it used to encode binary data in contexts that can only handle ascii - or a simple - character sets.
The base64 is a binary to a text encoding scheme that represents binary data in an ASCII string format. base64 is designed to carry data stored in binary format across the channels. It takes any form of data and transforms it into a long string of plain text. Earlier we can not transfer a large amount of data like files because it is made up of 2⁸ bit bytes but our actual network uses 2⁷ bit bytes. This is where base64 encoding came into the picture. But, what actually does base64 mean?
let’s understand the meaning of base64.
base64 = base+64
we can call base64 as a radix-64 representation.base64 uses only 6-bits(2⁶ = 64 characters) to ensure the printable data is human readable. but, how? we can also write base65 or base78, but why only 64? let’s prove it.
base64 encoding contains 64 characters to encode any string.
base64 contains:
10 numeric value i.e., 0,1,2,3,…..9.
26 Uppercase alphabets i.e., A,B,C,D,…….Z.
26 Lowercase alphabets i.e., a,b,c,d,……..z.
two special characters i.e., +,/. Depends upon your OS.
The steps followed by the base64 algorithm are as follow:
count the number of characters in a String.
If it is not multiple of 3 pad with special character i.e., = to
make it multiple of 3.
Encode the string in ASCII format.
Now, it will convert the ASCII to binary format 8-bit each.
After converting to binary format, it will divide binary data into
chunks of 6-bits each.
The chunks of 6-bit binary data will now be converted to decimal
number format.
Using the base64 Index Table, the decimals will be again converted
to a string according to the table format.
Finally, we will get the encoded version of our input string.
To expand a bit on what Brad is saying: many transport mechanisms for email and Usenet and other ways of moving data are not "8 bit clean", which means that characters outside the standard ascii character set might be mangled in transit - for instance, 0x0D might be seen as a carriage return, and turned into a carriage return and line feed. Base 64 maps all the binary characters into several standard ascii letters and numbers and punctuation so they won't be mangled this way.
One hexadecimal digit is of one nibble (4 bits). Two nibbles make 8 bits which are also called 1 byte.
MD5 generates a 128-bit output which is represented using a sequence of 32 hexadecimal digits, which in turn are 32*4=128 bits. 128 bits make 16 bytes (since 1 byte is 8 bits).
Each Base64 character encodes 6 bits (except the last non-pad character which can encode 2, 4 or 6 bits; and final pad characters, if any). Therefore, per Base64 encoding, a 128-bit hash requires at least ⌈128/6⌉ = 22 characters, plus pad if any.
Using base64, we can produce the encoded output of our desired length (6, 8, or 10).
If we choose to decide 8 char long output, it occupies only 8 bytes whereas it was occupying 16 bytes for 128-bit hash output.
So, in addition to security, base64 encoding is also used to reduce the space consumed.
Base64 can be used for many purposes.
The primary reason is to convert binary data to something passable.
I sometimes use it to pass JSON data around from one site to another, store information
in cookies about a user.
Note:
You "can" use it for encryption - I don't see why people say you can't, and that it's not encryption, although it would be easily breakable and is frowned upon. Encryption means nothing more than converting one string of data to another string of data that can be either later decrypted or not, and that's what base64 does.