I'm developing a function that needs to detect if a string is Unicode.
I get this string from an Access DB.
Now i'm analyzing every two bytes: If second is 00 then is Unicode, but not always is so; sometimes I've got a couple of bytes as &H2 &HA1.
How can i solve this problem?
Only the characters from 0 to 127 are "safe." ANSI character values from 128 to 255 have different meanings and character mappings in different locales.
For example, in the U.S. English locale:
Option Explicit
Private Sub Form_Load()
Dim S As String
S = "‰"
Debug.Print S, Asc(S), AscW(S)
End Sub
Produces:
‰ 137 8240
If the underlying data is primarily ASCII/ANSI, then your current check is enough. In 16-bit Unicode, such string data will have a majority of characters whose upper byte is 00. Not 100%, but an obvious majority. This won't occur in straight ANSI string data.
Related
I'm reviewing for an exam right now and one of the review questions gives an answer that I'm not understanding.
A main memory location of a MIPS processor based computer contains the following bit pattern:
0 01111110 11100000000000000000000
a. If this is to be interpreted as a NULL-terminated string of ASCII characters, what is the string?
The answer that's given is "?p" but I'm not sure how they got that.
Thanks!
All ASCII characters are made up of 8 bits. So given your main memory location, we can break it up into a few bytes.
00111111
01110000
00000000
...
Null terminated strings are terminated with none other than... a null byte! (A byte with all zeros). So this means that your string contains two bytes that are ASCII characters. Byte 1 has a value of 63 and byte two has a value of 112. If you have a look at an ASCII chart like this one you'll see that 63 corresponds to '?' and 112 corresponds to 'p'.
What is the easiest way to shorten a base 64 string. e.g
PHJkZjpEZXNjcmlwdGlvbiByZGY6YWJvdXQ9IiIKICAgICAgICAgICAgeG1sbnM6eG1wPSJodHRwOi8v
I just learned how to convert binary to base64. If I'm correct, groups of 24bits are made and groups of 6bits are used to create the 64 charcters A-Z a-z 0-9 +/
I was wondering is it possible to further shrink a base 64 string and make it smaller; I was hoping to reduce a 100 character base64 string to 20 or less characters.
A 100-character base64 string contains 600 bits of information. A base64 string contains 6 bits in each character and requires 100 characters to represent your data. It is encoded in US-ASCII (by definition) and described in RFC 4648. This is In order to represent your data in 20 characters you need 30 bits in each character (600/20).
In a contrived fashion, using a very large Unicode mapping, it would be possible to render a unified CJK typeface, but it would still require the minimum of about 40 glyphs (~75 bytes) to represent the data. It would also be really difficult to debug the encoding and be really prone to misinterpretation. Further, the purpose of base64 encoding is to present a representation that is not destroyed by broken intermediate systems. This would very likely not work with anything as obscure as a base2Billion encoding.
I am confused about the text encoding and charset. For many reasons, I have to
learn non-Unicode, non-UTF8 stuff in my upcoming work.
I find the word "charset" in email headers as in "ISO-2022-JP", but there's no
such a encoding in text editors. (I looked around the different text editors.)
What's the difference between text encoding and charset? I'd appreciate it
if you could show me some use case examples.
Basically:
charset is the set of characters you can use
encoding is the way these characters are stored into memory
Every encoding has a particular charset associated with it, but there can be more than one encoding for a given charset. A charset is simply what it sounds like, a set of characters. There are a large number of charsets, including many that are intended for particular scripts or languages.
However, we are well along the way in the transition to Unicode, which includes a character set capable of representing almost all the world's scripts. However, there are multiple encodings for Unicode. An encoding is a way of mapping a string of characters to a string of bytes. Examples of Unicode encodings include UTF-8, UTF-16 BE, and UTF-16 LE . Each of these has advantages for particular applications or machine architectures.
Throwing more light for people visiting henceforth, hopefully it would be helpful.
Character Set
There are characters in each language and collection of those characters form the “character set” of that language. When a character is encoded then it is assigned a unique identifier or a number called as code point. In computer, these code points will be represented by one or more bytes.
Examples of character set: ASCII (covers all English characters), ISO/IEC 646, Unicode (covers characters from all living languages in the world)
Coded Character Set
A coded character set is a set in which a unique number is assigned to each character. That unique number is called as "code point".
Coded character sets are sometimes called code pages.
Encoding
Encoding is the mechanism to map the code points with some bytes so that a character can be read and written uniformly across different system using same encoding scheme.
Examples of encoding: ASCII, Unicode encoding schemes like UTF-8, UTF-16, UTF-32.
Elaboration of above 3 concepts
Consider this - Character 'क' in Devanagari character set has a decimal code point of 2325 which will be represented by two bytes (09 15) when using the UTF-16 encoding
In “ISO-8859-1” encoding scheme “ü” (this is nothing but a character in Latin character set) is represented as hexa-decimal value of FC while in “UTF-8” it represented as C3 BC and in UTF-16 as FE FF 00 FC.
Different encoding schemes may use same code point to represent different characters, for example in “ISO-8859-1” (also called as Latin1) the decimal code point value for the letter ‘é’ is 233. However, in ISO 8859-5, the same code point represents the Cyrillic character ‘щ’.
On the other hand, a single code point in the Unicode character set can actually be mapped to different byte sequences, depending on which encoding was used for the document. The Devanagari character क, with code point 2325 (which is 915 in hexadecimal notation), will be represented by two bytes when using the UTF-16 encoding (09 15), three bytes with UTF-8 (E0 A4 95), or four bytes with UTF-32 (00 00 09 15)
A character encoding consists of:
The set of supported characters
A mapping between characters and integers ("code points")
How code points are encoded as a series of "code units" (e.g., 16-bit units for UTF-16)
How code units are encoded into bytes (e.g., big-endian or little-endian)
Step #1 by itself is a "character repertoire" or abstract "character set", and #1 + #2 = a "coded character set".
But back before Unicode became popular and everyone (except East Asians) was using a single-byte encoding, steps #3 and #4 were trivial (code point = code unit = byte). Thus, older protocols didn't clearly distinguish between "character encoding" and "coded character set". Older protocols use charset when they really mean encoding.
A character set, or character repertoire, is simply a set (an unordered collection) of characters. A coded character set assigns an integer (a "code point") to each character in the repertoire. An encoding is a way of representing code points unambiguously as a stream of bytes.
Googled for it.
http://en.wikipedia.org/wiki/Character_encoding
The difference seems to be subtle. The term charset actually doesn't apply to Unicode. Unicode goes through a series of abstractions.
abstract characters -> code points -> encoding of code points to bytes.
Charsets actually skip this and directly jump from characters to bytes.
sequence of bytes <-> sequence of characters
In short,
encoding : code points -> bytes
charset: characters -> bytes
A charset is just a set; it either contains, e.g. the Euro sign, or else it doesn't. That's all.
An encoding is a bijective mapping from a character set to a set of integers. If it supports the Euro sign, it must assign a specific integer to that character and to no other.
In my opinion, a charset is part of an encoding (a component), encoding has a charset attribute, so a charset can be used in many encodings. For example, Unicode is a charset used in encodings like UTF-8, UTF-16 and so on. See illustration here:
The char in charset doesn't mean the char type in the programming world. It means a character in the real world. In English it maybe the same, but in other languages not, like in Chinese, '我' is a inseparable 'char' in charsets (Unicode, GB [used in GBK and GB2312]), 'a' is also a char in charsets (ASCII, ISO-8859, Unicode).
In my opinion the word "charset" should be limited to identifying the parameter used in HTTP, MIME, and similar standards to specify a character encoding (a mapping from a series of text characters to a sequence of bytes) by name. For example:charset=utf-8.
I'm aware, though, that MySQL, Java, and other places may use the word "charset" to mean a character encoding.
An encoding is a mapping between bytes and characters from a character set, so it will be helpful to discuss and understand the difference between between bytes and characters.
Think of bytes as numbers between 0 and 255, whereas characters are abstract things like "a", "1", "$" and "Ä". The set of all characters that are available is called a character set.
Each character has a sequence of one or more bytes that are used to represent it; however, the exact number and value of the bytes depends on the encoding used and there are many different encodings.
Most encodings are based on an old character set and encoding called ASCII which is a single byte per character (actually, only 7 bits) and contains 128 characters including a lot of the common characters used in US English.
For example, here are 6 characters in the ASCII character set that are represented by the values 60 to 65.
Extract of ASCII Table 60-65
╔══════╦══════════════╗
║ Byte ║ Character ║
╠══════╬══════════════║
║ 60 ║ < ║
║ 61 ║ = ║
║ 62 ║ > ║
║ 63 ║ ? ║
║ 64 ║ # ║
║ 65 ║ A ║
╚══════╩══════════════╝
In the full ASCII set, the lowest value used is zero and the highest is 127 (both of these are hidden control characters).
However, once you start needing more characters than the basic ASCII provides (for example, letters with accents, currency symbols, graphic symbols, etc.), ASCII is not suitable and you need something more extensive. You need more characters (a different character set) and you need a different encoding as 128 characters is not enough to fit all the characters in. Some encodings offer one byte (256 characters) or up to six bytes.
Over time a lot of encodings have been created. In the Windows world, there is CP1252, or ISO-8859-1, whereas Linux users tend to favour UTF-8. Java uses UTF-16 natively.
One sequence of byte values for a character in one encoding might stand for a completely different character in another encoding, or might even be invalid.
For example, in ISO 8859-1, â is represented by one byte of value 226, whereas in UTF-8 it is two bytes: 195, 162. However, in ISO 8859-1, 195, 162 would be two characters, Ã, ¢.
When computers store data about characters internally or transmit it to another system, they store or send bytes. Imagine a system opening a file or receiving message sees the bytes 195, 162. How does it know what characters these are?
In order for the system to interpret those bytes as actual characters (and so display them or convert them to another encoding), it needs to know the encoding used. That is why encoding appears in XML headers or can be specified in a text editor. It tells the system the mapping between bytes and characters.
I'm trying to get some legacy code to display Chinese characters properly. One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long (including the 0x7F byte). Does anyone know what kind of encoding this is and where I can find information for it? Thanks..
UPDATE:
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long. It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application. However, if any other locale other than Japanese is selected, I cannot even view the filenames properly. So I'm guessing this encoding is not Unicode. Anyone know what it is? Is it ANSI? Is it Shift JIS?
For the Chinese one, I've tested it with Unicode and UTF-8 characters and I'm getting the same pattern; 0x7F followed by three bytes. Are Unicode and UTF-8 the same?
One character encoding I'm trying to work with starts with a 0x7F and is 4 bytes long
What are the other bytes? Do you have any Latin text in this encoding?
If it's “0x7f 0x... 0x00 0x00” you are looking at UTF-32LE. It could also be two UTF-16 (either LE or BE) characters.
Most East Asian encodings use 0x80-0xFF as lead bytes for non-ASCII characters; there is none I know of that would use a leading 0x7F as anything other than an ASCII delete.
ETA:
are there supposed to be Byte Order Marks?
There doesn't need to be a BOM if there is an out-of-band way of signalling that the encoding is ‘UTF-32LE’ (possibly one that is lost before it gets to you).
I've also had to work with some Japanese encoding that starts every character with a 0xE3 and is three bytes long.
That's surely UTF-8. Sequence 0xE3 0x... 0x... would result in a character between U+3000 and U+4000, which is where the hiragana/katakana live.
It displays on my computer properly if I choose the Japanese locale in Windows, however, it doesn't display properly in our application.
Then chances are your application is is one of the regrettable horde of non-Unicode-compliant apps, still using ‘A’(*) versions of the Win32 interfaces inside of the ‘W’-suffixed ones. Whether you can read in the string according to its real encoding is moot: a non-Unicode-compliant app will never be able to display an East Asian ideograph on a Western locale.
(*: named for “ANSI”, which is Windows's misleading term for “whatever the system codepage is set to at the moment”. That's why changing your locale affected it.)
ETA(2):
OK, cracked it. It's not any standardised encoding I've met before, but it's relatively easy to decipher if you assume the premise that Unicode code points are being encoded.
0x00-0x7E: plain ASCII
0x7F A B C: Unicode character
The character encoded in a Unicode escape can be calculated by taking the index in a key string of A, B and C and adding together:
A*0x1000 + B*0x40 + C
That is, it's a base-64 character set, but it's not the usual Base64 standard. A little experimentation gives a key string of:
.0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz
The ‘.’ and ‘_’ characters are guesses, since none of the characters you posted uses them. We'd need more data to find out the exact string.
So, for example:
0x7F 3 u g
A=4 B=58 C=44
4*0x1000 + 58*0x40 + 44 = 0x4EAC
U+4EAC = 京
ETA(3):
Yeah, it should be easy to create a native Unicode string by sucking out each code point manually and joining as a character. Not quite sure what's available on whatever platform you're using, but any Unicode-capable platform should be able to make a string from codepoints simply (and hopefully without having to manually re-encode to UTF-16LE bytes).
I figured it must be Unicode codepoints by noticing that the three example characters had first escape-characters in the same general range, and in the same numerical order as their Unicode codepoints. The other two characters seemed to change randomly, so it was very likely a big-endian encoding of the code point, and probably a base-64 encoding as 6 is as many bits as you can get out of readable ASCII.
Standard Base64 itself starts with letters, which would put something starting with a number too far up to be in the Basic Multilingual Plane. So I started guessing with ‘0123456789ABCDEFG...’ which would be the other obvious choice of key string. That got resulting numbers that were close to the code points for the given characters, but a bit too low. Inserting an extra character at the start of the key string (so digit ‘0’ doesn't map to number 0) got one of the characters right and the other two very close; the one that was right had no lower-case letters, so to change only the lower-case letters I inserted another character between the upper and lower cases. This came up with the right numbers.
It's not guaranteed that this is actually right, but (apart from the arbitrary choice of inserted characters) it's very likely to be it.
You might want to look at chinese character encoding page on Wikipedia. The only encoding in there that I can see that is always 4 bytes is UTF-32.
GB 18030 is the current standard Chinese character set, but it can be 1 to 4 bytes long.
Try chardet. It does a good job of guessing the character encoding of a string of bytes.
Are Unicode and UTF-8 the same?
No. UTF-8 is just one way to represent Unicode characters as a sequence of bytes. Unicode is the full standard, assigning numeric and human-readable identifiers to each character, as well as lots of metadata about the characters.
It might be a valid unicode encoding, such as a utf-8 or UTF16 surrogate pair.
Yes, the Chinese one is UTF-8, a implementation (encoding) of Unicode.
The UTF-8 is 1 byte long for ASCII characters and up to 4 bytes for others.
Using ASCII encoding, how many characters are there in a GUID?
I'm interested in the Microsoft style, which includes the curly brackets and dashes.
From MSDN:
A GUID is a 128-bit value consisting
of one group of 8 hexadecimal digits,
followed by three groups of 4
hexadecimal digits each, followed by
one group of 12 hexadecimal digits.
The following example GUID shows the
groupings of hexadecimal digits in a
GUID:
6B29FC40-CA47-1067-B31D-00DD010662DA
From Wikipedia:
Often braces are added to enclose the
above format, as such:
{3F2504E0-4F89-11D3-9A0C-0305E82C3301}
So a total of 38 characters in the typical hexadecimal encoding with curly braces.
-Adam
TL;DR: None.
As Adam Davis stated, the Microsoft style is HEX encoding (with braces and dashes to make it more readable) that can be displayed using a subset of ASCII characters (0-9 and A-F), but this is not specifically ASCII encoding.
I guess it's important to remember that the microsoft style of displaying GUID's is only a representation of a GUID, which is actually a 16 byte integral value (as Micheal Trausch stated).
You can also present it in different, more compact ways by converting the bytes into a different character set (like ASCII).
Theoretically you can display each byte as an extended ASCII character (255 characters), which would allow you to save a GUID as a 16 character length string.
It wouldn't be very readable though because it would include whitespace characters (CR, space, tab, etc) and other special characters, so this would only make sense if you want to efficiently save a GUID in a non-human readable character format, for example in in a database that doesn't natively support GUID's or fast matching of small binary values:
http://en.wikipedia.org/wiki/Extended_ASCII
IMHO the most readable way to display a GUID more compact would be to use Base64 encoding, which allows you to save it in a string with a length of 22 characters, and would make it look like this:
7v26IM9P2kmVepd7ZxuXyQ==
But as Jeff Atwood states on his site, you can also push a GUID into an ASCII85 encoded string with 20 characters:
[Rb*hlkkXVW+q4s(YSF0
For more inspiration, see:
http://www.codinghorror.com/blog/2005/10/equipping-our-ascii-armor.html
As Adam mentioned from the MSDN quote, UUIDs are 128-bit values. This means that they take 16 bytes of RAM to hold a value. A text representation will take 32 bytes (two bytes for each single byte), plus the 4 hyphens, plus the two brackets if you want to include those; this amounts to 38 bytes.
Just keep in mind that if you are exposing UUIDs to users of your software, they may provide the UUID with or without the brackets. If you're storing the value anywhere, it's best to store it as the 16-byte binary representation. If you are interoperating with other UUID implementations, you may want to use the basic text format for interoperability, since different implementations do different things to the order of bytes when storing a binary UUID value.
The length depends on the encoding. You can get the standard encoding and length with this snippet:
public void Main()
{
var guid = Guid.Empty;
Write(guid, "N"); // 32 characters
Write(guid, "D"); // 36 characters (default)
Write(guid, "B"); // 38 characters
Write(guid, "P"); // 38 characters
Write(guid, "X"); // 68 characters
}
private void Write(Guid guid, string format)
{
var guidString = guid.ToString(format);
Console.WriteLine("{0}: {1} ({2} characters)", format, guidString, guidString.Length);
}
See the Guid.ToString method for details: