Can someone explain Encoding.Unicode.GetBytes("hello") for me? - unicode

My code:
string input1;
input1 = Console.ReadLine();
Console.WriteLine("byte output");
byte[] bInput1 = Encoding.Unicode.GetBytes(input1);
for (int x = 0; x < bInput1.Length; x++)
Console.WriteLine("{0} = {1}", x, bInput1[x]);
outputs:
104
0
101
0
108
0
108
0
111
0
for the input "hello"
Is there a reference to the character map where I can make sense of this?

You should read "The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!)" at http://www.joelonsoftware.com/articles/Unicode.html
You can find a list of all Unicode characters at http://www.unicode.org but don't expect to be able to read the files there without learning a lot about text encoding issues.

At http://www.unicode.org/charts/ you can find all the Unicode code charts. http://www.unicode.org/charts/PDF/U0000.pdf shows that the code point for 'h' is U+0068. (Another great tool for viewing this data is BabelMap.)
The exact details of UTF-16 encoding can be found at http://unicode.org/faq/utf_bom.html#6 and http://www.ietf.org/rfc/rfc2781.txt. In short, U+0068 is encoded (in UTF-16LE) as 0x68 0x00. In decimal, this is the first two bytes you see: 104 0.
The other characters are encoded similarly.
Finally, a great reference (when trying to understand the various Unicode specifications), apart from the Unicode Standard itself, is the Unicode Glossary.

Related

How to truncate a 2's complement output

I have data written into short data type. The data written is of 2's complement form.
Now when I try to print the data using %04x, the data with MSB=0 is printed fine for eg if data=740, print I get is 0740
But when the MSB=1, I am unable to get a proper print. For eg if data=842, print I get is fffff842
I want the data truncated to 4 bytes so expected output is f842
Either declare your data as a type which is 16 bits long, or make sure the printing function uses the right format for 16 bits value. Or use your current type, but do a bitwise AND with 0xffff. What you can do depends on the language you're doing it in really.
But whichever way you go, check your assumptions again. There seems to be a few issues in your question:
2s-complement applies to signed numbers only. There are no negative numbers in your question.
Assuming you mean C's short - it doesn't have to be 16 bits long.
"I get is fffff842 I want the data truncated to 4 bytes" - fffff842 is 4 bytes long. f842 is 2 bytes long.
2-bytes long value 842 does not have the MSB set.
I'm assuming C (or possibly C++) as the language here.
Because of the default argument promotions involved when calling a variable argument function (such as printf), your use of a short will result in an integer promotion, which states that "If an int can represent all values of the original type (as restricted by the width, for a
bit-field), the value is converted to an int".
A short is converted to an int by means of sign-extension, and 0xf842 sign-extended to 32 bits is 0xfffff842.
You can use a bitwise AND to mask off the most significant word:
printf("%04x", data & 0xffff);
You could also add the h length specifier to state that you only want to print an (unsigned) short worth of bits from an int:
printf("%04hx", data);

F# .Net portable subset Unicode issues

OK, I've made an F# portable library project in VS2012 and I have some integers that represent Utf-32 encoded characters eg: 0x0001D538 which is a double struck A. Normally to make this into a Utf-16 surrogate pair you would use System.Char.ConvertFromUtf32(i), job done. However, Microsoft have kindly decided not to include this method in the .net portable subset. (it runs fine in the interactive window which must be running the full .net). So, what should I do instead to get my favorite surrogate pairs from these integers? They need to be integers because I do some arithmetic on them. Waiting for the next version of things to come out is a viable option.
Here's a quick translation of the C# from Reflector. Can you use this?
type System.Char with
static member ConvertFromUtf32(utf32) =
if utf32 < 0 || utf32 > 0x10ffff || (utf32 >= 0xd800 && utf32 <= 0xdfff) then
invalidArg "utf32" "Out of range"
elif utf32 < 0x10000 then
new String(char utf32, 1)
else
let utf32 = utf32 - 0x10000
new String([| char ((utf32 / 0x400) + 0xd800); char ((utf32 % 0x400) + 0xdc00) |])

Extract the first letter of a UTF-8 string with Lua

Is there any way to extract the first letter of a UTF-8 encoded string with Lua?
Lua does not properly support Unicode, so string.sub("ÆØÅ", 2, 2) will return "?" rather than "Ø".
Is there a relatively simple UTF-8 parsing algorithm I could use on the string byte per byte, for the sole purpose of getting the first letter of the string, be it a Chinese character or an A?
Or is this way too complex, requiring a huge library, etc.?
You can easily extract the first letter from a UTF-8 encoded string with the following code:
function firstLetter(str)
return str:match("[%z\1-\127\194-\244][\128-\191]*")
end
Because a UTF-8 code point either begins with a byte from 0 to 127, or with a byte from 194 to 244 followed by one or several bytes from 128 to 191.
You can even iterate over UTF-8 code points in a similar manner:
for code in str:gmatch("[%z\1-\127\194-\244][\128-\191]*") do
print(code)
end
Note that both examples return a string value for each letter, and not the Unicode code point numerical value.
Lua 5.3 provide a UTF-8 library.
You can use utf8.codes to get each code point, and then use utf8.char to get the character:
local str = "ÆØÅ"
for _, c in utf8.codes(str) do
print(utf8.char(c))
end
This also works:
local str = "ÆØÅ"
for w in str:gmatch(utf8.charpattern ) do
print(w)
end
where utf8.charpattern is just the string "[\0-\x7F\xC2-\xF4][\x80-\xBF]*" for the pattern to match one UTF-8 byte sequence.

How to encode 32-bit Unicode characters in a PowerShell string literal?

This Stack Overflow question deals with 16-bit Unicode characters. I would like a similar solution that supports 32-bit characters. See this link for a listing of the various Unicode charts. For example, a range of characters that are 32-bit are the Musical Symbols.
The answer in the question linked above doesn't work because it casts the System.Int32 value as a System.Char, which is a 16-bit type.
Edit: Let me clarify that I don't particularly care about displaying the 32-bit Unicode character, I just want to store the character in a string variable.
Edit #2: I wrote a PowerShell snippet that uses the info in the marked answer and its comments. I would have wanted to put this in another comment, but comments can't be multi-line.
$inputValue = '1D11E'
$hexValue = [int]"0x$inputValue" - 0x10000
$highSurrogate = [int]($hexValue / 0x400) + 0xD800
$lowSurrogate = $hexValue % 0x400 + 0xDC00
$stringValue = [char]$highSurrogate + [char]$lowSurrogate
Dour High Arch still deserves credit for the answer for helping me finally understand surrogate pairs.
IMHO, the most elegant way to use Unicode literals in PowerShell is
[char]::ConvertFromUtf32(0x1D11E)
See my blogpost for more details
Assuming PowerShell uses UTF-16, 32-bit code points are represented as surrogates. For example, U+10000 is represented as:
0xD100 0xDC00
That is, two 16-bit chars; hex D100 and DC00.
Good luck finding a font with surrogate chars.
FYI: If anyone wants to store surrogate pairs in a Case Sensitive HashTable, this seems to work:
$NCRs = new-object System.Collections.Hashtable
$NCRs['Yopf'] = [string]::new(([char]0xD835, [char]0xDD50))
$NCRs['yopf'] = [string]::new(([char]0xD835, [char]0xDD6A))
$NCRs['Yopf']
$NCRs['yopf']
Outputs:
𝕐
𝕪

Encoding of ... some sort?

Forgive me if this has been asked before, but I assure you I've scoured the internet and have turned up nothing, probably because I don't have the right terminology.
I would like to take an integer and convert it to a little-endian(?) hex representation like this:
303 -> 0x2f010000
I can see that the bytes are packed such that the 16's and 1's places are both in the same byte, and that the 4096's place and 256's place share a byte. If someone could just point me to the right terminology for such encoding, I'm sure I could find my answer on how to do it. Thanks!
use bit shift operators combined with bitwise AND and OR operators...
assuming 32 bit unsigned:
int value = 303;
int result = 0x00000000;
for (int i = 0; i < 4; i++)
{
result = result | ((value & (0xFF << (i * 8))) << (24 - (i * 8)));
}
Big-endian and little-endian refer to the order of the bytes in memory. A value like 0x2f100000 has no intrinsic endianness, endianness depends on the CPU architecture.
If you always want to reverse the order of the bytes in a 32-bit value, use the code that Demi posted.
If you always want to get the specific byte order (because you're preparing to transfer those bytes over the network, or store them to a disk file), use something else. E.g. the BSD sockets library has a function htonl() that takes your CPU's native 32-bit value and puts it into big-endian order.
If you're running on a little-endian machine, htonl(303) == 0x2f100000. If you're running on a big-endian machine, htonl(303) == 303. In both cases the result will be represented by bytes [0x00, 0x00, 0x01, 0x2f] in memory.
If anyone can put a specific term to what I was trying to do, I'd still love to hear it. I did, however, find a way to do what I needed, and I'll post it here so if anyone comes looking after me, they can find it. There may be (probably is) an easier, more direct way to do it, but here's what I ended up doing in VB.Net to get back the bytecode I wanted:
Private Function Encode(ByVal original As Integer) as Byte()
Dim twofiftysixes As Integer = CInt(Math.Floor(original / 256))
Dim sixteens As Integer = CInt(Math.Floor((original - (256 * twofiftysixes)) / 16))
Dim ones As Integer = original Mod 16
Dim bytecode As Byte() = {CByte((16 * sixteens) + ones), CByte(twofiftysixes), 0, 0}
Return bytecode
End Function
Effectively breaking the integer up into its hex components, then converting the appropriate pairs to cBytes.