Software should only work with Unicode strings internally, converting to a particular encoding on output.
-- Python Docs
The above quote is from the Python docs. Python has a unicode string type so this makes sense. Go doesn't have unicode strings. As strings are just an immutable byte slice. What would be the equivalent quote for Go?
Would it be to convert text to utf-8 on entry to the program and store as utf-8 internally, and then output utf-8?
Generally speaking, in Go you will be writing a []byte like when using the ioutil package's WriteFile method; https://golang.org/pkg/io/ioutil/#WriteFile
So yes, the answer is that you explicitly declare the encoding. Since the string is just a byte slice, there is no inherent encoding, however string literals in Go source will be UTF-8. If you haven't already read this Go blog post by Robert Pike on strings, bytes and runes, it's worth the time; https://blog.golang.org/strings
Related
So I'm trying to open and parse some old Visual Studio compilation log files with Tcl; my only problem is the files are in a strange encoding. Upon examining them with Notepad++ it seems they are in the 'UCS-2 Little Endian' encoding. Two questions:
Is there any command in Tcl that allows me to look at the character encoding of a file? I know there is encoding system which tells me the system encoding.
Using encoding names Tcl tells me the available encoding names are the following list:
cp860 cp861 cp862 cp863 tis-620 cp864 cp865 cp866 gb12345 gb2312-raw cp949 cp950 cp869 dingbats ksc5601 macCentEuro cp874 macUkraine jis0201 gb2312 euc-cn euc-jp macThai iso8859-10 jis0208 iso2022-jp macIceland iso2022 iso8859-13 jis0212 iso8859-14 iso8859-15 cp737 iso8859-16 big5 euc-kr macRomania macTurkish gb1988 iso2022-kr macGreek ascii cp437 macRoman iso8859-1 iso8859-2 iso8859-3 macCroatian koi8-r iso8859-4 ebcdic iso8859-5 cp1250 macCyrillic iso8859-6 cp1251 macDingbats koi8-u iso8859-7 cp1252 iso8859-8 cp1253 iso8859-9 cp1254 cp1255 cp850 cp1256 cp932 identity cp1257 cp852 macJapan cp1258 shiftjis utf-8 cp855 cp936 symbol cp775 unicode cp857
Given this, what would be the appropriate name to use in the fconfigure -encoding command to read these UCS-2 Little Endianencoded files and convert them to UTF-8 for use? If I understand the fconfigure command correctly, I need to specify the encoding type of the source file rather than what I want it to be; I just don't know which of the options in the above list corresponds to UCS-2 Little Endian. After reading a little bit, I see that UCS-2 is a predecessor of the UTF-16 character encoding, but that option isn't here either.
Thanks!
I'm afraid, currently there's no way to do it just by using fconfigure -encoding ?something?: the unicode encoding has rather moot meaning, and there's a feature request to create explicit support for UTF-16 variants.
What you could do about it?
Since unicode in Tcl running on Windows should mean UTF-16 with native endianness1 (little-endian on Wintel), if your solution is supposed to be a quick and dirty one, just try using -encoding unicode and see if that helps.
If you're targeting at some more bullet-proof or future-proof of cross-platform solution, I'd switch the channel to binary more, read the contents in chunks of two bytes at a time, and then use
binary scan $twoBytes s n
to scan the sequence of two bytes in $twoBytes as an 16-bit integer into a variable named "n", followed by something like
set c [format %c $n]
to produce a unicode character out of the number in $n, and assign it to a variable.
This way supposedly requires a bit more trickery to get correctly:
You might check the very first character obtained from the stream to see if it's a byte-order-mark, and drop it if it is.
If you need to process the stream in a line-wise manner, you'd have to implement a little state machine that would handle the CR+LF sequences correctly.
When doing your read $channelId 2, to get the next character, you should check that it returned not just 0 or 2, but also 1 — in case the file happens to be corrupted, — and handle this.
The UCS-2 encoding differs from UTF-16 in that the latter might contain the so-called surrogate pairs, and hence it is not a fixed-length encoding. Hence handling an UTF-16 stream properly implies also detecting those surrogate pairs. On the other hand, I hardly beleive a compilation log produced by MSVS might contain them, so I'd just assume it's encoded in UCS-2LE.
1 The true story is that the only thing Tcl guarantees about textual strings it handles (that is, those obtained by maniputating text, not via binary format or encoding convertto or reading a stream in binary mode) is that they're Unicode (or, rather, the "BMP" part of it).
But technically, the interpreter might switch the internal representation of any string between the UTF-8 encoding it uses by default and some fixed-length encoding which is what is referred to by that name "unicode". The "problem" is that no part of Tcl documentation specifies that internal fixed-length encoding because you're required to explicitly convert any text you output or read to/from some specific encoding — either via configuring the stream or using encoding convertfrom and encoding convertto or using binary format and binary scan, and the interpreter will do the right thing no matter which precise encoding it's currently using for your source string value — it's all transparent. Moreover, the next release of the "standard" Tcl interpreter might decide to drop this internal feature completely, or, say, use 32-bit or 64-bit integers for that internal fixed-length encoding. Whatever "non-standard" interpreters do (like Jacl etc) are also up to them. In other words, this feature is internal and is not a part of the documented contract about the interpreter's behaviour. And by the way, the "standard" encoding for Tcl strings (UTF-8) is not specified as such either — it's just an implementation detail.
In Tcl v8.6.8 I could solve the same issue with fconfigure channelId -encoding unicode.
I'm working with a binary file that references another file using absolute paths.
The path contains both japanese and ascii characters.
The length of the string is given, so I can just read that many bytes and convert it into a string.
However the problem is trying to convert the string. If I specify the encoding as ascii, it'll fail on the japanese characters. If I specify it as japanese encoding (shift-jis or something), it won't read the english characters properly.
One byte is used for each ascii character, while two bytes are used for each japanese character.
What is the fastest and cleanest way to convert these bytes into a string? The encodings are known. Will the same technique work in older versions of python.
This sounds like you have fallen victim for a misunderstand the basics of Unicode and encodings. It may be that you have not, but misunderstandnings are common and understandable, while the situation you describe are not.
A string of bytes that contains mixed encodings are, per definition, invalid in any of these encodings. If this really was the case, you would have to split the bytes string into it's parts, and decode every part separately. In this case it would probably mean splitting on the path separators, so it would be reasonably easy, but in other cases it would not. However, I serously doubt that this is the case, as it would mean that your source is insane. That happens, but it is unlikely. :-)
If the source gives you one path as a bytes string, it is most likely that this string uses only one encoding. It may contain both Japanese and ASCII-characters and still be using one encoding. The most common encodings that can handle both Japanese and ASCII are UTF-8 and UTF-16. My guess is that your source uses one of those. In fact, since you write "One byte is used for each ascii character, while two bytes are used for each japanese character" it is probably UTF-8. It could also be Shift JIS, but it seems you already tried that.
If not, please explain what your source is, and give examples of the byte strings (in ASCII/HEX) that you are given.
I'm having a little trouble getting erlang to give me a unicode string.
Here's what works:
io:format("~ts~n", [<<226,132,162>>]).
™
ok
But instead of printing to the console, I want to assign it to a variable. So I thought:
T = lists:flatten(io_lib:format("~ts~n", [<<226,132,162>>])).
T.
[8482,10]
How can I get T in the io_lib example to contain the ™ symbol so I can write it to a network stream?
Instead of assigning the flattened version to a variable for sending on the network, can you instead re-write your code that sends over the network to accept the binary in the first place and use the formatted write mechanism ~ts when sending over the socket?
That would also let you avoid the lists:flatten, which isn't needed for the built-in IO mechanisms.
It does contain the trademark symbol: as you can see here, 8482 is its code. It isn't printed as ™ in the shell, because the shell prints as strings only lists which contain printable character code in Latin-1. So [8482, 10] is a Unicode string (in UTF-32 encoding). If you want to convert it to a different encoding, use the unicode module.
First thing is knowing what you need to do. Then you can adapt your code the best way you find.
Erlang represents unicode strings as lists of codepoints. Unicode codepoints are integers, not bytes. Snce you can only send bytes over the network, things like unicode strings, need to be encoded in byte squences by the sending side and decoded by the receiving side. UTF-8 is the most used encoding for unicode strings, and that's what your binary is, the UTF-8 encoding of the unicode string composed by the codepoint 8482.
What you get out of the io_lib:format call is the erlang string representation of that codepoint plus the new line character.
A very reasonable way to send unicode strings over the network is encoding them in UTF-8. Don't use io_lib:format for that, though. unicode:characters_to_binary/1 is the function meant to transform unicode strings in UTF-8 encoded binaries.
In the receiving side (and probably even better in your whole application) you'll have to decide how you will handle the strings, either in encoded binaries (or lists) or in plain unicode lists. But over the network the only choice is using binaries (or iolists wich are possibly deep lists of bytes) and I'll bet the most reasonable encoding for your application will be UTF-8.
What is the best way to (losslessly) convert Unicode to a lower-order byte encoding (8 bits), in a language inspecific way? I want a format that is standard, i.e. has widespread library support for conversion both directions.
If I were using Python, I would use repr:
In [1]: x = u"Российская Федерация"
In [2]: repr(x)
Out[2]: "u'\\xd0\\xa0\\xd0\\xbe\\xd1\\x81\\xd1\\x81\\xd0\\xb8\\xd0\\xb9\\xd1\\x81\\xd0\\xba\\xd0\\xb0\\xd1\\x8f \\xd0\\xa4\\xd0\\xb5\\xd0\\xb4\\xd0\\xb5\\xd1\\x80\\xd0\\xb0\\xd1\\x86\\xd0\\xb8\\xd1\\x8f'"
However, I'm looking for a format that has good library support for converting the second string back to the first, in a variety of languages.
Out[2]: "u'\xd0\xa0\xd0\xbe\xd1\x81\xd1\x81\xd0\xb8\xd0\xb9\xd1\x81\xd0\xba\xd0\xb0\xd1\x8f \xd0\xa4\xd0\xb5\xd0\xb4\xd0\xb5\xd1\x80\xd0\xb0\xd1\x86\xd0\xb8\xd1\x8f'"
If that's what you see, your terminal is set up wrong, it's treating UTF-8 input as being ISO-8859-1 (or cp1252 in the case of the Windows console, which isn't possible to set up right).
The proper Python repr of Российская Федерация would be the Unicode literal:
u'\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u0430\u044f \u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u044f'
Which as it happens is pretty close to the JavaScript/JSON string literal
"\u0420\u043e\u0441\u0441\u0438\u0439\u0441\u043a\u0430\u044f \u0424\u0435\u0434\u0435\u0440\u0430\u0446\u0438\u044f"
If you want a 7-bit-safe (ASCII) representation of a Unicode string, JSON is a reasonable choice of format. Get it by using json.dumps() though rather than hacking the Python repr, since there are some subtle inconsistencies between the two formats.
Other well-understood ASCII representations you could try might include URL-encoding (%D0%A0%D0%BE...) and XML character escapes (<value>Рос...</value>).
If you only need an arbitrary binary representation that doesn't need to be 7-bit safe, as Max mentioned, just .encode('utf-8').
UTF-8, UTF-16 and UTF-32 are all standard. Perhaps UTF-8 is most common on the Internet; UTF-16 is used internally by Windows and Java. Any language with Unicode support will have encoding and decoding functions for all of them. In Python you can use the .encode method of unicode strings and .decode method of string to convert between them.
If you need something that's 7-bit clean (no 8th bits set), there's also UTF-7.
Based on the link below, I'm confused as to whether the Lua programming language supports Unicode.
http://lua-users.org/wiki/LuaUnicode
It appears it does but has limitations. I simply don't understand, are the limitation anything big/key or not a big deal?
You can certainly store unicode strings in lua, as utf8. You can use these as you would any string.
However Lua doesn't provide any default support for higher-level "unicode aware" operations on such strings—e.g., counting string length in characters, converting lower-to-upper-case, etc. Whether this lack is meaningful for you really depends on what you intend to do with these strings.
Possible approaches, depending on your use:
If you just want to input/output/store strings, and generally use them as "whole units" (for table indexing etc), you may not need any special handling at all. In this case, you just treat these strings as binary blobs.
Due to utf8's clever design, some types of string manipulation can be done on strings containing utf8 and will yield the correct result without taking any special care.
For instance, you can append strings, split them apart before/after ascii characters, etc. As an example, if you have a string "開発.txt" and you search for "." in that string using string.find (string_var, "."), and then split it using the normal string.sub function into "開発" and ".txt", those result strings will be correct utf8 strings even though you're not using any kind of "unicode-aware" algorithm.
Similarly, you can do case-conversions on only the ASCII characters in strings (those with the high bit zero), and treat the rest of the strings as binary without screwing them up.
Some utf8-aware operations are so simple that it's easy to just write one's own functions to do them.
For instance, to calculate the length in unicode-characters of a string, just count the number of characters with the high bit zero (ASCII characters), and the number of characters with the top two bits 11 ("leading bytes" for non-ASCII characters); the length is the sum of those two.
For more complex operations—e.g., case-conversion on non-ASCII characters, etc.—you'll probably have to use a Lua unicode library, such as those on the (previously mentioned) Lua-users Unicode page
Lua does not have any support for unicode (other than accepting any byte value in strings). The library slnunicode has a lot of unicode string functions, however. For example unicode.utf8.len.
(note: this answer is completely stolen from grom's comment on another question - I just think it deserves its own answer)
If you want a short answer, it is 'yes and no' as put on the linked site.
Lua supports Unicode in the way that specifying, storing and querying arbitrary byte values in strings is supported, so you can store any kind of Unicode-encoding encoded string in a Lua string.
What is not supported is iteration by unicode character, there is no standard function for string length in unicode characters etc. So the higher-level kind of Unicode support (like what is available in Python with length, lower -> upper case conversion, encoding in arbitrary coding etc) is not available.
Lua 5.3 was released now. It comes with a basic UTF-8 library.
You can use the utf8 library to do things about UTF-8 encoding, like getting the length of a UTF-8 string (not number of bytes as string.len), matching each characters (not bytes), etc.
It doesn't provide native support other than encoding, like is this character a Chinese character?
It supports it in the sense that you can use Unicode in Lua strings. It depends specifically on what you're planning to do, but most of the limitations can be fairly easily worked around by extending Lua with your own functions.