Related
Using SSIS we are extracting names and addresses from one system and providing them to a downstream system via a vendor specific file format that only accepts UTF-8, & parses the data based on character positions, so it expects each row to be an exact length.
Many users have umlauts, apostrophes or accents in their names and addresses.
These characters do not translate well in UTF-8, showing up as xD3, xE1 and similar.
As one char is now replaced with 3, the row length is now incorrect and the upload fails.
Is there a way to represent characters with accents & umlauts in UTF-8?
We can change them in the source system, but that means the spelling is now technically incorrect.
In lua 5.3 reference manual, we can see:
Lua is also encoding-agnostic; it makes no assumptions about the contents of a string.
I can't understand what the sentence says.
The same byte value in a string may represent different characters depending on the character encoding used for that string. For example, the same value \177 may represent ▒ in Code page 437 encoding or ± in Windows 1252 encoding.
Lua makes no assumption as to what the encoding of a given string is and the ambiguity needs to be resolved at the script level; in other words, your script needs to know whether to deal with the byte sequence as Windows 1252, Code page 437, UTF-8, or something else encoded string.
Essentially, a Lua string is a counted sequence of bytes. If you use a Lua string for binary data, the concept of character encodings is not relevant and does not interfere with the binary data. It that way, string is encoding-agnostic.
There are functions in the standard string library that treat string values as text—an uncounted, sequence of characters. There is no text but encoded text. An encoding maps a member of a character set to a sequence of bytes. A string would have the bytes for zero or more such encoded characters. To understand a string as text, you must know the character set and encoding. To use the string functions, the encoding should be compatible with os.setlocale().
I am trying to convert UTF-8 string to Unicode (code point) list with Erlang library "unicode. My input data is a string "АБВ" (Russian string, which correct Unicode representation is [1040,1041,1042]), encoded in UTF-8. When I am running following code:
1> unicode:characters_to_list(<<208,144,208,145,208,146>>,utf8).
[1040,1041,1042]
it returns correct value, but following:
2> unicode:characters_to_list([208,144,208,145,208,146],utf8).
[208,144,208,145,208,146]
does not. Why does it happens? As I read in specification, input data could be either binary or list of chars, so, as for me, I am doing everything right.
The signature of the function is unicode:characters_to_list(Data, InEncoding), it expects Data to be either binary containing string encoded in InEncoding encoding or possibly deep list of characters (code points) and binaries in InEncoding encoding. It returns list of unicode characters. Characters in erlang are integers.
When you call unicode:characters_to_list(<<208,144,208,145,208,146>>, utf8) or unicode:characters_to_list([1040,1041,1042], utf8) it correctly decodes unicode string (yes, second is noop as long as Data is list of integers). But when you call unicode:characters_to_list([208,144,208,145,208,146], utf8) erlang thinks you pass list of 6 characters in utf8 encoding, since it's already unicode the output will be exactly the same.
There is no byte type in erlang, but you assume that unicode:characters_to_list/2 will accept list of bytes and will behave correctly.
To sum it up. There are two usual ways to represent string in erlang, they are bitstrings and lists of characters. unicode:characters_to_list(Data, InEncoding) takes string Data in one of these representations (or combination of them) in InEncoding encoding and converts it to list of unicode codepoints.
If you have list [208,144,208,145,208,146] like in your example you can convert it to binary using erlang:list_to_binary/1 and then pass it to unicode:characters_to_list/2, i.e.
1> unicode:characters_to_list(list_to_binary([208,144,208,145,208,146]), utf8).
[1040,1041,1042]
unicode module supports only unicode and latin-1. Thus, (since the function expects codepoints of unicode or latin-1) characters_to_list does not need to do anything with list in a case of flat list of codepoints. However, list may be deep (unicode:characters_to_list([[1040],1041,<<1042/utf8>>]).). That is a reason to support list datatype for Data argument.
<<208,144,208,145,208,146>> is an UTF-8 binary.
[208,144,208,145,208,146] is a list of bytes (not code points).
[1040,1041,1042] is a list of code points.
You are passing a list of bytes, but the function wants a list of chars or a binary.
I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.
Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.
Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function
Why {html, "доуч"++[1076,1086,1091,1095]} in yaws-page gives me next error:
Yaws process died: {badarg,[{erlang,list_to_binary,
[[[[208,180,208,190,209,131,209,135,1076,
1086,1091,1095]],
...
"доуч" = [1076,1086,1091,1095] -> gives me exact match, but how yaws translate 2-byte per elem list in two times longer list with 1 byte per elem for "доуч", but doesnt do it for [1076,1086,1091,1095]. Is there some internal represintation of unicode data involed?
I want to output to the web pages lists like [1076,1086,1091,1095], but it crushed.
Erlang source files only support the ISO-LATIN-1 charset. The Erlang console can accept Unicode characters, but to enter them inside a source code file, you need to use this syntax:
K = "A weird K: \x{a740}".
See http://www.erlang.org/doc/apps/stdlib/unicode_usage.html for more info.
You have to do the following to make it work:
{html, "доуч"++ binary_to_list(unicode:characters_to_binary([1076,1086,1091,1095]))}
Why it fails?
In a bit more detail, the list_to_binary fails because it is trying to convert each item in the list to a byte, which it cannot do because each value in [1076,1086,1091,1095] would take more than a byte.
What is going on?
[1076,1086,1091,1095] is a pure unicode string representation of "доуч". Yaws tries to convert the string (list) into a binary string directly using list_to_binary and thus fails. Since each unicode character can take more than one byte, we need to convert it into a byte array. This can be done using:
unicode:characters_to_binary([1076,1086,1091,1095]).
<<208,180,208,190,209,131,209,135>>
This can now be safely converted back and forth between list and binary representations. See unicode for more details.
You can convert back to unicode as follows:
unicode:characters_to_list(<<208,180,208,190,209,131,209,135>>).
[1076,1086,1091,1095]