I create the node-opcua addressspace by usage of nodeset.xml files. I fill the server_options.nodeset_filename array with filenames to load. This works fine.
Now I wanted to load the nodeset "Opc.Ua.Ijt.Tightening.NodeSet2.xml" from companion specification for tightening (https://github.com/OPCFoundation/UA-Nodeset/blob/v1.04/IJT/Tightening/Opc.Ua.Ijt.Tightening.NodeSet2.xml) and recognized that some descriptions are cut if connecting to the server and reading with an opcua client. For example UAVariable NodeId="ns=1;i=6094" contains field 'Error' with description '0 – OK ...'.
The '-' in '0 - OK' is utf-8 encoded character in the nodeset xml.
After some investigation I found fs.readFile(xmlFile, "ascii", (err, xmlData: string) in export function readNodeSet2XmlFile https://github.com/node-opcua/node-opcua/blob/master/packages/node-opcua-address-space/source_nodejs/generate_address_space.ts#:~:text=fs.readFile(xmlFile%2C%20%22ascii%22%2C%20(err%2C%20xmlData%3A%20string)
The OPC UA specification tells 'All String values are encoded as a sequence of UTF-8 characters'.
Questions:
Does node-opcua really read nodeset xml 'ascii' encoded or is it my
wrong interpretation?
Is there a way to force node-opcua to use 'utf-8' encoding when reading nodesets?
this issue is now fixed in node-opcua. THere is not problem to read nodeset2.xml containing UTF-8 special characters anymore.
Related
I have UTF8 data in a MYSQL table. I Base64 encode this as it's read from the table and transport it to a web page via PHP and AJAX. Javascript Base64 decodes it as it is inserted into the HTML. The page receiving it is declared to be UTF8.
My problem is that if I insert the Base64 decoded data (using atob()) into the page, any two bytes that make up a single UTF-8 character are presented as two separate Unicode code points. I have to use "decodeURIComponent(escape(window.atob(data)))" (learned from another question on this forum, thank you) to get the characters to be represented correctly, and what this process does is convert the two UTF-8 byte to a single byte equaling the unicode code point for the char (also the same char under ISO 8859).
In short, to get the UTF-8 data correctly rendered in a UTF-8 page they have to be converted to their unicode code-point/ISO 8859 values.
An example:
THe unicode code-point for lowercase e-acute is \u00e9. The UTF-8 encoding of this character is \xc3\xa9:
THe following images show what is rendered for various decodings of my Base64 encoding of this word - first plain atob(), then adding escape() to the process, then further adding decodeURIComponent(). I show the console reporting the output of each, as well as three INPUT fields populated with the three outputs ("record[6]" contains the Base64 encoded data). First the code:
console.log(window.atob(record[6]));
console.log(escape(window.atob(record[6])));
console.log(decodeURIComponent(escape(window.atob(record[6]))));
jQuery("#b64-1").val(window.atob(record[6]));
jQuery("#b64-2").val(escape(window.atob(record[6])));
jQuery("#b64-3").val(decodeURIComponent(escape(window.atob(record[6]))));
`
Copy and pasting the two versions of née into a hex editor reveals what has happened
''
Clearly, the two bytes from the atob() decoding are the correct values for UTF-8 e-acute (\xc3\xa9), but are initially rendered not as a single UTF-8 char, but as two individual chars: C3 (uppercase A tilde) and A9 (copyright sign). The next two steps convert those two chars to the single codepoint for e-acute \u00e9.
So decodeURIComponent() obviously recognises the two bytes as a single UTF-8 character (because it changes them to A9), but not the browser.
Can anyone explain to me why this needs to happen in a page declared to be UTF-8?
(I am using Chrome on W10-64)
Right now there aren't really any books on red since it is so new. So I am trying to follow along an old Rebol book and salvage what I can from it.
I have found a few commands such read where I can't execute the code because of the file encoding.
save %/c/users/abagget/desktop/bay.jpg read http://rebol.com/view/bay.jpg
Access Error: invalid UTF-8 encoding: #{FFD8FFE0}
In Rebol this^ would have been read/binary and write/binary
>> write %/c/alex.txt read http://google.com
*** Access Error: invalid UTF-8 encoding: #{A050726F}
Is there a way to convert incoming content to UTF-8 so I can do the read?
Or are there other types of read that handle non-UTF-8?
In Rebol this^ would have been read/binary and write/binary
In Red too, save is for converting a Red datatype to a serialized text of binary format. So if you want to save to a JPEG file, you need to provide an image! value. read fetches a text content (limited to UTF-8 for now), so your usage is invalid. The proper line should be:
write/binary %/c/users/abagget/desktop/bay.jpg read/binary http://rebol.com/view/bay.jpg
Is there a way to convert incoming content to UTF-8 so I can do the read?
To obtain a string from a non-UTF-8 text resource, you need to fetch the resource as binary, and then write a poor's man converter which should work fine for the common Latin-1 encoding:
bin-to-string: function [bin [binary!]][
text: make string! length? bin
foreach byte bin [append text to char! byte]
text
]
Using it from the console:
>> bin-to-string read/binary http://google.com
== {<!doctype html><html itemscope="" itemtype="http://schema.org...
Red will provide proper converters for commonly used text encodings in the future. In the meantime, you can use such function, or write a proper decoder (using a conversion table) for the encodings you use most often.
I'm using DB-IP.com to get city names from IP addresses. Many of these are international cities, with special characters in the names.
As an example, one of these cities is Wężarów in Poland. Checking the JSON return in the console or opening the request URL directly, it's being returned from DB-IP as "W\u0119\u017car\u00f3w" with a Content-Type of text/javascript;charset=UTF-8. This is rendered in the browser as Wężarów - it is also saved in my mysql database as Wężarów (which I've tried with both utf8 and latin1 encoding).
I'm ok with saving it in the DB as another format, as long as I can convert it back to Wężarów for display in browser. I've tried encoding and decoding to/from several formats, even just to display directly on the screen (ignoring the DB entirely). I'm completely confused on what I need to do here to get it in readable format.
I'm working with PERL, however if I can figure out what I need to do with the encoding/decoding/charset (as I'm currently clueless), I'm sure I can figure it out from there.
It looks like the UTF-8 encoded string was interpreted by the browser as if it were Windows-1252. Here's how I deduced it:
% python3
>>> s = "W\u0119\u017car\u00f3w"
>>> b = bytes(s, encoding='utf-8')
>>> b
b'W\xc4\x99\xc5\xbcar\xc3\xb3w'
>>> str(b, encoding='utf-8')
'Wężarów'
>>> str(b, encoding='latin-1')
'WÄ\x99żarów'
>>> str(b, encoding='windows-1252')
'Wężarów'
If you're not good with Python, what I'm doing here is encoding the string "W\u0119\u017car\u00f3w" into UTF-8, yielding the byte sequence 'W\xc4\x99\xc5\xbcar\xc3\xb3w'. Decoding that with UTF-8 yielded 'Wężarów', confirming that this is the correct UTF-8 encoding of the string you want. So I took a guess that the browser is using the wrong encoding to render it, and decoded it using Latin-1. That gave me something very close, so I looked up Latin-1 and noticed that it's named as the basis for Windows-1252. Decoding again as Windows-1252 gives the result you saw.
What's gone wrong here is that the browser can't tell what encoding to use to render the page, and it's guessing wrong. You need to fix this by telling it explicitly to use UTF-8. Here's a page by the W3C that describes how to do that. Essentially what you need to do is add an HTML <meta> element to the document head. If you also set an HTTP header with the encoding name in it, make sure they are consistent.
(In Firefox, while you're debugging, you can go to View -> Character Encoding to set the encoding on a page-by-page basis. I assume other browsers have the same feature.)
The question says it all. Is it possible to transfer a UTF-8 file over FTP using ASCII mode? Or will this cause the characters to be written incorrectly? Thanks!
UTF-8 encoding was designed to be backward compatible with ASCII encoding.
The RFC 959 requires the FTP clients and servers to treat the file in ASCII mode as 8-bit:
3.1.1.1. ASCII TYPE
...
The sender converts the data from an internal character
representation to the standard 8-bit NVT-ASCII
representation (see the Telnet specification). The receiver
will convert the data from the standard form to his own
internal form.
In accordance with the NVT standard, the sequence
should be used where necessary to denote the end of a line
of text. (See the discussion of file structure at the end
of the Section on Data Representation and Storage.)
...
Using the standard NVT-ASCII representation means that data
must be interpreted as 8-bit bytes.
So even UTF-8 unaware FTP client or server should correctly translate line-endings, as these are encoded identically in ASCII and UTF-8. And they should not corrupt the other characters.
From a practical point of view: I haven't met a server that does have problems with 8-bit text files. I'm Czech, so I regularly work with UTF-8 and in past with Windows-1250 and ISO/IEC 8859-2 8-bit encodings.
RFC 2640, from 1999, updates the FTP protocol to support internationalization. It requires FTP servers to use UTF-8 as the transfer encoding in section 2.2. So as long as you aren't trying to upload to a DEC TOPS-20 server (which stores five 7-bit bytes within a 36-bit word), you should be fine.
I am analyzing this metasploit module, and I am wondering what encoding method does payload.encoded retrieves by default in metasploit.
I did a print payload.encoded in that exploit (without setting any encoder), and I get a normal string like:
PYIIIIIIIIIIQZVTX30VX4AP0A3HH0A00ABAABTAAQ2AB2BB0BBXP........
The module has an encoder option but it's commented.
I am use to see payloads encoded with the standard hex values like:
\xd9\xf7\xbd\x0f\xee\xaa\x47.......
Could someone help me understand where that string returned by payload.encoded comes from and what encoding it uses?
Turns out the first one was an alpha_upper encoded payload, the second is just binary data encoded with hex.