I have this vbscript code:
Set fs = CreateObject("Scripting.FileSystemObject")
Set ts = fs.OpenTextFile("tmp.txt", 2, True)
for i = 128 to 255
s = chr(i)
if lenb(s) <>2 then
wscript.echo i
wscript.quit
end if
ts.write s
next
ts.close
On my system, each integer is converted to a double byte character: there are no numbers in that range that cannot be represented by a character, and no number requires more than 2 bytes.
But when I look at the file, I find only 127 bytes.
This answer: https://stackoverflow.com/a/31436726/1335492 suggests the the FSO creates UTF files and inserts a BOM. But the file contains only 127 bytes, and no Byte Order Mark.
How does FSO decide how to encode text? What encoding allows 8 bit single-byte characters? What encodings do not include 255 8 bit single-byte characters?
(Answers about how FSO reads characters may also be interesting, but that's not what I'm specifically asking here)
Edit: I've limited my question to the high-bit characters, to make it clear what the question is. (Answers about the low-bit characters may also be interesting, but that's not what I'm specifically asking here)
Short Answer:
The file system object maps "Unicode" to "ASCII" using the code page associated with the System Locale. (Chr and ChrW use the User Locale.)
Application:
There may be silent transposition errors between the System code page and the Thread (user) code page. There may also be coding and decoding errors if code points are missing from a code page, or, as with Japanese and UTF-8, the code pages contain multi-byte characters.
VBscript provides no native method to detect the User, Thread, or System code page. The Thread (user) code page maybe inferred from the Locale set by SetLocale or returned by GetLocale (there is a list here: https://www.science.co.il/language/Locale-codes.php), but there does not appear to be any MS documentation. On Win2K+, WMI may be used to query the System code page. The CHCP command queries and changes the OEM codepage, which is neither the User nor the System code page.
The system code page may be spoofed by an application manifest. There is no way for an application (such as cscript or wscript) or script (such as VBScript or JScript) to change it's parent system except by creating a new process with a new manifest. or rebooting the system after making a registry change.
In detail:
s = chr(i)
'creates a Unicode string, using the Thread Locale Codepage.
Code points that do not exist as characters are mapped as control characters: 127 becomes U+00FF (which is a standard Unicode control character), and 128 becomes U+20AC (the Euro symbol) and 129 becomes 0081 (which is a code point in a Unicode control character region). In VBScript, Thread Locale can be set and read by SetLocale and GetLocale
createobject("Scripting.FileSystemObject").OpenTextFile(strOutFile, 2, True).write s
'creates a 'code page' string, using the System Locale Codepage.
There are two ways that Windows can handle Unicode values it can't map: it can either map to a default character, or return an error. "Scripting.FileSystemObject" uses the error setting, and throws an exception.
In More Detail:
The Thread Locale is, by default, the User Locale, which is the date and time format setting in the "Region and Language" control panel applet (called different things in different versions of windows). It has an associated code page. According to MS internationalization expert Michka (Michael Kaplan, RIP), the reason it has a code page is so that Months and Days of the week can be written in appropriate characters, and it should not be used for any other purpose.
The ASP-classic people clearly had other ideas, since Response.CodePage is thread-locale, and can be controlled by vbscript GetLocale and SetLocale amongst other methods. If the User Locale is changed, all processes are notified, and any thread that is using the default value updates. (I haven't tested what happens to a thread currently using a non-default value).
The System Locale is also called "Language for non-Unicode programs" and is also found in the "Region and Language" applet, but requires a reboot to change. This is the value used internally by windows ("The System") to map between the "A" API and the "W" API. Changing this has no effect on the language of the Windows GUI (That is not a "non-Unicode program")
Assuming that the "Time and Date" setting matches the "Language for non-Unicode programs", any Chr(i) that can create a valid Unicode code point (see "mapping errors" below), will map back exactly from Unicode to "code page". Note that this does work for code points that are "control characters": also note that it doesn't work the other way: UTF-CodePage-UTF doesn't always round-trip exactly. Famously (Character,Modifer)-CodePage-(Complex Character) does not round-trip correctly, where Unicode defines more than one way of constructing a language character representation.
If the "Time and Date" does not match the "Language for non-Unicode programs", any translation could take place, for example U+0101 is 0xE0 on cp28594 and 0xE2 on cp28603: Chr(224) would go through U+0101 to be written as 226.
Even if there are not transposition errors, if the "Time and Date" does not match the "Language for non-Unicode programs" the program may fail when translating to the System Locale: if the Unicode code point does not have a matching Code Page code point, there will be an exception from the FileSystemObject.
There may also be mapping errors at Chr(i), going from Code page to Unicode. Code page 1041 (Japanese) is a double-byte code page (probably Shift JIS). 0x81 is (only) the first byte of a double-byte pair. To be consistent with other code pages, 0x81 should map to the control character 0081, but when given 81 and code page 1041, Windows assumes that the next byte in the buffer, or in the BSTR, is the second byte of the double-byte pair (I've not determined if the mistake is made before or after the conversion). Chr(&H81) is mapped to U+xx81 (81,xx). When I did it, I got U+4581, which is a CJK Unified Ideograph (Brasenia purpurca): it's not mapped by code page 1041.
Mapping errors at Chr(1) do not cause VBScript exceptions at the point of creation. If the UTF-16 code point created is invalid or not on the System Locale code page, there will be a FileSystemObject exception at .write. This particular problem can be avoided by using ChrW(i) instead of Chr(i). On code page 1041, ChrW(129) becomes the Unicode Control character 0081 instead of xx81.
Background:
A program can map between Unicode and "codepage" using any installed code page: the Windows functions MultiByteToWideChar and WideCharToMultiByte take [UINT CodePage] as the first parameter. That mechanism is used internally in Windows to map the "A" API to the "W" API, for example GetAddressByNameA and GetAddressByNameW. Windows is "W", (wide, 16 bit) internally, and "A" strings are mapped to "W" strings on call, and back from "W" to "A" on return. When Windows does the mapping, it uses the code page associated with the "System Locale", also called "Language for non-Unicode programs".
The Windows API function WriteFile writes bytes, not characters, so it's not an "A" or "W" function. Any program that uses it has to handle conversion between strings and bytes. The c function fwrite writes characters, so it can handle 16 bit characters, but it has no way of handling variable length code points like UTF-8 or UTF-16: again, any program that uses "fwrite" has to handle conversion between strings and words.
The C++ function fwrite can handle UTF, and the compiler function _fwrite does magic that depends on the compiler. Presumably, on Windows, if code page translation is required the MultiByteToWideChar and WideCharToMultiByte API is used.
The "A" code pages and the "A" API were called "ANSI" or "ASCII" or "OEM", and started out as 8 bit characters, then grew to double-byte characters, and have now grown to UTF-8 (1..3 bytes). The "W" API started out as 16 bit characters, then grew to UTF-16 (1..6 bytes). Both are multi-word character encodings: the distinction is that for the "A" API and code pages, the word length is 8 bits: for the "W" API and UTF-16, the word length is 16 bits. Because they are both multi-byte mappings, and because "byte" and "word" and "char" and "character" mean different things in different contexts, and because "W" and particularly "A" mean different things than they did years ago, I've just use "A" and "W" and "code page" and "Unicode".
"OEM" is the code page associated with another locale: The Console I/O API. It is per-process (it's a thread locale), it can be changed dynamically (using the CHCP command) and its default value is set at installation: there is no GUI provided to change the value stored in the registry. Most console programs don't use the console I/O API, and as written, use either the system locale, or the user locale, or, (sometimes inadvertently), a mixture of both.
The System Locale can be spoofed by using a manifest and there was a WinXP utility called "AppLocale" that did the same thing.
FSO decide how to encode text during file opening. Use format argument as follows:
Set ts = fs.OpenTextFile("tmp.txt", 2, True, -1)
' ↑↑
Resource: OpenTextFile Method
Syntax
object.OpenTextFile(filename[, iomode[, create[, format]]])
Arguments
object - Required. Object is always the name of a FileSystemObject.
filename - Required. String expression that identifies the file to
open.
iomode - Optional. Can be one of three constants: ForReading,
ForWriting, or ForAppending.
create - Optional. Boolean value that indicates whether a new file
can be created if the specified filename doesn't exist. The value is
True if a new file is created, False if it isn't created. If
omitted, a new file isn't created.
format - Optional. One of three Tristate values used to indicate the
format of the opened file.
TristateTrue = -1 to open the file as Unicode,
TristateFalse = 0 to open the file as ASCII,
TristateUseDefault = -2 to open the file as the system default.
If omitted, the file is opened as ASCII.
I wrote an application that prefers NFC. When I get a filename from OSX its normalized as NFD though. As far as I know I shouldn't convert the data as its mentioned here:
http://www.win.tue.nl/~aeb/linux/uc/nfc_vs_nfd.html
[...](Not because something is wrong with NFD, or this version of NFD,
but because one should never change data. Filenames must not be
normalized.)[...]
When I compare the filename with the user input (which is in NFC) I have to implement a corresponding compare function which takes care of the Unicode equivalence. But that could be much slower than needed. Wouldn't it be better if I normalize the filename to NFC instead? It would improve the speed a lot when just a memory compare is involved.
The accuracy of advice you link to is dependent on the filesystem in question.
The 'standard' Linux file systems do not prescribe an encoding for filenames (they are treated as raw bytes), so assuming they are UTF-8 and normalising them is an error and may cause problems.
On the other hand, the default filesystem on Mac OS X (HFS+) enforces all filenames to be valid UTF-16 in a variant of NFD. If you need to compare file paths, you should do so in a similar format – ideally using the APIs provides by the system, as its NFD form is tied to an older version of Unicode.
Matlab requires that script files are limited to only 63 characters
>> namelengthmax
ans =
63
and these 63 characters must be out of a small characterset without - and others.
Why does Matlab limit the filenames and is there a workaround?
The comment from beaker answers part of your question. Because they can also be funtion names, you are restricted in the character they can incorporate.
For example, if you had a file (function) named foot-ball.m, when you call it in an instruction, Matlab couldn't differentiate between:
a = foot-ball ;
where you mean to call the result of a function named foot-ball.m (actually impossible)
or
a = foot-ball ;
assigning to variable "a" the result of function foot.m minus the result of function ball.m
As to the maximum length, there are no workaround (yet) to my knowledge (until Matlab lift the restriction).
Remember that your operating system also has a limit to the length of file (and full path). On windows it is 256+4 characters. So I guess limiting a file name length to 63 is just to allow for 193 characters of full path. This can be quickly reached, faster than we think.
If your filename was 255 character long, you would have no other choice than to put it directly into c:\ or the operating system couldn't access it (so Matlab couldn't call it obviously).
Use the instruction len = namelengthmax to get the actual maximum length on your system. You can read more about it in specify file names.
or also read a similar problem from another user: Extending the maximum length of MATLAB function names. Note that this user couldn't bypass the length limit, he had to find another way to fit all the information he wanted into the maximum file name.
The MATLAB Engine is a C interface to MATLAB. It provides a function engEvalString() which takes some MATLAB code as a C string (char *), evaluates it, then returns MATLAB's output as a C string again.
I need to be able to pass unicode data to MATLAB through engEvalString() and to retrieve the output as unicode. How can I do this? I don't care about the particular encoding (UTF-8, UTF-16, etc.), any will do. I can adapt my program.
More details:
To give a concrete example, if I send the following sting, encoded as, say, UTF-8,
s='Paul Erdős'
I would like to get back the following output, encoded again as UTF-8:
s =
Paul Erdős
I hoped to achieve this by sending feature('DefaultCharacterSet', 'UTF-8') (reference) before doing anything else, and this worked fine when working with MATLAB R2012b on OS X. It also works fine with R2013a on Ubuntu Linux. It does not work on R2013a on OS X though. Instead of the character ő in the output of engEvalString(), I get character code 26, which is supposed to mean "I don't know how to represent this". However, if I retrieve the contents of the variable s by other means, I see that MATLAB does correctly store the character ő in the string. This means that it's only the output that didn't work, but MATLAB did interpret the UTF-8 input correctly. If I test this on Windows with R2013a, neither input, nor output works correctly. (Note that the Windows and the Mac/Linux implementations of the MATLAB Engine are different.)
The question is: how can I get unicode input/output working on all platforms (Win/Mac/Linux) with engEvalString()? I need this to work in R2013a, and preferably also in R2012b.
If people are willing to experiment, I can provide some test C code. I'm not posting that yet because it's a lot of work to distill a usable small example from my code that makes it possible to experiment with different encodings.
UPDATE:
I learned about feature('locale') which returns some locale-related data. On Linux, where everything works correctly, all encodings it returns are UTF-8. But not on OS X / Windows. Is there any way I could set the various encodings returned by feature('locale')?
UPDATE 2:
Here's a small test case: download. The zip file contains a MATLAB Engine C program, which reads a file, passes it to engEvalString(), then writes the output to another file. There's a sample file included with the following contents:
feature('DefaultCharacterSet', 'UTF-8')
feature('DefaultCharacterSet')
s='中'
The (last part of the) output I expect is
>>
s =
中
This is what I get with R2012b on OS X. However, R2013 on OS X gives me character code 26 instead of the character 中. Outputs produces by R2012b and R2013a are included in the zip file.
How can I get the expected output with R2013a on all three platforms (Windows, OS X, Linux)?
I strongly urge you to use engPutVariable, engGetVariable, and Matlab's eval instead. What you're trying to do with engEvalString will not work with many unicode strings due to embedded NULL (\0) characters, among other problems. Due to how the Windows COM interface works, the Matlab engine can't really support unicode in interpreted strings. I can't speculate about how the engine works on other platforms.
Your other question had an answer about using mxCreateString_UTF16. Wasn't that sufficient?
For example, MessageBox function has LPCTSTR typed argument for text and caption, which is a pointer to char or pointer to wchar when _UNICODE or _MBCS are defined, respectively.
How does the MessageBox function interpret those stings? As which encoding?
Only explanation I managed to find is this:
http://msdn.microsoft.com/en-us/library/cwe8bzh0(VS.90).aspx
But it doesn't say anything about encoding? Just that in case of _MBCS one character takes up one wchar (which is 16-bit on Windows) and that in case of _UNICODE one or two char's (8-bit).
So are those some Microsoft's versions of UTF-8 and UTF-16 that ignore anything that has to be encoded in 3 or four bytes in case of UTF-8 and anything that has to be encoded in 4 bytes in case of UTF-16? And is there a way to show anything outside of basic multilingual plane of Unicode with MessageBox?
There are normally two different implementations of each function:
MessageBoxA, which accepts ANSI strings
MessageBoxW, which accepts Unicode strings
Here, 'ANSI' means the multi-byte code page currently assigned to the process. This varies according to the user's preferences and locale setting, although Win32 API functions such as WideCharToMultiByte can be counted on to do the right conversion, and the GetACP function will tell you the code page in use. MSDN explains the ANSI code page and how it interacts with Unicode.
'Unicode' generally means UCS-2; that is, support for characters above 0xFFFF isn't consistent. I haven't tried this, but UI functions such as MessageBox in recent versions (> Windows 2000) should support characters outside the BMP.
The ...A functions are obsolete and only wrap the ...W functions. The former were required for compatibility with Windows 9x, but since that is not used any more, you should avoid them at any costs and use the ...W functions exclusively. They require UTF-16 strings, the only native Windows encoding. All modern Windows versions should support non-BMP characters quite well (if there is a font that has these characters, of course).