Winjs, error reading file with FileIO.readTextAsync - unicode

I am reading a .json file from disk using Windows.Storage.FileIO.readTextAsync.
All is fine until I put some non english letters in the file, like Æ Å Ø
The error I get is (rough translation from Danish language):
WinRT: No mapping for the Unicode character exists in the target multi-byte code page.
any idea how to read those chars in WinJs?

I found the problem.
when I created the file manually with notepad I set it to type ANSII instead of utf8.
I reopened the file -> save as and the changed the type and overwrote it.

You may be able to solve this by changing the encoding from the default (Utf8) to Utf16. The readTextAsync method accepts a second parameter which is a UnicodeEncoding flag:
Windows.Storage.FileIO.readTextAsync(
file,
Windows.Storage.Streams.UnicodeEncoding.utf16LE
).done( ... );
Or if you need to, you can use utf16BE flag (see link above).

Related

How to get proper filename sanitizing on upload in TYPO3?

When I upload a badly (or "utf8-ly") named file in a fresh TYPO3 7.6 install, I get underscores instead of spelled out special characters.
E.g. the filename Bräm!.png is sanitized to Bra__m_.png.
I would expect Braem.png.
The server locale looks fine:
LANG=de_CH.UTF-8
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES="de_CH.UTF-8"
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=
In localConfiguration, we have
'systemLocale' => 'de_CH.UTF-8',
And even, in php.ini, I tried
intl.default_locale = de_CH.UTF-8
Still, no "proper" renaming as I'd expect, renaming the File Bräm!.png to Braem.png or at least Braem_.png.
Where else could I look?
From what you describe the name of the file is not encoded in UTF-8 but in a single byte character set (ISO-8859-1 for example).
In \TYPO3\CMS\Core\Resource\Driver\LocalDriver::sanitizeFileName() UTF-8 is used if you use it in the backend (same for the old file handling functions).
In that case the "ä" isn't a valid multi-byte UTF-8 character and is thus replace by underscore characters.
Make sure [SYS][UTF8filesystem] = true in you LocalConfiguration.php

ASCII / UTF8 set random?

I have tried a program called UTFCast Professional. It checkes the file encoding.
When I write code I use Sublime Text.
Random encoding
What I get is that some files are UTF8 and some files are ASCII/UTF8. It appears to be set random. All of them are set to "BOM: No".
Why is some files UTF8 and some ASCII/UTF8?
Is it possible that in some cases it does not know if it's ASCII or UTF8?
Should I be worried for future encoding problems? I have not have any so far.
(I prefer UTF8)
A plain text file does not in any way save what encoding it's in anywhere. Any program that purportedly tells you what encoding a file is in is by definition only giving you its best guess based on the content of the file. Now, since a file which contains only characters which are present in ASCII and is saved as UTF-8 is indistinguishable from a pure ASCII file, either answer is valid. Even Latin-1 and a large number of other answers would be valid.
So the answer why that program randomly outputs one or the other is because its detection algorithm triggers one or the other based on some characteristics of the file content. Only the program author can tell you exactly why. The file is encoded as UTF-8 without BOM. Whatever any application tells you it thinks it is is entirely up to that application.

file encoding and deformed strings

I am just working with a text file, that contains lots of deformed strings such as:
VyplÅ<88>te prosím pole "jméno
My editor says that the file encoding is latin1. The string is supposed to be a czech sentence that contains some diacritics so no wonder it is displayed wrong. I have tried to force utf8 and latin2 encodings in my editor but that did not help. I have also tried to use iconv to convert the file from latin1 to utf8 or latin2 but neither that helped. I quite often encounter issues likes this and I don't know any other solution than to manually rewrite the strings. Is there a better way to fix this?
EDIT:
Here is the original sentence:
Vyplňte prosím pole "jméno"
Here is hex dump of the part where the malformed string occurs:
0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056 jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20 ypl..te pros..m
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b pole "jm..no".';
EDIT2:
The sentence above is really correct utf8 as deceze have said. But I have just found out some strange thing. If I try to transcode the file from utf8 to utf8 (with iconv), I get an error on a word: Postgebühr at character ü. If I look at hex dump, this character is represented as \xfc (252 in decimal), which is valid latin1 byte encoding for ü but completely invalid utf8 byte encoding. It seems that part of the file is in latin1 and another part in utf8. Here is part of the file that is in latin1 (probably):
0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963 Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20 onf["wafers"] =
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b 'DE: opl..tk .';
As I look into this more, this even does not seem to be valid latin1 cause even in latin1 it is garbled (DE: oplátk à instead of probably DE: oplatky za). This part of the file seems to contain some damaged text.
I can't understand how encoding in this file could have got mixed up like that. Any ideas?
If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.
The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.
You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.
To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.
In response to your posted hex dump: That file is UTF-8 encoded.
Iconv is the way to go, but you must know the correct enconding. Latin2 (iso8859-2) is only one of the possibilities, since there were many ascii extensions in Europe. What language is this supposed to be in?

How to convert from the raw coding system to latin-1

I am having a file which used to be encoded in Latin-1. Now, when I open this file, I get the raw-encoding only. That is -t: in the status line. There are probably some non Latin-1 characters in the file, at least opening other Latin-1 files works.
I would like to change the file back to Latin-1 only. So I C-x RETURN f the buffer to latin-1. However, upon saving, I get as non-encodable characters \344 which is ä, \374 which is ü and so on. So the characters are actually here but are for some reason still misinterpreted.
By rote search, I found a \237 to be the culprit. What is odd is that that character is not identified as non Latin-1 upon saving, but causes the file to be no longer recognized as Latin-1.
Try to find out what encoding the file is now in. Then, C-xReturncthe file's encodingC-xffile (i.e. open the file using its encoding). Then you can save it using latin-1.

Creating files with french characters and encoding

HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.