How to get proper filename sanitizing on upload in TYPO3? - typo3

When I upload a badly (or "utf8-ly") named file in a fresh TYPO3 7.6 install, I get underscores instead of spelled out special characters.
E.g. the filename Bräm!.png is sanitized to Bra__m_.png.
I would expect Braem.png.
The server locale looks fine:
LANG=de_CH.UTF-8
LC_CTYPE="de_CH.UTF-8"
LC_NUMERIC="de_CH.UTF-8"
LC_TIME="de_CH.UTF-8"
LC_COLLATE="de_CH.UTF-8"
LC_MONETARY="de_CH.UTF-8"
LC_MESSAGES="de_CH.UTF-8"
LC_PAPER="de_CH.UTF-8"
LC_NAME="de_CH.UTF-8"
LC_ADDRESS="de_CH.UTF-8"
LC_TELEPHONE="de_CH.UTF-8"
LC_MEASUREMENT="de_CH.UTF-8"
LC_IDENTIFICATION="de_CH.UTF-8"
LC_ALL=
In localConfiguration, we have
'systemLocale' => 'de_CH.UTF-8',
And even, in php.ini, I tried
intl.default_locale = de_CH.UTF-8
Still, no "proper" renaming as I'd expect, renaming the File Bräm!.png to Braem.png or at least Braem_.png.
Where else could I look?

From what you describe the name of the file is not encoded in UTF-8 but in a single byte character set (ISO-8859-1 for example).
In \TYPO3\CMS\Core\Resource\Driver\LocalDriver::sanitizeFileName() UTF-8 is used if you use it in the backend (same for the old file handling functions).
In that case the "ä" isn't a valid multi-byte UTF-8 character and is thus replace by underscore characters.

Make sure [SYS][UTF8filesystem] = true in you LocalConfiguration.php

Related

I keep getting syntax errors with my new vBulletin Install

Here is one of many errors I keep getting.
Parse error: syntax error, unexpected 'Database' (T_STRING) in /home2/craven/public_html/forums/core/includes/config.php on line 39
Make sure, you use the correct singlequotes in the config file.
These are correct: 'string', "string"
These are wrong: ´string´, string
You could try to replace all singlequotes with the correct double-quotes, just in case there is something funky going on with encoding in your file.
Also, make sure there are no spaces or newlines between the $config[...] definitions which might break the syntax parser.
Maybe check if the config.php file has the correct encoding, UTF-8 or any latin encodings if you're using european special characters for example. You can use a text editor like Notepad++ to check/modify the file encoding and re-upload the updated file afterwards.
If it still fails, you need to provide information about your webserver system, like the PHP version.

Winjs, error reading file with FileIO.readTextAsync

I am reading a .json file from disk using Windows.Storage.FileIO.readTextAsync.
All is fine until I put some non english letters in the file, like Æ Å Ø
The error I get is (rough translation from Danish language):
WinRT: No mapping for the Unicode character exists in the target multi-byte code page.
any idea how to read those chars in WinJs?
I found the problem.
when I created the file manually with notepad I set it to type ANSII instead of utf8.
I reopened the file -> save as and the changed the type and overwrote it.
You may be able to solve this by changing the encoding from the default (Utf8) to Utf16. The readTextAsync method accepts a second parameter which is a UnicodeEncoding flag:
Windows.Storage.FileIO.readTextAsync(
file,
Windows.Storage.Streams.UnicodeEncoding.utf16LE
).done( ... );
Or if you need to, you can use utf16BE flag (see link above).

WiX Installer heat.exe and non-ascii filenames

I added a file in my WiX script with the character " î " in the path name. Light.exe will complain:
A string was provided with characters that are not available in the specified database code page '1252'
The character in question is 0xEE in Windows-1252 encoding, that is, 0x00EE Unicode or 0xC3AE in UTF-8. These files are in a wxs file generated by heat.exe, and this xml is encoded as UTF-8.
I assume the error message comes from the fact that it tries to input the character in UTF encoding while the database is 1252? Since UTF isn't really supported by Windows Installer (as described in the WiX documentation), should I be using input xml encoded in 1252 or iso-8859? If so, can I tell heat.exe to use another encoding for its output?
My question is similar to this one:
Leveraging heat.exe and harvest already localized file names and including them to msi using wix but the difference is that in that case the characters are "true" non-ansi charcaters, in my case the character can be encoded correctly in 1252, but it seems the conversion from utf-8 input files does not work.
The WiX toolset verifies codepages like so (roughly):
encoding = Encoding.GetEncoding(codepage, new EncoderExceptionFallback(),
new DecoderExceptionFallback());
writer = new StreamWriter(idtPath, false, encoding);
try
{
// GetBytes will throw an exception if any character doesn't
// match our current encoding
rowBytes = writer.Encoding.GetBytes(rowString);
}
catch (EncoderFallbackException)
{
rowBytes = convertEncoding.GetBytes(rowString);
messageHandler.OnMessage(WixErrors.InvalidStringForCodepage(
row.SourceLineNumbers,
writer.Encoding.WindowsCodePage));
}
It is possible that NETFX is not translating that "i" correctly. Explicitly setting the codepage on your XML may help. To do that from heat, you can try to use an XSLT (I've never tried changing the XML doc codepage via XSL but seems possible) or post-edit the document.

How to "force" a file's ISO-8859-1ness?

I remember when I used to develop website in Japan - where there are three different character encodings in currency - the developers had a trick to "force" the encoding of a source file so it would always open in their IDEs in the correct encoding.
What they did was to put a comment at the top of the file containing a Japanese character that only existed in that particular character encoding - it wasn't in any of the others! This worked perfectly.
I remember this because now I have a similar, albeit Anglophone, problem.
I've got some files that MUST be ISO-8859-1 but keep opening in my editor (Bluefish 1.0.7 on Linux) as UTF-8. This isn't normally a problem EXCEPT for pound (£) symbols and whatnot. Don't get me wrong, I can fix the file and save it out again as ISO-8859-1, but I want it to always open as ISO-8859-1 in my editor.
So, are there any sort of character hacks - like I mention above - to do this? Or any other methods?
PS. Unicode advocates / evangelists needn't waste their time trying to convert me because I'm already one of them! This is a rickety older system I've inherited :-(
PPS. Please don't say "use a different editor" because I'm an old fart and set in my ways :-)
Normally, if you have a £ encoded as ISO-8859-1 (ie. a single byte 0xA3), that's not going to form part of a valid UTF-8 byte sequence, unless you're unlucky and it comes right after another top-bit-set character in such a way to make them work together as a UTF-8 sequence. (You could guard against that by putting a £ on its own at the top of the file.)
So no editor should open any such file as UTF-8; if it did, it'd lose the £ completely. If your editor does that, “use a different editor”—seriously! If your problem is that your editor is loading files that don't contain £ or any other non-ASCII character as UTF-8, causing any new £ you add to them to be saved as UTF-8 afterwards, then again, simply adding a £ character on its own to the top of the file should certainly stop that.
What you can't necessarily do is make the editor load it as ISO-8859-1 as opposed to any other character set where all single top-bit-set bytes are valid. It's only multibyte encodings like UTF-8 and Shift-JIS which you can exclude them by using byte sequences that are invalid for that encoding.
What will usually happen on Windows is that the editor will load the file using the system default code page, typically 1252 on a Western machine. (Not actually quite the same as ISO-8859-1, but close.)
Some editors have a feature where you can give them a hint what encoding to use with a comment in the first line, eg. for vim:
# vim: set fileencoding=iso-8859-1 :
The syntax will vary from editor to editor/configuration. But it's usually pretty ugly. Other controls may exist to change default encodings on a directory basis, but since we don't know what you're using...
In the long run, files stored as ISO-8859-1 or any other encoding that isn't UTF-8 need to go away and die, of course. :-)
You can put character ÿ (0xFF) in the file. It's invalid in UTF8. BBEdit on Mac correctly identifies it as ISO-8859-1. Not sure how your editor of choice will do.

Creating files with french characters and encoding

HI, I am creating a file like so.
FileStream temp = File.Create( this.FileName );
Then putting data in the file like so.
this.Writer = new StreamWriter( this.Stream );
this.Writer.WriteLine( strMessage );
That code is encapsulated in a class hierarchy but that is the meat and potatoes of it.
My problem is this. MSDN says that the default encoding for creating a file this way is UTF8. And when I write a french character such as é Textpad interprets the file as a UTF 8 file, but notepad++ says it's "ANSI as UTF8" or maybe it's an ansi file but is reading it as UTF8. When I create a file the same way without the french character both textpad and notepad++ read the file as an ansi file even though according to msdn it should be a utf 8 file still.
Which program should be trusted. Notepad++ or textpad - Notepad++ seems to be more consistant, but is still the oppossite to what MSDN says it should be. My problem is that we create files that get sent off to another company and depending on whether there are french characters the encoding seems to keep changing.
Or is there a better way to determine the encoding of a file. I've read about byte order marks and preambles but as far as I understand neither are guaranteed to be there.
We initially thought that all the files we were building were ansi. Also please note that both ansi and utf8 should handle the french characters appropriately as the characters are part of both character sets.
as far as i know, "ansi" character encoding is another name for ascii-us.
if there are no characters in the file that aren't in the ascii charset then the file is valid ascii and valid utf8, there's no way to distinguish them. so your program can write it as utf8 and any other program would be correct in seeing it as ascii (ansi) just as it would be seeing it as utf8.