Perl Net::FTP and non-ASCII (UTF8) characters in file names - perl

I am using Net::FTP to access a PVR (satellite receiver) and retrieve recorded video files. Obtaining a list of all files using the dir() subroutine works fine, however if file names contain non-ASCII (UTF8) characters, calls to mtdm() and get() fail for these files. Here's an example (containing a german "umlaut"):
Net::FTP=GLOB(0x253d000)>>> MDTM /DataFiles/Kommissar Beck ~ Tödliche Kunst.rec
Net::FTP=GLOB(0x253d000)<<< 550 Can't access /DataFiles/Kommissar Beck ~ Tödliche Kunst.rec
File names only containing ASCII characters work well. Accessing files with non-ASCII characters through other FTP software works well too.
Does anyone have an idea how I can possibly make this work? Obviously I cannot simply avoid "umlauts" in file names.

Thank you ikegame and Slaven Rezic, your suggestions helped me solve the problem.
To sum it up: it is a bug in Topfield SRP2100's FTP implementation. The problem is not Perl or Net::FTP related. The MDTM command does not accept non-ASCII characters while the RETR command does. I checked with a network sniffer that my code and Net::FTP was doing everything right. All filenames sent in FTP commands were 100% correct.
I worked around the problem by parsing the date shown in the output of dir() instead of using MDTM for non-ASCII file names -- not a nice solution but it worked.

Related

Compare filenames with different encoding in Octave

I'm trying to accomplish following task in Octave:
Read filename from text file
Search for this file in particular location on hard drive
My script works for most files, but for certain files containing unicode characters I'm unable to match the filename from textfile with filename as it appears in the file system.
Filenames in textfile are in UTF-8 encoding and I read them in Octave with function fgetl().
Filenames from file system are obtained via function readdir(). I'm on Windows, NTFS file system.
For example, one problematic filename contains character "Č".
When printed out in Octave console, the characters appear exactly the same. However, a HEX viewer reveals that the characters are not actually the same. In the first case the character is encoded as 0x010C, in the second case as 0x0043 + 0x030C. Comparing both of them via strcmp() fails, of course.
What I tried to do is to omitt all non-ASCII characters from the filename and then compare them. But this didn't work, probably because in the second variant the first part of the character (0x0043) is actually ASCII.
Now I'm looking for some way of converting one format to another to be able to compare them. Any ideas?
EDIT:
As I discovered later, the character Č in the filename on Windows is actually written as C+ˇ, which is just another way you can write that character. So the difference probably insn't in encoding standard, but in 2 different ways to achieve 1 visible character (glyph).
This question basically then changes to a task of matching characters written "at once" and corresponding pair of letter+combining character.

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

UTF-8 source files are not supported in avisynth

I use avisynth to demux video from audio.
When I use
x = "m.mkv"
ffvideosource(x)
It work correctly but when I change my video filename to a UTF-8 one and my script as:
x = "م.mkv"
ffvideosource(x)
I Got the following error:
failed to open for hashing avisynth
I found a link (UTF-8 source files are not supported) who tell UTF-8 file name not work in avisynth, and to correct the problem, it said:
specify the parameter utf8=true when calling ffvideosource, save the script as UTF-8 without BOM and then see if that works.
But, I couldn't solve the problem. As I Open the script in the notepad and save it in utf-8 format, I got the following error:
UTF-8 Source files are not supported, re-save script with ANSI encoding
How can I solve the problem, How can I run my script with a UTF-8 filename?
“Withoutt BOM” is important. You need to save the file as raw UTF-8 without the Microsoft-style faux-BOM. Notepad can't do this, it always saves UTF-8 files with that generally-undesirable 0xEF 0xBB 0xBF header. Most other text editors (e.g. Notepad++) can do it properly.
AviSynth isn't really Unicode-aware so it doesn't want you using UTF-8 and will give that error message to try to stop you making mistakes. ffvideosource's workaround of hiding UTF-8 bytes in what AviSynth sees as ‘ANSI’ characters only works as long as AviSynth sees the file as ANSI. AviSynth doesn't have very sophisticated encoding-guessing, so removing the faux-BOM is enough to convince it is dealing with ANSI.
Very common problem when using UTF-8 in AviSynth.
Follow these steps:
Check the plugins folder. There should exist these three files: ffms2.dll, ffmsindex.exe, and FFMS2.avsi. If you did not have problem with ANSI, I guess that you don't have FFMS2.avsi in your plugins folder; In this situation download the latest version form here.
After that make an AVS file with Notepad++. For example I do this:
x = "C:/Users/Nemat/Desktop/StackOverFlow/نعمت.mkv"
ffmpegsource2(x,utf8=true)
Please note that here I used ffmpegsource2().
In the Encoding menu from Notepadd++ select Encode in UTF-8 without BOM.
Save your file.
Check the video file exists in the addressed directory.
Double click on your AVS file.
Enjoy it!

Uploading Amazon Inventory UTF 8 Encoding

I am trying to upload my english inventory to various european amazon sites. The issue I am having is that the accents found in certain different languages are not displaying correctly when an "inventory file" is uploaded to amazon. The inventory file is a tab delimited text file.
current setup:
$type = 'text/tab-separated-values; charset=utf-8';
header('Content-Type:'.$type);
header('Content-Disposition: attachment; filename="inventory-'.$_GET['cc'].'.txt');
header('Content-Length: ' . strlen($data));
header('Content-Encoding: UTF-8');
When the text file is outputted and saved it looks exactly how it should when opened in windows (all the characters are correct) but for some reason amazon doesn't see it as UTF8 and re-encodes it with all of the characters found here:
http://www.i18nqa.com/debug/utf8-debug.html
I have tried adding the BOM to the top of the file but this just results in amazon giving an error. Has anyone else experienced this?
As #fvu pointed out in his comment, Amazon is expecting the ISO-8859-1 format, not UTF-8. That's why you should use PHP's utf8_decode method when writing to your file.
Ok so after a lot of trying it turns out that the characters needed to be decoded. I opened the text files in excel and they seemed to encode themselves as weird characters like ü using php utf8_decode turned them back into the correct characters EVEN THOUGH the text file showed them as the right characters... very confusing.
To anyone out there having difficulties with UTF 8 try decoding first.
thanks for your help

ja chars in windows batch file

What is the secret to japanese characters in a Windows XP .bat file?
We have a script for open a file off disk in kiosk mode:
#ECHO OFF
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
It works fine when the OS is english, and it works fine for the japanese OS when XYZ is made up of english characters, but when XYZ is made up of japanese characters, they are getting mangled into gibberish by the time IE tries to find the file.
If the batch file is saved as Unicode or Unicode big endian the script wont even run.
I have tried various ways of encoding the japanese characters. ampersand escape does not work (〹)
Percent escape does not work %xx%xx%xx
ABC works, AB%43 becomes AB3 in the error message, so it looks like the percent escape is trying to do parameter substitution. This is confirmed because %043 puts in the name of the script !
One thing that does work is pasting the ja characters into a command prompt.
#ECHO OFF
CD "%ProgramFiles%\Internet Explorer\"
Set /p URL ="file to open: "
start iexplore.exe –K %URL%
This tells me that iexplore.exe will accept and parse the parameter correctly when it has ja characters, but not when they are written into the script.
So it would be nice to know what the secret may be to getting the parameter into IE successfully via the batch file, as opposed to via the clipboard and an environment variable.
Any suggestions greatly appreciated !
best regards
Richard Collins
P.S.
another post has has made this suggestion, which i am yet to follow up:
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Batch renaming of files with international chars on Windows XP
I will need to find out if this can be from inside the script.
For the record, a simple answer has been found for this question.
If the batch file is saved as ANSI - it works !
First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in. UTF-16 is out anyway, since cmd won't even parse that.
I have detailed an option in my answer to the following question:
Batch file encoding
which might be helpful for your needs.
In principle it boils down to the following:
Use an encoding which has single-byte mappings for ASCII
Put a chcp ... at the start of the batch file
Use the set codepage for the rest of the file
You can use codepage 65001, which is UTF-8 but make sure that your file doesn't include the U+FEFF character at the start (used as byte-order mark in UTF-16 and UTF-32 and sometimes used as marker for UTF-8 files as well). Otherwise the first command in the file will produce an error message.
So just use the following:
echo off
chcp 65001
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
and save it as UTF-8 without BOM (Note: Notepad won't allow you to do that) and it should work.
cmd /u won't do anything here, that advice is pretty much bogus. The /U switch only specifies that Unicode will be used for redirection of input and output (and piping). It has nothing to do with the encoding the console uses for output or reading batch files.
URL encoding won't help you either. cmd is hardly a web browser and outside of HTTP and the web URL encoding isn't exactly widespread (hence the name). cmd uses percent signs for environment variables and arguments to batch files and subroutines.
"Ampersand escape" also known as character entities known from HTML and XML, won't work either, because cmd is also not HTML or XML. The ampersand is used to execute multiple commands in a single line.
I too suffered this frustrating problem in batch/cmd files. However, so far as I can see, no one yet has stated the reason why this problem occurs, here or in other, similar posts at StackOverflow. The nearest statement addressing this was:
“First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in.”
Here is the basic problem. Cmd files are the Windows-2000+ successor to MS-DOS and IBM-DOS bat(ch) files. MS and IBM DOS (1984 vintage) were written in the IBM-PC character set (code page 437). There, the 8th-bit codes were assigned (or “clothed” with) characters different from those assigned to the corresponding codes of Windows, ANSI, or Unicode. The presumption of CP437 encoding is unalterable (except, as previously noted, through cmd.exe /u). Where the characters of the IBM-PC set have exact counterparts in the Unicode set, Windows Explorer remaps them to the Unicode counterparts. Alas, even Windows-1252 characters like š and ¾ have no counterpart in code page 437.
Here is another way to see the problem. Try opening your batch/cmd script using the Windows Edit.com program (at C:\Windows\system32\Edit.com). The Windows-1252 character 0145 ‘ (Unicode 8217) instead appears as IBM-PC 145 æ. A batch command to rename Mary'sFile.txt as Mary’sFile.txt fails, as it is interpreted as MaryæsFile.txt.
This problem can be avoided in the case of copying a file named Mary’sFile.txt: cite it as Mary?sFile.txt, e.g.:
xCopy Mary?sFile.txt Mary?sLastFile.txt
You will see a similar treatment (substitution of question marks) in a DIR list of files having Unicode characters.
Obviously, this is useless unless an extant file has the Unicode characters. This solution’s range is paltry and inadequate, but please make what use of it you can.
You can try to use Shift-JIS encoding.