Japanese filename being read as 8.3 on Windows - perl

I'm working on a Windows system with Perl 5.8.8 (yes, it's old but it's what's on the server and I can't change that).
I've been sent a csv datafile from our Japan office. The filename is 'a151110a(01_WeLBA①).csv'. My perl script can open and read the file, but it sees the filename as 'A15111~1.CSV'; so it's interpreting the filename into 8.3 format. I've tried using glob and readdir to create a file list and both give the same result.
The problem is that I need to be able to pull some info out of the filename. I need that part that is inside the parentheses, the '01_WeLBA' part. But Perl doesn't seem to "see" that. Those parentheses have a space (or other whitespace character) either in front or just after them. If I manually remove those and that numeral '1'-inside-a-circle character, then Perl sees the filename as it is.
Is there a way to get Perl to 'see' the filename as it appears in Windows Explorer?

Use Win32::LongPath's opendirL, readdirL and closedirL.
Windows provides two versions of each system call that accepts strings:
The "UNICODE" version (suffixed with "W" for "wide") accepts/returns strings encoded using UTF-16le. This version supports all Unicode characters.
The "ANSI" version (suffixed with "A") accepts/returns strings encoded using the Active Code Page (ACP). The "A" version only supports a small subset of the Unicode characters.
You can obtain the ACP for your system using the following:
perl -Mv5.14 -MWin32 -e"say Win32::GetACP()"
Unfortunately, Perl functions (named operators) use the "A" version of system calls and expect/return text encoded using the ACP. This severely limits which file names that can be passed to them.
For example, my system's ACP is 1252, so the "A" version of system calls would not support Japanese characters. This means there is nothing I can do to make open, -e, etc work with file names containing Japanese characters. ouch.
Win32::LongPath provides alternatives to Perl's builtins that use the "W" version of system calls, and thus support all Unicode characters. For example, -e is just a call to stat, and it provides statL.

Have a look at Win32::LongPath, it does exactly what you need.

Related

Compare filenames with different encoding in Octave

I'm trying to accomplish following task in Octave:
Read filename from text file
Search for this file in particular location on hard drive
My script works for most files, but for certain files containing unicode characters I'm unable to match the filename from textfile with filename as it appears in the file system.
Filenames in textfile are in UTF-8 encoding and I read them in Octave with function fgetl().
Filenames from file system are obtained via function readdir(). I'm on Windows, NTFS file system.
For example, one problematic filename contains character "Č".
When printed out in Octave console, the characters appear exactly the same. However, a HEX viewer reveals that the characters are not actually the same. In the first case the character is encoded as 0x010C, in the second case as 0x0043 + 0x030C. Comparing both of them via strcmp() fails, of course.
What I tried to do is to omitt all non-ASCII characters from the filename and then compare them. But this didn't work, probably because in the second variant the first part of the character (0x0043) is actually ASCII.
Now I'm looking for some way of converting one format to another to be able to compare them. Any ideas?
EDIT:
As I discovered later, the character Č in the filename on Windows is actually written as C+ˇ, which is just another way you can write that character. So the difference probably insn't in encoding standard, but in 2 different ways to achieve 1 visible character (glyph).
This question basically then changes to a task of matching characters written "at once" and corresponding pair of letter+combining character.

How to achieve fool proof unicode handling in Perl CGI?

So I have a mysql database which is serving an old wordpress db I backed up. I am writing some simple perl scripts to serve the wordpress articles (I don't want a wordpress install).
Wordpress for whatever reason stored all quotes as unicode chars, all ... as unicode chars, all double dashes, all apostrophes, there is unicode nbsp all over the place -- it is a mess (this is why I do not a wordpress install).
In my test environment, which is Linux Mint 17.1 Perl 5.18.2 Mysql 5.5, things mostly work fine when I have Content-type line being served up with "charset=utf-8" (except apostrophes just simply never decode properly no matter what combination I of things I try). Omitting the charset causes all unicode characters to break (except apostrophes now work). This is OK, with the exception of the apostrophes, I under stand what is going on and I have a handle on the data.
Now on my production environment, which is a VM, is Linux CentOS 6.5 Perl 5.10.1 Mysql 5.6.22, and here things do not work at all. Whether or not I include the "charset=utf-8" in the Content-type there is no difference, no unicode charatcers work correctly (including apostrophes). Maybe it has to do with the lower version of Perl? Does anyone have any insight?
Apart from this very specific case, does anyone know of a fool-proof Perl idiom for handling unicode which comes from the DB? (I'm not sure where in the pipeline things are going wrong, but I have a suspicion it is at the DB-driver level)
One of the problems is that my data is very inconsistent and dirty. I could parse the entire DB and scrub all unicode and re-import it -- the point is I want to avoid that. I want a one-size fits all collection of Perl scripts for reading wordpress databases.
Dealing with Perl and UTF-8 has been a pain to me. After a good amount of time i learned that there is no "fool proof unicode handling" in Perl ... but there is an unicode handling that can be of help:
The Encode module.
As the perlunifaq says (http://perldoc.perl.org/perlunifaq.html):
When should I decode or encode?
Whenever you're communicating text with anything that is external to
your perl process, like a database, a text file, a socket, or another
program. Even if the thing you're communicating with is also written
in Perl.
So we do this to every UTF-8 text string sent to our Perl process:
my $perl_str = decode('utf8',$myExt_str);
And this to every text string sent from Perl to anything external to our Perl process:
my $ext_str = encode('utf8',$perl_str);
...
Now that's a lot of encoding/decoding when we retrieve or send data from/to a mysql or postgresql database. But fear not, because there is a way to tell Perl that EVERY TEXT STRING from/to a database are utf8. Additionally we tell the database that every text string should be treated as UTF-8. The only downside is that you need to be sure that every text string is UTF-8 encoded... but that's another story:
# For MySQL:
# This requires DBD::mysql version 4 or greater
use DBI;
my $dbh = DBI->connect ('dbi:mysql:test_db',
$username,
$password,
{mysql_enable_utf8 => 1}
);
Ok, now we have the text strings from our database in utf8 and the database knows all our text strings should be treated as UTF-8... But what about anything else? We need to tell Perl (AND CGI) that EVERY TEXT STRING we write in our process is utf8 AND tell other other processes to treat our text strings as UTF-8 as well:
use utf8;
use CGI '-utf8';
my $cgi = new CGI;
$cgi->charset('UTF-8');
UPDATED!
What is a "wide character"?
This is a term used both for characters with an ordinal value greater
than 127, characters with an ordinal value greater than 255, or any
character occupying more than one byte, depending on the context. The
Perl warning "Wide character in ..." is caused by a character with an
ordinal value greater than 255.
With no specified encoding layer, Perl
tries to fit things in ISO-8859-1 for backward compatibility reasons.
When it can't, it emits this warning (if warnings are enabled), and
outputs UTF-8 encoded data instead. To avoid this warning and to avoid
having different output encodings in a single stream, always specify
an encoding explicitly, for example with a PerlIO layer:
# The next line is required to avoid the "Wide character in print" warning
# AND to avoid having different output encodings in a single stream.
binmode STDOUT, ":encoding(UTF-8)";
...
Even with all of this sometimes you need to encode('utf8',$perl_str) . That's why i learn there is no fool proof unicode handling in Perl. Please read the perlunifaq (http://perldoc.perl.org/perlunifaq.html)
I hope this helps.

Command-line arguments as bytes instead of strings in python3

I'm writing a python3 program, that gets the names of files to process from command-line arguments. I'm confused regarding what is the proper way to handle different encodings.
I think I'd rather consider filenames as bytes and not strings, since that avoids the danger of using an incorrect encoding. Indeed, some of my file names use an incorrect encoding (latin1 when my system locale uses utf-8), but that doesn't prevent tools like ls from working. I'd like my tool to be resilient to that as well.
I have two problems: the command-line arguments are given to me as strings (I use argparse), and I want to report errors to the user as strings.
I've successfuly adapted my code to use binaries, and my tool can handle files whose name are invalid in the current default encoding, as long as it is by recursing trough the filesystem, because I convert the arguments to binaries early, and use binaries when calling fs functions. When I receive a filename argument which is invalid, however, it is handed to me as a unicode string with strange characters like \udce8. I do not know what these are, and trying to encode it always fail, be it with utf8 or with the corresponding (wrong) encoding (latin1 here).
The other problem is for reporting errors. I expect users of my tool to parse my stdout (hence wanting to preserve filenames), but when reporting errors on stderr I'd rather encode it in utf-8, replacing invalid sequences with appropriate "invalid/question mark" characters.
So,
1) Is there a better, completely different way to do it ? (yes, fixing the filenames is planned, but i'd still like my tool to be robust)
2) How do I get the command line arguments in their original binary form (not pre-decoded for me), knowing that for invalid sequences re-encoding the decoded argument will fail, and
3) How do I tell the utf-8 codec to replace invalid, undecodable sequences with some invalid mark rather than dying on me ?
When I receive a filename argument
which is invalid, however, it is
handed to me as a unicode string with
strange characters like \udce8.
Those are surrogate characters. The low 8 bits is the original invalid byte.
See PEP 383: Non-decodable Bytes in System Character Interfaces.
Don't go against the grain: filenames are strings, not bytes.
You shouldn't use a bytes when you should use a string. A bytes is a tuple of integers. A string is a tuple of characters. They are different concepts. What you're doing is like using an integer when you should use a boolean.
(Aside: Python stores all strings in-memory under Unicode; all strings are stored the same way. Encoding specifies how Python converts the on-file bytes into this in-memory format.)
Your operating system stores filenames as strings under a specific encoding. I'm surprised you say that some filenames have different encodings; as far as I know, the filename encoding is system-wide. Functions like open default to the default system filename encoding, for example.

Perl strings internals

How do perl strings represented internally? What encoding is used? How do I handle different encodings properly?
I've been using perl for quite a long time, but it didn't include a lot of string handling in different encodings, and when I encountered a minor problem that had something to do with encodings I usually resorted to some shamanic actions.
Until this moment I thought about perl strings as sequences of bytes, which did fit pretty well for my tasks. Now I need to do some processing of UTF-8 encoded file and here starts trouble.
First, I read file into string like this:
open(my $in, '<', $ARGV[0]) or die "cannot open file $ARGV[0] for reading";
binmode($in, ':utf8');
my $contents;
{
local $/;
$contents = <$in>;
}
close($in);
then simply print it:
print $contents;
And I get two things: a warning Wide character in print at <scriptname> line <n> and a garbage in console. So I can conclude that perl strings have a concept of "character" that can be "wide" or not, but when printed these "wide" characters are represented in console as multiple bytes, not as single "character".
(I wonder now why did all my previous experience with binary files worked quite how I expected it to work without any "character" issues).
Why then I see garbage in console? If perl stores strings as character in some known encoding, I don't think there is a big problem to find out console encoding and print text properly. (I use Windows, BTW).
If perl stores strings as variable-width character sequences (e.g. using same UTF-8 encoding), why is it done this way? From my C experience handling strings is PAIN.
Update.
I use two computers for testing, one runs Windows 7 x64 with English language pack installed, but with Russian regional settings (so I have cp866 as OEM codepage and cp1251 as ANSI) with ActivePerl 5.10.1 x64; another runs Windows XP 32 bit Russian localization with Cygwin Perl 5.10.0.
Thanks to links, now I have much more solid understanding on what's going on and how things should be done.
Setting utf8 before reading from the file is good, it automagically decodes the bytes into the internal encoding. (Which is also UTF-8 but you don't need to know, and shouldn't rely on.)
Before printing you need to encode the characters back to bytes.
use Encode;
utf8::encode($contents);
There is also a two argument form of encode, for other encodings than unicode. (That sentence echoes too much, doesn't it?)
Here is a good reference. (Would have been more, but it's my first post.) Check out perlunitut too, and the unicode article on Joel on Software.
http://www.ahinea.com/en/tech/perl-unicode-struggle.html
Oh, and it must use multi-byte strings, because otherwise it's just not unicode.
Perl strings are stored internally in one of two encodings, either a 8-bit byte oriented native encoding, or UTF-8. For backwards comparability the assumption is that all I/O and strings are in native encoding, unless otherwise specified. Native encoding is usually 8-bit ASCII, but this can be changed with use locale.
In your sample you call binmode on your input handle changing it to use :utf8 semantics. One effect of this is that all strings read from this handle will be encoded as UTF-8. print writes to STDOUT by default, and STDOUT defaults to expecting native encoded characters.
Perl in an attempt to do the right thing will allow a UTF-8 string to be sent to a native encoded output, but if there is no encoding attached to that handle then it has to guess how to output multi-byte characters and it will almost certainly guess wrong. That is what the warning means, a multi-byte character was sent to a stream only expecting single byte characters and the result was that the character was probably damaged in translation.
Depending on what you want to accomplish you can use the Encode module mentioned by dylan to convert the UTF-8 data to a single byte character set that can be printed safely or if you know that whatever is attached to STDOUT can handle UTF-8 you can use binmode(STDOUT, ':utf8'); to tell Perl you want any data sent to STDOUT to be sent as UTF-8.
You should mention your actual Windows and Perl versions as this really depends on your used versions and installed language packages.
Otherwise have a look at the PerlUnicode manual first -
Perl uses logically-wide characters to represent strings internally.
it will confirm your statements.
Windows does not fully install all UTF8 character- thus this is might be the reason for your issue. You may need to install an additional language package.

ja chars in windows batch file

What is the secret to japanese characters in a Windows XP .bat file?
We have a script for open a file off disk in kiosk mode:
#ECHO OFF
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
It works fine when the OS is english, and it works fine for the japanese OS when XYZ is made up of english characters, but when XYZ is made up of japanese characters, they are getting mangled into gibberish by the time IE tries to find the file.
If the batch file is saved as Unicode or Unicode big endian the script wont even run.
I have tried various ways of encoding the japanese characters. ampersand escape does not work (〹)
Percent escape does not work %xx%xx%xx
ABC works, AB%43 becomes AB3 in the error message, so it looks like the percent escape is trying to do parameter substitution. This is confirmed because %043 puts in the name of the script !
One thing that does work is pasting the ja characters into a command prompt.
#ECHO OFF
CD "%ProgramFiles%\Internet Explorer\"
Set /p URL ="file to open: "
start iexplore.exe –K %URL%
This tells me that iexplore.exe will accept and parse the parameter correctly when it has ja characters, but not when they are written into the script.
So it would be nice to know what the secret may be to getting the parameter into IE successfully via the batch file, as opposed to via the clipboard and an environment variable.
Any suggestions greatly appreciated !
best regards
Richard Collins
P.S.
another post has has made this suggestion, which i am yet to follow up:
You might have more luck in cmd.exe if you opened it in UNICODE mode. Use "cmd /U".
Batch renaming of files with international chars on Windows XP
I will need to find out if this can be from inside the script.
For the record, a simple answer has been found for this question.
If the batch file is saved as ANSI - it works !
First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in. UTF-16 is out anyway, since cmd won't even parse that.
I have detailed an option in my answer to the following question:
Batch file encoding
which might be helpful for your needs.
In principle it boils down to the following:
Use an encoding which has single-byte mappings for ASCII
Put a chcp ... at the start of the batch file
Use the set codepage for the rest of the file
You can use codepage 65001, which is UTF-8 but make sure that your file doesn't include the U+FEFF character at the start (used as byte-order mark in UTF-16 and UTF-32 and sometimes used as marker for UTF-8 files as well). Otherwise the first command in the file will produce an error message.
So just use the following:
echo off
chcp 65001
"%ProgramFiles%\Internet Explorer\iexplore.exe" –K "%CD%\XYZ.htm"
and save it as UTF-8 without BOM (Note: Notepad won't allow you to do that) and it should work.
cmd /u won't do anything here, that advice is pretty much bogus. The /U switch only specifies that Unicode will be used for redirection of input and output (and piping). It has nothing to do with the encoding the console uses for output or reading batch files.
URL encoding won't help you either. cmd is hardly a web browser and outside of HTTP and the web URL encoding isn't exactly widespread (hence the name). cmd uses percent signs for environment variables and arguments to batch files and subroutines.
"Ampersand escape" also known as character entities known from HTML and XML, won't work either, because cmd is also not HTML or XML. The ampersand is used to execute multiple commands in a single line.
I too suffered this frustrating problem in batch/cmd files. However, so far as I can see, no one yet has stated the reason why this problem occurs, here or in other, similar posts at StackOverflow. The nearest statement addressing this was:
“First of all: Batch files are pretty limited in their internationalization support. There is no direct way of telling cmd what codepage a batch file is in.”
Here is the basic problem. Cmd files are the Windows-2000+ successor to MS-DOS and IBM-DOS bat(ch) files. MS and IBM DOS (1984 vintage) were written in the IBM-PC character set (code page 437). There, the 8th-bit codes were assigned (or “clothed” with) characters different from those assigned to the corresponding codes of Windows, ANSI, or Unicode. The presumption of CP437 encoding is unalterable (except, as previously noted, through cmd.exe /u). Where the characters of the IBM-PC set have exact counterparts in the Unicode set, Windows Explorer remaps them to the Unicode counterparts. Alas, even Windows-1252 characters like š and ¾ have no counterpart in code page 437.
Here is another way to see the problem. Try opening your batch/cmd script using the Windows Edit.com program (at C:\Windows\system32\Edit.com). The Windows-1252 character 0145 ‘ (Unicode 8217) instead appears as IBM-PC 145 æ. A batch command to rename Mary'sFile.txt as Mary’sFile.txt fails, as it is interpreted as MaryæsFile.txt.
This problem can be avoided in the case of copying a file named Mary’sFile.txt: cite it as Mary?sFile.txt, e.g.:
xCopy Mary?sFile.txt Mary?sLastFile.txt
You will see a similar treatment (substitution of question marks) in a DIR list of files having Unicode characters.
Obviously, this is useless unless an extant file has the Unicode characters. This solution’s range is paltry and inadequate, but please make what use of it you can.
You can try to use Shift-JIS encoding.