Python 3, is using sys.stdout.buffer.write() good style? - unicode

After I learned about reading unicode files in Python 3.0 web script, now it's time for me to learn using print() with unicode.
I searched for writing unicode, for example this question explains that you can't write unicode characters to non-unicode console. However, in my case, the output is given to Apache and I am sure that it is capable of handling unicode text. For some reason, however, the stdout of my web script is in ascii.
Obviously, if I was opening a file to write myself, I would do something like
open(filename, 'w', encoding='utf8')
but since I'm given an open stream, I resorted to using
sys.stdout.buffer.write(mytext.encode('utf-8'))
and everything seems to work. Does this violate some rule of good behavior or has any unintended consequences?

I don't think you're breaking any rule, but
sys.stdout = codecs.EncodedFile(sys.stdout, 'utf8')
looks like it might be handier / less clunky.
Edit: per comments, this isn't quite right -- #Miles gave the right variant (thanks!):
sys.stdout = codecs.getwriter('utf8')(sys.stdout.buffer)
Edit: if you can arrange for environment variable PYTHONIOENCODING to be set to utf8 when Apache starts your script, that would be even better, making sys.stdout be set to utf8 automatically; but if that's unfeasible or impractical the codecs solution stands.

This is an old answer but I'll add my version here since I first ventured here before finding my solution.
One of the issues with codecs.getwriter is if you are running a script of sorts, the output will be buffered (whereas normally python stdout prints after every line).
sys.stdout in the console is a IOTextWrapper, so my solution uses that. This also allows you to set line_buffering=True or False.
For example, to set stdout to, instead of erroring, backslash encode all output:
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding=sys.stdout.encoding,
errors="backslashreplace", line_buffering=True)
To force a specific encoding (in this case utf8):
sys.stdout = io.TextIOWrapper(sys.stdout.detach(), encoding="utf8",
line_buffering=True)
A note, calling sys.stdout.detach() will close the underlying buffer. Some modules use sys.__stdout__, which is just an alias for sys.stdout, so you may want to set that as well
sys.stdout = sys.__stdout__ = io.TextIOWrapper(sys.stdout.detach(), encoding=sys.stdout.encoding, errors="backslashreplace", line_buffering=True)
sys.stderr = sys.__stderr__ = io.TextIOWrapper(sys.stderr.detach(), encoding=sys.stdout.encoding, errors="backslashreplace", line_buffering=True)

Related

ECL: The filesystem does not accept filenames with extended characters

how do you open a file of which name contains UTF-8 character?
For example:
(open "~/a/你好.txt")
give this:
The filesystem does not accept filenames with extended characters: "~/a/你好.txt"
I'm using ecl 16.1.3 from emerge from gentoo.
Meantime, sbcl can open the file.
I'm pretty sure ECL simply does not support general unicode filenames on Unix or Linux, however they get encoded in the underlying filesystem (I also don't know how that happens with *nix nowadays, although I guess there must be a standard now).
The specific error you're seeing originates here, in pathname.d. If you then look in unixfsys.d you'll see that ECL_NAMESTRING_FORCE_BASE_STRING is one of the flags passed to ecl_namestring all over the place, and this isn't conditionalized by anything.
So at the very least you would need to compile ECL from scratch, and more probably it simply does not support general unicode filenames at all.

Can you write Perl 6 scripts using an encoding that is not utf8?

Perl 5 has the encoding pragma or the Filter::Encoding module, however, I have not found anything similar in Perl 6. I guess eventually source filters will be created, but for the time being, can you use other encodings in Perl 6 scripts?
You cannot write your Perl 6 script in anything except utf8. I don't think there will ever be any other encoding you will be allowed to write your script in, as utf8 is basically the universal standard. Benefits like not having endianess and being back compatible with ASCII are some reasons it has become the standard and not things like utf16 or utf32.
Maybe there was a time before when such a thing may have been useful, but today I do not see that being the case. All text editors in common usage I know of default to utf8, and having files in multiple formats makes it more difficult to share your Perl 6 programs with others. There are plenty of reasons to want to use other encodings external to Perl 6 (writing to files, reading files etc.) but I don't see adding filters as smart move.
Rakudo currently supports an --encoding= option, so you might in theory be able to write a script in a different character encoding, and call it with perl6 --encoding=utf16 yourscript.p6. But in my experiments, I haven't managed to get it working with anything except utf8, and even if it worked, specifying --encoding on the command line would be a big no go for me.
So the operational answer is: currently no.
(And I don't think anybody else has asked for it yet...)

Pyside does something weird with locale

here are the relevant code lines for a unicode/pyside related problem :
#!/usr/bin/env python
# coding=utf-8
...
msgBox = QtGui.QMessageBox()
msgBox.setText('é')
print 'é'
....
The print does what it should, implying my locale is utf-8, and printenv confirms it. On the other hand, the msgBox shows 'é', unless I prefix the string with 'u'. Is that normal, and do I really have to prefix every string with u in order to use Pyside, when python never raises a problem?
Thanks for your attention.
In Python 2.X there are two data types for strings: byte code and unicode. With u you choose unicode with nothing you choose byte code. If you print bytes('é') you see where PySide/PyQt get their data from. So it probably is normal.
The easy way out is using Python 3.X which just has unicode and won't bother you with the difference.
Not sure who is actually doing something that shouldn't be done, the print function Python 2.X or PySide/PyQt for Python 2.X, so I make it community wiki.

Using Perl, is there a difference between Win32API::File::MoveFile and CORE::rename on MSWin32?

I see that Win32API::File supports MoveFile(). However, I'm not sure how CORE::rename() is implemented in such a fashion that it should matter. Could someone juxtapose the difference -- specifically for the Win32 Environment -- between
CORE::rename()
File::Copy::move()
and, Win32API::File::MoveFile()
rename is implemented in a broken fashion since forever; move too, since it uses rename.
Win32::Unicode::File exposes MoveFileW from windows.h as moveW, and apparently handles encoding in a sane fashion, whereas Win32API::File leaves that to the user AFAICS from existing example code.
Related: How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

Why shouldn't I use shell tools in Perl code?

It is generally advised not to use additional linux tools in a Perl code;
e.g if someone intends to print the last line of a text file he can:
$last_line = `tail -1 $file` ;
or otherwise, open the file and read it line by line
open(INFO,$file);
while(<INFO>) {
$last_line = $_ if eof;
}
What are the pitfalls of using the previous and why should I avoid using shell tools in my code?
thanx,
Efficiency - you don't have to spawn a new process
Portability - you don't have to worry about an executable not existing, accepting different switches, or having different output
Ease of use - you don't have to parse the output, the results are already in a usable form
Error handling - you have finer-grained control over errors and what to do about them in Perl.
It's better to keep all the action in Perl because it's faster and because it's more secure. It's faster because you're not spawning a new process, and it's more secure because you don't have to worry about shell meta character trickery.
For example, in your first case if $file contained "afilename ; rm -rf ~" you would be a very unhappy camper.
P.S. The best all-Perlway to do the tail is to use File::ReadBackwards
One of the primary reasons (besides portability) for not executing shell commands is that it introduces overhead by spawning another process. That's why much of the same functionality is available via CPAN in Perl modules.
One reason is that your Perl code might be running in an environment where there is no shell tool called 'tail'.
It's a personal call depending on the project:
Is it going to be always used in shell environments with tail?
Do you care about only using pure Perl code?
Using tail? Fine. But that's really a special case, since it's so easy to use and since it is so trivial.
The problem in general is not really efficiency or portability, that is largely irrelevant; the issue is ease of use. To run an external utility, you have to find out what arguments it accepts, write code to transform your program's data structures to that format, quote them properly, build the command line, and run the application. Then, you might have to feed it data and read data from it (involving complexity like an event loop, worrying about deadlocking, etc.), and finally interpret the return value. (UNIX processes consider "0" true and anything else false, but Perl assumes the opposite. foo() and die is hard to read.) This is a lot of work to do, and that's why people avoid it. It's much easier to create an instance of a class and call methods on it to get the data you need.
(You can abstract away processes this way; see Crypt::GpgME for example. It handles the complexity associated with invoking gpg, which would normally involve creating multiple filehandles other than STDOUT, STDIN, and STDERR, among other things.)
The main reason I see for doing it all in Perl would be for robustness. Your use of tail will fail if the filename has shell metacharacters or spaces or doesn't exist or isn't accessible. From Perl, characters in the filename aren't an issue, and you can distinguish between errors in accessing the file. Sometimes being robust is more important than speedy coding and sometimes it's not.