How to handle unicode issues in PHP? - unicode

how do i properly handle unicode issues in PHP. Do i just set UTF-8 as a parameter to any function that needs it or do i set it as the locale somewhere in the bootstrap file once? How do it affect mysql etc

There's no way to set a global encoding in PHP. The best you can do is, always use the mb_ family of functions, and try to always be explicit about the encoding you want to use.
As for mysql, in particular, you can make sure that it connects using utf-8, usually by calling a set encoding method/function right after connecting or in the constructor if you're using pdo (see http://ar.php.net/manual/en/pdo.construct.php.)
Using utf-8 with PHP requires some discipline, but, it's definitly worth it. Hope any of this helps.

Related

Can you write Perl 6 scripts using an encoding that is not utf8?

Perl 5 has the encoding pragma or the Filter::Encoding module, however, I have not found anything similar in Perl 6. I guess eventually source filters will be created, but for the time being, can you use other encodings in Perl 6 scripts?
You cannot write your Perl 6 script in anything except utf8. I don't think there will ever be any other encoding you will be allowed to write your script in, as utf8 is basically the universal standard. Benefits like not having endianess and being back compatible with ASCII are some reasons it has become the standard and not things like utf16 or utf32.
Maybe there was a time before when such a thing may have been useful, but today I do not see that being the case. All text editors in common usage I know of default to utf8, and having files in multiple formats makes it more difficult to share your Perl 6 programs with others. There are plenty of reasons to want to use other encodings external to Perl 6 (writing to files, reading files etc.) but I don't see adding filters as smart move.
Rakudo currently supports an --encoding= option, so you might in theory be able to write a script in a different character encoding, and call it with perl6 --encoding=utf16 yourscript.p6. But in my experiments, I haven't managed to get it working with anything except utf8, and even if it worked, specifying --encoding on the command line would be a big no go for me.
So the operational answer is: currently no.
(And I don't think anybody else has asked for it yet...)

How does computer display a character on the screen with the correct encoding?

I'm interested in the encoding of the character in the computer.
When I open my xxx.c with visual studio code, how does the VS code detect the encoding of my file and interprets these "01" sequence. Further on, how the visual studio code (or even the computer system) display the character on the screen acorrding to my "01" sequence file and the character encoding?
Thank you!
I also uses Chinese during my projects. Sometimes, the file encoding really drive my crazy. Sometimes,my correct utf-8 file created by edit A for example, was destroyed by some text editor B that interpret it as GBK file, and edit A can never get it back correct.
I searched a lot, but the most answers seems to be too abstract or irrelevant. I want to figure out how the software and the computer system( or operating system) cooperate together to make this simple but important job done!
First things first, "can never get it back": Always Use Source Code Control
"How the software and the computer system (or operating system) cooperate together to make this simple but important job done!": They don't that's the problem!
Short history: Many decades ago people used small character sets. The idea was a system would always use the same one. Simple. Every time a text file was transferred between systems, it would be immediately transcribed to the local character encoding. Then came the globalization of file exchanges and systems needed to hold text files in different encodings. There was no general way of recording what the encoding was. In 1991 came the huge character set Unicode. Languages (VB4, Java), operating system APIs (Win32), file systems (NTFS), … began adopting it. However, its encodings (UTF-8, UTF-16) are just yet more possibilities for which encoding a text file uses. Many programs that read text files either rely on the old system of a system default encoding or guess ("detect").
In the programming world, some languages require source files to use a specific encoding (say UTF-8); In others, tools default to specific encoding (say UTF-8). In most cases, the toolset provided with a C or C++ implementation will have a consistent set of rules. If you also use an IDE or other form of project system, you can set the encoding for the entire project and in some cases specific files.
So, the only solution is to only use tools that work for you and to properly configure them. If it hurts, stop doing it.
Aside: On the topic of programming and default character encodings, be careful not to get tricked with various language libraries' use of the system default character encoding—unless that is exactly what's needed. Otherwise, you are giving your users the same problem that you are encountering. (In Java, just avoid it with explicit arguments. In C and C++ libraries, encoding is combined into Locales. But note that many systems initialize a program to use default character encoding.

Using Perl, is there a difference between Win32API::File::MoveFile and CORE::rename on MSWin32?

I see that Win32API::File supports MoveFile(). However, I'm not sure how CORE::rename() is implemented in such a fashion that it should matter. Could someone juxtapose the difference -- specifically for the Win32 Environment -- between
CORE::rename()
File::Copy::move()
and, Win32API::File::MoveFile()
rename is implemented in a broken fashion since forever; move too, since it uses rename.
Win32::Unicode::File exposes MoveFileW from windows.h as moveW, and apparently handles encoding in a sane fashion, whereas Win32API::File leaves that to the user AFAICS from existing example code.
Related: How do I copy a file with a UTF-8 filename to another UTF-8 filename in Perl on Windows?

Hitting ORA-01461 when inserting multibyte characters from perl into oracle

I have a perl script that is inserting records from a text file into our database. Whenever the record has a multibyte character like "RODR_Í_GUEZ". I receive the error ORA-01461, however i'm nowhere near the 4000 characters to switch from varchar2 to long
setting:
$ENV{NLS_CHARACTERSET} = 'AL32UTF8';
before connecting doesn't seem to help.
Using a java client (SQuirreL SQL) and manually writing the INSERT INTO statement inserts the record just fine, so i'm sure it's not how the database is configured.
Any thoughts?
You probably want to set the NLS_LANG environment variable. For Unix-ish systems, there is a script supplied in $ORACLE_HOME/server/bin called nls_lang.sh to output a reasonable value for your system, based on the LANG environment variable.
e.g. for my system (LANG=en_GB.UTF-8) the equivalent Oracle setting is
NLS_LANG=ENGLISH_UNITED KINGDOM.AL32UTF8
More info: http://forums.oracle.com/forums/thread.jspa?threadID=381531
Sergiusz's post there says practically all you need to know: I'll just add that the Perl DBD::Oracle driver is OCI-based, and the pure-Java JDBC driver isn't, hence they work differently in the same environment.

How can I combine Catalyst and ngettext?

I'm trying to get my head around i18n with Catalyst. As far as I understood the matter, there are two ways to make translations with Perl: Maketext and Gettext. However, I have a requirement to support gettext's .po format so basically I'm going with gettext.
Now, I've found Catalyst::Plugin::I18n and thus Locale::Maketext::Lexicon, which does what I want most of the time. However, it doesn't generate proper pluralization forms, i.e. properly writing msgid_plural and msgstr[x] into the .pot file. This happens probably because Maketext depends on its bracket notation [quant,_1...] and thus has to have the same notation in the translation.
Yet another solution might be using some direct gettext port like Locale::Messages, however this would mean rewriting C::P::I18n.
Does anybody have a proper solution for this problem apart from rewriting several modules? Anything that combines proper gettext with all its features and Catalyst will do.
You will probably get a better answer on the mailing list:
http://lists.scsys.co.uk/cgi-bin/mailman/listinfo/catalyst
I assume you've also read this:
http://www.catalystframework.org/calendar/2006/18