java.io.UnsupportedEncodingException: cp932? - encoding

What type of content would cause this exception?
Caused by: java.io.UnsupportedEncodingException: cp932
at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71)
at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
at com.google.code.com.sun.mail.handlers.text_plain.getContent(text_plain.java:109)
at com.google.code.javax.activation.DataSourceDataContentHandler.getContent(DataHandler.java:803)
at com.google.code.javax.activation.DataHandler.getContent(DataHandler.java:550)
at com.google.code.javax.mail.internet.MimeBodyPart.getContent(MimeBodyPart.java:639)
And why can't OpenJDK handle this encoding?

Any text or text-based content that uses that character set / encoding!
According to Wikipedia, CP932 is an extension of Shift JIS ... which is one of the character sets that is used to represent Japanese text.
According to this page, CP932 is in the "Extended Encoding Set (contained in lib/charsets.jar)". If it is not in your install of OpenJDK, look for a yum / apt / whatever OpenJDK package that offers extra Java character set support. Support for CP932 in OpenJDK is definitely available somewhere ...
It is also possible (though IMO unlikely) that OpenJDK doesn't recognize "cp932" as an alias for what it refers to as "MS932" and "windows-31j".
I checked the code.
The issue is that Java (not just OpenJDK!) does not recognize the "cp932" alias at all. The reason it doesn't recognize it is that the alias is non-standard.
The official (IANA endorsed) name for this encoding is "windows-31j", and Java also supports the following aliases by default:
"MS932"
"windows-932"
"csWindows31J"
If you set the "sun.nio.cs.map" system property (i.e. using "-D...") to "Windows-31J/Shift_JIS", then Java will also recognize "shift-jis", "ms_kanji", "x-sjis", and "csShiftJIS" as being equivalent ... but this should only be used for backwards compatibility with old (1.4.0 and earlier) JDKs that didn't implement the real SHIFT-JIS encoding correctly. (Besides, this doesn't solve your problem ...)
So what can you do?
Reject / discard the content as invalid. (And it is.)
Find out where this content is coming from, and get them to fix the incorrect encoding name.
Intercept the encoding name in the Google code before it tries to use it, and replace the non-standard name with an appropriate standard one.
Use nasty reflective hackery to add an encoding alias to the private data structure that the Oracle code is using to lookup encodings. (Warning: this may make your application fragile, and lead to portability problems.)
Raise an RFE against the Java SE requesting an easy way to add aliases for character encodings. (This is a really long term solution, though you may be able to accelerate it by writing and submitting the proposed enhancement to the OpenJDK team as a patch.)

Related

Encoding issue while spring message source using in idea

I have resource bundle property file with following content:
OwnerImagesController.TerminalContentFormatIsNotAcceptable = \u0424\u0430\u0439\u043b \u0438\u043c\u0435\u0435\u0442 \u043d\u0435\u0434\u043e\u043f\u0443\u0441\u0442\u0438\u043c\u044b\u0439 \u0444\u043e\u0440\u043c\u0430\u0442
In idea configuration following file looks like
To convert my file for resource bundle compatible state I direcly use native2ascii.exe application from jdk.
It is not comfortable.
Please help to facilitate my property file usage
According to the official documentation:
It is possible to encode non-ascii symbols using both upper- and
lower-case hex symbols (e.g. '\u00E3' vs '\u00e3'). Upper case is used
by default. To use lower case, set 'idea.native2ascii.lowercase'
property in the bin/idea.properties file to true.
Source:
https://www.jetbrains.com/idea/help/editing-resource-bundle.html
It seems to be better than editing vmoptions.
You can enable automatic conversion of non-ascii characters to appropriate escape sequences by checking Transparent native-to-ascii conversion in Settings/File Encodings (the section you show in the screenshot).
I also noticed that the the escape sequences in your snippet are lower case (i.e. \u043b instead of \u043B). IntelliJ converts them to uppercase by default. If you want to keep them lowercase to avoid unnecessary VCS changes, add following property to idea.vmoptions:
-Didea.native2ascii.lowercase=true

Porting a Delphi 2006 app to XE

I am wanting to port several large apps from Delphi 2006 to XE. The reasons are not so much to do with Unicode, but to take advantage of (hopefully) better IDE stability, native PNG support, more components, less VCL bugs, less dependence on 3rd party stuff, less ribbing from you guys, etc. Some of the apps might benefit from Unicode, but that's not a concern at present. At the moment I just want to take the most direct route to getting them to compile again.
As a start, I have changed all ambiguous string declarations, i.e. string to AnsiString or ShortString, char to AnsiChar and pChar to pAnsiChar and recompiled with D2006. So far so good. Nothing broke.
My question is: Where to from here? Assuming I just present my sources to the XE compiler and light the touch paper, what is likely to be the big issue?
For example,
var
S : AnsiString ;
...
MainForm.Caption := S ;
Will this generate an error? A warning? I'm assuming the VCL is now Unicode, or will XE pull in a non-Unicode component, or convert the string? Is it in fact feasible to keep an app using 8-bit strings in XE, or will there be too many headaches?
If the best/easiest way to go is to Unicode, I'll do that, even though I won't be using the extended characters, at least in the near future anyway.
The other thing that I wonder about is 3rd party stuff. I guess I will need to get updated versions that are XE-compatible.
Any (positive!) comment appreciated.
It is a long jump taking 2006 to 2011
But it is possible if you consider that:
You have to convert String variables using the new conversions methods ;
You have to check all the versions between 2006 and xe to know how the libraries have changed, bacause some have been spplited, others merged, and a few deleted ;
You have to buy/download the upgrade (if any) of your 3rd party components.
The VCL is completely Unicode now, so the code you showed will generate a compiler warning, not an error, about an implicit conversion from AnsiString to UnicodeString. That is a potentially lossy conversion if the AnsiString contains non-ASCII characters (which the compiler cannot validate). If you continue using AnsiString, then you have to do an explicit type-cast to avoid the warning:
var
S : AnsiString ;
...
MainForm.Caption := String(S);
You are best off NOT Ansi-fying your code like this. Embrace Unicode. Your code will be easier to manage for it, and it will be more portable to future versions and platforms. You should restrict AnsiString usage to just those places where Ansi is actually needed - network communications, file I/O of legacy data, etc. If you want to save memory inside your app, especially if you are using ASCII characters only, use UTF8String instead of AnsiString. UTF-8 is an 8-bit encoding of Unicode, and conversions between UTF8String and UnicodeString are loss-less with no compiler warnings.

Plone 4.0.5 and Unicode confusion

At first, Im using FreeBSD 8.1, Plone 4.0.5 and testing both Data.fs and RelStorage 1.5.0b2 (Postgresql 9.0.3). Im from Denmark and we use danish letters ("æøå").
Im confused about encoding, but my initial guess is that the best way to go is with Unicode (utf-8). What is the correct way to configure FreeBSD, Plone (and products) and PostgreSQL to comply with Danish letters. Ive already been told that the encoding does not matter for PostgreSQL.
Ive been seeing comments about site.py and sitecustomize.py around when googling for errors - please comment.
Thanks.
Nikolaj G.
Plone and all its add-ons support Unicode by default, you don't need to configure the encoding at any level.
Even when using RelStorage, we only store binary data inside the SQL database and no strings, so there's no de/encoding taking place at this level.
Changing the Python default encoding in site.py or sitecustomize.py is actually harmful and you should not do this. It will only mask actual programming errors inside the code base and can lead to inconsistent data.
Inside the codebase we do use a mixture of both Unicode and utf-8 encoded strings. So generally your code will have to be written in a way to handle both of these. This is unfortunate but a side-effect of us slowly migrating to proper Unicode at all levels.

Which programming languages were designed with Unicode support from the beginning?

Which widely used programming languages were designed ground-up with Unicode support?
A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?
Java was probably the first popular language to have ground-up Unicode support.
Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.
There were many breaking changes in Python 3, among them the switch to Unicode for all text.
So Python wasn't designed ground-up for Unicode, but Python 3 was.
I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:
using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט {
public תוצאה בית() {
int אלף = 0;
for (int λ = 0; λ < 20; λ++) אלף+=λ;
return אלף;
}
}
Google's Go programming language supports Unicode and works with UTF-8.
It really is difficult to design Unicode support for the future, in a programming language right from the beginning.
Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.
Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.
Therefore, don't be surprised if different programming languages implement Unicode support differently.
On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.
PS: Support for v5.1 of Unicode is to be made in Java 7.
Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.
The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)
Other languages that support non-ASCII identifiers include current PHP.
Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)
General overview
Unicode operators
Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).
A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.
Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html
Sometimes, a feature that was included in a language when it was first designed is not always the best.
Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.
So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.
With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.
Java uses characters from the Unicode character set.
java and .net languages

What problems should I expect when moving legacy Perl code to UTF-8?

Until now, the project I work in used ASCII only in the source code. Due to several upcoming changes in I18N area and also because we need some Unicode strings in our tests, we are thinking about biting the bullet and move the source code to UTF-8, while using the utf8 pragma (use utf8;)
Since the code is in ASCII now, I don't expect to have any troubles with the code itself. However, I'm not quite aware of any side effects we might be getting, while I think it's quite probable that I will get some, considering our environment (perl5.8.8, Apache2, mod_perl, MSSQL Server with FreeTDS driver).
If you have done such migrations in the past: what problems can I expect? How can I manage them?
The utf8 pragma merely tells Perl that your source code is UTF-8 encoded. If you have only used ASCII in your source, you won't have any problems with Perl understanding the source code. You might want to make a branch in your source control just to be safe. :)
If you need to deal with UTF-8 data from files, or write UTF-8 to files, you'll need to set the encodings on your filehandles and encode your data as external bits expect it. See, for instance, With a utf8-encoded Perl script, can it open a filename encoded as GB2312?.
Check out the Perl documentation that tells you about Unicode:
perlunicode
perlunifaq
perlunitut
Also see Juerd's Perl Unicode Advice.
A few years ago I moved our in-house mod_perl platform (~35k LOC) to UTF-8. Here are the things which we had to consider/change:
despite the perl doc advice of 'only where necessary', go for using 'use utf8;' in every source file - it gives you consistency.
convert your database to UTF-8 and ensure your DB config sets the connection charset to UTF-8 (in MySQL, watch out for field length issues with VARCHARs when doing this)
use a recent version of DBI - older versions don't correctly set the utf8 flag on returned scalars
use the Encode module, avoid using perl's built in utf8 functions unless you know exactly what data you're dealing with
when reading UTF-8 files, specify the layer - open($fh,"<:utf8",$filename)
on a RedHat-style OS (even 2008 releases) the included libraries won't like reading XML files stored in utf8 scalars - upgrade perl or just use the :raw layer
in older perls (even 5.8.x versions) some older string functions can be unpredictable - eg. $b=substr(lc($utf8string),0,2048) fails randomly but $a=lc($utf8string);$b=substr($a,0,2048) works!
remember to convert your input - eg. in a web app, incoming form data may need decoding
ensure all dev staff know which way around the terms encode/decode are - a 'utf8 string' in perl is in /de/-coded form, a raw byte string containg utf8 data is /en/-coded
handle your URLs properly - /en/-code a utf8 string into bytes and then do the %xx encoding to produce the ASCII form of the URL, and /de/-code it when reading it from mod_perl (eg. $uri=utf_decode($r->uri()))
one more for web apps, remember the charset in the HTTP header overrides the charset specified with <meta>
I'm sure this one goes without saying - if you do any byte operations (eg. packet data, bitwise operations, even an MIME Content-Length header) make sure you're calculating with bytes and not chars
make sure your developers know how to ensure their text editors are set to UTF-8 even if there's no BOM on a given file
remember to ensure your revision control system (for google's benefit - subversion/svn) will correctly handle the files
where possible, stick to ASCII for filenames and variable names - this avoids portability issues when moving code around or using different dev tools
One more - this is the golden rule - don't just hack til it works, make sure you fully understand what's happening in a given en/decoding situation!
I'm sure you already had most of these sorted out but hopefully all that helps someone out there avoid the many hours debugging which we went through.