less is not displaying what it considers as unassigned utf8 unicode characters - unicode

Not a question but a small contribution, since I was not finding the information anywhere.
The problem appears even with a proper utf8 locale with "older" versions versions of less.
If you cannot update to a version of less correcting the issue, I found a workaround :
export LESSUTFBINFMT=*n%C
this instructs less to process any Unicode code points that were successfully decoded and display them with normal attributes (*n) and as-they-are (%C).
Notes:
I found the %C format by trial-and-error, I didn't find it explicitly documented.
I found this description of the issue : redhat bug 1074489
It mentions a newer version of less solving it in 2014, but I still run into this problem with a less-458-9.el7 from 2018.

Related

What is the difference between IBM874 and MS874?

I am trying to add Thai Collation support in my driver and to do so, I need to use the appropriate character encoding. So after some research, I am left with the two options :
Code page 874, which is also known as CP874 and IBM874
and
Code page 1162, which is also known as windows-874, CP1162, IBM1162, MS874, x-windows-874, and x-IBM874
They both seem to belong to the family ISO/IEC 8859-11 and only differ from it by a couple of (8 to 9) symbols, which is nearly identical to the Thai Standard TIS-620
My question is, which among the two (IBM874 and MS874), would be the best choice to provide support for Thai Collation.
I tried both one after the other and both seem to do the job. I cannot seem to find much information about the two on google.
Can someone please help me understand which among the two is a more appropriate or comprehensive choice ?
P.S: I found an Oracle doc which mentions about the two and the only notable difference I see is that :
MS874 is described as "Windows Thai" and is categorized under "Extended Encoding Set" - International Version
whereas
IBM874 as described as "IBM Thai" and falls under "Basic Encoding Set" - European Version
The 'International Version' seems to support all encodings listed on the Oracle page. So I am guessing that is the more extensive or appropriate choice and so I am planning to go ahead with MS874. Am i missing something?

Have there been any "breaking changes" in the history of Unicode?

As Unicode versions progress, has there ever been a breaking change? For example, has it ever occurred that a symbol's code point has been re-mapped, be it so that this symbol appears together with the ones it relates to (think of a character set for a language that at some point gained a new letter)?
May Unicode "change" these things at all, or is there a guarantee that these mappings are constant forever?
If there weren't any code point re-mappings, have there been other breaking changes?
Yes, there have been many breaking changes. One interesting story related by Michael Kaplan (late Microsoft internationalization expert) in an archived version of "Every character has a story #5", quotes Ken Whistler:
Hundreds -- maybe thousands -- of Unicode 1.0 character names were changed in 1993 for Unicode 1.1 as part of the merger between the repertoires of Unicode and ISO/IEC 10646-1:1993. (The Great Compromise) The gory details of all the changes can be found in UTR #4, The Unicode Standard, Version 1.1. It was after that point (which was very painful for some people) that we put in place the never change a character name rule.
In another post ("Stability of the Unicode Character Database", archived), Kaplan quotes a discussion of changes to character categories, with this quote also by Ken Whistler:
The significant point of instability in General Category assignments was in establishing Unicode 2.0 data files (now more than 8 years in the past).
There was a significant hiccup for Unicode 3.0, at the point when it became clear that normalization stability was going to be a major issue, and when the data was culled for consistency under canonical and compatibility equivalence.
Since that time, the UTC has been very conservative, indeed, in approving any General Category change for an existing character. The types of changes have been limited to:
Clarification regarding obscure characters for which insufficient information was available earlier.
Establishment of further data consistency constraints (this impacted some numeric categories, and also explains the change for the Katakana middle dot)
Implementation issues with a few format characters (ZWSP, Arabic end of ayah, Mongolian free variation selectors)
There were many changes early in Unicode, and fewer "breaking changes" as time went on. Unicode has an official Stability Policy, describing what changes are no longer allowed, and the "applicable version" at which they instituted each policy (and thus finished making that kind of change). I have to expect that each of those policies may tell a story of changes being made, causing all sorts of trouble for people who were relying on the prior behavior and needing to update existing data in some way, or at least people knew that more pain would come in the future if they didn't fix those particular aspects of Unicode at that time. It really makes some amount of sense that as Unicode got adopted problems with particular aspects were found and fixed, and now that Unicode is ubiquitous there is less need to make breaking changes and more need to keep compatibility with the existing data that's out there.
To answer your specific question about code point placement, let me quote the Encoding Stability Policy:
Encoding Stability
Applicable Version: Unicode 2.0+
Once a character is encoded, it will not be moved or removed.
This policy ensures that implementers can always depend on each version of the Unicode Standard being a superset of the previous version. The Unicode Standard may deprecate the character (that is, formally discourage its use), but it will not reallocate, remove, or reassign the character.
Note: Ordering of characters is handled via collation, not by moving characters to different code points. For more information, see Unicode Technical Standard #10, Unicode Collation Algorithm, and the Unicode FAQ.
In general, I expect that you can rely on the Unicode Consortium to keep the promises that they've now made in their Stability Policy, though you may need to be aware of the changes made before each policy existed if you have data that predates the adoption of that applicable version of Unicode by the software that created it. And data not explicitly called out as now being "stable" can of course be changed in any future version.

java.io.UnsupportedEncodingException: cp932?

What type of content would cause this exception?
Caused by: java.io.UnsupportedEncodingException: cp932
at sun.nio.cs.StreamDecoder.forInputStreamReader(StreamDecoder.java:71)
at java.io.InputStreamReader.<init>(InputStreamReader.java:100)
at com.google.code.com.sun.mail.handlers.text_plain.getContent(text_plain.java:109)
at com.google.code.javax.activation.DataSourceDataContentHandler.getContent(DataHandler.java:803)
at com.google.code.javax.activation.DataHandler.getContent(DataHandler.java:550)
at com.google.code.javax.mail.internet.MimeBodyPart.getContent(MimeBodyPart.java:639)
And why can't OpenJDK handle this encoding?
Any text or text-based content that uses that character set / encoding!
According to Wikipedia, CP932 is an extension of Shift JIS ... which is one of the character sets that is used to represent Japanese text.
According to this page, CP932 is in the "Extended Encoding Set (contained in lib/charsets.jar)". If it is not in your install of OpenJDK, look for a yum / apt / whatever OpenJDK package that offers extra Java character set support. Support for CP932 in OpenJDK is definitely available somewhere ...
It is also possible (though IMO unlikely) that OpenJDK doesn't recognize "cp932" as an alias for what it refers to as "MS932" and "windows-31j".
I checked the code.
The issue is that Java (not just OpenJDK!) does not recognize the "cp932" alias at all. The reason it doesn't recognize it is that the alias is non-standard.
The official (IANA endorsed) name for this encoding is "windows-31j", and Java also supports the following aliases by default:
"MS932"
"windows-932"
"csWindows31J"
If you set the "sun.nio.cs.map" system property (i.e. using "-D...") to "Windows-31J/Shift_JIS", then Java will also recognize "shift-jis", "ms_kanji", "x-sjis", and "csShiftJIS" as being equivalent ... but this should only be used for backwards compatibility with old (1.4.0 and earlier) JDKs that didn't implement the real SHIFT-JIS encoding correctly. (Besides, this doesn't solve your problem ...)
So what can you do?
Reject / discard the content as invalid. (And it is.)
Find out where this content is coming from, and get them to fix the incorrect encoding name.
Intercept the encoding name in the Google code before it tries to use it, and replace the non-standard name with an appropriate standard one.
Use nasty reflective hackery to add an encoding alias to the private data structure that the Oracle code is using to lookup encodings. (Warning: this may make your application fragile, and lead to portability problems.)
Raise an RFE against the Java SE requesting an easy way to add aliases for character encodings. (This is a really long term solution, though you may be able to accelerate it by writing and submitting the proposed enhancement to the OpenJDK team as a patch.)

My dollar signs are now little boxes

This week we upgraded to JasperReports Server 4.7 (Professional) and iReport 4.7. I have several reports that I created in iReport 4.5.1 and successfully used in JasperReports Server 4.5.1.
After the upgrade, all of my dollar signs are now little boxes. The pattern for my currency fields is ¤ #,##0.00. JasperReports Server is not replacing the box with a dollar sign when the report is generated. Everything looks ok in the pattern sample. My percentage symbols are all still working. I tried removing and applying the currency pattern to the fields again, but this didn't fix the problem.
Any thoughts on how I can fix this?
This is Java operating as intended... but not as you want it to operate. Your locale does not specify a currency, so you get that "¤" symbol.
You could workaround it by changing your locale from "en" to "en_US". I just did this last week. As a side note, I found one tweak that I needed to make. After changing the locale to en_US I needed to copy one file like this:
cp .../jasperserver-pro/scripts/jquery/js/jquery.ui.datepicker-en.js .../jasperserver-pro/scripts/jquery/js/jquery.ui.datepicker-en-US.js
Alternatively, I usually find it's better to work around it by setting your format mask to use a hard-coded dollar sign. If you are displaying "$50.00" to a user in the United States, it would be nonsensical to display "€50,00" to a European user or "¥50.00" to a Japanese user for the same value. There are lots of times when the hard-coded currency symbol is more appropriate.

Plone 4.0.5 and Unicode confusion

At first, Im using FreeBSD 8.1, Plone 4.0.5 and testing both Data.fs and RelStorage 1.5.0b2 (Postgresql 9.0.3). Im from Denmark and we use danish letters ("æøå").
Im confused about encoding, but my initial guess is that the best way to go is with Unicode (utf-8). What is the correct way to configure FreeBSD, Plone (and products) and PostgreSQL to comply with Danish letters. Ive already been told that the encoding does not matter for PostgreSQL.
Ive been seeing comments about site.py and sitecustomize.py around when googling for errors - please comment.
Thanks.
Nikolaj G.
Plone and all its add-ons support Unicode by default, you don't need to configure the encoding at any level.
Even when using RelStorage, we only store binary data inside the SQL database and no strings, so there's no de/encoding taking place at this level.
Changing the Python default encoding in site.py or sitecustomize.py is actually harmful and you should not do this. It will only mask actual programming errors inside the code base and can lead to inconsistent data.
Inside the codebase we do use a mixture of both Unicode and utf-8 encoded strings. So generally your code will have to be written in a way to handle both of these. This is unfortunate but a side-effect of us slowly migrating to proper Unicode at all levels.