lift can't handle diacritics - scala

I'm developing a web app with Lift Framework, GlassfishV3 and there is a problem with diacritics in my app. I do just value binding to model and when I log the value from input text field, the diacritics letters are already broken. Where could possibly be the problem?
bind("entry",content,
"place" -> SHtml.text(lib.place, lib.place=_),
"submit" -> SHtml.submit("Kaboom", () => {
Logger.getAnonymousLogger.severe(lib.place)
Service.library.save(lib)})
)
It's probably a general java problem, not limited to Lift.
I enter š and I see Å¡ as the output from logger.

Would it make you feel better or worse to know that the source of your entire problem is located between the keyboard and the back of your chair?
Here's what happened:
You wanted to print out š, a lower-case s-with-caron, which is represented in Unicode by the number 0x161. You printed it out to a file, and your I/O system dutifully (and correctly) encoded it in UTF-8 as 0xC5, 0xA1. Then you asked view to that file without explaining to your viewing program that it was a UTF-8 file. Your viewing program, whatever it was, interpreted the file as ISO 8859-1, a very common, if somewhat elderly, format. The 0xC5 was displayed as Å, A-with-a-ring, and the 0xA1 as ¡, an inverted exclamation mark.
To summarize, there's nothing wrong with the output, there's just something wrong with the way you are looking at it. Bring the log up in an editor and set the encoding to UTF-8 or bring it up in a web browser and select View / Character Encoding / UTF-8 .

My guess would be that the clue to this might be the browser. What encoding does the browser assume for the page? Do you have an encoding meta-tag in the head; like this:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />

This could well be a problem in the logger or your console. Try logging to a file and opening that in an editor you know can handle UTF8

so one option to get UTF-8 encoding is to specify encoding in sun-web.xml file like this:
<sun-web-app error-url="">
<parameter-encoding default-charset="UTF-8"/>
</sun-web-app>
the other option is to set encoding in lift bootstrapping class:
def boot {
LiftRules.early.append(makeUtf8)
}
private def makeUtf8(req: HTTPRequest) {
req.setCharacterEncoding("UTF-8")
}

Related

Page won't show special characters

So I know this is a common problem with stuff like the charset, but the weird thing is that this works on a page with the same set-up/template, but not on this one!
So basically, my problem is that the page won't show Norwegian characters like å and ø.
Here's the page with the problem: http://suldal.underbakke.net/register.php
and here's one with the same template but working: http://suldal.underbakke.net/
(On the second one, it's a "å" in 4th post, in the name)
The page is declared as being in UTF-8 encoding, but it is in fact windows-1252 (or iso-8859-1) encoded. You can see this by manually selecting the encoding while viewing the page in a browser; browsers typically have a View menu where you can select the encoding.
Thus, as a quick fix, you could just change utf-8 to windows-1252 in the meta tag.
As a different workaround, you could replace the “special characters” (Scandinavian letters) by HTML entities, e.g. “ø” by ø. Depending on the authoring software, you might need to do something special to achieve this (e.g., enter “HTML mode”), because an authoring tool might automatically convert “&” to &.
As the best solution, find out how to save a file in UTF-8 encoding in the authoring program you are using, and keep the meta tag as is. This is typically either an option in the general settings of the program or a choice you can make in a “Save As” command.

GWT: Character encoding umlauts

I want to set a text in a label:
labelDemnaechst.setText(" Demnächst fällig:");
On the output in the application the characters "ä" are displayed wrong.
How can I display them well?
GWT assumes all source files are encoded in UTF-8 (and this is why you see löschen if you open them with an editor that interprets them as Windows-1252 or ISO-8859-1).
GWT also generates UTF-8-encoded files, so your web server should serve them as UTF-8 (or don't specify an encoding at all, to let browsers "sniff" it) and your HTML host page should be served in UTF-8 too (best is to put a <meta charset=utf-8> at the beginning of your <head>).
If a \u00E4 in Java is displayed correctly in the browser, then you're probably in the first case, where your source files are not encoded in UTF-8.
See http://code.google.com/webtoolkit/doc/latest/FAQ_Troubleshooting.html#International_characters_don't_display_correctly
well you have to encode your special charactars to Unicode. You can finde a list of the representive Unicode characters here.
Your examle would look like this:
labelDemnaechst.setText("Demn\u00E4lachst f\u00E4llig:");
Hope this helps, if noone has a better solution.
Appendix:
Thanks Thomas for your tipp, you really have to change the format in which eclipse safes it's source files. Per default it uses something like Cp1252. If you change it to UTF-8, your example works correct. (So Demnächst is written correctly).
You can edit the safing format, if you right-click on your file --> Preferences.
To get UTF-8 encoding for your entire workspace, go to Window -> Preferences. In the pop-up start typing encoding. Now you should have Content Types, Workspace, CSS Files, HTML Files, XML Files as result. In content Types you can type UTF-8 in the Default encoding text box, for the other elements you can just select the encoding in their respective listboxes.
Then check the encoding for your project in Project -> Properties -> Resource.
Detailed instruction with pictures can be found here:
http://stijndewitt.wordpress.com/2010/05/05/unicode-utf-8-in-eclipse-java/
Cheers
what i did:
open the file with notepad (Windows Explorer),
and save it with the option UFT-8 instead of proposed ANSI.
Encoding the project to UTF-8 didn't work (for me)
Cheerio
Use iso-8859-1 (western europe) character set instead of UTF-8.

How do I specify an encoding for TextCells in CellList?

I use a CellList like this
CellList<String> cellList = new CellList<String>(new TextCell());
and then give it an ArrayList<String>.
If a String contains an "ü" I get a question mark in the browser (FF4, GWT Dev Plugin). If I use ü I get ü
Where can I specify the encoding, so that "ü" works? (I'm not sure if it makes a difference, but the "ü" is currently hardcoded in the .java file and not read from somewhere else).
The GWT compiler assumes, that your Java files are encoded in UTF-8. Make sure, that your editor is set to save in that encoding.
You should also make sure to set the encoding of the HTML page to a unicode capable encoding like UTF-8 (this allows you to use even more exotic characters that you won't find in other charsets):
<head>
<meta http-equiv="content-type" content="text/html; charset=UTF-8">
...
Moreover, if you later want to retrieve the strings from a database, make sure, that it is also set up to handle Unicode, and that your JDBC driver connects in Unicode mode (required for some databases).

Java EE Web Project and Character Encoding

we built a java ee web project and use jdbc for storing our data.
The problem is that German 'Umlaute' like äöü are in use and properly stored in the mysql database. We don't know why, but in the browser those characters are broken, displaying weird stuff like
ö�
instead.
I've already tried setting the encoding of the jdbc connection like described in this question:
JDBC character encoding
And the encoding of the html page is correctly set:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Any ideas how to fix that?
Update
connection.prepareStatement("SET CHARACTER SET utf8").execute();
won't make umlauts work.
changing the meta-tag to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
won't change anything, too
"We don't know why, but in the browser those characters are broken"
Well, that's the first thing to find out. You should trace your data at every stage:
As you fetch it out of the database (with logging)
When you inject it into the page (with logging)
On the wire (via Wireshark)
When you log, don't just log the strings: log the Unicode characters that make up the strings, as integers. Just cast each character in the string to an integer and log it. It's primitive, but it'll tell you what you need to know.
When you look on the wire, of course, you'll be seeing bytes rather than characters as such. You should work out what bytes you expect for your chosen encoding, and check those against what's actually coming across the network.
You've specified the encoding in the HTML - but have you told whatever's generating your page that you want it in ISO Latin 1? That's likely to be responsible for both setting the content-type header and performing the actual conversion from text to bytes.
Additionally, is there any reason why you're using ISO Latin 1 instead of UTF-8? Why would you deliberately restrict yourself like that? (ISO Latin 1 can only handle the first 256 characters of Unicode, instead of the full range of Unicode characters. UTF-8 can handle everything, and is just as efficient for ASCII.)

How do you troubleshoot character encoding problems?

If all you see is the ugly no-char boxes, what tools or strategies do you use to figure out what went wrong?
(The specific scenario I'm facing is no-char boxes within a <select> when it should be showing Japanese chars.)
Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.
Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.
So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?
Sorry to be so general, but the question doesn't give much more to work with.
If the data you send to the browser becomes mangled (moji-bake) you will get trash characters. Also, if you specify the wrong character set in your META headers, your browser will render the page incorrectly, causing moji-bake again, sometimes in random places on the page.
When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the lifetime of your program (data storage, retrieval, data manipulation in your code, displaying in the browsser etc...)
What is UTF8?
UTF8 handles binary streams of data, not strings. This means the bit combinations can have variable length. ASCII characters have a fixed length of 8 bits representing 1 byte, however UTF8 characters can be composed of 6bits, 8bits, 12bits, etc... As such, UTF8 is prone to what Japanese call "mojibake".
As a coder, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X).
Database Settings
If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their sense of humour!!
Checking your Codebase
In my experience editors like Notepad++, Notepad2, UltraEdit, e, etc... all have UTF8 support problems. They mostly work, but since their developers don't use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc ... all present problems.
I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://www.hidemaru.interlink.or.jp/software/
Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.
Manipulating Strings
Any string function need to multibyte safe. Notice I didn't say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!
META Tags
Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
Email
This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here.
Debugging UTF8 Issues
Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.
Hope that helps
Redirect the data to disk and use a Hex Editor. Most text editors / viewers do their own conversions behind the scenes, so it is difficult to be sure you are seeing the data in it's true form.