Character encoding mystery ( working flawlessly for me, but unfortunately not for all ) - encoding

I have a content manager website, where character encoding working wonderfully for almost everyone, but not for a few "lucky" users, which drives me crazy.
I have a (search) keyword database table (log) which sometimes shows very strange keywords (the keyword originally was in Hungarian but something happened with it )
For example the database shows:
gyerekeknek papã rbã³l
instead of
gyerekeknek papírból
I can't replicate the culprit word/encoding or url because it works flawlessly for me with every accents and even different languages (such as Cyrillic letters, Chinese and special characters etc).
An average search url look like this:
https://example.com/keyword=árvíztűrő+tükörfúrógép
OR: https://example.com/keyword=%c3%a1rv%c3%adzt%c5%b1r%c5%91+t%c3%bck%c3%b6rf%c3%bar%c3%b3g%c3%a9p
And it works absolutely fine (the database row for this is "árvíztűrő tükörfúrógép" as expected)
The strange thing: The GET['keyword'] param seems like it is already in the wrong encoding format ("gyerekeknek papã rbã³l") when it arrives (without any jquery/php validation and/or processing).
My website and database is UTF8 encoded everywhere. My website has:
AddDefaultCharset utf-8 (in .htaccess)
All files in UTF8 without BOM
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" /> on every single page
mysqli_query($con,"SET NAMES 'utf8'");
mysqli_query($con,"SET CHARACTER SET 'utf8'");
mysqli_query($con,"SET COLLATION_CONNECTION='utf8_general_ci'"); On every database connection
Any idea or help is greatly appreciated,
Thank you!

Related

Laravel issue with: language character encoding

Privjet!
I don't understand for what reason I am not getting displayed the non ASCII language characters like say, "ç, ñ, я " for my different languages.
The text in question is hardcoded, it is not served from a DB.
I have seen identical questions here
Charset=utf8 not working in my PHP page
I have seen that I should write this:
header('Content-type: text/html; charset=utf-8');
But where the heck does that go? I cant write it like that, the browser just mirrors the words and displays them as plain text, no parsing.
My encoding for the frontpage says this:
<head>
<meta charset="utf-8">
</head>
which is supposed to be Unicode.
I tried to test my page in validator.w3.org and it went:
Sorry, I am unable to validate this document because on line 60 it contained one or more bytes that I cannot interpret as utf-8 (in other words, the bytes found are not valid values in the specified Character Encoding). Please check both the content of the file and the character encoding indication.
Line 60 actuallly has the word Español (Spanish) with that weird n.
Any hint?
thank you
best regards

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
In text editors such as Coda and Text Wrangler the text displays as
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
Adding euc-kr metadata to < head >
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212
\262\351\322\215\202\354\270\346\253\354\261\224 \262\3\
51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
In the end, it is about finding the correct character encoding and using iconv.
iconv --list
displays all available encodings. Grepping for "KR" reveals at least my system can do CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable pops out.
Even if this thread is old, it's still an issue, and not having found a way to convert the files in bulk (outside of using a Korean version of Windows7), now I'm using Naver, which has a cloud service like Google docs and if you upload those weirdly encoded files there, it deals with them very well. I just edit and copy the text, and it's back to being standard when I copy it elsewhere.
Not the kind of solution I like, but it might save a few passers-by.
You can register for the cloud account with an ID, even if you do not live in SKorea by the way, there's some minimal english to get by.

Java EE Web Project and Character Encoding

we built a java ee web project and use jdbc for storing our data.
The problem is that German 'Umlaute' like äöü are in use and properly stored in the mysql database. We don't know why, but in the browser those characters are broken, displaying weird stuff like
ö�
instead.
I've already tried setting the encoding of the jdbc connection like described in this question:
JDBC character encoding
And the encoding of the html page is correctly set:
<meta http-equiv="Content-Type" content="text/html; charset=iso-8859-1" />
Any ideas how to fix that?
Update
connection.prepareStatement("SET CHARACTER SET utf8").execute();
won't make umlauts work.
changing the meta-tag to
<meta http-equiv="Content-Type" content="text/html; charset=utf-8" />
won't change anything, too
"We don't know why, but in the browser those characters are broken"
Well, that's the first thing to find out. You should trace your data at every stage:
As you fetch it out of the database (with logging)
When you inject it into the page (with logging)
On the wire (via Wireshark)
When you log, don't just log the strings: log the Unicode characters that make up the strings, as integers. Just cast each character in the string to an integer and log it. It's primitive, but it'll tell you what you need to know.
When you look on the wire, of course, you'll be seeing bytes rather than characters as such. You should work out what bytes you expect for your chosen encoding, and check those against what's actually coming across the network.
You've specified the encoding in the HTML - but have you told whatever's generating your page that you want it in ISO Latin 1? That's likely to be responsible for both setting the content-type header and performing the actual conversion from text to bytes.
Additionally, is there any reason why you're using ISO Latin 1 instead of UTF-8? Why would you deliberately restrict yourself like that? (ISO Latin 1 can only handle the first 256 characters of Unicode, instead of the full range of Unicode characters. UTF-8 can handle everything, and is just as efficient for ASCII.)

lift can't handle diacritics

I'm developing a web app with Lift Framework, GlassfishV3 and there is a problem with diacritics in my app. I do just value binding to model and when I log the value from input text field, the diacritics letters are already broken. Where could possibly be the problem?
bind("entry",content,
"place" -> SHtml.text(lib.place, lib.place=_),
"submit" -> SHtml.submit("Kaboom", () => {
Logger.getAnonymousLogger.severe(lib.place)
Service.library.save(lib)})
)
It's probably a general java problem, not limited to Lift.
I enter š and I see Å¡ as the output from logger.
Would it make you feel better or worse to know that the source of your entire problem is located between the keyboard and the back of your chair?
Here's what happened:
You wanted to print out š, a lower-case s-with-caron, which is represented in Unicode by the number 0x161. You printed it out to a file, and your I/O system dutifully (and correctly) encoded it in UTF-8 as 0xC5, 0xA1. Then you asked view to that file without explaining to your viewing program that it was a UTF-8 file. Your viewing program, whatever it was, interpreted the file as ISO 8859-1, a very common, if somewhat elderly, format. The 0xC5 was displayed as Å, A-with-a-ring, and the 0xA1 as ¡, an inverted exclamation mark.
To summarize, there's nothing wrong with the output, there's just something wrong with the way you are looking at it. Bring the log up in an editor and set the encoding to UTF-8 or bring it up in a web browser and select View / Character Encoding / UTF-8 .
My guess would be that the clue to this might be the browser. What encoding does the browser assume for the page? Do you have an encoding meta-tag in the head; like this:
<meta http-equiv="content-type" content="text/html; charset=UTF-8" />
This could well be a problem in the logger or your console. Try logging to a file and opening that in an editor you know can handle UTF8
so one option to get UTF-8 encoding is to specify encoding in sun-web.xml file like this:
<sun-web-app error-url="">
<parameter-encoding default-charset="UTF-8"/>
</sun-web-app>
the other option is to set encoding in lift bootstrapping class:
def boot {
LiftRules.early.append(makeUtf8)
}
private def makeUtf8(req: HTTPRequest) {
req.setCharacterEncoding("UTF-8")
}

How do you troubleshoot character encoding problems?

If all you see is the ugly no-char boxes, what tools or strategies do you use to figure out what went wrong?
(The specific scenario I'm facing is no-char boxes within a <select> when it should be showing Japanese chars.)
Firstly, "ugly no-char boxes" might not be an encoding problem, they might just be a sign you don't have a font installed that can display the glyphs in the page.
Most character encoding problems happen when strings are being passed from one system to another. For webapps, this is usually between the browser and the application, between the application and the filesystem and between the application and the database.
So you need to check where the mis-encoded data is coming from, what character encoding it has at the source, and what encoding it is being received as. The best way is to send through characters you know the system is having problems with, and examine them at each level of the app. What do they look like inside the app? In the database? When you get them back from the database? When they're displayed in the browser?
Sorry to be so general, but the question doesn't give much more to work with.
If the data you send to the browser becomes mangled (moji-bake) you will get trash characters. Also, if you specify the wrong character set in your META headers, your browser will render the page incorrectly, causing moji-bake again, sometimes in random places on the page.
When handling CJK character sets, you must be sure to use UTF8 character encoding throughout the lifetime of your program (data storage, retrieval, data manipulation in your code, displaying in the browsser etc...)
What is UTF8?
UTF8 handles binary streams of data, not strings. This means the bit combinations can have variable length. ASCII characters have a fixed length of 8 bits representing 1 byte, however UTF8 characters can be composed of 6bits, 8bits, 12bits, etc... As such, UTF8 is prone to what Japanese call "mojibake".
As a coder, from database to codebase to browser, you should try and use UTF8 completely. For email you can use UTF8, but you will probably find most mail servers and clients are still old and use a mishmash of different character sets (e.g. ISO9022X).
Database Settings
If you are a mysql user, then make sure you have to ensure all connections to the DB use UTF8, and that all tables/fields use UTF8. By default mysql uses Latin (Swedish) character sets. Those kooky swedes love their sense of humour!!
Checking your Codebase
In my experience editors like Notepad++, Notepad2, UltraEdit, e, etc... all have UTF8 support problems. They mostly work, but since their developers don't use CJK languages themselves, they are not perfected. Issues like turning off BOM (Byte Order Mark), mangled tabs, poor character set conversion, etc ... all present problems.
I highly recommend using a proven UTF8 editor like Maruo. This is made by a Japanese company, but there is an English version (and a trial version) at http://www.hidemaru.interlink.or.jp/software/
Lastly, you may need to convert your source files into UTF8. Especially if the codebase itself has CJK language strings contained therein.
Manipulating Strings
Any string function need to multibyte safe. Notice I didn't say double-byte. UTF8 is not a double byte but multibyte, depending on the total number of bits used to represent a character. In PHP you need to call the MB string functions specifically. Ruby and other languages have more transparent support, but you need to check the docs for your flavour of application server!
META Tags
Check out google.co.jp or yahoo.co.jp for their META headers. These are sites that know how to to it properly. Basically include the following META tag the doucment <HEAD>
<meta http-equiv="content-type" content="text/html; charset=utf-8">
It is usually safe to mix English HTML document type attributes with the above character too. So adding the META tag above seems to work in a HTML document that has:
<html xmlns="http://www.w3.org/1999/xhtml" xml:lang="en" lang="en">
Email
This is a wholly different can of worms. UTF8 works a lot, but many older Japanese clients use ISO2022X more. This is not worth covering here.
Debugging UTF8 Issues
Once you have a reliable UTF8 editor like Maruo, you can create static pages and resolve your issues.
Hope that helps
Redirect the data to disk and use a Hex Editor. Most text editors / viewers do their own conversions behind the scenes, so it is difficult to be sure you are seeing the data in it's true form.