Encoding issue in the web page - perl

There is an encoding issue in the web page means it showing some special characters in the browser(Cinéma). content is in ISO, web page is rendering in UTF-8. some articles are displaying properly,bcz those are in UTF encode.some of the articles are shows the encoding issue like Cinéma in Perl 5.
Can any once help me out for this encoding issue.that would be a great!
Thanks in advance.

Ensure your Content-type header, or meta document element, contains correct encoding information.
A quick and easy way to test if this is your issue is to ask the browser to render the page as if it had received a specific encoding directive. In Safari this would be View -> Text Encoding and then selecting something appropriate.
I'd hazard a guess that if you inform the browser to use utf-8 then it will render the page correctly.

The only way to solve this will be to spend some time reading up on Unicode and UTF-8 and how to handle encoding in Perl. (perldoc perluniintro, perldoc perlunitut, perldoc perlunicode, perldoc perlunifaq for example).
UTF-8 encoding is a very different concept to other encodings that programmers encounter (escaping in strings, URL encoding, HTML character entities, etc) - it's about how your code should interpret sequences of bytes as characters.
Without knowing the source of the word containing the special character (an accented 'e'), it's impossible to offer further help - is it coming from a database? in a static HMTL page? in an HTML template? a string within Perl code?

Related

Hebrew characters processed by HTML Tidy turn into gibberish

I'm using HTML Tidy Online (http://infohound.net/tidy/) to tidy up some very old and messed up HTML file which contains some Hebrew characters. Whenever the page is processed by Tidy the output turns Hebrew characters into gibberish, even after changing encoding methods in the settings. Using different settings, I do manage to get the same output with the Hebrew characters as unicode entities.
I Googled around for a possible solution but found none.
I had a couple ideas in mind, but I'm not sure exactly how to approach them, if at all (maybe someone has a better solution).
I thought maybe I could (after processing the page) scan the page for unicode entities and replace them with the corresponding Hebrew characters (in a systematic way, of course).
Maybe I could take the HTML Tidy source code and modify it to output Hebrew characters appropriately. The problem with this is that I doubt I am knowledgeable enough to even get started on something like this.
I had a similar problem. Document in UTF-8, containing unicode characters. HTML Tidy turned them into HTML entities. This in HTMLTIDY.CFG fixed it:
char-encoding: utf8
input-encoding: utf8
output-encoding: utf8
Hope it helps.
The website http://infohound.net/tidy/ that you are using has a "Char encoding" clause at the bottom right. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. In Notepad++ for example, you can go to Encoding > Convert to UTF-8 without BOM.

Should I use hex ascii accented character code in HTML or use the actual character?

I have several huge CSVs with lots of accented characters in html hex code: é for é and lots of others, even – for –, etc.
My site is a wiki for people to update listings. So when they are presented a textarea for update, the existing content is filled in, and obviously those hex codes will be shown.
Should I be bothered replacing those codes with actual accented characters, or just leave it as it is? I wrote a script to replace the characters, but somehow the output are weird characters. Probably the format saved in Ruby isn't in UTF-8 format.
By default my site is in UTF-8, and the accented characters are displayed properly with some html coding in the view.
Please advise. Thanks.
Could you clarify what the problem is?
If your data (CSV) is in UTF-8, and the default encoding of your site is UTF-8, then all you would need to do is make sure that when users are editing content, that content is properly treated as UTF-8.
You may not need to display the markup to the users. Perhaps you could leverage a WYSIWIG editor package like TinyMCE?

UTF-8 incorrectly displayed in Lua/ Corona

In Lua, for an iPad Corona project, I'm requesting a UTF-8 server text file (containing Chinese characters) using network.request, but the result when displayed in the console or in the app shows as "garbage". Google Chrome, for instance, displays the same UTF-8 page fine, as I'm setting the http header when the server sends this (using PHP) to 'Content-Type: text/plain; charset=utf-8' (and there's no BOM, byte order mark either). The "garbage" I'm seeing in Lua looks similar to when I "force" Chrome to render the page as ISO-8859-1 using the options menu.
Does anyone have any help or pointers?
If all else fails, how would I convert the "garbage" string back to its UTF-8 origins within Lua?
Thanks for any help!
Lua doesn't know anything about UTF-8; Lua strings are just sequences of bytes. It sounds like Corona itself is parsing the strings as ISO8859-1. The most likely cause for this is them doing something really stupid and naive like treating each byte of the string as a Unicode code point.
I'm afraid I don't know Corona, so can't provide any specific solutions, but I'd suggest looking to see what functions it's got that involve encodings --- there may be a specific function to render a string with a particular encoding, for example.
Can you show the code for your network.request() call?
If you're downloading a html page, you should use network.download().
I had this exact same problem, except with Japanese characters. Although Lua doesn't support UTF-8, Corona acts like it does. What that means is that... if you pass a UTF-8 String to display.newText(...), it should display properly. Now, if you output to the console, it will actually print out the raw bytes of the String. And, if you try to print the length of the string, it will actually print out the number of bytes.
So, in summary, Lua treats all strings as an array of bytes. It knows nothing about UTF-8. Some Corona API methods, when passed UTF-8 strings, will display the strings correctly.
I had issues when I mixed UTF-8 with plain ASCII characters, which I believe confused Corona (what I mean is that I mixed English characters with Japanese characters... still all UTF-8, though). I have a hunch that each character in the string must be of the same length in bytes for Corona to display it properly. Try printing out one character at a time to see if that helps. Please feel free to post comments here if you run into trouble. I'd like to figure this issue out myself, too.

UTF8 charset, diacritical elements, conversion problems - and Zend Framework form escaping

I am writing a webapp in ZF and am having serious issues with UTF8. It's using multi lingual content through Zend Form and it seems that ZF heavily escapes all of these characters and basically just won't show a field if there's diacritical elements 'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
Zend Form allows for having non escaped data, but trying to use this is confusing, and it seems it'd need to be used all over the place.
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
And if the last question is true, then how do I convert the source text to UTF8? I am comfortable setting up apache so that it sends a default UTF8 charset heading, and also adding the charset meta tag to the html, but doing this I am still getting messed up encoding. I have also tried opening the translation csv file in TextWrangler on OSX as UTF8, but it has done nothing.
Thanks!
L
'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
This I don't understand. Can you show an example of how it is displayed, as opposed to how it should be displayed?
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
Yup. In more detail: If the data you're displaying and the encoding of the HTML page are both UTF-8, the multi-byte special characters will be displayed correctly.
And if the last question is true, then how do I convert the source text to UTF8?
Advanced editors and IDEs enable you to define what encoding the source file is saved in. You would need to open the file in its current encoding (with special characters being displayed correctly) and save it as UTF-8.
If the content is messed up when you have the right content-type header and/or meta tag specified, then the content is not UTF-8 yet. If you don't get it sorted, post an example of what it looks like here.

What kind of text code is %62%69%73%68%6F%70?

On a specific webpage, when I hover over a link, I can see the text as "bishop" but when I copy-and-paste the link to TextPad, it shows up as "%62%69%73%68%6F%70". What kind of code is this, and how can I convert it into text?
Thanks!
URL encoding, I think.
You can decode it here: http://meyerweb.com/eric/tools/dencoder/
Most programming languages will have functions to urlencode/decode too.
This is URL encoding. It is designed to pass characters like < / or & through a URL using their ASCII values in hex after a %. However, you can also use this for characters that don't need encoding per se. Makes the URL harder to read, which is sometimes desirable.
URL encoding replaces characters outside the ascii set.
More info about URL encoding in the w3schools site.
As mentioned by others, this is simply an ASCII representation of the text so that it can be passed around the HTTP object easily. If you've ever noticed typing in a website URL that has a space in it, the browser will usually convert that to %20. That's the hexadecimal value for the "space" character in ASCII.
This used to be a way to trick old spam scrapers. One way spammers get email addresses is to scrape the source code of websites for strings matching the pattern "username#company.tld". By encoding just the username portion or the whole string as ASCII characters, the string would be readable by humans, but would require the scraper to convert it to a literal string before it could be used to send emails. Of course, modern-day spamming tools account for these sort of strings.