Should I use hex ascii accented character code in HTML or use the actual character? - unicode

I have several huge CSVs with lots of accented characters in html hex code: é for é and lots of others, even – for –, etc.
My site is a wiki for people to update listings. So when they are presented a textarea for update, the existing content is filled in, and obviously those hex codes will be shown.
Should I be bothered replacing those codes with actual accented characters, or just leave it as it is? I wrote a script to replace the characters, but somehow the output are weird characters. Probably the format saved in Ruby isn't in UTF-8 format.
By default my site is in UTF-8, and the accented characters are displayed properly with some html coding in the view.
Please advise. Thanks.

Could you clarify what the problem is?
If your data (CSV) is in UTF-8, and the default encoding of your site is UTF-8, then all you would need to do is make sure that when users are editing content, that content is properly treated as UTF-8.
You may not need to display the markup to the users. Perhaps you could leverage a WYSIWIG editor package like TinyMCE?

Related

How should a properly UTF-8 encoded file look in notepad++

I am integrating data using some flat files. I'm getting the flat files delivered by FTP as .csv-files out of MS SQL exports from a business partner.
I asked him to encode it as UTF-8 (just using the standard I thought).
Now I can see in his files that a lot of UTF-8 bytes such as "& # 2 3 3 ;" (w/o the spaces) can be seen as plain text when I open it in Notedpad++ (or also using my "ETL" tool).
Before I ask him to fix it into proper UTF-8, I would like to understand the issue and whether my claim is actually correct?
Shouldn't special characters be shown as special characters when I open them in Notepad++ and not as plain text UTF-8 codes?
Any help is much appreciated :))
Cheers
Martin
é is an HTML entity. For some reason the text is HTML formatted, which I wouldn't count as "plaintext"/flat files. The file may or may not be encoded in UTF-8 in addition to that, we don't know from the information given.
A file containing "special characters" (meaning non-ASCII characters) encoded in UTF-8 opened in a text editor which correctly interprets the file as UTF-8 looks exactly like the text it should look like, e.g.:
正式名称は、ISO/IEC 10646では “UCS Transformation Format 8”、Unicodeでは “Unicode Transformation Format-8” という。両者はISO/IEC 10646とUnicodeのコード重複範囲で互換性がある。RFCにも仕様がある。
Put this in a file, save it as UTF-8, open it in another application as UTF-8, and this is what the text should look like.

Difference between Unicode format?

I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in ठformat. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.
& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤
When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.

Hebrew characters processed by HTML Tidy turn into gibberish

I'm using HTML Tidy Online (http://infohound.net/tidy/) to tidy up some very old and messed up HTML file which contains some Hebrew characters. Whenever the page is processed by Tidy the output turns Hebrew characters into gibberish, even after changing encoding methods in the settings. Using different settings, I do manage to get the same output with the Hebrew characters as unicode entities.
I Googled around for a possible solution but found none.
I had a couple ideas in mind, but I'm not sure exactly how to approach them, if at all (maybe someone has a better solution).
I thought maybe I could (after processing the page) scan the page for unicode entities and replace them with the corresponding Hebrew characters (in a systematic way, of course).
Maybe I could take the HTML Tidy source code and modify it to output Hebrew characters appropriately. The problem with this is that I doubt I am knowledgeable enough to even get started on something like this.
I had a similar problem. Document in UTF-8, containing unicode characters. HTML Tidy turned them into HTML entities. This in HTMLTIDY.CFG fixed it:
char-encoding: utf8
input-encoding: utf8
output-encoding: utf8
Hope it helps.
The website http://infohound.net/tidy/ that you are using has a "Char encoding" clause at the bottom right. You need to choose utf-8, but first you need to make sure that the page is encoded in UTF-8 in your test editor. In Notepad++ for example, you can go to Encoding > Convert to UTF-8 without BOM.

Encoding issue in the web page

There is an encoding issue in the web page means it showing some special characters in the browser(Cinéma). content is in ISO, web page is rendering in UTF-8. some articles are displaying properly,bcz those are in UTF encode.some of the articles are shows the encoding issue like Cinéma in Perl 5.
Can any once help me out for this encoding issue.that would be a great!
Thanks in advance.
Ensure your Content-type header, or meta document element, contains correct encoding information.
A quick and easy way to test if this is your issue is to ask the browser to render the page as if it had received a specific encoding directive. In Safari this would be View -> Text Encoding and then selecting something appropriate.
I'd hazard a guess that if you inform the browser to use utf-8 then it will render the page correctly.
The only way to solve this will be to spend some time reading up on Unicode and UTF-8 and how to handle encoding in Perl. (perldoc perluniintro, perldoc perlunitut, perldoc perlunicode, perldoc perlunifaq for example).
UTF-8 encoding is a very different concept to other encodings that programmers encounter (escaping in strings, URL encoding, HTML character entities, etc) - it's about how your code should interpret sequences of bytes as characters.
Without knowing the source of the word containing the special character (an accented 'e'), it's impossible to offer further help - is it coming from a database? in a static HMTL page? in an HTML template? a string within Perl code?

UTF8 charset, diacritical elements, conversion problems - and Zend Framework form escaping

I am writing a webapp in ZF and am having serious issues with UTF8. It's using multi lingual content through Zend Form and it seems that ZF heavily escapes all of these characters and basically just won't show a field if there's diacritical elements 'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
Zend Form allows for having non escaped data, but trying to use this is confusing, and it seems it'd need to be used all over the place.
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
And if the last question is true, then how do I convert the source text to UTF8? I am comfortable setting up apache so that it sends a default UTF8 charset heading, and also adding the charset meta tag to the html, but doing this I am still getting messed up encoding. I have also tried opening the translation csv file in TextWrangler on OSX as UTF8, but it has done nothing.
Thanks!
L
'é' and if I use the HTML entity equivalent e.g. é it gets escaped so that the user will see 'é'.
This I don't understand. Can you show an example of how it is displayed, as opposed to how it should be displayed?
So, I have been told that if the page and the text is in UTF8, no conversion to htmlentities is required. Is this true?
Yup. In more detail: If the data you're displaying and the encoding of the HTML page are both UTF-8, the multi-byte special characters will be displayed correctly.
And if the last question is true, then how do I convert the source text to UTF8?
Advanced editors and IDEs enable you to define what encoding the source file is saved in. You would need to open the file in its current encoding (with special characters being displayed correctly) and save it as UTF-8.
If the content is messed up when you have the right content-type header and/or meta tag specified, then the content is not UTF-8 yet. If you don't get it sorted, post an example of what it looks like here.