Difference between Unicode format?

Difference between Unicode format? - unicode

I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in à¤ format. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.

& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤

When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.

Related

Unicode and UTF-8 difference, lof of inconsistencies from the whole internet

I know there are a lot of answers about this subject, but I need some clarification.
From what I've understood, ASCII and Unicode are both charsets,
they tell you that A is decimal(41) and B is decimal(42) for example.
UTF-8, UTF-16, UTF-32, and ANSI are encodings
they are tasked with storing those 41 and 42 numbers into a binary form of their liking and managing their retrieval and conversion back to decimal. Then with the charset, you are able to get the corresponding char.
But, I was looking into how to get which charset/encoding is used by a webpage and I did tools>page information on Firefox.
And I can read this: charset=utf-8
(this is the page: http://www.leboncoin.fr/annonces/offres/ile_de_france/)
Is this a bug in Firefox?
Or, did I completely misunderstand charset/encoding?

You have slightly misunderstood character sets, though this is not a big issue. A character set is just the set of available characters, it doesn't have to reference any numbers (though they almost always do). See also: What's the difference between encoding and charset?
The real issue here is the use of charset. It comes from an HTML5 meta tag that often looks something like this:
<meta charset="utf-8" />
Despite the name, charset actually specifies a character encoding in HTML5, not a character set. This is likely due to historical confusion between character sets and encodings, as there was not much difference between the two before Unicode introduced multiple encodings for a single character set.

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.

Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()

You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)

converting arabic words to windows-1252

I'm inserting large amount of data to a oracle database.
In that database text is stored in windows-1252 format.
It turns out that there are lot of things to be entered, all of them need to be converted to this format. Also all of those data is in Arabic words.
can some one help me to find a online converter or a tool that encodes Arabic words to windows-1252 format?
*hope the details are enough
--rangana

The pair of Win32 APIs, MultiByteToWideChar and WideCharToMultiByte, allow you to convert code-page encoding to Unicode and Unicode data to code-page encoding, respectively. Each of these APIs takes as an argument the value of the code page to be used for that conversion. You can, therefore, either specify the value of a given code page (example: 1256 for Arabic) or use predefined flags such as:
CP_ACP: for the currently selected system Windows code page
CP_OEMCP: for the currently selected system OEM code page
CP_UTF8: for conversions between UTF-16 and UTF-8

Since windows-1252 does not encode Arabic letters at all, the only way to do the conversion would be to use some kind of transliteration. This is something completely different from encoding conversion (which does not change the identity of characters, only their coded representation).
There is a large number of transliteration (romanization) schemes for Arabic. Almost all of them non-reversible, and almost all of them not suitable for fully automatic processing (mainly because normal Arabic writing does not indicate short vowels but most transliteration schemes indicate them, i.e. the transliterator needs to know how the word is pronounced and to insert vowel characters).
You could fake a conversion by converting to windows-1256 and then inserting the windows-1256 encoded data into the database as raw bytes. You would then need to keep track of the encoding of each value in the database, so that you know which bytes are windows-1252 and which are really windows-1256. This sounds like a mess, so consider whether it is possible to convert the data base to use UTF-8.

Should I use hex ascii accented character code in HTML or use the actual character?

I have several huge CSVs with lots of accented characters in html hex code: é for é and lots of others, even – for –, etc.
My site is a wiki for people to update listings. So when they are presented a textarea for update, the existing content is filled in, and obviously those hex codes will be shown.
Should I be bothered replacing those codes with actual accented characters, or just leave it as it is? I wrote a script to replace the characters, but somehow the output are weird characters. Probably the format saved in Ruby isn't in UTF-8 format.
By default my site is in UTF-8, and the accented characters are displayed properly with some html coding in the view.
Please advise. Thanks.

Could you clarify what the problem is?
If your data (CSV) is in UTF-8, and the default encoding of your site is UTF-8, then all you would need to do is make sure that when users are editing content, that content is properly treated as UTF-8.
You may not need to display the markup to the users. Perhaps you could leverage a WYSIWIG editor package like TinyMCE?

Codepages and encodings

Before anyone recommends that I do a google search on this, I have. I just need a bit more clarity around what codepages and encodings.
If I use UTF8 encoding, and use an italian code page and then a french code page, does this mean ill get different characters even though the bytes havent changed?

Joel has a nice summary of this:
http://www.joelonsoftware.com/articles/Unicode.html
And no. if I understand your question correctly it doesn't mean that.
When you're converting UTF-8 to a specific code page, it is possible that only some of the characters are going to be converted. What happens to the ones that don't get converted depends on how you call the conversion. A possible result is that the characters which could not be mapped to the code page would be converted to question mark characters.

An encoding is simply a mapping between numerical values and "characters".
US-ASCII maps the number 65 to the letter A, 32 to a space and 49 to the digit "1". (How these things are rendered is another matter.) In fact, UTF-8 does the same! But there are other values which UTF-8 treats differently to ASCII. It is a variable-length encoding, i.e. a character may be encoded with 1, 2, 3, or 4 bytes; common characters generally consume less bytes.
Plain text files, including web pages, are stored and transmitted as sequences of bytes. These bytes are supposed to represent something textual. Software applications (like text editors and web browsers) are responsible for rending the information within these files on the screen. Usually they make use of library or OS functions.
If the software assumes a different encoding to the software that created the file, the wrong characters may be displayed!
Note that it is possible to convert between different encodings; however if you convert to an encoding that does not contain a certain character, the software must make a choice as to what to use instead. This conversion often happens transparently (when you save a file with a certain encoding, whatever you've typed must be changed into that encoding).

UTF-8 includes all characters from your French and Italian code page, but the language specific code pages does not include all of each others characters.
So you can take input from each language and convert it to UTF-8 for storage, but you can not be certain that you will get the right characters if you take Italian input and show it as French.
Use UTF-8 all the way if you can.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Difference between Unicode format? - unicode

Related

Unicode and UTF-8 difference, lof of inconsistencies from the whole internet

Find non-ASCII characters in a text file and convert them to their Unicode equivalent

converting arabic words to windows-1252

Should I use hex ascii accented character code in HTML or use the actual character?

Codepages and encodings

Categories

Resources