I have received this in a name field (so it should be a person's name)
Игорќ
What could that decode to? Is it UTF-8? What language does that translate to? Russian?
If you can give me a hint or maybe links to websites that explain what meaningful letters I should get out of that would be helpful, thank you :)
This typically is UTF-8 interpreted as some single-byte Windows encoding.
String s = "Игорќ"; // Source encoding UTF-8
byte[] b = s.getBytes("Cp1252");
System.out.println("" + new String(b, StandardCharsets.UTF_8));
// Игорќ
The data might easily get corrupted. Above I got some results with Windows-1252 (MS Windows Latin-1). The java source must be compiled with encoding UTF-8 to accept those chars.
Since you already pasted the original code into a UTF-8 encoded site as Stack Overflow your code is now corrupt data perfectly encoded as UTF-8. If you want to ask yourself anything about the data encoding you need to use an hexadecimal editor or a similar tool on the original raw bytes.
In any case, if you do this:
Open a text file in some single-byte encoding (possibly the ANSI code page used by your copy of Windows, I used Windows-1252)
Paste the Игорќ gibberish and save the file
Reload the file as UTF-8
... you get this:
Игорќ
So it's probably valid UTF-8 incorrectly decoded.
I am gathering information from a HEBREW (WINDOWS-1255 / UTF-8 encoding) website using vbscript and WinHttp.WinHttpRequest.5.1 object.
For Example :
Set objWinHttp = CreateObject("WinHttp.WinHttpRequest.5.1")
...
'writes the file as unicode (can't use Ascii)
Set Fileout = FSO.CreateTextFile("c:\temp\myfile.xml", true, true)
....
Fileout.WriteLine(objWinHttp.responsetext)
When Viewing the file in notepad / notepad++, I see Hebrew as Gibrish / Gibberish.
For example :
äìëåú - äøá àáøäí éåñó - îåøùú
I need a vbscript function to return Hebrew correctly, the function should be similar to the following http://www.pixiesoft.com/flip/ choosing the 2nd radio button and press convert button , you will see Hebrew correctly.
Your script is correctly fetching the byte stream and saving it as-is. No problems there.
Your problem is that the local text editor doesn't know that it's supposed to read the file as cp1255, so it tries the default on your machine of cp1252. You can't save the file locally as cp1252, so that Notepad will read it correctly, because cp1252 doesn't include any Hebrew characters.
What is ultimately going to be reading the file or byte stream, that will need to pick up the Hebrew correctly? If it does not support cp1255, you will need to find an encoding that is supported by that tool, and convert the cp1255 string to that encoding. Suggest you might try UTF-8 or UTF-16LE (the encoding Windows misleadingly calls 'Unicode'.)
Converting text between encodings in VBScript/JScript can be done as a side-effect of an ADODB stream. See the example in this answer.
Thanks to Charming Bobince (that posted the answer), I am now able to see HEBREW correctly (saving a windows-1255 encoding to a txt file (notpad)) by implementing the following :
Function ConvertFromUTF8(sIn)
Dim oIn: Set oIn = CreateObject("ADODB.Stream")
oIn.Open
oIn.CharSet = "X-ANSI"
oIn.WriteText sIn
oIn.Position = 0
oIn.CharSet = "WINDOWS-1255"
ConvertFromUTF8 = oIn.ReadText
oIn.Close
End Function
I noticed something while uploading some unicode data to the database. When the content is uploaded throught textarea, is gets stored in क format, but when you personally type or paste the unicode and insert it hardcoded in php, then it would store in ठformat. But for both, the unicode character is same क.
Now please tell me the difference between the different formats of unicode characters. And how they affect the development. There has to be some limitations in those formats.
& #2325; is markup used in HTML to represent a Unicode character
If you hard code something in a php source file, Make sure you are opening it with editor that correctly displays text files with unicode characters in it.
http://www.joelonsoftware.com/articles/Unicode.html is good place to know the basics of unicode.
UTF-8 encoding of क has the byte sequence E0 A4
Now if somebody interprets this as 8 bit Latin encoding it will think it is two characters
you will see in the table in the above link E0 is à and A4 is ¤
When the content is uploaded throught textarea, is gets stored in क format,
Forms should not submit content in a character-reference (&#...;) format.
But in reality, they do in most current browsers... but only when they can't submit the character in question in any other way. In this case, you can't tell whether the user originally typed क or क, it is a lossy encoding.
To avoid this, make sure you are serving your page in a charset that supports all possible Unicode characters. In practical terms this means always use UTF-8, and serve your page with the Content-Type: text/html;charset=utf-8 header and/or the <meta http-equiv="Content=Type" content="text/html;charset=utf-8"/> element in the header. You'll then get all characters in simple, uncorrupted UTF-8 format.
I'm trying to understand how ASP classic handles strings internally. I've googled and debugged, but I still don't know how a string is encoded within the ASP script.
See the illustration below.
Is input data transformed so that all string variables have the same encoding no matter what source?
Most ASP-pages are saved on disk as utf-8. They do however #include asp-files that are saved with another encoding. A the top of front-end-pages I set the Response encoding to unicode.
response.codepage = 65001 //unicode
reponse.charset = 'utf-8'
http://www.designerline.se/db/aspclassicencoding.png
First of all its worth considering that the both UTF-8 and Windows-1252 (and ISO-8859-1 and others) are based on US-ASCII. The first 128 characters in all of these codepages are identical. Use exactly the same byte value and all occupy just one byte.
In many cases the vast majority of the content is within the US-ASCII range so its hard to tell there is any difference between. Frequently the whole file is just using US-ASCII characters and hence the files are identical despite choosen encoding (save perhaps the BOM at the start of the file).
Basic Script Processing
First the processor combines an ASP file with all its includes and the includes of those includes. This is done very simply sequentially replacing the include markers with the content of the include file being referenced. This is done purely at the byte level not attempt is made to convert files of different encodings.
Next the combined version of the file is parsed. tokenized, "compiled" even into a tight interperter friendly file. Its at this point that chunks of content in the file (the stuff outside of script code blocks) are turned into a special form of Response.Write. Its special in that at the point script execution would reach these special writes the processor simply copies verbatim the bytes as found in the file directly to the output stream, again no attempt is made to convert any encodings.
Script code and character encoding
The ASP processor just doesn't cope well with anything that isn't ASCII. All your code and especially your string literals in your code should only be in ASCII.
What can be a bit confusing once a script is executing all string variables are stored using Unicode encoding.
When code writes content the response using the proper Response.Write method this is where the Response.CodePage comes into effect. It will encode the unicode string the script provides to the response code page before adding it to the output stream.
What is the effect of Response.CharSet
It adds the CharSet attribute to the Content-Type http header. That is it, it has no other impact. If set this one character set but send different one because either your Response.CodePage doesn't match it or because the byte content of the files are not in that encoding then you can expect problems.
Input encoding
Things get really messy here. When form data is posted to the server there is no provision in the form url encoding standard to declare the code page used. Browser can be told what encoding to use and they will default to the charset of the html page contain the form, but there is no mechanism to communicate that choice to the server.
ASP takes the view that the codepage of posted form fields would be the same as the codepage of the response its about to send. Take a moment to absorb that.... This means that quite counter intuatively the Response.CodePage value has an impact on the strings returned by Request.Form. For this reason its important to get the correct codepage set early, doing some form processing and then setting the codepage later just before sending a response can lead to unexpected results.
The classic "the web page looks fine but the data in the database is corrupt" gotcha
One common gotcha this behaviour results in is where the developer has set CharSet="UTF-8" but left the codepage at something like "Windows-1252".
What ends up happening is the user enters text which is sent to the server in UTF-8 encoding but the script code reads it as 1252. This corrupt string gets stored in the database. A subsequent web page looks at this data, the corrupt string it pulled from the DB. This string is then sent by response.write using 1252 encoding but the destination page is told its UTF-8. This has the effect of reversing the corruption and everything looks fine to the user.
However when other components, say a report generator, creates content from the database then the data appears corrupt because it is.
The Bottom Line
You are already doing the correct thing, get that CharSet and CodePage set early and consistently. Where other files may not be saved as UTF-8 you will have problems if there is non-ascii content in them but otherwise you would be fine.
Many include asps are purely code with no content and since that code ought to be purely in ascii its encoding doesn't really matter.
’ is showing on my page instead of '.
I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
In addition, my browser is set to Unicode (UTF-8):
So what's the problem, and how can I fix it?
So what's the problem,
It's a ’ (RIGHT SINGLE QUOTATION MARK - U+2019) character which is being decoded as CP-1252 instead of UTF-8. If you check the encodings table, then you see that this character is in UTF-8 composed of bytes 0xE2, 0x80 and 0x99. If you check the CP-1252 code page layout, then you'll see that each of those bytes stand for the individual characters â, € and ™.
and how can I fix it?
Use UTF-8 instead of CP-1252 to read, write, store, and display the characters.
I have the Content-Type set to UTF-8 in both my <head> tag and my HTTP headers:
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
This only instructs the client which encoding to use to interpret and display the characters. This doesn't instruct your own program which encoding to use to read, write, store, and display the characters in. The exact answer depends on the server side platform / database / programming language used. Do note that the one set in HTTP response header has precedence over the HTML meta tag. The HTML meta tag would only be used when the page is opened from local disk file system instead of from HTTP.
In addition, my browser is set to Unicode (UTF-8):
This only forces the client which encoding to use to interpret and display the characters. But the actual problem is that you're already sending ’ (encoded in UTF-8) to the client instead of ’. The client is correctly displaying ’ using the UTF-8 encoding. If the client was misinstructed to use, for example ISO-8859-1, you would likely have seen ââ¬â¢ instead.
I am using ASP.NET 2.0 with a database.
This is most likely where your problem lies. You need to verify with an independent database tool what the data looks like.
If the ’ character is there, then you aren't connecting to the database correctly. You need to tell the database connector to use UTF-8.
If your database contains ’, then it's your database that's messed up. Most probably the tables aren't configured to use UTF-8. Instead, they use the database's default encoding, which varies depending on the configuration. If this is your issue, then usually just altering the table to use UTF-8 is sufficient. If your database doesn't support that, you'll need to recreate the tables. It is good practice to set the encoding of the table when you create it.
You're most likely using SQL Server, but here is some MySQL code (copied from this article):
CREATE DATABASE db_name CHARACTER SET utf8;
CREATE TABLE tbl_name (...) CHARACTER SET utf8;
If your table is however already UTF-8, then you need to take a step back. Who or what put the data there. That's where the problem is. One example would be HTML form submitted values which are incorrectly encoded/decoded.
Here are some more links to learn more about the problem:
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!), from our own Joel.
Unicode - How to get the characters right?, with more concise and practical information, solutions are targeted on Java environments.
How to setup your PHP site to use UTF8, targeted on PHP environments.
Ensure the browser and editor are using UTF-8 encoding instead of ISO-8859-1/Windows-1252.
Or use ’.
’ (Unicode codepoint U+2019 RIGHT SINGLE QUOTATION MARK) is encoded in UTF-8 as bytes:
0xE2 0x80 0x99.
’ (Unicode codepoints U+00E2 U+20AC U+2122) is encoded in UTF-8 as bytes:
0xC3 0xA2 0xE2 0x82 0xAC 0xE2 0x84 0xA2.
These are the bytes your browser is actually receiving in order to produce ’ when processed as UTF-8.
That means that your source data is going through two charset conversions before being sent to the browser:
The source ’ character (U+2019) is first encoded as UTF-8 bytes:
0xE2 0x80 0x99
those individual bytes were then being mis-interpreted and decoded to Unicode codepoints U+00E2 U+20AC U+2122 by one of the Windows-125X charsets (1252, 1254, 1256, and 1258 all map 0xE2 0x80 0x99 to U+00E2 U+20AC U+2122), and then those codepoints are being encoded as UTF-8 bytes:
0xE2 -> U+00E2 -> 0xC3 0xA2
0x80 -> U+20AC -> 0xE2 0x82 0xAC
0x99 -> U+2122 -> 0xE2 0x84 0xA2
You need to find where the extra conversion in step 2 is being performed and remove it.
This sometimes happens when a string is converted from Windows-1252 to UTF-8 twice.
We had this in a Zend/PHP/MySQL application where characters like that were appearing in the database, probably due to the MySQL connection not specifying the correct character set. We had to:
Ensure Zend and PHP were communicating with the database in UTF-8 (was not by default)
Repair the broken characters with several SQL queries like this...
UPDATE MyTable SET
MyField1 = CONVERT(CAST(CONVERT(MyField1 USING latin1) AS BINARY) USING utf8),
MyField2 = CONVERT(CAST(CONVERT(MyField2 USING latin1) AS BINARY) USING utf8);
Do this for as many tables/columns as necessary.
You can also fix some of these strings in PHP if necessary. Note that because characters have been encoded twice, we actually need to do a reverse conversion from UTF-8 back to Windows-1252, which confused me at first.
mb_convert_encoding('’', 'Windows-1252', 'UTF-8'); // returns ’
I have some documents where … was showing as … and ê was showing as ê. This is how it got there (python code):
# Adam edits original file using windows-1252
windows = '\x85\xea'
# that is HORIZONTAL ELLIPSIS, LATIN SMALL LETTER E WITH CIRCUMFLEX
# Beth reads it correctly as windows-1252 and writes it as utf-8
utf8 = windows.decode("windows-1252").encode("utf-8")
print(utf8)
# Charlie reads it *incorrectly* as windows-1252 writes a twingled utf-8 version
twingled = utf8.decode("windows-1252").encode("utf-8")
print(twingled)
# detwingle by reading as utf-8 and writing as windows-1252 (it's really utf-8)
detwingled = twingled.decode("utf-8").encode("windows-1252")
assert utf8==detwingled
To fix the problem, I used python code like this:
with open("dirty.html","rb") as f:
dt = f.read()
ct = dt.decode("utf8").encode("windows-1252")
with open("clean.html","wb") as g:
g.write(ct)
(Because someone had inserted the twingled version into a correct UTF-8 document, I actually had to extract only the twingled part, detwingle it and insert it back in. I used BeautifulSoup for this.)
It is far more likely that you have a Charlie in content creation than that the web server configuration is wrong. You can also force your web browser to twingle the page by selecting windows-1252 encoding for a utf-8 document. Your web browser cannot detwingle the document that Charlie saved.
Note: the same problem can happen with any other single-byte code page (e.g. latin-1) instead of windows-1252.
You have a mismatch in your character encoding; your string is encoded in one encoding (UTF-8) and whatever is interpreting this page is using another (say ASCII).
Always specify your encoding in your http headers and make sure this matches your framework's definition of encoding.
Sample http header:
Content-Type text/html; charset=utf-8
Setting encoding in asp.net
<configuration>
<system.web>
<globalization
fileEncoding="utf-8"
requestEncoding="utf-8"
responseEncoding="utf-8"
culture="en-US"
uiCulture="de-DE"
/>
</system.web>
</configuration>
Setting encoding in jsp
If your content type is already UTF8 , then it is likely the data is already arriving in the wrong encoding. If you are getting the data from a database, make sure the database connection uses UTF-8.
If this is data from a file, make sure the file is encoded correctly as UTF-8. You can usually set this in the "Save as..." Dialog of the editor of your choice.
If the data is already broken when you view it in the source file, chances are that it used to be a UTF-8 file but was saved in the wrong encoding somewhere along the way.
If someone gets this error on WordPress website, you need to change wp-config db charset:
define('DB_CHARSET', 'utf8mb4_unicode_ci');
instead of:
define('DB_CHARSET', 'utf8mb4');
If the other answers haven't helped, you might want to check whether your database is actually storing the mojibake characters. I was viewing the text in utf-8, but I was still seeing the mojibake and it turned out that, due to a database upgrade, the text had been permanently "mojibaked".
In this case, one option is to "fix" the text with Python's ftfy package (or JavaScript verion here).
You must have copy/paste text from Word Document. Word document use Smart Quotes. You can replace it with Special Character (’) or simply type in your HTML editor (').
I'm sure this will solve your problem.
In DBeaver (or other editors) the script file you're working can prompt to save as UTF8 and that will change the char:
–
into
–
or
–
The same thing happened to me with the '–' character (long minus sign).
I used this simple replace so resolve it:
htmlText = htmlText.Replace('–', '-');