I'm having trouble sharing messages containing scandinavian ä & ö to twitter through a share-button on my site. If I use UTF8-codes above %7F, i just bump into an "Invalid Unicode value in one or more parameters" error.
An example: http://twitter.com/home/?status=%40user+blah%26%E4
I've tried a bunch of different encodings, but none seem to work with ä, ö etc.
Anyone found a solution for this?
Edit:
Part of this problem is related to what address you link your share-tweet. Links to http://twitter.com/home/?status=%40user+blah%26%E4%C3%A4
and
http://www.twitter.com/home/?status=%40user+blah%26%E4%C3%A4
Yield very different results.
UTF-8 represents code points above U+007F using more than one byte. So when you want ä (U+00E4), the UTF-8 representation is the two bytes C3 A4 and thus the percent-encoding is %C3%A4. A handy website that will help you with these conversions is https://www.url-encode-decode.com
Related
I am trying to read string data from txt file which has special turkish characters in it.
I want to store content in a string. I tried some methods like textscan , fileread but, instead of special turkish characters like ş,ç,ı,ö,ğ, there are some weird symbols. Are there any way to do that?
I created a file called turkish.txt with the characters you mentioned (ş,ç,ı,ö,ğ). Trying to read it gave me the following:
fid = fopen('turkish.txt','r','n','UTF-8');
str=fread(fid);
native2unicode(str')
ans =
ÿþ_, ç , 1, ö ,
As you can see, ş,ı,ğ are not rendered correctly. If you type
help slCharacterEncoding
You can see a list of most commonly supported encodings by platforms. I played with the encodings a little, some which I have checked were:
ISO-8891-1
US-ASCII
Windows-1252
Shift_JIS
The last one is related to japanese characters. They contain some of the turkish characters, which were rendered correctly such as ç and ö, but not all of them.
If you skim through the docs it says:
If you want to use a different character encoding, you need to start MATLAB with the appropriate locale settings for your operating system. Consult your operating system manual to change the locale setting.
The instructions for setting the locale on windows platforms, which I haven't tried, can be found here.
Hope it helps.
Some charset don't have all the 128 first identical to ascii, but is A to Z and a to z, always in the sam position?
I had a plan to set apaches default charset to somting odd in my test envirement, for easy detecting sites that don't tell the browser what encoding it sending.
But so far, I can't find one that makes A to Z show up as someting else.
There is an other question close to the subject, but thats about all 128 ascii chars:
Are ASCII characters always encoded the same way in all character encodings?
No, EBCDIC from IBM is the famous exception.
Another testcase is UTF-16 Big Endian, which puts "A" at U+0041. ASCII would treat the first 00 as a NUL, which often is interpreted as an end-of-string.
In short: no. The encoding mentioned in the other answer, EBCDIC, has a very different layout, to pick just one example.
Most encodings you find in the wild on the web today are probably ASCII compatible. But there are encodings other than ASCII which are entirely incompatible too.
My other question brought up a related question:
Is there a standard table of Unicode to ASCII transcriptions? Think for instance of German ü mapping to ue.
User bobince mentioned in a comment that other languages use the same character in a different way and I fear they may not only use the same glyph but also the same codepoint. Hence mapping e.g. "ü" to "u" would also be acceptable (mapping by visual similarity). So is mapping ü to "u as done by iconv (see for instance link posted by Juancho).
The methods shown in the link posted by Juancho are technically working solutions. However, is there a formal standard for such a mapping or at least a mapping used as a quasi-standard? Ideally it would also include for instance phonetics-based transcriptions for non-latin characters. I remember that one exists for Japanese kana and greek characters. It shouldn't be a big problem in that regard either.
There is no formal standard on such mappings. Mappings that deal with Latin letters in general (like ü, é and ß) mapping all to Ascii are not really transcriptions or transliterations but just, well, mappings, which might be called simplifications or Asciifications. They are performed for various purposes, often in an ad hoc way.
Mapping ü to ue is rather common in German and might be called an unofficial or de facto standard for German names when ü cannot be used. But other languages use other rules, and it would be odd to Asciify French or Spanish that way; instead, the diacritic would just be dropped, mapping ü to u.
People may map e.g. ü to u" when they are forced (or they believe they are forced) to use Ascii and yet want to convey the message that the u has a diaeresis on it.
I am importing .txt file from a remote server and saving it to a database. I use a .Net script for this purpose. I sometimes notice a garbled word/characters (Ullerهkersvنgen) inside the files, which makes a problem while saving to the database.
I want to filter all such characters and convert them to unicode before saving to the database.
Note: I have been through many similar posts but had no luck.
Your help in this context will be highly appreciated.
Thanks.
Assuming your script does know the correct encoding of your text snippet than that should be the regular expression to find all Non-ASCII charactres:
[^\x00-\x7F]+
see here: https://stackoverflow.com/a/20890052/1144966 and https://stackoverflow.com/a/8845398/1144966
Also, the base-R tools package provides two functions to detect non-ASCII characters:
tools::showNonASCII()
tools::showNonASCIIfile()
You need to know or at least guess the character encoding of the data in order to be able to convert it properly. So you should try and find information about the origin and format of the text file and make sure that you read the file properly in your software.
For example, “Ullerهkersvنgen” looks like a Scandinavian name, with Scandinavian letters in it, misinterpreted according to a wrong character encoding assumption or as munged by an incorrect character code conversion. The first Arabic letter in it, “ه”, is U+0647 ARABIC LETTER HEH. In the ISO-8859-6 encoding, it is E7 (hex.); in windows-1256, it is E5. Since Scandinavian text are normally represented in ISO-8859-1 or windows-1252 (when Unicode encodings are not used), it is natural to check what E7 and E5 mean in them: “ç” and “å”. For linguistic reasons, the latter is much more probable here. The second Arabic letter is “ن” U+0646 ARABIC LETTER NOON, which is E4 in windows-1256. And in ISO-8859-1, E4 is “ä”. This makes perfect sense: the word is “Ulleråkersvägen”, a real Swedish street name (in Uppsala, at least).
Thus, the data is probably ISO-8859-1 or windows-1252 (Windows Latin 1) encoded text, incorrectly interpreted as windows-1256 (Windows Arabic). No conversion is needed; you just need to read the data as windows-1252 encoded. (After reading, it can of course be converted to another encoding.)
Some UTF-8 characters like the UTF-8 equivalent of C2 96 (hyphen). On the browser it displays it as (utf box with 00 96). And not as '-'(hyphen). Any reasons for this behavior? How do we correct this?
http://stuffofinterest.com/misc/utf8.php?s=128 (Refer this URL for the codes)
I found that this can be handled with html entities. Is there any way to display this without converting to html entities?
The character you're talking about is an en-dash, not a hyphen. Its Unicode code point is U+2013, and its UTF-8 encoding is E2 80 93, not C2 96. That table you linked to is incorrect. The first two columns have nothing to do with UCS-2 or Unicode; they actually contain the windows-1252 encodings for the characters in question. The columns labeled "UTF-8 Hex" and "UTF-8 Native" are just plain wrong, at least for the rows labeled 128 to 159. The entities and represent an en-dash, but the UTF-8 sequence C2 96 represents a non-displayable control character.
You shouldn't need to encode those characters manually anyway. Just tell your text editor (or whatever you use to create the content) to save the file as UTF-8.
I suspect this is because the characters between U+0080 and U+009F inclusive are control characters. I'm still slightly surprised that they show differently when encoded directly in the HTML than using entities, but basically you shouldn't be using them to start with. U+0096 isn't really "hyphen", it's "start of guarded area".
See the U+0080-U+00FF code chart for more information. Basically, try to avoid control characters...
Two reasons come to mind:
Are you sure that you have output the correct character code to the browser? Better check in some hex viewer.
The font you are using doesn't have a glyph defined at this code point.