How to set receivedDataEncoding for big5 chinese? - iphone

I have a trouble in received data with chinese-big5 encoded web-page,
and I tried to get some sample code but can not find I need for big5 like below:
if ([encodingName isEqualToString:#"euc-jp"]) {
receivedDataEncoding = NSJapaneseEUCSStringEncoding;
} else {
receivedDataEncoding = NSUTF8StringEncoding};
How to replace the part of "NSJapaneseEUCSStringEncoding" for big5 chinese encoding?
Thanks for answer first.

You can use the kCFStringEncodingBig5_E constant which is available in
CoreFoundation/CFStringEncodingExt.h

Related

Arabic text file couldn’t be opened

I am trying to open .srt file using String:
let str = try String(filePath , encoding : .utf8)
But I get this error every time I want to read specific .srt file:
couldn’t be opened using text encoding
sampleFile
what is the problem?
Often, when you are not sure about the encoding, you can actually request that init(contentsOfFile:usedEncoding:) determine the encoding for you during the conversion process, e.g.
let fileURL = Bundle.main.url(forResource: "2", withExtension: "srt")!
var encoding: String.Encoding = .utf8
do {
let string = try String(contentsOfFile: fileURL.path, usedEncoding: &encoding)
print(encoding)
print(string)
} catch {
print(error)
}
Unfortunately, in this case, it throws an error:
The file “2.srt” couldn’t be opened because the text encoding of its contents can’t be determined.
Needless to say, this obviously is not utf8, much less any other encoding that Foundation understands.
When you look at the file in a hex editor (in this case, in Xcode, right clicking on the file and choose “Open as” » “Hex”), it shows:
So, we can see that this is text, but the Arabic is not in utf8. And, as https://subtitletools.com/ says,
Text encoding is a tricky thing. Years ago, there were hundreds of different text encodings in an attempt to support all languages and character sets. Nowadays all these different languages can be encoded in unicode UTF-8, but unfortunately all the files from years ago still exist, and some stubborn countries still use old text encodings. Many devices have trouble displaying text encodings that are not UTF-8, they will display the text as random, unreadable characters.
After a little investigation, it would appear that this is a Windows-1256 format. I could not open this on a Mac, but Word on my PC gave me a choice of encodings, and “Arabic (Windows)” looks promising:
Now, having confirmed the encoding, I would have thought that I could use CFStringEncodings.windowsArabic and CFStringCreateWithBytes, but I could not get that to work.
So, in the end, I built my own cross reference table:
class Windows1256 {
private static let values: [UInt32] = [
0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F,
0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F,
0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F,
0x20AC, 0x067E, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0679, 0x2039, 0x0152, 0x0686, 0x0698, 0x0688,
0x06AF, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x06A9, 0x2122, 0x0691, 0x203A, 0x0153, 0x200C, 0x200D, 0x06BA,
0x00A0, 0x060C, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x06BE, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x061B, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x061F,
0x06C1, 0x0621, 0x0622, 0x0623, 0x0624, 0x0625, 0x0626, 0x0627, 0x0628, 0x0629, 0x062A, 0x062B, 0x062C, 0x062D, 0x062E, 0x062F,
0x0630, 0x0631, 0x0632, 0x0633, 0x0634, 0x0635, 0x0636, 0x00D7, 0x0637, 0x0638, 0x0639, 0x063A, 0x0640, 0x0641, 0x0642, 0x0643,
0x00E0, 0x0644, 0x00E2, 0x0645, 0x0646, 0x0647, 0x0648, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x0649, 0x064A, 0x00EE, 0x00EF,
0x064B, 0x064C, 0x064D, 0x064E, 0x00F4, 0x064F, 0x0650, 0x00F7, 0x0651, 0x00F9, 0x0652, 0x00FB, 0x00FC, 0x200E, 0x200F, 0x06D2
]
static let scalars = values.map { Unicode.Scalar($0)! }
static func convert(_ data: Data) -> String {
var string = ""
string.reserveCapacity(data.count)
for byte in data {
string += String(scalars[Int(byte)])
}
return string
}
}
Then I can convert that
let string = Windows1256.convert(data)
And that yields:
1
00:00:00,276 --> 00:00:02,401
:انچه گذشت
قرباني لين دوهرسته
2
00:00:02,403 --> 00:00:04,870SW
الان ديگه ما دنبال يه مجرم سريالي هستيم
I do not read or write Arabic, so I cannot verify this solution, but it looks promising to my untrained eye.
Needless to say, having done this conversion, you can write this to a file (or convert to a Data) using .utf8 if you need to save the result in this encoding.

Read ansi file (persian language) file from internet and show that in textbox

I create a text file with ANSI format and write persian word in that, now I read that with this code:
System.Net.WebClient wc = new System.Net.WebClient();
string textBoxNewsRight2Left = wc.DownloadString("http://dl.rosesoftware.ir/RoseSoftware%20List/Settings/News.Settings.txt");
MessageBox.Show(textBoxNewsRight2Left);
but I see ???????????????????? character!
I change file format to UTF8 and my problem fix with persian word, but again I find another problem this time with english word!
once again I use from this codes with UTF8 format file:
System.Net.WebClient wc = new System.Net.WebClient();
string textBoxNewsRight2Left = wc.DownloadString("http://dl.rosesoftware.ir/RoseSoftware%20List/Settings/News.Settings.txt");
MessageBox.Show(textBoxNewsRight2Left);
Now I see: "ÿþr" and I don't see my file words!
Now I use:
wc.Encoding = Encoding.Unicode;
Now I see my file words! without "ÿþr".
I don't think I have in original file hello and after read by C# I have hello
now I test equaling textBoxNewsRight2Left and hello
if (textBoxNewsRight2Left == "hello")
{
MessageBox.Show("Equal");
}
else
{
MessageBox.Show("Not Equal");
}
Now I see message Not Equal, but I see hello!
what is the problem?
I how can fix this problem?

Facebook Graph API - non English album names

I am trying to do a simple thing - get all my albums.
the problem is that the album names are non-English ( they are in Hebrew ).
The code that retrieves the albums :
string query = "https://graph.facebook.com/me/albums?access_token=...";
string result = webClient.DownloadString(query);
And this is how one of the returned albums looks like :
{
"id": "410329886431",
"from": {
"name": "Noam Levinson",
"id": "500786431"
},
"name": "\u05ea\u05e2\u05e8\u05d5\u05db\u05ea \u05d2\u05de\u05e8 \u05e9\u05e0\u05d4 \u05d0",
"location": "\u05e9\u05e0\u05e7\u05e8",
"link": "http://www.facebook.com/album.php?aid=193564&id=500786431",
"count": 27,
"type": "normal",
"created_time": "2010-07-18T06:20:27+0000",
"updated_time": "2010-07-18T09:29:34+0000"
},
As you can see the problem is in the "name" property. Instead of Hebrew letters
I get those codes (These codes are not garbage, they are consistent - each code probably represents a single Hebrew letter).
The question is , how can I convert those codes to a non-English language ( In my case, Hebrew).
Or maybe the problem is how I retrive the albums with the webClient object. maybe change webclient.Encoding somehow?
what can I do to solve this problem ?
Thanks in advance.
That's how Unicode is represented in JSON (see the char definition in the sidebar). They are escape sequences in which the four hex digits are the Unicode code point of the character. Note that since there's only four hex digits available, only Unicode characters from the BMP can be represented in JSON.
Any decent JSON parser will transform these Unicode escape sequences into properly encoded characters for you - provided the target encoding supports the character in the first place.
I had the same problem with Facebook Graph Api and escaped unicode Romanian characters. I have used PHP but, you probably can translate the regexp method into javascript.
Method 1 (PHP):
$str = "\u05ea\u05e2\u05e8\u05d5\u05db\u05ea";
function esc_unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
echo esc_unicode2html($str);
Method 2 (PHP) and probaby it works also if u declare the charset directly in the html:
header('content-type:text/html;charset=utf-8');
These are Unicode character codes. The \u sequence tells the parser that the next 4 characters are actually form a unicode character number. What these characters look like will depend on your font, if someone does not have the correct font they may just appear as a lot of square boxes.
That's about as much as I know, Unicode is complicated.
For Hebrew texts, this code in PHP will solve the problem:
$str = '\u05ea\u05e2\u05e8\u05d5\u05db\u05ea \u05d2\u05de\u05e8 \u05e9\u05e0\u05d4 \u05d0';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo decode_encoded_utf8($str); // will show (תערוכת גמר שנה א) text
For Arabic texts use this:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));

What is the character encoding?

I have several characters that aren't recognized properly.
Characters like:
º
á
ó
(etc..)
This means that the characters encoding is not utf-8 right?
So, can you tell me what character encoding could it be please.
We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad.
If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.
read this first http://www.joelonsoftware.com/articles/Unicode.html
There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.
I wrote a couple of methods to narrow down the possibilities a while back for situations just like this.
static void Main(string[] args)
{
Encoding[] matches = FindEncodingTable('Ÿ');
Encoding[] enc2 = FindEncodingTable(159, 'Ÿ');
}
// Locates all Encodings with the specified Character and position
// "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table
//"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table
static Encoding[] FindEncodingTable(int CharacterPosition, char character)
{
List matches = new List();
byte myByte = (byte)CharacterPosition;
byte[] bytes = { myByte };
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = thisEnc.GetChars(bytes);
if (chars[0] == character)
{
matches.Add(thisEnc);
break;
}
}
return matches.ToArray();
}
// Locates all Encodings that contain the specified character
static Encoding[] FindEncodingTable(char character)
{
List matches = new List();
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = { character };
byte[] temp = thisEnc.GetBytes(chars);
if (temp != null)
matches.Add(thisEnc);
}
return matches.ToArray();
}
Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols.
An example of encoding can be seen when browsing the internet:
The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address:
www.example.com?search=...
The following variables on the URL require URL encoding. If you was to write:
www.example.com?search=cat food cheap
The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space)
To correct this encoding error you should exchange the ' ' with '%20' to form this URL:
www.example.com?search=cat%20food%20cheap
Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding.
Good Luck!

How to a recover a text from a wrong encoding?

I have got some files created from some asian OS (chinese and japanese XPs)
the file name is garbled, for example:
иè+¾«Ñ¡Õä²ØºÏ¼­
how i can recover the original text?
I tried with this in c#
Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)
and tried to change unicode to windows-1252 but no luck
It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:
>>> print 'иè+¾«Ñ¡Õä²ØºÏ¼­'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑
I'm sure you can do something similar in C#.
Encoding unicode = Encoding.Unicode;
That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.
Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.
Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);
chinese.getChars(latin.getBytes(s));
The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So
Encoding.Convert(chinese, unicode, chineseBytes);
might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.
This is an old question, but I just ran into the same situation while trying to migrate WordPress upload files off of an old Windows Server 2008 R2 server. bobince's answer set me on the right track, but I had to search for the right encoding/decoding pair.
With the following C#, I found the relevant encoding/deciding pair:
using System;
using System.Text;
public class Program
{
public static void Main()
{
// garbled
string s = "2020竹慶本樂ä»æ³¢åˆ‡äºžæ´²æ³•ç­µ-Intro-2-1024x643.jpg";
// expected
string t = "2020竹慶本樂仁波切亞洲法筵-Intro-2-1024x643.jpg";
foreach( EncodingInfo ei in Encoding.GetEncodings() ) {
Encoding e = ei.GetEncoding();
foreach( EncodingInfo ei2 in Encoding.GetEncodings() ) {
Encoding e2 = ei2.GetEncoding();
var s2 = e2.GetString(e.GetBytes(s));
if (s2 == t) {
var x = ei.CodePage;
Console.WriteLine($"e1={ei.DisplayName} (CP {ei.CodePage}), e2={ei2.DisplayName} (CP {ei2.CodePage})");
Console.WriteLine(t);
Console.WriteLine(s2);
}
}
}
Console.WriteLine("-----------");
Console.WriteLine(t);
Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));
}
}
It turned out that the correct encoding/deciding in my case was:
e1=Western European (Windows) (CP 1252), e2=Unicode (UTF-8) (CP 65001)
So the last line of code is a one-liner for the correct conversion Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));.