I am trying to open .srt file using String:
let str = try String(filePath , encoding : .utf8)
But I get this error every time I want to read specific .srt file:
couldn’t be opened using text encoding
sampleFile
what is the problem?
Often, when you are not sure about the encoding, you can actually request that init(contentsOfFile:usedEncoding:) determine the encoding for you during the conversion process, e.g.
let fileURL = Bundle.main.url(forResource: "2", withExtension: "srt")!
var encoding: String.Encoding = .utf8
do {
let string = try String(contentsOfFile: fileURL.path, usedEncoding: &encoding)
print(encoding)
print(string)
} catch {
print(error)
}
Unfortunately, in this case, it throws an error:
The file “2.srt” couldn’t be opened because the text encoding of its contents can’t be determined.
Needless to say, this obviously is not utf8, much less any other encoding that Foundation understands.
When you look at the file in a hex editor (in this case, in Xcode, right clicking on the file and choose “Open as” » “Hex”), it shows:
So, we can see that this is text, but the Arabic is not in utf8. And, as https://subtitletools.com/ says,
Text encoding is a tricky thing. Years ago, there were hundreds of different text encodings in an attempt to support all languages and character sets. Nowadays all these different languages can be encoded in unicode UTF-8, but unfortunately all the files from years ago still exist, and some stubborn countries still use old text encodings. Many devices have trouble displaying text encodings that are not UTF-8, they will display the text as random, unreadable characters.
After a little investigation, it would appear that this is a Windows-1256 format. I could not open this on a Mac, but Word on my PC gave me a choice of encodings, and “Arabic (Windows)” looks promising:
Now, having confirmed the encoding, I would have thought that I could use CFStringEncodings.windowsArabic and CFStringCreateWithBytes, but I could not get that to work.
So, in the end, I built my own cross reference table:
class Windows1256 {
private static let values: [UInt32] = [
0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F,
0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F,
0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F,
0x20AC, 0x067E, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0679, 0x2039, 0x0152, 0x0686, 0x0698, 0x0688,
0x06AF, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x06A9, 0x2122, 0x0691, 0x203A, 0x0153, 0x200C, 0x200D, 0x06BA,
0x00A0, 0x060C, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x06BE, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x061B, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x061F,
0x06C1, 0x0621, 0x0622, 0x0623, 0x0624, 0x0625, 0x0626, 0x0627, 0x0628, 0x0629, 0x062A, 0x062B, 0x062C, 0x062D, 0x062E, 0x062F,
0x0630, 0x0631, 0x0632, 0x0633, 0x0634, 0x0635, 0x0636, 0x00D7, 0x0637, 0x0638, 0x0639, 0x063A, 0x0640, 0x0641, 0x0642, 0x0643,
0x00E0, 0x0644, 0x00E2, 0x0645, 0x0646, 0x0647, 0x0648, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x0649, 0x064A, 0x00EE, 0x00EF,
0x064B, 0x064C, 0x064D, 0x064E, 0x00F4, 0x064F, 0x0650, 0x00F7, 0x0651, 0x00F9, 0x0652, 0x00FB, 0x00FC, 0x200E, 0x200F, 0x06D2
]
static let scalars = values.map { Unicode.Scalar($0)! }
static func convert(_ data: Data) -> String {
var string = ""
string.reserveCapacity(data.count)
for byte in data {
string += String(scalars[Int(byte)])
}
return string
}
}
Then I can convert that
let string = Windows1256.convert(data)
And that yields:
1
00:00:00,276 --> 00:00:02,401
:انچه گذشت
قرباني لين دوهرسته
2
00:00:02,403 --> 00:00:04,870SW
الان ديگه ما دنبال يه مجرم سريالي هستيم
I do not read or write Arabic, so I cannot verify this solution, but it looks promising to my untrained eye.
Needless to say, having done this conversion, you can write this to a file (or convert to a Data) using .utf8 if you need to save the result in this encoding.
I am trying to do a simple thing - get all my albums.
the problem is that the album names are non-English ( they are in Hebrew ).
The code that retrieves the albums :
string query = "https://graph.facebook.com/me/albums?access_token=...";
string result = webClient.DownloadString(query);
And this is how one of the returned albums looks like :
{
"id": "410329886431",
"from": {
"name": "Noam Levinson",
"id": "500786431"
},
"name": "\u05ea\u05e2\u05e8\u05d5\u05db\u05ea \u05d2\u05de\u05e8 \u05e9\u05e0\u05d4 \u05d0",
"location": "\u05e9\u05e0\u05e7\u05e8",
"link": "http://www.facebook.com/album.php?aid=193564&id=500786431",
"count": 27,
"type": "normal",
"created_time": "2010-07-18T06:20:27+0000",
"updated_time": "2010-07-18T09:29:34+0000"
},
As you can see the problem is in the "name" property. Instead of Hebrew letters
I get those codes (These codes are not garbage, they are consistent - each code probably represents a single Hebrew letter).
The question is , how can I convert those codes to a non-English language ( In my case, Hebrew).
Or maybe the problem is how I retrive the albums with the webClient object. maybe change webclient.Encoding somehow?
what can I do to solve this problem ?
Thanks in advance.
That's how Unicode is represented in JSON (see the char definition in the sidebar). They are escape sequences in which the four hex digits are the Unicode code point of the character. Note that since there's only four hex digits available, only Unicode characters from the BMP can be represented in JSON.
Any decent JSON parser will transform these Unicode escape sequences into properly encoded characters for you - provided the target encoding supports the character in the first place.
I had the same problem with Facebook Graph Api and escaped unicode Romanian characters. I have used PHP but, you probably can translate the regexp method into javascript.
Method 1 (PHP):
$str = "\u05ea\u05e2\u05e8\u05d5\u05db\u05ea";
function esc_unicode2html($string) {
return preg_replace('/\\\\u([0-9a-z]{4})/', '&#x$1;', $string);
}
echo esc_unicode2html($str);
Method 2 (PHP) and probaby it works also if u declare the charset directly in the html:
header('content-type:text/html;charset=utf-8');
These are Unicode character codes. The \u sequence tells the parser that the next 4 characters are actually form a unicode character number. What these characters look like will depend on your font, if someone does not have the correct font they may just appear as a lot of square boxes.
That's about as much as I know, Unicode is complicated.
For Hebrew texts, this code in PHP will solve the problem:
$str = '\u05ea\u05e2\u05e8\u05d5\u05db\u05ea \u05d2\u05de\u05e8 \u05e9\u05e0\u05d4 \u05d0';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo decode_encoded_utf8($str); // will show (תערוכת גמר שנה א) text
For Arabic texts use this:
$str = '\u00d8\u00ae\u00d9\u0084\u00d8\u00b5';
function decode_encoded_utf8($string){
return preg_replace_callback('#\\\\u([0-9a-f]{4})#ism', function($matches) { return mb_convert_encoding(pack("H*", $matches[1]), "UTF-8", "UCS-2BE"); }, $string);
}
echo iconv("UTF-8", "ISO-8859-1//TRANSLIT", decode_encoded_utf8($str));
I have got some files created from some asian OS (chinese and japanese XPs)
the file name is garbled, for example:
иè+¾«Ñ¡Õä²ØºÏ¼
how i can recover the original text?
I tried with this in c#
Encoding unicode = Encoding.Unicode;
Encoding cinese = Encoding.GetEncoding(936);
byte[] chineseBytes = chinese.GetBytes(garbledString);
byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes);
//(Then convert byte in string)
and tried to change unicode to windows-1252 but no luck
It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python:
>>> print 'иè+¾«Ñ¡Õä²ØºÏ¼'.decode('utf8').encode('latin1').decode('cp936')
新歌+精选珍藏合辑
I'm sure you can do something similar in C#.
Encoding unicode = Encoding.Unicode;
That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252.
Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install.
Encoding latin= Encoding.getEncoding(1252);
Encoding chinese= Encoding.getEncoding(936);
chinese.getChars(latin.getBytes(s));
The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So
Encoding.Convert(chinese, unicode, chineseBytes);
might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.
This is an old question, but I just ran into the same situation while trying to migrate WordPress upload files off of an old Windows Server 2008 R2 server. bobince's answer set me on the right track, but I had to search for the right encoding/decoding pair.
With the following C#, I found the relevant encoding/deciding pair:
using System;
using System.Text;
public class Program
{
public static void Main()
{
// garbled
string s = "2020竹慶本樂ä»æ³¢åˆ‡äºžæ´²æ³•çµ-Intro-2-1024x643.jpg";
// expected
string t = "2020竹慶本樂仁波切亞洲法筵-Intro-2-1024x643.jpg";
foreach( EncodingInfo ei in Encoding.GetEncodings() ) {
Encoding e = ei.GetEncoding();
foreach( EncodingInfo ei2 in Encoding.GetEncodings() ) {
Encoding e2 = ei2.GetEncoding();
var s2 = e2.GetString(e.GetBytes(s));
if (s2 == t) {
var x = ei.CodePage;
Console.WriteLine($"e1={ei.DisplayName} (CP {ei.CodePage}), e2={ei2.DisplayName} (CP {ei2.CodePage})");
Console.WriteLine(t);
Console.WriteLine(s2);
}
}
}
Console.WriteLine("-----------");
Console.WriteLine(t);
Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));
}
}
It turned out that the correct encoding/deciding in my case was:
e1=Western European (Windows) (CP 1252), e2=Unicode (UTF-8) (CP 65001)
So the last line of code is a one-liner for the correct conversion Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));.