Arabic text file couldn’t be opened - swift
I am trying to open .srt file using String:
let str = try String(filePath , encoding : .utf8)
But I get this error every time I want to read specific .srt file:
couldn’t be opened using text encoding
sampleFile
what is the problem?
Often, when you are not sure about the encoding, you can actually request that init(contentsOfFile:usedEncoding:) determine the encoding for you during the conversion process, e.g.
let fileURL = Bundle.main.url(forResource: "2", withExtension: "srt")!
var encoding: String.Encoding = .utf8
do {
let string = try String(contentsOfFile: fileURL.path, usedEncoding: &encoding)
print(encoding)
print(string)
} catch {
print(error)
}
Unfortunately, in this case, it throws an error:
The file “2.srt” couldn’t be opened because the text encoding of its contents can’t be determined.
Needless to say, this obviously is not utf8, much less any other encoding that Foundation understands.
When you look at the file in a hex editor (in this case, in Xcode, right clicking on the file and choose “Open as” » “Hex”), it shows:
So, we can see that this is text, but the Arabic is not in utf8. And, as https://subtitletools.com/ says,
Text encoding is a tricky thing. Years ago, there were hundreds of different text encodings in an attempt to support all languages and character sets. Nowadays all these different languages can be encoded in unicode UTF-8, but unfortunately all the files from years ago still exist, and some stubborn countries still use old text encodings. Many devices have trouble displaying text encodings that are not UTF-8, they will display the text as random, unreadable characters.
After a little investigation, it would appear that this is a Windows-1256 format. I could not open this on a Mac, but Word on my PC gave me a choice of encodings, and “Arabic (Windows)” looks promising:
Now, having confirmed the encoding, I would have thought that I could use CFStringEncodings.windowsArabic and CFStringCreateWithBytes, but I could not get that to work.
So, in the end, I built my own cross reference table:
class Windows1256 {
private static let values: [UInt32] = [
0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F,
0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F,
0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F,
0x20AC, 0x067E, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0679, 0x2039, 0x0152, 0x0686, 0x0698, 0x0688,
0x06AF, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x06A9, 0x2122, 0x0691, 0x203A, 0x0153, 0x200C, 0x200D, 0x06BA,
0x00A0, 0x060C, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x06BE, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x061B, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x061F,
0x06C1, 0x0621, 0x0622, 0x0623, 0x0624, 0x0625, 0x0626, 0x0627, 0x0628, 0x0629, 0x062A, 0x062B, 0x062C, 0x062D, 0x062E, 0x062F,
0x0630, 0x0631, 0x0632, 0x0633, 0x0634, 0x0635, 0x0636, 0x00D7, 0x0637, 0x0638, 0x0639, 0x063A, 0x0640, 0x0641, 0x0642, 0x0643,
0x00E0, 0x0644, 0x00E2, 0x0645, 0x0646, 0x0647, 0x0648, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x0649, 0x064A, 0x00EE, 0x00EF,
0x064B, 0x064C, 0x064D, 0x064E, 0x00F4, 0x064F, 0x0650, 0x00F7, 0x0651, 0x00F9, 0x0652, 0x00FB, 0x00FC, 0x200E, 0x200F, 0x06D2
]
static let scalars = values.map { Unicode.Scalar($0)! }
static func convert(_ data: Data) -> String {
var string = ""
string.reserveCapacity(data.count)
for byte in data {
string += String(scalars[Int(byte)])
}
return string
}
}
Then I can convert that
let string = Windows1256.convert(data)
And that yields:
1
00:00:00,276 --> 00:00:02,401
:انچه گذشت
قرباني لين دوهرسته
2
00:00:02,403 --> 00:00:04,870SW
الان ديگه ما دنبال يه مجرم سريالي هستيم
I do not read or write Arabic, so I cannot verify this solution, but it looks promising to my untrained eye.
Needless to say, having done this conversion, you can write this to a file (or convert to a Data) using .utf8 if you need to save the result in this encoding.
Related
Java iText - special characters not being displayed
I'm trying for quite some time now to generate a PDF in Java using itextpdf (com.itextpdf kernel,layout,form,pdfa) with text containing special characters (äöüß). I tried several things in different variations, like loading a TTF file and setting the encoding: FontProgram fontProgram = FontProgramFactory.createFont( "font/FreeSans.ttf") ; PdfFont font = PdfFontFactory.createFont( fontProgram, "UTF-8" ) ; document.setFont( font ); This way it just doesn't display special characters at all. This doesn't work either: var font = PdfFontFactory.createFont(StandardFonts.HELVETICA, PdfEncodings.UTF8); document.setFont( font ); I haven't found any solution to this and the official tutorials don't seem to have a solution. Other encodings just render placeholder characters. this is how I add the text: PdfWriter writer = new PdfWriter(filename); PdfDocument pdf = new PdfDocument(writer); Document document = new Document(pdf); Paragraph p = new Paragraph("äüöß"); document.add(p); document.close(); edit: I just realized that it works when I load the text from elsewhere like an input field, instead of passing a normal string. How can I make this work with hardcoded strings? I tried re-encoding the string as described here: https://www.baeldung.com/java-string-encode-utf-8 but none of these methods work either. It always shows wrong characters. PdfFont freeUnicode = PdfFontFactory.createFont("font/FreeSans.ttf", PdfEncodings.IDENTITY_H); String rawString = "äöüß1234'"; byte[] bytes = StringUtils.getBytesUtf8(rawString); String utf8EncodedString = StringUtils.newStringUtf8(bytes); document.add(new Paragraph().setFont(freeUnicode) .add(utf8EncodedString)); edit: The encoding in the source code editor is UTF-8 and I passed UTF-8 to the createFont() method, but that didn't work. When I pass CP1252 and change the source code encoding to ISO-8859-1, it shows the correct characters. Really strange how I couldn't find much information about this problem.
How to make my own encoding for a file in VSCode Editor
Is it possible to have an own encoding in VSCode editor, inheritd from an exising? class myEcoding implements utf-8 { // changes for some codes } I have some files, which contains german characters like "ä ö ü" that are encoded as unicode numbers in this file. So for example, the file conatins the following line Pr\u00FCfsignal While I want to edit this file with the correct german characters, it should exist on the harddisk in the form above. This is how I want to see it in the editor Prüfsignal I already have a function, that can transform a string in both directions: function translate(content: string, direction: boolean): string { if (direction) { content = content .replace(/\\u00E4/g, "ä") .replace(/\\u00F6/g, "ö") .replace(/\\u00FC/g, "ü") .replace(/\\u00C4/g, "Ä") .replace(/\\u00D6/g, "Ö") .replace(/\\u00DC/g, "Ü") .replace(/\\u00DF/g, "ß") .replace(/\\u00B0/g, "°") .replace(/\\u00B1/g, "±") .replace(/\\u00B5/g, "µ"); } else { content = content .replace(/ä/g, "\\u00E4") .replace(/ö/g, "\\u00F6") .replace(/ü/g, "\\u00FC") .replace(/Ä/g, "\\u00C4") .replace(/Ö/g, "\\u00D6") .replace(/Ü/g, "\\u00DC") .replace(/ß/g, "\\u00DF") .replace(/°/g, "\\u00B0") .replace(/±/g, "\\u00B1") .replace(/µ/g, "\\u00B5"); } return content; } Can this be solved with a custom encoding, and if yes, any hints? Is there possibly a better solution?
There has been an open feature request for some years to Provide encoding-related APIs for editor extensions: https://github.com/microsoft/vscode/issues/824 For now you could just wrap that function in a loop that encodes all files in the working directory.
Storing Special Characters in Windows Azure Blob Metadata
I have an app that is storing images in a Windows Azure Block Blob. I'm adding meta data to each blob that gets uploaded. The metadata may include some special characters. For instance, the registered trademark symbol (®). How do I add this value to meta data in Windows Azure? Currently, when I try, I get a 400 (Bad Request) error anytime I try to upload a file that uses a special character like this. Thank you!
You might use HttpUtility to encode/decode the string: blob.Metadata["Description"] = HttpUtility.HtmlEncode(model.Description); Description = HttpUtility.HtmlDecode(blob.Metadata["Description"]); http://lvbernal.blogspot.com/2013/02/metadatos-de-azure-vs-caracteres.html
The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc. joe
HttpUtility.HtmlEncode may not work; if Unicode characters are in your string (i.e. ’), it will fail. So far, I have found Uri.EscapeDataString does handle this edge case and others. However, there are a number of characters that get encoded unnecessarily, such as space (' '=chr(32)=%20). I mapped the illegal ascii characters metadata will not accept and built this to restore the characters: static List<string> illegals = new List<string> { "%1", "%2", "%3", "%4", "%5", "%6", "%7", "%8", "%A", "%B", "%C", "%D", "%E", "%F", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1A", "%1B", "%1C", "%1D", "%1E", "%1F", "%7F", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8A", "%8B", "%8C", "%8D", "%8E", "%8F", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9A", "%9B", "%9C", "%9D", "%9E", "%9F", "%A0", "%A1", "%A2", "%A3", "%A4", "%A5", "%A6", "%A7", "%A8", "%A9", "%AA", "%AB", "%AC", "%AD", "%AE", "%AF", "%B0", "%B1", "%B2", "%B3", "%B4", "%B5", "%B6", "%B7", "%B8", "%B9", "%BA", "%BB", "%BC", "%BD", "%BE", "%BF", "%C0", "%C1", "%C2", "%C3", "%C4", "%C5", "%C6", "%C7", "%C8", "%C9", "%CA", "%CB", "%CC", "%CD", "%CE", "%CF", "%D0", "%D1", "%D2", "%D3", "%D4", "%D5", "%D6", "%D7", "%D8", "%D9", "%DA", "%DB", "%DC", "%DD", "%DE", "%DF", "%E0", "%E1", "%E2", "%E3", "%E4", "%E5", "%E6", "%E7", "%E8", "%E9", "%EA", "%EB", "%EC", "%ED", "%EE", "%EF", "%F0", "%F1", "%F2", "%F3", "%F4", "%F5", "%F6", "%F7", "%F8", "%F9", "%FA", "%FB", "%FC", "%FD", "%FE" }; private static string MetaDataEscape(string value) { //CDC%20Guideline%20for%20Prescribing%20Opioids%20Module%206%3A%20%0Ahttps%3A%2F%2Fwww.cdc.gov%2Fdrugoverdose%2Ftraining%2Fdosing%2F var x = HttpUtility.HtmlEncode(value); var sz = value.Trim(); sz = Uri.EscapeDataString(sz); for (int i = 1; i < 255; i++) { var hex = "%" + i.ToString("X"); if (!illegals.Contains(hex)) { sz = sz.Replace(hex, Uri.UnescapeDataString(hex)); } } return sz; } The result is: Before ==> "1080x1080 Facebook Images" Uri.EscapeDataString => "1080x1080%20Facebook%20Images" After => "1080x1080 Facebook Images" I am sure there is a more efficient way, but the hit seems negligible for my needs.
What is the character encoding?
I have several characters that aren't recognized properly. Characters like: º á ó (etc..) This means that the characters encoding is not utf-8 right? So, can you tell me what character encoding could it be please.
We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad. If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.
read this first http://www.joelonsoftware.com/articles/Unicode.html There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.
I wrote a couple of methods to narrow down the possibilities a while back for situations just like this. static void Main(string[] args) { Encoding[] matches = FindEncodingTable('Ÿ'); Encoding[] enc2 = FindEncodingTable(159, 'Ÿ'); } // Locates all Encodings with the specified Character and position // "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table //"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table static Encoding[] FindEncodingTable(int CharacterPosition, char character) { List matches = new List(); byte myByte = (byte)CharacterPosition; byte[] bytes = { myByte }; foreach (EncodingInfo encInfo in Encoding.GetEncodings()) { Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage); char[] chars = thisEnc.GetChars(bytes); if (chars[0] == character) { matches.Add(thisEnc); break; } } return matches.ToArray(); } // Locates all Encodings that contain the specified character static Encoding[] FindEncodingTable(char character) { List matches = new List(); foreach (EncodingInfo encInfo in Encoding.GetEncodings()) { Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage); char[] chars = { character }; byte[] temp = thisEnc.GetBytes(chars); if (temp != null) matches.Add(thisEnc); } return matches.ToArray(); }
Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols. An example of encoding can be seen when browsing the internet: The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address: www.example.com?search=... The following variables on the URL require URL encoding. If you was to write: www.example.com?search=cat food cheap The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space) To correct this encoding error you should exchange the ' ' with '%20' to form this URL: www.example.com?search=cat%20food%20cheap Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding. Good Luck!
How to a recover a text from a wrong encoding?
I have got some files created from some asian OS (chinese and japanese XPs) the file name is garbled, for example: иè+¾«Ñ¡Õä²ØºÏ¼ how i can recover the original text? I tried with this in c# Encoding unicode = Encoding.Unicode; Encoding cinese = Encoding.GetEncoding(936); byte[] chineseBytes = chinese.GetBytes(garbledString); byte[] unicodeBytes = Encoding.Convert(unicode, chinese, chineseBytes); //(Then convert byte in string) and tried to change unicode to windows-1252 but no luck
It's a double-encoded text. The original is in Windows-936, then some application assumed the text is in ISO-8869-1 and encoded the result to UTF-8. Here is an example how to decode it in Python: >>> print 'иè+¾«Ñ¡Õä²ØºÏ¼'.decode('utf8').encode('latin1').decode('cp936') 新歌+精选珍藏合辑 I'm sure you can do something similar in C#.
Encoding unicode = Encoding.Unicode; That's not what you want. “Unicode” is Microsoft's totally misleading name for what is really the UTF-16LE encoding. UTF-16LE plays no part here, what you have is a simple case where a 936 string has been misdecoded as 1252. Windows codepage 1252 is similar but not the same as ISO-8859-1. There is no way to tell which is in the example string as it does not contain any of the bytes 0x80-0x9F which are different in the two encodings, but I'm assuming 1252 because that's the standard codepage on a western Windows install. Encoding latin= Encoding.getEncoding(1252); Encoding chinese= Encoding.getEncoding(936); chinese.getChars(latin.getBytes(s));
The first argument to Encoding.Convert is the source encoding, Shouldn't that be chinese in your case? So Encoding.Convert(chinese, unicode, chineseBytes); might actually work. Because, after all, you want to convert CP-936 to Unicode and not vice-versa. And I'd suggest you don't even try bothering with CP-1252 since your text there is very likely not Latin.
This is an old question, but I just ran into the same situation while trying to migrate WordPress upload files off of an old Windows Server 2008 R2 server. bobince's answer set me on the right track, but I had to search for the right encoding/decoding pair. With the following C#, I found the relevant encoding/deciding pair: using System; using System.Text; public class Program { public static void Main() { // garbled string s = "2020竹慶本樂ä»æ³¢åˆ‡äºžæ´²æ³•çµ-Intro-2-1024x643.jpg"; // expected string t = "2020竹慶本樂仁波切亞洲法筵-Intro-2-1024x643.jpg"; foreach( EncodingInfo ei in Encoding.GetEncodings() ) { Encoding e = ei.GetEncoding(); foreach( EncodingInfo ei2 in Encoding.GetEncodings() ) { Encoding e2 = ei2.GetEncoding(); var s2 = e2.GetString(e.GetBytes(s)); if (s2 == t) { var x = ei.CodePage; Console.WriteLine($"e1={ei.DisplayName} (CP {ei.CodePage}), e2={ei2.DisplayName} (CP {ei2.CodePage})"); Console.WriteLine(t); Console.WriteLine(s2); } } } Console.WriteLine("-----------"); Console.WriteLine(t); Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s))); } } It turned out that the correct encoding/deciding in my case was: e1=Western European (Windows) (CP 1252), e2=Unicode (UTF-8) (CP 65001) So the last line of code is a one-liner for the correct conversion Console.WriteLine(Encoding.GetEncoding(65001).GetString(Encoding.GetEncoding(1252).GetBytes(s)));.