iText 7.1.14 'SymbolEncoding' is not a supported encoding name [duplicate] - powershell

I have a method in our software that pulls the text from a PDF, from a scan or text generated.
I usually try the GetTextFromPage() method first. If it doesn't return text, then I move onto OCR'ing the page.
I have a particular 6 page PDF with the first three pages being a scanned document, and the last two being a form.
On this PDF I'm getting an error that I can't figure out how to resolve.
'StandardEncoding' is not a supported encoding name. For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
Parameter name: name
at System.Globalization.EncodingTable.internalGetCodePageFromName(String name)
at System.Globalization.EncodingTable.GetCodePageFromName(String name)
at iText.IO.Util.IanaEncodings.GetEncodingEncoding(String name)
at iText.IO.Util.EncodingUtil.ConvertToBytes(Char[] chars, String encoding)
at iText.IO.Font.PdfEncodings.ConvertToBytes(String text, String encoding)
at iText.IO.Font.FontEncoding.FillNamedEncoding()
at iText.IO.Font.FontEncoding.CreateFontEncoding(String baseEncoding)
at iText.Kernel.Font.PdfType1Font..ctor(PdfDictionary fontDictionary)
at iText.Kernel.Font.PdfFontFactory.CreateFont(PdfDictionary fontDictionary)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.GetFont(PdfDictionary fontDict)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.SetTextFontOperator.Invoke(PdfCanvasProcessor processor, PdfLiteral operator, IList`1 operands)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.InvokeOperator(PdfLiteral operator, IList`1 operands)
at iText.Kernel.Pdf.Canvas.Parser.PdfCanvasProcessor.ProcessContent(Byte[] contentBytes, PdfResources resources)
at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page, ITextExtractionStrategy strategy, IDictionary`2 additionalContentOperators)
at iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(PdfPage page)
at EFR.OCR.OCR.ExtractTextFromPDF(FileInfo fileInfo, Int32 StartingPage, Int32 NumberOfPages) in P:\Cloud\Dropbox\EF Recovery\OCRTest\EFR.OCR\OCR.vb:line 113
I've processed many PDFs through my code, some text, some scans, some mixed together. Some had forms... This is the first time that I've had this error.
Here's a snippet of my code...
Using reader As New iText.Kernel.Pdf.PdfReader(fileInfo.FullName)
reader.SetUnethicalReading(True)
Using sourceDoc As New iText.Kernel.Pdf.PdfDocument(reader)
If NumberOfPages = 0 Then NumberOfPages = sourceDoc.GetNumberOfPages
For i As Integer = StartingPage To StartingPage + NumberOfPages - 1
Dim pageText As String = ""
Try
pageText = iText.Kernel.Pdf.Canvas.Parser.PdfTextExtractor.GetTextFromPage(sourceDoc.GetPage(i))
Catch ex As Exception
OCRLog.Log($"Error attempting to extract text from page {i}. {ex.ToString}")
End Try
If pageText = "" Then
'extract this page
Dim results As OCRResults = ExtractTextFromPDFImagePage(fileInfo.FullName, i)
pageText = results.Text
pageItems.Add(New OCRResults.PagesClass(results.Accuracy, True, pageText))
Else
pageItems.Add(New OCRResults.PagesClass(100, False, pageText))
End If
stringBuilder.Append(pageText)
Next
Return New OCRResults(stringBuilder.ToString, pageItems)
End Using
End Using
Any ideas?

There is an error in the PDF, just as indicated by the error text "'StandardEncoding' is not a supported encoding name.".
The fonts on the page you shared use the name StandardEncoding in their Encoding entries. This is not a valid name here. According to the specification ISO 32000-1 the only valid values here are MacRomanEncoding, MacExpertEncoding, and WinAnsiEncoding, see Table 111 – Entries in a Type 1 font dictionary – and Table 114 – Entries in an encoding dictionary.
Adobe Preflight also complains about these names when checking for syntax errors:
An unexpected value is associated with the key
Key: BaseEncoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Encoding
Cos ID: 38
Traversal Path: ->Pages->Kids->[0]->Resources->Font->WARSP->Encoding
An unexpected value is associated with the key
Key: Encoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Font.FontType1
Cos ID: 27
Traversal Path: ->Pages->Kids->[0]->Resources->Font->Arial,Bold
An unexpected value is associated with the key
Key: BaseEncoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Encoding
Cos ID: 22
Traversal Path: ->Pages->Kids->[0]->Resources->Font->Arial->Encoding
An unexpected value is associated with the key
Key: BaseEncoding
Value: /StandardEncoding
Type: CosName
Formal Representation: Encoding
Cos ID: 19
Traversal Path: ->Pages->Kids->[0]->Resources->Font->ARROW->Encoding
(Excerpt from a preflight report for your shared PDF)
In spite of StandardEncoding not being a valid name here, the PDF specification knows a "Standard Encoding", see Annex D of ISO 32000-1. Most likely your document attempts to refer to that encoding at the locations outlined above.
If you need to extract text from the document in question, therefore, you may want to follow the recommendation of the error message:
For information on defining a custom encoding, see the documentation for the Encoding.RegisterProvider method.
The Encoding class here is the one in System.Text.
To extract the text from your PDF, therefore, it should suffice to implement an EncodingProvider that for the name StandardEncoding provides an Encoding instance according to the information from the STD column of the table in Annex D.2 – Latin Character Set and Encodings – of ISO 32000-1.

Related

Arabic text file couldn’t be opened

I am trying to open .srt file using String:
let str = try String(filePath , encoding : .utf8)
But I get this error every time I want to read specific .srt file:
couldn’t be opened using text encoding
sampleFile
what is the problem?
Often, when you are not sure about the encoding, you can actually request that init(contentsOfFile:usedEncoding:) determine the encoding for you during the conversion process, e.g.
let fileURL = Bundle.main.url(forResource: "2", withExtension: "srt")!
var encoding: String.Encoding = .utf8
do {
let string = try String(contentsOfFile: fileURL.path, usedEncoding: &encoding)
print(encoding)
print(string)
} catch {
print(error)
}
Unfortunately, in this case, it throws an error:
The file “2.srt” couldn’t be opened because the text encoding of its contents can’t be determined.
Needless to say, this obviously is not utf8, much less any other encoding that Foundation understands.
When you look at the file in a hex editor (in this case, in Xcode, right clicking on the file and choose “Open as” » “Hex”), it shows:
So, we can see that this is text, but the Arabic is not in utf8. And, as https://subtitletools.com/ says,
Text encoding is a tricky thing. Years ago, there were hundreds of different text encodings in an attempt to support all languages and character sets. Nowadays all these different languages can be encoded in unicode UTF-8, but unfortunately all the files from years ago still exist, and some stubborn countries still use old text encodings. Many devices have trouble displaying text encodings that are not UTF-8, they will display the text as random, unreadable characters.
After a little investigation, it would appear that this is a Windows-1256 format. I could not open this on a Mac, but Word on my PC gave me a choice of encodings, and “Arabic (Windows)” looks promising:
Now, having confirmed the encoding, I would have thought that I could use CFStringEncodings.windowsArabic and CFStringCreateWithBytes, but I could not get that to work.
So, in the end, I built my own cross reference table:
class Windows1256 {
private static let values: [UInt32] = [
0x0000, 0x0001, 0x0002, 0x0003, 0x0004, 0x0005, 0x0006, 0x0007, 0x0008, 0x0009, 0x000A, 0x000B, 0x000C, 0x000D, 0x000E, 0x000F,
0x0010, 0x0011, 0x0012, 0x0013, 0x0014, 0x0015, 0x0016, 0x0017, 0x0018, 0x0019, 0x001A, 0x001B, 0x001C, 0x001D, 0x001E, 0x001F,
0x0020, 0x0021, 0x0022, 0x0023, 0x0024, 0x0025, 0x0026, 0x0027, 0x0028, 0x0029, 0x002A, 0x002B, 0x002C, 0x002D, 0x002E, 0x002F,
0x0030, 0x0031, 0x0032, 0x0033, 0x0034, 0x0035, 0x0036, 0x0037, 0x0038, 0x0039, 0x003A, 0x003B, 0x003C, 0x003D, 0x003E, 0x003F,
0x0040, 0x0041, 0x0042, 0x0043, 0x0044, 0x0045, 0x0046, 0x0047, 0x0048, 0x0049, 0x004A, 0x004B, 0x004C, 0x004D, 0x004E, 0x004F,
0x0050, 0x0051, 0x0052, 0x0053, 0x0054, 0x0055, 0x0056, 0x0057, 0x0058, 0x0059, 0x005A, 0x005B, 0x005C, 0x005D, 0x005E, 0x005F,
0x0060, 0x0061, 0x0062, 0x0063, 0x0064, 0x0065, 0x0066, 0x0067, 0x0068, 0x0069, 0x006A, 0x006B, 0x006C, 0x006D, 0x006E, 0x006F,
0x0070, 0x0071, 0x0072, 0x0073, 0x0074, 0x0075, 0x0076, 0x0077, 0x0078, 0x0079, 0x007A, 0x007B, 0x007C, 0x007D, 0x007E, 0x007F,
0x20AC, 0x067E, 0x201A, 0x0192, 0x201E, 0x2026, 0x2020, 0x2021, 0x02C6, 0x2030, 0x0679, 0x2039, 0x0152, 0x0686, 0x0698, 0x0688,
0x06AF, 0x2018, 0x2019, 0x201C, 0x201D, 0x2022, 0x2013, 0x2014, 0x06A9, 0x2122, 0x0691, 0x203A, 0x0153, 0x200C, 0x200D, 0x06BA,
0x00A0, 0x060C, 0x00A2, 0x00A3, 0x00A4, 0x00A5, 0x00A6, 0x00A7, 0x00A8, 0x00A9, 0x06BE, 0x00AB, 0x00AC, 0x00AD, 0x00AE, 0x00AF,
0x00B0, 0x00B1, 0x00B2, 0x00B3, 0x00B4, 0x00B5, 0x00B6, 0x00B7, 0x00B8, 0x00B9, 0x061B, 0x00BB, 0x00BC, 0x00BD, 0x00BE, 0x061F,
0x06C1, 0x0621, 0x0622, 0x0623, 0x0624, 0x0625, 0x0626, 0x0627, 0x0628, 0x0629, 0x062A, 0x062B, 0x062C, 0x062D, 0x062E, 0x062F,
0x0630, 0x0631, 0x0632, 0x0633, 0x0634, 0x0635, 0x0636, 0x00D7, 0x0637, 0x0638, 0x0639, 0x063A, 0x0640, 0x0641, 0x0642, 0x0643,
0x00E0, 0x0644, 0x00E2, 0x0645, 0x0646, 0x0647, 0x0648, 0x00E7, 0x00E8, 0x00E9, 0x00EA, 0x00EB, 0x0649, 0x064A, 0x00EE, 0x00EF,
0x064B, 0x064C, 0x064D, 0x064E, 0x00F4, 0x064F, 0x0650, 0x00F7, 0x0651, 0x00F9, 0x0652, 0x00FB, 0x00FC, 0x200E, 0x200F, 0x06D2
]
static let scalars = values.map { Unicode.Scalar($0)! }
static func convert(_ data: Data) -> String {
var string = ""
string.reserveCapacity(data.count)
for byte in data {
string += String(scalars[Int(byte)])
}
return string
}
}
Then I can convert that
let string = Windows1256.convert(data)
And that yields:
1
00:00:00,276 --> 00:00:02,401
:انچه گذشت
قرباني لين دوهرسته
2
00:00:02,403 --> 00:00:04,870SW
الان ديگه ما دنبال يه مجرم سريالي هستيم
I do not read or write Arabic, so I cannot verify this solution, but it looks promising to my untrained eye.
Needless to say, having done this conversion, you can write this to a file (or convert to a Data) using .utf8 if you need to save the result in this encoding.

Emoji and accent encoding in dart/flutter

I get the next String from my api
"à é í ó ú ü ñ \uD83D\uDE00\uD83D\uDE03\uD83D\uDE04\uD83D\uDE01\uD83D\uDE06\uD83D\uDE05"
from a response in a json format
{
'apiText': "à é í ó ú ü ñ \uD83D\uDE00\uD83D\uDE03\uD83D\uDE04\uD83D\uDE01\uD83D\uDE06\uD83D\uDE05",
'otherInfo': 'etc.',
.
.
.
}
it contains accents à é í ó ú ü ñ that are not correctly encoded and it contains emojis \uD83D\uDE00\uD83D\uDE03\uD83D\uDE04\uD83D\uDE01\uD83D\uDE06\uD83D\uDE05
so far i have tried
var json = jsonDecode(response.body)
String apiText = json['apiText'];
List<int> bytes = apiText.codeUnits;
comentario = utf8.decode(bytes);
but produces a
[ERROR:flutter/lib/ui/ui_dart_state.cc(166)] Unhandled Exception: FormatException: Invalid UTF-8 byte (at offset 21)
how can i get the correct text with accents and emoji?
Based on the fact you called response.body I assumes you are using the http package which does have the body property on Response objects.
You should note the following detail in the documentation:
This is converted from bodyBytes using the charset parameter of the Content-Type header field, if available. If it's unavailable or if the encoding name is unknown, latin1 is used by default, as per RFC 2616.
Well, it seems rather likely that it cannot figure out the charset and therefore defaults to latin1 which explains how your response got messed up.
A solution for this is to use the resonse.bodyBytes instead which contains the raw bytes from the response. You can then manually parse this with e.g. utf8.decode(resonse.bodyBytes) if you are sure the response should be parsed as UTF-8.
rewrite your function as
String utf8convert(String text) {
var bytes = text.codeUnits;
String decodedCode = utf8.decode(bytes, allowMalformed: true);
if (decodedCode.contains("�")) {
return text;
}
return decodedCode;
}

How do I avoid truncation of message subjects at 255 chars with the EWS managed API?

I have an email messge on an Exchange server (2010 SP1) with a Subject header that is 272 characters long. Both Outlook and OWA show it truncated to the first 252 characters followed by "...". EWSEditor shows it the same way. I know, however, that the full Subject is stored somewhere, because when I look at the headers in Message Options dialog Outlook or in the Message Details in OWA, all 272 characters are there.
My code is only gettting the truncated Subject, and I need a way to get the full string.
My code is using SyncFolderItems to get a ChangeCollection of ItemChange objects. I have two code branches for this. One retrieves FirstClassProperties, and one retrieves IdOnly. I have a function called getItemStringProp(), and depending on the branch, I either call it directly with the Item that I get from the ItemChange, or with the Item that I get by binding to the ItemChange.Item.Id. In both cases, my getItemStringProp() uses Item.TryGetProperty() and returns a max of 255 characters for the Subject. If the actual subject is longer, then I get 252 chars followed by "...".
Here's my code from the branch doing SyncFolderItems with FirstClassProperties:
useIdOnly = false;
icc = exchange.SyncFolderItems(folderId, PropertySet.FirstClassProperties, null, syncFolderItemsBatchSize, SyncFolderItemsScope.NormalItems, result.getSyncState());
and from the other branch:
useIdOnly = true;
icc = exchange.SyncFolderItems(folderId, PropertySet.IdOnly, null, syncFolderItemsBatchSize, SyncFolderItemsScope.NormalItems, result.getSyncState());
Following this, I drill down to get the Subject:
foreach (ItemChange ic in icc)
{
if (!useIdOnly)
{
icSubject = getItemStringProp(ic.Item, EmailMessageSchema.Subject,"Subject", folderName,"");
}
else
{
PropertySet itemProps = new PropertySet(BasePropertySet.IdOnly);
itemProps.Add(EmailMessageSchema.Subject);
itemProps.Add(EmailMessageSchema.DateTimeSent);
itemProps.Add(EmailMessageSchema.ItemClass);
Item item = Item.Bind(exchange, ic.Item.Id, itemProps);
icSubject = getItemStringProp(item, EmailMessageSchema.Subject, "Subject", folderName, "");
}
}
And here's the function that gets the Subject:
private String getItemStringProp(Item item, PropertyDefinition propDef, String propName, String fName, String defaultValue)
{
// some debug logging code and error checks omitted
object prop = null;
String value = "";
try
{
if (item.TryGetProperty(propDef, out prop) && prop != null)
{
value = prop.ToString();
}
if (prop == null || value == null)
{
value = defaultValue;
}
}
return value;
}
By the way, I'm aware that neither Outlook (at least the 2007 version) nor OWA allows creation of a message with a Subject longer than 255 characters. The message in question came into Exchange via SMTP, and a Subject far longer than 255 characters is legal according to the RFCs.
Don't rely on Item.Bind(), sync, search, or any other operation in EWS to load up all of the properties you're looking for. Have you tried getting the item, then doing a .load(PropertySet) or ExchangeService.loadPropertiesForItems()? Some properties won't come through in various retrieval actions even if you specifically request them. Some may come through, but get truncated. What makes it more fun is that I don't think there's any documentation telling you exactly which operations will return which properties, so you get to guess and check. You have to load the property set after you retrieve the Item(s), so it's usually best to get the Item with the ID only, then load the property set.

Storing Special Characters in Windows Azure Blob Metadata

I have an app that is storing images in a Windows Azure Block Blob. I'm adding meta data to each blob that gets uploaded. The metadata may include some special characters. For instance, the registered trademark symbol (®). How do I add this value to meta data in Windows Azure?
Currently, when I try, I get a 400 (Bad Request) error anytime I try to upload a file that uses a special character like this.
Thank you!
You might use HttpUtility to encode/decode the string:
blob.Metadata["Description"] = HttpUtility.HtmlEncode(model.Description);
Description = HttpUtility.HtmlDecode(blob.Metadata["Description"]);
http://lvbernal.blogspot.com/2013/02/metadatos-de-azure-vs-caracteres.html
The supported characters in the blob metadata must be ASCII characters. To work around this you can either escape the string ( percent encode), base64 encode etc.
joe
HttpUtility.HtmlEncode may not work; if Unicode characters are in your string (i.e. &#8217), it will fail. So far, I have found Uri.EscapeDataString does handle this edge case and others. However, there are a number of characters that get encoded unnecessarily, such as space (' '=chr(32)=%20).
I mapped the illegal ascii characters metadata will not accept and built this to restore the characters:
static List<string> illegals = new List<string> { "%1", "%2", "%3", "%4", "%5", "%6", "%7", "%8", "%A", "%B", "%C", "%D", "%E", "%F", "%10", "%11", "%12", "%13", "%14", "%15", "%16", "%17", "%18", "%19", "%1A", "%1B", "%1C", "%1D", "%1E", "%1F", "%7F", "%80", "%81", "%82", "%83", "%84", "%85", "%86", "%87", "%88", "%89", "%8A", "%8B", "%8C", "%8D", "%8E", "%8F", "%90", "%91", "%92", "%93", "%94", "%95", "%96", "%97", "%98", "%99", "%9A", "%9B", "%9C", "%9D", "%9E", "%9F", "%A0", "%A1", "%A2", "%A3", "%A4", "%A5", "%A6", "%A7", "%A8", "%A9", "%AA", "%AB", "%AC", "%AD", "%AE", "%AF", "%B0", "%B1", "%B2", "%B3", "%B4", "%B5", "%B6", "%B7", "%B8", "%B9", "%BA", "%BB", "%BC", "%BD", "%BE", "%BF", "%C0", "%C1", "%C2", "%C3", "%C4", "%C5", "%C6", "%C7", "%C8", "%C9", "%CA", "%CB", "%CC", "%CD", "%CE", "%CF", "%D0", "%D1", "%D2", "%D3", "%D4", "%D5", "%D6", "%D7", "%D8", "%D9", "%DA", "%DB", "%DC", "%DD", "%DE", "%DF", "%E0", "%E1", "%E2", "%E3", "%E4", "%E5", "%E6", "%E7", "%E8", "%E9", "%EA", "%EB", "%EC", "%ED", "%EE", "%EF", "%F0", "%F1", "%F2", "%F3", "%F4", "%F5", "%F6", "%F7", "%F8", "%F9", "%FA", "%FB", "%FC", "%FD", "%FE" };
private static string MetaDataEscape(string value)
{
//CDC%20Guideline%20for%20Prescribing%20Opioids%20Module%206%3A%20%0Ahttps%3A%2F%2Fwww.cdc.gov%2Fdrugoverdose%2Ftraining%2Fdosing%2F
var x = HttpUtility.HtmlEncode(value);
var sz = value.Trim();
sz = Uri.EscapeDataString(sz);
for (int i = 1; i < 255; i++)
{
var hex = "%" + i.ToString("X");
if (!illegals.Contains(hex))
{
sz = sz.Replace(hex, Uri.UnescapeDataString(hex));
}
}
return sz;
}
The result is:
Before ==> "1080x1080 Facebook Images"
Uri.EscapeDataString =>
"1080x1080%20Facebook%20Images"
After => "1080x1080 Facebook
Images"
I am sure there is a more efficient way, but the hit seems negligible for my needs.

What is the character encoding?

I have several characters that aren't recognized properly.
Characters like:
º
á
ó
(etc..)
This means that the characters encoding is not utf-8 right?
So, can you tell me what character encoding could it be please.
We don't have nearly enough information to really answer this, but the gist of it is: you shouldn't just guess. You need to work out where the data is coming from, and find out what the encoding is. You haven't told us anything about the data source, so we're completely in the dark. You might want to try Encoding.Default if these are files saved with something like Notepad.
If you know what the characters are meant to be and how they're represented in binary, that should suggest an encoding... but again, we'd need to know more information.
read this first http://www.joelonsoftware.com/articles/Unicode.html
There are two encodings: the one that was used to encode string and one that is used to decode string. They must be the same to get expected result. If they are different then some characters will be displayed incorrectly. we can try to guess if you post actual and expected results.
I wrote a couple of methods to narrow down the possibilities a while back for situations just like this.
static void Main(string[] args)
{
Encoding[] matches = FindEncodingTable('Ÿ');
Encoding[] enc2 = FindEncodingTable(159, 'Ÿ');
}
// Locates all Encodings with the specified Character and position
// "CharacterPosition": Decimal position of the character on the unknown encoding table. E.G. 159 on the extended ASCII table
//"character": The character to locate in the encoding table. E.G. 'Ÿ' on the extended ASCII table
static Encoding[] FindEncodingTable(int CharacterPosition, char character)
{
List matches = new List();
byte myByte = (byte)CharacterPosition;
byte[] bytes = { myByte };
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = thisEnc.GetChars(bytes);
if (chars[0] == character)
{
matches.Add(thisEnc);
break;
}
}
return matches.ToArray();
}
// Locates all Encodings that contain the specified character
static Encoding[] FindEncodingTable(char character)
{
List matches = new List();
foreach (EncodingInfo encInfo in Encoding.GetEncodings())
{
Encoding thisEnc = Encoding.GetEncoding(encInfo.CodePage);
char[] chars = { character };
byte[] temp = thisEnc.GetBytes(chars);
if (temp != null)
matches.Add(thisEnc);
}
return matches.ToArray();
}
Encoding is the form of modifying some existing content; thus allowing it to be parsed by the required destination protocols.
An example of encoding can be seen when browsing the internet:
The URL you visit: www.example.com, may have the search facility to run custom searches via the URL address:
www.example.com?search=...
The following variables on the URL require URL encoding. If you was to write:
www.example.com?search=cat food cheap
The browser wouldn't understand your request as you have used an invalid character of ' ' (a white space)
To correct this encoding error you should exchange the ' ' with '%20' to form this URL:
www.example.com?search=cat%20food%20cheap
Different systems use different forms of encoding, in this example I have used standard Hex encoding for a URL. In other applications and instances you may find the need to use other types of encoding.
Good Luck!