Error when extracting text

Error when extracting text - itext

Unable to cast object of type 'iTextSharp.text.pdf.PdfLiteral' to type 'iTextSharp.text.pdf.PdfNumber'.
CODE:
StringBuilder text = new StringBuilder();
SimpleTextExtractionStrategy strategy = new SimpleTextExtractionStrategy();
for (int p = 1; p <= reader.NumberOfPages; p++)
{
text.AppendLine(PdfTextExtractor.GetTextFromPage(reader, p, strategy));
}
reader.Close();
return text.ToString();
Only get this error with a very few pdfs. Any ideas?
STACK TRACE:
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ShowTextArray.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
at DCS.Common.PDF.Functions.GetTextPdf(PdfReader reader) in C:\Users\rmaldonado\Documents\Visual Studio 2008\Projects\DCS\Contract\Common\PDF\Functions.cs:line 35
at DCS.Common.PDF.Functions.ParsePDF(Byte[] bytes) in C:\Users\rmaldonado\Documents\Visual Studio 2008\Projects\DCS\Contract\Common\PDF\Functions.cs:line 23
at DCS.CAPPS.BLL.Common.Attachment.ReParseText() in C:\Users\rmaldonado\Documents\Visual Studio 2008\Projects\DCS\Contract\ContractBLL\Common\Common.cs:line 1120

The page content of your document Mod 2.pdf is utterly broken. It actually is broken so badly Adobe Preflight (from Acrobat 9.5.4) just like iText runs into an error while trying to analyze it.
A manual inspection indicates that the most obvious errors relate to operations injected into an array of operands of TJ operations, e.g.
[(OMB) 0.0 Tc -278.0 (Approval) 0.0 Tc -278.0 (2700-0042) ] TJ
[(AMENDMENT) 0.0 Tc -278.0 (OF) 0.0 Tc -278.0 (SOLICITATION/MODIFICATION)
0.0 Tc -278.0 (OF) 0.0 Tc -278.0 (CONTRACT) ] TJ
This pattern continues, i.e. every non-trivial [ ... ] TJ operation contains injected 0.0 Tc operations.
This is wrong, cf. section 7.8.2 of the PDF specification ISO 32000-1:2008:
In PDF, all of the operands needed by an operator shall immediately precede that operator. Operators do not return results, and operands shall not be left over when an operator finishes execution.
This makes PdfContentStreamProcessor.ShowTextArray.Invoke (responsible for processing TJ operations) run into the error. As the operand array of TJ may contain only strings and numbers, everything which is not a PdfString, is cast to PdfNumber but the Tc operators are instances of PdfLiteral.

for extracting text from pdf ,try using this code given below
PdfTextExtractor.GetTextFromPage(reader, p, new LocationTextExtractionStrategy())

As #mkl said there might be an error in PDf too. try to copy paste the text contents from PDf to notepad. Is it coming blank .Just to check whether the contents are in image format or any other format. And Provide the complete code if possible.

Related

Formatting math expression before parsing using math_expressions package in Flutter

How to properly format a math expression before passing it to the math_expressions package in flutter?
Context
I'm using math_expressions package but there are two cases I found when it throws an error:
A. Missing an asterisk before a parenthesis.
B. Missing parenthesis within the expression.
E.g.
// Throws error
final expression = "8(3+1)"; // A
final expression = "8(3+1"; // B
// Executes correctly
final expression = "8*(3+1)";
final Parser parser = Parser();
Expression exp = parser.parse(expression);
ContextModel cm = ContextModel();
final double result = exp.evaluate(EvaluationType.REAL, cm);
I'm aware of the syntactic requirement of the package so I'd like to properly format the expression before passing it to the parser since I cannot guarantee user input will comply to the requirement mentioned before.
What I've got so far
A. Missing an asterisk before a parenthesis:
I read about the replaceAllMapped method but I don't really know how to start from here in order to add the missing asterisks when needed.
B. Missing parenthesis within the expression. (solved)
Hypothesis
A. Missing an asterisk before a parenthesis:
I think the way is to create an array of digits, search for coincidences of a digit + parenthesis and then replace it with the addition of an asterisk like this: digit + "*" + parenthesis
Any ideas on how to solve this appropriately?

Get UTF-16 code unit at a given index in ABAP

I want to get the UTF-16 code unit at a given index in ABAP.
Same can be done in JavaScript with charCodeAt().
For example "d".charCodeAt(); will give back 100.
Is there a similar functionality in ABAP?

This can be done with class CL_ABAP_CONV_OUT_CE
DATA(lo_converter) = cl_abap_conv_out_ce=>create( encoding = '4103' ). "Litte Endian
TRY.
CALL METHOD lo_converter->convert
EXPORTING
data = 'a'
n = 1
IMPORTING
buffer = DATA(lv_buffer). "lv_buffer will 0061
CATCH ...
ENDTRY.
Codepage 4102 is for UTF-16 Big endian.
It is possible to encode not just a single character, but a string as well:
EXPORTING
data = 'abc'
n = 3
"n" always stands for the length of the string you want to be encoded. It could be less, than the actual length of the string.

When you say you "want to get the UTF-16 code unit",
either you mean the Unicode code point, e.g. the character d is always U+0064 (official "name" of Unicode character, the two bytes 0x0064 being the hexadecimal representation of decimal 100),
or you mean you want to encode d to UTF-16 little endian (SAP code page 4103) or big endian (SAP code page 4102) which gives respectively 2 bytes 0x4400 or 2 bytes 0x0044.
For the second case, see József answer.
For the first case, you may get it using the method UCCP (UniCode Code Point) or UCCPI (UniCode Code Point Integer) of class CL_ABAP_CONV_OUT_CE:
DATA: l_unicode_point_hex TYPE x LENGTH 2,
l_unicode_point_int TYPE i.
l_unicode_point_hex = cl_abap_conv_out_ce=>UCCP( 'd' ).
ASSERT l_unicode_point_hex = '0064'.
l_unicode_point_int = cl_abap_conv_out_ce=>UCCPI( 'd' ).
ASSERT l_unicode_point_int = 100.
EDIT: Note that the two methods return always the same values whatever the SAP system code page is (4102, 4103 or whatever).

How character expansions does work under the hood?

I am reading the CLR VIA C# by Jeffrey Richter. And while explaining string comparison, he notes that:
When the Compare method is not performing an ordinal comparison, it
performs character expansions. A character expansion is when a
character is expanded to multiple characters regardless of culture.
String s1 = "Strasse";
String s2 = "Straße";
Boolean eq;
CultureInfo ci = new CultureInfo("de-DE");
eq = String.Compare(s1, s2, true, ci) == 0; // returns true
For the above case, he notes:
...the German Eszet character ‘ß’ is always expanded to
‘ss. So in the code example, the call to Compare will always
return 0 regardless of which culture I actually pass in to it.
I want to know from which source, the runtime takes that ß is equal to ss or how it calculates it?

Strange results when deleting all special characters from a string in Progress / OpenEdge

I have the code snippet below (as suggested in this previous Stack Overflow answer ... Deleting all special characters from a string in progress 4GL) which is attempting to remove all extended characters from a string so that I may transmit it to a customer's system which will not accept any extended characters.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
It is working perfectly with one exception (which makes me fear there may be others I have not caught). When it gets to 255, it will replace all 'y's in the string.
If I do the following ...
display chr(255) = chr(121). /* 121 is asc code of y */
I get true as the result.
And therefore, if I do the following ...
display replace("This is really strange",chr(255),"").
I get the following result:
This is reall strange
I have verified that 'y' is the only character affected by running the following:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz".
def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string,chr(v-int),"").
end.
display v-string.
Which results in the following:
abcdefghijklmnopqrstuvwxz
I know I can fix this by removing 255 from the range but I would like to understand why this is happening.
Is this a character collation set issue or am I missing something simpler?
Thanks for any help!

This is a bug. Here's a Progress Knowledge Base article about it:
http://knowledgebase.progress.com/articles/Article/000046181
The workaround is to specify the codepage in the CHR() statement, like this:
CHR(255, "UTF-8", "1252")
Here it is in your example:
def var v-string as char init "abcdefghijklmnopqrstuvwxyz". def var v-int as int.
do v-int = 128 to 255:
assign v-string = replace(v-string, chr(v-int, "UTF-8", "1252"), "").
end.
display v-string.
You should now see the 'y' in the output.

This seems to be a bug!
The REPLACE() function returns an unexpected result when replacing character CHR(255) (ÿ) in a String.
The REPLACE() function modifies the value of the target character, but additionally it changes any occurrence of characters 'Y' and 'y' present in the String.
This behavior seems to affect only the character ÿ. Other characters are correctly changed by REPLACE().
Using default codepage ISO-8859-1
Link to knowledgebase

itextsharp text extraction fails for some pdfs

I have couple of PDF files whose text I am not able to extract from. These PDFs file were created by converting Word files to PDFs.
The main purpose I am extracting text from pdf is to index its text and make it searchable.
PdfReader reader = new PdfReader(inFileName);
for (int page = 1; page <= reader.NumberOfPages; page++)
{
// where strPDFText is string builder
strPDFText.Append(iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(reader, page) + " ");
}
string str = strPDFText.ToString();
I get an empty string. What could be the reason for the same. I am using Itextsharp 5.5

While the sample PDF provided by the OP indeed indicates that it is a MS Word export, it simply does not contain any text, only an image (which incidentally shows text).
The content of the PDF is this:
/P <</MCID 0>> BDC BT
/F1 11.04 Tf
1 0 0 1 540.1 500.95 Tm
/GS7 gs
0 g
0 G
[( )] TJ
ET
EMC /P <</MCID 1>> BDC q
0.000000071 488.88 612 231.12 re
W* n
468 0 0 219.05 72 500.95 cm
/Image8 Do Q
EMC
As you see the only actual text displayed is a single space ([( )] TJ), and the only remaining content is a bitmap image (/Image8 Do).
Thus,
I get an empty string. What could be the reason for the same.
The reason is that there is no text in your document.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Error when extracting text - itext

for extracting text from pdf ,try using this code given below PdfTextExtractor.GetTextFromPage(reader, p, new LocationTextExtractionStrategy())

As #mkl said there might be an error in PDf too. try to copy paste the text contents from PDf to notepad. Is it coming blank .Just to check whether the contents are in image format or any other format. And Provide the complete code if possible.

Related

Formatting math expression before parsing using math_expressions package in Flutter

Get UTF-16 code unit at a given index in ABAP

How character expansions does work under the hood?

Strange results when deleting all special characters from a string in Progress / OpenEdge

itextsharp text extraction fails for some pdfs

Categories

Resources