itext: Textextracting example not working

itext: Textextracting example not working - itext

I want to extract text from an PDF document page wise and I am using itext. I used the example code from their website:
PdfReader reader = new PdfReader(pathToFile);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
TextExtractionStrategy strategy = parser.processContent(page, new SimpleTextExtractionStrategy());
The method processContent gives me a NullPointerException. What did I do wrong?
This is the stacktrace I get when using version 5.5.0 with this file:
java.lang.NullPointerException
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:82)
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:105)
at org.languageresources.PDFExtraktor.extractTextFromPage(PDFExtractor.java:100)

Given the code sniplet and sample document you gave I tried to reproduce the issue but to no avail, text extraction worked fine.
Furthermore, the stacktrace given:
java.lang.NullPointerException
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:82)
at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:105)
at org.languageresources.PDFExtraktor.extractTextFromPage(PDFExtractor.java:100)
does not match the alleged version 5.5.0 because PdfReaderContentParser.java:82 in that version is an empty line and PdfReaderContentParser.java:105 does not exist: Back then that file was only 85 lines in size.
Assuming, though, that you use the current version 5.5.9, the stacktrace makes sense, in that version PdfReaderContentParser.java:82 is the second of these lines:
PdfDictionary pageDic = reader.getPageN(pageNumber);
PdfDictionary resourcesDic = pageDic.getAsDict(PdfName.RESOURCES);
and pageDic indeed can be null if there is no page pageNumber.
Thus, please check that page, the page number in question, is between 1 and reader.getNumberOfPages() inclusive.

Related

Xpages PDF creation using the XMLWorker class

Using SSJS I've successfully created a PDF using the simple HTML parser that comes with iText but the simple HTML parser doesn't respect CSS and is very limited. I downloaded the XMLWorker class from the iText site and have tried to use that instead but my knowledge of working out how to call Java packages is too limited. All the examples I can find use Java and refer to the classes directly, eg.
Document newPDF = new Document();
But in SSJS we have to use dot notation, eg.
var newPDF:com.itextpdf.text.Document = new com.itextpdf.text.Document();
This - I think - is where I stumble. My code looks like this:
function createLPO2(pReqDoc:NotesDocument) {
importPackage(com.itextpdf);
//importPackage(com.itextpdf.tool.xml.XMLWorkerHelper);
importPackage(java.io);
var con = facesContext.getExternalContext();
var response:com.ibm.xsp.webapp.XspHttpServletResponse = con.getResponse();
response.setContentType("application/pdf");
response.setHeader("Cache-Control", "no-cache");
response.setDateHeader("Expires", -1);
response.setHeader("Content-Disposition","attachment; filename=\"LPO_" + pReqDoc.getItemValueString("RequisitionNo") + ".pdf\"");
var newPDF:com.itextpdf.text.Document = new com.itextpdf.text.Document();
var writer = com.itextpdf.text.pdf.PdfWriter.getInstance(newPDF,response.getOutputStream());
var xmlWorkerHelper = com.itextpdf.tool.xml.XMLWorkerHelper.getInstance();
var strHTML = getTestHTML(); //this is the HTML used in the examples on the iText site
xmlWorkerHelper.parseXHtml(writer, newPDF, new java.io.StringReader(strHTML));
newPDF.close();
writer.close();
facesContext.responseComplete();
}
If I run this script as it is I get a script error on the Domino console. If I remove the comment on the line importPackage(com.itextpdf.tool.xml.XMLWorkerHelper); it gives a completely different error. I suspect I have to import the XMLWorkerHelper package and not just the com.itextpdf package. I thought if I opened the jar file using a tool like 7-zip I could work out the path, which is how I arrived at com.itextpdf.tool.xml.XMLWorkerHelper
Is this right? If so, why does my script fail?

Rob,
seriously, don't try to do that in SSJS. iText is all Java, if you try to mangle with it in a different language it will stress you out. Create a wrapper class that has a method that takes an OutputStream and whatever data (Document, View etc) you need. Obtain the OutputStream in your SSJS and call the function. Look for the XAgent XSnippet on OpenNTF and my blog series (the last two are missing - bear with me) on PDF creation.
One word of caution: iText is GPL, so you either GPL your software too, buy a commercial iText license or look for alternatives like Apache PDFBox or Apache FOP. Ah the second caution: HTML to PDF is a Pita. You could look at a commercial tool like e.g. from Swing software (or change your approach)

finding hyperlinks in a pdf file using iTextSharp

Im getting an error while getting hyperlinks from PDF file in this line
PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
Getting below mention exception, somebody please help on this.
Unable to cast object of type
'iTextSharp.text.pdf.PRIndirectReference' to type
'iTextSharp.text.pdf.PdfDictionary'.

Instead of
PdfDictionary AnnotationAction = (PdfDictionary)AnnotationDictionary.Get(PdfName.A);
please try
PdfDictionary AnnotationAction = AnnotationDictionary.GetAsDict(PdfName.A);
In case of your document the value of the /A key does not seem to be a dictionary immediately but instead a reference to a dictionary. This reference has to be resolved. GetAsDict does this and the cast under the hood.

Eclipse plugin development AndroidXmlFormatter

I am building an eclipse plugin which modifies an android XML(layout) file. I use a dom parser internally to produce the output XML. However the XML formatting is messed up.
I want to use the android xml-formatting mechanism. I tried this -
//xmlFile is a IFile
IDocumentProvider provider = new TextFileDocumentProvider(); provider.connect(xmlFile);
IDocument document = provider.getDocument(xmlFile);
xmlFile.setContents(inputStream, IFile.ENCODING_UTF_8, new NullProgressMonitor());
AndroidXmlFormatter a=new AndroidXmlFormatter();
IFormattingContext context=new FormattingContext();
context.setProperty(FormattingContextProperties.CONTEXT_DOCUMENT, Boolean.TRUE);
a.format(document, context);
But, the document isn't formatted. :(
What could be the problem? Are there alternatives for my problem?

Still did not find a definitive answer. As far as formatting goes, I used package com.android.ide.eclipse.adt.internal.editors.formatting to format xml default android-style.
XmlFormatPreferences xfp=XmlFormatPreferences.create();
xfp.joinLines=true;
xfp.removeEmptyLines=true;
xfp.useEclipseIndent=true;
return XmlPrettyPrinter.prettyPrint(xmlAsWriter.toString(), xfp, XmlFormatStyle.FILE, "\n");

iTextSharp dll 4.1.2.0 issue causing blank pages in a merged pdf

I am using iTextSharp dll version 4.1.2.0 for pdf merging.But it is causing some pages blank in the final merged pdf. but this issue is not present in its latest dll.
I am using .net framework 1.1 , so i can't use latest dll cause it doesn't support.
So,Please give suggestion what should i do for this.
Thanks

Yes, I have got the solutions like "if we will use latest dll then it is ok, no problem but what if we are using a dll that supports .net framework 1.1 that might be before the latest dll.
The problem in my case are some pdfs are corrupted and not able to parse correctly, that's why it is throwing an exception as "Attempt to Read Past the End of The Stream" .And i found in web that is some pdfs having some more characters after EOF Marker then it is the problem .So,we have to just remove all the characters from the file and test in the newly created pdf file. It has worked for me.
public void RemoveExtraBytes(string ofilepath,nfilepath)
{
string oldfilePath =ofilepath;
string newFilePath="nfilepath";
WebClient client = new WebClient();
byte[] buffer = client.DownloadData(filePath);
string str;
int position = 0;
str = ASCIIEncoding.ASCII.GetString(buffer);
if (str.Contains("%%EOF"))
{
position = str.LastIndexOf("%%EOF");
}
Stream stream = new System.IO.FileStream(newfilepath, FileMode.Create);
stream.Write(buffer, 0, position);
stream.Close();
}

Intrincate sites using htmlunit

I'm trying to dump the whole contents of a certain site using HTMLUnit, but when I try to do this in a certain (rather intrincate) site, I get an empty file (not an empty file per se, but it has an empty head tag, an empty body tag and that's it).
The site is https://www.abcdin.cl/abcdin/abcdin.nsf#https://www.abcdin.cl/abcdin/abcdin.nsf/linea?openpage&cat=Audio&cattxt=TV%20y%20Audio&catpos=03&linea=LCD&lineatxt=LCD%20&
And here's my code:
BufferedWriter writer = new BufferedWriter(new FileWriter(fullOutputPath));
HtmlPage page;
final WebClient webClient = new WebClient(BrowserVersion.INTERNET_EXPLORER_8);
webClient.setCssEnabled(false);
webClient.setPopupBlockerEnabled(true);
webClient.setRedirectEnabled(true);
webClient.setThrowExceptionOnScriptError(false);
webClient.setThrowExceptionOnFailingStatusCode(false);
webClient.setUseInsecureSSL(true);
webClient.setJavaScriptEnabled(true);
page = webClient.getPage(url);
dumpString += page.asXml();
writer.write(dumpString);
writer.close();
webClient.closeAllWindows();
Some people say that I need to introduce a pause in my code, since the page takes a while to load in Google Chrome, but I set long pauses and it doesn't work.
Thanks in advanced.

Just some ideas...
Retrieving that URL with wget returns a non-trivial HTML file. Likewise running your code with webClient.setJavaScriptEnabled(false). So it's definitely something to do with the Javascript in the page.
With Javascript enabled, I see from the logs that a bunch of Javascript jobs are being queued up, and I get see corresponding errors like this:
EcmaError: lineNumber=[49] column=[0] lineSource=[<no source>] name=[TypeError] sourceName=[https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js] message=[TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)]
com.gargoylesoftware.htmlunit.ScriptException: TypeError: Cannot read property "nodeType" from undefined (https://www.abcdin.cl/js/jquery/jquery-1.4.2.min.js#49)
at
com.gargoylesoftware.htmlunit.javascript.JavaScriptEngine$HtmlUnitContextAction.run(JavaScriptEngine.java:601)
Maybe those jobs are meant to populate your HTML? So when they fail, the resulting HTML is empty?
The error looks strange, as HtmlUnit usually has no issues with JQuery. I suspect the issue is with the code calling that particular line of the JQuery library.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

itext: Textextracting example not working - itext

Related

Xpages PDF creation using the XMLWorker class

finding hyperlinks in a pdf file using iTextSharp

Eclipse plugin development AndroidXmlFormatter

iTextSharp dll 4.1.2.0 issue causing blank pages in a merged pdf

Intrincate sites using htmlunit

Categories

Resources