PDF text extraction using iText - itext

We are doing research in information extraction, and we would like to use iText.
We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.

Like Theodore said you can extract text from a pdf and like Chris pointed out
as long as it is actually text (not outlines or bitmaps)
Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.
But you can look at his site for examples. http://itextpdf.com/examples/iia.php?id=279
And you can parse it to create a plain txt file.
Here is a code example:
/*
* This class is part of the book "iText in Action - 2nd Edition"
* written by Bruno Lowagie (ISBN: 9781935182610)
* For more info, go to: http://itextpdf.com/examples/
* This example only works with the AGPL version of iText.
*/
package part4.chapter15;
import java.io.FileOutputStream;
import java.io.IOException;
import java.io.PrintWriter;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
public class ExtractPageContent {
/** The original PDF that will be parsed. */
public static final String PREFACE = "resources/pdfs/preface.pdf";
/** The resulting text file. */
public static final String RESULT = "results/part4/chapter15/preface.txt";
/**
* Parses a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
*/
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
out.println(strategy.getResultantText());
}
reader.close();
out.flush();
out.close();
}
/**
* Main method.
* #param args no arguments needed
* #throws IOException
*/
public static void main(String[] args) throws IOException {
new ExtractPageContent().parsePdf(PREFACE, RESULT);
}
}
Notice the license
This example only works with the AGPL version of iText.
If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.
Hope it helps.

iText allows you to do that, but there is no guarantee about the granularity of the text blocks, those depend on the actual pdf renderers used in producing your documents.
It's quite possible that each word or even letter has its own text block. Nor do these need to be in lexical order, for reliable results you may have to reorder text blocks based on their coordinates. Also you may have to calculate if you need to insert spaces between textblocks.

In newer versions of itext:
public static void main(String[] args) throws Exception {
try (var document = new PdfDocument(new PdfReader("your.pdf"))) {
var strategy = new SimpleTextExtractionStrategy();
for (int i = 1; i < document.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(document.getPage(i), strategy);
System.out.println(text);
}
}
}

Related

How to access query param from url in SOAP WS?

I want to create a SOAP service method which consume data from URL as query param. But i am not sure, How can we pass data as query param in SOAP URL. I have created the method as below accepting data but that will come from SOAP request:
Also let me know how we would be passing data in the Query param from SOAP UI:
#WebMethod
public String test(String str){
System.out.println("Test method called:"+ str);
return str;
}
It would be very helpful if any of you help me out. Thanks in advance!
The following code uses the Servlet context to obtain the query param(s). I've provided two methods. The first method just uses the numbers passed via the SOAP request. The second method deals with one or more query parameters passed and gives two examples of accessing the query parameters.
package net.myco.ws;
import java.util.Iterator;
import java.util.Map;
import java.util.Map.Entry;
import javax.annotation.Resource;
import javax.jws.WebMethod;
import javax.jws.WebParam;
import javax.jws.WebService;
import javax.servlet.http.HttpServletRequest;
import javax.xml.ws.WebServiceContext;
import javax.xml.ws.handler.MessageContext;
#WebService
public class SOAPWSWithQueryParam {
#Resource
private WebServiceContext ctx;
/**
* Default no arg constructor
*/
public SOAPWSWithQueryParam() {
super();
}
/*
* Web Service that adds two numbers together
*/
#WebMethod
public int addTwoNumbers(
#WebParam(name="inputNumber1") int inputNumber1,
#WebParam(name="inputNumber2") int inputNumber2
){
int result;
result = inputNumber1 + inputNumber2;
return result;
}
/*
* Web Service that adds two numbers together, *but* also inspects
* the HTTP POST for a single queryParam and adds that as well.
*
* Example URL:
* http://localhost:8080/SOAPWSWithQueryParam/SOAPWSWithQueryParam?number1=8&number2=6
*
* Note, we're only getting the first query param, we could split based on "&" and get
* other params.
*
*/
#WebMethod
public int addThreeNumbers(
#WebParam(name="inputNumber1") int inputNumber1,
#WebParam(name="inputNumber2") int inputNumber2
){
int result;
int queryStringNumber1 = 0;
Map <String, String[]>quesryStringMap;
HttpServletRequest servletRequest = (HttpServletRequest) ctx.getMessageContext().get(MessageContext.SERVLET_REQUEST);
/*
* Likely want to add a try catch on this or other logic in case there isn't a query string param.
* Also, because the example URL contains a second param, we split again at the "&" in URL else the
* result would be "8&number2"
*/
queryStringNumber1 = Integer.valueOf(servletRequest.getQueryString().split("=")[1].split("&")[0]);
/*
* The second and more elegant way of accomplishing it is using the Parameters Map, because we're
* adding the second way of doing it, the returned value is increased as it was 17 based on our URL
* and the WS two input numbers. Now it becomes 31.
*
*/
quesryStringMap = servletRequest.getParameterMap();
Iterator<Entry<String, String[]>> mapIterator = quesryStringMap.entrySet().iterator();
while (mapIterator.hasNext()) {
Map.Entry<String, String[]> pair = (Entry<String, String[]>)mapIterator.next();
System.out.println(pair.getKey() + " = " + pair.getValue()[0]);
/*
* Prints:
07:43:57,666 INFO [stdout] (default task-10) number1 = 8
07:43:57,666 INFO [stdout] (default task-10) number2 = 6
*/
//Add the other param values
queryStringNumber1 += Integer.valueOf(pair.getValue()[0]);
mapIterator.remove();
}
result = inputNumber1 + inputNumber2 + queryStringNumber1;
return result;
}
}
From SOAP UI, after you have created a new SOAP project it would look like this, I've shown two examples (right pane), the first example calls the web service method which just adds the two numbers together passed in the SOAP body as arguments. The second method (bottom) first gets a single query parameter even though their are two, adds that to queryStringNumber1 it then has a second example which uses a iterator to iterate through the parameter map and then adds any passed values to queryStringNumber1. Finally, it adds the soap input variables to queryStringNumber1 and returns that value.
You could also uses the Binding Provider such as purpose of bindingprovider in jax-ws web service and google provides even more examples.

How to process images in XMLWorkerHelper.ParseToElementList [duplicate]

I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
The source file:
I am trying to convert the following HTML file to PDF:
<html>
<head>
<title>Colossal (movie)</title>
<style>
.poster { width: 120px;float: right; }
.director { font-style: italic; }
.description { font-family: serif; }
.imdb { font-size: 0.8em; }
a { color: red; }
</style>
</head>
<body>
<img src="img/colossal.jpg" class="poster" />
<h1>Colossal (2016)</h1>
<div class="director">Directed by Nacho Vigalondo</div>
<div class="description">Gloria is an out-of-work party girl
forced to leave her life in New York City, and move back home.
When reports surface that a giant creature is destroying Seoul,
she gradually comes to the realization that she is somehow connected
to this phenomenon.
</div>
<div class="imdb">Read more about this movie on
IMDB
</div>
</body>
</html>
In a browser, this HTML looks like this:
The problems I encountered:
HTMLWorker doesn't take CSS into account at all
When I used HTMLWorker, I need to create an ImageProvider to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet instance to change some of the styles:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src, Map<String, String> h,
ChainedProperties cprops, DocListener doc) {
try {
return Image.getInstance(
String.format("resources/html/img/%s",
src.substring(src.lastIndexOf("/") + 1)));
} catch (DocumentException e) {
e.printStackTrace();
} catch (IOException e) {
e.printStackTrace();
}
return null;
}
}
public static void main(String[] args) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));
document.open();
StyleSheet styles = new StyleSheet();
styles.loadStyle("imdb", "size", "-3");
HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
HashMap<String,Object> providers = new HashMap<String, Object>();
providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
htmlWorker.setProviders(providers);
htmlWorker.parse(new FileReader("resources/html/sample.html"));
document.close();
}
The result looks like this:
For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
So I decided to upgrade to using XML Worker.
Images aren't found when using XML Worker
I tried the following code:
public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
document.close();
}
This resulted in the following PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.
Not all CSS styles are supported in XML Worker
I adapted my code like this:
public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));
document.open();
CSSResolver cssResolver =
XMLWorkerHelper.getInstance().getDefaultCssResolver(true);
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setTagFactory(Tags.getHtmlTagProcessorFactory());
htmlContext.setImageProvider(new AbstractImageProvider() {
public String getImageRootPath() {
return IMG_PATH;
}
});
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
document.close();
}
My code is much longer, but now the image is rendered:
The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?
The remaining question:
So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?
Additional question:
When my HTML contains form elements (such as <input>), those form elements are ignored.
Why your code doesn't work
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
How to solve the problem
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
}
The result looks like this:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
Some additional thoughts.
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free use in the sense of free of charge in the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
I need to maintain a legacy system, and I have to use an old iText version.
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.
Use iText 7 and this code:
public void generatePDF(String htmlFile) {
try {
//HTML String
String htmlString = htmlFile;
//Setting destination
FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
ConverterProperties converterProperties = new ConverterProperties();
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
//For setting the PAGE SIZE
pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
document.close();
}
catch (Exception e) {
e.printStackTrace();
}
}
Convert a static HTML page take also any CSS Style:
HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));
For spring Boot user: Convert a dynamic HTML page using SpringBoot and Thymeleaf:
#RequestMapping(path = "/pdf")
public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
/* Do Business Logic*/
Order order = OrderHelper.getOrder();
/* Create HTML using Thymeleaf template Engine */
WebContext context = new WebContext(request, response, servletContext);
context.setVariable("orderEntry", order);
String orderHtml = templateEngine.process("order", context);
/* Setup Source and target I/O streams */
ByteArrayOutputStream target = new ByteArrayOutputStream();
ConverterProperties converterProperties = new ConverterProperties();
converterProperties.setBaseUri("http://localhost:8080");
/* Call convert method */
HtmlConverter.convertToPdf(orderHtml, target, converterProperties);
/* extract output as bytes */
byte[] bytes = target.toByteArray();
/* Send the response as downloadable PDF */
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")
.contentType(MediaType.APPLICATION_PDF)
.body(bytes);
}

How to convert string containing macro in wiki markup to xhtml format

During migration procedure from Confluence 3.5.13 to Confluence 5.0.3, I need in my MacroMigration class to convert a string containing some text and macro to xhtml format.
I've tried the following code:
WikiStyleRenderer wikiStyleRenderer = (WikiStyleRenderer) ContainerManager.getComponent("wikiStyleRenderer");
String result= wikiStyleRenderer.convertWikiToXHtml(new PageContext(context.getEntity()), body.getBody());
It works on simple text but as soon as it contains a reference to a macro (for example {info:title=int Random(int range)}{info}) the result is just an line feed ('\n').
I've succeeded to get it working using the com.atlassian.confluence.xhtml.api.XhtmlContent interface. Here is how I did:
private XhtmlContent xhtmlContent;
public void setXhtmlContent(XhtmlContent xhtmlContent) {
this.xhtmlContent = xhtmlContent;
}
public MacroDefinition migrate(MacroDefinition macroDefinition,
ConversionContext context) {
MacroBody body = macroDefinition.getBody();
List<RuntimeException> migrationExceptions = new ArrayList<RuntimeException>();
String resultContent;
resultContent = xhtmlContent.convertWikiToStorage(body.getBody(),
context, migrationExceptions);
The setter setXhtmlContent() is just there for Confluence to inject the right instance.

Play 2.0 date format

I'm trying to format a date in a Scala template in Play. So far I've written this:
<p>#DateFormat.getInstance().format(deadline)</p>
Where deadline is the date I'm outputting to the web page. However, this uses the JVM's locale and not the one selected by the user.
My app currently supports two locales, Norwegian (no) and English (en). This works well for messages, but not for Dates. So I tried adding a GlobalSettings to intercept each request as shown below, but apparently it's never invoked:
import java.lang.reflect.Method;
import java.util.Locale;
import org.springframework.context.i18n.LocaleContext;
import org.springframework.context.i18n.LocaleContextHolder;
import play.GlobalSettings;
import play.i18n.Lang;
import play.mvc.Action;
import play.mvc.Http.Request;
public class Global extends GlobalSettings {
#SuppressWarnings("rawtypes")
#Override
public Action onRequest(final Request request, final Method actionMethod) {
LocaleContextHolder.setLocaleContext(new LocaleContext() {
public Locale getLocale() {
Lang preferred = Lang.preferred(request.acceptLanguages());
return preferred.toLocale();
}
});
return super.onRequest(request, actionMethod);
}
}
Does someone have a solution to this problem? Is it a known bug in Play? I'm using version 2.0.4.
Thanks!
I tried estmatic's solution, but it didn't discriminate properly between country variants of the same language, for example if my browser's preferred languages were "en_AU" and "en_US" in that order, then it would only use the "en" part, which resulted in a US-style date (with the month first) rather than an Aussie-style one (with the date first, as is right and proper).
My solution was to create a helper class as follows:
public class Formatter extends Controller {
private static final int DATE_STYLE = LONG;
private static final int TIME_STYLE = SHORT;
/**
* Formats the given Date as a date and time, using the locale of the current
* request's first accepted language.
*
* #param date the date to format (required)
* #return the formatted date
* #see play.mvc.Http.Request#acceptLanguages()
*/
public static String formatDateTime(final Date date) {
final Locale locale = getPreferredLocale();
return DateFormat.getDateTimeInstance(
DATE_STYLE, TIME_STYLE, locale).format(date);
}
private static Locale getPreferredLocale() {
final List<Lang> acceptedLanguages = request().acceptLanguages();
final Lang preferredLanguage = acceptedLanguages.isEmpty() ?
Lang.preferred(acceptedLanguages) : acceptedLanguages.get(0);
return new Locale(preferredLanguage.language(), preferredLanguage.country());
}
}
Then in my Scala templates, all I had to do was use (for example):
#import my.package.Formatter
...
Date = #Formatter.formatDateTime(someDate)
This seems cleaner to me than having a lot of Locale construction logic in the templates.
Well you need to provide the locale when you get your DateFormat instance; otherwise it'll just use the system default locale instead of what Play is getting from the browser.
Something like this seems to work:
#DateFormat.getDateInstance(DateFormat.LONG, (implicitly[Lang]).toLocale).format(deadline)
That implicitly[Lang] bit is basically calling Lang.preferred(request.acceptLanguages() just like you were doing in your onRequest() method.

How to get input from Output tab in NetBeans

I have an Output tab created and I would like to listen for user's input (to do a chat like component). Of course you can't predict when the user is going to type.
I found the org.jivesoftware.smack.util package and the related ObservableReader and ReaderListener that should do the trick, but I'm missing something and can't figure it out... yet.
Here's the code I have:
/*
* Enable/create the tabs we need for the component
*/
package sample.component;
import com.dreamer.outputhandler.OutputHandler;
import org.jivesoftware.smack.util.ObservableReader;
import org.jivesoftware.smack.util.ReaderListener;
import org.openide.modules.ModuleInstall;
/**
* Manages a module's lifecycle. Remember that an installer is optional and
* often not needed at all.
*/
public class Installer extends ModuleInstall implements ReaderListener {
private final String normal = "Output";
#Override
public void restored() {
OutputHandler.output(normal, "Welcome! Type something below.");
OutputHandler.setInputEnabled(normal, true);
ObservableReader reader = new ObservableReader(OutputHandler.getReader(normal));
reader.addReaderListener(this);
}
#Override
public void read(String read) {
System.out.println("Read: " + read);
OutputHandler.output(normal, "You typed: " + read);
}
}
OutPutHandler is a helper class I created to handle the output tabs. You can see its source here
Any idea?
Finally got it! It was a mix of the above code with this forum post and this classes: org.jivesoftware.smack.util.ObservableReader and org.jivesoftware.smack.util.ReaderListener. See the FAQ here