How to process images in XMLWorkerHelper.ParseToElementList [duplicate] - itext

I am posting this question because many developers ask more or less the same question in different forms. I will answer this question myself (I am the Founder/CTO of iText Group), so that it can be a "Wiki-answer." If the Stack Overflow "documentation" feature still existed, this would have been a good candidate for a documentation topic.
The source file:
I am trying to convert the following HTML file to PDF:
<title>Colossal (movie)</title>
.poster { width: 120px;float: right; }
.director { font-style: italic; }
.description { font-family: serif; }
.imdb { font-size: 0.8em; }
a { color: red; }
<img src="img/colossal.jpg" class="poster" />
<h1>Colossal (2016)</h1>
<div class="director">Directed by Nacho Vigalondo</div>
<div class="description">Gloria is an out-of-work party girl
forced to leave her life in New York City, and move back home.
When reports surface that a giant creature is destroying Seoul,
she gradually comes to the realization that she is somehow connected
to this phenomenon.
<div class="imdb">Read more about this movie on
In a browser, this HTML looks like this:
The problems I encountered:
HTMLWorker doesn't take CSS into account at all
When I used HTMLWorker, I need to create an ImageProvider to avoid an error that informs me that the image can't be found. I also need to create a StyleSheet instance to change some of the styles:
public static class MyImageFactory implements ImageProvider {
public Image getImage(String src, Map<String, String> h,
ChainedProperties cprops, DocListener doc) {
try {
return Image.getInstance(
src.substring(src.lastIndexOf("/") + 1)));
} catch (DocumentException e) {
} catch (IOException e) {
return null;
public static void main(String[] args) throws IOException, DocumentException {
Document document = new Document();
PdfWriter.getInstance(document, new FileOutputStream("results/htmlworker.pdf"));;
StyleSheet styles = new StyleSheet();
styles.loadStyle("imdb", "size", "-3");
HTMLWorker htmlWorker = new HTMLWorker(document, null, styles);
HashMap<String,Object> providers = new HashMap<String, Object>();
providers.put(HTMLWorker.IMG_PROVIDER, new MyImageFactory());
htmlWorker.parse(new FileReader("resources/html/sample.html"));
The result looks like this:
For some reason, HTMLWorker also shows the content of the <title> tag. I don't know how to avoid this. The CSS in the header isn't parsed at all, I have to define all the styles in my code, using the StyleSheet object.
When I look at my code, I see that plenty of objects and methods I'm using are deprecated:
So I decided to upgrade to using XML Worker.
Images aren't found when using XML Worker
I tried the following code:
public static final String DEST = "results/xmlworker1.pdf";
public static final String HTML = "resources/html/sample.html";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));;
XMLWorkerHelper.getInstance().parseXHtml(writer, document,
new FileInputStream(HTML));
This resulted in the following PDF:
Instead of Times-Roman, the default font Helvetica is used; this is typical for iText (I should have defined a font explicitly in my HTML). Otherwise, the CSS seems to be respected, but the image is missing, and I didn't get an error message.
With HTMLWorker, an exception was thrown, and I was able to fix the problem by introducing an ImageProvider. Let's see if this works for XML Worker.
Not all CSS styles are supported in XML Worker
I adapted my code like this:
public static final String DEST = "results/xmlworker2.pdf";
public static final String HTML = "resources/html/sample.html";
public static final String IMG_PATH = "resources/html/";
public void createPdf(String file) throws IOException, DocumentException {
Document document = new Document();
PdfWriter writer = PdfWriter.getInstance(document, new FileOutputStream(file));;
CSSResolver cssResolver =
HtmlPipelineContext htmlContext = new HtmlPipelineContext(null);
htmlContext.setImageProvider(new AbstractImageProvider() {
public String getImageRootPath() {
return IMG_PATH;
PdfWriterPipeline pdf = new PdfWriterPipeline(document, writer);
HtmlPipeline html = new HtmlPipeline(htmlContext, pdf);
CssResolverPipeline css = new CssResolverPipeline(cssResolver, html);
XMLWorker worker = new XMLWorker(css, true);
XMLParser p = new XMLParser(worker);
p.parse(new FileInputStream(HTML));
My code is much longer, but now the image is rendered:
The image is larger than when I rendered it using HTMLWorker which tells me that the CSS attribute width for the poster class is taken into account, but the float attribute is ignored. How do I fix this?
The remaining question:
So the question boils down to this: I have a specific HTML file that I try to convert to PDF. I have gone through a lot of work, fixing one problem after the other, but there is one specific problem that I can't solve: how do I make iText respect CSS that defines the position of an element, such as float: right?
Additional question:
When my HTML contains form elements (such as <input>), those form elements are ignored.

Why your code doesn't work
As explained in the introduction of the HTML to PDF tutorial, HTMLWorker has been deprecated many years ago. It wasn't intended to convert complete HTML pages. It doesn't know that an HTML page has a <head> and a <body> section; it just parses all the content. It was meant to parse small HTML snippets, and you could define styles using the StyleSheet class; real CSS wasn't supported.
Then came XML Worker. XML Worker was meant as a generic framework to parse XML. As a proof of concept, we decided to write some XHTML to PDF functionality, but we didn't support all of the HTML tags. For instance: forms weren't supported at all, and it was very hard to support CSS that is used to position content. Forms in HTML are very different from forms in PDF. There was also a mismatch between the iText architecture and the architecture of HTML + CSS. Gradually, we extended XML Worker, mostly based on requests from customers, but XML Worker became a monster with many tentacles.
Eventually, we decided to rewrite iText from scratch, with the requirements for HTML + CSS conversion in mind. This resulted in iText 7. On top of iText 7, we created several add-ons, the most important one in this context being pdfHTML.
How to solve the problem
Using the latest version of iText (iText 7.1.0 + pdfHTML 2.0.0) the code to convert the HTML from the question to PDF is reduced to this snippet:
public static final String SRC = "src/main/resources/html/sample.html";
public static final String DEST = "target/results/sample.pdf";
public void createPdf(String src, String dest) throws IOException {
HtmlConverter.convertToPdf(new File(src), new File(dest));
The result looks like this:
As you can see, this is pretty much the result you'd expect. Since iText 7.1.0 / pdfHTML 2.0.0, the default font is Times-Roman. The CSS is being respected: the image is now floating on the right.
Some additional thoughts.
Developers often feel opposed to upgrade to a newer iText version when I give the advice to upgrade to iText 7 / pdfHTML 2. Allow me to answer to the top 3 of arguments I hear:
I need to use the free iText, and iText 7 isn't free / the pdfHTML add-on is closed source.
iText 7 is released using the AGPL, just like iText 5 and XML Worker. The AGPL allows free use in the sense of free of charge in the context of open source projects. If you are distributing a closed source / proprietary product (e.g. you use iText in a SaaS context), you can't use iText for free; in that case, you have to purchase a commercial license. This was already true for iText 5; this is still true for iText 7. As for versions prior to iText 5: you shouldn't use these at all. Regarding pdfHTML: the first versions were indeed only available as closed source software. We have had heavy discussion within iText Group: on the one hand, there were the people who wanted to avoid the massive abuse by companies who don't listen to their developers when those developers tell the powers that be that open source isn't the same as free. Developers were telling us that their boss forced them to do the wrong thing, and that they couldn't convince their boss to purchase a commercial license. On the other hand, there were the people who argued that we shouldn't punish developers for the wrong behavior of their bosses. Eventually, the people in favor of open sourcing pdfHTML, that is: the developers at iText, won the argument. Please prove that they weren't wrong, and use iText correctly: respect the AGPL if you're using iText for free; make sure that your boss purchases a commercial license if you're using iText in a closed source context.
I need to maintain a legacy system, and I have to use an old iText version.
Seriously? Maintenance also involves applying upgrades and migrating to new versions of the software you're using. As you can see, the code needed when using iText 7 and pdfHTML is very simple, and less error-prone than the code needed before. A migration project shouldn't take too long.
I've only just started and I didn't know about iText 7; I only found out after I finished my project.
That's why I'm posting this question and answer. Think of yourself as an eXtreme Programmer. Throw away all of your code, and start anew. You'll notice that it's not as much work as you imagined, and you'll sleep better knowing that you've made your project future-proof because iText 5 is being phased out. We still offer support to paying customers, but eventually, we'll stop supporting iText 5 altogether.

Use iText 7 and this code:
public void generatePDF(String htmlFile) {
try {
//HTML String
String htmlString = htmlFile;
//Setting destination
FileOutputStream fileOutputStream = new FileOutputStream(new File(dirPath + "/USER-16-PF-Report.pdf"));
PdfWriter pdfWriter = new PdfWriter(fileOutputStream);
ConverterProperties converterProperties = new ConverterProperties();
PdfDocument pdfDocument = new PdfDocument(pdfWriter);
//For setting the PAGE SIZE
pdfDocument.setDefaultPageSize(new PageSize(PageSize.A3));
Document document = HtmlConverter.convertToDocument(htmlFile, pdfDocument, converterProperties);
catch (Exception e) {

Convert a static HTML page take also any CSS Style:
HtmlConverter.convertToPdf(new File("./pdf-input.html"),new File("demo-html.pdf"));
For spring Boot user: Convert a dynamic HTML page using SpringBoot and Thymeleaf:
#RequestMapping(path = "/pdf")
public ResponseEntity<?> getPDF(HttpServletRequest request, HttpServletResponse response) throws IOException {
/* Do Business Logic*/
Order order = OrderHelper.getOrder();
/* Create HTML using Thymeleaf template Engine */
WebContext context = new WebContext(request, response, servletContext);
context.setVariable("orderEntry", order);
String orderHtml = templateEngine.process("order", context);
/* Setup Source and target I/O streams */
ByteArrayOutputStream target = new ByteArrayOutputStream();
ConverterProperties converterProperties = new ConverterProperties();
/* Call convert method */
HtmlConverter.convertToPdf(orderHtml, target, converterProperties);
/* extract output as bytes */
byte[] bytes = target.toByteArray();
/* Send the response as downloadable PDF */
return ResponseEntity.ok()
.header(HttpHeaders.CONTENT_DISPOSITION, "attachment; filename=order.pdf")


Why is BinaryFormatter not supporting Version Tolerance?

According the the .NET docs, Version Tolerance was added to BinaryFormatter in .NET 2.0:
So in theory, I should be able to remove a field from a Serialized class, and not have things blow up. ie:
//Version 1
SaveData {
public int Num1;
public int Num2;
//Version 2
SaveData {
public int Num1;
Even though I removed Num2, the older versions of the class should still deserialize ok, due to Version Tolerance: "In the case of removing an unused member variable, the binary formatter will simply ignore the additional information found in the stream. "
Instead, Unity is throwing an Error:
System.Runtime.Serialization.SerializationException: Field "Num2" not found in class SaveData
Is there a way to get this working in Unity?
[Edit] Here's the code I'm using to Save/Load:
BinaryFormatter bf = new BinaryFormatter();
FileStream fs = File.Create(path);
bf.Serialize(fs, data);
BinaryFormatter bf = new BinaryFormatter();
FileStream fs = new FileStream(path, FileMode.Open);
SaveData data = (SaveData)bf.Deserialize(fs);
First of all, BinaryFormatter could be weird (I dont test this case myself), because Unity use some weird version of Mono instead of MS.NET. So, .NET docs may lie :-< (Most of the the times, they work in same way)
Second, i use protobuf-net to do data-things in my Unity project. As so far, it works like a charm.

Wicket generate BookmarkablePageLink or Link from URL String

My final goal is to generate a go back button in my wicket site forms.
Right now I'm able to get the referrer with:
HttpServletRequest req = (HttpServletRequest)getRequest().getContainerRequest();"referer: {}", req.getHeader("referer"));
This works and I get the whole URL (as a String) but I'm unable to generate a Link object from this.
I'm not sure about the internals although I've been seeing the code for Application.addMount, IRequestHandler and more, I'm not able to find exactly where a URL is converted to what I need to generate a BookmarkablePageLink: the Class and the PageParameters.
P.S. I know this can be done with JavaScript, but I want to serve users without JS active.
Possible solution I'm currently using:
public static WebMarkupContainer getBackButton(org.apache.wicket.request.Request request, String id) {
WebMarkupContainer l = new WebMarkupContainer(id);
HttpServletRequest req = (HttpServletRequest)request.getContainerRequest();
l.add(AttributeModifier.append("href", req.getHeader("referer")));
return l;
In my markup I have:
<a wicket:id="backButton">Back</a>
And then, in my Page object:
add(WicketUtils.getBackButton(getRequest(), "backButton");
If anyone has any better idea, I'm leaving this open for a while.
You should be able to use an ExternalLink for this.
Something resembling
public Component getBackButton(org.apache.wicket.request.Request request, String id) {
HttpServletRequest req = (HttpServletRequest)request.getContainerRequest();
String url = req.getHeader("referer");
return new ExternalLink(id, url, "Back");
with html
this body will be replaced
And your Page object code unchanged.

How to restrict a component to add only once per page

How to restrict a CQ5/Custom component to add only once per page.? I want to restrict the drag and drop of component into the page when the author is going to add the same component for the second time into the same page.
One option is to include the component directly in the JSP of the template and exclude it from the list of available components in the sidekick. To do so, add the component directly to your JSP (foundation carousel in this example):
<cq:include path="carousel" resourceType="foundation/components/carousel" />
To hide the component from the sidekick, either set:
componentGroup: .hidden
or exclude it from the list of "Allowed Components" using design mode.
If you need to allow users to create a page without this component you can provide a second template with the cq:include omitted.
Thanks Rampant, I have followed your method and link stated.
Posting link again : please follow this blog
It was really helpful. I am posting the implementation whatever I have done.
It worked fine for me. One can definitely improve the code quality, this is raw code and is just for reference.
1.Servlet Filter
Keep this in mind that,if any resource gets refereshed, this filter will execute. So you need to filter the contents at your end for further processing.
P.S. chain.doFilter(request,response); is must. or cq will get hanged and nothing will be displayed.
#SlingFilter(generateComponent = false, generateService = true, order = -700,
scope = SlingFilterScope.REQUEST)
#Component(immediate = true, metatype = false)
public class ComponentRestrictorFilter implements Filter {
public void init(FilterConfig filterConfig) throws ServletException {}
private ResourceResolverFactory resolverFactory;
public void doFilter(ServletRequest request, ServletResponse response, FilterChain chain)
throws IOException, ServletException {
WCMMode mode = WCMMode.fromRequest(request);
if (mode == WCMMode.EDIT) {
SlingHttpServletRequest slingRequest = (SlingHttpServletRequest) request;
PageManager pageManager = slingRequest.getResource().getResourceResolver().adaptTo(PageManager.class);
Page currentPage = pageManager.getContainingPage(slingRequest.getResource());
logger.error("***mode" + mode);
if (currentPage != null )) {
ComponentRestrictor restrictor = new ComponentRestrictor(currentPage.getPath(), RESTRICTED_COMPONENT);
chain.doFilter(request, response);
public void destroy() {}
2.ComponentRestrictor class
public class ComponentRestrictor {
private String targetPage;
private String component;
private Pattern pattern;
private Set<Resource> duplicateResource = new HashSet<Resource>();
private Logger logger = LoggerFactory.getLogger(ComponentRestrictor.class);
private Resource resource = null;
private ResourceResolver resourceResolver = null;
private ComponentRestrictorHelper helper = new ComponentRestrictorHelper();
public ComponentRestrictor(String targetPage_, String component_){
targetPage = targetPage_ + "/jcr:content";
component = component_;
public void removeDuplicateEntry(ResourceResolverFactory resolverFactory, PageManager pageManager) {
pattern = Pattern.compile("([\"']|^)(" + component + ")(\\S|$)");
findReference(resolverFactory, pageManager);
private void findReference(ResourceResolverFactory resolverFactory, PageManager pageManager) {
try {
resourceResolver = resolverFactory.getAdministrativeResourceResolver(null);
resource = resourceResolver.getResource(this.targetPage);
if (resource == null)
} catch (LoginException e) {
logger.error("Exception while getting the ResourceResolver " + e.getMessage());
private void search(Resource parentResource) {
for (Iterator<Resource> iter = parentResource.listChildren(); iter.hasNext();) {
Resource child =;
private void searchReferencesInContent(Resource resource) {
ValueMap map = ResourceUtil.getValueMap(resource);
for (String key : map.keySet()) {
if (!helper.checkKey(key)) {
String[] values = map.get(key, new String[0]);
for (String value : values) {
if (pattern.matcher(value).find()) {
logger.error("resource**" + resource.getPath());
3.To remove the node/ resource
Whichever resource you want to remove/delete just use PageManager api
That's it !!! You are good to go.
None of the options looks easy to implement. The best approach I found is to use the ACS Commons Implementation which is very easy and can be adopted into any project.
Here is the link and how to configure it:
Enjoy coding !!!
you can't prevent that without doing some massive hacking to the ui code, and even then, you've only prevented it from one aspect of the ui. there's still crxde, and then the ability to POST content.
if this is truly a requirement, the best approach might be the following:
have the component check for a special value in the pageContext object (use REQUEST_SCOPE)
if value is not found, render component and set value
otherwise, print out a message that component can only be used once
note that you can't prevent a dialog from showing, but at the very least the author has an indication that that particular component can only be used once.
It sounds like there needs to be clarification of requirements (and understanding why).
If the authors can be trained, let them manage limits of components through authoring and review workflows.
If there is just 1 fixed location the component can appear, then the page component should include the content component, and let the component have an "enable" toggle property to determine if it should render anything. The component's group should be .hidden to prevent dragging from the sidekick.
If there is a fixed set of locations for the component, the page component can have a dropdown of the list of locations (including "none"). The page render component would then conditionally include the component in the correct location. Again, prevent dragging the component from the sidekick.
In the "hard to imagine" case that the component can appear anywhere on the page, added by authors, but limited to only 1 instance - use a wrapper component to manage including the (undraggable) component. Let the authors drag the wrapper on the page as many times as they want, but the wrapper should query the page's resources and determine if it is the first instance, and if so, include the end component. Otherwise, the wrapper does nothing.
In our experience (>2years on CQ), implementing this type of business rules via code creates a brittle solution. Also, requirements have a habit of changing. If enforced via code, development work is required instead of letting authors make changes faster & elegantly.
None of these options are that great. If you truly want a robust solution to this problem (limit the number of items on the page without hardcoding location) then the best way is with a servlet filter chain OSGI service where you can administer the number of instances and then use a resource resolver to remove offending instances.
The basic gist is:
Refresh the page on edit using cq:editConfig
Create an OSGI service implementing javax.servlet.Filter that encapsulates your business rules.
Use the filter to remove excess components according to business rules
Continue page processing.
For more details see here:
Using a servlet filter to limit the number of instances of a component per page or parsys
This approach will let you administer the number of items per page or per parsys and apply other possibly complex business rules in a way that the other offered solutions simply cannot.

PDF text extraction using iText

We are doing research in information extraction, and we would like to use iText.
We are on the process of exploring iText. According to the literature we have reviewed, iText is the best tool to use. Is it possible to extract text from pdf per line in iText? I have read a question post here in stackoverflow related to mine but it just read text not to extract it. Can anyone help me with my problem? Thank you.
Like Theodore said you can extract text from a pdf and like Chris pointed out
as long as it is actually text (not outlines or bitmaps)
Best thing to do is buy Bruno Lowagie's book Itext in action. In the second edition chapter 15 covers extracting text.
But you can look at his site for examples.
And you can parse it to create a plain txt file.
Here is a code example:
* This class is part of the book "iText in Action - 2nd Edition"
* written by Bruno Lowagie (ISBN: 9781935182610)
* For more info, go to:
* This example only works with the AGPL version of iText.
package part4.chapter15;
import com.itextpdf.text.pdf.PdfReader;
import com.itextpdf.text.pdf.parser.PdfReaderContentParser;
import com.itextpdf.text.pdf.parser.SimpleTextExtractionStrategy;
import com.itextpdf.text.pdf.parser.TextExtractionStrategy;
public class ExtractPageContent {
/** The original PDF that will be parsed. */
public static final String PREFACE = "resources/pdfs/preface.pdf";
/** The resulting text file. */
public static final String RESULT = "results/part4/chapter15/preface.txt";
* Parses a PDF to a plain text file.
* #param pdf the original PDF
* #param txt the resulting text
* #throws IOException
public void parsePdf(String pdf, String txt) throws IOException {
PdfReader reader = new PdfReader(pdf);
PdfReaderContentParser parser = new PdfReaderContentParser(reader);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
strategy = parser.processContent(i, new SimpleTextExtractionStrategy());
* Main method.
* #param args no arguments needed
* #throws IOException
public static void main(String[] args) throws IOException {
new ExtractPageContent().parsePdf(PREFACE, RESULT);
Notice the license
This example only works with the AGPL version of iText.
If you look at the other examples it will show how to leave out parts of the text or how to extract parts of the pdf.
Hope it helps.
iText allows you to do that, but there is no guarantee about the granularity of the text blocks, those depend on the actual pdf renderers used in producing your documents.
It's quite possible that each word or even letter has its own text block. Nor do these need to be in lexical order, for reliable results you may have to reorder text blocks based on their coordinates. Also you may have to calculate if you need to insert spaces between textblocks.
In newer versions of itext:
public static void main(String[] args) throws Exception {
try (var document = new PdfDocument(new PdfReader("your.pdf"))) {
var strategy = new SimpleTextExtractionStrategy();
for (int i = 1; i < document.getNumberOfPages(); i++) {
String text = PdfTextExtractor.getTextFromPage(document.getPage(i), strategy);

Pull generated HTML programmatically vs. a HttpWebRequest

For our newsletter, I generate the final body of the email in a web page and then want to pull that into the body of the email. I found a way to do that with HttpWebRequest.
private string GetHtmlBody(Guid id)
HttpWebRequest request = (HttpWebRequest)WebRequest.Create(String.Format("{0}", id.ToString()));
HttpWebResponse responce = (HttpWebResponse)request.GetResponse();
System.IO.StreamReader sr = new System.IO.StreamReader(responce.GetResponseStream());
return sr.ReadToEnd();
However, I feel there has to be a better way. Can I somehow pull the generated view without making a web call?
You could use MVCContrib for this task.
Or try to roll some ugly code:
public static string ViewToString(string controlName, object viewData)
var vd = new ViewDataDictionary(viewData);
var vp = new ViewPage { ViewData = vd };
var control = vp.LoadControl(controlName);
var sb = new StringBuilder();
using (var sw = new StringWriter(sb))
using (HtmlTextWriter tw = new HtmlTextWriter(sw))
return sb.ToString();
and then:
var viewModel = ...
string template = ViewToString("~/Emails/EmailTemplate.ascx", viewModel);
Assuming the the email code is in the same project as the website, then you should be able to call the action method, get the ActionResult back, then call the ExecuteResult method. The downside is that in order to do it this way, you will need to set it up such that the ExecuteResult will write to a stream that you can take advantage of. In order to do all of this, you will need to mock up some of the classes used by the ControllerContext.
What would probably be a better way (though will likely take more work), is to have the markup you want be generated by an XSLT transform. XSLT is a type of XML document template that can be merged with an XML document that holds data to produce a desired result. If you do this, then you can have your process that sends out emails run the transform as well as have your website run the transform. The advantage of this, is that if you want the markup to be different (i.e. you are redesigning thew newsletter), you will simply need to update the XSLT file and deploy it.
Finally got a working solution. After finding some proper search terms (thanks to #Darin) any many, many trials I found a solution that works. Putting this in my controller then passing the rendered string into my EmailHelper works great for what I needed.