PDF generation from templated Word documents - ms-word

I have a Word document(some template format) where it containing some placeholders for the data to be filled in and there are several Word documents like this which lies in some directory. When data comes I will be choosing different templates (based on some criteria) and fill the data and the documents have to be converted to PDF format.
I have been investigating Apache POI for this. If anyone has a good suggestion, it would be much appreciated.

As mbeckish mentioned you should indicate how you are going to run/automate this. For example is it one-off, run by hand or part of another program (and if so what programming languages do you use)?
If you are trying to automate it JODReports and Docmosis are tools that can use templates like you require and can produce PDF. JODReports is free. Docmosis is not but has several APIs. Please note I work for the company that develops Docmosis.
Hope that helps.

I've just uploaded this presentation, which presents three approaches for doing this.

Why not use any of existing PDF virtual printers?

Related

Loading a .docx file into ETL/ELT tool?

Could someone please guide me on how to extract a .docx file and load it onto a database using an ETL(Extract-Transform-Load) or ELT(Extract-Load-Transform) tool?
Assuming that the .docx file contains mostly unstructured data, isn't it an ELT tool I should go for instead of ETL?
The ETL and ELT tools I found this far didn't support the MS Word component. What other way is there to extract and store the content in a .docx file onto a database?
My requirement is to:
Extract the data inside the .docx file,
Convert them into meaningful data, and
Store them onto a data lake so I can perform data analysis, and take productive decisions based on those results.
It's just like how e-commerce companies convert customer reviews into meaningful data so they can take decisions to boost their sales. In my case, it's Word files I need to analyze.
I'm asking this because I've searched for so many ETL and ELT tools but couldn't find anything that supported Word files. Maybe it's because I haven't been searching for the right tool or the right way to do it?
If somebody knows a way, please guide me through the process. What should I start looking for? A tool, or a way to code the entire thing?
I've been looking for an answer for weeks now but didn't find a helpful answer. And it's starting to get really frustrating to see all the tools supporting every other component like social media, MongoDB, or whatever EXCEPT Word files.
You have to do this in 2 steps:
Extract the data from the .docx file to txt or xml
Now use SSIS to import. (Azure Data Factory if you are in the cloud)

Can I convert .docx Word documents using the DocX .NET Library?

I am currently attempting to convert a couple of .NET desktop applications that I have developed into a web application harnessing AngularJS and RESTful services.
One of the key components of these applications is in their ability to generate Word documents on the fly using a .dotx Word template. I am currently exploring the possibility of using a third party library called DocX to generate these Word documents without resorting to using a template.
I guess my question is: Can I use this library to read an existing Word document in .docx format and generate a source code representation of the document? If this is possible could someone point me in the direction of any code samples that I could use? I have looked around and have been unable to find anything that could help me get started.
Generating code representation of the document and using it with DocX seems like a time consuming effort to me. Why not using a template instead and fill it with data at runtime?
I have some experience with Docentric, which is 3rd party OpenXML toolkit. It features an Word Add-in for template design and libraries for document generation and manipulation. It took me less then a week to generate pretty complex documents. If I was in your shoes I would definitely try some 3rd party toolkits. They cost money, but save time so do some math and see it they can be useful for you.
It is possible to read an existing Word document in .docx format with following code
DocX document = DocX.Load(filename)
While it is impossible to generate a source code representation of a document.

Can Crystal Reports generate documents in PDF/A file format?

We are looking for a solution to generate documents in PDF/A format for sharing and also archiving purpose.
I checked the description of ExportFormatType.PortableDocFormat, however it just say PDF file.
Can the Crystal Reports generate PDF/A compatible files?
I don't think you export directly to PDF/A. Instead, I recommend using Crystal to export to PDF, then find a third-party software to convert your PDF to PDF/A. It takes 1 extra step, but it will meet your needs.
I googled a bit and found http://www.abbyyusa.com/shop/pdftransformer/. I know nothing about this software, I'm just presenting it as an example. It costs 80USD, but you might be able to find a freeware alternative.
http://www.pdfa.org/doku.php is the offical homepage of PDF/A. You might find something useful there too.
According to this SAP community thread from a few days ago, it can't be done natively, although there was a third-party component mentioned there. I haven't tried it, so I have no idea if it works or not.

Is there an alternative to open-xml sdk to generate word documents

I'm trying to generate word documents using open xml sdk. When the documents are small this is no problem (and rather easy). When the documents become larger (+500 pages) I notice the peformance (duration, memory usage, ...) goes down significantly.
Googling this problem I came across some posts that point out the same problem. For excel there is a solution with spreadsheetgear.
I would like to know if there is a word alternative to this or if there are other solutions to generate word documents?
Thanks,
Jelle
I've written a blog post series on generating Open XML WordprocessingML documents. The approach that I take is that you create a template Word document, insert content controls, and then write XPath expressions in those content controls to specify the XML to pull from a source XML data file. I've also explored another approach where you write C# code in Open XML content controls. That approach also works.
http://ericwhite.com/blog/map/generating-open-xml-wordprocessingml-documents-blog-post-series/
-Eric
You might look at http://docx.codeplex.com/
On Java, you could use docx4j. If you were brave, you could create DLLs for it via IKVM...
I decided to go with Aspose Words. It is really fast and not very demanding on resources (CPU, memory). It has the disadvantage that it is quite expensive. I also investigated Softartisans Office writer. The posibilities are the same but due to fact that the company I'm currently working for already used other Aspose components we decided to go with Aspose Word.

Is there a Platform-independent Web-based replacement for Word Templates?

The above Title is my Manager's words, not mine. :)
This is a follow-up to a question that I posted previously. After reading my assessment on the impacts of converting Word Templates from PC to Mac, I have now been asked to investigate whether Word Templates can be replaced with a "Platform-independent Web-based solution" (her words, not mine). She has suggested using Adobe Forms (ie. Adobe Designer).
Personally, I think the only truly platform-independent web-based solution is text files or html forms. What do other people think?
It's called WordprocessingML (aka. WordXML, WordML)...
Overview of WordprocessingML [Word 2003 XML Reference] at http://msdn.microsoft.com/en-us/library/aa212812(office.11).aspx.
MSDN Search for "WordML" at http://social.msdn.microsoft.com/Search/en-US?query=WordML&ac=3
It could be called XForms...
The Web was suppose to be platform-independent electronic documents. In other words, if you truly want platform-independence, then I agree with you and your forms should be in HTML. Yet, HTML forms are really not a good development platform. That is why Adobe, Microsoft, and others provide "form" solutions. XForms is an attempt to make developing and using HTML forms more flexible, overcome its limitations, and provide a platform-independent object model for completing HTML forms. You might want to look at XForms at http://www.w3.org/MarkUp/Forms/.
But, I wouldn't call it PDF
In my opinion, working with PDF files is difficult. I have not looked at the file format specification, but I heard it is not trivial. Moreover, you need a custom editor and you are locked into one vendor, which is Adobe. (Yet, there are other open-source and vendors who support the file format.) Adobe is not know for creating programs that are easy to use.
My Suggestion
If you are already using Word, then moving to WordML should be fairly easy. You can easily convert your existing Word documents into WordML by simply saving them as XML from the Save Dialog; therefore, you can automate this process through code. In addition, I believe WordML supports form templates (the actual form) and data documents (the actual data for a form).
It's called PDF...
At the core (and without the million of extra unnecessary features" that's exactly the niche that Adobe PDFs were designed to fill.
I'd suggest you look more into Adobe Acrobat Professional for more info. Although, I don't think there's any good way to directly convert Word docs to PDF format.
Note: This question should be moved to Super User since it's not really programming related
Google Docs meets those requirements of a Platform-independent Web-based solution. Your mileage will vary with Google Docs though - if you just want to use it for letters, it's good. Much beyond that, it's rather limited. Unless you get the Premier (read: Corporate) version which you have to pay for, you won't be able to programmatically fiddle with the templates.
If you want a "Platform-independent solution", go with ODF or OOXML. You can make either "web-based" to your hearts content - maybe with HTML5 or another solution such as Flash or Silverlight.