How to convert Word 2007 document to PDF using Apache FOP

How to convert Word 2007 document to PDF using Apache FOP - ms-word

I am currently using Apache FOP and have a stylesheet (possibly from RenderX) that converts Word 2003 XML documents (Saved as XML option) to PDF. However, this does not work for Word 2007 XML documents.
I am looking for options and/or suggestions on how to accomplish one of the following tasks -
Get a stylesheet that will transform Word 2007 XML file to:
Word 2003 XML or
PDF using FOP (using a stylesheet to create xsl-fo)
I am also open to any other options you might have. If possible I would like to do this with little to no cost. However, I am limited to using Java so a C# type option is not possible.
Thanks,

You could try docx4j, an open source Java library (ASL v2) which uses FOP to create PDFs from docx files.

I'm not aware of any style sheets that do this transformation. It would be reasonably sophisticated. If you end up having to engineer another way of doing it, you might want to look at JODConverter (straight conversion - might be your best bet), the OpenOffice UNO API (very manual), JODReports or Docmosis (both can produce documents in various formats). All can produce PDFs from a Java environment. I think they all have free versions.
Hope that helps.

Related

Generate .hhk file From Word Document

I am trying to convert MS Word file to chm file. I have a well organized word document. But,I could not figure out how to word saved as a html file to chm file. I know I can add html file to created project but there are some issue such that I could not solve how to convert ms word table of content file to index file in html help workshop program. I would be very happy If someone provide some example about conversion of word documents.(I am trying to achieve this thorough HTML Help Workshop program)
Best regards,

Converting a Word document to CHM format is difficult without special (often expensive) tools and has a learning curve.
You should think about whether the PDF format is not sufficient. But the CHM format - integrated in the Windows operating system - has of course some popular functions.
I recommend to read through Search and Index not working after converting from Word 2016 to CHM.
As I mentioned in my answer I never used chmProcessor before (because using other tools) but surprisingly seems to be a good one for converting Word documents in a simple way.
Please try chmProcessor for your needs. You may want to ask a new question here on SO later.
Edit:
Maybe you have additional interest in the following CodeProject article:
How to Easily Write a User's Guide for Your Application using Different File Extensions

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?

Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

DocBook to Word Conversion?

I need some help with conversion of DocBook files to Microsoft Word files.
Do I need an XSL file for the transformation?

Yes, you do need an XSL file. You can get XSL files for DocBook from the free DocBook XML distribution. Then, you run a free XSLT transformer such as Saxon. If you run Saxon from a command line, you give it the name of your DocBook file, and the name of one of the stylesheets, and it will transform your file according to the rules in the stylesheet.
What you need to do to transform to Word, is to pick the right stylesheet.
From DocBook XSL: The Complete Guide, here are three possibilities:
Convert to XSL-FO and then use the XMLmind to export to Word. See the XMLmind website for more information.
Use a limited set of tags and then use one of DocBook XML's included stylesheets to output to WordML.
Try to use Jfor to output to RTF, although Jfor no longer appears to be maintained.
And I have one of my own:
As above, use one of DocBook XML's included stylesheets to publish to XSL-FO, then run Apache FOP to convert from XSL-FO to RTF. You will lose the structural information, but you will keep a certain amount of the formatting.

I recently implemented same feature for our users. They use Oxygen XML editor that allows for easy transformations via XSL. I was going to do OOXML but settled on WordML. As a starting point I used roundtrip XSL, but I had to rewrite lots of templates because of existing bugs or just missing functionality. In addition, I did other customization to serve a purpose or for our XML file only.
I would not mind contributing back to the project, but don't really know how to get about it.

I know this is an 11 years old question. But now, in 2022 you can use pandoc to convert DocBook to MS Word (docx).
pandoc --from docbook --to docx --output filename.docx filename.docbook

I am using XQuery to transform DocBook into various formats using XQuery typeswitch library. XQuery uses indexes so I can transform many documents very quickly.

Writing MS Word 2007 XML

How would I write Word 2007 XML (WordProcessingML) on my own? I have a requirement to do so, where I need to write a Word 2007 XML format for a Word template. The important thing is I should convert a Word template doc to XML (by zipping it, etc), where I need to write Word 2007 XML with those respective tags. How can I do this?

docx4j

I'm not sure what you mean by "on your own," but there is an existing API for this:
Apache POI - the Java API for Microsoft Documents
If it doesn't do what you need it to, just extend it.

Go to OpenXMLDeveloper - you can get most of the information there on WordprocessingML and links out to other places.

How to generate Microsoft Word documents using Sphinx

Sphinx supports a few output formats:
Multiple HTML files (with html or dirhtml)
Latex which is useful for creating .pdf or .ps
text
How can I obtain output in a Microsoft Word file instead?
With another doc generator I managed to generate a single html output file and then convert it to Microsoft Word format using the Word application.
Unfortunately I don't know a way to generate either Word or the HTML single-page format.

The solution I use is singlehtml builder like andho mentioned in the comment, then convert the html to docx using pandoc.
The following sample assumes the generated html would be located at _build/singlehtml/index.html
make singlehtml
cd _build/singlehtml/
pandoc -o index.docx index.html

There is a Sphinx extension for generating docx format (which I haven't tested) and a newer one (which I also haven't tested, but looks like it is more actively maintained)

To convert files in restructured text to MSdoc, I use rst2odt and next unoconv. Look next script:
#!/bin/sh
rst2odt $1 $1.odt
unoconv -f doc $1.odt
rm $1.odt
With rst2odt you can use your own stylesheet: unoconv comes with OpenOffice and also allows to apply an Open Office style (template) during the conversion. Simply edit a converted document, change styles, add headers and footers, save that as an ODF Text Document Template (OTT) and use this as part of the conversion, like:
unoconv -f doc -t template.ott $1.odt
to use that template for various conversions later on.

I realize this is an old question, but I found that LibreOffice supports the following way of doing conversion (assuming soffice.exe is in your path):
soffice.exe --invisible --convert-to doc myInputFile.odt
Some things I have read say to use the --headless option rather than --invisible. Both seem to work on Windows.
You can start with the rst2odt.py script and then do the above to convert to an MS Word document.
Here is a link with additional start up options for LibreOffice:
http://help.libreoffice.org/Common/Starting_the_Software_With_Parameters
Here is a link with file types supported by OpenOffice which, I believe, LibreOffice should also support:
http://wiki.services.openoffice.org/wiki/Framework/Article/Filter/FilterList_OOo_3_0

This answer is not a command line solution and it is not obviously the best, but it simply works for me and save my time. After generating html file 1, you can open the generated html with a browser and copy the entire page (Crtl + a and Ctrl+ c) and then run Microsoft Office(or use live version if you don't have Microsoft Windows, like me) and paste (Ctrl+v) to it.

The best option might be rst -> odt -> doc
Convert the sphinx documents into openoffice format.
Then convert open the odt with openoffice and saved to Word. But I don't know how to do this automatically.

This is a workaround using Calibre (https://calibre-ebook.com), which includes a powerful converter. This worked well and most of the formatting are preserved:
Generate epub output in Sphinx make epub
Import epub output into Calibre and then convert epub to docx using inbuilt ebook converter.
Answer is too late for the original question, but people looking at the same problem may find this useful.

I don't now what Sphinx is, but you could create a rtf file or html file or something similar.
See the following blogpost for more information/approaches : OFFICE AUTOMATION
and from there : How to use ASP to generate a Rich Text Format (RTF) document to stream to Microsoft Word
This article describes how you can generate Rich Text Format (RTF) files with ASP script and then stream those files to Microsoft Word. This technique provides an alternative to server-side Automation of Microsoft Word for run-time document generation.
You don't use ASP script (who does :-) ), but for the idea.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to convert Word 2007 document to PDF using Apache FOP - ms-word

You could try docx4j, an open source Java library (ASL v2) which uses FOP to create PDFs from docx files.

Related

Generate .hhk file From Word Document

Index PDF files and generate keywords summary

DocBook to Word Conversion?

Writing MS Word 2007 XML

How to generate Microsoft Word documents using Sphinx

Categories

Resources