Conversion between docx / doc / rtf and lightweight markup - ms-word

I am looking for a tool or set of tools to convert between file formats D and M where
D is a format handled by MSWord, in order of preference, docx, doc, rtf
M is a lightweight markup, such as markdown, textile, txt2tags, it can be an esoteric one
there is a way to generate html from M
conversion is two-way, it's done both from D to M, and from M to D
utf-8 encoding is handled properly
the content is simple, paragraphs, some simple formatting like bold and italics, maybe lists
the tools are platform-independent
What I've found so far
TeX, LaTeX -- too heavyweight
docx2txt -- too lightweight, it supports no formatting at all
html -- MSWord produces bloated html
a few one-way conversions, like doc to mediawiki,
UPDATE:
The use case is a document workflow between technical and non-technical people
I, the technical guy edit a document in plain text, put it into version control, etc.
I send it to my manager or other non-technical people
They add comments, make changes to it using their Word, then they send it back to me
I want to simply grok their changes, make my changes, put it into version control, without having to use Word

I think that Pandoc much more than meet all requirements.
http://pandoc.org

Adam, I've used docx4j to convert docx to html, edit the html in CKEditor, and then use docx4j to convert the html back to docx. My process made some assumptions about the css (ie it was designed to handle docx4j's clean html, and editing in CKEditor).
You don't say whether there is a way to generate M from HTML?

This is probably hard to do two-way, since you will have impedance mismatches between the various formats.
The best world I can think of would be a sort of Wiki / Word hybrid: Maybe you can get Google Wave to do that for you?
Another solution that might work is a CMS like Plone (did they ever add WYSIWIG capability? I stopped caring after version 1). Keep your documents there. Let the system handle changes, annotations etc. You can automate retrieval of the source (should be ReStructuredText) and commit that to your source control if you have to.

This script I wrote might help you in your workflow:
https://github.com/matb33/docx2md
It is a command-line PHP script that will only work with .docx files. It will extract the XML, run some XSL transformations, and provide you the result in Markdown format.
I encourage you to send me .docx files that don't convert accurately. I'd love to make this script as robust and reliable as possible.

Related

Generate .hhk file From Word Document

I am trying to convert MS Word file to chm file. I have a well organized word document. But,I could not figure out how to word saved as a html file to chm file. I know I can add html file to created project but there are some issue such that I could not solve how to convert ms word table of content file to index file in html help workshop program. I would be very happy If someone provide some example about conversion of word documents.(I am trying to achieve this thorough HTML Help Workshop program)
Best regards,
Converting a Word document to CHM format is difficult without special (often expensive) tools and has a learning curve.
You should think about whether the PDF format is not sufficient. But the CHM format - integrated in the Windows operating system - has of course some popular functions.
I recommend to read through Search and Index not working after converting from Word 2016 to CHM.
As I mentioned in my answer I never used chmProcessor before (because using other tools) but surprisingly seems to be a good one for converting Word documents in a simple way.
Please try chmProcessor for your needs. You may want to ask a new question here on SO later.
Edit:
Maybe you have additional interest in the following CodeProject article:
How to Easily Write a User's Guide for Your Application using Different File Extensions

Index PDF files and generate keywords summary

I have a large amount of PDF files in my local filesystem I use as documentation base and I would like to create an index of these files.
I would like to :
Parse the contents of the PDF files to get keywords.
Select the most relevant keywords to make a summary.
Create static HTML pages for some keywords with entries linked to the appropriate files.
My questions are :
Is there an existing tool to perform the whole job ?
What is the most appropriate tool to parse PDF files content, filter (by words size) and counting the words?
I consider using Perl, swish-e, pdfgrep to make a script. Do you know other tools which could be useful?
Given that points 2 and 3 seem custom I'd recommend to have your own script, use a tool out of it to parse pdf, process its output as you please, and write HTML (perhaps using another tool).
Perl is well suited for that, since it excels in processing that you'll need and also provides support for working with all kinds of file formats, via modules.
As for reading pdf, here are some options if your needs aren't too elaborate
Use CAM::PDF (and CAM::PDF::PageText) or PDF-API2 modules
Use pdftotext from the poppler library (probably in poppler-utils package)
Use pdftohtml with -xml option, read the generated simple XML file with XML::libXML or XML::Twig
The last two are external tools which you use via Perl's builtins like system.
The following text processing, to build your summary and design the output, is precisely what languages like Perl are for. The couple of tasks that are mentioned take a few lines of code.
Then write out HTML, either directly if simple or using a suitable module. Given your purpose, you may want to look into HTML::Template. Also see this post, for example.
Full parsing of PDF may be infeasible, but if the files aren't too complex it should work.
If your process for selecting keywords and building statistics is fairly common, there are integrated tools for document management (search for bibliography managers). However, I think that most of them resort to external tools to parse pdf so you may still be better off with your own script.

UIMA Ruta input type - html

I have pdf and word files that need to be used as an input for Ruta. I can convert them into text files, but lose all the tables and formatting if I do that. Is there anyway I can use them without losing any information?
Thanks!
You need an additional program that is able to convert pdf (/doc/docx) to html. There are mainly two different types of PDF converter: those which use absolute positions for generating nice-looking html, and those which rely only on html elements and css. For processing tables, I recommend the latter ones. I personally use a commerical solution, but there is also a lot of good open source software, e.g., pdf2htmlEX
If you have html, then you can apply the HtmlAnnotator and HtmlConverter for gaining plain text with annotations for the html tags as described in the UIMA Ruta documentation

Using version control for RTF documents?

I have recently started a new job as a Business Systemd Analyst. The company has an in-house document management system that reads/parses RTF documents that have a BBCode-like syntax to do basic conditional, looping and inserting of data from a database; my role is to modify these RTF files with the code blocks to make them dynamic.
For my own personal use I would like to utilize a version control system to better handle revisions and so I don't have to have dozens of copies of a file during the various stages I'm working on them, probably Mercurial (I don't feel like dealing with Cygwin), but seeing as I'm more used to source code in an IDE than a rich text document template, I'm not quite sure if a VCS system is even the appropriate solution to use as I couldn't really use them to diff files, just as storage and tracking.
Any suggestions for this? Could I get by with a VCS system or am I applying programmer logic to a non-programming problem? :)
seeing as I'm more used to source code in an IDE than a rich text
document template
It is a look at a strange angle: you can version all, always, anytime. Just sometimes it's less usable, sometimes - more.
If your files are basically text - you can version/compare/rollback, if your files are readable by special viewers texts - you can also diff revisons, if your files are readable by eyes - you can also merge sources. If you have GUI, you have all power of SCM and usability of tools.
...And be glad that you did not have to work with something like this
{\rtf1\ansi\ansicpg1251\deff0\deflang1049{\fonttbl{\f0\fswiss\fcharset204{\*\fname Arial;}Arial CYR;}}
{\colortbl ;\red0\green128\blue0;}
{\*\generator Msftedit 5.41.15.1515;}\viewkind4\uc1\pard\f0\fs20\'dd\'f2\'ee \'ef\'e5\'f0\'e2\'e0\'ff \'f1\'f2\'f0\'ee\'ea\'e0\par
\'dd\'f2\'ee \b\'e2\'f2\'ee\'f0\'e0\'ff \cf1\b0\'f1\'f2\'f0\'ee\'ea\'e0\cf0\par
}
(ordinary pure-RTF with short russian text in it)

Understanding WordProcessingML tags and avoid unnecessary tags

I am using MS Word API to generate .docx which contains the data fetched from DB, in which i am applying the respective styles, fonts, symbols, etc. If the data fetched from the DB is quite huge, then there is a problem in displaying those data in the .docx file. I found that internally MS Word 2007 will write some content through tags which may not be needed to display the data. Hence i am figuring out what are the necessary MS Word tags needed when converting into a .xml file. So that i can avoid unnecessary tags and build only the respective tags which are needed to display the data. Hence i am planning to write my own .xml with the MS Word tags which are needed, than generating a .XML from .docx file
My queries are:-
1) Whether it is right that the MS Word will generate some tags which may not be needed during the conversion of .docx to document.xml? That makes it heavy? If so what are the tags , so that i can avoid them when write by own .xml file.
2) Please send links to understand about the MS Word tags and its advantages, which tags are needed and which are not ?
3) Whether my approach to write a new .xml similar to document.xml (.docx conversion) is worthy one to go forward so that i can build the .xml with the tags i needed , so that i can improve the performance of the data display?
Please shed some light into it and thanks in advance..
Thanks,
Rithu
You'll want to learn WordprocessingML in much more detail to do this. It certainly isn't impossible, but it is quite a learning curve to start with. Probably the best place to start is with this eBook. If you go the manual route, you'll need a zip technology. If you're in Visual Studio, you can make the writing of all of this easier by using the Open XML SDK.
As to your questions on 'unnecessary tags', it's hard to believe that there would be much at all in the file that is unnecessary. But that depends on what you consider not needed - for example, if a word is caught as mispelled, there will be "dirty=1" attribute on the Run tag. If you're okay with displaying mispelled words, then that could be considered unnecessary. Really depends on what you're displaying for and in what.