Programmatically convert Doc(x) files to PDF using Microsoft Word - ms-word

We are developing a Java application that needs to programmatically convert .rtf, .doc and .docx files to PDF files.
Formatting is important to us, so we need the page numbers to be the same between a source file and a target PDF file, and the contents of each page being the same as the original file.
We have tried out open source solutions, such as JODConverter to invoke a LibreOffice of OpenOffice installation, Docx4j and XDocReport. The best formatting was achieved with LibreOffice. However, even in that case, the pages were different (for example, a 87-page .rtf file results in an 80-page PDF file).
So, we think that the ideal way to make the conversion would be to somehow invoke Microsoft Word though our Java application, and make the conversion with it. That would produce PDF files that have the same formatting as the original files.
Is this possible in any of the following ways:
An API that is directly invokeable through Java?
An API that is invokeable through a .Net language and we would use that with something like JACOB?
A 3rd party library that uses a Microsoft Word installation under the hood (something like JODConverter for Word)?
A CLI interface supported by Word (relevant question)?
Something else?

Related

Is it possible to write a binary file import extension for vs code?

I want to display some informations of a binary file in vs code.
Is it possible to write an extension for vs code, such that when selecting that file in the Explorer (or opening it directly) you see some text extracted from the binary file by that extension?
So the core functionality of that extension would be (simplified) a binary to text converter.
Any suggestions?
The VS Code team member has confirmed they do not have support for registering content providers for binary files in my issue.
I've inspected the workspace.onDidOpenTextDocument and window.onDidChangeActiveTextEditor APIs, but neither seems to be called when opening binary files.
Is there a way to display fallback content using registerTextDocumentContentProvider (or otherwise) for binary files?
That's why these types all carry Text in their names, TextEditor, TextDocument, etc. They can only handle textual, not binary data ;-)
No explanation as to why this works for PDFs, probably special-cased.

Extract data from many PDF forms

I regularly receive large numbers of the same PDF form. I want to extract the data from them into a text file. I'd like to do this via a script of some sort. I'm working in a UNIX environment.
Is this possible? I've googled my brains out and can't find anything.
Text in PDF is represented by text elements in page content streams. The streams are commonly compressed. If you have the time and resources you can use ISO 32000-1:2008 or Adobe PDF 1.7 specification to build your own PDF parser. Or it may be more practical to use a 3rd party app as an intermediate translation step.
There are utilities that will decode the stream and give you clear text. One option is PDFtk Server which will work in your environment. Another option is to use the Poppler PDF Rendering Library which has a command line utility "pdftotext" useful for searching for strings in PDFs.

How can I open a .mediawiki file in Word so that Word will interpret it as a MediaWiki file rather than a TXT file?

I've recently installed the Microsoft Office Word Add-in For MediaWiki (http://www.microsoft.com/en-us/download/details.aspx?id=12298) and I'm able to save MediaWiki files just fine but I can't open them (they are opened as plain text).
How can I force MS Word to make the correct association for MediaWiki files?
You can't. "Interpreting" the MediaWiki markup, i.e. wikitext, is called parsing. Writing a MediaWiki parser is a pain and there is no single parser which fully works yet, other than MediaWiki itself. LibreOffice's wiki-publisher plugin and those which copied it are able to produce good wikitext from their well-formed data format, but making this bidirectional is another matter.
Parsoid is almost perfect now and produces standard HTML, but it's a rather heavy application, you can't expect it to be embedded in Word. Maybe someone can write a LibreOffice plugin for Parsoid, though! Would be scary, but who knows.
See https://www.mediawiki.org/wiki/Alternative_parsers for more information. Many tried and got hurt. :)

How to write a MsWord (.doc) file in Objective C?

I need to write pure Msword file in objective C.
I have been writing .txt file till now but when i write a .doc file i'm facing encoding issues with all encoding schemes.
Microsoft provide library in visual studio to play with .doc files which is not available in Xcode.
So is there any way to make it happen?
You can extract the code from e.g. OpenOffice project. This is C++ code, but you can use a wrapper around. This will be a lot of work.
If it's an option for you:
On the server use Microsft VSTO to create a document (doc,els,ppt...) using ServerDocument class and passing the data that you collect on the client. Then you download that file and your phone.
Sounds like you are trying to port MS word. Basically u need to write XML files and zip them in a particular manner. Check out the markup specification here
http://msdn.microsoft.com/en-us/library/cc313105%28office.12%29.aspx
if it's .docx, you may compose an XML file and rename it. With .doc you may have to do it with your own wrapper.

What's the easiest way to generate DOC files?

Right now I'm generating HTML with a Perlscript, and then manually converting to DOC in OpenOffice. Actually I have to copy, create new "Text document", paste, save, as it treats HTML and DOC as separate file types, but that's quite unessential. That's very inconvenient.
Is there any automated way I can convert HTML to decent DOC, or some other nice format like HTML I can generate textually and convert to DOC in automated way?
(I'm on OSX)
I can't help you get to .doc, but have you seen the Open XML Format SDK from Microsoft? This will allow you to generate Office 2007 format documents (.docx, .xlsx etc) from .NET code.
Theoretically you may have some luck with this under Mono on OS X, as it doesn't require an installation of Office 2007 (for Windows) to function.
Not sure if this is what you want, but you can fairly easily generate WordML documents with code. WordML is the Word 2003 XML file format. It's NOT the same thing at the Office 2007 Open XML formats. WordML is just one file that's not too hard to create if your just doing fairly basic formatting. You could generate it directly rather than creating the HTML first. You can name the files with a .DOC extension and Word 2003 and later will open them just fine. You can resave them as real .DOC file if you want.
Here's the on-line WordML reference. I can send you some sample code if you'd like.
http://msdn.microsoft.com/en-us/library/aa212812(office.11).aspx
If you really want to create a general file format that could be converted into other formats, creating XML-FO file might be the way to go. There are a number of products out there that can take XML-FO and transform it into other files, such as Word and PDF.
We do use the components of Aspose that are available for .NET and Java. With Java you should be able to use them on OS X, too.
You have to purchase the components (i.e. they are not free), but aside from this, they are really great.