LibreOffice doc to html file conversion with embedded images - libreoffice

We've been using libreoffice's headless conversion to convert Word documents to HTML files. On version 4.0, it would create an HTML file and separate JPG files for any images embedded in the Word document and reference them via the img tag src attribute.
Now that we've upgraded to 4.2, the conversion only creates the HTML file, with all of the images embedded inline as base 64 encoded data-src attributes.
Is there a way to make the libreoffice conversion create the individually linked image files again? Here's the command we are using for the conversion:
soffice --headless --convert-to html:HTML file_to_convert.docx

I found a LibreOffice bug report which pointed me to this commit that shows that the functionality has changed and that it's caused problems for others as well. For now, there isn't a solution, but many folks on the bug report are requesting that the feature, to embed images, be optional. The issue (according to a commenter of the bug) appears to be an issue in most recent version: 5.0.0rc2 and current master.

Related

Get documentation from GitHub project as a single pdf

I'm looking for a single pdf of the ErpNext and Frappe user manuals.
Documentation seems to be provided in html and the source is in markdown. I did find tools to convert markdown to html/pdf, but no reliable solution to generate a SINGLE pdf file keeping the structure as shown here:
Put more abstractly: How to transform GitHub markdown documentation (organized in subdirectories) into a single pdf file?
Could anyone help me out?
Any way of achieving this is welcome, thanks in advance!
You can convert markdown to PDF with Pandoc or similar tools.
You can fsearch the internet about how to concatenate files on your OS.
There are several (online) tools to merge multiple PDFs into one.
To create a single file you can either
concatenate the markdown files into one big file, then convert to PDF, or
convert all markdown files to PDF, then merge all PDF files into one big PDF.

Programmatically convert Doc(x) files to PDF using Microsoft Word

We are developing a Java application that needs to programmatically convert .rtf, .doc and .docx files to PDF files.
Formatting is important to us, so we need the page numbers to be the same between a source file and a target PDF file, and the contents of each page being the same as the original file.
We have tried out open source solutions, such as JODConverter to invoke a LibreOffice of OpenOffice installation, Docx4j and XDocReport. The best formatting was achieved with LibreOffice. However, even in that case, the pages were different (for example, a 87-page .rtf file results in an 80-page PDF file).
So, we think that the ideal way to make the conversion would be to somehow invoke Microsoft Word though our Java application, and make the conversion with it. That would produce PDF files that have the same formatting as the original files.
Is this possible in any of the following ways:
An API that is directly invokeable through Java?
An API that is invokeable through a .Net language and we would use that with something like JACOB?
A 3rd party library that uses a Microsoft Word installation under the hood (something like JODConverter for Word)?
A CLI interface supported by Word (relevant question)?
Something else?

iText form filling missing PDF content

I am running into an odd problem with iText. I have a document with a few fields. On my server, I open the local document, set the fields and send the output of the stamper to the browser.
Works perfectly on my local devel machine.
The pdf generated on the server is missing the PDF contents. I only see the content of the fields I set, the rest is completely blank.
Any tips?
Your application on your local machine respects the bytes of the PDF you're using as a template. Your application on the server doesn't respect those bytes. Maybe you've copied the template using the wrong encoding, making all the binary characters corrupt. Or maybe your application is reading the template using the wrong encoding with the same result.
You can find out by opening your PDF file in a text editor (not inside a PDF viewer). Look for the keyword stream and inspect the bytes that follow this keyword. Do you see the difference? In the PDF produced on your local machine, the bytes look like a normal binary stream. In the PDF produced on your server, the bytes look awkward. For instance: it consists of plenty of question marks.
How to solve: check if the template was copied correctly. If so, check the way you're reading the document. For instance: read the PDF template into a byte array without using iText and write it to a new byte array. Can you reproduce the process of corruption? If so, tweak your application (the one that doesn't involve iText) until you've got the correct encoding.

Remove "generated by doxygen" and timestamp in PDF

As the title. I've just started using doxygen, with the first test run I noticed the PDF created has "created by doxygen 1.8.3.1" followed by the date and time, across the front page.
Is it possible to remove this? or even just move it, say to the end of the document?
I have noted other similar questions but only for the HTML (or RTF which Im not generating) and not PDF
You can do this by using a custom LaTeX header.
First generate a default one using
doxygen -w latex header.tex footer.tex doxygen.sty
now edit the header.tex and look for the "Generated on ..." part and replace that by something of your liking.
Then mention the customized header in doxygen's configuration file
LATEX_HEADER = header.tex
and run doxygen as normal.
Note: When you upgrade to a newer version of doxygen you may need to update your custom header as well.
I believe you should use the HTML_FOOTER configuration tag.
I haven't tested this, but it sounds right:
The HTML_FOOTER tag can be used to specify a user-defined HTML footer for each generated HTML page. If the tag is left blank doxygen will generate a standard footer.

Convert rtf files to chm files ? Convert hlp files to chm files?

We were shipping .hlp files to customers when development was in VC++. The process to create it was as follows:
1. Create rtf file
2. Create new project in WinHelp and then compile to get .hlp file.
Now development has moved to .net and also I found that we can no longer open .hlp files in windows 7 or vista.
I wanted to know if there are any free command line tools using which we can convert these .hlp files to a .chm file ?
Also I wanted to know if there are any free command line tools to convert .rtf file to .chm ?
Microsoft has a tool which can convert Win Help projects to HTML Help. It is called HTML Help Workshop. You can open the existing .hpj project file with it and choose the option to convert it to HTML Help project .hhp. You can then compile the .hhp project with the same tool to generate the .chm file.
There are however many shortcomings in the tool. It generates an HTML page for each page in the rtf file but the naming of these HTML pages is random causing future referencing to be difficult.
If you just have the .hlp file and not the original Win Help project files, you can use a decompiler to generate the .hpj and .rtf files first and then convert them using HTML Help Workshop.
I found the following link quite helpful:
http://www.help-info.de/en/Help_Info_WinHelp/hw_converting.htm
EDIT: there are some 3rd party convertors and Help Authoring Tools (HATs) also available which may do the job better than HTML Help Workshop but most of them are not free.
Keep in mind that CHM is compiled HTML, and not very related to html, so your main problem is conversion of rtf to html
I would try to convert RTF to HTML, but on a topic per file.
What you could try is to input the RTF into word and try to save as HTML, and then use a program/script to split out the various topics to individual files and fixup references.
Then compile the result with a CHM compiler (like MS htmlhelp workshop)