unwanted indent after line wrap - apache-fop

I'm converting some worddocuments (docx) with docx4j 6.1.2 and docx4j-export-fo 8.1.2 (apache FOP 2.3) to PDF with Java 11 like this:
// Load File
var wordMLPackage = WordprocessingMLPackage.load(wordDocument.getInputStream());
// Convert to PDF
var out = new FastByteArrayOutputStream();
Docx4J.toPDF(wordMLPackage, out);
return new ByteArrayResource(out.toByteArray());
In all paragraphs in the generated PDF there is a formatting issue I can't get a grip on. The following image shows a section from the docx in word.
The next image shows the section from the pdf file.
Each wrapped line starts with some extra indent on the left side.
Long lines are not wrapped.
Any ideas?
Edit 1:
The docx File is here: https://filebin.net/cux9s1p5ufm1vgul.

<dependency>
<groupId>org.docx4j</groupId>
<artifactId>docx4j-export-fo</artifactId>
<version>6.1.0</version>
</dependency>
works ok.
It seems the problem is white-space-collapse="false" white-space-treatment="preserve" introduced by https://github.com/plutext/docx4j-export-FO/commit/4451111aa02a698ed54788299513f7eac74bd996#diff-eeb9c00a64479f4ff29769e29a6a0cd7R455

Related

Apache POI docx: HTML as an altChunk

Good morning
I would like to add HTML as an altChunk to a DOCX file using Apache POI. To do that I followed this stackoverflow answer
How to add an altChunk element to a XWPFDocument using Apache POI
Everything works perfectly except for a problem with special character of my language (italian).
My case is the follow: I have an external html file. To import that I use the following code
byte[] inputBytes = Files.readAllBytes(Paths.get("testo.html"));
String xhtml = new String(inputBytes, StandardCharsets.UTF_8);
Then I generate the docx using the code provided in the stackoverflow answer.
If I unzip the .docx under the "word" folder I have correctly the file "chunk1.html".
If I open it the special caracter are reported correctly, for example
L'attività in oggetto è:
but when I opened the document in Word I see this
L'attività in oggetto è:
Is there same Microsoft Configuration that I missed?
Do I need to specify the character set when I create the chunk?
Microsoft seems to take ANSI as the default character encoding for HTML chunks in Word. That's annoying as the whole other world takes Unicode (UTF-8) as the default now.
So we need to set charset for the HTML explicitly. In the template of the chunk's HTML do:
...
private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
super(part);
this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>";
this.id = id;
}
...
I would recommend this instead of using ANSI encoding for the HTML chunks.
I have edited this into my answer in How to add an altChunk element to a XWPFDocument using Apache POI too.

convert stream file of iText PDF not opening MS word

Our project has requirement to generate end report both in PDF and MS-Word Document. We are using iTextSharp to dynamically generate tables and rows in report. Finally we will upload the file to server as PDF and MS-word. Both will be converted to Byte Array/Stream file and saved as PDF and MS-Word Document. In Which,uploaded PDF working as expected, but MS-word getting error and not opening(Attaching the screen shot).
iTextSharp doesn't produce MS Word documents, so this isn't an actual iText question. When I look at your screen shot, I see that you are trying to import a PDF file into Word. Since Word can't interpret PDF syntax, it shows you the syntax of the PDF file:
%PDF-1.4
%âãÏÓ
1 0 obj
<</Type/Font...
I think your question is wrong. You are not using iTextSharp to create a PDF file and an MS Word file. You are using iTextSharp to create a PDF file, and not an MS Word file.
There is no such thing as "Save a PDF as MS Word file" in iTextSharp, and it will be extremely difficult to find another tool that can convert a PDF document to a Word document in an acceptable way. (There are such tools, but the quality is suboptimal for PDFs that weren't made to be converted to another format.)

Put highlighted code in a word document using Apache POI

I'm generating some docx file (using Apache POI) that has a lot of SQL code in it. Because I'd like that code to be colored in a Word document, I'm first generating HTML with styles that does syntax highlighting. Now I can't put that HTML in a Word document. Is that even possible (using POI)?
What I'd like to achieve is SQL code in a docx being colored based on a generated HTML (like exporting SQL code from Notepad++ as HTML and pasting it in a Word document). Any ideas?

Cyrillic characters in Apache POI excel file hyperlink address

I use Scala and Apache POI (with folone/poi-scala).
I want to create a hyperlink to the local file in the cell. The path of the file contains cyrillic characters. And in Excel i can't open this file, i see '?' instead of cyrillic characters.
I tried to go through a lot of encodes and URL encoding, but it did not work.
Here is my code:
...
val cell = sheetOne.asPoi.getSheetAt(0).getRow(0).getCell(0)
cell.setHyperlink({
val link = new HSSFHyperlink(HSSFHyperlink.LINK_FILE)
link.setAddress("D:/Проверка/проверка.txt")
link
})
...
Any suggestions?
Need to replace
HSSFHyperlink.LINK_FILE
by
HSSFHyperlink.LINK_URL

PDFs are not displaying apostrophes in field data inserted by iTextSharp

I am using iTextSharp to fill pre-defined fields on an existing PDF document using the folowing syntax:
PdfStamper stamper = new PdfStamper(reader, stream);
stamper.AcroFields.SetField("A","O'Henry");
stamper.FormFlattening = true;
stamper.Close();
Unfortunately, apostrophes (and likely other forms of common punctuation) are not displayed in the output PDF. For instance, in the code above, field "A" displays the text "OHENRY" instead of "O'HENRY".
How do I get the output PDF to display the text including the apostrophes?
Also, please note that I do not have control over creating/modifying the original PDF being filled. I was given the PDF from an external source and will likely be given new versions of the PDF as the form changes.
Thanks!
An easy fix is to replace the single quotes with the ` character.
I found a solution here http://www.nabble.com/Populating-form-fields-with-Unicode-data-td21610346.html.
This solution involves embedding into each field a font that can handle the desired characters.