Apache POI docx: HTML as an altChunk - ms-word

Good morning
I would like to add HTML as an altChunk to a DOCX file using Apache POI. To do that I followed this stackoverflow answer
How to add an altChunk element to a XWPFDocument using Apache POI
Everything works perfectly except for a problem with special character of my language (italian).
My case is the follow: I have an external html file. To import that I use the following code
byte[] inputBytes = Files.readAllBytes(Paths.get("testo.html"));
String xhtml = new String(inputBytes, StandardCharsets.UTF_8);
Then I generate the docx using the code provided in the stackoverflow answer.
If I unzip the .docx under the "word" folder I have correctly the file "chunk1.html".
If I open it the special caracter are reported correctly, for example
L'attività in oggetto è:
but when I opened the document in Word I see this
L'attività in oggetto è:
Is there same Microsoft Configuration that I missed?
Do I need to specify the character set when I create the chunk?

Microsoft seems to take ANSI as the default character encoding for HTML chunks in Word. That's annoying as the whole other world takes Unicode (UTF-8) as the default now.
So we need to set charset for the HTML explicitly. In the template of the chunk's HTML do:
...
private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
super(part);
this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>";
this.id = id;
}
...
I would recommend this instead of using ANSI encoding for the HTML chunks.
I have edited this into my answer in How to add an altChunk element to a XWPFDocument using Apache POI too.

Related

Put highlighted code in a word document using Apache POI

I'm generating some docx file (using Apache POI) that has a lot of SQL code in it. Because I'd like that code to be colored in a Word document, I'm first generating HTML with styles that does syntax highlighting. Now I can't put that HTML in a Word document. Is that even possible (using POI)?
What I'd like to achieve is SQL code in a docx being colored based on a generated HTML (like exporting SQL code from Notepad++ as HTML and pasting it in a Word document). Any ideas?

php remove arabic/farsi string from string

I got content from other website(persian language) by "simple dom html" and stored content of div to variable , here it is my code:
$html = file_get_html('./test.html');
$tmp = $html->find('a div.min_price_space')->plaintext;
so my first question is how can i detect encode of characters related to this string?
for detecting of encode char i used below code which is not working
echo mb_detect_encoding($tmp);
i put sample of string in my language(persian) here : "کمترین قیمت رزرو شبی ۲۲۸,۰۰۰ تومان" .i want to remove "تومان" from this string and i used below code:
$result = str_replace('تومان','',$tmp);
after i execute my php file in IE show just "?" instead of my string and if i add this code to my php file "header('Content-Type: text/html; charset=utf-8');" display my string with right characters but without remove determined string from it.
Do you have any idea to fix this?
i find my problem.
In visual studio 2013 go to : Tools > Options > Environment and checked option "Save documents as Unicode when data cannot be saved in codepage".
after that everything work perfect.

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

How to properly store umlauts and accents in vcard file?

From the RFC for card 4.0 I learned that vcard 4.0 is always utf-8.
I am using ez-vcard to export contacts into a export.vcf file transferred via http:
response.setContentType("text/vcard; charset=utf-8");
response.setStatus(HttpServletResponse.SC_OK);
PrintWriter writer = response.getWriter();
VCardWriter vCardWriter = new VCardWriter(writer, VCardVersion.V4_0);
// add cards...
vCardWriter.close();
Guess what? Characters are not being encoded properly. If I open the file in a text editor, I see characters are messed up.
Any help?
It may be ignoring the character encoding specified in the content type because you are setting it to something other than text/html.
Try setting the character encoding using setCharacterEncoding() instead (make sure to call it before calling getWriter()).
response.setContentType("text/vcard");
response.setCharacterEncoding("UTF-8");
response.setStatus(HttpServletResponse.SC_OK);
PrintWriter writer = response.getWriter();
Also, make sure your text editor is reading the file correctly. During my testing, I found that Eclipse would not display UTF-8 characters correctly, because it was configured to load the file under a different character set. Try viewing the file contents from the terminal:
cat the-vcard-file.vcf
EDIT: One more thing: Do not close the VCardWriter object. This will close the servlet's PrintWriter object, which you must never close!!

Cyrillic characters in Apache POI excel file hyperlink address

I use Scala and Apache POI (with folone/poi-scala).
I want to create a hyperlink to the local file in the cell. The path of the file contains cyrillic characters. And in Excel i can't open this file, i see '?' instead of cyrillic characters.
I tried to go through a lot of encodes and URL encoding, but it did not work.
Here is my code:
...
val cell = sheetOne.asPoi.getSheetAt(0).getRow(0).getCell(0)
cell.setHyperlink({
val link = new HSSFHyperlink(HSSFHyperlink.LINK_FILE)
link.setAddress("D:/Проверка/проверка.txt")
link
})
...
Any suggestions?
Need to replace
HSSFHyperlink.LINK_FILE
by
HSSFHyperlink.LINK_URL