Cyrillic characters in Apache POI excel file hyperlink address - scala

I use Scala and Apache POI (with folone/poi-scala).
I want to create a hyperlink to the local file in the cell. The path of the file contains cyrillic characters. And in Excel i can't open this file, i see '?' instead of cyrillic characters.
I tried to go through a lot of encodes and URL encoding, but it did not work.
Here is my code:
...
val cell = sheetOne.asPoi.getSheetAt(0).getRow(0).getCell(0)
cell.setHyperlink({
val link = new HSSFHyperlink(HSSFHyperlink.LINK_FILE)
link.setAddress("D:/Проверка/проверка.txt")
link
})
...
Any suggestions?

Need to replace
HSSFHyperlink.LINK_FILE
by
HSSFHyperlink.LINK_URL

Related

Apache POI docx: HTML as an altChunk

Good morning
I would like to add HTML as an altChunk to a DOCX file using Apache POI. To do that I followed this stackoverflow answer
How to add an altChunk element to a XWPFDocument using Apache POI
Everything works perfectly except for a problem with special character of my language (italian).
My case is the follow: I have an external html file. To import that I use the following code
byte[] inputBytes = Files.readAllBytes(Paths.get("testo.html"));
String xhtml = new String(inputBytes, StandardCharsets.UTF_8);
Then I generate the docx using the code provided in the stackoverflow answer.
If I unzip the .docx under the "word" folder I have correctly the file "chunk1.html".
If I open it the special caracter are reported correctly, for example
L'attività in oggetto è:
but when I opened the document in Word I see this
L'attività in oggetto è:
Is there same Microsoft Configuration that I missed?
Do I need to specify the character set when I create the chunk?
Microsoft seems to take ANSI as the default character encoding for HTML chunks in Word. That's annoying as the whole other world takes Unicode (UTF-8) as the default now.
So we need to set charset for the HTML explicitly. In the template of the chunk's HTML do:
...
private MyXWPFHtmlDocument(PackagePart part, String id) throws Exception {
super(part);
this.html = "<!DOCTYPE html><html><head><meta http-equiv=\"content-type\" content=\"text/html; charset=utf-8\"><style></style><title>HTML import</title></head><body></body>";
this.id = id;
}
...
I would recommend this instead of using ANSI encoding for the HTML chunks.
I have edited this into my answer in How to add an altChunk element to a XWPFDocument using Apache POI too.

Scala java.nio.charset.UnmappableCharacterException: Input length = 1

I've found several questions with similar titles, but couldn't seem to use any to resolve my issue. I Can't seem to load my .csv file:
val source = io.Source.fromFile("C:/mon_usatotaldat.csv")
Returns:
java.nio.charset.UnmappableCharacterException: Input length = 1
So I tried:
val source = io.Source.fromFile("UTF-8", "C:/mon_usatotaldat.csv")
and got:
java.nio.charset.IllegalCharsetNameException: C:/mon_usatotaldat.csv
I guess UTF-8 wouldn't work, if the file isn't in UTF-8 format, so that makes sense, but I don't know what to do next.
I've managed to discover the encoding is windows-1252 using:
val source = io.Source.fromFile("C:/mon_usatotaldat.csv").codec.decodingReplaceWith("UTF-8")
But this didn't do what I had expected, which was convert the file to UTF-8. I have no Idea how to work with it.
Another thing I've tried was:
val source = io.Source.fromFile("windows-1252","C:/mon_usatotaldat.csv")
But that returned:
java.nio.charset.IllegalCharsetNameException: C:/mon_usatotaldat.csv
Please help. Thanks in advance.
Try mapping your excel file to UTF-8 first and then try val source = io.Source.fromFile("UTF-8", "C:/mon_usatotaldat.csv")
To map to UTF-8 try:
(1) Open an Excel file where you have the info (.xls, .xlsx)
(2) In Excel, choose "CSV (Comma Delimited) (*.csv) as the file type
and save as that type.
(3) In NOTEPAD (found under "Programs" and then Accessories in Start
menu), open the saved .csv file in Notepad
(4) Then choose -> Save As..and at the bottom of the "save as" box,
there is a select box labelled as "Encoding". Select UTF-8 (do NOT use
ANSI or you lose all accents etc). After selecting UTF-8, then save
the file to a slightly different file name from the original.
This file is in UTF-8 and retains all characters and accents and can be imported, for example, into MySQL and other database programs.
Reference: Excel to CSV with UTF8 encoding
Hope this helps!
Set up an InputStreamReader to correctly read windows-1252. Don't bother with intermediate UTF-8.

How to convert escaped Unicode (e.g. \u0432\u0441\u0435) to UTF-8 chars (все) in Notepad++

I have .properties files with a bunch of unicode escaped characters. I want to convert it to the correct chars display.
E.g.:
Currently: \u0432\u0441\u0435 \u0433\u043e\u0442\u043e\u0432\u043e\u005c
Desired result: все готово
Notepad++ is already set to encode UTF8 without BOM. Opening the document and 'converting' (from the Encoding drop-down menu) doesn't do anything.
How do I achieve this with notepad++?
If not in Notepad++, is there any other way to do this for many files, perhaps by using some script?
You need a plugin named HTML Tag.
Once plugin is installed, select your text and invoke command Plugins > HTML Tag > Decode JS (Ctrl+Shift+J).
I'm not aware of how you can do it natively in Notepad++ but as requested you can script it with Python:
import codecs
# opens a file and converts input to true Unicode
with codecs.open("escaped-unicode.txt", "rb", "unicode_escape") as my_input:
contents = my_input.read()
# type(contents) = unicode
# opens a file with UTF-8 encoding
with codecs.open("utf8-out.txt", "wb", "utf8") as my_output:
my_output.write(contents)

Writing CR+LF into Open XML from a Database

I'm trying to take some data stored in a database and populate a Word template's Content Controls with it using the Open XML SDK. The data contains paragraphs and so there are carriage return and line feed characters in it. The data is stored in the database as nvarchar.
When I open the generated document, the CR+LF combination shows up as a question mark with a box around it (not sure the name of this character). This is actually two sequences back to back, so CR+LF CR+LF equals two strange characters:
If I unzip the .docx, take the Custom XML part and do a hex dump, I can clearly see 0d0a 0d0a so the CR+LF is there. Word is just printing it weird.
I've tried enforcing UTF-8 encoding in my XmlWriter's settings, but that didn't seem to help:
Dim docStream As New MemoryStream
Dim settings As XmlWriterSettings = New XmlWriterSettings()
settings.Encoding = New UTF8Encoding(False)
Dim docWriter As XmlWriter = XmlTextWriter.Create(docStream, settings)
Does anyone know how I can get Word to render these characters correctly when written to a .docx through the Open XML SDK?
To bind to a Word 2013 rich text control, your XML element has to contain a complete docx. See [MS-DOCX]:
the data stored in the XML element will be an escaped string comprised of a flattened WordprocessingML document representing the formatted data in the structured document tag range.
Earlier versions couldn't bind a rich text control.
Things should work though (with CR/LF, not w:br), if you bind to a plain text control, with multiline set to true.

Saving outlook mail items in UTF-8 / Unicode using C#

We have created an Outlook Plugin which (amongst other things) can be used to save Mail items in text form to a specific folder. However, the text of the resulting text file is encoded in ANSI and I would like to save it as UTF8. I have already set the Codepage of the mail item like so:
mail = (MailItem)objItem;
mail.InternetCodepage = 65001; // equal UTF8 encoding; see http://msdn.microsoft.com/en-us/library/office/ff860730.aspx
mail.SaveAs(filePath, olSaveAsType);
However, the resulting file is saved as "ANSI as UTF8" and all extended characters (e.g. in Arabic or Russian) come out as question marks.
Does anyone know how I can save the mail item in utf8?
Thanks a lot.
Cheers,
Martin
Instead of trying to set the encoding, try reading InternetCodepage and then using a System.Text.Encoding object to read the saved file into a string. You could then convert and re-save the string as another file in the encoding you prefer.