docx4J - Set default font or encoding to UTF-8 for docx output file - encoding

I'm using docx4j to make a translate apps with input file is docx and output is docx too. I have problems when working with chinese character input. That is the w:rFonts tag of input file: <w:rFonts w:hint="eastAsia" w:ascii="MingLiU" w:hAnsi="MingLiU" w:eastAsia="MingLiU" w:cs="MingLiU"/>
How can i change to Time New Roman font in the output file or change the encoding to UTF-8.
Thank you guys!

The encoding should be UTF-8 already. That's standard for docx files.
The simplest way to change to "Times New Roman" is to set the attributes of the rFonts tag above. That is, where it says "MingLiU"
To do that, get the rFonts object (in direct formatting, styles etc)
You should also change the font in rPrDefaults, since this takes effect anywhere where it isn't overridden by another rFonts tag.

Related

Can eclipse detect a file encoding with a specific text header?

What can I do on a file to have eclipse opening it with an UTF-8 encoding on any computer ?
Context : I will distribute a text file to multiple people. This file contains UTF-8 characters but eclipse does not display them correctly by default since the file properties specify that the encoding is "Default (Inherited from container: Cp1252)".
This file is correctly displayed if I change this property to "Other: UTF-8", however I don't want the people who will receive this file to need to configure this property to have a correct display of UTF-8 characters, nor to change any setting.
If the text file starts with a UTF-8 Byte Order Mark (the sequence 0xEF,0xBB,0xBF) then the Eclipse content type system should recognize the file as being UTF-8.

Extra characters in .doc file when opened with textpad

When I open a document in textpad, some extra null character is appended between every character.
Like my document is having following text
बॉम्बे testing for webmail.
When I am opening in text it is coming as
I....M....I t.e.s.t.i.n.g. f.o.r. w.e.b.m.a.i.l.
Can Anybody help me on this ?
This file is in UTF-16 or UCS-2 format. When opening it, you must specify in which encoding you want to open it. Your text editor does not recognize this encoding automatically.
If your text editor does not allow for setting encoding on opening file, try using Notepad++ or Textpad.

Applescript: Save Word documents as plaintext while retaining accents

I'm trying to save Word documents as plain text docs. Currently, some times the accents turn into other symbols (usually the same ones, for example: é turns into a theta). Other times it works fine. How do I prevent this?
Currently using the line:
save as active document file name FullDocPath file format format Unicode text
When I encounter this error, I can save the document using the dialog (selecting Western Mac OS Roman encoding...that fixes the problem.
The applescript Word dictionary mentions:
[text encoding unsigned integer] : Text encoding to use when saving out as text file
I have no idea if this is the piece I'm missing or how to utilize it (is there a set integer that designates Western Mac OS Roman encoding?)
Anyone have any ideas?
Try:
set wordDoc to choose file
do shell script "textutil -convert txt " & quoted form of POSIX path of (wordDoc as text)
Check out StefanK's solution using textutil
This is in response to your comment beginning "Thanks Stefan and bibadiak"
With .txt file formats is that there is no universally used way to specify the encoding of a file inside the file, so either the application has to guess, or you have to know the encoding and the application has to let you tell it.
AFAIK if you do not specify an output encoding when you use textutil to convert from .doc or .docx format to text, you get UTF-8. But Mac Word just does not seem to recognise that when you try to open it, either programmatically or in the UI.
So I think you need to do some mix of the following:
a. save in, and work with, a format that uses 16-bit Unicode encoding. Word should recognise that, certainly if the BOM is preserved
b. save to UTF and work with UTF elsewhere, but use textutil to do the conversion back to (say) .docx before you re-open the document in Mac Word
c. if all your characters can be encoded using Mac OS Roman, use e.g.
textutil -convert txt -encoding 30
to save, ensure you work only with that character set, and re-open with Word. (30 is the value of the APple NSString value NSMacOSRomanStringEncoding). I think textutil will fail to convert documents that contain characters outside the MacOS Roman set.

Non unicode to Unicode conversion, for any font!

I have a html file with text encoded in a non-unicode font. I need to convert that file to unicode. I searched for a convertor. But, most of the convertors work for only a list of fonts, not for all fonts.
My font is very specific, text is in Devanagari script.
I have the file, I have the font, now, please suggest me a tool or technique. Thanks.
Unicode is not about fonts, it is about encoding. You need to find a converter that can convert your text to Unicode. What is the encoding of your text?
Apache Tika has the ability to pull text from PDF files via knowledge of font behavior. So if the file is in fact a PDF you have a chance. If you have a text file full of font indices in no particular encoding, you have a big programming job ahead of you.

How to draw Thai text to PDF file by using libharu library

i am using free pdf library libharu to generate PDF file,
but i have a encoding problem, i can not draw Thai lanugage text on PDF file,
all the text shows "???.."
Somebody know how to fix it?
Thanks
I have succeeded in rendering hieroglyphic texts (not Thai, but Chinese and Japanese) using libharu. First of all, I used Unicode mode, please refer to HPDF_UseUTFEncodings() function documentation.
For C language, here is a sequence of libharu API calls needed to overcome your trouble:
HPDF_UseUTFEncodings(docHandle);
HPDF_SetCurrentEncoder(docHandle, "UTF-8");
Here docHandle is a valid HPDF_Doc object.
Next part is proper work with UTF fonts:
const char * libFontName = HPDF_LoadTTFontFromFile(docHandle, fontFileName.c_str(), font_embed::EmbedFonts);
HPDF_Font font = HPDF_GetFont(docHandle, libFontName, "UTF-8");
After these calls you may render unicode texts containing Thai characters. Also note about embedding flag (3rd param of LoadTTFontFromFile) - your PDF file may be unreadable due to external font references. If you are not crazy with output PDF size, you may just embed fonts.
I've tested couple of Thai .ttf fonts found in Google and they were rendered OK. Also (it may be important, but I'm not sure) I'm using fork of libharu https://github.com/kdeforche/libharu which is now merged into master branch.
When you write text to the PDF, use the correct font and encoding. In the libharu documentation you have all the possibilities: https://github.com/libharu/libharu/wiki/Fonts
In your case, you must use the ISO8859-11 Thai, TIS 620-2569 character set
An example (in spanish):
HPDF_Font fontEn = HPDF_GetFont(pdf, "Helvetica-Bold", "ISO8859-2");
HPDF_Page_TextOut(page1, 50.00, 750.00, [#"Código para correcta codificación en libharu" cStringUsingEncoding:NSISOLatin1StringEncoding]);