Special Characters are not converting using Apache FOP1.0 - apache-fop

I have an xml document. I am converting it into pdf using Apache FOP 1.0. It is able to convert it into pdf but it is not converting some special charcter like Δ,μ,ρ,α and showing another output.
Expected Output = Δ INTRODUCTION length constant (λ)
Account Output = # INTRODUCTION length constant (#)
This is giving # symbol for unknown characters and when I try on other tools, I don't get these errors.
What is the problem with Apache FOP?

You may wish to try using the character in escaped format in your template e.g.
α
for the Greek alpha. Refer here: http://www.fileformat.info/info/unicode/char/03b1/index.htm

Related

Unicode variable labels

I have a dataset but its variable labels are in unicode like this:
I tried typing:
unicode
However, this simply displays the following:
How can I correctly display the unicode?
Or, at least, is there any method I can see the labels using another program?
Assuming your data are in a file data.dta in your working directory (where Stata should point):
clear
unicode encoding set euc-kr
unicode translate data.dta
Type help encodings from Stata's command prompt for details regarding the different formats.

Unicode character creation in Python 3.4

Using Python 3.4, suppose I have some data from a file, and it is literally the 6 individual characters \ u 0 0 C 0 but I need to convert it to the single unicode character \u00C0. Is there a simple way of doing that conversion? I can't find anything in the Python 3.4 Unicode documentation that seems to provide that kind of conversion, except for a complex way using exec() of an assignment statement which I'd like to avoid if possible.
Thanks.
Well, there is:
>>> b'\\u00C0'.decode('unicode-escape')
'À'
However, the unicode-escape codec is aimed at a particular format of string encoding, the Python string literal. It may produce unexpected results when faced with other escape sequences that are special in Python, such as \xC0, \n, \\ or \U000000C0 and it may not recognise other escape sequences from other string literal formats. It may also handle characters outside the Basic Multilingual Plane incorrectly (eg JSON would encode U+10000 to surrogates\uD800\uDC00).
So unless your input data really is a Python string literal shorn of its quote delimiters, this isn't the right thing to do and it'll likely produce unwanted results for some of these edge cases. There are lots of formats that use \u to signal Unicode characters; you should try to find out what format it is exactly, and use a decoder for that scheme. For example if the file is JSON, the right thing to do would be to use a JSON parser instead of trying to deal with \u/\n/\\/etc yourself.

How to produce Unicode characters with Matlab LaTex interpreter

I have the following line of code
ylabel('Średnia wartość parametru $f_{max}$','Interpreter','latex');
and would like to use it as a label for my plot. Unfortunately what I actually get is:
Warning: Unable to interpret LaTeX string
If I remove Unicode characters like so:
ylabel('Srednia wartosc parametru $f_{max}$','Interpreter','latex');
it works with no problem.
So how could I make Matlab print those unfortunate characters?
Use LaTeX representations for those characters: \'S, \'s, etc. And don't forget to duplicate quote signs within the string:
ylabel('\''Srednia warto\''s\''c parametru $f_{max}$','Interpreter','latex')

Does Apache Tika do character set conversion?

I'm using org.apache.tika.Tika.parseToString() to convert documents into plain text (i.e., unformatted text) files. My application potentially needs to convert documents that don't use a Unicode character set. For instance, some documents may be encoded in the Chinese GB2312 character set. It would be great if Tika re-coded the output into UTF-8. This would require Tika to reference a mapping between many different character sets and Unicode in order to convert the characters.
Does Tika convert the non-Unicode character set text into Unicode as the output of parseToString()? There are a lot of character sets out there so I would be impressed if Tika did this for more than a few character sets.
Update: I was able to create a couple different files with some non-Latin charsets (GB2312 (Chinese) and KOI8-R (Russian)). Tika.parseToString() couldn't even detect the charset or encoding. I opened an issue on the Tika bug tracker here: https://issues.apache.org/jira/browse/TIKA-1262
When talking about Character Sets in Apache Tika, you need to consider two kinds of files differently. One kind is that of basically just plain text, the other are more complex types (including binary ones)
With the more complex files, Tika mostly uses third party libraries, and these libraries are responsible for returning Java Strings. The exact way of doing that will depend on the file format in question - sometimes the file format will including encoding information, other times it'll be fixed in what it supports. Either way, Tika gets Java Strings, and returns to you a Java String. How you choose to encode that for output is up to you. (For Windows users especially, check the encoding of your terminal, and the font used. There've been lots of "Tika Encoding Problems" which were actually people failing to correctly set the default Java encoding on output, or failing to have a Unicode capable terminal!)
With plain text files, there's no encoding information in the file, all we have is a bunch of bytes. Here, Apache Tika uses one of a number of EncodingDetector instances to do the detection. These use hints, n-grams, language detection etc, to try to work out the most likely encoding of the file based on information given, pattern of bytes in the file etc.
The definition of EncodingDetector is held in the Tika-Core jar, but most of the implentations are held in the Tika-Parsers jar (and loaded by the service loader method, just like Detectors and Parsers). The main ones are here in SVN. If you check there, you'll see the main list of encodings that Tika can detect.
One final thing - the encoding detection is only performed on files that are text files, it isn't done on the binary type files. Depending on how you call Tika, you might need to tweak that and/or provide a hint that it's a text file, so that the EncodingDetector logic gets triggered.
This answer actually comes from a JIRA user on the Tika project. https://issues.apache.org/jira/browse/TIKA-1262
It turns out that if you tell Tika that the file extension is '.txt' it will treat the file as plain text, attempt to detect the encoding, and convert it to UTF.
An easy way to do this is to pass an empty Metadata object to TikaInputStream.get(). This will fill out the resourceName field of the Metadata object. Then pass this object to parseToString(). With the resourceName field set to a file name that ends with .txt the parser knows to treat this file as plain text and will do a encoding detection to try to discover how to decode the file. The string returned from parseToString() is a Java UTF-16 String object. When written to a file you can see that it is Unicode and uses the UCS charset.
Tika tika = new Tika();
Metadata metadata = new Metadata();
TikaInputStream reader = TikaInputStream.get(new File(filepath), metadata);
String contents = tika.parseToString(reader, metadata);
So far this has worked for text files using either GB2312/GB18030 and KOI8-R. This is the expected behavior and it's perfect! I don't know what other charsets/encoding is can handle.

How to discover what codepage to use when converting RTF hex literals to Unicode

I'm parsing RTF 1.5+ files generated by Word 2003+ that may have content from other languages. This content is usually encoded as hex literals (\'xx). I would like to convert these literals to unicode values.
I know my document's code page by looking for ansicpg (\ansi\ansicpg1252).
When I use the ansicpg codepage to decode to Unicode, many languages (like French) seem to convert to the Unicode char values that I expect.
However when I see Russian text (like below), codepage 1252 decodes the content to jibberish.
\f277\lang1049\langfe1033\langnp1049\insrsid5989826\charrsid6817286
\'d1\'f2\'f0\'e0\'ed\'e8\'f6\'fb \'e1\'e5\'e7 \'ed\'e0\'e7\'e2\'e0\'ed\'e8\'ff. \'dd\'f2
\'e0 \'f1\'f2\'f0\'e0\'ed\'e8\'f6\'e0 \'ed\'e5 \'e4\'ee\'eb\'e6\'ed\'e0
\'ee\'f2\'ee\'e1\'f0\'e0\'e6\'e0\'f2\'fc\'f1\'ff \'e2 \'f2\'e0\'e1\'eb\'e8\'f6\'e5
\'e2 \'f1\'ee\'e4\'e5\'f0\'e6\'e0\'ed\'e8\'e8.
I assume that lang1049, langfe1033, langnp1049 should provide me clues so I can programmatically choose a different (non-default) code page for the text that they reference? If so, where can I find information that explains how to map a lang* code to a codepage? Or should I be looking for some other RTF command/directive to provide me with the information I'm looking for? (Or must I use \f277 as a font reference and see if it has an associated codepage?)
\lang really only marks up particular stretches of the text as being in a particular language, and shouldn't impact what code page is to be used for the old non-Unicode \' escapes.
Putting an \ansicpg token in the header should perhaps do it, but seems to be ignored by Word (for both raw bytes and \' escapes.
Or must I use \f277 as a font reference and see if it has an associated codepage?
It looks that way. Changing the \fcharset of the font assigned to a particular stretch of text is the only way I can get Word to change how it treats the bytes, anyway. The codes in this token (see eg here for list) are, aggravatingly, different again from either the language ID or the code page number.
It is not so clear but you can use the RichEdit control in order to convert the RTF to UTF-8 format according to the MSDN:
http://msdn.microsoft.com/en-us/library/windows/desktop/bb774304(v=vs.85).aspx
Take a look to the SF_USECODEPAGE for the EM_STREAMOUT message.