how to generate a pdf with pdfbox which should contains latin/asian/arbian charaters in the same page? - unicode

I try since few days to produce a new pdf file with pdfbox from a data extraction which contains values with different fonts. I have mainly latin characters but some names in my list of strings are in chinese or cyrillic, etc, characters.
I have spent lots of time and energy on google or stackoverflow but still don t manage to produce it(glyphe issue).
Currently, I m on Windows but will be deployed on Linux, and I use the version 2.0.26 or 3.0.0-RC1 of pdfbox.
I m manage to load ttf like that:
PDType0Font.load(doc, File("src/main/resources/font/LiberationSans-Regular.ttf").inputStream(),false)
if I set true to embedded in any cases, I got an issue of cmap.
I also tried to load ttc files but failed each time.
I have already started to implement this solution link but I don t manage to init/load correctly my font
Do you have any idea to do it?
Best, Mat

Related

Why can not Google Dataprep handle the encoding in my log files?

We are receiving big log files each month. Before loading them into Google BigQuery they need to be converted from fixed with to delimited. I found a good article on how to do that in Google Dataprep. However, there seems to be something wrong with the encoding.
Each time a Swedish Character appears in the log file, the Split function seems to add another space. This messes up the rest of the columns, as can be seen in the attached screenshot.
I can't determine the correct encoding of the log files, but I know they are being created by pretty old Windows servers in Poland.
Can anyone advice on how to solve this challenge?
Screenshot of the issue in Google Dataprep.
What us the exact recipe you are using ? Do you use (split every x ) ?
When I used in a test case an ISO Latin1 text and ingested it as ISO 8859-1, the output was as expected and only the display was off
Can you try the same ?
Would it be possible to share an example input file with one or two rows ?
As a workaround you can use the RegEx, which should work.
It's unfortunately a bit more complex, because you would have to use multiple regex splits. Here's an example for the first two splits after 10 characters each /.{10}/ and split on //

some nonASCII Characters are coming like a thick vertical line in talend 5.6

I have a file with name "→Ψjohn.txt" and now i want to remove those special characters from the file name and update the file name as "john.txt". But talend is recognizing those characters as thick vertical line so it is not recognizing the source file in the physical location.can anyone please suggest a solution.
I have this file in database as well as physical location and when I am reading the file from the database it has to remove the special characters and update the same in the database as well as physical location.
in database the file looks like this
database
when I am reading from database using talend it looks like following
talend
Thanks in advance
You dont have to specify the exact file name, you can use tFilelist and get all file in a specific directory, you can also regex to mask some names, for example iterate over all *john.txt.
Once you get the actual filename, use a regex to remove unwanted characters for example : \W for non word characters and rename the file through a system command or using tFileCopy.
After seeing the pictures this is not a Talend issue but has to do with the font used in the Talend (Eclipse) console and the enconding setting of Java.
Those rectangles (better visible with a bigger font size) show that the font cannot represent your characters - it has no symbols for it.
Talend (Eclipse) settings
In Talend, navigate to Window / Preferences and choose General / Appearance / Colors and Fonts (as described in the Eclipse help). Check what font you are using for Debug / Console font. I've had good results with the font Consolas, which is set from Talend version 6 onwards. Beforehand it was Courier New.
Java encoding
You should check that Java uses UTF-8 encoding to display the characters. The Console Enconding has to be set to UTF-8. See this answer for an explanation how to do this.
Alternative
Alternatively you could store all the log data into a file and open this file in e.g. Notepad++ to see if the output is generated correctly and only displayed wrong.

Decoding Korean text files from the 90s

I have a collection of .html files created in the mid-90s, which include a significant ammount of Korean text. The HTML lacks character set metadata, so of course all of the Korean text now does not render properly. The following examples will all make use of the same excerpt of text .
In text editors such as Coda and Text Wrangler the text displays as
╙╦ ╝№бя└К ▓щ╥НВь╕цль▒Ф ▓щ╥НВь╕цль▒Ф
Which in the absence of character set metadata in < head > is rendered by the browser as:
ÓË ¼ü¡ïÀŠ ²éÒ‚ì¸æ«ì±” ²éÒ‚ì¸æ«ì±”
Adding euc-kr metadata to < head >
<meta http-equiv="Content-Type" content="text/html; charset=euc-kr">
Yields the following, which is illegible nonsense (verified by a native speaker):
沓 숩∽핅 꿴�귥멩レ콛 꿴�귥멩レ콛
I have tried this approach with all historic Korean character sets, each yielding similarly unsuccessful results. I also tried parsing and upgrading to UTF-8, via Beautiful Soup, which also failed.
Viewing the files in Emacs seems promising, as it reveals the text encoding a lower level. The following is the same sample of text:
\323\313 \274\374\241\357\300\212
\262\351\322\215\202\354\270\346\253\354\261\224 \262\3\
51\322\215\202\354\270\346\253\354\261\224
How can I identify this text encoding and promote it to UTF-8?
All of those octal codes that emacs revealed are less than 254 (or \376 in octal), so it looks like one of those old pre-Unicode fonts that just used it's own mapping in the ASCII range. If this is right, you'll just have to try to figure out what font it was intended for, find it and perhaps do the conversion yourself.
It's a pain. Many years ago I did something similar for some popular pre-Unicode Greek fonts: http://litot.es/unicode-converter/ (the code: https://github.com/seanredmond/Encoding-Converter)
In the end, it is about finding the correct character encoding and using iconv.
iconv --list
displays all available encodings. Grepping for "KR" reveals at least my system can do CSEUCKR, CSISO2022KR, EUC-KR, ISO-2022-KR and ISO646-KR. Korean is also BIG5HKSCS, CSKSC5636 and KSC5636 according to Wikipedia. Try them all until something reasonable pops out.
Even if this thread is old, it's still an issue, and not having found a way to convert the files in bulk (outside of using a Korean version of Windows7), now I'm using Naver, which has a cloud service like Google docs and if you upload those weirdly encoded files there, it deals with them very well. I just edit and copy the text, and it's back to being standard when I copy it elsewhere.
Not the kind of solution I like, but it might save a few passers-by.
You can register for the cloud account with an ID, even if you do not live in SKorea by the way, there's some minimal english to get by.

Where to get a reference image for any unicode code point?

I am looking for an online service (or collection of images) that can return an image for any unicode code point.
Unicode.org does not have an image for each one, consider for example
http://www.unicode.org/cgi-bin/GetUnihanData.pl?codepoint=31cf
EDIT: I need to use these images programmatically, so the code chart PDFs provided at unicode.org are not useful.
The images in the PDF are copyrighted, so there are legal issues around extracting them. (I am not a lawyer.) I suspect that those legal issues prevent a simple solution from being provided, unless someone wants to go to the trouble of drawing all of those images. It might happen, but seems unlikely.
Your best bet is to download a selection of fonts that collectively cover the entire range of characters, and display the characters using those fonts. There are two difficulties with this approach: combining characters and invisible characters.
The combining characters can easily be detected from the Unicode database, and you can supply a base character (such as NBSP) to use for displaying them. (There is a special code point intended for this purpose, but I can't find it at the moment.)
Invisible characters could be displayed with a dotted square box containing the abbreviation for the character. Those you may have to locate manually and construct the necessary abbreviations. I am not aware of any shortcuts for that.

Input utf-8 characters in management studio

HI,
[background]
We currently build files for many different companies. Our job as a company is basically to sit in between other companies and help with communication and data storage. We have begun to run in to encoding issues where we are receiving data encoded in one format but we need to send it out in another. All files were prevsiously built using the .net framework default of UTF-8. However we've discovered that certain companies cannot read utf-8 files. I assume because they have older systems that require something else. This becomes apparent when sending certain french charaters in particular.
I have a solution in place where we can build a specific file for a specific member using a specific encoding. (While I understand that this may not be enough, unfortunately this is as far as I can go at the moment due to other issues.)
[problem]
Anyways, I'm at the testing stage and I want to input utf-8 or other characters into management studio. Perform an update on some data and then verify that the file is built correctly from that data. I realize that this is not perfect. I've already tried programatically reading the file and verifying the encoding by reading preambles etc. So this is what I'm stuck with. According to this website http://www.biega.com/special-char.html ... I can input utf-8 characters by clicking ALT+&+#+"decimal representation of character" or ALT+"decimal representation of character" but when I use the data specified by the table I get completely different characters in management studio. I've even saved the file in a utf-8 format using management studio by clicking the arrow on the save button in the save dialog and specifying the encoding. So my question is how can I accurately specify a character that will end up being the character I'm trying to input and actually put it in the data that will then be put in a file.
Thanks,
Kevin
I eventually found the solution. The website doesn't specify that you need to type ALT+0+"decimal character representation". The zero was left out. I'd been searching for this for ages.