FPDF utf-8 encoding (HOW-TO) - unicode

Does anybody know how to set the encoding in FPDF package to UTF-8? Or at least to ISO-8859-7 (Greek) that supports Greek characters?
Basically I want to create a PDF file containing Greek characters.
Any suggestions would help.
George

Don't use UTF-8 encoding. Standard FPDF fonts use ISO-8859-1 or Windows-1252. It is possible to perform a conversion to ISO-8859-1 with utf8_decode():
$str = utf8_decode($str);
But some characters such as Euro won't be translated correctly. If the iconv extension is available, the right way to do it is the following:
$str = iconv('UTF-8', 'windows-1252', $str);

There also is a official UTF-8 Version of FPDF called tFPDF http://www.fpdf.org/en/script/script92.php
You can easyly switch from the original FPDF, just make sure you also use a unicode Font as shown in the example in the above link or my code:
<?php
//this is a UTF-8 file, we won't need any encode/decode/iconv workarounds
//define the path to the .ttf files you want to use
define('FPDF_FONTPATH',"../fonts/");
require('tfpdf.php');
$pdf = new tFPDF();
$pdf->AddPage();
// Add Unicode fonts (.ttf files)
$fontName = 'Helvetica';
$pdf->AddFont($fontName,'','HelveticaNeue LightCond.ttf',true);
$pdf->AddFont($fontName,'B','HelveticaNeue MediumCond.ttf',true);
//now use the Unicode font in bold
$pdf->SetFont($fontName,'B',12);
//anything else is identical to the old FPDF, just use Write(),Cell(),MultiCell()...
//without any encoding trouble
$pdf->Cell(100,20, "Some UTF-8 String");
//...
?>
I think its much more elegant to use this instead of spaming utf8_decode() everywhere and the ability to use .ttf files directly in AddFont() is an upside too.
Any other answer here is just a way to avoid or work around the problem, and avoiding UTF-8 is no real option for an up to date project.
There are also alternatives like mPDF or TCPDF (and others) wich base on FPDF but offer advanced functions, have UTF-8 Support and can interpret HTML Code (limited of course as there is no direct way to convert HTML to PDF).
Most of the FPDF code can be used directly in those librarys, so its pretty easy to migrate the code.
https://github.com/mpdf/mpdf
http://www.tcpdf.org/

there is a really simple solution for this problem.
In the file fpdf.php go to the line that says:
if($txt!=='')
{
It is line 648 in my version of fpdf.
Insert the following line of code:
$txt = iconv('utf-8', 'cp1252', $txt);
(above the line of code)
if($align=='R')
This works for all German special characters and should also work for Greek special characters. Otherwise simply replace cp1252 with the respective alphabet you require. You can see all supported characters here: http://en.wikipedia.org/wiki/Windows-1252
I saw the solution here: http://fudforum.org/forum/index.php?t=msg&goto=167345
Please use my example code above, as the original author forgot to insert a dash between utf and 8.
Hope the above was helpful.
Daan

You need to generate a font first. You must use the MakeFont utility included within the FPDF package. I used on Linux this a bit extended script from the demo:
<?php
// Generation of font definition file for tutorial 7
require('../makefont/makefont.php');
$dir = opendir('/usr/share/fonts/truetype/ttf-dejavu/');
while (($relativeName = readdir($dir)) !== false) {
if ($relativeName == '..' || $relativeName == '.')
continue;
MakeFont("/usr/share/fonts/truetype/ttf-dejavu/$relativeName",'ISO-8859-2');
}
?>
Then I copied generated files to the font directory of my web and used this:
$pdf->Cell(80,70, iconv('UTF-8', 'ISO-8859-2', 'Buňka jedna'),1);
(I was working on a table.) That worked for my language (Buňka jedna is czech for Cell one). Czech language belongs to central european languages, also ISO-8859-2. Regrettably the user of FPDF is forced to lost advantages of UTF-8 encoding. You cannot get this in your PDF:
Městečko Fruens Bøge
Danish letter ø becomes ř in ISO-8859-2.
Suggestion of solution: You need to get a Greek font, generate the font using proper encoding (ISO-8859-7) and use iconv with the same target encoding as the one the font has been generated with.

How do I create PDF's in FPDF that support Chinese, Japanese, Russian, etc.?
(snapshots of code in use below)
I'd like to provide: a summary of the problem, the solution, a github project with the working code, and an online example with the expected, resultant PDF.
The Problem :
As stated by Tarsis, swap FPDF to TFPDF.
You actually need a font that supports the UTF-8 characters you are using.
I.E., merely using Helvetica and trying to display Japanese will not work. If you use Font Forge, or some other font tool, you can scroll to the Chinese characters of the font, and see that they are blank.
Google has a font (Noto font) that contains all languages, and it is 20mb, which is usually several factors the size of your text. So, you can see why many fonts simply won't cover every single language.
The Solution :
I'm using rounded-mgenplus-20140828.ttf and ZCOOL_QingKe_HuangYou.ttf font packs for Japanese and Chinese, which are open source and can be found in many open source projects. In tFPDF itself, or a new inheriting class of it, like class HTMLtoPDF extends tFPDF {...}, you'll do this...
$this->AddFont('japanese', '', 'rounded-mgenplus-20140828.ttf', true);
$this->SetFont('japanese', '', 14);
$this->Write(14, '日本語');
Should be nothing more to it!
Code Package on GitHub :
https://github.com/HoldOffHunger/php-html-to-pdf
Working, Online Demo of Japanese :
https://www.earthfluent.com/privacy.pdf?language=ja

This answer didn't work for me, I needed to run html decode on the string also. See
iconv('UTF-8', 'windows-1252', html_entity_decode($str));
Props go to emfi from html_entity_decode in FPDF(using tFPDF extention)

just edit the function cell in the fpdf.php file, look for the line that looks like this
function cell ($w, $h = 0, $txt = '', $border = 0, $ln = 0, $align = '', $fill = false, $link = '')
{
after finding the line
write after the {,
$txt = utf8_decode($txt);
save the file and ready, the accents and the utf8 encoding will be working :)

There is an extension of FPDF called mPDF that allows Unicode fonts.
http://www.mpdf1.com/mpdf/index.php

None of the above solutions are going to work.
Try this:
function filter_html($value){
$value = mb_convert_encoding($value, 'ISO-8859-1', 'UTF-8');
return $value;
}

You can make a class to extend FPDF and add this:
class utfFPDF extends FPDF {
function Cell($w, $h=0, $txt="", $border=0, $ln=0, $align='', $fill=false, $link='')
{
if (!empty($txt)){
if (mb_detect_encoding($txt, 'UTF-8', false)){
$txt = iconv('UTF-8', 'ISO-8859-5', $txt);
}
}
parent::Cell($w, $h, $txt, $border, $ln, $align, $fill, $link);
}
}

I wanted to answer this for anyone who hasn't switched over to TFPDF for whatever reason (framework integration, etc.)
Go to: http://www.fpdf.org/makefont/index.php
Use a .ttf compatible font for the language you want to use. Make sure to choose the encoding number that is correct for your language. Download the files and paste them in your current FPDF font directory.
Use this to activate the new font: $pdf->AddFont($font_name,'','Your_Font_Here.php');
Then you can use $pdf->SetFont normally.
On the font itself, use iconv to convert to UTF-8. So if for example you're using Hebrew, you would do iconv('UTF-8', 'windows-1255', $first_name).
Substitute the windows encoding number for your language encoding.
For right to left, a quick fix is doing something like strrev(iconv('UTF-8', 'windows-1255', $first_name)).

You can apply this function on your text :
$yourtext = iconv('UTF-8', 'windows-1252', $yourtext);
Thanks

Like many said here:
$yourtext = iconv('UTF-8', 'windows-1252', $yourtext);
BUT! with an '//Ignore' after the windows-1252 or in my case CP1252, like this:
iconv("UTF-8", "CP1252//IGNORE", $row['project_name'])
This one worked for me, I hope it works for you!

Not sure if it will do for Greek, but I had the same issue for Brazilian Portuguese characters and my solution was to use html entities. I had basically two cases:
String may contain UTF-8 characters.
For these, I first encoded it to html entities with htmlentities() and then decoded them to iso-8859-1. Example:
$s = html_entity_decode(htmlentities($my_variable_text), ENT_COMPAT | ENT_HTML401, 'iso-8859-1');
Fixed string with html entities:
For these, I just left htmlentities() call out. Example:
$s = html_entity_decode("Treasurer/Trésorier", ENT_COMPAT | ENT_HTML401, 'iso-8859-1');
Then I passed $s to FPDF, like in this example:
$pdf->Cell(100, 20, $s, 0, 0, 'L');
Note: ENT_COMPAT | ENT_HTML401 is the standard value for parameter #2, as in http://php.net/manual/en/function.html-entity-decode.php
Hope that helps.

For offsprings.
How I managed to add russian language to fpdf on my Linux machine:
1) Go to http://www.fpdf.org/makefont/ and convert your ttf font(for example AerialRegular.ttf) into 2 files using ISO-8859-5 encoding: AerialRegular.php and AerialRegular.z
2) Put these 2 files into fpdf/font directory
3) Use it in your code:
$pdf = new \FPDI();
$pdf->AddFont('ArialMT','','ArialRegular.php');
$pdf->AddPage();
$tplIdx = $pdf->importPage(1);
$pdf->useTemplate($tplIdx, 0, 0, 211, 297); //width and height in mms
$pdf->SetFont('ArialMT','',35);
$pdf->SetTextColor(255,0,0);
$fullName = iconv('UTF-8', 'ISO-8859-5', 'Алексей');
$pdf->SetXY(60, 54);
$pdf->Write(0, $fullName);

Instead of this iconv solution:
$str = iconv('UTF-8', 'windows-1252', $str);
You could use the following:
$str = mb_convert_encoding($str, "UTF-8", "Windows-1252");
See: How to convert Windows-1252 characters to values in php?

There's an extention to FPDF called UFDPF
http://acko.net/blog/ufpdf-unicode-utf-8-extension-for-fpdf/
But, imho, it's better to use mpdf if you're it's possible for you to change class.

I use FPDF for ASP, and the iconv function is not available.
It seems strange, by I solved the UTF-8 problem by adding a fake image (an 1x1px jpeg) to the pdf, just after the AddPage() function:
pdf.Image "images/fpdf.jpg",0,0,1
In this way, accented characters are correctly added to my pdf, don't ask me why but it works.

I know that this question is old but I think my answer would help those who haven't found solution in other answers. So, my problem was that I couldn't display croatian characters in my PDF. Firstly, I used FPDF but, I think, it does not support Unicode. Finally, what solved my problem is tFPDF which is the version of FPDF that supports Unicode. This is the example that worked for me:
require('tFPDF/tfpdf.php');
$pdf = new tFPDF();
$pdf->AddPage();
$pdf->AddFont('DejaVu','','DejaVuSansCondensed.ttf',true);
$pdf->AddFont('DejaVu', 'B', 'DejaVuSansCondensed-Bold.ttf', true);
$pdf->SetFont('DejaVu','',14);
$txt = 'čćžšđČĆŽŠĐ';
$pdf->Write(8,$txt);
$pdf->Output();

Related

Howto make JRequest::getVar filter correctly accented characters?

I want to filter some variables with accented character on a component for joomla1.5 for example:
$name = JRequest::getVar('name', '', 'post','WORD');
but the getvar function filters áéíóú. I need this get well for a form in spanish language.
I'm new to joomla development, but for as far as I can see, it doesn't let me set any other parameter to config to get this.
Is there a way to do this with the advantage of filtering with JRequest::getVar or should I create a function myself which does so?
Do you mean JRequest::getVar() removes symbols like 'áéíóú'? It is very weird because I've worked on Joomla with danish and hebrew symbols. And they were passed through GET, POST, SESSION successfully. Because Joomla works with UTF8 and it understands such symbols. The problem could be only in your file encoding. They should be in UTF8. Is it so? If not try to change it. This should help.

Why am I unable to parse non-proportional text using CAM::PDF?

While parsing page no. 22 of http://sfdoccentral.symantec.com/sf/5.1/linux/pdf/vxfs_admin.pdf, I am able to parse all the words except mount_vxfs as its encoding style and/or font is different than normal plain text.
Please find attached PDF Page for details.
Please find my code :-
`#!/usr/bin/perl
use CAM::PDF;
my $file_name="vxfs_admin_51sp1_lin.pdf";
my $pdf = CAM::PDF ->new($file_name);
my $no_pages=$pdf->numPages();
print "$no_pages\n";
for(my $i=1;$i<$no_pages;$i++){
my $page = $pdf->getPageText($i);
//for page no. 22
//if($i==22){
print $page;
//}
}`
PDF doesn't store the semantic text that you read but rather uses character codes which map to glyphs (the painted characters) in a particular font. Often, however, the code-glyph mapping matches common character sets (such as ISO-8859-1 or UTF-8) so that the codes are human-readable. That's the case for all of the text you have been able to parse, although sometimes the odd character, mostly punctuation, is also "wrong".
The text for "mount_vxfs" in your document is encoded completely differently, unfortunately, resulting in apparent garbage. If you're curious, you can see what's really there by substituting getPageText() with getPageContent() in your code.
In order to convert the PDF text back to meaningful characters, PDF readers have to jump through hoops with a number of conversion tables (including the so-called CMaps). Because this is a lot of programming work, many simpler libraries opt not to implement them. That's the case with CAM::PDF.
If you're just interested in parsing the text (not editing it), the following technique is something I use with success:
Obtain xpdf (http://foolabs.com/xpdf) or Poppler (http://poppler.freedesktop.org/). Poppler is a newer fork of xpdf. If you're using *nix, there will be a package available.
Use the command-line tool 'pdftotext' to extract the text from a file, either page-wise or all at once.
Example:
#!/usr/bin/perl
use English;
my $file_name="vxfs_admin.pdf";
open my $text_fh, "/usr/bin/pdftotext -layout -q '$file_name' - 2>/dev/null |";
local $INPUT_RECORD_SEPARATOR = "\f"; # slurp a whole page at a time
while (my $page_text = <$text_fh>) {
# this is here only for demo purposes
print $page_text if $INPUT_LINE_NUMBER == 19;
}
close $text_fh;
(Note: The document I retrieved using your link is slightly different; the offending bit is on page 19 instead.)

How to draw Thai text to PDF file by using libharu library

i am using free pdf library libharu to generate PDF file,
but i have a encoding problem, i can not draw Thai lanugage text on PDF file,
all the text shows "???.."
Somebody know how to fix it?
Thanks
I have succeeded in rendering hieroglyphic texts (not Thai, but Chinese and Japanese) using libharu. First of all, I used Unicode mode, please refer to HPDF_UseUTFEncodings() function documentation.
For C language, here is a sequence of libharu API calls needed to overcome your trouble:
HPDF_UseUTFEncodings(docHandle);
HPDF_SetCurrentEncoder(docHandle, "UTF-8");
Here docHandle is a valid HPDF_Doc object.
Next part is proper work with UTF fonts:
const char * libFontName = HPDF_LoadTTFontFromFile(docHandle, fontFileName.c_str(), font_embed::EmbedFonts);
HPDF_Font font = HPDF_GetFont(docHandle, libFontName, "UTF-8");
After these calls you may render unicode texts containing Thai characters. Also note about embedding flag (3rd param of LoadTTFontFromFile) - your PDF file may be unreadable due to external font references. If you are not crazy with output PDF size, you may just embed fonts.
I've tested couple of Thai .ttf fonts found in Google and they were rendered OK. Also (it may be important, but I'm not sure) I'm using fork of libharu https://github.com/kdeforche/libharu which is now merged into master branch.
When you write text to the PDF, use the correct font and encoding. In the libharu documentation you have all the possibilities: https://github.com/libharu/libharu/wiki/Fonts
In your case, you must use the ISO8859-11 Thai, TIS 620-2569 character set
An example (in spanish):
HPDF_Font fontEn = HPDF_GetFont(pdf, "Helvetica-Bold", "ISO8859-2");
HPDF_Page_TextOut(page1, 50.00, 750.00, [#"Código para correcta codificación en libharu" cStringUsingEncoding:NSISOLatin1StringEncoding]);

How can I convert japanese characters to unicode in Perl?

Can you point me tool to convert japanese characters to unicode?
CPAN gives me "Unicode::Japanese". Hope this is helpful to start with. Also you can look at article on Character Encodings in Perl and perl doc for unicode for more information.
See http://p3rl.org/UNI.
use Encode qw(decode encode);
my $bytes_in_sjis_encoding = "\x88\xea\x93\xf1\x8e\x4f";
my $unicode_string = decode('Shift_JIS', $bytes_in_sjis_encoding); # returns 一二三
my $bytes_in_utf8_encoding = encode('UTF-8', $unicode_string); # returns "\xe4\xb8\x80\xe4\xba\x8c\xe4\xb8\x89"
For batch conversion from the command line, use piconv:
piconv -f Shift_JIS -t UTF-8 < infile > outfile
First, you need to find out the encoding of the source text if you don't know it already.
The most common encodings for Japanese are:
euc-jp: (often used on Unixes and some web pages etc with greater Kanji coverage than shift-jis)
shift-jis (Microsoft also added some extensions to shift-jis which is called cp932, which is often used on non-Unicode Windows programs)
iso-2022-jp is a distant third
A common encoding conversion library for many languages is iconv (see http://en.wikipedia.org/wiki/Iconv and http://search.cpan.org/~mpiotr/Text-Iconv-1.7/Iconv.pm) which supports many other encodings as well as Japanese.
This question seems a bit vague to me, I'm not sure what you're asking. Usually you would use something like this:
open my $file, "<:encoding(cp-932)", "JapaneseFile.txt"
to open a file with Japanese characters. Then Perl will automatically convert it into its internal Unicode format.

How can I extract text from a PDF file in Perl?

I am trying to extract text from PDF files using Perl. I have been using pdftotext.exe from command line (i.e using Perl system function) for extracting text from PDF files, this method works fine.
The problem is that we have symbols like α, β and other special characters in the PDF files which are not being displayed in the generated txt file. Also few extra spaces are being added randomly in the text.
Is there a better and more reliable way to extract text from PDF files such that the text will include all the symbols like α, β etc and the text will exactly match the text in the PDF (i.e without extra spaces)?
These modules you can acheive the extract text from pdf
PDF::API2
CAM::PDF
CAM::PDF::PageText
From CPAN
my $pdf = CAM::PDF->new($filename);
my $pageone_tree = $pdf->getPageContentTree(1);
print CAM::PDF::PageText->render($pageone_tree);
This module attempts to extract sequential text from a PDF page. This is not a robust process, as PDF text is graphically laid out in arbitrary order. This module uses a few heuristics to try to guess what text goes next to what other text, but may be fooled easily by, say, subscripts, non-horizontal text, changes in font, form fields etc.
All those disclaimers aside, it is useful for a quick dump of text from a simple PDF file.
You may never get an appropriate solution to your problem. The PDF format can encode text either as ASCII values with a font applied, or it can encode it as a bitmap. If the tool that created your PDF decided to encode the special characters as a bitmap, you will be out of luck (unless you want to get into OCR solutions, of course).
I'm not a Perl user but I imagine you'll struggle to find a better free text extractor than pdftotext.
pdftotext usually recognises non-ASCII characters fine, is it possible it's extracting them ok but the app you're using to view the text file isn't using the correct encoding? If pdftoetxt on windows is the same as the one on my linux system, then it defaults to exporting as utf-8.
There is getpdftext.pl; part of CAM::PDF.
Well, I tried 2-3 perl modules like CAM::PDF, API2 but the problem remains the same! I'm parsing a pdf file containing main pages. Cam or API2 parses the plain text very well. However, they are not able to parse the code snippet [code snippet usually are in different font & encoding than plain text].
James Healy is correct. After trying CAM::PDF and PDF::API2, the former of which I've had some success reading text, downloading pdftotext worked great for a number of my implementations.
If on windows go here and download xpdf precompiled binary:
http://www.foolabs.com/xpdf/download.html
Then, if you need to run this within perl use system, e.g.,:
system("C:\Utilities\xpdfbin-win-3.04\bin64\pdftotext.exe $saveName");
where $saveName is the full path to your PDF file.
This hopefully leaves you with a text file you can open and parse in perl.
i tried this module which is working fine for special characters of pdf..
!/usr/bin/perl
use strict;
use warnings;
use PDF::OCR::Thorough;
my $filename = "pdf.pdf";
my $pdf = PDF::OCR::Thorough->new($filename);
my $text = $pdf->get_text();
print "$text";
Take a look at PDFBox. It is a library but i think that it also comes with some tool to do text extracting.