FPDF library not showing special characters like '✓' - fpdf

Hey I try to write special characters like '✓' to FPDF and it's not work. Only normal string work.
I have checkbox on the pdf and i try to fill the checkbox with '✓'.
I try it like this:
$value = iconv('UTF-8', 'windows-1255', html_entity_decode('✓'));
$pdf->Write(0, $value);
But when i go to the pdf and the string broken and not the same.
Thanks

This character is not included in windows-1255. You may use "ZapfDingbats" and use chr(51) or chr(52).
$pdf->SetFont('ZapfDingbats', '', 12);
$pdf->Write(0, chr(51();
See here for a font dump of all standard fonts.

Related

How to remove quotes in my product description string?

I'm using OSCommerce for my online store and I'm currently optimizing my product page for rich snippets.
Some of my Google Indexed pages are being marked as "Failed" by Google due to double quotes in the description field.
I'm using an existing code which strips the html coding and truncates anything after 197 characters.
<?php echo substr(trim(preg_replace('/\s\s+/', ' ', strip_tags($product_info['products_description']))), 0, 197); ?>
How can I include the removal of quotes in that code so that the following string:
<strong>This product is the perfect "fit"</strong>
becomes:
This product is the perfect fit
Happened with me, try to use:
tep_output_string($product_info['products_description']))
" becomes "
We can try using preg_replace_callback here:
$input = "SOME TEXT HERE <strong>This product is the perfect \"fit\"</strong> SOME MORE TEXT HERE";
$output = preg_replace_callback(
"/<([^>]+)>(.*?)<\/\\1>/",
function($m) {
return str_replace("\"", "", $m[2]);
},
$input);
echo $output;
This prints:
SOME TEXT HERE This product is the perfect fit SOME MORE TEXT HERE
The regex pattern used does the following:
<([^>]+)> match an opening HTML tag, and capture the tag name
(.*?) then match and capture the content inside the tag
<\/\\1> finally match the same closing tag
Then, we use a callback function which does an additional replacement to strip off all double quotes.
Note that in general using regex against HTML is bad practice. But, if your text only has single level/occasional HTML tags, then the solution I gave above might be viable.

Convert unicode to HTML entities function

I have the following function that converts unicode to HTML entities, but if I run the function again over the result it will not leave the HTML entities in tact. How can I get the function to leave already converted HTML entities alone?
sub convert_unicode {
use HTML::Entities;
use Encode;
my $str = shift;
Encode::_utf8_off($str);
return encode_entities(decode('utf8',$str));
}
What you're asking for is to be able to safely double character encode. Some encodings allow this. HTML character encoding does not because it uses certain characters like & to do the encoding and it cannot tell the difference between a special character being used for encoding and one that needs to be encoded.
For example...
use HTML::Entities;
use v5.10;
say encode_entities("&foo");
That produces &foo. If we encode it again it produces &amp;foo because & is a special character which it faithfully encodes. It does not know that & is an already encoded & so it treats it as a literal & and encodes it.
You could write your own custom HTML encoding function that assumes &xxx; (and its variants) are already encoded, but that's just a guess. You can't actually tell a literal &foo; and an encoded &foo; apart. It will break with, for example, old school Perl code like &function;. Maybe you can be super clever and use an array of objects to indicate which parts are encoded and have the whole thing overload stringification so it looks like a string, and so long as everything carefully preserves that object that looks like a string it'll work...
And now we're into the lava flow anti-pattern where rather than fixing bad design, more complex and bad design is layered on top of it. Trying to "fix" that will just create more problems. The real problem lies deeper.
The real problem is that you're encoding multiple times. This probably means you've wielded your formatting and your functionality together. For example...
sub get_user_name {
my $uid = shift;
my $name = ...do a bunch of work to get the user name...
return encode_entities($name);
}
By HTML encoding the data, a function like this makes assumptions about how the data is going to be used. It limits its use to just HTML. If all your functions do this, you have a double encoding problem.
Then maybe you have something like this:
sub do_something {
my $uid = shift;
# $name is already HTML encoded.
my $name = get_user_name($uid);
my $stuff = ...something incorporating $name...
# Whoops, the user name is double encoded.
return encode_entities($stuff);
}
The answer is to leave the HTML formatting and encoding until the last minute. Ideally don't do it at all, just work with data and let an HTML template system take care of it. Template Toolkit, for example.
This also provides a clean separation between the formatting and the code, so now non-programmers can work on the formatting using a documented template system.

How can I save Perl/Expect output that contains mixed ascii content?

I have a perl script that uses the expect library to login to a remote system. I'm getting the final output of the interaction with the before method:
$exp->before();
I'm saving this to a text file. When I use cat on the file it outputs fine in the terminal, but when I open the text file in an editor or try to process it the formatting is bizarre:
[H[2J[1;19HCIRCULATION ACTIVITY by TERMINAL (Nov 6,14)[11;1H
Is there a better way to save the output?
When I run enca it's identified as:
7bit ASCII characters
Surrounded by/intermixed with non-text data
you can remove none ascii chars.
$str1 =~ s/[^[:ascii:]]//g;
print "$str1\n";
I was able to remove the ANSI escape codes from my output by using the Text::ANSI::Util library's ta_strip() function:
my $ansi_string = $exp->before();
my $clean_string = ta_strip($ansi_string);

preg_match a keyword variable against a list of latin and non-latin chars keywords in a local UTF-8 encoded file

I have a bad words filter that uses a list of keywords saved in a local UTF-8 encoded file. This file includes both Latin and non-Latin chars (mostly English and Arabic). Everything works as expected with Latin keywords, but when the variable includes non-Latin chars, the matching does not seem to recognize these existing keywords.
How do I go about matching both Latin and non-Latin keywords.
The badwords.txt file includes one word per line as in this example
bad
nasty
racist
سفالة
وساخة
جنس
Code used for matching:
$badwords = file_get_contents("badwords.txt");
$badtemp = explode("\n", $badwords);
$badwords = array_unique($badtemp);
$hasBadword = 0;
$query = strtolower($query);
foreach ($badwords as $key => $val) {
if (!empty($val)) {
$val = trim($val);
$regexp = "/\b" . $val . "\b/i";
if (preg_match($regexp, $query))
$badFlag = 1;
if ($badFlag == 1) {
// Bad word detected die...
}
}
}
I've read that iconv, multibyte functions (mbstring) and using the operator /u might help with this, and I tried a few things but do not seem to get it right. Any help would be much appreciated in resolving this, and having it match both Latin and non-Latin keywords.
The problem seems to relate to recognizing word boundaries; the \b construct is apparently not “Unicode aware.” This is what the answers to question php regex word boundary matching in utf-8 seem to suggest. I was able to reproduce the problem even with text containing Latin letters like “é” when \b was used. And the problem seems to disappear (i.e., Arabic words get correctly recognized) when I set
$wstart = '(^|[^\p{L}])';
$wend = '([^\p{L}]|$)';
and modify the regexp as follows:
$regexp = "/" . $wstart . $val . $wend . "/iu";
Some string functions in PHP cannot be used on UTF-8 strings, they're supposedly going to fix it in version 6, but for now you need to be careful what you do with a string.
It looks like strtolower() is one of them, you need to use mb_strtolower($query, 'UTF-8'). If that doesn't fix it, you'll need to read through the code and find every point where you process $query or badwords.txt and check the documentation for UTF-8 bugs.
As far as I know, preg_match() is ok with UTF-8 strings, but there are some features disabled by default to improve performance. I don't think you need any of them.
Please also double check that badwords.txt is a UTF-8 file and that $query contains a valid UTF-8 string (if it's coming from the browser, you set it with a <meta> tag).
If you're trying to debug UTF-8 text, remember most web browsers do not default to the UTF-8 text encoding, so any PHP variable you print out for debugging will not be displayed correctly by the browser, unless you select UTF-8 (in my browser, with View -> Encoding -> Unicode).
You shouldn't need to use iconv or any of the other conversion API's, most of them will simply replace all of the non-latin characters with latin ones. Obviously not what you want.

Dompdf unicode problem

Is there any solution for dompdf unicode.
Dompdf utf problem is mainly about fonts. If you can supply your own fonts or use something like DejaVu which has contains large set of chars. Edit the information your fonts in config file dompdf_font_family_cache.dist.php.
Get your language special characters in ASCII numerical code and replace like that, in this example Turkish special chars
$from = array('İ', 'ı', 'Ö', 'ö', 'Ü', 'ü', 'Ç', 'ç', 'Ğ', 'ğ', 'Ş', 'ş');
$to = array('İ', 'ı', 'Ö', 'ö', ' Ü', 'ü', 'Ç', 'ç', 'Ğ', 'ğ', 'Ş', 'ş');
$html= str_replace($from, $to, $html);
Your output should looks good without weird chars.
Unfortunately there is not. I would go with wkhtmltopdf but that requires access on the box. Other options are fpdf and others spawned from the same fpdf (but these don't convert HTML to PDF, but rather supply you with some primitives for creating a PDF). Again, I would go with wkhtmltopdf.