Perl split string at character entity reference - perl

Quick Perl question with hopefully a simple answer. I'm trying to perform a split on a string containing non breaking spaces ( ). This is after reading in an html page using HTML::TreeBuilder::XPath and retrieving the string needed by $titleString = $tree->findvalue('/html/head/title')
use HTML::TreeBuilder::XPath;
$tree = HTML::TreeBuilder::XPath->new;
$tree->parse_file( "filename" );
$titleString = $tree->findvalue('/html/head/title');
print "$titleString\n";
Pasted below is the original string and below that the string that gets printed:
Mr Dan Perkins (Active)
Mr?Dan Perkins?(Active)
I've tried splitting $titleString with #parts = split('\?',$titleString); and also with the original nbsp, though neither have worked. My hunch is that there's a simple piece of encoding code to be added somewhere?
HTML code:
<html>
<head>
<title>Dan Perkins (Active)</title>
</head>
</html>

You shouldn't have to know how the text in the document is encoded. As such, findvalue returns an actual non-breaking space (U+00A0) when the document contains . As such, you'd use
split(/\xA0/, $title_string)
-or-
split(/\x{00A0}/, $title_string)
-or-
split(/\N{U+00A0}/, $title_string)
-or-
split(/\N{NBSP}/, $title_string)
-or-
split(/\N{NO-BREAK SPACE}/, $title_string)

Related

Html encode in perl excluding html tags

$encoded = encode_entities($input, '<>&"');
This will encode the <,>,&,".But how to exclude these things from the encoding??
There is an example in the documentation:
$encoded = encode_entities($input, '^\n\x20-\x25\x27-\x7e');

How can I fix garbled multibyte text when using Text::vCard's as_string() method?

When using multibyte UTF-8 characters in a NOTE node, characters are garbled/lost around the newline.
For example:
$vcard = $address_book->add_vcard();
$vcard->version('3.0');
$vcard->FN('Tèśt Ûšér');
$vcard->NOTE('①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳');
say $vcard->as_string();
Produces:
BEGIN:VCARD
VERSION:3.0
FN:Tèśt Ûšér
NOTE:①②③④⑤⑥⑦⑧⑨⑩⑪��
�⑬⑭⑮⑯⑰⑱⑲⑳①②③④
⑤⑥⑦⑧⑨⑩⑪⑫⑬⑭⑮⑯�
��⑱⑲⑳①②③④⑤⑥⑦⑧��
�⑩⑪⑫⑬⑭⑮⑯⑰⑱⑲⑳①
②③④⑤⑥⑦⑧⑨⑩⑪⑫⑬�
��⑮⑯⑰⑱⑲⑳
END:VCARD
How would go about fixing this? I also posted this as an issue on the text-vcard project page. I think this is related to how the new lines are inserted (by inserting the raw bytes: \x0D\x0A), but I'm not sure.
It looks like the culprit is Text::vCard::Node->_wrap_utf8(). I was able to at least get it to stop cutting up characters by bypassing that method all together.
sub _wrap_utf8 {
my ( $self, $key, $value, $max, $newline ) = #_;
#bypass wrapping
return $key . $value;
…
}

PHP preg_replace with any variation of upper/lowercase?

I needed to write a custom module in drupal to help out with my location search. Initially I simply needed to remove a comma from queries, and then I realized that I would need to replace all instances of states with their abbreviation (California -> CA) because of how information is stored in my database. However, upon doing this I found out that my method of using preg_replace seems to be dependent on upper/lowercase. So in this line:
$form_state['values'] = preg_replace("/alabama/", 'al', $form_state['values']);
"alabama" will be replaced with "al", but "Alabama" or "ALABAMA" will not. Is there a way to replace any instance of Alabama with its abbreviation without accounting for every possible variation in casings?
you can try also str_ireplace() it's Case-insensitive
<?php
$str = 'alabama ,Alabama,ALABAMA';
$replace = str_ireplace('alabama','al',$str);
echo $str;
echo "<br/>";
echo $test;
?>
$form_state['values'] = preg_replace("/alabama/i", 'al', $form_state['values']);
The 'i' modifier will make the pattern case-insensitive.

Perl OpenOffice::OODoc - accessing header/footer elements

How do you get elements in a header/footer of a odt doc?
for example I have:
use OpenOffice::OODoc;
my $doc = odfDocument(file => 'whatever.odt');
my $t=0;
while (my $table = $doc->getTable($t))
{
print "Table $t exists\n";
$t++;
}
When I check the tables they are all from the body. I can't seem to find elements for anything in the header or footer?
I found sample code here which led me to the answer:
#! /usr/local/bin/perl
use OpenOffice::OODoc;
my $file='asdf.odt';
# odfContainer is a representation of the zipped odf file
# and all of its parts.
my $container = odfContainer("$file");
# We're going to look at the 'style' part of the container,
# because that's where the header is located.
my $style = odfDocument
(
container => $container,
part => 'styles'
);
# masterPageHeader takes the style name as its argument.
# This is not at all clear from the documentation.
my $masterPageHeader = $style->masterPageHeader('Standard');
my $headerText = $style->getText( $masterPageHeader );
print "$headerText\n"
The master page style defines the look and feel of the document -- think CSS. Apparently 'Standard' is the default name for the master page style of a document created by OpenOffice... that was the toughest nut to crack... once I found the example code, that fell out in my lap.

Escape Single Quotes in Template Toolkit

Do you ever escape single quotes in template toolkit for necessary javascript handlers? If so, how do you do it.
[% SET s = "A'B'C" %]
ABC
html_entity obviously doesn't work because it only handles the double quote. So how do you do it?
I don't use the inlined event handlers -- for the same reason I refuse to use the style attribute for css. Jquery just makes it to easy to do class="foo" on the html and $('.foo').click( function () {} ), in an external .js file.
But, for the purpose of doing my best to answer this question, check out these docs on Template::Filter for the ones in core.
It seems as if you could do [% s | replace( "'", "\\'" ) %], to escape single quotes. Or you could probably write a more complex sanitizing javascript parser that permits only function calls, and make your own Template::Filter
2018 update for reference:
TT has a method for this called squote for escaping single quotes and dquote for double quotes.
[% tim = "Tim O'Reilly" %]
[% tim.squote %] # Tim O\'Reilly
Questioned link would be something like:
ABC
http://www.template-toolkit.org/docs/manual/VMethods.html#section_squote
You can try: popup('[% s | html %]').
Perl isn't my strongest language... But!
Easiest way I've found is to use the JSON module. In a module called JS.pm or something:
use JSON;
sub encode () {
my $self = shift;
my $string = shift;
$json = JSON->new->allow_nonref;
return $json->encode( $string );
}
More here: http://search.cpan.org/~makamaka/JSON-2.90/lib/JSON.pm
Then in your template:
[% use JS; %]
<script>
var escaped_string = [% JS.encode( some_template_variable ) %];
</script>
Remember to double-escape the slash in the replacement, otherwise it will be interpreted as escaping the apostrophe.
[% string.replace( "'", "\\'" ) %]