Escape angle brackets using Pod::Markdown - perl

I have problem with getting the correct behavior from Pod::Markdown when using brackets < and >. For example:
use strict;
use warnings;
use Data::Dump;
use Pod::Markdown;
my $str = "=head1 OPTIONS\n\n=over 4\n\n=item B<< --file=<filename> >>\n\nFile name \n\n=back\n";
my $parser = Pod::Markdown->new;
my $markdown;
$parser->output_string( \$markdown );
$parser->parse_string_document($str);
dd $markdown;
Gives output:
"# OPTIONS\n\n- **--file=<filename>**\n\n File name \n"
Which gives
on GitHub. So the part <filename> inside the ** tag is probably treated as a HTML tag and therefore not shown.
The desired output would be
"# OPTIONS\n\n- **--file=\<filename\>**\n\n File name \n"
where the brackets < and > should be escaped with a backslash.
Update
Seems like the problem is not restricted to double star sequences. I updated the question according to this..

For the moment, a workaround seems to be to insert a backslash in a postprocessing step. For example:
$parser->output_string( \$markdown );
$parser->parse_string_document($str);
fix_escape_chars(\$markdown);
sub fix_escape_chars {
my ($str) = #_;
$$str =~ s/(?<!\\)>/\\>/g;
$$str =~ s/(?<!\\)</\\</g;
}
This seems to work well.. (It works even inside URLs contrary to what is claimed in this question )..

Pod::Markdown 3.000 has been released and fixes this issue.
Not all markdown processors recognize backslash escaped < chars, so I followed the Markdown spec suggestion of escaping & and < as html entities (& and <).

Related

Matching special character (###!~`%^&()[]}{;') and replace it with _ (underscore) in perl

I want to remove all special characters except this 2 character .-
$name=~s/[^\w\d\.-]/_/g ;
But the line above it not only removes the special character but also non-alphabet characters e.g Arabic or other none alphabet characters.
How to remove only these characters (###!~`%^&()[]}{;',)
There are a few things to consider here.
First, do \d and \w really do what you think they do? Recent perls are Unicode aware (and in some cases locale aware), and those character classes aren't the same in every situation.
Since you know what you want to exclude, you can just put that directly into the character class. You need escape only the ] so it doesn't end the character class:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/[(###!~`%^&()[\]}{;',)]/_/g;
say $name;
Mark Jason Dominus has written about the "American" and "Prussian" approaches to cleansing data. You can specify what to exclude, or what to include.
If you specify the things to exclude, you potentially pass through some things that you should have excluded but did not. This may be because you forgot or didn't even know you should exclude it. These unintended situations may bite you.
If you specify only the things that are safe, you potentially miss out on things you should pass through, but bad things don't get through by mistakes of omission.
You then might try this, where you don't use the character class shortcuts:
$name =~ s/[^0-9A-Za-z.-]/_/g;
But the output is a bit weird because this also replaces whitespace. You might add the \s shortcut in there:
$name =~ s/[^0-9A-Za-z\s.-]/_/g;
But the meaning of \s has also changed over time too (vertical tab!) and is also Unicode aware. You could list the whitespace you would accept:
$name =~ s/[^0-9A-Za-z\x20.-]/_/g;
But no this is getting a bit weird. There's another way. You can go back to the ASCII versions of the character class shortcuts with the /a flag:
$name =~ s/[^\d\w\s.-]/_/ga;
The regex operator flags are in perlop since they apply to an operator. But, for as long as I've been using Perl and telling that to people in classes, someone I still go to perlre first.
Transliterate
Second, the substitution operator may be more than you need though. If you want to change single characters into other single characters, the transliteration operator may be what you need. It changes the character on the left with the corresponding character on the right:
$name =~ tr/abc/XYZ/; # a -> X, b -> Y, c -> Z
If you don't have enough characters to match up on the right, it reuses the last character:
$name =~ tr/abc/XY/; # a -> X, b -> Y, c -> Y
So, in your case with one underscore:
$name =~ tr/##!~`%^&()[]}{;',/_/;
Since the sequence of characters in tr/// aren't a regular expression, you don't worry about metacharacters.
Just for giggles
If this pattern is something you want to use in multiple places, you might want to give it a name with a user-defined Unicode property. Once it has a name, you use that everywhere and can update for everyone at the same time:
use v5.10;
my $name = "(Hello] #&^% {World[} (###!~`%^&()[]}{;',)!";
$name =~ s/\p{IsForbidden}/_/g;
say $name;
sub IsForbidden {
# see https://perldoc.perl.org/perlunicode#User-Defined-Character-Properties
state $exclude = q|##!~`%^&()[]}{;',|;
state $string =
join '',
map { sprintf "%X\n", ord }
split( //, $exclude );
return $string;
}
Building on Gene's comment, specify what you want to replace but I'd escape each special character. Note, to replace #, use \#\# in character array as shown in line 2:
$name = "# # R ! ~## ` % ^ & ( O ){{();,'`## { } ;!!! ' N , ";
$name =~ s/[\#\!\~\`\%\&\^\(\)\{\}\;\'\,\#\#]//g;
$name =~ s/ *//g;
print $name;
### Outputs RON

How to do I convert an escaped t into a tab character

I have a variable that contains a slash and a t.
my $var = "\\t";
I want to convert that to a tab. How do I do that?
use Data::Dumper;
use Term::ReadLine;
my $rl = Term::ReadLine->new();
my $var = $rl->readline( 'Enter \t:' );
print Dumper $var;
The following is the simplest solution:
$var = "\t" if $var eq "\\t";
If you want to do this no matter where the sequence appears in the string, you could use
$var =~ s/\\t/\t/g;
But it sounds like you're not asking the right question. Nothing supports \t and nothing else. At the very least, I would also expect \\ to produce \. Are you perhaps trying to parse JSON? If so, there are number of other escape sequences you need to worry about.

How to use a variable in a substitution?

I've a text file and I want to match and erase the following text (please note the newline):
[ From:
http://www.website.com ]
The following code works
$text =~ s/\[.*\]//ms;
This other doesn't
my $patt = \[.*\];
$text =~ s/$patt//ms;
Would someone be so kind to explain me why?
Thanks in advance
The second variant works perfectly, if you quote the pattern string and get rid of syntax error:
#!/usr/bin/perl
use strict;
use warnings;
my $text = qq{a[ From:
http://www.website.com ]b};
my $patt = qr/\[.*?\]/s;
$text =~ s/$patt//;
print $text;
Prints:
ab
I added ? quantifier to the regexp to make the replacement ungreedy. And removed m modifier, because you are not using ^ and $ in your regexp, so m is useless.
The only reason your variation isn't working is that you haven't put quotes around your $patt string. As it is it throws a syntax error. This works fine
my $patt = '\[.*\]';
$text =~ s/$patt//ms;
My only comment is that the /m modifier is superfluous as it modifies the behaviour of the $ and ^ anchors, which you aren't using here. Only /s is necessary to make the . match newline characters.

How to truncate the extension for special case in Perl?

I'm working on a script to truncate all the extensions for a file using the regex as below but it seem doesn't works well as this command does remove some data that I want as it will basically removing everything whenever it see a dot.
The regex I use currently:-
/\..*?$/
It would remove some files like
b10_120.00c.current.all --> b10_120
abc_10.77.log.bac.temp.ls --> abc_10
but I'm looking for an output in b10_120.00c and abc_10.77
Aside from that, is there a way to printout the output such as it keep certain extension only? Such as for the above 2 examples, it will displays b10_120.00c.current and abc_10.77.log. Thank you very much.
The following will strip file name extensions off:
s/\.[^.]+$//;
Explanation
\. matches a literal .
[^.]+ matches every character that is not a .
$ till end of string
Update
my ($new_file_name) = ( $file_name =~ m/^( [^.]+ \. [^.]+ )/x );
Explanation
^ anchor at the start of the string
[^.]+ matches every character that is not a .
\. matches a literal .
[^.]+ matches every character that is not a .
Test
#!/usr/bin/env perl
use strict;
use warnings;
use Test::More 'tests' => 2;
my %file_name_map = (
'b10_120.00c.current.all' => 'b10_120.00c',
'abc_10.77.log.bac.temp.ls' => 'abc_10.77',
);
sub new_file_name {
my $file_name = shift;
my ($new_file_name) = ( $file_name =~ m/^( [^.]+ \. [^.]+ )/x );
return $new_file_name;
}
for my $file_name ( keys %file_name_map ) {
is $file_name_map{$file_name}, new_file_name($file_name),
"Got $file_name_map{$file_name}";
}
$file =~ s/(\.[^.]+).*/$1/; # SO requires 30 chars in answer, that is stupid
You should use \. for the dot in the regular expression.
Also please explain in more details how you want to process file name.
Instead of a regex, I would suggest using this package:
http://perldoc.perl.org/File/Basename.html

How can I check if HTML contains extended entities like <?

Let's say we have a html string like "2 < 4"
How should be determined if it contains any of these extended sequences?
I 've found HTML::Entities on CPAN, but it doesn't provide 'check' method.
Details: fixing 'truncate' method in a way to not leave corrupted string like "2 &l" and not to do unnecesary work. It should look like this
$s = HTML::Entities::decode_entities ($s) if $has_ext_chars;
$s = substr ($s, 0, $len - 3) . '...' if length $s > $len;
$s = HTML::Entities::encode_entities ($s, "‚„-‰‹‘-™›\xA0¤¦§©«-®°-±µ-·»") if $has_ext_chars;
How do I determine $has_ext_chars?
A complete list of character entities can be found on the W3C reference.
You have also to match \&#u?\d+; and \&#x[a-fA-F0-9]+;
From perldoc HTML::Entities:
The module can also export the
%char2entity and the %entity2char
hashes, which contain the mapping from all characters to the
corresponding entities (and vice versa, respectively).
You can probably use them to build regexes. For example, to match entities:
use HTML::Entities '%entity2char';
my $regex = "&(?:" . join("|", map {s/;\z//; $_} keys %entity2char) . ");";
if ($str =~ /$regex/) {
print "$str contains entities\n";
}
This will skip entities like &#entity_number; though.
You can try it with a regular expression
$str =~ /.*\&[^\s]+;.*/
From your code sample you have probably just introduced a cross site scripting attack into your application. If I were to get your code to process something like <script src="evil.example.com"></script> your code would decode it to valid HTML and not re-encode the < and > back to entities. (The angle brackets in the code are not ASCII angle brackets.)
If you are truncating a string that contains any HTML tags or entities you will probably break something if you use a simple solution. You might be better off building a solution based on an HTML parsing module. If you are only looking at text inside an element with no elements inside it you can grab the text, truncate it and then replace it back into the element. If you have to deal with mixed content it will be more complicated.
But in the interest of bad solutions:
#treats each entity as one character "2 < 4" is 5 characters long
$trunc_len = $len - 3;
$str =~ s/^((?>(?:[^&]|&[^\s;]+;?){$trunc_len}))(?:[^&]|&[^\s;]+;?){4,}/$1.../;
#abuses proceadural nature of the regexp engine
#treats each input character as on character "2 < 4" is 8 characters long
$str =~ s/^( (?:[^&]|&[^\s;]+;?)+ )(?(?{ $found = (pos() > ( $found ? $len - 3 : $len ))})(?!)).*$(?(?{pos() < $len })(?!))/$1.../x;
Both are fairly permissive in what is an entity to allow for common browser quirks.