Characters (curved quotes, dash, etc.) coming as â on parsing XML - dom

I am trying to parse the Guardian RSS feed (Link). The feed contains curved quotes (” ’ “ ‘), dash (-) and characters with accents (Orbán).
When I parse & display the text on a HTML page, these characters show as â (for quotes & dash), á (for á) & so on in the 'description' section. How do I make them parse properly?
Code
$xml = simplexml_load_file($link);
for($i = 0; $i < 30; $i++){
$title = $xml->channel->item[$i]->title;
$description = $xml->channel->item[$i]->description;
$count = 0;
$para = "";
$doc = new DOMDocument();
#$doc->loadHTML($description);
while($count<3){
if($count==0){
$para = $doc->getElementsByTagName('p')->item($count)->nodeValue;
}else{
$para = $para."<br><br>".$doc->getElementsByTagName('p')->item($count)->nodeValue;
}
$count++;
}
echo "<tr>";
echo "<td>" . $title . "</td>";
echo "<td>" . $para . "</td>";
echo "</tr>";
}
I have the below line in my 'head' section.
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
The title section shows properly. It might be because they use straight quotes (') in title & curved (‘) in description. But as you can see á is also showing correctly in title.

The problem was with the loadHTML line. It does not treat the text as UTF-8 unless specified.
I replaced this line
#$doc->loadHTML($description);
with this line
#$doc->loadHTML('<?xml encoding="utf-8" ?>'.$description);
Check the original answer here.

Related

How to create HTML tables dynamically using Perl?

I am working on a project where I need to access CSV file form a web URL. I am able access the file and print the content from CSV file in the terminal, but I'm unable to produce HTML table (then I'll later send email using MIME).
Here is my code - I need complete CSV file as HTML table delivered to my email.
#!/usr/bin/perl
use strict;
use warnings;
use LWP::Simple;
use POSIX qw(strftime);
use MIME::Lite;
use Getopt::Long;
my $to="pankajdustbin\#gmail.com";
my $from="pankajdustbin\#gmail.com";
$subject="CSV File";
my $content = `curl -s "https:csvfile.com"`;
#output = split(/\n/,$content);
foreach my $line (#output) {
my ($col1, $col2, $col3, $col4, $col5, $col6) = split(/,/, $line);
#print "\n$col1, $col2, $col3, $col4, $col5, $col6\n";
$message = "<tr> <td>$col1</td> <td>$col2</td> <td>$col3</td> <td>$col4</td> <td>$col5</td> <td>$col6</td></tr>";
}
my $msg=MIME::Lite->new(
From => $from,
To => $to,
Subject => $subject,
Data => $message
);
$msg->attr('content-type' => 'text/html');
#MIME::Lite->send("smtp");
$msg->send;
With this code, the HTML table contains only the last row of the CSV. Can someone help me how I should do?
CSV has around 100 rows, and the sample output that I see in terminal as below:
1.2.3.4 03-04-2022. 03-08-2022. Red. 1%. Positive
5.2.3.4 03-05-2022. 04-08-2022. Blue. 1%. Neutral
and so on...
The problem is that you overwrite the contents of the $message variable each time through the foreach loop. This means that $message will only have the last value that you assign to it.
You could append to the contents of the variable using the .= operator:
my $message;
foreach my $line (#output) {
my ($col1, $col2, $col3, $col4, $col5, $col6) = split(/,/, $line);
$message .= "<tr> <td>$col1</td> <td>$col2</td> <td>$col3</td> <td>$col4</td> <td>$col5</td> <td>$col6</td></tr>";
}
Previous answer covered that you overwrite $message in the loop what is not you have intended.
Following snippet code demonstrates slightly different approach to build html table utilizing split and for loop.
Then table can be utilized anyway you desire -- send it by mail or generate html page. In this demo code complete html page generated.
Note #1: \n and \t optional and added for html readability only
Note #2: as no sample input CVS file was provided the content was assumed based on provided output in terminal
use strict;
use warnings;
use feature 'say';
my $table = '<table border=1>';
while( my $line = <DATA>) {
chomp $line;
$table .= "\n\t\t\t<tr>";
$table .= "\n\t\t\t\t<td>" . $_ . '</td>' for split(/,/,$line);
$table .= "\n\t\t\t</tr>";
}
$table .= "\n\t\t</table>";
my $html =
"<html lang='en'>
<head>
<meta charset='utf8' />
<link rel='stylesheet' href='css/styles.css'>
<title>
CVS table
</title>
</head>
<body>
$table
</body>
</html>
";
say $html;
__DATA__
1.2.3.4,03-04-2022,03-08-2022,Red,1%,Positive
5.2.3.4,03-05-2022,04-08-2022,Blue,1%,Neutral
1.2.3.4,03-04-2022,03-08-2022,Red,1%,Positive
5.2.3.4,03-05-2022,04-08-2022,Blue,1%,Neutral
1.2.3.4,03-04-2022,03-08-2022,Red,1%,Positive
5.2.3.4,03-05-2022,04-08-2022,Blue,1%,Neutral

Parse UTF-8 HTML to CSV Ascii using Perl

First off I am a little new to this, so the answer may be that it is up to the consumer, however, I have the following code:
#!/usr/bin/perl
open(RESPONSE,"response.xml")
$result ="";
while(<RESPONSE>){
next unless $. > 1
$line = $_
$line =~ "<html><body>";
$line =~ "</body></html>";
$result .= $line
}
print "$result";
exit 0;
But this still outputs \n and \r\n explicitly. I tried adding the following...
use Encode
...
$final = decode_utf8($result);
print "$final";
But I still see the chars when I open up the doc generated by this shell command....
perl parse.pl > "outfile.csv"
So for example
<html><body>test,a\r\ntest2,b<body></html>
Stays as test,a\r\ntest2,b in the csv
Thanks!
If you want to parse HTML or XML then use an HTML or XML parser. If you want to create a CSV file then use a CSV file module.
This problem has nothing at all do to with the differences between Unicode and ASCII.

Identifying a standard gettext pot file

Within my php framework (CakePHP), is a i18n tool for generating POT files. The header of the file is generated like so:
protected function _writeHeader() {
$output = "# LANGUAGE translation of CakePHP Application\n";
$output .= "# Copyright YEAR NAME <EMAIL#ADDRESS>\n";
$output .= "#\n";
$output .= "#, fuzzy\n";
$output .= "msgid \"\"\n";
$output .= "msgstr \"\"\n";
$output .= "\"Project-Id-Version: PROJECT VERSION\\n\"\n";
$output .= "\"POT-Creation-Date: " . date("Y-m-d H:iO") . "\\n\"\n";
$output .= "\"PO-Revision-Date: YYYY-mm-DD HH:MM+ZZZZ\\n\"\n";
$output .= "\"Last-Translator: NAME <EMAIL#ADDRESS>\\n\"\n";
$output .= "\"Language-Team: LANGUAGE <EMAIL#ADDRESS>\\n\"\n";
$output .= "\"MIME-Version: 1.0\\n\"\n";
$output .= "\"Content-Type: text/plain; charset=utf-8\\n\"\n";
$output .= "\"Content-Transfer-Encoding: 8bit\\n\"\n";
$output .= "\"Plural-Forms: nplurals=INTEGER; plural=EXPRESSION;\\n\"\n\n";
return $output;
}
I am curious to know if the following:
$output .= "#, fuzzy\n";
$output .= "msgid \"\"\n";
$output .= "msgstr \"\"\n";
breaks some sort of standards of gettext. If it does not, I would love an explanation as to why one would put that in the header of the file.
I suppose it's 'fuzzy' because the headers are not complete. i.e. the template entries will be filled in when the PO files are generated from the POT.
The official Gettext xgettext tool that generates POT files from source code also adds the fuzzy flag to the header. By that token it is certainly not against the standard.

How to substitute arbitrary fixed strings in Perl

I want to replace a fixed string within another string using Perl. Both strings are contained in variables.
If it was impossible for the replaced string to contain any regex meta-characters, I could do something like this:
my $text = 'The quick brown fox jumps over the lazy dog!';
my $search = 'lazy';
my $replace = 'drowsy';
$text =~ s/$search/$replace/;
Alas, I want this to work for arbitrary fixed strings. E.g., this should leave $text unchanged:
my $text = 'The quick brown fox jumps over the lazy dog!';
my $search = 'dog.';
my $replace = 'donkey.';
$text =~ s/$search/$replace/;
Instead, this replaces dog! with donkey., since the dot matches the exclamation mark.
Assuming that the variable contents themselves are not hardcoded, e.g., they can come from a file or from the command line, is there a way to quote or otherwise markdown the contents of a variable so that they are not interpreted as a regular expression in such substitution operations?
Or is there a better way to handle fixed strings? Preferably something that would still allow me to use regex-like features such as anchors or back-references.
Run your $search through quotemeta:
my $text = 'The quick brown fox jumps over the lazy dog!';
my $search = quotemeta('dog.');
my $replace = 'donkey.';
$text =~ s/$search/$replace/;
This will unfortunately not allow you to use other regex features. If you have a select set of features you want to escape out, perhaps you can just run your $search through a first "cleaning" regex or function, something like:
my $search = 'dog.';
$search = clean($search);
sub clean {
my $str = shift;
$str =~ s/\./\\\./g;
return $str;
}
Wrap your search string with \Q...\E, which quotes any meta characters within.
$text =~ s/\Q$search\E/$replace/;
#Replace a string without using RegExp.
sub str_replace {
my $replace_this = shift;
my $with_this = shift;
my $string = shift;
my $length = length($string);
my $target = length($replace_this);
for(my $i=0; $i<$length - $target + 1; $i++) {
if(substr($string,$i,$target) eq $replace_this) {
$string = substr($string,0,$i) . $with_this . substr($string,$i+$target);
return $string; #Comment this if you what a global replace
}
}
return $string;
}

How convert text into XML using perl?

input text file contain the following:
....
ponies B-pro
were I-pro
used I-pro
A O
report O
of O
indirect B-cd
were O
. O
...
output XML file
<sen>
<base id="pro">
<w id="1">ponies</w>
<w id="2">were</w>
<w id="3">were</w>
</base>A report of
<base id="cd">indirect</base> were
</sen>
i want to make an XML file by reading the text file, B- means the begining of my tag and I- means an include words inside the tag while "O" means outside the base tag which means it only exist in the tag.
i try the following codes:
#!/usr/local/bin/perl -w
open(my $f, "input.txt") or die "Can't";
open(my $o, ">output.xml") or die "Can't";
my $c;
sub read_line {
my $fh = shift;
if ($fh and my $line = <$fh>) {
chomp($line);
my #words = split(/\t/, $line);
my $word = $words[0];
my $group = $words[1];
if($word eq "."){
return;
}
else{
if($group ne 'O'){
my #b = split(/\-/, $group);
if($b[0] eq 'B'){
my $e = "<e id=\"";
$e .= " . $b[1] . "\">";
$e .= $word . "</e>";
return $e;
}
if($b[0] eq 'I'){
my $w = "<w id=\"";
$w .= $c . "\">";
$w .= $word . "</w>";
$c++;
return $w;
}
}
else{
$c = 2;
return $word;
}
}
}
return;
}
sub get_text(){
my $txt = "";
my $r = read_line($f);
while($r){
if($r =~ m/[[:punct:]]/){
chop($txt);
$txt .= " " . $r . " ";
}
else{
$txt .= $r . " ";
}
$r = read_line($f);
}
chop($txt);
return "<sen>" . $txt . ".</sen>";
}
instead im getting as output:
<sen>
<base id="pro"> ponies </base>
<w id="2">were</w>
<w id="3">were</w>
A report of
<base id="cd">indirect</base> were
</sen>
i really need help.
Thanks
Writing XML "by hand" will only get you in trouble. Use a module from CPAN.
In your case, I would first put the data in a proper Perl data structure (maybe a hash containing some arrays, or something similar) and then using a module (i.e. XML::Simple for starters) to output to a file.
As Javs said, you want to use a module rather than do this by hand. For your purposes, since you have mixed content, I recommend XML::LibXML. Here is an example I made to test that you can indeed to mixed content like you've got:
use XML::LibXML;
my $doc = XML::LibXML::Document->new();
my $root = $doc->createElement('html');
$doc->setDocumentElement($root);
my $body = $doc->createElement('body');
$root->appendChild($body);
my $link = $doc->createElement('a');
$link->setAttribute('href', 'http://google.com');
$link->appendText('Google');
$body->appendChild($link);
$body->appendText('Inline Text');
print $doc->toString;