"Unrecognized character \xEF" Error with BOM-Free File - perl

I am currently extracting text from CSV files using Perl and the module, Text::CSV.
Each of the CSV files have quotation marks separating each field. The texts are being saved to independent text files with tab separation into columns. I can call and print each column from the text files no problem, but when I try to use the values in a loop, I get the error Unrecognized character \xEF.
An example of my code is as follows:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
#### Match ligand data with GPCR interaction data ####
my $csv = Text::CSV->new();
my $file = $ARGV[0];
open (FILE, "<$file");
open (OUT, ">new_$file");
while (my $line2 = <FILE>)
{
binmode(STDOUT, ":utf8");
if ($line2 =~ /^(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)$/)
{
#### Data from filtered1.txt ####
my $up_fil = $1;
my $ligid_fil = $2;
my $units_fil = $3;
my $low_fil = $4;
my $median_fil = $5;
my $upper_fil = $6;
my $ref = $7;
#### Convert negative log affinity values to normal ####
my $activity = $units_fil;
$activity =~ s/p//;
my $value;
if ($median_fil ne "")
{
$value = $median_fil;
$value = (10**-$median_fil)/(10**-9);
}
elsif ($low_fil ne "" and $upper_fil ne "")
{
my $lower = $low_fil;
$lower = (10**-$low_fil)/(10**-9);
my $upper = $upper_fil;
$upper = (10**-$upper_fil)/(10**-9);
$value = "$upper - $lower";
}
else
{
$value = "n/a";
}
#### Match entries from filtered1.txt with ligands.csv ####
open (LIG, "<ligands.csv");
while (my $line3 = <LIG>)
{
$csv->parse($line3);
my #ligand_fields = $csv->fields();
if (!$ligand_fields[14]) { next; }
if ($ligand_fields[0] eq $ligid_fil)
{
#print OUT "$ligand_fields[14]\t$ligand_fields[13]\t$up_fil\t$ligid_fil\t$activity\t$value\t$ref\n";
print "$ligand_fields[14]\t$ligand_fields[13]\t$up_fil\t$ligid_fil\t$activity\t$value\t$ref\n";
next;
}
}
close LIG;
}
}
close FILE;
close OUT;
I've also tried using a regex along the lines of the following, but to no avail.
# remove BOM
${$self->{CODE}} =~ s/^(?:
\xef\xbb\xbf |
\xfe\xff |
\xff\xfe |
\x00\x00\xfe\xff |
\xff\xfe\x00\x00
)//x;
The original CSV files appear not to have any BOM, so I suspect that Text::CSV may be creating it when it is parsing and returning values. I hope this was a clear enough explanation of the problem, and if needed, I can provide more details. Thanks in advance for any advice given.

The documentation of Text::CSV states you should almost certainly be using the binary mode.
my $csv = Text::CSV->new ( { binary => 1 } ) # should set binary attribute.
or die "Cannot use CSV: ".Text::CSV->error_diag ();
From https://metacpan.org/pod/Text::CSV#SYNOPSIS.
You may also want to take a look at Text::CSV::Encoded.
I also see you are setting a binmode of :utf8 on STDOUT. There are a couple of problems with that:
You are setting it each time around the loop unnecessarily
The :utf8 binmode does not have good error checking, you should use :encoding(UTF-8) instead
The byte 0xEF can appear in UTF-8 byte sequences, but only in very specific circumstances, it is too high (> 0x7F) to be a single character. However in Perl \xEF or \x{ef} does not refer to the byte 0xEF, but the Unicode code point U+00EF which is represented in UTF-8 as 0xC3 0xAF. You can see this in a Unicode/UTF-8 character table such as http://www.utf8-chartable.de/.
$ perl -E 'binmode STDOUT, ":encoding(UTF-8)"; say "\xEF";'
ï
So I think that is why your regular expression to remove the BOM didn't work.
I would recommend using three argument open with either '<:encoding(UTF-8)' or '>:encoding(UTF-8)' to open all of your input and output files, and using Text::CSV in binary mode, for the best results.

Related

How can i split and fill value from csv data?

How can i Split Value by Newline (\n) in some column, extract to new row and fill other column
My Example CSV Data (data.csv)
No,Email,IP,Service,Comment
1,test#email.com,192.168.10.109,FTP
HTTP
HTTPS,,
2,webmaster#email.com,192.168.10.111,SFTP
SNMP,,
3,admin#email.com,192.168.10.112,HTTP,,
In Service column has multiple value, separate by new line.
I want to extract it and fill with other value in some row look like this.
1,test#email.com,192.168.10.110,FTP,,
1,test#email.com,192.168.10.110,HTTP,,
1,test#email.com,192.168.10.110,HTTPS,,
2,webmaster#email.com,192.168.10.111,SFTP,,
2,webmaster#email.com,192.168.10.111,SNMP,,
3,admin#email.com,192.168.10.112,HTTP,,
I try to parsing with Text::CSV, I can only split multiple ip and service But i Don't known to fill other value as above example.
#!/usr/bin/perl
use Text::CSV;
my $file = "data.csv";
my #csv_value;
open my $fh, '<', $file or die "Could not open $file: $!";
my $csv = Text::CSV->new;
my $sum = 0;
open(my $data, '<:encoding(utf8)', $file) or die "Could not open '$file' $!\n";
while (my $fields = $csv->getline( $data )) {
push #csv_value, $fields;
}
close $data;
Thank you in advance for any help you can provide.
To expand on my comment
perl -ne 'if (!/^\d/){print "$line$_";} else {print $_;} /(.*,).*/; $line=$1;' file1
Use the perl command line options
e = inline command
n = implicit loop, i.e. for every line in the file do the script
Each line of the file is now in the $_ default variable
if (!/^\d/){print "$line$_";} - if the line does not start with a digit print the $line (more later) variable, followed by default variable which is the line from the file
else {print $_;} - else just print the line
Now after we've done this if the line matches anything followed by a comma followed by anything, catch it with the regex bracket so it's put in $1. So for the first line $1 will be '1,test#email.com,192.168.10.109,'
/(.*,).*/; $line=$1;
Because we do this after the first line has been printed $line will always be the previous full line.
Your input CSV is broken. I would suggest to fix the generator.
With correctly formatted input CSV you will have to enable binary option in Text::CSV as your data contains non-ASCII characters.
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV;
# input has non-ASCII characters
my $csv_in = Text::CSV->new({ binary => 1 });
my $csv_out = Text::CSV->new();
$csv_out->eol("\n");
while (my $row = $csv_in->getline(\*STDIN)) {
for my $protocol (split("\n", $row->[3])) {
$row->[3] = $protocol;
$csv_out->print(\*STDOUT, $row);
}
}
exit 0;
Test with fixed input data:
$ cat dummy.csv
No,Email,IP,Service,Comment
1,test#email.com,192.168.10.109,"FTP
HTTP
HTTPS",,
2,webmaster#email.com,192.168.10.111,"SFTP
SNMP",,
3,admin#email.com,192.168.10.112,HTTP,,
$ perl dummy.pl <dummy.csv
No,Email,IP,Service,Comment
1,test#email.com,192.168.10.109,FTP,,
1,test#email.com,192.168.10.109,HTTP,,
1,test#email.com,192.168.10.109,HTTPS,,
2,webmaster#email.com,192.168.10.111,SFTP,,
2,webmaster#email.com,192.168.10.111,SNMP,,
3,admin#email.com,192.168.10.112,HTTP,,

Search and add a text file for words in an XML file

I have two files. The first has a sequence of words on each line
bus do car
car tree
The second file is an XML file
<title>i have a car. i take bus..</title>
I want to search the text file for each word in the XML file. If it is found, I want to insert all lines from the text file where it appears, with any spaces replaced by x.
The result file would be
<title>i have a car busxdoxcar carxtree. i take bus busxdoxcar..</title>
I try this
use strict;
use warnings;
use autodie;
my $QueryFile = "query.txt";
my $SequenceFile = "Seq_2_terms_150.txt";
my %hashlist;
open NewQueryFile ,">./NewQuery.txt"
or die "Cannot create NewQuery.txt";
open(my $fh,$SequenceFile)
or die "$SequenceFile : $!";
while ( <$fh> ) {
chop;
s/^\s+|\s+$//g;
my $h = \%hashlist;
foreach ( split('\s+', $_) ) {
$h->{$_} //= {};
$h = $h->{$_};
}
$h->{'#'} = 1;
}
close $fh;
open(my $fd, $QueryFile)
or die "$QueryFile : $!";
for my $xml (<$fd>) {
foreach my $line (split(/\n/, $xml)) {
my #words = split(/\s/, $line);
if $words = #hashlist[$_] {
print NewQueryFile join ('x',$words) ;
}
}
}
close NewQueryFile ;
close($fd);
I have put together a quick script to indicate how one might go about this.
I have not bothered with xml, because that may well have left me in a bad mood.
My advice would be: do use variables, whatever you save from not doing so is lost as your code gets confusing and then buggy.
#!/usr/bin/env perl
use strict;
use warnings;
# Notes:
# - more than one space or tab in a row are mangled: They become one space only
# - the query file is not checked for containing actual words to match on,
# it is assumed to be suitable
# - I have made no attempt to parse xml. You should use a parser for that.
# Search Stack Overflow or Google or CPAN or all of those for examples.
# - The replace_xml_text function can be used on the text supplied by the
# parser to get the desired output
# - a feeble attempt is made to deal with punctuation in replace_xml_text
# - This code is not really tested
my %query_words;
my $query_fn = 'query.txt';
open (my $fh, "<",$query_fn) or die "could not open file '$query_fn'";
# build list of words from query file
while ( <$fh> ){
chomp;
# Words mentioned in line.
my #words = split(/\s+/,$_);
# Words joined by 'x'. Seems a strange choice *shrug*.
# This is used to replace words with later.
my $line = join("x",#words);
# Storing in arrayref, this seems easier to me
# than concatening repeatedly and getting the spaces right.
for my $word ( #words ){
push #{$query_words{$word}}, $line;
}
}
# Expects the text to replace.
# Returns the text to replace it with.
sub replace_xml_text {
my $original_text = shift;
my #words;
for my $word ( split(/\s+/,$original_text) ){
my $punctuation = '';
# Remove punctuation before matching,
# but do preserve it.
if ( $word =~s /(\s*[,.]\s*)$// ){
$punctuation = $1;
}
if ( my $additions = $query_words{$word} ){
$word = join(" ",$word,#$additions);
}
# Put punctuation back.
$word .= $punctuation;
# Done replacing in this word, next
push #words,$word;
}
return join(" ",#words);
}

Grep using perl

I'm trying to grep multiple patterns from a log file using perl. For the first pattern i'm getting the desired matching pattern via read only variable($1,$2..). But for the next pattern the read only variable is returning the previous value but not the value matching the second pattern.
here is the code:
$tmp = `grep "solo_video_channel_.*(0): queueing" $log`;
chomp($tmp);
$tmp =~ m/(.*):.*solo_video_channel_write(.*): queueing page (.*).*/;
$chnl = $2;
$page = $3;
$timestamp = $1;
$tmp1 = `grep "(0): DUMP GO" $log`;
chomp($tmp1);
$tmp1 =~ m/(.*): solo_video_channel_write(0): DUMP GO/;
$dmp = $1;
print "dump go time = $1\n";
tmp1's value after grep is coming as expected. but $1 value remains same as the previous one.
Any suggestions?
Always make sure that you verify that a regex matched before using a captured variable.
Additionally, there is no reason to shell out to grep. Use Perl's file processing instead:
use strict;
use warnings;
local #ARGV = $log;
while (<>) {
chomp;
if (/solo_video_channel_.*\(0\): queueing/) {
if ( my ( $timestamp, $chnl, $page ) = m/(.*):.*solo_video_channel_write(.*): queueing page (.*).*/ ) {
print "$. - $timestamp, $chnl, $page\n";
}
}
if ( my ($dmp) = m/(.*): solo_video_channel_write\(0\): DUMP GO/ ) {
print "dump go time = $dmp\n";
}
}
Note, your first set of if's could almost certainly be combined into a single if statement, but I left it as is for now.
Why not use Pure Perl? It's faster than running external greps. Plus, you can grep both regular expressions at once. Faster than looping through the file twice.
Always check the value of your rexp match. Here I'm using if statements to do this. Note too that I am printing all lines that don't match with UNMATCHED LINES. You can remove the else when you see that everything is working, or simply redirect 2> /dev/null.
use strict;
use warnings;
use autodie;
use feature qw(say);
my $log = "log.txt";
open my $log_fh, "<", $log;
while ( my $line = <$log_fh> ) {
my $timestamp;
my $channel;
my $page;
my $gotime;
if ( $line =~ /(.*):.*solo_video_channel_(.*):\s+queueing page (.*)/ ) {
$timestamp = $1;
$channel = $2;
$page = $3;
say qq(Timestamp = "$timestamp" Channel = "$channel" Page = "$page");
}
elsif ( $line =~ /(.*): solo_video_channel_write(0): DUMP GO/ ) {
$gotime = $1;
say "Dump Go Time = $1";
}
else {
say STDERR qq(UNMATCHED LINES: "$line");
}
}
close $log_fh;
In the second regexp you need to escape the literal brackets
$tmp1 =~ m/(.*): solo_video_channel_write\(0\): DUMP GO/
This is because the expression \(0\) matches the exact pattern (0)
In the example given in this answer this would include strings such as
37: solo_video_channel_write(0): DUMP GO
In contrast, the expression (0) matches the exact pattern 0 and sets a capture group.
With the regexp given in your original question
$tmp1 =~ m/(.*): solo_video_channel_write(0): DUMP GO/;
matching would occur on strings such as
37: solo_video_channel_write0: DUMP GO
Of course in the original program the strings are not in this format, so they do not match and $1 is not set
The regular expression syntax for the shell program grep is (confusingly) different
To use round brackets for setting a capture group they must be escaped with a backslash, which is the opposite to the syntax in perl

Replace substring with string from second csv file

Earlier I was working on a loop within a loop and if a match was made it would replace the entire string from the second loop file. Now i have a slightly different situation. I'm trying to replace a substring from the first loop with a string from the second loop. They're both csv files and semicolon delimited. What i'm trying to replace are special characters: from the numerical code to the character itself The first file looks like:
1;2;bla&#322blabla &#261bla;7;8
3;4;bl&#261blabla;9;10
2;3;blablabla&#261ał8;9
and the second file has the numerical code and the corresponding character:
Ą;Ą
ą;ą
Ǟ;Ǟ
Á;Á
á;á
Â;Â
ł;ł
The first semicolon in the second file belongs to the numerical code of the corresponding character and should not be used to split the file. The result should be:
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał;8;9
This is the code I have. How can i fix this?
use strict;
use warnings;
my $inputfile1 = shift || die "input/output!\n";
my $inputfile2 = shift || die "input/output!\n";
my $outputfile = shift || die "output!\n";
open my $INFILE1, '<', $inputfile1 or die "Used/Not found :$!\n";
open my $INFILE2, '<', $inputfile2 or die "Used/Not found :$!\n";
open my $OUTFILE, '>', $outputfile or die "Used/Not found :$!\n";
my $infile2_pos = tell $INFILE2;
while (<$INFILE1>) {
s/"//g;
my #elements = split /;/, $_;
seek $INFILE2, $infile2_pos, 0;
while (<$INFILE2>) {
s/"//g;
my #loopelements = split /;/, $_;
#### The problem part ####
if (($elements[2] =~ /\&\#\d{3}\;/g) and (($elements[2]) eq ($loopelements[0]))){
$elements[2] =~ s/(\&\#\d{3}\;)/$loopelements[1]/g;
print "$2. elements[2]\n";
}
#### End problem part #####
}
my $output_line = join(";", #elements);
print $OUTFILE $output_line;
#print "\n"
}
close $INFILE1;
close $INFILE2;
close $OUTFILE;
exit 0;
Assuming your character codes are standard Unicode entities, you are better off using HTML::Entities to decode them.
This program processes the data you show in your first file and ignores the second file completely. The output seems to be what you want.
use strict;
use warnings;
use HTML::Entities 'decode_entities';
binmode STDOUT, ":utf8";
while (<DATA>) {
print decode_entities($_);
}
__DATA__
1;2;bla&#322blabla &#261bla;7;8
3;4;bl&#261blabla;9;10
2;3;blablabla&#261ał8;9
output
1;2;blałblabla ąbla;7;8
3;4;bląblabla;9;10
2;3;blablablaąał8;9
You split your #elements at every occurrence of ;, which is then removed. You will not find it in your data, the semicolon in your Regexp can never match, so no substitutions are done.
Anyway, using seek is somewhat disturbing for me. As you have a reasonable number of replacement codes (<5000), you might consider putting them into a hash:
my %subst;
while(<$INFILE2>){
/^&#(\d{3});;(.*)\n/;
$subst{$1} = $2;
}
Then we can do:
while(<$INFILE1>){
s| &# (\d{3}) | $subst{$1} // "&#$1" |egx;
# (don't try to concat undef
# when no substitution for our code is defined)
print $OUTFILE $_;
}
We do not have to split the files or view them as CSV data if replacement should occur everywhere in INFILE1. My solution should speed things up a bit (parsing INFILE2 only once). Here I assumed your input data is correct and the number codes are not terminated by a semicolon but by length. You might want to remove that from your Regexes.(i.e. m/&#\d{3}/)
If you have trouble with character encodings, you might want to open your files with :uft8 and/or use Encode or similar.

How to match Chinese character in the web pages whose charset is big5 (Perl)?

I am doing a match between Chinese words, for example, "语言中心“ and a mount of web files (php, html, htm, etc).
However, somehow I get the following error:
Malformed UTF-8 character (1 byte, need 2, after start byte 0xdf) in regexp compilation at ../Final_FindOnlyNoReplace_CLE_Chinese.pl line 89, <INFILE> line 12.
Can anyone help?
Here is my code.
#!/usr/bin/env perl
use Encode qw/encode decode/;
use utf8;
use strict;
use Cwd;
use LWP::UserAgent;
my($path) = #_;
## append a trailing / if it's not there
$path .= '/' if($path !~ /\/$/);
use File::Glob ':glob';
my #all_files = bsd_glob($path."*");
for my $eachFile (#all_files) {
open(INFILE, "<$eachFile") || die ("Could not open '$eachFile'\n");
my(#inlines) = <INFILE>;
my($line, $find);
my $outkey = 1;
foreach $line (#inlines) {
$find = &find($line);
if ($find ne 'false') {
chomp($line);
print "\tline$outkey : $line\n";
}
$outkey ++;
}
}
#subroutine
sub find {
my $m = encode("utf8", decode("big5", #_));
my $html = LWP::UserAgent->new
->get($m)
->decoded_content;
my $str_chinese = '語言中心';
if ($m =~ /$str_chinese/) {
$m; ##if match, return the whole line.
}
}
You aren't searching in $html you've retrieved and decoded, but in URL instead: $m =~ /$str_chinese/, which, I guess, is not what you intend.
Also, you're comparing result of find function with exact string "false," which will never work. Change if ($find ne 'false') to if (defined($find)) and add explicit returns for success and failure to find for clarity.
Finally, you script seems to fail because you point it to directory that have some other Perl scripts amongst other files. They're most likely in UTF-8, so when your script tries to read them as big5 data, it falis on decoding. Just change your glob to cover data files only.
#!/usr/bin/env perl
use utf8;
use strictures;
use LWP::UserAgent qw();
use Path::Class::Rule qw();
use URI::file qw();
my $start_directory = q(.);
my $search_text = qr'語言中心';
my $next = Path::Class::Rule->new->name(qw(*.php *.htm*))->iter($start_directory);
my #matching_lines;
while (my $file = $next->()) {
for my $line (split /\R/, LWP::UserAgent
->new
->get(URI::file->new_abs($file))
->decoded_content
) {
push #matching_lines, $line if $line =~ $search_text;
}
}
# #matching_lines is (
# '<title>Untitled 語言中心 Document</title>',
# 'abc 語言中心 cde',
# '天天向上語言中心他'
# )