Rogue Character in tab delimited file causing error

Rogue Character in tab delimited file causing error - perl

I am trying to read and parse a file line by line, but there is some kind of delimiter at the end of the file that is causing strange behavior.
Here is what the lines of the file I am reading looks like :
20111129 AMEX BHO OTCBB BHODD
20111129 AMEX LCAPA NASDAQ LMCA
The code to read it is straightforward :
my(#line) = <INFO>;
foreach $line(#line) {
chomp( $line );
my #vals = split('\t', $line);
my $date = $vals[0];
my $old_exch = $vals[1];
my $old_symb = $vals[2];
my $new_exch = $vals[3];
my $new_symb = $vals[4];
print "0> date '$date'\n";
print "1> old Exch '$old_exch'\n";
print "2> old symb '$old_symb'\n";
print "3> new Exch '$new_exch'\n";
print "4> new symb '$new_symb'\n";
The output appears like this :
0> date '20111129'
1> old Exch 'AMEX'
2> old symb 'BHO'
3> new Exch 'OTCBB'
'> new symb 'BHODD
so there appears to be a character at the end of each line that is causing the trailing ' to print at the beginning of the line, wiping out the 4 that should print there. it is like a character that resets where printing should be occurring back to the begining of the line. Is there any way to 'chomp out' this rogue character? or perhaps there is some kind of bug in my code, but I have other scripts doing something similar...
Thanks much In Advance!
Don

The file has Windows line endings. The rogue character is "\r", you can remove it by a regular expression:
s/\r//;
Or, you can specify the :crlf layer when opening the file.

Related

Reading CSV with Perl produces distorted lines

I am reading a CSV file using Perl 5.26.1 with lines that look like this:
B1_10,202337840166,R08C02,202337840166_R08C02.gtc
I'm reading this data into a hash that has the last element as a key, and the first as a value.
I read the file line by line (snippet only):
while (<$csv>) {
if (/^Sample/) { next }
say "-----start----\noriginal = $_";
chomp;
my #line = split /,/;
my $name = $line[0];
my $vcf = $line[3];
say "1st element = $name";
say "4th element = $vcf";
$vcf2dir{$vcf} = $name;
say "\$vcf2dir{$vcf} = '$name'";
say '-----end------';
}
which produces the following output:
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
} = 'B1_10'2337840166_R08C02.gtc
-----end-------
but it should look like
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
$vcf2dir{202337840166_R08C02.gtc} = 'B1_10'
-----end-------
and it shows strangely with the data printer package:
use DDP;
p %vcf2dir;
produces
{
' "B1_10"840166_R08C02.gtc
}
in other words, the last string is being cut up for some reason.
I have tried removing non-ascii characters with $_ =~ s/[[:^ascii:]]//g; but this still produces the same error.
I have no idea why Perl is ripping these strings apart :(

while (<$csv>) {
...
chomp;
My guess is that the input file has as line end \r\n (windows style) while you are executing the code in a UNIX like environment (Linux, Mac...) where the line end is \n. This means that $INPUT_RECORD_SEPARATOR is also \n and that chomp only removes the \n and leaves the \r. This left \r causes such strange output.
To fix this either fix the line endings in your input file, set $INPUT_RECORD_SEPARATOR to the expected separator or just do s{\r?\n\z}{} instead of chomp to handle both \r\n and \n line endings.

I ran your snippet against your line and it worked as expected
But I have had behavior like what you show because a spurious Control-M's in my data.
Try filtering for control-M's
after your chomp replace all control-M's with the command below
s/\cM//g;

Perl chomp doesn't remove the newline

I want to read a string from a the first line in a file, then repeat it n repetitions in the console, where n is specified as the second line in the file.
Simple I think?
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
chomp($repetitions = <INPUT>);
print $text x $repetitions;
Where input.txt is as follows
Hello
3
I expected the output to be
HelloHelloHello
But words are new line separated despite that chomp is used.
Hello
Hello
Hello
You may try it on the following Perl fiddle CompileOnline
The strange thing is that if the code is as follows:
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
print $text x 3;
It will work fine and displays
HelloHelloHello
Am I misunderstanding something, or is it a problem with the online compiler?

You have issues with line endings; chomp removes trailing char/string of $/ from $text and that can vary depending on platform. You can however choose to remove from string any trailing white space using regex,
open(my $INPUT, "<", "input.txt");
my $text = <$INPUT>;
my $repetitions = <$INPUT>;
s/\s+\z// for $text, $repetitions;
print $text x $repetitions;
I'm using an online Perl editor/compiler as mentioned in the initial post http://compileonline.com/execute_perl_online.php
The reason for your output is that string Hello\rHello\rHello\r is differently interpreted in html (\r like line break), while in console \r returns cursor to the beginning of the current line.

I want to replace some characters in file 1 and want to write the output of replaced characters into file 2 with Perl

I have file actual.out.tmp, and I want to replace some characters and send the output to
file actual.out. I tried the following code :
open(ACTUAL, "$tmpDir/data/actual_out.tmp");
my $pattern="";
while(<ACTUAL>)
{
$pattern .= $_;
}
close(ACTUAL);
$pattern=~s/#[^[]*/#/g;
$rc= systemTestSetup::execute("touch $tmpDir/data/actual_out");
open(ACTUAL1, ">$tmpDir/data/actual_out");
print ACTUAL1 ;
close(ACTUAL1);
sleep(10);

I believe the line print ACTUAL1; should be print ACTUAL1 $pattern, since that's where you did your search-and-replace ($_ still has the last line of the original file in it, I think).
There may be other problems as well.

Trying to understand Perl split() output

I have a few lines of text that I'm trying to use Perl's split function to convert into an array. The problem is that I'm getting some unusual extra characters in the output, specifically the following string "\cM" (without the quotes). This string appears where there were line breaks in the original text; however, (I believe) those line breaks were removed in the text that I'm trying to split. Does anybody know what's going on with this phenomenon? I posted an example below. Thanks.
Here's the original plain text that I'm trying to split. I'm loading it from a file, in case that matters:
10b2obo12b2o2b$6b3obob3o8bob3o2b$2bobo10bo3b2obo4bo2b$2o4b2o5bo3b4obo
3b2o2b$2bob2o2bo4b3obo5b4obob$8bo4bo13b3o$2bob2o2bo4b3obo5b4obob$2o4b
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
Here is my Perl code that is supposed to do the splitting:
while(<$FH>) {
chomp;
$string .= $_;
last if m/!$/;
}
#rows = split(qr/\$/, $string);
print; # a dummy line to provide a breakpoint for the debugger
This what the debugger outputs when it gets to the "print" line. The issue I'm trying to deal with appears in lines 3, 7, and 10:
DB<10> p $string
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
DB<11> x #rows
0 '10b2obo12b2o2b'
1 '6b3obob3o8bob3o2b'
2 '2bobo10bo3b2obo4bo2b'
3 "2o4b2o5bo3b4obo\cM3b2o2b"
4 '2bob2o2bo4b3obo5b4obob'
5 '8bo4bo13b3o'
6 '2bob2o2bo4b3obo5b4obob'
7 "2o4b\cM2o5bo3b4obo3b2o2b"
8 '2bobo10bo3b2obo4bo2b'
9 '6b3obob3o8bob3o2b'
10 "10b2obo12b2o!\cM"

You know, changing the file input separator would make this code a lot simpler.
$/ = '$';
my #rows = <$FH>;
chomp #rows;
print "#rows";

The debugger is probably using \cM to represent Ctrl-M which is also known as a carriage return (and sometimes \r or ^M). Text files from Windows use a CR-LF (carriage return, line feed) pair to represent the end of a line. If you read such a file on a Unix system, your chomp will strip off the Unix EOL (a single line feed) but leave the CR as is and you end up with stray CRs in your file.
For a file like you have you can just strip out all the trailing whitespace instead of using chomp:
while(defined(my $line = <$FH>)) {
$line =~ s/\s+$//;
$string .= $line;
last if($line =~ /!$/);
}

You don't say which OS you're on.
Check out binmode and what it has to say about \cM, and that their position coincides with the line endings of your input file:
http://perldoc.perl.org/functions/binmode.html

Perl Regex Error Help

I'm receiving a similar error in two completely unrelated places in our code that we can't seem to figure out how to resolve. The first error occurs when we try to parse XML using XML::Simple:
Malformed UTF-8 character (unexpected end of string) in substitution (s///) at /usr/local/lib/perl5/XML/LibXML/Error.pm line 217.
And the second is when we try to do simple string substitution:
Malformed UTF-8 character (unexpected non-continuation byte 0x78, immediately after start byte 0xe9) in substitution (s///) at /gold/content/var/www/alltrails.com/cgi-bin/API/Log.pm line 365.
The line in question in our Log.pm file is as follows where $message is a string:
$message =~ s/\s+$//g;
Our biggest problem in troubleshoot this is that we haven't found a way to identify the input that is causing this to occur. My hope is that some else has run into this issue before and can provide advice or sample code that will help us resolve it.
Thanks in advance for your help!

Not sure what the cause is, but if you want to log the message that is causing this, you could always add a __DIE__ signal handler to make sure you capture the error:
$SIG{__DIE__} = sub {
if ($_[0] =~ /Malformed UTF-8 character/) {
print STDERR "message = $message\n";
}
};
That should at least let you know what string is triggering these errors.

Can you do a hex dump of the source data to see what it looks like?
If your reading this from a file, you can do this with a tool like "od".
Or, you can do this inside the perl script itself by passing the string to a function like this:
sub DumpString {
my #a = unpack('C*',$_[0]);
my $o = 0;
while (#a) {
my #b = splice #a,0,16;
my #d = map sprintf("%03d",$_), #b;
my #x = map sprintf("%02x",$_), #b;
my $c = substr($_[0],$o,16);
$c =~ s/[[:^print:]]/ /g;
printf "%6d %s\n",$o,join(' ',#d);
print " "x8,join(' ',#x),"\n";
print " "x9,join(' ',split(//,$c)),"\n";
$o += 16;
}
}

Sounds like you have an "XML" file that is expected to have UTF-8 encoded characters but doesn't. Try just opening it and looking for hibit characters.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Rogue Character in tab delimited file causing error - perl

The file has Windows line endings. The rogue character is "\r", you can remove it by a regular expression: s/\r//; Or, you can specify the :crlf layer when opening the file.

Related

Reading CSV with Perl produces distorted lines

Perl chomp doesn't remove the newline

I want to replace some characters in file 1 and want to write the output of replaced characters into file 2 with Perl

Trying to understand Perl split() output

Perl Regex Error Help

Categories

Resources