Trying to understand Perl split() output - perl

I have a few lines of text that I'm trying to use Perl's split function to convert into an array. The problem is that I'm getting some unusual extra characters in the output, specifically the following string "\cM" (without the quotes). This string appears where there were line breaks in the original text; however, (I believe) those line breaks were removed in the text that I'm trying to split. Does anybody know what's going on with this phenomenon? I posted an example below. Thanks.
Here's the original plain text that I'm trying to split. I'm loading it from a file, in case that matters:
10b2obo12b2o2b$6b3obob3o8bob3o2b$2bobo10bo3b2obo4bo2b$2o4b2o5bo3b4obo
3b2o2b$2bob2o2bo4b3obo5b4obob$8bo4bo13b3o$2bob2o2bo4b3obo5b4obob$2o4b
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
Here is my Perl code that is supposed to do the splitting:
while(<$FH>) {
chomp;
$string .= $_;
last if m/!$/;
}
#rows = split(qr/\$/, $string);
print; # a dummy line to provide a breakpoint for the debugger
This what the debugger outputs when it gets to the "print" line. The issue I'm trying to deal with appears in lines 3, 7, and 10:
DB<10> p $string
2o5bo3b4obo3b2o2b$2bobo10bo3b2obo4bo2b$6b3obob3o8bob3o2b$10b2obo12b2o!
DB<11> x #rows
0 '10b2obo12b2o2b'
1 '6b3obob3o8bob3o2b'
2 '2bobo10bo3b2obo4bo2b'
3 "2o4b2o5bo3b4obo\cM3b2o2b"
4 '2bob2o2bo4b3obo5b4obob'
5 '8bo4bo13b3o'
6 '2bob2o2bo4b3obo5b4obob'
7 "2o4b\cM2o5bo3b4obo3b2o2b"
8 '2bobo10bo3b2obo4bo2b'
9 '6b3obob3o8bob3o2b'
10 "10b2obo12b2o!\cM"

You know, changing the file input separator would make this code a lot simpler.
$/ = '$';
my #rows = <$FH>;
chomp #rows;
print "#rows";

The debugger is probably using \cM to represent Ctrl-M which is also known as a carriage return (and sometimes \r or ^M). Text files from Windows use a CR-LF (carriage return, line feed) pair to represent the end of a line. If you read such a file on a Unix system, your chomp will strip off the Unix EOL (a single line feed) but leave the CR as is and you end up with stray CRs in your file.
For a file like you have you can just strip out all the trailing whitespace instead of using chomp:
while(defined(my $line = <$FH>)) {
$line =~ s/\s+$//;
$string .= $line;
last if($line =~ /!$/);
}

You don't say which OS you're on.
Check out binmode and what it has to say about \cM, and that their position coincides with the line endings of your input file:
http://perldoc.perl.org/functions/binmode.html

Related

Why is my last line is always output twice?

I have a uniprot document with a protein sequence as well as some metadata. I need to use perl to match the sequence and print it out but for some reason the last line always comes out two times. The code I wrote is here
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if($_=~m /^\s+(\D+)/) { #this is the pattern I used to match the sequence in the document
$seq=$1;
$seq=~s/\s//g;} #removing the spaces from the sequence
print $seq;
}
I instead tried $seq.=$1; but it printed out the sequence 4.5 times. Im sure i have made a mistake here but not sure what. Here is the input file https://www.uniprot.org/uniprot/P30988.txt
Here is your code reformatted and extra whitespace added between operators to make it clearer what scope the statements are running in.
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
}
print $seq;
}
The placement of the print command means that $seq will be printed for every line from the input file -- even those that don't match the regex.
I suspect you want this
#!usr/bin/perl
open (IN,'P30988.txt');
while (<IN>) {
if ($_ =~ m /^\s+(\D+)/) {
$seq = $1;
$seq =~ s/\s//g;
# only print $seq for lines that match with /^\s+(\D+)/
# Also - added a newline to make it easier to debug
print $seq . "\n";
}
}
When I run that I get this
MRFTFTSRCLALFLLLNHPTPILPAFSNQTYPTIEPKPFLYVVGRKKMMDAQYKCYDRMQ
QLPAYQGEGPYCNRTWDGWLCWDDTPAGVLSYQFCPDYFPDFDPSEKVTKYCDEKGVWFK
HPENNRTWSNYTMCNAFTPEKLKNAYVLYYLAIVGHSLSIFTLVISLGIFVFFRSLGCQR
VTLHKNMFLTYILNSMIIIIHLVEVVPNGELVRRDPVSCKILHFFHQYMMACNYFWMLCE
GIYLHTLIVVAVFTEKQRLRWYYLLGWGFPLVPTTIHAITRAVYFNDNCWLSVETHLLYI
IHGPVMAALVVNFFFLLNIVRVLVTKMRETHEAESHMYLKAVKATMILVPLLGIQFVVFP
WRPSNKMLGKIYDYVMHSLIHFQGFFVATIYCFCNNEVQTTVKRQWAQFKIQWNQRWGRR
PSNRSARAAAAAAEAGDIPIYICHQELRNEPANNQGEESAEIIPLNIIEQESSA
You can simplify this a bit:
while (<IN>) {
next unless m/^\s/;
s/\s+//g;
print;
}
You want the lines that begin with whitespace, so immediately skip those that don't. Said another way, quickly reject things you don't want, which is different than accepting things you do want. This means that everything after the next knows it's dealing with a good line. Now the if disappears.
You don't need to get a capture ($1) to get the interesting text because the only other text in the line is the leading whitespace. That leading whitespace disappears when you remove all the whitespace. This gets rid of the if and the extra variable.
Finally, print what's left. Without an argument, print uses the value in the topic variable $_.
Now that's much more manageable. You escape that scoping issue with if causing the extra output because there's no scope to worry about.

Reading CSV with Perl produces distorted lines

I am reading a CSV file using Perl 5.26.1 with lines that look like this:
B1_10,202337840166,R08C02,202337840166_R08C02.gtc
I'm reading this data into a hash that has the last element as a key, and the first as a value.
I read the file line by line (snippet only):
while (<$csv>) {
if (/^Sample/) { next }
say "-----start----\noriginal = $_";
chomp;
my #line = split /,/;
my $name = $line[0];
my $vcf = $line[3];
say "1st element = $name";
say "4th element = $vcf";
$vcf2dir{$vcf} = $name;
say "\$vcf2dir{$vcf} = '$name'";
say '-----end------';
}
which produces the following output:
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
} = 'B1_10'2337840166_R08C02.gtc
-----end-------
but it should look like
-----start----
original = B1_10,202337840166,R08C02,202337840166_R08C02.gtc
1st element = B1_10
4th element = 202337840166_R08C02.gtc
$vcf2dir{202337840166_R08C02.gtc} = 'B1_10'
-----end-------
and it shows strangely with the data printer package:
use DDP;
p %vcf2dir;
produces
{
' "B1_10"840166_R08C02.gtc
}
in other words, the last string is being cut up for some reason.
I have tried removing non-ascii characters with $_ =~ s/[[:^ascii:]]//g; but this still produces the same error.
I have no idea why Perl is ripping these strings apart :(
while (<$csv>) {
...
chomp;
My guess is that the input file has as line end \r\n (windows style) while you are executing the code in a UNIX like environment (Linux, Mac...) where the line end is \n. This means that $INPUT_RECORD_SEPARATOR is also \n and that chomp only removes the \n and leaves the \r. This left \r causes such strange output.
To fix this either fix the line endings in your input file, set $INPUT_RECORD_SEPARATOR to the expected separator or just do s{\r?\n\z}{} instead of chomp to handle both \r\n and \n line endings.
I ran your snippet against your line and it worked as expected
But I have had behavior like what you show because a spurious Control-M's in my data.
Try filtering for control-M's
after your chomp replace all control-M's with the command below
s/\cM//g;

Unpack and x option in TEMPLATE

I have a row which looks like (no new lines, this is one line and I replaces the spaces with _ as otherwise they are trimmed):
46S990BZ6BRIG___1381TRANSOCEAN_LTD______________BCALL_FEB00025000__1000000000000000000000000000000000000000B90002132015000000099999900161100000000000000007500111214111414121714100003000_H8817H100015012200005000000000010000000000000000000000009920202020150213__20_________________________________________________OV__0203P
The the use of unpack perl method returns as follows:
unpack("x49 A4", $line); # where $line is the above example line
returns: CALL
unpack("x68 A4", $line);
returns: 0122
unpack("x238 A4", $line);
return: 2015
Apparently, the column numbers do not match with the number given after 'x' in the TEMPLATE, as x238 is not equal to column 238 ('0000'), I have '2015' on column 251, not 238. The same for the other.
Please, explain how exactly the numbers given after 'x' in TEMPLATE work.
Thank you
First of all, your data isn't what you said it is. The data you provided produces the desired result.
$ perl -E'
my $line = "46S990BZ6BRIG___1381TRANSOCEAN_LTD______________BCALL_FEB00025000__1000000000000000000000000000000000000000B90002132015000000099999900161100000000000000007500111214111414121714100003000_H8817H100015012200005000000000010000000000000000000000009920202020150213__20_________________________________________________OV__0203P";
$line =~ s/_/ /g;
say unpack("x238 A4", $line);
'
0000
Maybe your actual data contained non-printable characters. Or maybe some of the spaces were actually tabs.
$ perl -E'
$_ = "46S990BZ6BRIG___1381TRANSOCEAN_LTD______________BCALL_FEB00025000__1000000000000000000000000000000000000000B90002132015000000099999900161100000000000000007500111214111414121714100003000_H8817H100015012200005000000000010000000000000000000000009920202020150213__20_________________________________________________OV__0203P";
s/_/ /g;
s/LTD\K\s+/\t\t/; # If the spaces after LTD were tabs
say unpack("x238 A4", $_);
'
2015
If you want to extract the characters at based on how your viewer expands tabs, you will need to expand tabs to spaces in the same fashion before passing the string to unpack.

Perl chomp doesn't remove the newline

I want to read a string from a the first line in a file, then repeat it n repetitions in the console, where n is specified as the second line in the file.
Simple I think?
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
chomp($repetitions = <INPUT>);
print $text x $repetitions;
Where input.txt is as follows
Hello
3
I expected the output to be
HelloHelloHello
But words are new line separated despite that chomp is used.
Hello
Hello
Hello
You may try it on the following Perl fiddle CompileOnline
The strange thing is that if the code is as follows:
#!/usr/bin/perl
open(INPUT, "input.txt");
chomp($text = <INPUT>);
print $text x 3;
It will work fine and displays
HelloHelloHello
Am I misunderstanding something, or is it a problem with the online compiler?
You have issues with line endings; chomp removes trailing char/string of $/ from $text and that can vary depending on platform. You can however choose to remove from string any trailing white space using regex,
open(my $INPUT, "<", "input.txt");
my $text = <$INPUT>;
my $repetitions = <$INPUT>;
s/\s+\z// for $text, $repetitions;
print $text x $repetitions;
I'm using an online Perl editor/compiler as mentioned in the initial post http://compileonline.com/execute_perl_online.php
The reason for your output is that string Hello\rHello\rHello\r is differently interpreted in html (\r like line break), while in console \r returns cursor to the beginning of the current line.

Removing lines containing a string from a file w/ perl

I'm trying to take a file INPUT and, if a line in that file contains a string, replace the line with something else (the entire line, including line breaks), or nothing at all (remove the line like it wasn't there). Writing all this to a new file .
Here's that section of code...
while(<INPUT>){
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT ""; #remove the line
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
}
}
while(<INPUT>) should read one line at a time (if my understanding is correct) and store each line in the special variable $_
However, when I run the above code I get only the very first if statement condition returned Some_Replacement_String, and only once. (1 line, out of a file with 1.3m, and expecting 600,000 replacements). This obviously isn't the behavior I expect. If I do something like while(<INPUT>){print OUTPUT $_;) I get a copy of the entire file, every line, so I know the entire file is being read (expected behavior).
What I'm trying to do is get a line, test it, do something with it, and move on to the next one.
If it helps with troubleshooting at all, if I use print $.; anywhere in that while statement (or after it), I get 1 returned. I expected this to be the "Current line number for the last filehandle accessed.". So by the time my while statement loops through the entire file, it should be equal to the number of lines in the file, not 1.
I've tried a few other variations of this code, but I think this is the closest I've come. I assume there's a good reason I'm not getting the behavior I expect, can anyone tell me what it is?
The problem you are describing indicates that your input file only contains one line. This may be because of a great many different things, such as:
You have changed the input record separator $/
Your input file does not contain the correct line endings
You are running your script with -0777 switch
Some notes on your code:
if ($_ =~ / <openTag>/){
chomp;
print OUTPUT "Some_Replacement_String";
No need to chomp a line you are not using.
} elsif ($_ =~ / <\/closeTag>/) {
chomp;
print OUTPUT "";
This is quite redundant. You don't need to print an empty string (ever, really), and chomp a value you're not using.
} else {
chomp;
print OUTPUT "$_\r\n"; #print the original line
No need to remove newlines and then put them back. Also, normally you would use \n as your line ending, even on windows.
And, since you are chomping in every if-else clause, you might as well move that outside the entire if-block.
chomp;
if (....) {
But since you are never relying on line endings not being there, why bother using chomp at all?
When using the $_ variable, you can abbreviate some commands, such as you are doing with chomp. For example, a lone regex will be applied to $_:
} elsif (/ <\/closeTag>/) { # works splendidly
When, like above, you have a regex that contains slashes, you can choose another delimiter for your regex, so that you do not need to escape the slashes:
} elsif (m# </closeTag>#) {
But then you need to use the full notation of the m// operator, with the m in front.
So, in short
while(<INPUT>){
if (/ <openTag>/){
print OUTPUT "Some_Replacement_String";
} elsif (m# </closeTag>#) {
# do nothing
} else {
print OUTPUT $_; # print the original line
}
}
And of course, the last two can be combined into one, with some negation logic:
} elsif (not m# </closeTag>#) {
print OUTPUT $_;
}