Weird printing behaviour in Perl - perl

I want to read some numbers from a file into an array and then print them as part of a statement. But I am getting some weird results. The file list1.txt contains the following text:
01
02
15
30
43
75
76
I want to print:
The number is: 01;
The number is: 02;
The number is: 15;
The number is: 30;
The number is: 43;
The number is: 75;
The number is: 76;
So I wrote the following simple program:
my #arr;
my $i = 0;
open(my $file_fh,"<","list1.txt") or die "Could not find specified file\n";
while (<$file_fh>) {
$arr[$i] = $_;
chomp($arr[$i]);
$i++;
}
foreach my $num (#arr) {
my $stmt = "The number is: ".$num."\;";
print "$stmt\n";
}
But the result I get is:
;he number is: 01
;he number is: 02
;he number is: 15
;he number is: 30
;he number is: 43
;he number is: 75
;he number is: 76
What am I doing wrong?

The file you are reading is in Windows format with CRLF at the end of each line. Your cygwin Perl is configured to use Linux line endings, so the chomp removes only the ending LF, leaving the string value stored with a trailing CR, which is causing the semicolon to overwrite the first character of the output.

You're reading a Windows text file (CRLF line endings) using a Perl on a unix (cygwin) system. Unix systems expect text files to have LF endings.
The fix is to use s/\s+\z// instead of chomp.

Related

Perl: how to compare array to hash and print out results

I'm quite new to Perl, so I'm sorry if this is somewhat rudimentary.
I'm working with a Perl script that is working as a wrapper for some Python, text formatting, etc. and I'm struggling to get my desired output.
The script takes a folder, for this example, the folder contains 6 text files (test1.txt through test6.txt). The script then extracts some information from the files, runs a series of command line programs and then outputs a tab-delimited result. However, that result contains only those results that made it through the rest of the processing by the script, i.e. the result.
Here are some snippets of what I have so far:
use strict;
use warnings;
## create array to capture all of the file names from the folder
opendir(DIR, $folder) or die "couldn't open $folder: $!\n";
my #filenames = grep { /\.txt$/ } readdir DIR;
closedir DIR;
#here I run some subroutines, the last one looks like this
my $results = `blastn -query $shortname.fasta -db DB/$db -outfmt "6 qseqid sseqid score evalue" -max_target_seqs 1`;
#now I would like to compare what is in the #filenames array with $results
Example of tab delimited result - stored in $results:
test1.txt 200 1:1-20 79 80
test3.txt 800 1:1-200 900 80
test5.txt 900 1:1-700 100 2000
test6.txt 600 1:1-1000 200 70
I would like the final output to include all of the files that were run through the script, so I think I need a way to compare two arrays or perhaps compare an array to a hash?
Example of the desired output:
test1.txt 200 1:1-20 79 80
test2.txt 0 No result
test3.txt 800 1:1-200 900 80
test4.txt 0 No result
test5.txt 900 1:1-700 100 2000
test6.txt 600 1:1-1000 200 70
Update
Ok, so I got this to work with suggestions by #terdon by reading the file into a hash and then comparing. So I was trying to figure out how to do this with out writing to file and the reading the file back in - I still can't seem to get the syntax correct. Here's what I have, however it seems like I'm not able to match the array to the hash - meaning the hash must not be correct:
#!/usr/bin/env perl
use strict;
use warnings;
#create variable to mimic blast results
my $blast_results = "file1.ab1 9 350 0.0 449 418 418 403479 403042 567
file3.ab1 2 833 0.0 895 877 877 3717226 3718105 984";
#create array to mimic filename array
my #filenames = ("file1.ab1", "file2.ab1", "file3.ab1");
#header for file
my $header = "Query\tSeq_length\tTarget found\tScore (Bits)\tExpect(E-value)\tAlign-length\tIdentities\tPositives\tChr\tStart\tEnd\n";
#initialize hash
my %hash;
#split blast results into array
my #row = split(/\s+/, $blast_results);
$hash{$row[0]}=$_;
print $header;
foreach my $file (#filenames){
## If this filename has an associated entry in the hash, print it
if(defined($hash{$file})){
print "$row[0]\t$row[9]\t$row[1]:$row[7]-$row[8]\t$row[2]\t$row[3]\t$row[4]\t$row[5]\t$row[6]\t$row[1]\t$row[7]\t$row[8]\n";
}
## If not, print this.
else{
print "$file\t0\tNo Blast Results: Sequencing Rxn Failed\n";
}
}
print "-----------------------------------\n";
print "$blast_results\n"; #test what results look like
print "-----------------------------------\n";
print "$row[0]\t$row[1]\n"; #test if array is getting split correctly
print "-----------------------------------\n";
print "$filenames[2]\n"; #test if other array present
The result from this script is (the #filenames array is not matching the hash):
Query Seq_length Target found Score (Bits) Expect(E-value) Align-length Identities Positives Chr Start End
file1.ab1 0 No Blast Results: Sequencing Rxn Failed
file2.ab1 0 No Blast Results: Sequencing Rxn Failed
file3.ab1 0 No Blast Results: Sequencing Rxn Failed
-----------------------------------
file1.ab1 9 350 0.0 449 418 418 403479 403042 567
file3.ab1 2 833 0.0 895 877 877 3717226 3718105 984
-----------------------------------
file1.ab1 9
-----------------------------------
file3.ab1
I'm not entirely sure what you need here but the equivalent of awk's A[$1]=$0 is done using hashes in Perl. Something like:
my %hash;
## Open the output file
open(my $fh, "<","text_file");
while(<$fh>){
## remove newlines
chomp;
## split the line
my #A=split(/\s+/);
## Save this in a hash whose keys are the 1st fields and whose
## values are the associated lines.
$hash{$A[0]}=$_;
}
close($fh);
## Now, compare the file to #filenames
foreach my $file (#filenames){
## Print the file name
print "$file\t";
## If this filename has an associated entry in the hash, print it
if(defined($hash{$file})){
print "$hash{$file}\n";
}
## If not, print this.
else{
print "0\tNo result\n";
}
}

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab
file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 0.781985
16 0.810993
20 0.769601
24 0.733831
file B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
0.721985
0.610993
0.760123
0.573831
I expect an output like this. file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.7194950944
16 1.754418585
20 1.7047982296
24 1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr2
for file B (in this file values are just copy paste of file A)
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.7194950944
1.754418585
1.7047982296
1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar
variableStep chrom=chr2
This awk script does the calculation that you want:
awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file
This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}.
Testing it out on your new files:
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.719495094445
16 1.754418584953
20 1.704798229573
24 1.663049372620
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.649449947457
1.527310087388
1.693635012985
1.488470882686
So here's the Perl version:
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.
This is the modified little script of #ThomasKilian
Thanks to him for providing the framework.
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
$power= sprintf("%.12f", $power);
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
$power= sprintf("%.12f", $power);
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;

some help on the following perl script

Need help in merging/concatenating /combining /binding etc
I have several ascii files each defining one variable which I have converted to a single column array
I have such columnised data for many variables ,so I need to perform a column bind like R does and make it one single file.
I can do the same in R but there are too many files. Being able to do it with one single code will help save a lot of time.
Using the following code ,new to perl and need help with this.
#filenames = ("file1.txt","file2.txt");
open F2, ">file_combined.txt" or die;
for($j = 0; $j< scalar #filenames;$j++){
open F1, $filenames[$j] or die;
for($i=1;$i<=6;$i++){$line=<F1>;}
while($line=<F1>){
chomp $line;
#spl = split '\s+', $line;
for($i=0;$i<scalar #spl;$i++){
print F2 "$spl[$i]\n";
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
close F1;
}
Input files here are Ascii text files of a raster.They look like this
32 12 34 21 32 21 22 23
12 21 32 43 21 32 21 12
The above mentioned code without the paste syntax converts these files into a single column
32
12
34
21
32
21
22
23
12
21
32
43
21
32
21
12
The output should look like this
12 21 32
32 23 23
32 21 32
12 34 12
43 32 32
32 23 23
32 34 21
21 32 23
Each column represents a different ascii file.
I need around 15 such ascii files into one dataframe.I can do the same in R but it consumes a lot of time as the number of files and regions of interest are too many and the files are a bit large too.
Let's step through what you have...
# files you want to open for reading..
#filenames = ("file1.txt","file2.txt");
# I would use the 3 arg lexical scoped open
# I think you want to open this for 'append' as well
# open($fh, ">>", "file_combined.txt") or die "cannot open";
open F2, ">file_combined.txt" or die;
# #filenames is best thought as a 'list'
# for my $file (#filenames) {
for($j = 0; $j< scalar #filenames;$j++){
# see above example of 'open'
# - $filenames[$j] + $file
open F1, $filenames[$j] or die;
# what are you trying to do here? You're overriding
# $line in the next 'while loop'
for($i=1;$i<=6;$i++){$line=<F1>;}
# while(<$fh1>) {
while($line=<F1>){
chomp $line;
# #spl is short for split?
# give '#spl' list a meaningful name
#spl = split '\s+', $line;
# again, #spl is a list...
# for my $word (#spl) {
for($i=0;$i<scalar #spl;$i++){
# this whole block is a bit confusing.
# 'F2' is 'file_combined.txt'. Then you try and merge
# ( and overwrite the file) with the paste afterwards...
print F2 "$spl[$i]\n";
# is this a 'system call'?
# Missing 'backticks' or 'system'
paste "file_bio1.txt","file_bio2.txt"> file_combined.txt;
}
}
# close $fh1
close F1;
}
# I'm assuming there's a 'close F2' somewhere here..
It looks like you're trying to do this:
#filenames = ("file1.txt","file2.txt");
$oufile = "combined_text.txt";
`paste $filenames[0] $filenames[1] > $outfile`;

Perl: utf8::decode vs. Encode::decode

I am having some interesting results trying to discern the differences between using Encode::decode("utf8", $var) and utf8::decode($var). I've already discovered that calling the former multiple times on a variable will eventually result in an error "Cannot decode string with wide characters at..." whereas the latter method will happily run as many times as you want, simply returning false.
What I'm having trouble understanding is how the length function returns different results depending on which method you use to decode. The problem arises because I am dealing with "doubly encoded" utf8 text from an outside file. To demonstrate this issue, I created a text file "test.txt" with the following Unicode characters on one line: U+00e8, U+00ab, U+0086, U+000a. These Unicode characters are the double-encoding of the Unicode character U+8acb, along with a newline character. The file was encoded to disk in UTF8. I then run the following perl script:
#!/usr/bin/perl
use strict;
use warnings;
require "Encode.pm";
require "utf8.pm";
open FILE, "test.txt" or die $!;
my #lines = <FILE>;
my $test = $lines[0];
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
my #unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
my #hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
print "==============\n";
$test = Encode::decode("utf8", $test);
print "Length: " . (length $test) . "\n";
print "utf8 flag: " . utf8::is_utf8($test) . "\n";
#unicode = (unpack('U*', $test));
print "Unicode:\n#unicode\n";
#hex = (unpack('H*', $test));
print "Hex:\n#hex\n";
This gives the following output:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 2
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
This is what I would expect. The length is originally 7 because perl thinks that $test is just a series of bytes. After decoding once, perl knows that $test is a series of characters that are utf8-encoded (i.e. instead of returning a length of 7 bytes, perl returns a length of 4 characters, even though $test is still 7 bytes in memory). After the second decoding, $test contains 4 bytes interpreted as 2 characters, which is what I would expect since Encode::decode took the 4 code points and interpreted them as utf8-encoded bytes, resulting in 2 characters. The strange thing is when I modify the code to call utf8::decode instead (replace all $test = Encode::decode("utf8", $test); with utf8::decode($test))
This gives almost identical output, only the result of length differs:
Length: 7
utf8 flag:
Unicode:
195 168 194 171 194 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
232 171 139 10
Hex:
c3a8c2abc28b0a
==============
Length: 4
utf8 flag: 1
Unicode:
35531 10
Hex:
e8ab8b0a
It seems like perl first counts the bytes before decoding (as expected), then counts the characters after the first decoding, but then counts the bytes again after the second decoding (not expected). Why would this switch happen? Is there a lapse in my understanding of how these decoding functions work?
Thanks,Matt
You are not supposed to use the functions from the utf8 pragma module. Its documentation says so:
Do not use this pragma for anything else than telling Perl that your script is written in UTF-8.
Always use the Encode module, and also see the question Checklist for going the Unicode way with Perl. unpack is too low-level, it does not even give you error-checking.
You are going wrong with the assumption that the octects E8 AB 86 0A are the result of UTF-8 double-encoding the characters 諆 and newline. This is the representation of a single UTF-8 encoding of these characters. Perhaps the whole confusion on your side stems from that mistake.
length is unappropriately overloaded, at certain times it determines the length in characters, or the length in octets. Use better tools such as Devel::Peek.
#!/usr/bin/env perl
use strict;
use warnings FATAL => 'all';
use Devel::Peek qw(Dump);
use Encode qw(decode);
my $test = "\x{00e8}\x{00ab}\x{0086}\x{000a}";
# or read the octets without implicit decoding from a file, does not matter
Dump $test;
# FLAGS = (PADMY,POK,pPOK)
# PV = 0x8d8520 "\350\253\206\n"\0
$test = decode('UTF-8', $test, Encode::FB_CROAK);
Dump $test;
# FLAGS = (PADMY,POK,pPOK,UTF8)
# PV = 0xc02850 "\350\253\206\n"\0 [UTF8 "\x{8ac6}\n"]
Turns out this was a bug: https://rt.perl.org/rt3//Public/Bug/Display.html?id=80190.

Why does 'chomp' fail to remove newlines on Windows XP with Eclipse and Cygwin Perl?

I'm running Windows XP, Eclipse 3.2 with EPIC and Cygwin for my Perl interpreter, and I get an unexpected result.
FYI... When I run it on my Ubuntu distribution (VMware, same pc) I get the expected results. Why?
############ CODE: #############
use warnings;
use strict;
my $test = "test";
my $input = <STDIN>;
print length $test, " ", length $input, "\n";
chomp $input;
print "|$test| |$input| \n"; #The bars indicate white space, new line, etc...
print length $test, " ", length $input, "\n";
if ($test eq $input) {
print "TIME TO QUIT";
}
Results on Windows XP:
test <-- My input
4 6 <-- Lengths printed before chomp
|test| |test <-- Print the variables after chomp
| <-- There is still a new line there
4 5 <-- Lengths after the initial chomp
Given that Windows XP figures in the problem, the difference must be due to CRLF (carriage return, line feed) handling. The chomp removes, it appears, the LF but not the CR; the print translates the CR into CR LF.
The Perl doc for chomp says that if you set the EOL correctly for Windows ($/ = "\r\n";), then chomp should do its stuff correctly:
$/ = "\r\n";
$test = "test\r\n";
print "<<$test>>\n";
chomp $test;
print "<<$test>>\n";
A hex dump of the output of that yields:
0x0000: 3C 3C 74 65 73 74 0D 0A 3E 3E 0A 3C 3C 74 65 73 <<test..>>.<<tes
0x0010: 74 3E 3E 0A t>>.
0x0014:
I'm not sure why $/ is not set automatically - it may be Cygwin confusing things (pretending too successfully it is running on Unix).
Based on the lengths, I'd say you're getting the input string as:
test<cr><lf>
where <cr> and <lf> are ASCII codes 0x13 and 0x10 respectively.
When you chomp it, it removes the <lf> but leaves the <cr> there.
It's almost certainly an interaction issue between Eclipse, Cygwin and Windows, disagreeing on what the end-of-line character sequence should be. I couldn't replicate your problem with just Perl/Cygwin or Perl/Windows but this command gives similar results (in Cygwin):
echo 'test^M' | perl qq.pl | sed 's/^M/\n/g'
(qq.pl is your script and "^M" is the actual CTRL-M). Here's the output in text form:
4 6
|test| |test
|
4 5
and octal dump:
0000000 2034 0a36 747c 7365 7c74 7c20 6574 7473
4 6 \n | t e s t | | t e s t
064 040 066 012 174 164 145 163 164 174 040 174 164 145 163 164
0000020 7c0a 340a 3520 000a
\n | \n 4 5 \n \0
012 174 012 064 040 065 012 000
0000027
So I'd say that your input is putting on both <cr> and <lf>, and the print is translating <cr> to <lf> (or just doing the same thing for both of them).
If you need a workaround for your environment, you can replace your chomp line with:
$input =~ s/\r?\n$//;
as in:
use warnings;
use strict;
my $test = "test";
my $input = <STDIN>;
print length $test ," ",length $input,"\n";
$input =~ s/\r?\n$//;
print "|$test| |$input|\n";
print length $test," ",length $input,"\n";
if ($test eq $input) {
print "TIME TO QUIT";
}
which works on Cygwin for the test data I used (check it for your own situation, of course), but you may find you can solve it better by using tools that all agree on the line end sequence (eg, Perl for Windows rather than the Cygwin one may do the trick for you).
Here is how to remove a trailing \r\n or \n (whichever is at the end):
$input =~ s#\r?\n\Z(?!\n)##;
Another option is to do a
binmode(STDIN, ':crlf')
before reading anything from STDIN. This would convert trailing \r\n to just a \n, which you can remove using chomp. This will also work even if your input contains only \n. See the documentation about PerlIO for more.