Perl program to look for k-mer with specific sequence

Perl program to look for k-mer with specific sequence - perl

I am trying to enhance a perl program I have previously written so that it recognizes top 1000 length 23 k-mers that ends with GG and print out the k-mers that only appears once in the sequence. However, no matter where I add the reg exp, I am unable to get the expected result.
The code I have:
#!/usr/bin/perl
use strict;
use warnings;
my $k = 23;
my $input = 'Fasta.fasta';
my $output = 'Fasta2.fasta';
my $match_count = 0;
#Open File
unless ( open( FASTA, "<", $input ) ) {
die "Unable to open fasta file", $!;
}
#Unwraps the FASTA format file
$/ = ">";
#Separate header and sequence
#Remove spaces
unless ( open( OUTPUT, ">", $output ) ) {
die "Unable to open file", $!;
}
<FASTA>; # discard 'first' 'empty' record
my %seen;
while ( my $line = <FASTA> ) {
chomp $line;
my ( $header, #seq ) = split( /\n/, $line );
my $sequence = join '', #seq;
for ( length($sequence) >= $k ) {
$sequence =~ m/([ACTG]{21}[G]{2})/g;
for my $i ( 0 .. length($sequence) - $k ) {
my $kmer = substr( $sequence, $i, $k );
##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
$match_count = $match_count + 1;
print OUTPUT ">crispr_$match_count", "\n", "$kmer", "\n" unless $seen{$kmer}++;
}
}
}
The input fasta file looks like this:
> >2L type=chromosome_arm; loc=2L:1..23011544; ID=2L; dbxref=REFSEQ:NT_033779,GB:AE014134; MD5=bfdfb99d39fa5174dae1e2ecd8a231cd; length=23011544; release=r5.54; species=Dmel;
CGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTGCCTCTCAT
TTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCGAGAGTAGT
GCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATGGTGGCGGA
TGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAATAAATTCA
TTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGAGAGTAGTG
CCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTATATTACCG
CAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCGGAGATATT
TAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGCCAACATAT
TGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTTTACCGCAA
ACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATATTTAGATTG
CCTCTCATTTTCTCTCCCATATTATAGGGAGAAATATGATCGCGTATGCG
AGAGTAGTGCCAACATATTGTGCTCTTTGATTTTTTGGCAACCCAAAATG
GTGGCGGATGAACGAGATGATAATATATTCAAGTTGCCGCTAATCAGAAA
TAAATTCATTGCAACGTTAAATACAGCACAATATATGATCGCGTATGCGA
GAGTAGTGCCAACATATTGTGCTAATGAGTGCCTCTCGTTCTCTGTCTTA
TATTACCGCAAACCCAAAAAGACAATACACGACAGAGAGAGAGAGCAGCG
GAGATATTTAGATTGCCTATTAAATATGATCGCGTATGCGAGAGTAGTGC
CAACATATTGTGCTCTCTATATAATGACTGCCTCTCATTCTGTCTTATTT
TACCGCAAACCCAAATCGACAATGCACGACAGAGGAAGCAGAACAGATAT
and so on...
The expected outcome (print out the 23k-mers with GG ending that only appear once in the sequence) I am hoping to get:
>crispr_1
GGGTGGAGCTCCCGAAATGCAGG
>crispr_2
TTAATAAATATTGACACAGCGGG
>crispr_3
ATCGTGGGGCGTTTTGTGAAAGG
>crispr_4
AGTTTTTCACATAATCAGACAGG
>crispr_5
GTGTTGGATGAGTGTCCTCTGGG
>crispr_6
ATAGGTTGGTTGTTTTAAAAGGG
>crispr_7
AAATTTTTGTTGCCACTGAATGG
>crispr_8
AAGTTTCGAACTACGATGGTTGG
>crispr_9
CATGCTTTGTGGAAATAAGTCGG
>crispr_10
CACAGTGGGTGTTTGCACCTCGG
.... and so on
The current code I did create a fasta file with following:
>crispr_1
CGACAATGCACGACAGAGGAAGC
>crispr_2
GACAATGCACGACAGAGGAAGCA
>crispr_3
ACAATGCACGACAGAGGAAGCAG
>crispr_4
CAATGCACGACAGAGGAAGCAGA
>crispr_5
AATGCACGACAGAGGAAGCAGAA
>crispr_6
ATGCACGACAGAGGAAGCAGAAC
>crispr_7
TGCACGACAGAGGAAGCAGAACA
>crispr_8
GCACGACAGAGGAAGCAGAACAG
>crispr_9
CACGACAGAGGAAGCAGAACAGA
>crispr_10
ACGACAGAGGAAGCAGAACAGAT
.... and so on
while if I remove the
for (length($sequence) >=$k){
$sequence =~m/([ACTG]{21}[G]{2})/g;
and add the ##while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
while ($kmer =~ m/([ACTG]{21}[G]{2})/g){
I am getting fasta file (with results which is not numbered correctly and unable to distinguish between duplicated and unique sequences):
>crispr_1
CATTTTCTCTCCCATATTATAGG
>crispr_2
ATTTTCTCTCCCATATTATAGGG
>crispr_3
TATTGTGCTCTTTGATTTTTTGG
>crispr_4
GATTTTTTGGCAACCCAAAATGG
>crispr_5
TTTTTGGCAACCCAAAATGGTGG
>crispr_6
TTGGCAACCCAAAATGGTGGCGG
>crispr_7
ACGACAGAGAGAGAGAGCAGCGG
>crispr_8
AAATCGACAATGCACGACAGAGG
>crispr_91
TATTGTGATCTTCGATTTTTTGG
>crispr_93
TTTTTGGCAACCCAAAATGGAGG
.... and so on
I have attempted to move the regex around my code, but none of them generated the expected result. I do not know what I did wrong over here. I have not add the exit the program when count reaches 1000 into the code yet.
Thanks in advance!

I'm not sure I understand your question completely, but could the following be what you need.
<FASTA>; # discard 'first' 'empty' record
my %data;
while (my $line = <FASTA>){
chomp $line;
my($header, #seq) = split(/\n/, $line);
my $sequence = join '', #seq;
for my $i (0 .. length($sequence) - $k) {
my $kmer = substr($sequence, $i, $k);
$data{$kmer}++ if $kmer =~ /GG$/;
}
}
my $i = 0;
for my $kmer (sort {$data{$b} <=> $data{$a}} keys %data) {
printf "crispr_%d\n%s appears %d times\n", ++$i, $kmer, $data{$kmer};
last if $i == 1000;
}
Some output on a file I have is:
crispr_1
ggttttccggcacccgggcctgg appears 4 times
crispr_2
ccgagctgggcgagaagtagggg appears 4 times
crispr_3
gccgagctgggcgagaagtaggg appears 4 times
crispr_4
gcacccgggcctgggtggcaggg appears 4 times
crispr_5
agcagcgggatcgggttttccgg appears 4 times
crispr_6
gctgggcgagaagtaggggaggg appears 4 times
crispr_7
cccttctgcttcagtgtgaaagg appears 4 times
crispr_8
gtggcagggaagaatgtgccggg appears 4 times
crispr_9
gatcgggttttccggcacccggg appears 4 times
crispr_10
tgagggaaagtgctgctgctggg appears 4 times
crispr_11
agctgggcgagaagtaggggagg appears 4 times
. . . .
ggcacccgggcctgggtggcagg appears 4 times
crispr_50
gaatctctttactgcctggctgg appears 4 times
crispr_51
accacaacattgacagttggtgg appears 2 times
crispr_52
caacattgacagttggtggaggg appears 2 times
crispr_53
catgctcatcgtatctgtgttgg appears 2 times
crispr_54
gattaatgaagtggttattttgg appears 2 times
crispr_55
gaaaccacaacattgacagttgg appears 2 times
crispr_56
aacattgacagttggtggagggg appears 2 times
crispr_57
gacttgatcgattaatgaagtgg appears 2 times
crispr_58
acaacattgacagttggtggagg appears 2 times
crispr_59
gaaccatatattgttatcactgg appears 2 times
crispr_60
ccacagcgcccacttcaaggtgg appears 1 times
crispr_61
ctgctcctgggtgtgagcagagg appears 1 times
crispr_62
ccatatattatctgtggtttcgg appears 1 times
. . . .
Update
To get the results you mentioned in your comment (below), replace the output code with:
my $i = 1;
while (my ($kmer, $count) = each %data) {
next unless $count == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
To answer my own comment to get first 1000.
<FASTA>; # discard 'first' 'empty' record
my %seen;
my #kmers;
while (my $line = <FASTA>){
chomp $line;
my($header, #seq) = split(/\n/, $line);
my $sequence = join '', #seq;
for my $i (0 .. length($sequence) - $k) {
my $kmer = substr($sequence, $i, $k);
if ($kmer =~ /GG$/) {
push #kmers, $kmer unless $seen{$kmer}++;
}
}
}
my $i = 1;
for my $kmer (#kmers) {
next unless $seen{$kmer} == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}
Answer To check for uniqueness of final 12 chars ending in GG, the code below achieves that.
if ($kmer =~ /(.{10}GG)$/) {
my $substr = $1;
push #kmers, $kmer unless $seen{$substr}++;
}
my $i = 1;
for my $kmer (#kmers) {
my $substr = substr $kmer, -12;
next unless $seen{$substr} == 1;
print "crispr_$i\n$kmer\n";
last if $i++ == 1000;
}

Actually, this code line
$sequence =~m/([ACTG]{21}[G]{2})/g;
this line is just for the regex match, if you try to print this $sequence, it will surely print out the original result.
please add the code segement like this
if($sequence =~/([ACTG]{21}[G]{2}$)/g)
{
}#please remember to match the end with $.
BTW,It looks like the multiple for loop to parse this data is not very reasonable, the parse speed is without the best-efficiency.

Related

How can I calculate the geometric center of a protein in Perl?

I have a PDB file which contains information about a specific protein. One of the information it holds is the positions of the different atoms (xyz coordinates).
The file is the following https://files.rcsb.org/view/6U9D.pdb . With this file I want to calculate the geometric center of the atoms. In theory I know what I need to do, but the script I wrote does not seem to work.
The first part of the program, before the for loop, is another part of the assignment which requires me to read the sequence and convert it from the 3 letter nomenclature to the 1 letter one. The part that interests me is the for loop until the end. I tried to pattern match in order to isolate the XYZ coordinates. Then I used a counter that I had set up in the beginning which is the $k variable. When I check the output on cygwin the only output I get is the sequence 0 0 0 instead of the sum of each dimension divided by $k. Any clue what has gone wrong?
$k=0;
open (IN, '6U9D.pdb.txt');
%amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
while (<IN>) {
if ($_=~m/HEADER\s+(.*)/){
print ">$1\n"; }
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq.=$1;
$seq=~s/ //g;
}
}
for ($i=0;$i<=length $seq; $i+=3) {
print "$amino_acid_conversion{substr($seq,$i,3)}";
if ($_=~m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
print "\n";
#print $k;
$xgk=($x/$k); $ygk=($y/$k); $zgk=($z/$k);
print "$xgk $ygk $zgk \n";

I do not know bioinformatics but it seems like you should do something like this:
use feature qw(say);
use strict;
use warnings;
my $fn = '6U9D.pdb';
open ( my $IN, '<', $fn ) or die "Could not open file '$fn': $!";
my $seq = '';
my $x = 0;
my $y = 0;
my $z = 0;
my $k = 0;
while (<$IN>) {
if ($_ =~ m/HEADER\s+(.*)/) {
say ">$1";
}
if ($_=~m/^SEQRES\s+\d+\s+\w+\s+\d+\s+(.*)/){
$seq .= $1;
}
if ($_ =~ m/^ATOM\s+\d+\s+\w+\s+\w+\s+\w+\s+\d+\s+(\S+)\s+(\S+)\s+(\S+)/) {
$x+=$1; $y+=$2; $z+=$3; $k++;
}
}
close $IN;
$seq =~ s/ //g;
my %amino_acid_conversion = (
ALA=>'A',TYR=>'Y',MET=>'M',LEU=>'L',CYS=>'C',
GLY=>'G',ARG=>'R',ASN=>'N',ASP=>'D',GLN=>'Q',
GLU=>'E',HIS=>'H',TRP=>'W',LYS=>'K',PHE=>'F',
PRO=>'P',SER=>'S',THR=>'T',ILE=>'I',VAL=>'V'
);
my %unknown_keys;
my $conversion = '';
say "Sequence length: ", length $seq;
for (my $i=0; $i < length $seq; $i += 3) {
my $key = substr $seq, $i, 3;
if (exists $amino_acid_conversion{$key}) {
$conversion.= $amino_acid_conversion{$key};
}
else {
$unknown_keys{$key}++;
}
}
say "Conversion: $conversion";
say "Unknown keys: ", join ",", keys %unknown_keys;
say "Number of atoms: ", $k;
my $xgk=($x/$k);
my $ygk=($y/$k);
my $zgk=($z/$k);
say "Geometric center: $xgk $ygk $zgk";
This gives me the following output:
[...]
Number of atoms: 76015
Geometric center: 290.744642162734 69.196842162731 136.395842938893

Hash incorrectly tracking counts, runtime long

I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}

Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;

Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.

How to identify nth lines of n files in while<>

I have a code which adds all vectors in all files.
There can be any number of input files. For example first input file is:
0.55 0 0.3335 1.2
0.212 0 2.2025 1
and the second one is:
0.25 0 0.3333 1.0
0.1235 0 0.2454 1
What I get is the sum of all vectors, thus in result i get one vector
which is:
1.13550 0 3.1147 4.2
But I'm trying to sum the first vector of the first file with the first vector of the second file and so on. In result according to this example I should get 2 vectors.
For now I have this:
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my #sum = 0;
my $dim = 0;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
my $vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum[$i] += $vectors[$i];
}
}
$" = "\t\t";
print "\n --- \n #sum \n";
What I need is just to find out how to identify each file's nth line
and to sum the column values of those lines while keeping in mind, that there can be n number of files.
I saw filehandling question over here with similar issue, however
I didn't find my answer there.
Just looking for some suggestions and guidance. Got stuck on this.

Open each file yourself and use the $. variable to know which line you are on (or count the files yourself). Here's the basic structure:
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $. ] = ...; # $. is the line number
}
}
If you don't like $., you can use its longer name. You have to turn on English (which comes with Perl):
use English;
## use English qw( -no_match_vars ); # for v5.16 and earlier
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
while( <$fh> ) {
chomp;
$sum[ $INPUT_LINE_NUMBER ] = ...;
}
}
Or, you can count yourself, which might be handy if the vectors in the files don't line up by strict line number (perhaps because of comments or some other formatting oddity):
foreach my $file ( #files ) {
open my $fh, '<', $file or die ...;
my $line = -1;
while( <$fh> ) {
$line++;
chomp;
$sum[ $line ] = ...;
}
}
The harder way is the answer bart gives which inspects eof at the end of every line to see if the magical ARGV handle is looking at a new file, and resetting $. if it is. It's an interesting trick but hardly anyone is going to understand what it's doing (or even notice it).
For the other part of the problem, I think you're doing the vector sum wrong, or using confusing variable names. A line is a vector, and the numbers in the lines are a component. A two dimensional array will work. The first index is the line number and the second in the component index:
while( <$fh> ) {
chomp;
... skip unwanted lines
my #components = split;
... various dimension checks
foreach my $i ( 0 .. $#components ) {
$sum[ $. ][ $i ] += $components[ $i ];
}
}
The Data::Dumper module is handy for complex data structures. You can also see the perldsc (Perl Data Structures Cookbook) documentation. The $. variable is found in perlvar.

$. is the line number of the most recently read file handle. close(ARGV) if eof; can be used to reset the file number between files (as documented in eof). (Note: eof() is different than eof.) So you now have line numbers.
The second problem you have is that you are adding vector components ($vectors[$i]) to a vectors ($sum[$i]). You need to add vector components to vectors components. Start by using more appropriate variable names.
This is what we get:
my #sum_vectors;
while (<>) {
s/#.*//; # Remove comments.
next if /^\s*$/; # Ignore blank lines.
my #vector = split;
if ($sum_vectors[$.] && #{ $sum_vectors[$.] } != #vector) {
die("$ARGV:$.: Vector dimensions do not match\n");
}
for my $i (0..$#vector) {
$sum_vectors[$.][$i] += $vector[$i];
}
} continue {
close(ARGV) if eof; # Reset line numbers for each file.
}
Two other errors fixed:
$! did not contain anything meaningful when you used it.
You ignored lines that contain comments, even if they contained valid data too.

Try this:
#!/usr/bin/perl
use strict;
use warnings;
if ($ARGV[0] ne "vector1.dat"){
die ("vector1.dat is necessary as first argument");
}
my %sum;
my $dim = 0;
my $vector_length;
my $line_number;
while (<>) {
#Ignore blank lines, hashtags
#and lines starting with $
if ($_ =~ /#/ || $_ =~ /^$/ || $_ =~ /^\s$/){
next;
}
my #vectors = split(" ", $_);
$vector_length = #vectors;
if ($dim eq 0) {
$dim = $vector_length;
}
else {
if ($dim ne $vector_length) {
die ("Vector dimensions do not match. : $!");
}
}
for (my $i = 0; $i <= $#vectors; $i++) {
$sum{$.}{$i} += $vectors[$i];
}
$line_number = $.;
$. = 0 if eof;
}
$" = "\t\t";
for (my $line=1; $line<=$line_number; $line++)
{
print $line;
for (my $vector=0; $vector<$vector_length; $vector++)
{
print " " . $sum{$line}{$vector};
}
print "\n";
}

Perl hash does not print value if it begins with 2 or 22 under certain conditions

This is really frustrating me. The script I'm writing is indexing coordinates in a hash and then using those index numbers to pull out values from an array.
The weird thing is that if the value begins with 2 or 22 it will not print. Any other number works. I'll show you two variations and output of the script.
First variation. This is what I want the script to do. Print chromosome, position, value.
#!/usr/bin/perl
use strict;
use warnings;
use File::Find;
use Scalar::Util qw(looks_like_number);
open IN, "/home/big/scratch/affy_map.txt" or die "Cannot open reference\n";
my %ref;
my $head = <IN>;
my $index = 0;
while(<IN>){
chomp $_;
my #row = split /\t/, $_;
my $value = join "\t", $row[1],$row[2];
if($row[1] == 2 && $row[2] <= 50000 && $row[2] <= 51113178) { $ref{$index}=$value; print $index."\t".$value."\n";}
if($row[1] == 22 && $row[2] <= 16300001 && $row[2] <= 20500000) { $ref{$index}=$value; print $index."\t".$value."\n"; }
$index++;
}
close(IN);
my #files;
my $masterDirect = "/nfs/archive02/big/Norm/norm_gcc/";
find(\&file_names, $masterDirect);
sub file_names {
if( -f && $File::Find::name=~/\.nzd$/)
{
push #files, $File::Find::name;
}
}
my $count=0;
foreach(#files){
$count++;
if($count % 100 == 0 ){ print "\n","-" x 10, " $count ", "-" x 10,"\n";}
undef my #probes;
open IN, $_;
#file name handling
my #inDir = split "\/", $_;
my $id = pop(#inDir);
$id =~ s/\.gcc.nzd$//;
#header test
$head =<IN>;
if(looks_like_number($head)) { push #probes, $head; }
#open output
open OUT, ">/home/big/scratch/phase1_affy/".$id."_select_probeset.txt";
#load probe array
#probes = <IN>;
close(IN);
foreach my $key (sort keys %ref){
#intended function
print OUT $ref{$key}."\t".$probes[$key];
#testing
my #temp = split "\t", $ref{$key};
foreach(#temp){if($temp[0] == 2){print $key."\t".$ref{$key}."\t".$probes[$key];}}
}
close(OUT);
}
Here's the output for the test. The printing from the reference file is flawless. The first number is the $key or index number. The second is frome $probes[$key] why is the $ref{$key} missing?
146529 0.777314368326637
146529 0.777314368326637
146530 0.116241153901913
146530 0.116241153901913
146531 0.940593233609167
146531 0.940593233609167
Variation 2.
...
foreach my $key (sort keys %ref){
print OUT $ref{$key}."\t".$probes[$key];
my #temp = split "\t", $ref{$key};
foreach(#temp){if($temp[0] == 2){print $key."\t".$ref{$key}."\n";}}
}
And its output. See now it's printing correctly. $key and $ref{$key}
146542 2 31852
146542 2 31852
146543 2 37693
146543 2 37693
146544 2 40415
146544 2 40415
146545 2 40814
I thought it might be a DOS->UNIX file problem but I performed perl -pi -e 's/\R/\n/g' input_files.txt for all the input the script sees. It prints the same value twice because there are two elements in the #temp array. I'm really at a loss right now.

Here is a hint for possible issue. In the beginning part,
if($row[1] == 2 && $row[2] <= 50000 && $row[2] <= 51113178) { $ref{$index}=$value; print $index."\t".$value."\n";}
Note that you used two "<=" for $row[2], which looks peculiar. The next line has such "problem" too. Please double check it first otherwise you may have filtered them out in the first place.

How to parse through tab-delimited file in perl?

I'm new to Perl, and I've hit a mental roadblock. I need to extract information from a tab delimited file as shown below.
#name years risk total
adam 5 100 200
adam 5 50 100
adam 10 20 300
bill 20 5 100
bill 30 10 800
In this example, the tab delimited file shows length of investment, amount of money risked, and total at the end of investment.
I want to parse through this file, and for each name (e.g. adam), calculate sum of years invested 5+5, and calculate sum of earnings (200-100) + (100-50) + (300-20). I also would like to save the totals for each name (200, 100, 300).
Here's what I have tried so far:
my $filename;
my $seq_fh;
open $seq_fh, $frhitoutput
or die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
my $yeartotal = 0;
my $earning = 0;
#$line[0] = name
#$line[1] = years
#$line[2] = start
#$line[3] = end
while (#line[0]){
$yeartotal += $line[1];
$earning += ($line[3]-$line[2]);
}
}
Any ideas of where I went wrong?

The Text::CSV module can be used to read tab-delimited data. Often much nicer than trying to manually hack yourself something up with split and so on when it comes to things like quoting, escaping, etc..

You're wrong here : while(#line[0]){
I'd do:
my $seq_fh;
my %result;
open($seq_fh, $frhitoutput) || die "failed to read input file: $!";
while (my $line = <$seq_fh>) {
chomp $line;
## skip comments and blank lines and optional repeat of title line
next if $line =~ /^\#/ || $line =~ /^\s*$/ || $line =~ /^\+/;
#split each line into array
my #line = split(/\s+/, $line);
$result{$line[0]}{yeartotal} += $line[1];
$result{$line[0]}{earning} += $line[3] - $line[2];
}

You should use hash, something like this:
my %hash;
while (my $line = <>) {
next if $line =~ /^#/;
my ($name, $years, $risk, $total) = split /\s+/, $line;
next unless defined $name and defined $years
and defined $risk and defined $total;
$hash{$name}{years} += $years;
$hash{$name}{risk} += $risk;
$hash{$name}{total} += $total;
$hash{$name}{earnings} += $total - $risk;
}
foreach my $name (sort keys %hash) {
print "$name earned $hash{$name}{earnings} in $hash{$name}{years}\n";
}

Nice opportunity to explore Perl's powerful command line options! :)
Code
Note: this code should be a command line oneliner, but it's a little bit easier to read this way. When writing it in a proper script file, you really should enable strict and warnings and use a little bit better names. This version won't compile under strict, you have to declare our $d.
#!/usr/bin/perl -nal
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
# print summary
END { print "$_:\tyears: $d{$_}{y},\tearnings: $d{$_}{e}" for sort keys %d }
Output
adam: years: 20, earnings: 430
bill: years: 50, earnings: 885
Explanation
I make use of the -n switch here which basically lets your code iterate over the input records (-l tells it to use lines). The -a switch lets perl split the lines into the array #F. Simplified version:
while (defined($_ = <STDIN>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
# collect data
$d{$F[0]}{y} += $F[1];
$d{$F[0]}{e} += $F[3] - $F[2];
}
%d is a hash with the names as keys and hashrefs as values, which contain years (y) and earnings (e).
The END block is executed after finishing the input line processing and outputs %d.
Use O's Deparse to view the code which is actually executed:
book:/tmp memowe$ perl -MO=Deparse tsv.pl
BEGIN { $/ = "\n"; $\ = "\n"; }
LINE: while (defined($_ = <ARGV>)) {
chomp $_;
our(#F) = split(' ', $_, 0);
$d{$F[0]}{'y'} += $F[1];
$d{$F[0]}{'e'} += $F[3] - $F[2];
sub END {
print "${_}:\tyears: $d{$_}{'y'},\tearnings: $d{$_}{'e'}" foreach (sort keys %d);
}
;
}
tsv.pl syntax OK

It seems like a fixed-width file, I would use unpack for that

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Perl program to look for k-mer with specific sequence - perl

Related

How can I calculate the geometric center of a protein in Perl?

Hash incorrectly tracking counts, runtime long

How to identify nth lines of n files in while<>

Perl hash does not print value if it begins with 2 or 22 under certain conditions

How to parse through tab-delimited file in perl?

Categories

Resources