exists statement not working perl - perl

This is a homework assignment. I am not looking for the "code to make it work" more looking for a point in the right direction on where my logic is wrong.
use strict;
use warnings;
#rot13 sub for passwords
sub rot13{
my $result;
chomp(my $input = <STDIN>);``
# all has to be lower case
my $lower = lc $input;
my $leng = length $lower;
for(my $i = 0; $i < $leng; $i++){
my $temp = substr ($lower,$i,1);
my $con = ord $temp;
if($con >= '55'){
if($con >= '110'){
$con -= 13;
}
else{
$con += 13;
}
}
$result = $result . chr $con;
}
return $result;
};
#opening a file specified by the user for input and reading it
#into an array then closing file.
open FILE, $ARGV[0] or die "cannot open input.txt";
my #input = <FILE>;
close FILE;
my (#username,#password,#name,#uid,#shell,#ssn,#dir,#group,#gid);
my $ui = 100;
foreach(#input){
my ($nam, $ss, $gro) = split ('/', $_);
chomp ($gro);
$nam= lc $nam;
I created a hash so I can use the exists function then using the function and if it does exist go to the next round of the loop. I feel like I am missing something with this.
my %nacheck;
if( exists ($nacheck { '$nam' } )){
next;
}
$nacheck{ "$nam" } = 1;
while (my ($key, $value) = each %nacheck){
print "$key => $value\n";
}
All this works for now but any tips on how to do it better would be appreaciated
my($unf, $unm, $unl) = split (/ /, $nam);
$unf = (substr $unf,0,1);
$unm = (substr $unm,0,1);
$unl = (substr $unl,0,1);
my $un = $unf . $unm . $unl;
if(($gro) eq "faculty"){
push #username, $un;
push #gid, "1010";
push #dir, "/home/faculty/$un";
push #shell, "/bin/tcsh";
}
else{
my $lssn = substr ($ss,7,4);
push #username, $un . $lssn;
push #gid, "505";
push #dir, "/home/student/$un";
push #shell, "/bin/bash";
}
#pushing results onto global arrays to print out later
push #ssn, $ss;
my $pass = rot13;
push #password, $pass;
push #name, $nam;
push #uid, $ui += 1;
}
#printing results
for(my $i = 0; $i < #username; $i++){
print
"$username[$i]:$password[$i]:$uid[$i]:$gid[$i]:$name[$i]:$dir[$i]:$shell[$i]\n";
}

The value of the expression '$nam' is those four characters themselves. The value of the expression "$nam" is whatever the value of the variable $nam is, expressed as a string.
Double quotes allow string interpolation. Single quotes do not; you get exactly what you type.

As you've written it:
my %nacheck;
if( exists ($nacheck { '$nam' } )){
next;
}
$nacheck{ "$nam" } = 1;
the %nacheck is newly created and must be empty. Therefore the exists test fails.
Or have you just shown the definition adjacent to the test for the purpose of the example?
If so, can you show us what your code actually looks like?
Edit: Also, as Charles Engelke noted, you've used single-quotes around a variable '$nam' which is wrong.

Related

Hash incorrectly tracking counts, runtime long

I am working on a program in Perl and my output is wrong and taking forever to process. The code is meant to take in a large DNA sequence file, read through it in 15 letter increments (kmers), stepping forward 1 position at a time. I'm supposed to enter the kmer sequences into a hash, with their value being the number of incidences of that kmer- meaning each key should be unique and when a duplicate is found, it should increase the count for that particular kmer. I know from my Prof. expected output file, that I have too many lines, so it is allowing duplicates and not counting correctly. It's also running 5+ minutes, so I have to Ctrl+C to escape. When I go look at kmers.txt, the file is at least written and formatted correctly.
#!/usr/bin/perl
use strict;
use warnings;
use diagnostics;
# countKmers.pl
# Open file /scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta
# Identify all k-mers of length 15, load them into a hash
# and count the number of occurences of each k-mer. Each
# unique k-mer and its' count will be written to file
# kmers.txt
#Create an empty hash
my %kMersHash = ();
#Open a filehandle for the output file kmers.txt
unless ( open ( KMERS, ">", "kmers.txt" ) ) {
die $!;
}
#Call subroutine to load Fly Chromosome 2L
my $sequenceRef = loadSequence("/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta");
my $kMer = 15; #Set the size of the sliding window
my $stepSize = 1; #Set the step size
for (
#The sliding window's start position is 0
my $windowStart = 0;
#Prevent going past end of the file
$windowStart <= ( length($$sequenceRef) - $kMer );
#Advance the window by the step size
$windowStart += $stepSize
)
{
#Get the substring from $windowStart for length $kMer
my $kMerSeq = substr( $$sequenceRef, $windowStart, $kMer );
#Call the subroutine to iterate through the kMers
processKMers($kMerSeq);
}
sub processKMers {
my ($kMerSeq) = #_;
#Initialize $kCount with at least 1 occurrence
my $kCount = 1;
#If the key already exists, the count is
#increased and changed in the hash
if ( not exists $kMersHash{$kMerSeq} ) {
#The hash key=>value is loaded: kMer=>count
$kMersHash{$kMerSeq} = $kCount;
}
else {
#Increment the count
$kCount ++;
#The hash is updated
$kMersHash{$kMerSeq} = $kCount;
}
#Print out the hash to filehandle KMERS
for (keys %kMersHash) {
print KMERS $_, "\t", $kMersHash{$_}, "\n";
}
}
sub loadSequence {
#Get my sequence file name from the parameter array
my ($sequenceFile) = #_;
#Initialize my sequence to the empty string
my $sequence = "";
#Open the sequence file
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
#Loop through the file line-by-line
while (<FASTA>) {
#Assign the line, which is in the default
#variable to a named variable for readability.
my $line = $_;
#Chomp to get rid of end-of-line characters
chomp($line);
#Check to see if this is a FASTA header line
if ( $line !~ /^>/ ) {
#If it's not a header line append it
#to my sequence
$sequence .= $line;
}
}
#Return a reference to the sequence
return \$sequence;
}
Here's how I would write your application. The processKMers subroutine boils down to just incrementing a hash element, so I've removed that. I've also altered the identifiers to be match the snake_case that is more usual in Perl code, and I didn't see any point in load_sequence returning a reference to the sequence so I've changed it to return the string itself
use strict;
use warnings 'all';
use constant FASTA_FILE => '/scratch/Drosophila/dmel-2L-chromosome-r5.54.fasta';
use constant KMER_SIZE => 15;
use constant STEP_SIZE => 1;
my $sequence = load_sequence( FASTA_FILE );
my %kmers;
for (my $offset = 0;
$offset + KMER_SIZE <= length $sequence;
$offset += STEP_SIZE ) {
my $kmer_seq = substr $sequence, $start, KMER_SIZE;
++$kmers{$kmer_seq};
}
open my $out_fh, '>', 'kmers.txt' or die $!;
for ( keys %kmers ) {
printf $out_fh "%s\t%d\n", $_, $kmers{$_};
}
sub load_sequence {
my ( $sequence_file ) = #_;
my $sequence = "";
open my $fh, '<', $sequence_file or die $!;
while ( <$fh> ) {
next if /^>/;
chomp;
$sequence .= $_;
}
return $sequence;
}
Here's a neater way to increment a hash element without using ++ on the hash directly
my $n;
if ( exists $kMersHash{$kMerSeq} ) {
$n = $kMersHash{$kMerSeq};
}
else {
$n = 0;
}
++$n;
$kMersHash{$kMerSeq} = $n;
Everything looks fine in your code besides processKMers. The main issues:
$kCount is not persistent between calls to processKMers, so in your else statement, $kCount will always be 2
You are printing every time you call processKMers, which is what is slowing you down. Printing frequently slows down your process significantly, you should wait until the end of your program and print once.
Keeping your code mostly the same:
sub processKMers {
my ($kMerSeq) = #_;
if ( not exists $kMersHash{$kMerSeq} ) {
$kMersHash{$kMerSeq} = 1;
}
else {
$kMersHash{$kMerSeq}++;
}
}
Then you want to move your print logic to immediately after your for-loop.

Difficulty in finding error in perl script

Its a kind of bioinformatics concept but programmatic problem. I've tried a lot and at last I came here. I've reads like following.
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
Now what I want to do is, in a simplistic way,
take the last 6 residues of first read -> check if any other read is starting with those 6 residues, if yes add the last residue of that read to the first read -> again same with the 2nd read and so on.
Here is the code what I've tried so far.
#!/usr/bin/perl -w
use strict;
use warnings;
my $in = $ARGV[0];
open(IN, $in);
my #short_reads = <IN>;
my $first_read = $short_reads[0];
chomp $first_read;
my #all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
if($short_reads[$i+1] =~ /^$end_of_kmers/){
my $end_res = substr($short_reads[$i], -1);
push(#all_end_res, $end_res);
}
}
my $end_res2 = join('', #all_end_res);
print $first_read.$end_res2,"\n\n";
At the end I should get an output like ATGGAAGTCGCGGAATC but I'm getting ATGGAAGGTCGCGGAAT. The error must be in if, any help is greatly appreciated.
There are three huge problems in IT.
Naming of things.
Off by one error.
And you just hit the second one. The problem is in a way you think about this task. You think in way I have this one string and if next one overlap I will add this one character to result. But correct way to think in this case I have this one string and if it overlaps with previous string or what I read so far, I will add one character or characters which are next.
#!/usr/bin/env perl
use strict;
use warnings;
use constant LENGTH => 6;
my $read = <>;
chomp $read;
while (<>) {
chomp;
last unless length > LENGTH;
if ( substr( $read, -LENGTH() ) eq substr( $_, 0, LENGTH ) ) {
$read .= substr( $_, LENGTH );
}
else {last}
}
print $read, "\n";
I didn't get this ARGV[0] thing. It is useless and inflexible.
$ chmod +x code.pl
$ cat data
ATGGAAG
TGGAAGT
GGAAGTC
GAAGTCG
AAGTCGC
AGTCGCG
GTCGCGG
TCGCGGA
CGCGGAA
GCGGAAT
CGGAATC
$ ./code.pl data
ATGGAAGTCGCGGAATC
But you have not defined what should happen if data doesn't overlap. Should there be some recovery or error? You can be also more strict
last unless length == LENGTH + 1;
Edit:
If you like working with an array you should try avoid using for(;;). It is prone to errors. (BTW for (my $i = 0; $i < #a; $i++) is more idiomatic.)
my #short_reads = <>;
chomp #short_reads;
my #all_end_res;
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
push #all_end_res, $1;
}
}
print $short_reads[0], join('', #all_end_res), "\n";
The performance and memory difference is negligible up to thousands of lines. Now you can ask why to accumulate characters into an array instead of accumulate it to string.
my #short_reads = <>;
chomp #short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $prev_read = $short_reads[$i-1];
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($prev_read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now the question is why to use $prev_read when you have $end_of_kmers inside of $read.
my #short_reads = <>;
chomp #short_reads;
my $read = $short_reads[0];
for my $i (1 .. $#short_reads) {
my $curr_read = $short_reads[$i+1];
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
Now you can ask why I need indexes at all. You just should remove the first line to work with the rest of array.
my #short_reads = <>;
chomp #short_reads;
my $read = shift #short_reads;
for my $curr_read (#short_reads) {
my $end_of_kmers = substr($read, -6);
if ( $curr_read =~ /^\Q$end_of_kmers(.)/ ) {
$read .= $1;
}
}
print "$read\n";
And with few more steps and tweaks you will end up with the code what I posted initially. I don't need an array at all because I look only to the current line and accumulator. The difference is in a way how you think about the problem. If you think in terms of arrays and indexes and looping or in terms of data flow, data processing and state/accumulator. With more experience, you don't have to do all those steps and make the final solution just due different approach to problem solving.
Edit2:
It is almost ten times faster using substr and eq then using regular expressions.
$ time ./code.pl data.out > data.test
real 0m0.480s
user 0m0.468s
sys 0m0.008s
$ time ./code2.pl data.out > data2.test
real 0m4.520s
user 0m4.516s
sys 0m0.000s
$ cmp data.test data2.test && echo OK
OK
$ wc -c data.out data.test
6717368 data.out
839678 data.test
with minor modification:
use warnings;
use strict;
open my $in, '<', $ARGV[0] or die $!;
chomp(my #short_reads = <$in>);
my $first_read = $short_reads[0];
my #all_end_res;
for(my $i=0; $i<=$#short_reads; $i++){
chomp $short_reads[$i];
my $end_of_kmers = substr($short_reads[$i], -6);
my ($next_read) = $short_reads[$i+1];
if( (defined $next_read) and ($next_read =~ /^\Q$end_of_kmers/)){
my $end_res = substr($next_read, -1);
push(#all_end_res, $end_res);
}
}
my $end_res2 = join('', #all_end_res);
print $first_read.$end_res2,"\n";
ATGGAAGTCGCGGAATC

How I can take the coordinates of a block of numbers?

I have a problem tha bothers me a lot...
I have a file with two columns (thanks to your help in a previous question) like:
14430001 0.040
14430002 0.000
14430003 0.990
14430004 1.000
14430005 0.050
14430006 0.490
....................
the first column is coordinates the second probabilities.
I am trying to find the blocks with probability >=0.990 and to be more than 100 in size.
As output I want to be like this:
14430001 14430250
14431100 14431328
18750003 18750345
.......................
where the first column has the coordinate of the start of each block and the second the end of it.
I wrote this script:
use strict;
#use warnings;
use POSIX;
my $scores_file = $ARGV[0];
#finds the highly conserved subsequences
open my $scores_info, $scores_file or die "Could not open $scores_file: $!";
#open(my $fh, '>', $coords_file) or die;
my $count = 0;
my $cons = "";
my $newcons = "";
while( my $sline = <$scores_info>) {
my #data = split('\t', $sline);
my $coord = $data[0];
my $prob = $data[1];
if ($data[1] >= 0.990) {
#$cons = "$cons + '\n' + $sline + '\n'";
$cons = join("\n", $cons, $sline);
# print $cons;
$count++;
if($count >= 100) {
$newcons = join("\n", $newcons, $cons);
my #array = split /'\n'/, $newcons;
print #array;
}
}
else {
$cons = "";
$count = 0;
}
}
It gives me the lines with probability >=0.990 (the first if works) but the coordinates are wrong. When Im trying to print it in a file it stacks, so I have only one sample to check it.
Im terrible sorry if my explanations aren't helpful, but I am new in programming.
Please, I need your help...
Thank you very much in advance!!!
You seem to be using too much variables. Also, after splitting the array and assigning its parts to variables, use the new variables rather than the original array.
sub output {
my ($from, $to) = #_;
print "$from\t$to\n";
}
my $threshold = 0.980; # Or is it 0.990?
my $count = 0;
my ($start, $last);
while (my $sline = <$scores_info>) {
my ($coord, $prob) = split /\t/, $sline;
if ($prob >= $threshold) {
$count++;
defined $start or $start = $coord;
$last = $coord;
} else {
output($start, $last) if $count > 100;
undef $start;
$count = 0;
}
}
output($start, $last) if $count > 100;
(untested)

Argument is not numeric error while comparing hash values based on keys

#!/usr/bin/perl
use strict;
use Data::Dumper;
use warnings;
my #mdsum;
open (IN1,"$ARGV[0]") || die "counldn't open";
open (MYFILE, '>>md5sum-problem.txt');
open (IN2, "mdsumfile.txt");
my %knomexl=();
my %knomemdsum = ();
my #arrfile ;
my $tempkey ;
my $tempval ;
my #values ;
my $val;
my $i;
my #newarra;
my $testxl ;
my $testmdsum;
while(<IN1>){
next if /barcode/;
#arrfile = split('\t', $_);
$knomexl{$arrfile[0]} = $arrfile[2];
}
while(<IN2>){
chomp $_;
#newarra = split(/ {1,}/, $_);
$tempval = $newarra[0];
$tempkey = $newarra[1];
$tempkey=~ s/\t*$//g;
$tempval=~ s/\s*$//g;
$tempkey=~s/.tar.gz//g;
$knomemdsum{$tempkey} = $tempval;
}
#values = keys %knomexl;
foreach $i(#values){
$testxl = $knomexl{$values[$i]};
print $testxl."\n";
$testmdsum = $knomemdsum{$values[$i]};
print $testmdsum."\n";
if ( $testxl ne $testmdsum ) {
if ($testxl ne ""){
print MYFILE "Files hasving md5sum issue $i\n";
}
}
}
close (MYFILE);
I have two files one both having File name and Mdsum values and I need to check that which all file's md5sum values are not matching so I understand that in some case where Value and corresponding values will not be their and I want those cases only. Any work around on this code ? Please. This code is pretty simple but don't know why it's not working!! :( :(
#values = keys %knomexl;
foreach $i(#values){
#print Dumper $knomexl{$values[$i]};
$testxl = $knomexl{$i};
print $testxl."\n";
$testmdsum = $knomemdsum{$i};
print $testmdsum."\n";
$i is an element of #values because of the foreach, not an index, so you shouldn't use $values[$i].

How do I average column values from a tab-separated data file, ignoring a header row and the left column?

My task is to compute averages from the following data file, titled Lab1_table.txt:
retrovirus genome gag pol env
HIV-1 9181 1503 3006 2571
FIV 9474 1353 2993 2571
KoRV 8431 1566 3384 1980
GaLV 8088 1563 3498 2058
PERV 8072 1560 3621 1532
I have to write a script that will open and read this file, read each line by splitting the contents into an array and computer the average of the numerical values (genome, gag, pol, env), and write to a new file the average from each of the aforementioned columns.
I've been trying my best to figure out how to not take into account the first row, or the first column, but every time I try to execute on the command line I keep coming up with 'explicit package name' errors.
Global symbol #average requires explicit package name at line 23.
Global symbol #average requires explicit package name at line 29.
Execution aborted due to compilation errors.
I understand that this involves # and $, but even knowing that I've not been able to change the errors.
This is my code, but I emphasise that I'm a beginner having started this just last week:
#!/usr/bin/perl -w
use strict;
my $infile = "Lab1_table.txt"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";
my $count = 0;
my $average = ();
while (<INFILE>) {
chomp;
my #columns = split /\t/;
$count++;
if ( $count == 1 ) {
$average = #columns;
}
else {
for( my $i = 1; $i < scalar $average; $i++ ) {
$average[$i] += $columns[$i];
}
}
}
for( my $i = 1; $i < scalar $average; $i++ ) {
print $average[$i]/$count, "\n";
}
I'd appreciate any insight, and I would also great appreciate letting me know by list numbering what you're doing at each step - if appropriate. I'd like to learn and it would make more sense to me if I was able to read through what someone's process was.
Here are the points you need to change
Use another variable for the headers
my $count = 0;
my #header = ();
my #average = ();
then change the logic inside if statement
if ( $count == 1 ) {
#header = #columns;
}
Now don't use the #average for the limit, use $i < scalar #columns for else statement.
Initially #average is zero, you will never get inside the for loop ever.
else {
for( my $i = 1; $i < scalar #columns; $i++ ) {
$average[$i] += $columns[$i];
}
}
Finally add -1 to your counter. Remember you increment your counter when you parse your header
for( my $i = 1; $i < scalar #average; $i++ ) {
print $average[$i]/($count-1), "\n";
}
Here is the final code
You can take advantage of #header to display the result neatly
#!/usr/bin/perl -w
use strict;
my $infile = "Lab1_table.txt"; # This is the file path
open INFILE, $infile or die "Can't open $infile: $!";
my $count = 0;
my #header = ();
my #average = ();
while (<INFILE>) {
chomp;
my #columns = split /\t/;
$count++;
if ( $count == 1 ) {
#header = #columns;
}
else {
for( my $i = 1; $i < scalar #columns; $i++ ) {
$average[$i] += $columns[$i];
}
}
}
for( my $i = 1; $i < scalar #average; $i++ ) {
print $average[$i]/($count-1), "\n";
}
There are other ways to write this code but I thought it would be better to just correct your code so that you can easily understand what is wrong with your code. Hope it helps