How to calculate frequency of characters in a FASTA file in Perl

How to calculate frequency of characters in a FASTA file in Perl - perl

I'm trying to calculate the percentage of certain characters in a string from a file that is in FASTA format. So the file looks like this;
>label
sequence
>label
sequence
>label
sequence
I'm trying to calculate the percentage of specific characters (e.g G's) from the "sequence" strings.
After calculating that (which I have been able to do), I'm trying to print a sentence that says: "The percentage of G's in (e.g.) label 1 is (e.g)53%".
So my question really is, how do I do a calculation on the sequence strings and then name each one in its corresponding output by the label above it?
The code I have so far works out the percentage but I have no way of identifying it.
#!/usr/bin/perl
use strict;
# opens file
my $infile = "Lab1_seq.fasta.txt";
open INFILE, $infile or die "$infile: $!\n";
# reads each line
while (my $line = <INFILE>){
chomp $line;
#creates an array
my #seq = split (/>/, $line);
# Calculates percent
if ($line !~ />/){
my $G = ($line =~ tr/G//);
my $C = ($line =~ tr/C//);
my $total = $G + $C;
my $length = length($line);
my $percent = ($total / $length) * 100;
#prints the percentage of G's and C's for label is x%
print "The percentage of G's and C's for #seq[1] is $percent\n";
}
else{
}
}
close INFILE
It spits out this output (below) when I'm really trying to get it to also say the name of each label that corresponds to the sequence
The percentage of G's and C's for is 53.4868841970569
The percentage of G's and C's for is 52.5443110348771
The percentage of G's and C's for is 50.8746355685131

You just need to match your label and save that in a variable:
my $label;
# reads each line
while (my $line = <INFILE>){
...
if ($line =~ />(.*)/){
$label = $1;
# Calculates percent
} else{
...
print "The percentage of G's and C's for $label, #seq[1] is $percent\n";
}
}

Related

can't loop through the whole thing to start at the beginning after it shows your results

I am really new in perl and I am writing this program that gives you the unique words that are in a text file. however I don't know how to make it loop to ask the user for another file or to quit the program altogether.
I tried to put my whole code under a do until loop and it did not work
use 5.18.0;
use warnings;
use strict;
print "Enter the name of the file: ";
my %count;
my $userinput = <>; #the name of the text file the user wants to read
chomp($userinput); #take out the new line comand
my $linenumb = $ARGV[1];
my $uniqcount = 0;
#opens the file if is readeable
open(FH, '<:encoding(UTF-8)', $userinput) or die "Could not open file '$userinput' $!";
print "Summary of file '$userinput': \n";
my ($lines, $wordnumber, $total) = (0, 0, 0);
my #words = ();
my $count =1;
while (my $line = <FH>) {
$lines++;
my #words = split (" ", $line);
$wordnumber = #words;
print "\n Line $lines : $wordnumber ";
$total = $total+$wordnumber;
$wordnumber++;
}
print "\nTotal no. of words in file are $total \n";
#my #uniq = uniq #words;
#print "Unique Names: " .scalar #uniq . "\n";
close(FH);

It's often a good idea to put complicated pieces of your code into subroutines so that you can forget (temporarily) how the details work and concentrate on the bigger picture.
I'd suggest that you have two obvious subroutines here that might be called get_user_input() and process_file(). Putting the code into subroutines might look like this:
sub get_user_input {
print "Enter the name of the file: ";
my $userinput = <>; #the name of the text file the user wants to read
chomp($userinput); #take out the new line comand
return $userinput;
}
sub process_file {
my ($file) = #_;
#opens the file if is readeable
# Note: Changed to using a lexical filehandle.
# This will automatically be closed when the
# lexical variable goes out of scope (i.e. at
# the end of this subroutine).
open(my $fh, '<:encoding(UTF-8)', $file)
or die "Could not open file '$file' $!";
print "Summary of file '$file': \n";
# Removed $lines variable. We'll use the built-in
# variable $. instead.
# Moved declaration of $wordnumber inside the loop.
# Removed #words and $count variables that aren't used.
my $total = 0;
# Removed $line variable. We'll use $_ instead.
while (<$fh>) {
# With no arguments, split() defaults to
# behaving as split ' ', $_.
# When assigned to a scalar, split() returns
# the number of elements in the split list
# (which is what we want here - we never actually
# use the list of words).
my $wordnumber = split;
print "\n Line $. : $wordnumber ";
# $x += $y is a shortcut for $x = $x + $y.
$total += $wordnumber;
$wordnumber++;
}
print "\nTotal no. of words in file are $total \n";
}
And then you can plug them together with code something like this:
# Get the first filename from the user
my $filename = get_user_input();
# While the user hasn't typed 'q' to quit
while ($filename ne 'q') {
# Process the file
process_file($filename);
# Get another filename from the user
$filename = get_user_input();
}
Update: I've cleaned up the process_file() subroutine a bit and added comments about the changes I've made.

Wrap everything in a neverending loop and conditionally jump out of it.
while () {
my $prompt = …
last if $prompt eq 'quit';
… # file handling goes here
}

The output of a subroutine is returning 0

I have written a script which uses a subroutine to call percentage of nucleotides in a given sequence. When I run the script the output for each nucleotide percentage is always shown to be zero.
Here's my code;
#!/usr/bin/perl
use strict;
use warnings;
#### Subroutine to report percentage of each nucleotide in DNA sequence ####
my $input = $ARGV[0];
my $nt = $ARGV[1];
my $args = $#ARGV +1;
if($args != 2){
print "Error!!! Insufficient number of arguments\n";
print "Usage: $0 <input fasta file>\n";
}
my($FH, $line);
open($FH, '<', $input) || die "Could\'nt open file: $input\n";
$line = do{
local $/;
<$FH>;
};
$line =~ s/>(.*)//g;
$line =~ s/\s+//g;
my $perc = perc_nucleotide($line , $nt);
printf("The percentage of $nt nucleotide in given sequence is %.0f", $perc);
print "\n";
sub perc_nucleotide {
my($line, $nt) = #_;
print "$nt\n";
my $count = 0;
if( $nt eq "A" || $nt eq "T" || $nt eq "G" || $nt eq "C"){
$count++;
}
my $total_len = length($line);
my $perc = ($count/$total_len)*100;
}
I think that I am setting the $count variable wrong. I tried different ways but can't figure it out.
This is the input file
>XM_024894547.1 Trichoderma citrinoviride Redoxin (BBK36DRAFT_1163529), partial mRNA
ATGGCCTTCCGTCTCCCTCTGCGCCGCATTGCCCTGGCCCGCCCCGCCACCGTTGCGCGTGGCTTCCACT
CGACGCCCCGCGCCCTGGTCAAGGTCGGCGACGAGGTCCCGAGCTTGGAGCTGTTCGAGAAGTCGGCCGC
CAGCAAGATCAACCTGGCCGACGAGTTCAAGAAGGGCGACGGCTACATTGTCGGCGTCCCGGGCGCCTTC
TCCGGCACCTGCTCCGGCACCCACGTCCCGTCGTACATCAACCACCCTGACATCAAGACGGCCGGCCAGG
TCTTTGTCGTCTCCGTCAACGACCCCTTTGTCATGAAGGCTTGGGCAGACCAGCTGGATCCCGCCGGAGA
GACAGGAATCCGGTTCGTTGCCGACCCCACGGCTGAGTTCACAAAGGCTCTGGAACTGGGATTCGACGAC
GCTGCTCCTCTGTTCGGAGGCACCCGAAGCAAGCGCTATGCTCTCAAGGTTAAGGATGGCAAGGTCACTG
CCGCCTTTGTTGAGCCCGACAACACGGGCACTTCCGTGTCAATGGCCGACAAGGTCCTCAGCTAA

The problem is here:
my $perc = perc_nucleotide($line , $nt);
printf("The percentage of $nt nucleotide in given sequence is %.0f", $perc);
perc_nucleotide is returning 0.18018018018018 but the format %.0f says to print it with no decimal places. So it gets truncated to 0. You should probably use something more like %.2f.
It's also worth noting that perc_nucleotide does not have a return. It still works, but for reasons that might not be obvious.
perc_nucleotide sets my $perc = ($count/$total_len)*100; but never uses that $perc. The $perc in the main program is a different variable.
perc_nucleotide does return something, every Perl subroutine without an explicit return returns the "last evaluated expression". In this case it's my $perc = ($count/$total_len)*100; but the last evaluated expression rules can get a bit tricky.
It's easier to read and safer to have an explicit return. return ($count/$total_len)*100;

I corrected the script and it gave me right answers.
#!/usr/bin/perl
use strict;
use warnings;
##### Subroutine to calculate percentage of all nucleotides in a DNA sequence #####
my $input = $ARGV[0];
my $nt = $ARGV[1];
my $args = $#ARGV + 1;
if($args != 2){
print "Error!!! Insufficient number of arguments\n";
print "Usage: $0 <input_fasta_file> <nucleotide>\n";
}
my($FH, $line);
open($FH, '<', $input) || die "Couldn\'t open input file: $input\n";
$line = do{
local $/;
<$FH>;
};
chomp $line;
#print $line;
$line =~ s/>(.*)//g;
$line =~ s/\s+//g;
#print "$line\n";
my $total_len = length($line);
my $perc_of_nt = perc($line, $nt);
**printf("The percentage of nucleotide $nt in a given sequence is %.2f%%", $perc_of_nt);
print "\n";**
#print "$total_len\n";
sub perc{
my($line, $nt) = #_;
my $char; my $count = 0;
**foreach $char (split //, $line){
if($char eq $nt){
$count += 1;
}
}**
**return (($count/$total_len)*100)**
}
The answer for the above input file is:
Total_len = 555
The percentage of nucleotide A in a given sequence is 18.02%
The percentage of nucleotide T in a given sequence is 18.74%
The percentage of nucleotide G in a given sequence is 28.47%
The changes which I made are in bold.
Thanks for amazing insight!!!

Counting and printing location of duplicate words in a line using Perl

I am trying to read from a file and print out the location of duplicate words on each line.I have stored each line in an array, but I am not sure if this is the right way to start.
while (my $fileLine = <$fh>){
my #lineWords = split /\s+/, $fileLine;
print "#\n"
}

#!/usr/bin/perl
use strict;
use warnings;
while (<DATA>){
chomp; # remove end of line chars
my #wordsInLine = split /\s+/, $_;
#wordsInLine = map {lc($_)} #wordsInLine; # convert words to lowercase
my( $word, %wordsInLine, $n );
for $word (#wordsInLine) {
$wordsInLine{$word}++; # use hash %wordsInLine to count occurences of words
}
for $word (#wordsInLine) {
$n++;
if( (my $count = $wordsInLine{$word}||0) > 1 ) {
print "line $.: Word $n \"$word\" is repeated $count times\n";
delete($wordsInLine{$word}); # do not generate more than one report
# about the same word in single line
}
}
}
__DATA__
This this is a sample sentence
A that That THAT !

Trying to write a specific result to a new outfile

I am extremely new to the Perl process. I am very much enjoying the learning curve and Perl but I am frustrated beyond belief and have spent many, many hours on one task achieving little to no results.
#!/usr/bin/perl
use strict;
print "Average value of retroviruses for the length of each genome and each of the genes:\n"; #create a title for the script
my $infile = "Lab1_table.txt"; # This is the file path.
open INFILE, $infile or die "Can't open $infile: $!"; # Provides an error message if the file can'tbe found.
# set my initial values.
my $tally = 0;
my #header = ();
my #averages = ();
# create my first loop to run through the file by line.
while (my $line = <INFILE>){
chomp $line;
print "$line\n";
# add one to the loop and essentially remove the header line of value.
# the first line is what was preventing me from caclulating averages as Perl can't calculate words.
my #row = split /\t/, $line; # split the file by tab characters.
$tally++; #adds one to the tally.
if ( $tally == 1 ) { #if the tally = 1 the row is determined as a the header.
#header = #row;
}
# if the tally is anything else besides 1 then it will read those rows.
else {
for( my $i = 1; $i < scalar #row; $i++ ) {
$averages[$i] += $row[$i];
}
foreach my $element (#row){
}
foreach my $i (0..4){
$averages[$i] = $averages[$i] + $row[1..4];
}
}
}
print "Average values of genome, gag, pol and env:\n";
for( my $i = 1; $i < scalar #averages; $i++ ) { # this line is used to determine the averages of the columns and print the values
print $averages[$i]/($tally-1), "\n";
}
SO, I got the results to come up with what I wanted (not in the exact format I wanted but as close as I can seem to get at the moment) and they do average the columns.
The issue now is writing to a an outfile. I am trying to get my table and results from the previous code to appear in my outfile. I get a good file name but no results.
foreach my $i (1){
my $outfile= "Average_values".".txt";
open OUTFILE, ">$outfile" or die "$outfile: $!";
print "Average values of genome, gag, pol and env:\n";
}
close OUTFILE;
close INFILE;
I feel like there is an easy way to do this and a hard way and I have taken the very hard way. Any help would be much appreciated.

You did not tell Perl where to print:
print OUTFILE "Average values of genome, gag, pol and env:\n";
BTW, together with use strict, also use warnings. And for working with files, use lexical filehandles and the three argument form of open:
open my $FH, '>', $filename or die $!;
print $FH 'Something';
close $FH or die $!;

How to multiply the numbers in csv file per line and add together

I need a way to take the numbers in one line in my .csv file and multiply them together, and then add the products from each line together to get just one number. My .csv file looks something like:
1,1
2,3
3,4
I know the answer should be 19, but I'm not sure how exactly to program it in Perl. I have both numbers split into different variables by:
($x,$y) = split (/,/, $line)
I've already read the file in and all that, I just need help with this one part of my code.
If anyone could point me in the right direction I would really appreciate it.

A naive solution could look like this:
use strict;
use warnings FATAL => 'all';
my $total;
open(my $fh, '<', "temp.csv");
while( my $line = <$fh> ) {
my ($x, $y) = split(',', $line);
$total += ($x * $y);
}
print "Total is: $total\n";

In short form
perl -F, -anE'$s+=$F[0]*$F[1]}{say$s'

my $sum = 0;
open my $csv, '<', $filename or die $!;
while(my $line = <$csv>) {
my $prod = 1;
$prod *= $_ for split ',', $line;
$sum += $prod;
}

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to calculate frequency of characters in a FASTA file in Perl - perl

You just need to match your label and save that in a variable: my $label; # reads each line while (my $line = <INFILE>){ ... if ($line =~ />(.*)/){ $label = $1; # Calculates percent } else{ ... print "The percentage of G's and C's for $label, #seq[1] is $percent\n"; } }

Related

can't loop through the whole thing to start at the beginning after it shows your results

The output of a subroutine is returning 0

Counting and printing location of duplicate words in a line using Perl

Trying to write a specific result to a new outfile

How to multiply the numbers in csv file per line and add together

Categories

Resources