Unable to remove newlines from Emsembl FASTA - perl

I'm trying to find protein motifs from an Ensembl FASTA file. I've gotten the bulk of the script done, such as retrieving the sequence ID and the sequence itself, but I am receiving some funny results.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $motif1 = qr/(HE(\D)(\D)H(\D{18})E)/x;
my $motif2 = qr/(AMEN)/x;
my $input;
my $output;
my $count_total = 0;
my $count_processed = 0;
my $total_run = 0;
my $id;
my $seq;
my $motif1_count = 0;
my $motif2_count = 0;
my $motifboth_count = 0;
############################################################################################################################
# FILEHANDLING - INPUT/OUTPUT
# User input prompting and handling
print "**********************************************************\n";
print "Question 3\n";
print "**********************************************************\n";
#opens the user input file previously assigned to varible to new variable or kills script.
open my $fh, '<', "chr2.txt" || die "Error! Cannot open file:$!\n";
#Opens and creates output file previously assigned to variable to new variable or kills script
#open(RESULTS, '>', $output)||die "Error! Cannot create output file:$!\n";
# FILE and DATA PROCESSING
############################################################################################################################
while (<$fh>) {
if (/^>(\S+)/) {
$count_total = ++$count_total; # Plus one to count
find_motifs($id, $seq) if $seq; # Passing to subroutine
$id = substr($1, 0, 15); # Taking only the first 16 characters for the id
$seq = '';
}
else {
chomp;
$seq .= $_;
}
}
print "Total proteins: $count_total \n";
print "Proteins with both motifs: $motifboth_count \n";
print "Proteins with motif 1: $motif1_count \n";
print "Proteins with motif 2: $motif2_count \n";
exit;
######################################################################################################################################
# SUBROUTINES
#
# Takes passed variables from special array
# Finds the position of motif within seq
# Checks for motif 1 presence and if found, checks for motif 2. If not found, prints motif 1 results
# If no motif 1, checks for motif 2
sub find_motifs {
my ($id, $seq) = #_;
if ($seq =~ $motif1) {
my $motif_position = index $seq, $1;
my $motif = $1;
if ($seq =~ $motif2) {
$motif1_count = ++$motif1_count;
$motif2_count = ++$motif2_count;
$motifboth_count = ++$motifboth_count;
print "$id, $motif_position, \n$motif \n";
}
else {
$motif1_count = ++$motif1_count;
print "$id, $motif_position,\n $motif\n\n";
}
}
elsif ($seq =~ $motif2) {
$motif2_count = ++$motif2_count;
}
}
What is happening is that if the motif is found at the end of one line of data and the beginning of the next one, it will return the motif with the newline in the data. This method of slurping in data has worked well before.
Sample Results:
ENSG00000119013, 6, HEHGHHKMELPDYRQWKIEGTPLE (CORRECT!)
ENSG00000142327, 123, HEVAHSWFGNAVTNATWEEMWLSE (CORRECT!)
ENSG00000151694, 410, **AECAPNEFGAEHDPDGL**
This is the problem. The motif matches but returns the first half, the newline, then prints the second half on the same line as well (which is a symptom of the larger problem - Getting rid of the newline!)
Total proteins: 13653
Proteins with both motifs: 1
Proteins with motif 1: 12
Proteins with motif 2: 22
I've tried different methods such as #seq =~ s/\r//g or `s/\n//g and at different places within the script.

It's not clear from your description, but "prints the second half on the same line as well" sounds like your output is overlaid on itself because it has a carriage-return character at the end.
This happens if you are running on a Linux system and you just chomp a line that has come from Windows.
You should replace chomp with s/\s+\z// which will remove all trailing whitespace. And because both carriage return and linefeed count as "whitespace" it will remove all possible termination characters.
By the way, you are misunderstading the purpose of the ++ operator. It also modifies the contents of the variable it is applied to so all you need is ++$motif1_count etc. Your code works as it is because the operator also returns the value of the incremented variable, so $motif1_count = ++$motif1_count first increments the variable and then assigns it to itself.
Also, you use \D in your regex. Are you aware that this matches any non-digit character? It seems a very vague classification to be useful.

Related

Write a Perl script that takes in a fasta and reverses all the sequences (without BioPerl)?

I dont know if this is just a quirk with Stawberry Perl, but I can't seem to get it to run. I just need to take a fasta and reverse every sequence in it.
-The problem-
I have a multifasta file:
>seq1
ABCDEFG
>seq2
HIJKLMN
and the expected output is:
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
The script is here:
$NUM_COL = 80; ## set the column width of output file
$infile = shift; ## grab input sequence file name from command line
$outfile = "test1.txt"; ## name output file, prepend with “REV”
open (my $IN, $infile);
open (my $OUT, '>', $outfile);
$/ = undef; ## allow entire input sequence file to be read into memory
my $text = <$IN>; ## read input sequence file into memory
print $text; ## output sequence file into new decoy sequence file
my #proteins = split (/>/, $text); ## put all input sequences into an array
for my $protein (#proteins) { ## evaluate each input sequence individually
$protein =~ s/(^.*)\n//m; ## match and remove the first descriptive line of
## the FATA-formatted protein
my $name = $1; ## remember the name of the input sequence
print $OUT ">REV$name\n"; ## prepend with #REV#; a # will help make the
## protein stand out in a list
$protein =~ s/\n//gm; ## remove newline characters from sequence
$protein = reverse($protein); ## reverse the sequence
while (length ($protein) > $NUM_C0L) { ## loop to print sequence with set number of cols
$protein =~ s/(.{$NUM_C0L})//;
my $line = $1;
print $OUT "$line\n";
}
print $OUT "$protein\n"; ## print last portion of reversed protein
}
close ($IN);
close ($OUT);
print "done\n";
This will do as you ask
It builds a hash %fasta out of the FASTA file, keeping array #keys to keep the sequences in order, and then prints out each element of the hash
Each line of the sequence is reversed using reverse before it is added to the hash, and using unshift adds the lines of the sequence in reverse order
The program expects the input file as a parameter on the command line, and prints the result to STDOUT, which may be redirected on the command line
use strict;
use warnings 'all';
my (%fasta, #keys);
{
my $key;
while ( <> ) {
chomp;
if ( s/^>\K/REV/ ) {
$key = $_;
push #keys, $key;
}
elsif ( $key ) {
unshift #{ $fasta{$key} }, scalar reverse;
}
}
}
for my $key ( #keys ) {
print $key, "\n";
print "$_\n" for #{ $fasta{$key} };
}
output
>REVseq1
GFEDCBA
>REVseq2
NMLKJIH
Update
If you prefer to rewrap the sequence so that short lines are at the end, then you just need to rewrite the code that dumps the hash
This alternative uses the length of the longest line in the original file as the limit, and rerwraps the reversed sequence to the same length. It's claer that it would be simple to specify an explicit length instead of calculating it
You will need to add use List::Util 'max' at the top of the program
my $len = max map length, map #$_, values %fasta;
for my $key ( #keys ) {
print $key, "\n";
my $seq = join '', #{ $fasta{$key} };
print "$_\n" for $seq =~ /.{1,$len}/g;
}
Given the original data the output is identical to that of the solution above. I used this as input
>seq1
ABCDEFGHI
JKLMNOPQRST
UVWXYZ
>seq2
HIJKLMN
OPQRSTU
VWXY
with this result. All lines have been wrapped to eleven characters - the length of the longest JKLMNOPQRST line in the original data
>REVseq1
ZYXWVUTSRQP
ONMLKJIHGFE
DCBA
>REVseq2
YXWVUTSRQPO
NMLKJIH
I don't know if this is just for a class that uses toy datasets or actual research FASTAs that can be gigabytes in size. If the latter, it would make sense not to keep the whole data set in memory as both your program and Borodin's do but read it one sequence at a time, print that out reversed and forget about it. The following code does that and also deals with FASTA files that may have asterisks as sequence-end markers as long as they start with >, not ;.
#!/usr/bin/perl
use strict;
use warnings;
my $COL_WIDTH = 80;
my $sequence = '';
my $seq_label;
sub print_reverse {
my $seq_label = shift;
my $sequence = reverse shift;
return unless $sequence;
print "$seq_label\n";
for(my $i=0; $i<length($sequence); $i += $COL_WIDTH) {
print substr($sequence, $i, $COL_WIDTH), "\n";
}
}
while(my $line = <>) {
chomp $line;
if($line =~ s/^>/>REV/) {
print_reverse($seq_label, $sequence);
$seq_label = $line;
$sequence = '';
next;
}
$line = substr($line, 0, -1) if substr($line, -1) eq '*';
$sequence .= $line;
}
print_reverse($seq_label, $sequence);

Perl script to print out all the lines containing a keyword and the line below it

I need to write a perl script to search for a keyword in a large file and then print all the lines containing the keyword plus the line below each keyword to a new file.
In the original file, there are multiple lines (the exact number varies) below each keyword-containing line. I already have a script that makes the variable number of lines to equal 1. I need this functionality to remain in the script and build upon it.
I found out that I could use grep to extract the lines, but this requires running the script I already have first and then using the grep command. I'd really need to have these functions to be combined into one.
Any help is much appreaciated!
Here is the script I have so far:
use strict;
open (FILE, $ARGV[0]) or die ("Cannot open file");
my $name;
my $sequence;
while (my $line = <FILE>) {
chomp ($line);
if (substr ($line, 0, 1) eq ">") {
if ($sequence ne "") {
printf if / ("%s\n%s\n", $name, $sequence);
}
$name = $line;
$sequence = "";
} else {
$sequence .= $line;
}
}
if ($sequence ne "") {
printf ("%s\n%s\n", $name, $sequence);
}
And an example of the original file:
sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGW
In this example, the keyword would be "FRG3G". The keyword is always in the same place, the characters before it vary, but the structure is the same.
If you have only 1 line to print after the keyword line, you can just remember if you found the keyword and then print the line like this:
my $matched = 0;
while (<FILE>) {
print if ($matched);
if (m/$keyword/) {
print;
matched = 1;
}
else {
matched = 0;
}
}
If you can detect the end of the lines you want to print somehow, you can adjust the code above instead of just hard-coding it to print 1 line.
Redirect to a new file as needed.

Using Perl's SEEK to jump to a line in a file and continuing to read the file

My goal is to open a file containing a single column of fixed length (1 character = 2 bytes on my Mac), and then to read the lines of the file into an array, beginning and ending at specified points. The file is very long, so I am using the seek command to jump to the appropriate starting line of the file. The file is a chromosomal sequence, arranged as a single column. I am successfully jumping to the appropriate point in the file, but I am having trouble reading the sequence into the array.
my #seq = (); # to contain the stretch of sequence I am seeking to retrieve from file.
my $from_bytes = 2*$from - 2; # specifies the "start point" in terms of bytes.
seek( SEQUENCE, $from_bytes, 0 );
my $from_base = <SEQUENCE>;
push ( #seq, $from_base ); # script is going to the correct line and retrieving correct base.
my $count = $from + 1; # here I am trying to continue the read into #seq
while ( <SEQUENCE> ) {
if ( $count = $to ) { # $to specifies the line at which to stop
last;
}
else {
push( #seq, $_ );
$count++;
next;
}
}
print "seq is: #seq\n\n"; # script prints only the first base
It seems you are reading fixed width records, consisting of $to lines, and each line has 2 bytes (1 char + 1 newline). As such you can simply read each chromosome sequence with a single read. A short example:
use strict;
use warnings;
use autodie;
my $record_number = $ARGV[0];
my $lines_per_record = 4; # change to the correct value
my $record_length = $lines_per_record * 2;
my $offset = $record_length * $record_number;
my $fasta_test = "fasta_test.txt";
if (open my $SEQUENCE, '<', $fasta_test) {
my $sequence_string;
seek $SEQUENCE, $offset, 0;
my $chars_read = read($SEQUENCE, $sequence_string, $record_length);
if ($chars_read) {
my #seq = split /\n/, $sequence_string; # if you want it as an array
$sequence_string =~ s/\n//g; # if you want the chromosome sequence as a single string without newlines
print $sequence_string, "\n";
} else {
print STDERR "Failed to read record $record_number!\n";
}
close $SEQUENCE;
}
With more information one could probably present a better solution.

Reading the next line in the file and keeping counts separate

Another question for everyone. To reiterate I am very new to the Perl process and I apologize in advance for making silly mistakes
I am trying to calculate the GC content of different lengths of DNA sequence. The file is in this format:
>gene 1
DNA sequence of specific gene
>gene 2
DNA sequence of specific gene
...etc...
This is a small piece of the file
>env
ATGCTTCTCATCTCAAACCCGCGCCACCTGGGGCACCCGATGAGTCCTGGGAA
I have established the counter and to read each line of DNA sequence but at the moment it is do a running summation of the total across all lines. I want it to read each sequence, print the content after the sequence read then move onto the next one. Having individual base counts for each line.
This is what I have so far.
#!/usr/bin/perl
#necessary code to open and read a new file and create a new one.
use strict;
my $infile = "Lab1_seq.fasta";
open INFILE, $infile or die "$infile: $!";
my $outfile = "Lab1_seq_output.txt";
open OUTFILE, ">$outfile" or die "Cannot open $outfile: $!";
#establishing the intial counts for each base
my $G = 0;
my $C = 0;
my $A = 0;
my $T = 0;
#initial loop created to read through each line
while ( my $line = <INFILE> ) {
chomp $line;
# reads file until the ">" character is encounterd and prints the line
if ($line =~ /^>/){
print OUTFILE "Gene: $line\n";
}
# otherwise count the content of the next line.
# my percent counts seem to be incorrect due to my Total length counts skewing the following line. I am currently unsure how to fix that
elsif ($line =~ /^[A-Z]/){
my #array = split //, $line;
my $array= (#array);
# reset the counts of each variable
$G = ();
$C = ();
$A = ();
$T = ();
foreach $array (#array){
#if statements asses which base is present and makes a running total of the bases.
if ($array eq 'G'){
++$G;
}
elsif ( $array eq 'C' ) {
++$C; }
elsif ( $array eq 'A' ) {
++$A; }
elsif ( $array eq 'T' ) {
++$T; }
}
# all is printed to the outfile
print OUTFILE "G:$G\n";
print OUTFILE "C:$C\n";
print OUTFILE "A:$A\n";
print OUTFILE "T:$T\n";
print OUTFILE "Total length:_", ($A+=$C+=$G+=$T), "_base pairs\n";
print OUTFILE "GC content is(percent):_", (($G+=$C)/($A+=$C+=$G+=$T)*100),"_%\n";
}
}
#close the outfile and the infile
close OUTFILE;
close INFILE;
Again I feel like I am on the right path, I am just missing some basic foundations. Any help would be greatly appreciated.
The final problem is in the final counts printed out. My percent values are wrong and give me the wrong value. I feel like the total is being calculated then that new value is incorporated into the total.
Several things:
1. use hash instead of declaring each element.
2. assignment such as $G = (0); is indeed working, but it is not the right way to assign scalar. What you did is declaring an array, which in scalar context $G = is returning the first array item. The correct way is $G = 0.
my %seen;
$seen{/^([A-Z])/}++ for (grep {/^\>/} <INFILE>);
foreach $gene (keys %seen) {
print "$gene: $seen{$gene}\n";
}
Just reset the counters when a new gene is found. Also, I'd use hashes for the counting:
use strict; use warnings;
my %counts;
while (<>) {
if (/^>/) {
# print counts for the prev gene if there are counts:
print_counts(\%counts) if keys %counts;
%counts = (); # reset the counts
print $_; # print the Fasta header
} else {
chomp;
$counts{$_}++ for split //;
}
}
print_counts(\%counts) if keys %counts; # print counts for last gene
sub print_counts {
my ($counts) = #_;
print "$_:=", ($counts->{$_} || 0), "\n" for qw/A C G T/;
}
Usage: $ perl count-bases.pl input.fasta.
Example output:
> gene 1
A:=3
C:=1
G:=5
T:=5
> gene 2
A:=1
C:=5
G:=0
T:=13
Style comments:
When opening a file, always use lexical filehandles (normal variables). Also, you should do a three-arg open. I'd also recommend the autodie pragma for automatic error handling (since perl v5.10.1).
use autodie;
open my $in, "<", $infile;
open my $out, ">", $outfile;
Note that I don't open files in my above script because I use the special ARGV filehandle for input, and print to STDOUT. The output can be redirected on the shell, like
$ perl count-bases.pl input.fasta >counts.txt
Declaring scalar variables with their values in parens like my $G = (0) is weird, but works fine. I think this is more confusing than helpful. → my $G = 0.
Your intendation is a bit weird. It is very unusual and visually confusing to put closing braces on the same line with another statement like
...
elsif ( $array eq 'C' ) {
++$C; }
I prefer cuddling elsif:
...
} elsif ($base eq 'C') {
$C++;
}
This statement my $array= (#array); puts the length of the array into $array. What for? Tip: You can declare variables right inside foreach-loops, like for my $base (#array) { ... }.

Perl --Change in all matches in previous lines according to match in current line

File I want to parse:
input Pattern;
input SDF;
input ABC
input Pattern;
output Pattern;
output XYZ;
In perl, usual operation is scan line by line.
I want to check that if
current line has output Pattern; and previous line (or all previous lines)has input Pattern;
then change all the previous lines matches to "input Pattern 2;" and current line to "output Pattern2;".
It is complicated ,I hope I have explained properly.
Is it possible in Perl to scan and change previous lines after they have been read?
Thanks
If this is your data:
my $sfile =
'input Pattern;
input SDF;
input ABC
input Pattern;
output Pattern;
output XYZ;' ;
then, the following snippet will read the whole file and change text accordingly:
open my $fh, '<', \$sfile or die $!;
local $/ = undef; # set file input mode to 'slurp'
my $content = <$fh>;
close $fh;
$content =~ s{ ( # open capture group
input \s+ (Pattern); # find occurence of input pattern
.+? # skip some text
output \s+ \2 # find same for output
) # close capture group
}
{ # replace by evaluated expression
do{ # within a do block
local $_=$1; # get whole match to $_
s/($2)/$1 2/g; # substitute Pattern by Pattern 2
$_ # return substituted text
} # close do block
}esgx;
Then, you may close your file and check the string:
print $content;
=>
input Pattern 2;
input SDF;
input ABC
input Pattern 2;
output Pattern 2;
output XYZ;
You may even include a counter $n which will be incremented after each successful match (by code assertion (?{ ... }):
our $n = 1;
$content =~ s{ ( # open capture group
input \s+ (Pattern); # find occurence of input pattern
.+? # skip some text
output \s+ \2 # find same for output
) # close capture group
(?{ $n++ }) # ! update match count
}
{ # replace by evaluated expression
do{ # within a do block
local $_=$1; # get whole match to $_
s/($2)/$1 $n/g; # substitute Pattern by Pattern and count
$_ # return substituted text
} # close do block
}esgx;
The substitution will now start with input Pattern 2; und increment subsequently.
I think this will do what you need, but try it on a 'scratch' file first (a copy of the original) since it actually changes the file:
use Modern::Perl;
open my $fh_in, '<', 'parseThis.txt' or die $!;
my #fileLines = <$fh_in>;
close $fh_in;
for ( my $i = 1 ; $i < scalar #fileLines ; $i++ ) {
next
if $fileLines[$i] !~ /output Pattern;/
and $fileLines[ $i - 1 ] !~ /input Pattern;/;
$fileLines[$i] =~ s/output Pattern;/output Pattern2;/g;
$fileLines[$_] =~ s/input Pattern;/input Pattern 2;/g for 0 .. $i - 1;
}
open my $fh_out, '>', 'parseThis.txt' or die $!;
print $fh_out #fileLines;
close $fh_out;
Results:
input Pattern 2;
input SDF;
input ABC;
input Pattern 2;
output Pattern2;
output XYZ;
Hope this helps!
#!/usr/bin/env perl
$in1 = 'input Pattern';
$in2 = 'input Pattern2';
$out1 = 'output Pattern';
$out2 = 'output Pattern2';
undef $/;
$_ = <DATA>;
if (/^$in1\b.*?^$out1\b/gms) {
s/(^$in1\b)(?=.*?^$out1\b)/$in2/gms;
s/^$out1\b/$out2/gms;
}
print;
__DATA__
input Pattern;
input SDF;
input ABC;
input Pattern;
output Pattern;
output XYZ;
Will there be additional "Input pattern1: lines folloring an occurence of "Output Patttern1?"
Are there going to be multiple pattern to search for, or will it just be "If we find Output Pattern1 then perform the replacement?
Will the "output pattern occur multiple times, or just once?
Will there be additional "Input pattern1: lines folloring an occurence of "Output Patttern1?"
I would perform this task in two/mutiple passes:
Pass1 - read the file, looking for the matching output lines, store the line number in memory.
Pass 2 - read the file, and based on the line numbers in the set of matches, perform the replacement on the appropriate Input lines.
So in semi-perlish, untested psuedocode:
my #matches = ();
open $fh, $inputfile, '<';
while (<$fh>) {
if (/Pattern1/) {
push #matches, $.;
}
}
close $fh;
open $fh, $inputfile, '<';
while (<$fh>) {
if ($. <= $matches[-1]) {
s/Input Pattern1/Input Pattern2/;
print ;
}
else {
pop #matches);
last unless #matches;
}
}
close $fh;
You run this like:
$ replace_pattern.pl input_file > output_file
You'll need to adjust it a little to meet your exact needs, but that should get you close.
You cannot go back and change lines in Perl. What you can do is open the file for the first time in read mode, find out which line has the pattern (say the 5th line), close it before gulping the entire file into an array, open it again in write mode, modify the contents of the array upto the 5th line, dump that array into that file, and close it. Something like this (assuming each file will have at most one output pattern):
my #arr;
my #files = ();
while (<>) {
if ($. == 0) {
$curindex = undef;
#lines = ();
push #files, $ARGV;
}
push #lines, $_;
if (/output pattern/) { $curindex = $. }
if (eof) {
push #arr, [\#lines, $curindex];
close $ARGV;
}
}
for $file (#files) {
open file, "> $file";
#currentfiledetails = #{ $arr[$currentfilenumber++] };
#currentcontents = #{ $currentfiledetails[0] };
$currentoutputmarker = $currentfiledetails[1];
if ($currentoutputmarker) {
for (0 .. $currentoutputmarker - 2) {
$currentcontents[$_] =~ s/input pattern/input pattern2/g;
}
$currentcontents[$currentoutputmarker - 1] =~
s/output pattern/output pattern2/g;
}
print file for #currentcontents;
close file;
}