Find nucleotides in DNA sequence with perl - perl

I have the sequence DNA and I want to find nucleotide of the sequence at the position which was chosed by people. Below is the example:
Enter the sequence DNA:
ACTAAAAATACAAAAATTAGCCAGGCGTGGTGGCAC (the length of sequence is 33)
Enter the position: (12)
I hope the result is the position number 12 the nucleotides are AAA.
I have no problem finding the amino acid of the position. Below is the current code I have.
print "ENTER THE FILENAME OF THE DNA SEQUENCE:= ";
$DNAfilename = <STDIN>;
chomp $DNAfilename;
unless ( open(DNAFILE, $DNAfilename) ) {
print "Cannot open file \"$DNAfilename\"\n\n";
}
#DNA = <DNAFILE>;
close DNAFILE;
$DNA = join( '', #DNA);
print " \nThe original DNA file is:\n$DNA \n";
$DNA =~ s/\s//g;
print" enter the number ";
$po=<STDIN>;
#pos=$DNA;
if ($po>length($DNA))
{
print" no data";
}
else
{
print " #pos\n\n";
}
Please advice how can I find the position at the DNA sequence.

my $nucleotide = substr $DNA, $po, 3;
This will take the 3 nucleotides from positions $po upto $po+2 and assign it to $nucleotide.

That'll be something like this:
use strict;
use warnings;
print 'ENTER THE FILENAME OF THE DNA SEQUENCE:= ';
my $DNA_filename = <STDIN>;
chomp $DNA_filename;
unless (open(DNAFILE, $DNA_filename))
{
die 'Cannot open file "' . $DNA_filename . '"' . "\n\n";
}
my #DNA = <DNAFILE>;
close DNAFILE;
my $DNA_string = join('', #DNA);
print "\n" . 'The original DNA file is:' . "\n" . $DNA_string . "\n";
$DNA_string =~ s/\s//g;
print ' enter the number ';
my $pos = <STDIN>;
if ($pos > length($DNA_string))
{
print ' no data';
}
else
{
print ' ' . substr($DNA_string, $pos, 3) . "\n\n";
}
Some comments:
Always use strict and use warnings - it'll help you to write better and bug-free code.
I personally don't like using interpolation in double quoted strings, hence those concatenations.
Result's position is starting with 0 - if you want, you may change last if's condition and else.
Edit: I've misread part of question about nucleotides, as #hexcoder wrote, you want substr($DNA_string, $pos, 3).

Related

Get shortest and longest sequence in file

I'm trying to get the shortest and longest sequence in a file containing multiple genbank-like entries. example of the file:
LOCUS NM_182854 2912 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_182854
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 gggcgatcag aagcaggtca cacagcctgt ttcctgtttt caaacgggga acttagaaag
61 tggcagcccc tcggcttgtc gccggagctg agaaccaaga gctcgaaggg gccatatgac
//
LOCUS NM_001323410 6992 bp mRNA linear PRI 20-APR-2016
DEFINITION Homo sapiens mRNA.
ACCESSION NM_001323410
SOURCE Homo sapiens (human)
ORGANISM Homo sapiens
Eukaryota; Metazoa; Chordata; Craniata; Vertebrata; Euteleostomi;
Mammalia; Eutheria; Euarchontoglires; Primates; Haplorrhini;
Catarrhini; Hominidae; Homo.
ORIGIN
1 actacttccg gcttccccgc cccgccccgt ccccgggcgt ctccattttg gtctcaggtg
61 tggactcggc aagaaccagc gcaagaggga agcagagtta tagctacccc ggc
//
I'd like to print the accession number, the type of the organism from the shortest sequence and the longest sequence
my code so far:
#!/usr/bin/perl
use strict;
use warnings;
print "enter file path\n";
while (my $line = <>){
chomp $line;
my #record = ($line);
foreach my $file(#record){
open(IN, "$file") or die "\n error opening file \n;/\n";
$/="//";
while (my $line = <IN>){
my #gb_seq = split ("ORIGIN", $line);
my $definition = $gb_seq[0];
my $sequence = $gb_seq[1];
$definition =~ m/ORGANISM[\s\t]+(.+)[\n\s\t]+/;
my $organism = $1;
if ($definition =~ m/ACCESSION[\s\t]+(\D\D_\d\d\d\d\d\d(\d*))[\n\s\t]+/){
my $accession = $1;
$sequence =~ s/\d//g;
$sequence =~ s/[\n\s\t]//g;
my $size = length($sequence);
my #sorted_keys = sort { $a <=> $b } keys my %size;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
print "this is the shortest: $accession $organism size: $shortest\n";
print "this is the longest: $accession $organism size: $longest\n";
}
}}}
exit;
I thought about putting the length in a hash to get the shortest and the longest but something is wrong there. I get these errors:
Use of uninitialized value $organism in concatenation (.) or string at test.pl line 39, <IN> chunk 1
Use of uninitialized value $shortest in concatenation (.) or string at test.pl line 39, <IN> chunk 1.
Use of uninitialized value $longest in concatenation (.) or string at test.pl line 40, <IN> chunk 1.
What part should I change? Thanks
We need to find extreme-length entries while being able to identify the record they belong to. Reading records by // is again a nice idea. However, then each record is a string and pulling the sequence out of it directly is harder than breaking it into lines first. Thus we may as well go line by line, given that there are clear markers for everything needed.
A choice of data structure is important and depends on the purpose. Here I organize data so that it is easy to work with, into a hash with elements
%block = ( 'accession' => { 'type' => type, 'sequence' => sequence }, ... )
The search to perform once the data is read in would be greatly aided by organizing this by 'sequence' (instead of by 'accession'), but that would make it very hard to work with. I presume that this may end up being used for more, and that a small loss of speed is not of consequence. If the sole objective here were to answer the specific question with optimal performance other approaches would be more suitable. Comments follow the code.
use warnings;
use strict;
use feature qw(say);
my $file = 'data_seqs.txt';
open my $fh, '<', $file or die "Can't open $file -- $!";
# Hash, helper variables, flag (inside a sequence?), sequence-end marker
my (%block, $accession, $sequence);
my $is_seq = 0;
my $end_marker = qr(\s*//); # marks end of sequence: //
while (my $line = <$fh>)
{
chomp($line);
next if $line =~ /^\s*$/; # skip empty lines
if ($line =~ /$end_marker/) { # done with the sequence
$is_seq = 0;
$sequence = '';
next;
}
if ($line =~ /^\s*ACCESSION\s*(\w+)/) {
$accession = $1;
}
elsif ($line =~ /^\s*ORGANISM\s*(.+)/) {
$block{$accession}{'type'} = $1;
}
elsif ($line =~ /^\s*ORIGIN/) { # start sequence on next line
$is_seq = 1;
}
elsif ($is_seq) { # read (and add to) sequence
if ($line =~ /^\s*\d+\s*(.*)/) {
$block{$accession}{'sequence'} .= $1;
}
else { warn "Not sequence? Line: $line " }
}
}
# Identify keys for max and min lenght. Initialize with any keys
my ($max, $min) = keys %block;
foreach my $acc (keys %block)
{
my $current_len = length($block{$acc}{'sequence'});
if ( $current_len > length($block{$max}{'sequence'}) ) {
$max = $acc;
}
if ( $current_len < length($block{$min}{'sequence'}) ) {
$min = $acc;
}
}
say "Maximum length sequence: ACCESSION: $max, ORGANISM: " . $block{$max}{'type'};
say "Minimum length sequence: ACCESSION: $min, ORGANISM: " . $block{$min}{'type'};
use Data::Dumper;
print Dumper(\%block);
This prints (Dumper's printout omitted)
Maximum length sequence: ACCESSION: NM_182854, ORGANISM Homo sapiens
Minimum length sequence: ACCESSION: NM_001323410, ORGANISM Homo sapiens
A comment on searching efficiency
One common approach would be to first build a reverse lookup hash, then use a library, say from List::Utils, to find max and min, then look up where they belong. For this we do need to build the lookup hash and we'd use the library twice, while searching through it by hand as above makes one pass over the structure and is also simpler. Another option would be to have hash top-level keys be sequences and then directly find max and min. However, such hash would be considerably harder to work with.
Yet another approach would be to organize data into a structure that would allow more efficient retrieval of this specific information, probably based on arrays.
However, the efficiency gain doesn't seem to justify the great loss of convenience. If the speed turns out to be a problem then this should be considered.
If you need to work with multiple files just change the loop to while (<>) and submit them on the command line. All lines from all of them will then be read line by line and the code stays the same.
It may be that I misunderstood some terms. I don't remove empty spaces from the "sequence", and use words on the first line only for "type", just to name a couple of candidates. These are easy to adjust, please let me know.
You state that you want two pieces of data - the accession and the organism - for the longest and shortest sequence. This means your hash values need to store two elements. As well as that, when you use '//' as a record separator, the '//' still appears on the end of each record. So, when you filter out whitespace and digits from you sequence, you're still left with '//' on the end. When I ran your code through the debugger, I was finding the lengths were all out by 2 because of this.
A couple of other things:
When using regexs, use 'extended mode', /x, so you can include whitespace for readabillity
you presume a successful match when you dig out $definition - better to test your regexs and assign on match, die on missmatch
Rather than store the length in the hash (and lose the sequence itself), you might as well store the sequence and calculate the lengths later;
I renamed the variable $line to $chunk as it contains several lines
All the stuff to do with calculating the shortest and longest and printing the resuts needs to move out of the loop. In its place, you simply need to make an entry into the hash. As described above, the hash values need to be an array with two values - the accession and the organism.
You remove digits from the sequence in one command and then whitespace from the sequence in another - might as well do them both togeather. While we're at it, might as well remove the '/'s on the end of the record.
Given the mods above, I get;
use v5.14;
use warnings;
print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//" ;
my %organisms ;
while (my $chunk = <IN>) {
next if $chunk =~ /^\s*\n\s*$/ ;
my ($definition , $sequence) = split "ORIGIN", $chunk ;
my $organism ;
$definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
? $organism = $1
: die "Couldnt find ORGANISM line" ;
my $accession ;
$definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*)) [\n\s\t]+ /x
? $accession = $1
: die "Cant find ACCESSION line" ;
$sequence =~ s/[\d\n\s\t\/]//g;
$organisms{ $sequence } = [ $accession , $organism ] ;
}
my #sorted_keys = sort { length $a <=> length $b } keys %organisms ;
my $shortest = $sorted_keys[0];
my $longest = $sorted_keys[-1];
say "this is the shortest: ", $organisms{$shortest}->[0],
", ", $organisms{$shortest}->[1],
" size: ", length $shortest, "\n",
" sequence: ", $shortest ;
say "this is the longest: ", $organisms{$longest}->[0],
", ", $organisms{$longest}->[1],
" size: ", length $longest, "\n",
" sequence: ", $longest ;
exit;
when ran on your data, it produces;
$ ./sequence.pl
Enter file path: data.txt
this is the shortest: NM_001323410, Homo sapiens size: 113
sequence: actacttccggcttccccgccccgccccgtccccgggcgtctccattttggtctcaggtgtggactcggcaagaaccagcgcaagagggaagcagagttatagctaccccggc
this is the longest: NM_182854, Homo sapiens size: 120
sequence: gggcgatcagaagcaggtcacacagcctgtttcctgttttcaaacggggaacttagaaagtggcagcccctcggcttgtcgccggagctgagaaccaagagctcgaaggggccatatgac
UPDATE
The problem with the code above is that if the same sequence appears in two chunks, then data is going to be overwritten in the hash and lost. Below is an updated version that stores data in an array of arrays which will advoid the problem. It produces exactly the same output:
use v5.14;
use warnings;
print "Enter file path: ";
chomp(my $filename = <>);
open(IN, $filename) or die "\n error opening file \n;/\n";
$/ = "//" ;
my #organisms ;
while (my $chunk = <IN>) {
next if $chunk =~ /^\s*\n\s*$/ ;
my ($definition , $sequence) = split "ORIGIN", $chunk ;
my $organism ;
$definition =~ m/ ORGANISM [\s\t]+ (.+) [\n\s\t]+ /x
? $organism = $1
: die "Couldnt find ORGANISM line" ;
my $accession ;
$definition =~ m/ ACCESSION [\s\t]+ (\D\D _ \d{6} (\d*)) [\n\s\t]+ /x
? $accession = $1
: die "Cant find ACCESSION line" ;
$sequence =~ s/[\d\n\s\t\/]//g;
push #organisms, [$organism , $accession , $sequence] ;
}
my #sorted_organisms = sort { length $a->[2] <=> length $b->[2] } #organisms ;
my ($organism , $accession , $sequence) = #{ $sorted_organisms[0] };
say "this is the shortest: $accession, $organism, size: ",
length $sequence, "\n", " sequence: ", $sequence ;
($organism , $accession , $sequence) = #{ $sorted_organisms[-1] };
say "this is the longest: $accession, $organism, size: ",
length $sequence, "\n", " sequence: ", $sequence ;
exit;

Open a file.txt and find the possible start and end positions of its genes

Hi I have a file and I would like to open it and find the start and end positions of its genes,also I have some extra imformations.The beginning of each gene is mapped by the following pattern. There is an 8 letter consensus known as the Shine-Dalgarno sequence (TAAGGAGG) followed by 4-10 bases downstream before the initiation codon (ATG). However there are variants of the Shine-Dalgarno sequence with the most common of which being [TA][AC]AGGA[GA][GA].The end of the gene is specified by the stop codon TAA, TAG and TGA. It must be taken care the stop codon is found after the correct Open.Reading Frame (ORF).
Now I have make a txt file with genome and I open it with this code,and the error begin when I go to read the genome and put start and end.Any help?Thanks a lot.:
#!/usr/bin/perl -w
use strict;
use warnings;
# Searching for motifs
# Ask the user for the filename of the file containing
my $proteinfilename = "yersinia_genome.fasta";
print "\nYou open the filename of the protein sequence data: yersinia_genome.fasta \n";
# Remove the newline from the protein filename
chomp $proteinfilename;
# open the file, or exit
unless (open(PROTEINFILE, $proteinfilename) )
{
print "Cannot open file \"$proteinfilename\"\n\n";
exit;
}
# Read the protein sequence data from the file, and store it
# into the array variable #protein
my #protein = <PROTEINFILE>;
# Close the file - we've read all the data into #protein now.
close PROTEINFILE;
# Put the protein sequence data into a single string, as it's easier
# to search for a motif in a string than in an array of
# lines (what if the motif occurs over a line break?)
my $protein = join( '', #protein);
# Remove whitespace.
$protein =~ s/\s//g;
# In a loop, ask the user for a motif, search for the motif,
# and report if it was found.
my $motif='TAAGGAGG';
do
{
print "\n Your motif is:$motif\n";
# Remove the newline at the end of $motif
chomp $motif;
# Look for the motif
if ( $protein =~ /$motif/ )
{
print "I found it!This is the motif: $motif in line $.. \n\n";
}
else
{
print "I couldn't find it.\n\n";
}
}
until ($motif =~ /TAAGGAGG/g);
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
#HERE STARTS THE PROBLEMS,I DONT KNOW WHERE I MAKE THE MISTAKES
#$genome=$motif;
#$genome = $_[0];
my $ORF = 0;
while (my $genome = $proteinfilename) {
chomp $genome;
print "processing $genome\n";
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
# $start and $stop are 0-based indexes
my $start = pos($mrna) - 3; # back up to include the start sequence
# discard remnant if no stop sequence can be found
last unless $mrna=~ /TAA|TAG|TGA/g;
#m/^ATG(?:[ATGC]{3}){8,}?(?:TAA|TAG|TGA)/gm;
my $stop = pos($mrna);
my $genlength = $stop - $start;
my $genome = substr($mrna, $start, $genlength);
print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
# $ORF ++;
#print "$ORF\n";
}
}
exit;
Thanks,I have make it the solution is :
local $_=$protein;
while(/ATG/g){
my $start = pos()-3;
if(/T(?:TAA|TAG|TGA)/g){
my $stop = pos;
print $start, " " , $stop, " " ,$stop - $start, " " ,
substr ($_,$start,$stop - $start),$/;
}
}
while (my $genome = $proteinfilename) {
This creates an endless loop: you are copying the file name (not the $protein data) over and over.
The purpose of the while loop is unclear; it will never terminate.
Perhaps you simply mean
my ($genome) = $protein;
Here is a simplistic attempt at fixing the obvious problems in your code.
#!/usr/bin/perl -w
use strict;
use warnings;
my $proteinfilename = "yersinia_genome.fasta";
chomp $proteinfilename;
unless (open(PROTEINFILE, $proteinfilename) )
{
# die, don't print & exit
die "Cannot open file \"$proteinfilename\"\n";
}
# Avoid creating a potentially large temporary array
# Read directly into $protein instead
my $protein = join ('', <PROTEINFILE>);
close PROTEINFILE;
$protein =~ s/\s//g;
# As this is a static variable, no point in looping
my $motif='TAAGGAGG';
chomp $motif;
if ( $protein =~ /$motif/ )
{
print "I found it! This is the motif: $motif in line $.. \n\n";
}
else
{
print "I couldn't find it.\n\n";
}
my $reverse=reverse $motif;
print "Here is the reverse Motif: $reverse. \n\n";
# $ORF isn't used; removed
# Again, no point in writing a loop
# Also, $genome is a copy of the data, not the filename
my $genome = $protein;
# It was already chomped, so no need to do that again
my $mrna = split(/\s+/, $genome);
while ($mrna =~ /ATG/g) {
my $start = pos($mrna) - 3; # back up to include the start sequence
last unless $mrna=~ /TAA|TAG|TGA/g;
my $stop = pos($mrna);
my $genlength = $stop - $start;
my $genome = substr($mrna, $start, $genlength);
print "\t" . join(' ', $start+1, $stop, $genome, $genlength) . "\n";
}
exit;

Perl script not running conditional statements?

I know I'm a newb in Perl so please excuse my stupid mistakes. I am making a calculator that takes user input, but it isn't working. It runs fine and dandy until the if statement, when it reaches the if statement it just ends the program. I looked through forums and books but couldn't find anything.
use warnings;
print "number\n";
$number = <STDIN>;
# Asks the user for what number to calculate.
print "Second number\n";
$secnumber = <STDIN>;
# Asks the user for second number to calculate the first number with
print "Calculation\n Multiplication x\n Addition +\n Substraction -\n ";
$calculation = <STDIN>;
# Asks the user for which calculation to make.
if ($calculation eq "x") {
print "$number" . 'x' . "\n$secnumber" . '=' . "\n" . ($number * $secnumber);
} elsif ($calculation eq "+") {
print "$number" . '+' . "\n$secnumber" . '=' . "\n" . ($number + $secnumber);
} elsif ($calculation eq "-") {
print "$number" . '-' . "\n$secnumber" . '=' . "\n" . ($number - $secnumber);
}
# Displays the calculation and answer.
The value assigned to $calculation will contain a new line character. So on a unix type system the value assigned to $calculation would actually be +\n
you need to use the chomp function which will remove the new line character. you can find more information on chomp with this URL http://perldoc.perl.org/functions/chomp.html
You can apply chomp in two ways. You can have it chomp the new line at the time of reading it
chomp ($calculation = <STDIN>);
Or you can do it after the assignement.
$calculation = <STDIN>;
chomp ($calculation);
Also as a new user to Perl, i would recommend as well as using the warning pragma, you should also use the strict pragma to help you keep good maintainable code.
use warnings;
use strict;
You need to do a chomp($calculation) before the if stmt and after the initial assignment operation. (the $calulation = <STDIN>;).

Splitting and printing with Perl

My Perl script is attempting to take in a command line argument that is a file name such as name.txt or hello.txt.exe and parse out the file extension based on the . delimiter, and print only the extension like exe or txt. Here's what I currently have which doesn't print anything and I'm not entirely sure why.
usr/bin/perl -w
use strict;
my ($user_arg) = shift #ARGV;
my ($ext);
if ( ($ext) = $user_arg =~ /(\.[^.].+)$/)
{
print "Ends in ", ($ext) = $user_arg =~ /(\.[^.].+)$/ , "\n";
print "Ends in" , ($ext) = $user_arg =~ /(\.[^.]+)$/, "\n";
}
elsif( ($ext) = $user_arg =~ /(\.[^.]+)$/)
{
print"Ends in " , ($ext), "\n";
}
else
{
print "No Extension";
}
*Updated, now my problem is the first statement will print both conditions if it's something like name.txt it will print .txt twice, where I want it to only print .txt once UNLESS it's name.txt.exe where I'd like it to print .txt.exe then .exe
There's two main issues here:
1) You need to shift off #ARGV
my $arg = shift #ARGV;
2) You need to escape the 'dot'
my #values = split /\./, $user_arg;
Other things...
You usually want to sanitize user input:
die "usage: $0 filename\n" if {some condition}
I think you mean chomp $val; in your foreach.
It wouldn't hurt to be familiar with File::Basename, fileparse could make your life easier. Although it might be overkill here.
UPDATE
You should be able to integrate this yourself. In your case you won't need to loop
over a list of files, you'll just have one.
This doesn't do what you want where it prints "txt.exe", "exe". But you can fine tune this to your liking.
my #file_tests = qw(nosuffix testfile.txt /path/to/file.exe foo.bar.baz);
for my $fullname (#file_tests) {
my #names = split /\./, $fullname;
# shift off the first element, which will
# give you the list of suffixs or an empty list
shift #names;
# you can decide how you want to print this list
# if scalar #names is 0 don't print anything
print "list of suffixes: " . join( ', ', #names ) . "\n"
if scalar(#names) > 0;
}
OUTPUT:
list of suffixes: txt
list of suffixes: exe
list of suffixes: bar, baz

How do I search for a string in file with different headings?

I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it.
suppose I am reqd to search for some string "FSFSD" in a given file then eg:
can't search if file has following content :
Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK
Can search when file has one heading i.e:
Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ
I need to output the result as "String xyz found under Heading abc"
The code I am using is:
print "Input the file name \n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file \n\n";
exit;
}
#prot= <protein>;
close protein;
$newprotein=join("",#prot);
$protein=~s/\s//g;
do{
print "enter the motif to be searched \n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif \n\n";
}
else{
print "not found \n\n";
}
}
until ($motif=~/^\s*$/);
exit;
Seeing your code, I want to make a few suggestions without answering your question:
Always, always, always use strict;. For the love of whatever higher power you may (or may not) believe in, use strict;.
Every time you use strict;, you should use warnings; along with it.
Also, seriously consider using some indentation.
Also, consider using obviously different names for different variables.
Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.
Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:
use strict;
use warnings;
print "Input the file name:\n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open file\n\n";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/\s//g;
while(1) {
print "enter the motif to be searched:\n";
my $motif = <STDIN>;
last if $motif =~ /^\s*$/;
chomp $motif;
# here I might even use the ternary ?: operator, but whatever
if ($newprotein =~ /$motif/) {
print "found motif\n\n";
}
else {
print "not found\n\n";
}
}
The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.
use strict;
use warnings;
print "Enter the motif to be searched \n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
if(/[a-z]/) {
$header = $_;
next;
}
if (/$motif/o) {
print "Found $motif under header $header\n";
exit;
}
}
print "$motif not found\n";
So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?
Just have a loop and read the file line by line.
$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
//same command that you do for one 'heading' will go here. $line represents one heading
}
EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).
Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.
Slightly more difficult way ( O(n) ), given a BNF grammar such as:
file: block
| file block
;
block: heading sequence
heading: [A-Z][a-z]
sequence: [A-Z][a-z]
Try recursive decent parsing (pseudo-code, I don't know perl):
GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"
FUNC file ()
WHILE nextChar() != EOF
block()
printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
END WHILE
END FUNC
FUNC block ()
heading()
sequence()
END FUNC
FUNC heading ()
in = popChar()
IF in == WHITESPACE
sequenceHeading = tempHeading
tempHeading = ""
RETURN
END IF
tempHeading &= in
END FUNC
FUNC sequence ()
in = popChar()
IF in == WHITESPACE
sequenceCount = count
count = 0
i = 0
END IF
IF in == substring[i]
i++
IF i > substringLength
count++
END IF
ELSE
i = 0
END IF
END FUNC
For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.
use strict;
use warnings;
use autodie qw'open';
my($filename,$motif) = #ARGV;
if( #ARGV < 1 ){
print "Please enter file name:\n";
$filename = <STDIN>;
chomp $filename;
}
if( #ARGV < 2 ){
print "Please enter motif:\n";
$motif = <STDIN>;
chomp $motif;
}
my %data;
# fill in %data;
{
open my $file, '<', $filename;
my $heading;
while( my $line = <$file> ){
chomp $line;
if( $line ne uc $line ){
$heading = $line;
next;
}
if( $data{$heading} ){
$data{$heading} .= $line;
} else {
$data{$heading} = $line;
}
}
}
{
# protect against malicious users
my $motif_cmp = quotemeta $motif;
for my $heading ( keys %data ){
my $data = $data{$heading};
if( $data =~ /$motif_cmp/ ){
print "String $motif found under Heading $heading\n";
exit 0;
}
}
die "String $motif not found anywhere in file $filename\n";
}