I have two files as input, a file containing a list of words StopWordsList.txt, I want to remove from StopWordsList.txt the words that are in StopWordsList.txt, here is my code:
my $FichierResulat = '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';
open( my $FhResultat, '>:utf8', $FichierResulat );
open( my $fh1, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt' )
or die "Failed to open file: $!\n"; #file contains stop words
open( my $fh2, "<:utf8", '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt' ) #file contains text
or die "Failed to open file: $!\n";
my #tabStopWords = <$fh1>;
my #tab_contexte;
my #words;
while ( <$fh2> ) {
chomp;
next if m/^$/;
my $context = $_;
#words = split( / /, $_ );
}
#compare: remove from #words the words existing in #tabStopWords
my %temp;
#temp{#tabStopWords} = 0 .. $#tabStopWords;
for my $val ( #words ) {
if ( exists $temp{$val} ) {
print "$val est présent dans tab1 à la position $temp{$val}.\n";
}
else {
print "$val n'est pas dans tab1.\n";
push #tab_sans_SW, $val;
}
}
foreach my $value ( #tab_sans_SW ) {
print $FhResultat "$value\n";
}
but in the result file i have all the words existing in #words without removing the word that exist in #tabStopWords..
I hope tha can you help me.
my sotpwords file :
ال
الآن
التي
الذي
الذين
اللاتي
اللائي
اللتان
اللتين
my texte file :
ومواصلات بما فيه من بريد ونور ومياه وصناعات وعلوم ومعارف وحينما يركب احدنا قطارا فإنه يركب في نفس الوقت على حرية جاهزة اعدها له آلاف العمال والمخترعين والمهندسين في
There are a couple of problems
You don't chomp the contents of #tabStopWords, so each entry has a newline at the end
You overwrite the contents of #words each time around the while loop with #words = split(/ /, $_) instead of adding to it
This program will do what you want. I have added use autodie to avoid having to check the result of every open, and I have removed a couple of unused variables. Local variable names are better written using just lower-case letters and underscores, especially for readers whose first language isn't English
I've used split on both files to reduce them both to individual words. Because split also removes newline characters there is no need for chomp
use strict;
use warnings 'all';
use autodie;
use constant FICHIER_STOP_WORD => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/DemoLesk/StopWordsList.txt';
use constant FICHIER_TEXTE => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/text1.txt';
use constant FICHIER_RESULAT => '/home/lenovo/Bureau/MesTravaux/LeskAlgo/OriginalLeskResult';
my #tab_stop_words = do {
open my $fh1, "<:utf8", FICHIER_STOP_WORD;
map { split } <$fh1>;
};
my #words = do {
open my $fh1, "<:utf8", FICHIER_TEXTE;
map { split } <$fh1>;
};
my %words = map { $words[$_] => $_ } 0 .. $#words;
open my $fh_resultat, '>:utf8', FICHIER_RESULAT;
for my $word ( #words ) {
my $position = $words{$word};
if ( defined $position ) {
print "$word est présent dans tab1 à la position $position.\n";
}
else {
print "$word n'est pas dans tab1.\n";
print $fh_resultat "$word\n";
}
}
This problem would be easier to solve if you showed us the format of your two input files. But as you don't, this will be guesswork.
I guess that your file of stopwords contains a single word on each line. In that case, each element in #tabStopWords and, therefore, each key in %temp will have newline at the end of them. This makes it extremely unlikely that any of the words in your source file will match these keys.
You probably want to add:
chomp #tabStopWords;
to your code.
We can get the difference using smart match operator (~~),
my(#words_arr) = ("is","a");
my(#input_arr) = ("This","is","a","example","code");
my (#diff) = grep { not $_ ~~ #words_arr} #input_arr;
Related
I'm searching for a list of Keywords from a file. I am able to match the whole keyword, but for some keywords i need to match a first part of word. For Example
DES
AES
https:// --- here it should match the word starting with https:// but my code considers the whole word and skips it.
For example using the above keywords I would want to match DES, DES and https:// only from the below input:
DES some more words
DESTINY and more...
https://example.domain.com
http://anotherexample.domain.com # note that this line begins with http://, not https://
Here is what I've tried so far:
use warnings;
use strict;
open STDOUT, '>>', "my_stdout_file.txt";
#die qq[Usage: perl $0 <keyword-file> <search-file> <file-name>\n] unless #ARGV == 3;
my $filename = $ARGV[2];
chomp ($filename);
open my $fh, q[<], shift or die $!; --- This file handle Opening all the 3 arguments. I need to Open only 2.
my %keyword = map { chomp; $_ => 1 } <$fh>;
print "$fh\n";
while ( <> ) {
chomp;
my #words = split;
for ( my $i = 0; $i <= $#words; $i++ ) {
if ( $keyword{^$words[ $i ] } ) {
print "Keyword Found for file:$filename\n";
printf qq[$filename Line: %4d\tWord position: %4d\tKeyword: %s\n],
$., $i, $words[ $i ];
}
}
}
close ($fh);
Here's a working solution for what I think you're trying to achieve. Let me know if not:
use warnings;
use strict;
use feature qw/ say /;
my %keywords;
while(<DATA>){
chomp;
my ($key) = split;
my $length = length($key);
$keywords{$key} = $length;
}
open my $in, '<', 'in.txt' or die $!;
while(<$in>){
chomp;
my $firstword = (split)[0];
for my $key (keys %keywords){
if ($firstword =~ m/$key/){
my $word = substr($firstword, 0, $keywords{$key});
say $word;
}
}
}
__DATA__
Keywords:-
DES
AES
https:// - here it should match the word starting with https:// but my code considers the whole word and skipping it.
For an input file containing:
here are some words over multiple
lines
that may or
may not match your keywords:
DES DEA AES SSE
FOO https:
https://example.domain.com
This produces the output:
DES
https://
I have a text in a file F1 each sentence in line, and another file contain the part of speech(POS) of each word in the text for example:
F1 contains:
he lives in paris\n
he jokes
F2 contains:
he pro\n
lives verb\n
in prep\n
paris adv_pl\n
he pro\n
jokes verb\n
I would like to parse each sentence of F1 and extract the POS of each word. I arrived to extract the POS of the first sentence, but the program can't parse the second line. This is the code:
open( FILE, $filename ) || die "Problème d'ouverture du ficher en entrée";
open( FILEOUT, ">$filenameout" ) || die "Problème d'ouverture";
open( F, "/home/ahmed/Bureau/test/corpus.txt" ) || die " Pb pour ouvrir";
open( F2, "/home/ahmed/Bureau/test/corp.txt" ) || die " Pb pour ouvrir";
my $z;
my $y = 0;
my $l;
my $li;
my $pos;
while ( $ligne = <F> ) {
while ( $li = <F2> ) { # F2 POS
chomp($li);
# prem contain the first word of F2 in each line,
# deux contain the POS of this word
( $prem, $deux ) = ( $li =~ m/^\W*(\w+)\W+(\w+)/ );
print "premier: $prem\n";
chomp($ligne);
#val = split( / /, $ligne ); # corpus de texte
$l = #val;
while ( $y < $l ) { # $l length of sentence
$z = $val[$y];
print "z : $z\n";
if ( $z eq $prem ) {
print "true\n";
$pos .= "POSw" . $y . "=" . $deux . " ";
++$y;
} else {
last;
}
}
}
print FILEOUT "$pos\n";
$pos = "";
}
The result I had in the terminal:
premier: he
z : he
true
premier : lives
z : lives
true
premier : in
z : in
true
premier : paris
z : paris
true
premier : he
premier : jokes
The first sentence has 4 words, when it pass 4, we must go to the next line in the text, I can't arrive to solve it.
There are some issues in your script.
You must always use strict; use warnings; to show the most common syntax and/or typing errors, unused variables, etc.
You should always use the three-parameter open and no global FILEHANDLEs (see opentut).
You should use some sensible names for your filehandles, not FH, FH1, etc. but $fh_sentences and $fh_grammar (or other meaningful names).
So far for the general part. Now let's get more specific:
Your outer loop (F) reads the sentences one by one. The next loop (F2) reads the grammatical types but it does so only once for the first sentence. When the F2 file is read, subsequent calls to <F2> will always return undef because the file was already read. You have to reset the filepointer to the beginning of the file after each sentence or -- even better -- read the file F2 in advance and store its contents in a hash.
Iterating over an array of words in a sentence is easier with foreach my $word (#words). No need to do the housekeeping of index variables (like $y) yourself.
chomping and splitting the sentences should be moved outside the F2 loop because $ligne doesn't change in the loop and only burns CPU cycles.
Putting this together I end up with this:
use strict;
use warnings;
# Read the grammar file, F2, into a hash:
my %grammar;
open( my $fh_grammar, '<', 'F2' ) or die "Pb pour ouvrir F2: $!\n";
while( my $ligne = <$fh_grammar> ) {
my ($prem, $deux) = ( $ligne =~ m/^\W*(\w+)\W+(\w+)/ );
$grammar{$prem} = $deux;
}
close($fh_grammar);
# The hash is now:
# %grammar = (
# 'he' => 'pro',
# 'lives => 'verb',
# 'in' => 'prep',
# 'paris' => 'adv_pl'
# 'jokes' => 'verb'
# );
# Read the sentences from F1 and check the grammar:
open( my $fh_sentences, '<', 'F1' ) or die "Pb pour ouvrir F1: $!\n";
while( my $ligne = <$fh_sentences> ) {
my #words = split(/\s+/, $ligne );
foreach my $word (#words) {
print "z: $word\n";
if ( exists $grammar{$word} ) {
print "true; $grammar{$word}\n";
}
}
print "\n";
}
close($fh_sentences);
Output:
z: he
true; pro
z: lives
true; verb
z: in
true; prep
z: paris
true; adv_pl
z: he
true; pro
z: jokes
true; verb
You can solve the above problem in different way like :
First read the POS file and put it in hash
Code :
my $filename = "F2";
open FH2, $filename or die "Error";
my %POS_hash;
while(<FH2>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
$POS_hash{$arr[0]} = $arr[1];
}
Now read your file and replace it with the POS
my $filename1 = "F1";
open FH1, $filename1 or die "Error";
while(<FH1>)
{
chomp($_);
my #arr = split(/ /, $_); # you can change the split function
foreach my $val (#arr)
{
if(exists $POS_hash{$val})
{
print "$POS_hash{$val}\t";
}
}
print "\n";
}
I believe this is a better way for your problem. Hope this will solve your problem.
I have the following script which searches for specified substrings within an input string (a DNA sequence). I was wondering if anybody could help out with being able to specify degeneracy of specific characters. For example, instead of searching for GATC (or anything consisting solely of G's, T's, A's and C's), I could instead search for GRTNA where R = A or G and where N = A, G, C or T. I would need to be able to specify quite a few of these in a long list within the script. Many thanks for any help or tips!
use warnings;
use strict;
#User Input
my $usage = "Usage (OSX Terminal): perl <$0> <FASTA File> <Results Directory + Filename>\n";
#Reading formatted FASTA/FA files
sub read_fasta {
my ($in) = #_;
my $sequence = "";
while(<$in>) {
my $line = $_;
chomp($line);
if($line =~ /^>/){ next }
else { $sequence .= $line }
}
return(\$sequence);
}
#Scanning for restriction sites and length-output
open(my $in, "<", shift);
open(my $out, ">", shift);
my $DNA = read_fasta($in);
print "DNA is: \n $$DNA \n";
my $len = length($$DNA);
print "\n DNA Length is: $len \n";
my #pats=qw( GTTAAC );
for (#pats) {
my $m = () = $$DNA =~ /$_/gi;
print "\n Total DNA matches to $_ are: $m \n";
}
my $pat=join("|",#pats);
my #cutarr = split(/$pat/, $$DNA);
for (#cutarr) {
my $len = length($_);
print $out "$len \n";
}
close($out);
close($in);
GRTNA would correspond to the pattern G[AG]T[AGCT]A.
It looks like you could do this by writing
for (#pats) {
s/R/[AG]/g;
s/N/[AGCT]/g;
}
before
my $pat = join '|', #pats;
my #cutarr = split /$pat/, $$DNA;
but I'm not sure I can help you with the requirement that "I would need to be able to specify quite a few of these in a long list within the script". I think it would be best to put your sequences in a separate text file rather than embed the list directly into the program.
By the way, wouldn't it be simpler just to
return $sequence
from your read_fasta subroutine? Returning a reference just means you have to dereference it everywhere with $$DNA. I suggest that it should look like this
sub read_fasta {
my ($fh) = #_;
my $sequence;
while (<$fh>) {
unless (/^>/) {
chomp;
$sequence .= $_;
}
}
return $sequence;
}
I have two files:
file_1 has three columns (Marker(SNP), Chromosome, and position)
file_2 has three columns (Chromosome, peak_start, and peak_end).
All columns are numeric except for the SNP column.
The files are arranged as shown in the screenshots. file_1 has several hundred SNPs as rows while file_2 has 61 peaks. Each peak is marked by a peak_start and peak_end. There can be any of the 23 chromosomes in either file and file_2 has several peaks per chromosome.
I want to find if the position of the SNP in file_1 falls within the peak_start and peak_end in file_2 for each matching chromosome. If it does, I want to show which SNP falls in which peak (preferably write output to a tab-delimited file).
I would prefer to split the file, and use hashes where the chromosome is the key. I have found only a few questions remotely similar to this, but I could not understand well the suggested solutions.
Here is the example of my code. It is only meant to illustrate my question and so far doesn't do anything so think of it as "pseudocode".
#!usr/bin/perl
use strict;
use warnings;
my (%peaks, %X81_05);
my #array;
# Open file or die
unless (open (FIRST_SAMPLE, "X81_05.txt")) {
die "Could not open X81_05.txt";
}
# Split the tab-delimited file into respective fields
while (<FIRST_SAMPLE>) {
chomp $_;
next if (m/Chromosome/); # Skip the header
#array = split("\t", $_);
($chr1, $pos, $sample) = #array;
$X81_05{'$array[0]'} = (
'position' =>'$array[1]'
)
}
close (FIRST_SAMPLE);
# Open file using file handle
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
my ($chr, $peak_start, $peak_end);
while (<PEAKS>) {
chomp $_;
next if (m/Chromosome/); # Skip header
($chr, $peak_start, $peak_end) = split(/\t/);
$peaks{$chr}{'peak_start'} = $peak_start;
$peaks{$chr}{'peak_end'} = $peak_end;
}
close (PEAKS);
for my $chr1 (keys %X81_05) {
my $val = $X81_05{$chr1}{'position'};
for my $chr (keys %peaks) {
my $min = $peaks{$chr}{'peak_start'};
my $max = $peaks{$chr}{'peak_end'};
if (($val > $min) and ($val < $max)) {
#print $val, " ", "lies between"," ", $min, " ", "and", " ", $max, "\n";
}
else {
#print $val, " ", "does not lie between"," ", $min, " ", "and", " ", $max, "\n";
}
}
}
More awesome code:
http://i.stack.imgur.com/fzwRQ.png
http://i.stack.imgur.com/2ryyI.png
A couple of program hints in Perl:
You can do this:
open (PEAKS, "peaks.txt")
or die "Couldn't open peaks.txt";
Instead of this:
unless (open (PEAKS, "peaks.txt")) {
die "could not open peaks.txt";
}
It's more standard Perl, and it's a bit easier to read.
Talking about Standard Perl, you should use the 3 argument open form, and use scalars for file handles:
open (my $peaks_fh, "<", "peaks.txt")
or die "Couldn't open peaks.txt";
This way, if your file's name just happens to start with a | or >, it will still work. Using scalars variables (variables that start with a $) makes it easier to pass file handles between functions.
Anyway, just to make sure I understand you correctly: You said "I would prefer ... use hashes where the chromosome is the key."
Now, I have 23 pairs of chromosomes, but each of those chromosomes might have thousands of SNPs on it. If you key by chromosome this way, you can only store a single SNP per chromosome. Is this what you want? I notice your data is showing all the same chromosome. That means you can't key by chromosome. I'm ignoring that for now, and using my own data.
I've also noticed a difference in what you said the files contained, and how your program uses them:
You said: "file 1 has 3 columns (SNP, Chromosome, and position)" , yet your code is:
($chr1, $pos, $sample) = #array;
Which I assume is Chromosome, Position, and SNP. Which way is the file arranged?
You've got to clarify exactly what you're asking for.
Anyway, here's the tested version that prints out in tab delimited format. This is in a bit more modern Perl format. Notice that I only have a single hash by chromosome (as you specified). I read the peaks.txt in first. If I find in my position file a chromosome that doesn't exist in my peaks.txt file, I simply ignore it. Otherwise, I'll add in the additional hashes for POSITION and SNP:
I do a final loop that prints everything out (tab delimitated) as you specified, but you didn't specify a format. Change it if you have to.
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #No need to check for file open failure
use constant {
PEAKS_FILE => "peak.txt",
POSITION_FILE => "X81_05.txt",
};
open ( my $peak_fh, "<", PEAKS_FILE );
my %chromosome_hash;
while ( my $line = <$peak_fh> ) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ( $chromosome, $peak_start, $peak_end ) = split ( "\t", $line );
$chromosome_hash{$chromosome}->{PEAK_START} = $peak_start;
$chromosome_hash{$chromosome}->{PEAK_END} = $peak_end;
}
close $peak_fh;
open ( my $position_fh, "<", POSITION_FILE );
while ( my $line = <$position_fh> ) {
chomp $line;
my ( $chromosome, $position, $snp ) = split ( "\t", $line );
next unless exists $chromosome_hash{$chromosome};
if ( $position >= $chromosome_hash{$chromosome}->{PEAK_START}
and $position <= $chromosome_hash{$chromosome}->{PEAK_END} ) {
$chromosome_hash{$chromosome}->{SNP} = $snp;
$chromosome_hash{$chromosome}->{POSITION} = $position;
}
}
close $position_fh;
#
# Now Print
#
say join ("\t", qw(Chromosome, SNP, POSITION, PEAK-START, PEAK-END) );
foreach my $chromosome ( sort keys %chromosome_hash ) {
next unless exists $chromosome_hash{$chromosome}->{SNP};
say join ("\t",
$chromosome,
$chromosome_hash{$chromosome}->{SNP},
$chromosome_hash{$chromosome}->{POSITION},
$chromosome_hash{$chromosome}->{PEAK_START},
$chromosome_hash{$chromosome}->{PEAK_END},
);
}
A few things:
Leave spaces around parentheses on both sides. It makes it easier to read.
I use parentheses when others don't. The current style is not to use them unless you have to. I tend to use them for all functions that take more than a single argument. For example, I could have said open my $peak_fh, "<", PEAKS_FILE;, but I think parameters start to get lost when you have three parameters on a function.
Notice I use use autodie;. This causes the program to quit if it can't open a file. That's why I don't even have to test whether or not the file opened.
I would have preferred to use object oriented Perl to hide the structure of the hash of hashes. This prevents errors such as thinking that the start peek is stored in START_PEEK rather than PEAK_START. Perl won't detect these type of miskeyed errors. Therefore, I prefer to use objects whenever I am doing arrays of arrays or hashes of hashes.
You only need one for loop because you are expecting to find some of the SNPs in the second lot. Hence, loop through your %X81_05 hash and check if any matches one in %peak. Something like:
for my $chr1 (keys %X81_05)
{
if (defined $peaks{$chr1})
{
if ( $X81_05{$chr1}{'position'} > $peaks{$chr1}{'peak_start'}
&& $X81_05{$chr1}{'position'} < $peaks{$chr1}{'peak_end'})
{
print YOUROUTPUTFILEHANDLE $chr1 . "\t"
. $peaks{$chr1}{'peak_start'} . "\t"
. $peaks{$chr1}{'peak_end'};
}
else
{
print YOUROUTPUTFILEHANDLE $chr1
. "\tDoes not fall between "
. $peaks{$chr1}{'peak_start'} . " and "
. $peaks{$chr1}{'peak_end'};
}
}
}
Note: I Have not tested the code.
Looking at the screenshots that you have added, this is not going to work.
The points raised by #David are good; try to incorporate those in your programs. (I have borrowed most of the code from #David's post.)
One thing I didn't understand is that why load both peak values and position in hash, as loading one would suffice. As each chromosome has more than one record, use HoA. My solution is based on that. You might need to change the cols and their positions.
use strict;
use warnings;
our $Sep = "\t";
open (my $peak_fh, "<", "data/file2");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromo
}
close $peak_fh;
open (my $position_fh, "<", "data/file1");
while (my $line = <$position_fh>) {
chomp $line;
my ($chromosome, $snp, $position) = split ($Sep, $line);
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end) = (split($Sep, $line))[1,2];
if ($position >= $start and $position <= $end) {
print "MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
else {
print "NO MATCH REQUIRED-DETAILS...$line-$peak_line\n";
}
}
}
close $position_fh;
I used #tuxuday and #David's code to solve this problem. Here is the final code that did what I wanted. I have not only learned a lot, but I have been able to solve my problem successfully! Kudos guys!
use strict;
use warnings;
use feature qw(say);
# Read in peaks and sample files from command line
my $usage = "Usage: $0 <peaks_file> <sample_file>";
my $peaks = shift #ARGV or die "$usage \n";
my $sample = shift #ARGV or die "$usage \n";
our $Sep = "\t";
open (my $peak_fh, "<", "$peaks");
my %chromosome_hash;
while (my $line = <$peak_fh>) {
chomp $line;
next if $line =~ /Chromosome/; #Skip Header
my ($chromosome) = (split($Sep, $line))[0];
push #{$chromosome_hash{$chromosome}}, $line; # Store the line(s) indexed by chromosome
}
close $peak_fh;
open (my $position_fh, "<", "$sample");
while (my $line = <$position_fh>) {
chomp $line;
next if $line =~ /Marker/; #Skip Header
my ($snp, $chromosome, $position) = split ($Sep, $line);
# Check if chromosome in peaks_file matches chromosome in sample_file
next unless exists $chromosome_hash{$chromosome};
foreach my $peak_line (#{$chromosome_hash{$chromosome}}) {
my ($start,$end,$peak_no) = (split( $Sep, $peak_line ))[1,2,3];
if ( $position >= $start and $position <= $end) {
# Print output
say join ("\t",
$snp,
$chromosome,
$position,
$start,
$end,
$peak_no,
);
}
else {
next; # Go to next chromosome
}
}
}
close $position_fh;
I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it.
suppose I am reqd to search for some string "FSFSD" in a given file then eg:
can't search if file has following content :
Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK
Can search when file has one heading i.e:
Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ
I need to output the result as "String xyz found under Heading abc"
The code I am using is:
print "Input the file name \n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file \n\n";
exit;
}
#prot= <protein>;
close protein;
$newprotein=join("",#prot);
$protein=~s/\s//g;
do{
print "enter the motif to be searched \n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif \n\n";
}
else{
print "not found \n\n";
}
}
until ($motif=~/^\s*$/);
exit;
Seeing your code, I want to make a few suggestions without answering your question:
Always, always, always use strict;. For the love of whatever higher power you may (or may not) believe in, use strict;.
Every time you use strict;, you should use warnings; along with it.
Also, seriously consider using some indentation.
Also, consider using obviously different names for different variables.
Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.
Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:
use strict;
use warnings;
print "Input the file name:\n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open file\n\n";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/\s//g;
while(1) {
print "enter the motif to be searched:\n";
my $motif = <STDIN>;
last if $motif =~ /^\s*$/;
chomp $motif;
# here I might even use the ternary ?: operator, but whatever
if ($newprotein =~ /$motif/) {
print "found motif\n\n";
}
else {
print "not found\n\n";
}
}
The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.
use strict;
use warnings;
print "Enter the motif to be searched \n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
if(/[a-z]/) {
$header = $_;
next;
}
if (/$motif/o) {
print "Found $motif under header $header\n";
exit;
}
}
print "$motif not found\n";
So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?
Just have a loop and read the file line by line.
$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
//same command that you do for one 'heading' will go here. $line represents one heading
}
EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).
Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.
Slightly more difficult way ( O(n) ), given a BNF grammar such as:
file: block
| file block
;
block: heading sequence
heading: [A-Z][a-z]
sequence: [A-Z][a-z]
Try recursive decent parsing (pseudo-code, I don't know perl):
GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"
FUNC file ()
WHILE nextChar() != EOF
block()
printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
END WHILE
END FUNC
FUNC block ()
heading()
sequence()
END FUNC
FUNC heading ()
in = popChar()
IF in == WHITESPACE
sequenceHeading = tempHeading
tempHeading = ""
RETURN
END IF
tempHeading &= in
END FUNC
FUNC sequence ()
in = popChar()
IF in == WHITESPACE
sequenceCount = count
count = 0
i = 0
END IF
IF in == substring[i]
i++
IF i > substringLength
count++
END IF
ELSE
i = 0
END IF
END FUNC
For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.
use strict;
use warnings;
use autodie qw'open';
my($filename,$motif) = #ARGV;
if( #ARGV < 1 ){
print "Please enter file name:\n";
$filename = <STDIN>;
chomp $filename;
}
if( #ARGV < 2 ){
print "Please enter motif:\n";
$motif = <STDIN>;
chomp $motif;
}
my %data;
# fill in %data;
{
open my $file, '<', $filename;
my $heading;
while( my $line = <$file> ){
chomp $line;
if( $line ne uc $line ){
$heading = $line;
next;
}
if( $data{$heading} ){
$data{$heading} .= $line;
} else {
$data{$heading} = $line;
}
}
}
{
# protect against malicious users
my $motif_cmp = quotemeta $motif;
for my $heading ( keys %data ){
my $data = $data{$heading};
if( $data =~ /$motif_cmp/ ){
print "String $motif found under Heading $heading\n";
exit 0;
}
}
die "String $motif not found anywhere in file $filename\n";
}