How can I parse multiline records in Perl? - perl

I'm trying to parse the string that use delimiter '#'
this string has 3 lines
101#Introduction to the Professor#SG_FEEL#QUE_NOIMAGE#
head up to the Great Hall and speak to the professor to check in for class.#
#
102#Looking for Instructors#SG_FEEL#QUE_NOIMAGE#
Look for the Battle Instructor.#
Talk to Battle Instructor#
103#Battle Instructor#SG_FEEL#QUE_NOIMAGE#
You have spoken to the Battle Instructor#
#
how to get each value before delimiter '#' so I can make a new format that look like this
[101] = {
Title = "Introduction to the Professor",
Description = {
"head up to the Great Hall and speak to the professor to check in for class."
},
Summary = ""
},
[102] = {
Title = "Looking for Instructors",
Description = {
"Look for the Battle Instructor."
},
Summary = "Talk to Battle Instructor"
},
[103] = {
Title = "Battle Instructor",
Description = {
"You have spoken to the Battle Instructor"
},
Summary = ""
},
Also there will be multiple data from 101 - n
I'm trying to use split with the code below:
#!/usr/bin/perl
use strict;
use warnings;
my $data = '101#Introduction to the Professor#SG_FEEL#QUE_NOIMAGE#';
my #values = split('#', $data);
foreach my $val (#values) {
print "$val\n";
}
exit 0;
and the output:
101
Introduction to the Professor
SG_FEEL
QUE_NOIMAGE
How to read multiple line data? And also how to exclude some data, for example to match the new format, I don't need SG_FEEL and QUE_NOIMAGE data

The Perl special variable $/ sets the "input record separator"—the string that Perl uses to decide where a line ends. You can set that to something else.
use v5.26;
use utf8;
use strict;
use warnings;
$/ = "\n\n"; # set the input record separator
while( <DATA> ) {
chomp;
say "$. ------\n", $_;
}
__END__
101#Introduction to the Professor#SG_FEEL#QUE_NOIMAGE#
head up to the Great Hall and speak to the professor to check in for class.#
#
102#Looking for Instructors#SG_FEEL#QUE_NOIMAGE#
Look for the Battle Instructor.#
Talk to Battle Instructor#
103#Battle Instructor#SG_FEEL#QUE_NOIMAGE#
You have spoken to the Battle Instructor#
#
The output shows that you read whole records with each call to <DATA>:
1 ------
101#Introduction to the Professor#SG_FEEL#QUE_NOIMAGE#
head up to the Great Hall and speak to the professor to check in for class.#
#
2 ------
102#Looking for Instructors#SG_FEEL#QUE_NOIMAGE#
Look for the Battle Instructor.#
Talk to Battle Instructor#
3 ------
103#Battle Instructor#SG_FEEL#QUE_NOIMAGE#
You have spoken to the Battle Instructor#
#
From there you can parse that record however you need.

Reading multiple lines is easy, see readline:
open my $fh, '<', $filename
or die "Couldn't read '$filename': $!";
my #input = <$fh>;
Now you want to go through all lines and look at what to do with them:
my $linenumber;
my %info; # We want to collect information
while ($linenumber < $#input) {
Each line that starts with nnn# starts a new item:
if( $input[ $linenumber ] =~ /^(\d+)#/ ) {
my #data = split /#/, $input[ $linenumber ];
$info{ number } = $data[0];
$info{ Title } = $data[1];
$linenumber++;
};
Now, read stuff into the description until we encounter an empty line:
while ($input[$linenumber] !~ /^#$/) {
$info{ Description } .= $input[$linenumber];
$linenumber++;
};
$linenumber++; # skip the last "#" line
Now, output the stuff in %info, formatting left as an exercise. I've used qq{} for demonstration purposes. You will want to change that to qq():
print qq{Number: "$info{ number }"\n};
print qq{Title: "$info{ Title }"\n};
print qq(Description: {"$info{ Description }"}\n);
};

Related

Perl read and write text file with strings

Friends need help. Following my INPUT TEXT FILE
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
To convert the above into hash and to write back into file when the records exist in a directory. I have tried following (it does not work).
#!/usr/perl/5.14.1/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash = ();
my $file = ".../input_and_output.txt";
my $people;
my $country;
open (my $fh, "<", $file) or die "Can't open the file $file: ";
my $line;
while (my $line =<$fh>) {
my ($people) = split("", $line);
$hash{$people} = 1;
}
foreach my $people (sort keys %hash) {
my #country = $people;
foreach my $c (#country) {
my $c_folder = `country/test1_testdata/17.26.6/$c/`;
if (-d $cad_root){
print "Exit\n";
} else {
print "NA\n";
}
}
This is the primary problem:
my ($people) = split("", $line);
Your are splitting using an empty string, and you are assigning the return value to a single variable (which will just end up with the first character of each line).
Instead, you should split on ' ' (a single space character which is a special pattern):
As another special case, ... when the PATTERN is either omitted or a string composed of a single space character (such as ' ' or "\x20" , but not e.g. / /). In this case, any leading whitespace in EXPR is removed before splitting occurs, and the PATTERN is instead treated as if it were /\s+/; in particular, this means that any contiguous whitespace (not just a single space character) is used as a separator.
Limit the number of fields returned to ensure the integrity of country names with spaces:
#!/usr/bin/env perl
use strict;
use warnings;
my #people;
while (my $line = <DATA>) {
$line =~ /\S/ or next;
$line =~ s/\s+\z//;
push #people, [ split ' ', $line, 2 ];
}
use YAML::XS;
print Dump \#people;
__DATA__
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
The entries are added to an array so 1) The input order is preserved; and 2) Two people with the same name but from different countries do not result in one entry being lost.
If the order is not important, you could just use a hash keyed on country names with people's names in an array reference for each entry. For now, I am going to assume order matters (it would help us help you if you put more effort into formulate a clear question).
One option is to now go through the list of person-country pairs, and print all those pairs for which the directory country/test1_testdata/17.26.6/$c/ exists (incidentally, in your code you have
my $c_folder = `country/test1_testdata/17.26.6/$c/`;
That will try to execute a program called country/test1_testdata/17.26.6/$c/ and save its output in $c_folder if it produces any. To moral of the story: In programming, precision matters. Just because ` looks like ', that doesn't mean you can use one to mean the other.)
Given that your question is focused on hashes, I use an array of references to anonymous hashes to store the list of people-country pairs in the code below. I cache the result of the lookup to reduce the number of times you need to hit the disk.
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV == 2 ? run( #ARGV )
: die_usage()
;
sub run {
my $people_data_file = shift;
my $country_files_location = shift;
open my $in, '<', $people_data_file
or die "Failed to open '$people_data_file': $!";
my #people;
my %countries;
while (my $line = <$in>) {
next unless $line =~ /\S/; # ignore lines consisting of blanks
$line =~ s/\s+\z//;# remove all trailing whitespace
my ($name, $country) = split ' ', $line, 2;
push #people, { name => $name, country => $country };
$countries{ $country } = undef;
}
# At this point, #people has a list of person-country pairs
# We are going to use %countries to reduce the number of
# times we need to check the existence of a given directory,
# assuming that the directory tree is stable while this program
# is running.
PEOPLE:
for my $person ( #people ) {
my $country = $person->{country};
if ($countries{ $country }) {
print join("\t", $person->{name}, $country), "\n";
}
elsif (-d "$country_files_location/$country/") {
$countries{ $country } = 1;
redo PEOPLE;
}
}
}
sub die_usage {
die "Need data file name and country files location\n";
}
Now, there are a bazillion variations on this which is why it is important for you to formulate a clear and concise question so people trying to help you can answer your specific questions, instead of each coming up his/her own solution to the problem as they see it. For example, one could also do this:
#!/usr/bin/env perl
use strict;
use warnings;
#ARGV == 2 ? run( #ARGV )
: die_usage()
;
sub run {
my $people_data_file = shift;
my $country_files_location = shift;
open my $in, '<', $people_data_file
or die "Failed to open '$people_data_file': $!";
my %countries;
while (my $line = <$in>) {
next unless $line =~ /\S/; # ignore lines consisting of blanks
$line =~ s/\s+\z//;# remove all trailing whitespace
my ($name, $country) = split ' ', $line, 2;
push #{ $countries{$country} }, $name;
}
for my $country (keys %countries) {
-d "$country_files_location/$country"
or delete $countries{ $country };
}
# At this point, %countries maps each country for which
# we have a data file to a list of people. We can then
# print those quite simply so long as we don't care about
# replicating the original order of lines from the original
# data file. People's names will still be sorted in order
# of appearance in the original data file for each country.
while (my ($country, $people) = each %countries) {
for my $person ( #$people) {
print join("\t", $person, $country), "\n";
}
}
}
sub die_usage {
die "Need data file name and country files location\n";
}
If what you want is a counter of names in a hash, then I got you, buddy!
I won't attempt the rest of the code because you are checking a folder of records
that I don't have access to so I can't trouble shoot anything more than this.
I see one of your problems. Look at this:
#!/usr/bin/env perl
use strict;
use warnings;
use feature 'say'; # Really like using say instead of print because no need for newline.
my $file = 'input_file.txt';
my $fh; # A filehandle.
my %hash;
my $people;
my $country;
my $line;
unless(open($fh, '<', $file)){die "Could not open file $_ because $!"}
while($line = <$fh>)
{
($people, $country) = split(/\s{2,}/, $line); # splitting on at least two spaces
say "$people \t $country"; # Just printing out the columns in the file or people and Country.
$hash{$people}++; # Just counting all the people in the hash.
# Seeing how many unique names there are, like is there more than one Cindy, etc ...?
}
say "\nNow I'm just sorting the hash of people by names.";
foreach(sort{$a cmp $b} keys %hash)
{
say "$_ => $hash{$_}"; # Based on your file. The counter is at 1 because nobody has the same names.
}
Here is the output. As you can see I fixed the problem by splitting on at least two white-spaces so the country names don't get cut out.
Andrew UK
Cindy China
Rupa India
Gordon Australia
Peter New Zealand
Andrew United States
Now I'm just sorting the hash of people by names.
Andrew => 2
Cindy => 1
Gordon => 1
Peter => 1
Rupa => 1
I added another Andrew to the file. This Andrew is from the United States
as you can see. I see one of your problems. Look at this:
my ($people) = split("", $line);
You are splitting on characters as there is no space between those quotes.
If you look at this change now, you are splitting on at least one space.
my ($people) = split(" ", $line);

Search and add a text file for words in an XML file

I have two files. The first has a sequence of words on each line
bus do car
car tree
The second file is an XML file
<title>i have a car. i take bus..</title>
I want to search the text file for each word in the XML file. If it is found, I want to insert all lines from the text file where it appears, with any spaces replaced by x.
The result file would be
<title>i have a car busxdoxcar carxtree. i take bus busxdoxcar..</title>
I try this
use strict;
use warnings;
use autodie;
my $QueryFile = "query.txt";
my $SequenceFile = "Seq_2_terms_150.txt";
my %hashlist;
open NewQueryFile ,">./NewQuery.txt"
or die "Cannot create NewQuery.txt";
open(my $fh,$SequenceFile)
or die "$SequenceFile : $!";
while ( <$fh> ) {
chop;
s/^\s+|\s+$//g;
my $h = \%hashlist;
foreach ( split('\s+', $_) ) {
$h->{$_} //= {};
$h = $h->{$_};
}
$h->{'#'} = 1;
}
close $fh;
open(my $fd, $QueryFile)
or die "$QueryFile : $!";
for my $xml (<$fd>) {
foreach my $line (split(/\n/, $xml)) {
my #words = split(/\s/, $line);
if $words = #hashlist[$_] {
print NewQueryFile join ('x',$words) ;
}
}
}
close NewQueryFile ;
close($fd);
I have put together a quick script to indicate how one might go about this.
I have not bothered with xml, because that may well have left me in a bad mood.
My advice would be: do use variables, whatever you save from not doing so is lost as your code gets confusing and then buggy.
#!/usr/bin/env perl
use strict;
use warnings;
# Notes:
# - more than one space or tab in a row are mangled: They become one space only
# - the query file is not checked for containing actual words to match on,
# it is assumed to be suitable
# - I have made no attempt to parse xml. You should use a parser for that.
# Search Stack Overflow or Google or CPAN or all of those for examples.
# - The replace_xml_text function can be used on the text supplied by the
# parser to get the desired output
# - a feeble attempt is made to deal with punctuation in replace_xml_text
# - This code is not really tested
my %query_words;
my $query_fn = 'query.txt';
open (my $fh, "<",$query_fn) or die "could not open file '$query_fn'";
# build list of words from query file
while ( <$fh> ){
chomp;
# Words mentioned in line.
my #words = split(/\s+/,$_);
# Words joined by 'x'. Seems a strange choice *shrug*.
# This is used to replace words with later.
my $line = join("x",#words);
# Storing in arrayref, this seems easier to me
# than concatening repeatedly and getting the spaces right.
for my $word ( #words ){
push #{$query_words{$word}}, $line;
}
}
# Expects the text to replace.
# Returns the text to replace it with.
sub replace_xml_text {
my $original_text = shift;
my #words;
for my $word ( split(/\s+/,$original_text) ){
my $punctuation = '';
# Remove punctuation before matching,
# but do preserve it.
if ( $word =~s /(\s*[,.]\s*)$// ){
$punctuation = $1;
}
if ( my $additions = $query_words{$word} ){
$word = join(" ",$word,#$additions);
}
# Put punctuation back.
$word .= $punctuation;
# Done replacing in this word, next
push #words,$word;
}
return join(" ",#words);
}

Perl hash formatting

I have a log file like below
ID: COM-1234
Program: Swimming
Name: John Doe
Description: Joined on July 1st
------------------------------ID: COM-2345
Program: Swimming
Name: Brock sen
Description: Joined on July 1st
------------------------------ID: COM-9876
Program: Swimming
Name: johny boy
Description: Joined on July 1st
------------------------------ID: COM-9090
Program: Running
Name: justin kim
Description: Good Record
------------------------------
and I want to group it based on the Program (Swimming , Running etc) and want a display like,
PROGRAM: Swimming
==>ID
COM-1234
COM-2345
COM-9876
PROGRAM: Running
==>ID
COM-9090
I'm very new to Perl and I wrote the below piece (incomplete).
#!/usr/bin/perl
use Data::Dumper;
$/ = "%%%%";
open( AFILE, "D:\\mine\\out.txt");
while (<AFILE>)
{
#temp = split(/-{20,}/, $_);
}
close (AFILE);
my %hash = #new;
print Dumper(\%hash);
I have read from perl tutorials that hash key-value pairs will take unique keys with multiple values but not sure how to make use of it.
I'm able to read a file and store in to hash, unsure how to process to the aforementioned format.Any help is really appreciated.Thanks.
I always prefer to write programs like this so they read from STDIN as that makes them more flexible.
I'd do it like this:
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read one "chunk" of data at a time
local $/ = '------------------------------';
# A hash to store our results.
my %program;
# While there's data on STDIN...
while (<>) {
# Remove the end of record marker
chomp;
# Skip any empty records
# (i.e. ones with no printable characters)
next unless /\S/;
# Extract the information that we want with a regex
my ($id, $program) = /ID: (.+?)\n.*Program: (.+?)\n/s;
# Build a hash of arrays containing our data
push #{$program{$program}}, $id;
}
# Now we have all the data we need, so let's display it.
# Keys in %program are the program names
foreach my $p (keys %program) {
say "PROGRAM: $p\n==>ID";
# $program{$p} is a reference to an array of IDs
say "\t$_" for #{$program{$p}};
say '';
}
Assuming this is in a program called programs.pl and the input data is in programs.txt, then you'd run it like this:
C:/> programs.pl < programs.txt
Always put use warnings; and use strict; in top of the program. And always use three argument for open
open my $fh, "<", "D:\\mine\\out.txt";
my %hash;
while (<$fh>){
if(/ID/)
{
my $nxt = <$fh>;
s/.*?ID: //g;
$hash{"$nxt==>ID \n"}.=" $_";
}
}
print %hash;
Output
Program: Running
==>ID
COM-9090
Program: Swimming
==>ID
COM-1234
COM-2345
COM-9876
I your input file program were found at the line after ID. So I used
my $nxt = <$fh>; Now the program were store into the $nxt variable.
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my %hash = ();
open my $IN, "<", "your file name here" or die "Error: $!\n";
while (<$IN>) {
if ($_ =~ m/^\s*-*ID:\s*COM/) {
(my $id) = ($_ =~ m/\s*ID:\s*(.*)/);
my $prog_name = <$IN>;
chomp $prog_name;
$prog_name =~ s/Program/PROGRAM/;
$hash{$prog_name} = [] unless $hash{$prog_name};
push #{$hash{$prog_name}}, $id;
}
}
close $IN;
print Dumper(\%hash);
Output will be:
$VAR1 = {
'PROGRAM: Running' => [
'COM-9090'
],
'PROGRAM: Swimming' => [
'COM-1234',
'COM-2345',
'COM-9876'
]
};
Let's look at these two lines:
$hash{$prog_name} = [] unless $hash{$prog_name};
push #{$hash{$prog_name}}, $id;
The first line initiates an empty array reference as the value if the hash is undefined. The second line pushes the ID to the end of that array (regardless of the first line).
Moreover, the first line is not mandatory. Perl knows what you mean if you just write push #{$hash{$prog_name}}, $id; and interprets it as if you said "go to the value of this key" and creates it if it wasn't already there. Now you say that the value is an array and you push $id to the list.

perl code that makes a list of all words that follow a given string in a text file

This is difficult to describe but useful in extracting data in the output I am dealing with (I hope to use this code for a large number of purposes)
Here is an example:
Say I have a text file with words and some special characters ($, #, !, etc) that reads:
blah blah
blah add this word to the list: 1234.56 blah blah
blah blah
blah now don't forget to add this word to the list: PINAPPLE blah blah
And for bonus points,
it would be nice to know that the script
would be able to add this word to the list: 1!##$%^&*()[]{};:'",<.>/?asdf blah blah
blah blah
As the example implies, I would like to add whatever "word" (defined as any string that does not contain spaces in this context) to some form of list such that I can extract elements of the list as list[2] list[3] or list(4) list(5), or something along those lines.
This would be very versatile, and after some questioning in another thread and another forum, I am hoping that having it in perl would make it relatively fast in execution--so it will work well even for large text files.
I intend to use this to read data from output files generated from different programs regardless of structure of the output file, i.e. if I know the string to search for, I can get the data.
I think there are some missing words in your question :)
But this sounds like what you want (assuming even the "large text files" fit in memory - if not, you'd loop through line by line pushing onto $list instead).
my $filecontents = File::Slurp::read_file("filename");
#list = $filecontents =~ /add this word to the list: (\S+)/g;
If the string for the searches is the same, let Perl do the processing by using the search phrase as input record separator:
open my $fh, '<', 'test.dat' or die "can't open $!"; # usual way of opening a file
my #list; # declare empty array 'list' (results)
$/= 'add this word to the list:'; # define custom input record seperator
while( <$fh> ) { # read records one by one
push #list, $1 if /(\S\S*)/
}
close $fh; # thats it, close file!
print join "\n", #list; # this will list the results
The above is "almost ok", it will save the first word of the file in $list[0] because
of the way of the processing. But this way makes it very easy to comprehend (imho)
blah <== first word of the file
1234.56
PINAPPLE
1!##$%^&*()[]{};:'",<.>/?asdf
Q: why not simply look the strings up with one regex over the entire data (as has already been suggested here). Because in my experience, the record-wise procesing with per-record regular expression (probably very complicated regex in a real use case) will be faster - especially on very large files. Thats the reason.
Real world test
To back this claim up, I performed some tests with a 200MB data file containing 10,000 of
your markers. The test source follows:
use strict;
use warnings;
use Benchmark qw(timethese cmpthese);
use FILE::Slurp;
# 'data.dat', a 200MB data file, containing 10_000
# markers: 'add this word to the list:' and a
# one of different data items after each.
my $t = timethese(10,
{
'readline+regex' => sub { # trivial reading line-by-line
open my $fh, '<', 'data.dat' or die "can't open $!";
my #list;
while(<$fh>) {
push #list,$1 if /add this word to the list:\s*(\S+)/
}
close $fh;
return scalar #list;
},
'readIRS+regex' => sub { # treat each 'marker' as start of an input record
open my $fh, '<', 'data.dat' or die "can't open $!";
$/= 'add this word to the list:'; # new IRS
my #list;
while(<$fh>) { push #list, $1 if /(\S+)/ }
close $fh;
return scalar #list;
},
'slurp+regex' => sub { # read the whole file and apply regular expression
my $filecontents = File::Slurp::read_file('data.dat');
my #list = $filecontents =~ /add this word to the list:\s*(\S+)/g;
return scalar #list;
},
}
);
cmpthese( $t ) ;
which outputs the following timing results:
Benchmark: timing 10 iterations of readIRS+regex, readline+regex, slurp+regex...
readIRS+regex: 43 wallclock secs (37.11 usr + 5.48 sys = 42.59 CPU) # 0.23/s (n=10)
readline+regex: 42 wallclock secs (36.47 usr + 5.49 sys = 41.96 CPU) # 0.24/s (n=10)
slurp+regex: 142 wallclock secs (135.85 usr + 4.98 sys = 140.82 CPU) # 0.07/s (n=10)
s/iter slurp+regex readIRS+regex readline+regex
slurp+regex 14.1 -- -70% -70%
readIRS+regex 4.26 231% -- -1%
readline+regex 4.20 236% 1% --
which basically means that the simple line-wise reading and the block-wise reading by custom IRS
are about 2.3 times faster (one pass in ~4 sec) than slurping the file and scanning by regular
expression.
This basically says, that if you are processing files of this size on a system like mine ;-),
you should read line-by-line if your search problem is located on one line and read
by custom input record separator if your search problem involves more than one line (my $0.02).
Want to make the test too? This one:
use strict;
use warnings;
sub getsomerandomtext {
my ($s, $n) = ('', (shift));
while($n --> 0) {
$s .= chr( rand(80) + 30 );
$s .= "\n" if rand($n) < $n/10
}
$s x 10
}
my #stuff = (
q{1234.56}, q{PINEAPPLE}, q{1!##$%^&*()[]{};:'",<.>/?asdf}
);
my $fn = 'data.dat';
open my $fh, '>', $fn or die $!;
my $phrase='add this word to the list:';
my $x = 10000;
while($x --> 0) {
print $fh
getsomerandomtext(1000), ' ',
$phrase, ' ', $stuff[int(rand(#stuff))], ' ',
getsomerandomtext(1000), "\n",
}
close $fh;
print "done.\n";
creates the 200MB input file 'data.dat'.
Regards
rbo
How about:
my(#list);
my $rx = qr/.*add this word to the list: +(\S+)/;
while (<>)
{
while (m/$rx/)
{
push #list, $1;
s/$rx//;
}
}
This allows for long lines containing more than one of the 'add' markers. If there definitively can only be one, replace the inner while with if. (Except, of course, that I used a greedy '.*' which snaffles up everything to the last occurrence of the match...)
my(#list);
my $rx = qr/(?:.*?)add this word to the list: +(\S+)/;
while (<>)
{
while (m/$rx/)
{
push #list, $1;
s/$rx//;
}
}
With a selectable marker:
my $marker = "add this word to the list:";
my(#list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
while (m/$rx/)
{
push #list, $1;
s/$rx//;
}
}
With no repeats:
my $marker = "add this word to the list:";
my(%hash);
my(#list);
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
while (m/$rx/)
{
push #list, $1 unless defined $hash{$1};
$hash{$1} = 1;
s/$rx//;
}
}
Etc.
And, as #ysth points out, you (I) don't need the substitution - Perl DWIM's correctly a g-qualified match in the inner loop:
#!/bin/perl -w
use strict;
my(#list);
my(%hash);
my($marker) = "add this word to the list:";
my $rx = qr/(?:.*?)$marker\s+(\S+)/;
while (<>)
{
while (m/$rx/g)
{
push #list, $1 unless defined $hash{$1};
$hash{$1} = 1;
}
}
foreach my $i (#list)
{
print "$i\n";
}

How do I search for a string in file with different headings?

I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it.
suppose I am reqd to search for some string "FSFSD" in a given file then eg:
can't search if file has following content :
Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK
Can search when file has one heading i.e:
Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ
I need to output the result as "String xyz found under Heading abc"
The code I am using is:
print "Input the file name \n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file \n\n";
exit;
}
#prot= <protein>;
close protein;
$newprotein=join("",#prot);
$protein=~s/\s//g;
do{
print "enter the motif to be searched \n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif \n\n";
}
else{
print "not found \n\n";
}
}
until ($motif=~/^\s*$/);
exit;
Seeing your code, I want to make a few suggestions without answering your question:
Always, always, always use strict;. For the love of whatever higher power you may (or may not) believe in, use strict;.
Every time you use strict;, you should use warnings; along with it.
Also, seriously consider using some indentation.
Also, consider using obviously different names for different variables.
Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.
Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:
use strict;
use warnings;
print "Input the file name:\n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open file\n\n";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/\s//g;
while(1) {
print "enter the motif to be searched:\n";
my $motif = <STDIN>;
last if $motif =~ /^\s*$/;
chomp $motif;
# here I might even use the ternary ?: operator, but whatever
if ($newprotein =~ /$motif/) {
print "found motif\n\n";
}
else {
print "not found\n\n";
}
}
The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.
use strict;
use warnings;
print "Enter the motif to be searched \n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
if(/[a-z]/) {
$header = $_;
next;
}
if (/$motif/o) {
print "Found $motif under header $header\n";
exit;
}
}
print "$motif not found\n";
So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?
Just have a loop and read the file line by line.
$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
//same command that you do for one 'heading' will go here. $line represents one heading
}
EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).
Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.
Slightly more difficult way ( O(n) ), given a BNF grammar such as:
file: block
| file block
;
block: heading sequence
heading: [A-Z][a-z]
sequence: [A-Z][a-z]
Try recursive decent parsing (pseudo-code, I don't know perl):
GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"
FUNC file ()
WHILE nextChar() != EOF
block()
printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
END WHILE
END FUNC
FUNC block ()
heading()
sequence()
END FUNC
FUNC heading ()
in = popChar()
IF in == WHITESPACE
sequenceHeading = tempHeading
tempHeading = ""
RETURN
END IF
tempHeading &= in
END FUNC
FUNC sequence ()
in = popChar()
IF in == WHITESPACE
sequenceCount = count
count = 0
i = 0
END IF
IF in == substring[i]
i++
IF i > substringLength
count++
END IF
ELSE
i = 0
END IF
END FUNC
For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.
use strict;
use warnings;
use autodie qw'open';
my($filename,$motif) = #ARGV;
if( #ARGV < 1 ){
print "Please enter file name:\n";
$filename = <STDIN>;
chomp $filename;
}
if( #ARGV < 2 ){
print "Please enter motif:\n";
$motif = <STDIN>;
chomp $motif;
}
my %data;
# fill in %data;
{
open my $file, '<', $filename;
my $heading;
while( my $line = <$file> ){
chomp $line;
if( $line ne uc $line ){
$heading = $line;
next;
}
if( $data{$heading} ){
$data{$heading} .= $line;
} else {
$data{$heading} = $line;
}
}
}
{
# protect against malicious users
my $motif_cmp = quotemeta $motif;
for my $heading ( keys %data ){
my $data = $data{$heading};
if( $data =~ /$motif_cmp/ ){
print "String $motif found under Heading $heading\n";
exit 0;
}
}
die "String $motif not found anywhere in file $filename\n";
}