string match search - perl

one text file like this as query file:
fooLONGcite
GetmoreDATA
stringMATCH
GOODthing
another text file like this as subject file:
sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
brotherGETDATA
CITEMORETHING
TOOLONGSTUFFETC
The expected result will be get the matched string from subject file and then print it out. So, the output should be:
sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing
Here is my perl script. But It doesn't work. Can you help me find where is the problem? Thanks.
#!/usr/bin/perl
use strict;
# to check the command line option
if($#ARGV<0){
printf("Usage: \n <tag> <seq> <outfile>\n");
exit 1;
}
# to open the given infile file
open(tag, $ARGV[0]) or die "Cannot open the file $ARGV[0]";
open(seq, $ARGV[1]) or die "Cannot open the file $ARGV[1]";
my %seqhash = ();
my $tag_id;
my $tag_seq;
my $seq_id;
my $seq_seq;
my $seq;
my $i = 0;
print "Processing cds seq\n";
#check the seq file
while(<seq>){
my #line = split;
if($i != 0){
$seqhash{$seq_seq} = $seq;
$seq = "";
print "$seq_seq\n";
}
$seq_seq = $line[0];
$i++;
}
while(<tag>){
my #tagline = split;
$tag_seq = $tagline[0];
$seq = $seqhash{$seq_seq};
#print "$tag_seq\n";
print "$seq\n";
#print output ">$id\n$seq\n";
}
#print "Ending of Processing gff\n";
close(tag);
close(seq);

As I understand, you look for a match of part of the string, not an exact one. Here a script that does what I think you are looking for:
Content of script.pl. I take into account that file of queries is small because I add all its content to a regex:
use warnings;
use strict;
## Check arguments.
die qq[Usage: perl $0 <query_file> <subject_file>\n] unless #ARGV == 2;
## Open input files. Abort if found errors.
open my $fh_query, qq[<], shift #ARGV or die qq[Cannot open input file: $!\n];
open my $fh_subject, qq[<], shift #ARGV or die qq[Cannot open input file: $!\n];
## Variable to save a regex with alternations of the content of the 'query' file.
my $query_regex;
{
## Read content of the 'query' file in slurp mode.
local $/ = undef;
my $query_content = <$fh_query>;
## Remove trailing spaces and generate a regex.
$query_content =~ s/\s+\Z//;
$query_content =~ s/\n/|/g;
$query_regex = qr/(?i:($query_content))/;
}
## Read 'subject' file and for each line compare if that line matches with
## any word of the 'query' file and print in success.
while ( <$fh_subject> ) {
if ( m/$query_regex/o ) {
print
}
}
Run the script:
perl script.pl query.txt subject.txt
And result:
sometingfooLONGcite
anyotherfooLONGcite
matchGetmoreDATA
GETGOODthing

Your current code doesn't make a lot of sense; you're even referencing variables you don't assign anything to.
All you need to do is read the first file into a hash, then check each line of the second against that hash.
while (my $line = <FILE>)
{
chomp($line);
$hash{$line} = 1;
}
...
while (my $line = <FILE2>)
{
chomp($line);
if (defined $hash{$line})
{
print "$line\n";
}
}

Related

Split up files according to column value perl text::csv

I've asked this question before how to do this with AWK but it doesn't handle it all that well.
The data has semicolons in quoted fields, which AWK doesn't take into account. So I was trying it in perl with the text::csv module so I don't have to think about that. The problem is I don't know how to output it to files based on a column value.
Short example from previous question, the data:
10002394;"""22.98""";48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;"""Miami""";http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;Chicago;"""http://testdata.com/bla/28798580.jpg""";5.95;10201848233
10025825;12.99;65;Chicago;"""http://testdata.com/bla/29017837.jpg""";5.95;93962025367
The desired result:
File --> 26.csv
10003062;19.99;26;San Francisco;http://testdata.com/bla/29002816.jpg;5.95;17012725049
File --> 48.csv
10002394;22.98;48;New York;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;New York;http://testdata.com/bla/29003007.jpg;5.95;3692164452
File --> 53.csv
10003122;13.0;53;Miami;http://testdata.com/bla/29019899.jpg;5.95;24404000059
File --> 65.csv
10025155;27.99;65;Chicago;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10007645;20.99;65;Chicago;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025825;12.99;65;Chicago;http://testdata.com/bla/29017837.jpg;5.95;93962025367
This is what I have so far. EDIT: Modified code:
#!/usr/bin/perl
use strict;
use warnings;
use Text::CSV_XS;
#use Data::Dumper;
use Time::Piece;
my $inputfile = shift || die "Give input and output names!\n";
open my $infile, '<', $inputfile or die "Sourcefile in use / not found :$!\n";
#binmode($infile, ":encoding(utf8)");
my $csv = Text::CSV_XS->new({binary => 1,sep_char => ";",quote_space => 0,eol => $/});
my %fh;
my %count;
my $country;
my $date = localtime->strftime('%y%m%d');
open(my $fh_report, '>', "report$date.csv");
$csv->getline($infile);
while ( my $elements = $csv->getline($infile)){
EDITED IN:
__________
next unless ($elements->[29] =~ m/testdata/);
for (#$elements){
next if ($elements =~ /apple|orange|strawberry/);
}
__________
for (#$elements){
s/\"+/\"/g;
}
my $filename = $elements->[2];
$shop = $elements->[3] .";". $elements->[2];
$count{$country}++;
$fh{$filename} ||= do {
open(my $fh, '>:encoding(UTF-8)', $filename . ".csv") or die "Could not open file '$filename'";
$fh;
};
$csv->print($fh{$filename}, $elements);
}
#print $fh_report Dumper(\%count);
foreach my $name (reverse sort { $count{$a} <=> $count{$b} or $a cmp $b } keys %count) {
print $fh_report "$name;$count{$name}\n";
}
close $fh_report;
Errors:
Can't call method "print" on an undefined value at sort_csv_delimiter.pl line 28, <$infile> line 2
I've been messing around with this but I'm totally at a loss. Can someone help me?
My guess is that you want hash of cached file handles,
my %fh;
while ( my $elements = $csv->getline( $infile ) ) {
my $filename = $elements->[2];
$fh{$filename} ||= do {
open my $fh, ">", "$filename.csv" or die $!;
$fh;
};
# $csv->combine(#$elements);
$csv->print($fh{$filename}, $elements);
}
I don't see an instance of your stated problem -- occurrences of the semicolon separator character ; within quoted fields -- but you are correct that Text::CSV will handle it correctly.
This short program reads your example data from the DATA file handle and prints the result to STDOUT. I presume you know how to read from or write to different files if you wish.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });
my #data;
while ( my $row = $csv->getline(\*DATA) ) {
push #data, $row;
}
my $file;
for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } #data ) {
unless (defined $file and $file == $row->[2]) {
$file = $row->[2];
printf "\nFile --> %d.csv\n", $file;
}
$csv->print(\*STDOUT, $row);
}
__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
output
File --> 26.csv
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
File --> 48.csv
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
File --> 53.csv
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
File --> 65.csv
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;"10201848233 "
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
Update
I have just realised that your "desired result" isn't the output that you expect to see, but rather the way separate records are written to different files. This program solves that.
It looks from your question as though you want the data sorted in order of the first field as well, and so I have read all of the file into memory and printed a sorted version to the relevant files. I have also used autodie to avoid having to code status checks for all the IO operations.
use strict;
use warnings;
use autodie;
use Text::CSV;
my $csv = Text::CSV->new({ sep_char => ';', eol => $/ });
my #data;
while ( my $row = $csv->getline(\*DATA) ) {
push #data, $row;
}
my ($file, $fh);
for my $row ( sort { $a->[2] <=> $b->[2] or $a->[0] <=> $b->[0] } #data ) {
unless (defined $file and $file == $row->[2]) {
$file = $row->[2];
open $fh, '>', "$file.csv";
}
$csv->print($fh, $row);
}
close $fh;
__DATA__
10002394;22.98;48;http://testdata.com/bla/29012827.jpg;5.95;93962094820
10025155;27.99;65;http://testdata.com/bla/29011075.jpg;5.95;14201021349
10003062;19.99;26;http://testdata.com/bla/29002816.jpg;5.95;17012725049
10003122;13.0;53;http://testdata.com/bla/29019899.jpg;5.95;24404000059
10029650;27.99;48;http://testdata.com/bla/29003007.jpg;5.95;3692164452
10007645;20.99;65;http://testdata.com/bla/28798580.jpg;5.95;10201848233
10025825;12.99;65;http://testdata.com/bla/29017837.jpg;5.95;93962025367
FWIW I have done this using Awk (gawk):
awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }' bigfile.txt
other_process data | awk --assign col=2 'BEGIN { if(!(col ~/^[1-9]/)) exit 2; outname = "part-%s.txt"; } !/^#/ { out = sprintf(outname, $col); print > out; }'
Let me explain the awk script:
BEGIN { # execution block before reading any file (once)
if(!(col ~/^[1-9]/)) exit 2; # assert the `col` variable is a positive number
outname = "part-%s.txt"; # formatting string of the output file names
}
!/^#/ { # only process lines not starting with '#' (header/comments in various data files)
out = sprintf(outname, $col); # format the output file name, given the value in column `col`
print > out; # put the line to that file
}
If you like you can add a variable to specify a custom filename or use the current filename (or STDIN) as prefix:
NR == 1 { # at the first file (not BEGIN, as we might need FILENAME)
if(!(col ~/^[1-9]/)) exit 2; # assert the `col` variable is a positive number
if(!outname) outname = (FILENAME == "-" ? "STDIN" : FILENAME); # if `outname` variable was not provided (with `-v/--assign`), use current filename or STDIN
if(!(outname ~ /%s/)) outname = outname ".%s"; # if `outname` is not a formatting string - containing %s - append it
}
!/^#/ { # only process lines not starting with '#' (header/comments in various data files)
out = sprintf(outname, $col); # format the output file name, given the value in column `col`
print > out; # put the line to that file
}
Note: if you provide multiple input files, only the first file's name will be used as output prefix. To support multiple input files and multiple prefixes, you can use FNR == 1 instead and add another variable to distinguish between user-provided outname and the auto-generated one.

how to extract substrings by knowing the coordinates

I am terribly sorry for bothering you with my problem in several questions, but I need to solve it...
I want to extract several substrings from a file whick contains string by using another file with the begin and the end of each substring that I want to extract.
The first file is like:
>scaffold30 24194
CTTAGCAGCAGCAGCAGCAGTGACTGAAGGAACTGAGAAAAAGAGCGAGCTGAAAGGAAGCATAGCCATTTGGGAGTGCCAGAGAGTTGGGAGG GAGGGAGGGCAGAGATGGAAGAAGAAAGGCAGAAATACAGGGAGATTGAGGATCACCAGGGAG.........
.................
(the string must be everything in the file except the first line), and the coordinates file is like:
44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
.....................
my script is this:
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $count = 0;
while(<INFILE1>) {
if ($_ !~ m/^>/) {
local $/ = undef;
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my #cline = <INFILE>;
foreach my $cline (#cline) {
print "$cline\n";
my#data = split('\t', $cline);
my $start = $data[0];
my $end = $data[1];
my $offset = $end - $start;
$count++;
my $sub = substr ($var, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE;
}
}
when I run it, it looks like it does only one iteration and it prints me the start of the first file.
It seems like the foreach loop doesn't work.
also substr seems that doesn't work.
when I put an exit to print the cline to check the loop, it prints all the lines of the file with the coordinates.
I am sorry if I become annoying, but I must finish it and I am a little bit desperate...
Thank you again.
This line
local $/ = undef;
changes $/ for the entire enclosing block, which includes the section where you read in your second file. $/ is the input record separator, which essentially defines what a "line" is (it is a newline by default, see perldoc perlvar for details). When you read from a filehandle using <>, $/ is used to determine where to stop reading. For example, the following program relies on the default line-splitting behavior, and so only reads until the first newline:
my $foo = <DATA>;
say $foo;
# Output:
# 1
__DATA__
1
2
3
Whereas this program reads all the way to EOF:
local $/;
my $foo = <DATA>;
say $foo;
# Output:
# 1
# 2
# 3
__DATA__
1
2
3
This means your #cline array gets only one element, which is a string containing the text of your entire coordinates file. You can see this using Data::Dumper:
use Data::Dumper;
print Dumper(\#cline);
Which in your case will output something like:
$VAR1 = [
'44801988 44802104
44846151 44846312
45620133 45620274
45640443 45640543
45688249 45688358
45729531 45729658
45843362 45843490
46066894 46066996
46176337 46176464
'
];
Notice how your array (technically an arrayref in this case), delineated by [ and ], contains only a single element, which is a string (delineated by single quotes) that contains newlines.
Let's walk through the relevant sections of your code:
while(<INFILE1>) {
if ($_ !~ m/^>/) {
# Enable localized slurp mode. Stays in effect until we leave the 'if'
local $/ = undef;
# Read the rest of INFILE1 into $var (from current line to EOF)
my $var = <INFILE1>;
open INFILE, $coords_file or die "Could not open $coords_file: $!";
# In list context, return each block until the $/ character as a
# separate list element. Since $/ is still undef, this will read
# everything until EOF into our first list element, resulting in
# a one-element array
my #cline = <INFILE>;
# Since #cline only has one element, the loop only has one iteration
foreach my $cline (#cline) {
As a side note, your code could be cleaned up a bit. The names you chose for your filehandles leave something to be desired, and you should probably use lexical filehandles anyway (and the three-argument form of open):
open my $chromosome_fh, "<", $ARGV[0] or die $!;
open my $coordinates_fh, "<", $ARGV[1] or die $!;
Also, you do not need to nest your loops in this case, it just makes your code more convoluted. First read the relevant parts of your chromosome file into a variable (named something more meaningful than var):
# Get rid of the `local $/` statement, we don't need it
my $chromosome;
while (<$chromosome_fh>) {
next if /^>/;
$chromosome .= $_;
}
Then read in your coordinates file:
my #cline = <$coordinates_fh>;
Or if you only need to use the contents of the coordinates file once, process each line as you go using a while loop:
while (<$coordinates_fh>) {
# Do something for each line here
}
As 'ThisSuitIsBlackNot' suggested, your code could be cleaned up a little. Here is a possible solution that may be what you want.
#!/usr/bin/perl
use strict;
use warnings;
my $chrom = $ARGV[0];
my $coords_file = $ARGV[1];
#finds subsequences: fasta files
open INFILE1, $chrom or die "Could not open $chrom: $!";
my $fasta;
<INFILE1>; # get rid of the first line - '>scaffold30 24194'
while(<INFILE1>) {
chomp;
$fasta .= $_;
}
close INFILE1 or die "Could not close '$chrom'. $!";
open INFILE, $coords_file or die "Could not open $coords_file: $!";
my $count = 0;
while(<INFILE>) {
my ($start, $end) = split;
# Or, should this be: my $offset = $end - ($start - 1);
# That would include the start fasta
my $offset = $end - $start;
$count++;
my $sub = substr ($fasta, $start, $offset);
print ">conserved $count\n";
print "$sub\n";
}
close INFILE or die "Could not close '$coords_file'. $!";

How to print all lines between certain matching string to a different file in perl

am fairly new to the perl scripting and need some help. below is my query:
I have a file which has contents like below:
AA ABC 0 0
line1
line2
...
AA XYZ 1 1
line..
line..
AA GHI 2 2
line..
line...
Now I would like get all the lines between those lines which have the starting string/pattern "AA" and write them to files ABC.txt, XYZ.txt, GHI.txt, repsectively including the line AA*, for examples ABC.txt should look like
AA ABC 0 0
line1
line2...
and XYZ.txt should look like
AA XYZ 1 1
line..
line..
Hope am clear in this question and any help regarding this is much appreciated.
Thanks,
Sandy
I presume you're asking for an algorithm since you didn't specify what you needed help with.
Declare a file handle for use for output.
While you haven't reached the end of the input file,
Read a line.
If it's a header line,
Parse it.
Determine file name.
(Re)open the output file.
Print the line to the output file handle.
Lest you be tempted to use one of the poor solutions that have been posted since I posted the above, here's the code:
my $fh;
while (<>) {
if (my ($fn) = /^AA\s+(\S+)/) {
$fn .= '.txt';
open($fh, '>', $fn)
or die("Can't create file \"$fn\": $!\n");
}
print $fh $_;
}
Possible improvements, all of which are easy to add:
Check for duplicate headers. (if -e $fn is one way)
Check for data before the first header. (if !$fh is one way)
You just need to keep one file open at a time... When a line matches XYZ, then you open your XYZ.txt file and output the line. You keep that file open (let's just say it's the handle CURRENT_FILE) and output each successive line to it until you match a new header line. Then you close the current file and open another one.
My Perl is a extremely rusty, so I don't think I can provide code that compiles, but essentially it's something close to this.
my $current_name = "";
foreach my $line (<INPUT>)
{
my($name) = $line =~ /^AA (\w+)/;
if( $name ne $current_name ) {
close(CURRENT_FILE) if $current_name ne "";
open(CURRENT_FILE, ">>", "$name.txt") || die "Argh\n";
$current_name = $name;
}
next if $current_name eq "";
print CURRENT_FILE $line;
}
close(CURRENT_FILE) if $current_name ne "";
What do you think about this one?
1: Get contents from the file (maybe using File::Slurp's read_file) and save to a scalar.
use File::Slurp qw(read_file write_file);
my $contents = read_file($filename);
2: Have a regex pattern matching similar to this:
my #file_rows = ($contents ~= /(AA\s[A-Z]{3}\s+\d+\s+\w*)/);
3: If column 2 values are always unique throughout the file:
foreach my $file_row (#file_rows) {
my #values = split(' ', $file_row, 3);
write_file($values[1] . ".txt", $file_row);
}
3: Otherwise: Split the row values. Store them to a hash using the second column as the key. Write data to output files using the hash.
my %hash;
foreach my $file_row (#file_rows) {
my #values = split(' ', $file_row, 3);
if (defined $hash{$value[1]}) {
$hash{$values[1]} .= $file_row;
} else {
$hash{$values[1]} = $file_row;
}
}
foreach my $key (keys %hash) {
write_file($key .'txt', $hash{$key});
}
Here's an option that looks for the pattern matching the start of each record. When found, it loops through the data file's lines and builds a record until it finds the same pattern again or eof, then that record is written to a file. It does not check to see if the file already exists before writing to it, so it will replace ABC.txt if it already exists:
use strict;
use warnings;
my $dataFile = 'data.txt';
my $nextLine = '';
my $recordRegex = qr/^AA\s+(\S+)\s+\d+\s+\d+/;
open my $inFH, '<', $dataFile or die $!;
RECORD: while ( my $line = <$inFH> ) {
my $record = $nextLine . $line;
if ( $record =~ $recordRegex ) {
my $fileName = $1 . '.txt';
while ( $nextLine = <$inFH> ) {
if ( $nextLine =~ $recordRegex or eof $inFH ) {
$record .= $nextLine if eof $inFH;
open my $outFH, '>', $fileName or die $!;
print $outFH $record;
close $outFH;
next RECORD;
}
$record .= $nextLine;
}
}
}
close $inFH;
Hope this helps!
Edit: This code replaces the original that was problematic. Thank you, amon, for reviewing the original code.

Skip the problematic DAT and proceed with next DATs and out put the error or missed DATs list in separate text file in Perl

I have another question here, i have several dats and want to merge them. But the script first checks for header of all the DATs and if not matching it will throw error and stop the script. Now i want to run the script skipping the problematic dat and output the error in separate text file with list of errored DAts and reason. Could anyone please help on this. Here is what i have so far:
use strict;
my $rootdir = $ARGV[0];
die "usage: perl mergetxtfiles.pl <folder>" if ($#ARGV != 0);
#$rootdir =~ s/\\/\\\\/g;
print "\nFolder = $rootdir\n\n";
opendir(DIR, $rootdir)
or die "failed opening the directory $rootdir";
open(OF,">:utf8",'combined_'.time.'.dat')
or die "failed opening the file";
my $icr = 0;
my $cnt = 0;
my $header = '';
my $header_flag = 0;
while(my $fname = readdir(DIR)) {
# add extensions if needed
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) {
$icr++;
my $fnamepath = $rootdir.'\\'.$fname;
print "\($icr\) $fname\n";
open(IF, "<:utf8", $fnamepath)
or die "ERROR: cannot open the file\n$fnamepath ";
my $sep_icr = 0;
while(<IF>) {
my $line = $_;
chomp $line;
next if (/^$/);
$sep_icr++;
$cnt++;
my #ar = split(/\t/,$line);
if ($cnt == 1) {
$header_flag = 1;
$header = $line;
}
if ($sep_icr == 1 and $header_flag == 1) {
#print "$line \n $header\n";
if ($line ne $header) {
die "Headers are not same\n";
}
elsif (($line eq $header) and ($cnt >1)) {
print "INFO\: ignoring the same header for $fname \n";
$cnt--;
next;
}
}
print OF $line."\n";
}
print "\--Line count= $sep_icr\n\n";
close IF;
#print OF "\n";
}
}
print "\-\-\> Total line count= $cnt\n";
Named Loops
In your loop, we have to change your if-clause and the outer loop a bit:
FILE:
while(my $fname = readdir(DIR)) {
...;
if ($line ne $header) {
logger($fname, "Headers not matching");
next FILE;
}
...;
}
In Perl, loops can be labeled, so we can specify which loop we do next, instead of setting and checking flags. I used an example logging function loggeras given below, but you can substitute it with an appropriate print statement.
Logging
This is probably a bit more than asked, but here is a little logging function for flexibility. Arguments are a filename, a reason, and an optional severity. You can remove the severity code if it isn't needed. The severity is optional anyway and defaults to debug.
open my $logfile, ">>", "FILENAME" or die "..."; # open for append
sub logger {
my ($file, $reason, $severity) = (#_, 'debug');
$severity = {
debug => '',
info => 'INFO',
warn => '!WARN!',
fatal => '!!!ERROR!!!',
}->{$severity} // $severity; # transform the severity if it is a name we know
$severity .= ' ' if length $severity; # append space if we have a severity
print {$logfile} $severity . qq{$reason while processing "$file"\n};
}
If called with logger("./foo/bar", "Headers not matching", 'warn') it will output:
!WARN! Headers not matching while processing "./foo/bar"
Change the printed error message to something more machine-readable if needed.
Style tips and tricks:
If find these lines more elegant:
die "usage: ...\n" unless #ARGV;
my ($rootdir) = #ARGV;
note the newline at the end (supresses the "at line 3" etc). In scalar context, an array returns the array length. In the second line we can avoid array subscripting by assigning in list context. Surplus elements are ignored.
Instead
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) { ...; }
we can say
next unless $fname =~ m/(?: \.txt | \.dat | \.csv )$/xi;
and avoid unneccessary intendation, therefore improving readability.
I modified the regex so that all suffixes must come at the end, not only the .csv suffix, and added the /x modifier so that I can use non-semantic whitespace inside the regex.
Windows, and pretty much any OS, understand forward slashes in path names. So instead
my $fnamepath = $rootdir.'\\'.$fname;
we can write
my $fnamepath = "$rootdir/$fname";
I find that easier to write and understand.
The
while(<IF>) {
my $line = $_;
construct can be simplified to
while(my $line = <IF>) {...}
Last but not least, consider starting a habit of using filehandles with my. Often, global filehandles are not needed and can cause some bugs.

How do I search for a string in file with different headings?

I am using perl to search for a specific strings in a file with different sequences listed under different headings. I am able to write script when there is one sequence present i.e one heading but am not able to extrapolate it.
suppose I am reqd to search for some string "FSFSD" in a given file then eg:
can't search if file has following content :
Polons
CACAGTGCTACGATCGATCGATDDASD
HCAYCHAYCHAYCAYCSDHADASDSADASD
Seliems
FJDSKLFJSLKFJKASFJLAKJDSADAK
DASDNJASDKJASDJDSDJHAJDASDASDASDSAD
Teerag
DFAKJASKDJASKDJADJLLKJ
SADSKADJALKDJSKJDLJKLK
Can search when file has one heading i.e:
Terrans
FDKFJSKFJKSAFJALKFJLLJ
DKDJKASJDKSADJALKJLJKL
DJKSAFDHAKJFHAFHFJHAJJ
I need to output the result as "String xyz found under Heading abc"
The code I am using is:
print "Input the file name \n";
$protein= <STDIN>;
chomp $protein;
unless (open (protein, $protein))
{
print "cant open file \n\n";
exit;
}
#prot= <protein>;
close protein;
$newprotein=join("",#prot);
$protein=~s/\s//g;
do{
print "enter the motif to be searched \n";
$motif= <STDIN>;
chomp $motif;
if ($protein =~ /motif/)
{
print "found motif \n\n";
}
else{
print "not found \n\n";
}
}
until ($motif=~/^\s*$/);
exit;
Seeing your code, I want to make a few suggestions without answering your question:
Always, always, always use strict;. For the love of whatever higher power you may (or may not) believe in, use strict;.
Every time you use strict;, you should use warnings; along with it.
Also, seriously consider using some indentation.
Also, consider using obviously different names for different variables.
Lastly, your style is really inconsistent. Is this all your code or did you patch it together? Not trying to insult you or anything, but I recommend against copying code you don't understand - at least try before you just copy it.
Now, a much more readable version of your code, including a few fixes and a few guesses at what you may have meant to do, follows:
use strict;
use warnings;
print "Input the file name:\n";
my $filename = <STDIN>;
chomp $filename;
open FILE, "<", $filename or die "Can't open file\n\n";
my $newprotein = join "", <FILE>;
close FILE;
$newprotein =~ s/\s//g;
while(1) {
print "enter the motif to be searched:\n";
my $motif = <STDIN>;
last if $motif =~ /^\s*$/;
chomp $motif;
# here I might even use the ternary ?: operator, but whatever
if ($newprotein =~ /$motif/) {
print "found motif\n\n";
}
else {
print "not found\n\n";
}
}
The main issue is how do you distinguish between a header and the data, from your examples I assume that a line is a header iff it contains a lower case letter.
use strict;
use warnings;
print "Enter the motif to be searched \n";
my $motif = <STDIN>;
chomp($motif);
my $header;
while (<>) {
if(/[a-z]/) {
$header = $_;
next;
}
if (/$motif/o) {
print "Found $motif under header $header\n";
exit;
}
}
print "$motif not found\n";
So you are saying you are able to read one line and achieve this task. But when you have more than one line in the file you are not able to do the same thing?
Just have a loop and read the file line by line.
$data_file="yourfilename.txt";
open(DAT, '<', $data_file) || die("Could not open file!");
while( my $line = <DAT>)
{
//same command that you do for one 'heading' will go here. $line represents one heading
}
EDIT: You're posted example has no clear delimiter, you need to find a clear division between your headings and your sequences. You could use multiple linebreaks or a non-alphanumeric character such as ','. Whatever you choose, let WHITESPACE in the following code be equal to your chosen delimiter. If you are stuck with the format you have, you will have to change the following grammar to disregard whitespace and delimit through capitalization (makes it slightly more complex).
Simple way ( O(n^2)? ) is to split the file using a whitespace delimiter, giving you an array of headings and sequences( heading[i] = split_array[i*2], sequence[i] = split_array[i*2+1]). For each sequence perform your regex.
Slightly more difficult way ( O(n) ), given a BNF grammar such as:
file: block
| file block
;
block: heading sequence
heading: [A-Z][a-z]
sequence: [A-Z][a-z]
Try recursive decent parsing (pseudo-code, I don't know perl):
GLOBAL sequenceHeading, sequenceCount
GLOBAL substringLength = 5
GLOBAL substring = "FSFSD"
FUNC file ()
WHILE nextChar() != EOF
block()
printf ( "%d substrings in %s", sequenceCount, sequenceHeading )
END WHILE
END FUNC
FUNC block ()
heading()
sequence()
END FUNC
FUNC heading ()
in = popChar()
IF in == WHITESPACE
sequenceHeading = tempHeading
tempHeading = ""
RETURN
END IF
tempHeading &= in
END FUNC
FUNC sequence ()
in = popChar()
IF in == WHITESPACE
sequenceCount = count
count = 0
i = 0
END IF
IF in == substring[i]
i++
IF i > substringLength
count++
END IF
ELSE
i = 0
END IF
END FUNC
For detailed information on recursive decent parsing, check out Let's Build a Compiler or Wikipedia.
use strict;
use warnings;
use autodie qw'open';
my($filename,$motif) = #ARGV;
if( #ARGV < 1 ){
print "Please enter file name:\n";
$filename = <STDIN>;
chomp $filename;
}
if( #ARGV < 2 ){
print "Please enter motif:\n";
$motif = <STDIN>;
chomp $motif;
}
my %data;
# fill in %data;
{
open my $file, '<', $filename;
my $heading;
while( my $line = <$file> ){
chomp $line;
if( $line ne uc $line ){
$heading = $line;
next;
}
if( $data{$heading} ){
$data{$heading} .= $line;
} else {
$data{$heading} = $line;
}
}
}
{
# protect against malicious users
my $motif_cmp = quotemeta $motif;
for my $heading ( keys %data ){
my $data = $data{$heading};
if( $data =~ /$motif_cmp/ ){
print "String $motif found under Heading $heading\n";
exit 0;
}
}
die "String $motif not found anywhere in file $filename\n";
}