Get value from next N rows of a file - perl

I'm having problems intercepting the contents of the lines above what I'm reading $lines[0] as following foreach loop
my $IN_DIR = "/tmp/appo/log"; # Input Directories
my $jumprow = '<number of row to skip>'; # This is a value
foreach my $INPUT ( glob( "$IN_DIR/logrotate_*.log" ) ) {
open( my $fh, '<', $INPUT ) or die $!;
while ( <$fh> ) {
next unless $. > $jumprow;
my #lines = split /\n/;
my $i = 0;
foreach my $lines ( #lines ) {
if ( $lines[$i] =~ m/\A#\d.\d.+#\d{4}\s\d{2}\s\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}#\+\d+#\w+#\/\w+\/\w+\/Authentication/ ) {
# Shows only LOGIN/LOGOUT access type and exclude GUEST users
if ( $lines[ $i + 2 ] =~ m/Login/ || $lines[ $i + 2 ] =~ m/Logout/ && $lines[ $i + 3 ] !~ m/Guest/ ) {
my ( $y, $m, $d, $time ) = $lines[$i] =~ /\A#\d.\d.+#(\d{4})\s(\d{2})\s(\d{2})\s(\d{2}:\d{2}:\d{2}:\d{3})/;
my ( $action ) = $lines[ $i + 2 ] =~ /\A(\w+)/;
my ( $user ) = $lines[ $i + 3 ] =~ /\w+:\s(.+)/;
print "$y/$m/$d;$time;$action;$user\n";
}
}
else {
next; # Is this next technically necessary according to you?
}
$i++;
}
}
close( $fh );
}
The Tie::File
module could help me
my $IN_DIR = "/tmp/appo/log"; # Input Directories
my $jumprow = '<number of row to skip>'; # This is a value
foreach my $INPUT ( glob( "$IN_DIR/logrotate_*.log" ) ) {
tie #lines, 'Tie::File', $INPUT, mode => O_RDONLY;
or die $!;
my $i = $.;
next unless $i > $jumprow;
foreach my $lines ( #lines ) {
if ( $lines[$i] =~ m/\A#\d.\d.+#\d{4}\s\d{2}\s\d{2}\s\d{2}:\d{2}:\d{2}:\d{3}#\+\d+#\w+#\/\w+\/\w+\/Authentication/ ) {
# Shows only LOGIN/LOGOUT access type and exclude GUEST users
if ( $lines[ $i + 2 ] =~ m/Login/ || $lines[ $i + 2 ] =~ m/Logout/ && $lines[ $i + 3 ] !~ m/Guest/ ) {
my ( $y, $m, $d, $time ) = $lines[$i] =~ /\A#\d.\d.+#(\d{4})\s(\d{2})\s(\d{2})\s(\d{2}:\d{2}:\d{2}:\d{3})/;
my ( $action ) = $lines[ $i + 2 ] =~ /\A(\w+)/;
my ( $user ) = $lines[ $i + 3 ] =~ /\w+:\s(.+)/;
print "$y/$m/$d;$time;$action;$user\n";
}
}
else {
next; # Is this next technically necessary according to you?
}
$i++;
}
}
Could you tell me if my declaration with Tie::File is correct or not?
This is only a part of my master script as indicated in following guide mcve
Actually without tie, my master scripts works only with $lines[0], it doesn't take value from $lines[$i+2] or $lines[$i+3]

It looks like you're getting very lost here. I've written a working program that processes the data you showed in your previous question; it should at least form a stable basis for you to continue your work. I think it's fairly straightforward, but ask if there's anything that's not obvious in the Perl documentation
use strict;
use warnings 'all';
use feature 'say';
use autodie; # Handle IO failures automatically
use constant IN_DIR => '/tmp/appo/log';
chdir IN_DIR; # Change to input directory
# Status handled by autodie
for my $file ( glob 'logrotate_*.log' ) {
say $file;
say '-' x length $file;
say "";
open my $fh, '<', $file; # Status handled by autodie
local $/ = ""; # Enable block mode
while ( <$fh> ) {
my #lines = split /\n/;
next unless $lines[0] =~ /
^
\# \d.\d .+?
\# (\d\d\d\d) \s (\d\d) \s (\d\d)
\s
( \d\d : \d\d : \d\d : \d\d\d )
/x;
my ( $y, $m, $d, $time ) = ($1, $2, $3, $4);
$time =~ s/.*\K:/./; # Change decimal point to dot for seconds
next unless $lines[2] =~ /^(Log(?:in|out))/;
my $action = $1;
next unless $lines[3] =~ /^User:\s+(.*\S)/ and $1 ne 'Guest';
my $user = $1;
print "$y/$m/$d;$time;$action;$user\n";
}
say "";
}
output
logrotate_0.0.log
-----------------
2018/05/24;11:05:04.011;Login;USER4
2018/05/24;11:04:59.410;Login;USER4
2018/05/24;11:05:07.100;Logout;USER3
2018/05/24;11:07:21.314;Login;USER2
2018/05/24;11:07:21.314;Login;USER2
2018/05/26;10:48:02.458;Logout;USER2
2018/05/28;10:00:25.000;Logout;USER0
logrotate_1.0.log
-----------------
2018/05/29;10:09:45.969;Login;USER4
2018/05/29;11:51:06.541;Login;USER1
2018/05/30;11:54:03.906;Login;USER4
2018/05/30;11:59:59.156;Logout;USER3
2018/05/30;08:32:11.348;Login;USER4
2018/05/30;11:09:54.978;Login;USER2
2018/06/01;08:11:30.008;Logout;USER2
2018/06/01;11:11:29.658;Logout;USER1
2018/06/02;12:05:00.465;Logout;USER9
2018/06/02;12:50:00.065;Login;USER9
2018/05/24;10:43:38.683;Login;USER1

Related

how to display the hash value from my sample data

I'm learning perl at the moment, i wanted to ask help to answer this exercise.
My objective is to display the hash value of PartID 1,2,3
the sample output is displaying lot, wafer, program, version, testnames, testnumbers, hilimit, lolimit and partid values only.
sample data
lot=lot123
wafer=1
program=prgtest
version=1
Testnames,T1,T2,T3
Testnumbers,1,2,3
Hilimit,5,6,7
Lolimit,1,2,3
PartID,,,,
1,3,0,5
2,4,3,2
3,5,6,3
This is my code:
#!/usr/bin/perl
use strict;
use Getopt::Long;
my $file = "";
GetOptions ("infile=s" => \$file ) or die("Error in command line arguments\n");
my $lotid = "";
open(DATA, $file) or die "Couldn't open file $file";
while(my $line = <DATA>) {
#print "$line";
if ( $line =~ /^lot=/ ) {
#print "$line \n";
my ($dump, $lotid) = split /=/, $line;
print "$lotid\n";
}
elsif ($line =~ /^program=/ ) {
my ($dump, $progid) = split /=/, $line;
print "$progid \n";
}
elsif ($line =~ /^wafer=/ ) {
my ($dump, $waferid) = split /=/, $line;
print "$waferid \n";
}
elsif ($line =~ /^version=/ ) {
my ($dump, $verid) = split /=/, $line;
print "$verid \n";
}
elsif ($line =~ /^testnames/i) {
my ($dump, #arr) = split /\,/, $line;
foreach my $e (#arr) {
print $e, "\n";
}
}
elsif ($line =~ /^testnumbers/i) {
my ($dump, #arr1) = split /\,/, $line;
foreach my $e1 (#arr1) {
print $e1, "\n";
}
}
elsif ($line =~ /^hilimit/i) {
my ($dump, #arr2) = split /\,/, $line;
foreach my $e2 (#arr2) {
print $e2, "\n";
}
}
elsif ($line =~ /^lolimit/i) {
my ($dump, #arr3) = split /\,/, $line;
foreach my $e3 (#arr3) {
print $e3, "\n";
}
}
}
Kindly help add to my code to display Partid 1,2,3 hash.
So I've rewritten your code a little to use a few more modern Perl idioms (along with some comments to explain what I've done). The bit I've added is near the bottom.
#!/usr/bin/perl
use strict;
# Added 'warnings' which you should always use
use warnings;
# Use say() instead of print()
use feature 'say';
use Getopt::Long;
my $file = "";
GetOptions ("infile=s" => \$file)
or die ("Error in command line arguments\n");
# Use a lexical variable for a filehandle.
# Use the (safer) 3-argument version of open().
# Add $! to the error message.
open(my $fh, '<', $file) or die "Couldn't open file $file: $!";
# Read each record into $_ - which makes the following code simpler
while (<$fh>) {
# Match on $_
if ( /^lot=/ ) {
# Use "undef" instead of a $dump variable.
# split() works on $_ by default.
my (undef, $lotid) = split /=/;
# Use say() instead of print() - less punctuation :-)
say $lotid;
}
elsif ( /^program=/ ) {
my (undef, $progid) = split /=/;
say $progid;
}
elsif ( /^wafer=/ ) {
my (undef, $waferid) = split /=/;
say $waferid;
}
elsif ( /^version=/ ) {
my (undef, $verid) = split /=/;
say $verid;
}
elsif ( /^testnames/i) {
my (undef, #arr) = split /\,/;
# Changed all of these similar pieces of code
# to use the same variable names. As they are
# defined in different code blocks, they are
# completely separate variables.
foreach my $e (#arr) {
say $e;
}
}
elsif ( /^testnumbers/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
elsif ( /^hilimit/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
elsif ( /^lolimit/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
# And here's the new bit.
# If we're on the "partid" line, then read the next
# three lines, split each one and print the first
# element from the list returned by split().
elsif ( /^partid/i) {
say +(split /,/, <$fh>)[0] for 1 .. 3;
}
}
Update: By the way, there are no hashes anywhere in this code :-)
Update 2: I've just realised that you only have three different ways to process the data. So you can simplify your code drastically by using slightly more complex regexes.
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
use Getopt::Long;
my $file = "";
GetOptions ("infile=s" => \$file)
or die ("Error in command line arguments\n");
open(my $fh, '<', $file) or die "Couldn't open file $file: $!";
while (<$fh>) {
# Single value - just print it.
if ( /^(?:lot|program|wafer|version)=/ ) {
my (undef, $value) = split /=/;
say $value;
}
# List of values - split and print.
elsif ( /^(?:testnames|testnumbers|hilimit|lolimit)/i) {
my (undef, #arr) = split /\,/;
foreach my $e (#arr) {
say $e;
}
}
# Extract values from following lines.
elsif ( /^partid/i) {
say +(split /,/, <$fh>)[0] for 1 .. 3;
}
}

"No such file" when opening multiple files in a directory, but no error when opening only one file

I can open one file in a directory and run the following code. However, when I try to use the same code on multiple files within a directory, I get an error regarding there not being a file.
I have tried to make sure that I am naming the files correctly, that they are in the right format, that they are located in my current working directory, and that things are referenced correctly.
I know a lot of people have had this error before and have posted similar questions, but any help would be appreciated.
Working code:
#!/usr/bin/perl
use warnings;
use strict;
use diagnostics;
use List::Util qw( min max );
my $RawSequence = loadSequence("LDTest.fasta");
my $windowSize = 38;
my $stepSize = 1;
my %hash;
my $s1;
my $s2;
my $dist;
for ( my $windowStart = 0; $windowStart <= 140; $windowStart += $stepSize ) {
my $s1 = substr( $$RawSequence, $windowStart, $windowSize );
my $s2 = 'CGGAGCTTTACGAGCCGTAGCCCAAACAGTTAATGTAG';
# the 28 nt forward primer after the barcode plus the first 10 nt of the mtDNA dequence
my $dist = levdist( $s1, $s2 );
$hash{$dist} = $s1;
#print "Distance between '$s1' and '$s2' is $dist\n";
sub levdist {
my ( $seq1, $seq2 ) = (#_)[ 0, 1 ];
my $l1 = length($s1);
my $l2 = length($s2);
my #s1 = split '', $seq1;
my #s2 = split '', $seq2;
my $distances;
for ( my $i = 0; $i <= $l1; $i++ ) {
$distances->[$i]->[0] = $i;
}
for ( my $j = 0; $j <= $l2; $j++ ) {
$distances->[0]->[$j] = $j;
}
for ( my $i = 1; $i <= $l1; $i++ ) {
for ( my $j = 1; $j <= $l2; $j++ ) {
my $cost;
if ( $s1[ $i - 1 ] eq $s2[ $j - 1 ] ) {
$cost = 0;
}
else {
$cost = 1;
}
$distances->[$i]->[$j] = minimum(
$distances->[ $i - 1 ]->[ $j - 1 ] + $cost,
$distances->[$i]->[ $j - 1 ] + 1,
$distances->[ $i - 1 ]->[$j] + 1,
);
}
}
my $min_distance = $distances->[$l1]->[$l2];
for ( my $i = 0; $i <= $l1; $i++ ) {
$min_distance = minimum( $min_distance, $distances->[$i]->[$l2] );
}
for ( my $j = 0; $j <= $l2; $j++ ) {
$min_distance = minimum( $min_distance, $distances->[$l1]->[$j] );
}
return $min_distance;
}
}
sub minimum {
my $min = shift #_;
foreach (#_) {
if ( $_ < $min ) {
$min = $_;
}
}
return $min;
}
sub loadSequence {
my ($sequenceFile) = #_;
my $sequence = "";
unless ( open( FASTA, "<", $sequenceFile ) ) {
die $!;
}
while (<FASTA>) {
my $line = $_;
chomp($line);
if ( $line !~ /^>/ ) {
$sequence .= $line; #if the line doesn't start with > it is the sequence
}
}
return \$sequence;
}
my #keys = sort { $a <=> $b } keys %hash;
my $BestMatch = $hash{ keys [0] };
if ( $keys[0] < 8 ) {
$$RawSequence =~ s/\Q$BestMatch\E/CGGAGCTTTACGAGCCGTAGCCCAAACAGTTAATGTAG/g;
print ">|Forward|Distance_of_Best_Match: $keys[0] |Sequence_of_Best_Match: $BestMatch", "\n",
"$$RawSequence", "\n";
}
Here is an abbreviated version of my non-working code. Things that haven't changed I didn't included:
Headers and Globals:
my $dir = ("/Users/roblogan/Documents/FakeFastaFiles");
my #ArrayofFiles = glob "$dir/*.fasta";
foreach my $file ( #ArrayofFiles ) {
open( my $Opened, $file ) or die "can't open file: $!";
while ( my $OpenedFile = <$Opened> ) {
my $RawSequence = loadSequence($OpenedFile);
for ( ... ) {
...;
print
">|Forward|Distance_of_Best_Match: $keys[0] |Sequence_of_Best_Match: $BestMatch",
"\n", "$$RawSequence", "\n";
}
}
}
The exact error is:
Uncaught exception from user code:
No such file or directory at ./levenshtein_for_directory.pl line 93, <$Opened> line 1.
main::loadSequence('{\rtf1\ansi\ansicpg1252\cocoartf1404\cocoasubrtf470\x{a}') called at ./levenshtein_for_directory.pl line 22
line 93:
89 sub loadSequence{
90 my ($sequenceFile) = #_;
91 my $sequence = "";
92 unless (open(FASTA, "<", $sequenceFile)){
93 die $!;
94 }
Line 22:
18 foreach my $file ( #ArrayofFiles ) {
19 open (my $Opened, $file) or die "can't open file: $!";
20 while (my $OpenedFile = <$Opened>) {
21
22 my $RawSequence = loadSequence($OpenedFile);
23
I just learned that "FASTA file" is a settled term. Wasn't aware of that and previously thought they are some files and contain filenames or something. As #zdim already said, you're opening these files twice.
The following code gets a list of FASTA files (only the filenames) and then calls loadSequence with each such a filename. That subroutine then opens the given file, concatenates the none-^> lines to one big line and returns it.
# input: the NAME of a FASTA file
# return: all sequences in that file as one very long string
sub loadSequence
{
my ($fasta_filename) = #_;
my $sequence = "";
open( my $fasta_fh, '<', $fasta_filename ) or die "Cannot open $fasta_filename: $!\n";
while ( my $line = <$fasta_fh> ) {
chomp($line);
if ( $line !~ /^>/ ) {
$sequence .= $line; #if the line doesn't start with > it is the sequence
}
}
close($fasta_fh);
return $sequence;
}
# ...
my $dir = '/Users/roblogan/Documents/FakeFastaFiles';
my #ArrayofFiles = glob "$dir/*.fasta";
foreach my $filename (#ArrayofFiles) {
my $RawSequence = loadSequence($filename);
# ...
}
You seem to be trying to open files twice. The line
my #ArrayofFiles = glob "$dir/*.fasta";
Gives you the list of files. Then
foreach my $file (#ArrayofFiles){
open (my $Opened, $file) or die "can't open file: $!";
while (my $OpenedFile = <$Opened>) {
my $RawSequence = loadSequence($OpenedFile);
# ...
does the following, line by line. It iterates through files, opens each, reads a line from it, and then submits that line to the function loadSequence().
However, in that function you attempt to open a file again
sub loadSequence{
my ($sequenceFile) = #_;
my $sequence = "";
unless (open(FASTA, "<", $sequenceFile)){
# ...
The $sequenceFile variable in the function is passed to the function as $OpenedFile -- which is a line in the file that is already opened and being read from, not the file name. While I am not certain about details of your code, the error you show seems to be consistent with this.
It may be that you are confusing the glob, which gives you the list of files, with the opendir which would indeed need a following readdir to access the files.
Try renaming $OpenedFile to, say, $line (which it is) and see how it looks then.

Running a nested while loop inside a foreach loop in Perl

I'm trying to use a foreach loop to loop through an array and then use a nested while loop to loop through each line of a text file to see if the array element matches a line of text; if so then I push data from that line into a new array to perform calculations.
The outer foreach loop appears to be working correctly (based on printed results with each array element) but the inner while loop is not looping (same data pushed into array each time).
Any advice?
The code is below
#! /usr/bin/perl -T
use CGI qw(:cgi-lib :standard);
print "Content-type: text/html\n\n";
my $input = param('sequence');
my $meanexpfile = "final_expression_complete.txt";
open(FILE, $meanexpfile) or print "unable to open file";
my #meanmatches;
#regex = (split /\s/, $input);
foreach $regex (#regex) {
while (my $line = <FILE>) {
if ( $line =~ m/$regex\s(.+\n)/i ) {
push(#meanmatches, $1);
}
}
my $average = average(#meanmatches);
my $std_dev = std_dev($average, #meanmatches);
my $average_round = sprintf("%0.4f", $average);
my $stdev_round = sprintf("%0.4f", $std_dev);
my $coefficient_of_variation = $stdev_round / $average_round;
my $cv_round = sprintf("%0.4f", $coefficient_of_variation);
print font(
{ color => "blue" }, "<br><B>$regex average: $average_round
&nbspStandard deviation: $stdev_round&nbspCoefficient of
variation(Cv): $cv_round</B>"
);
}
sub average {
my (#values) = #_;
my $count = scalar #values;
my $total = 0;
$total += $_ for #values;
return $count ? $total / $count : 0;
}
sub std_dev {
my ($average, #values) = #_;
my $count = scalar #values;
my $std_dev_sum = 0;
$std_dev_sum += ($_ - $average)**2 for #values;
return $count ? sqrt($std_dev_sum / $count) : 0;
}
Yes, my advice would be:
Turn on strict and warnings.
perltidy your code,
use 3 argument open: open ( my $inputfile, "<", 'final_expression.txt' );
die if it doesn't open - the rest of your program is irrelevant.
chomp $line
you are iterating your filehandle, but once you've done this you're at the end of file for the next iteration of the foreach loop so your while loops becomes a null operation. Simplistically, reading the file into an array my #lines = <FILE>; would fix this.
So with that in mind:
#!/usr/bin/perl -T
use strict;
use warnings;
use CGI qw(:cgi-lib :standard);
print "Content-type: text/html\n\n";
my $input = param('sequence');
my $meanexpfile = "final_expression_complete.txt";
open( my $input_file, "<", $meanexpfile ) or die "unable to open file";
my #meanmatches;
my #regex = ( split /\s/, $input );
my #lines = <$input_file>;
chomp (#lines);
close($input_file) or warn $!;
foreach my $regex (#regex) {
foreach my $line (#lines) {
if ( $line =~ m/$regex\s(.+\n)/i ) {
push( #meanmatches, $1 );
}
}
my $average = average(#meanmatches);
my $std_dev = std_dev( $average, #meanmatches );
my $average_round = sprintf( "%0.4f", $average );
my $stdev_round = sprintf( "%0.4f", $std_dev );
my $coefficient_of_variation = $stdev_round / $average_round;
my $cv_round = sprintf( "%0.4f", $coefficient_of_variation );
print font(
{ color => "blue" }, "<br><B>$regex average: $average_round
&nbspStandard deviation: $stdev_round&nbspCoefficient of
variation(Cv): $cv_round</B>"
);
}
sub average {
my (#values) = #_;
my $count = scalar #values;
my $total = 0;
$total += $_ for #values;
return $count ? $total / $count : 0;
}
sub std_dev {
my ( $average, #values ) = #_;
my $count = scalar #values;
my $std_dev_sum = 0;
$std_dev_sum += ( $_ - $average )**2 for #values;
return $count ? sqrt( $std_dev_sum / $count ) : 0;
}
The problem here is that starting from the second iteration of foreach you are trying to read from already read file handle. You need to rewind to the beginning to read it again:
foreach $regex (#regex) {
seek FILE, 0, 0;
while ( my $line = <FILE> ) {
However that does not look very performant. Why read file several times at all, when you can read it once before the foreach starts, and then iterate through the list:
my #lines;
while (<FILE>) {
push (#lines, $_);
}
foreach $regex (#regex) {
foreach $line (#lines) {
Having the latter, you might also what to consider using grep instead of the while loop.

Take random substrings from genome data

I am trying to use the substring function to take random 21 base sequences from a genome in fasta format. Below is the start of the sequence:
FILE1 data:
>gi|385195117|emb|HE681097.1| Staphylococcus aureus subsp. aureus HO 5096 0412 complete genome
CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATAC
I have tried adapting a previous answer to use while reading my file and i'm not getting any error messages, just no output! The code hopefully prevents there being any overlap of sequences, though the chances of that are very small anyway.
Code as follows:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/randomoutput.txt";
open FILE1, "/Users/edwardtickle/Documents/EMRSA-15.fasta";
open( OUTPUTFILE, ">$outputfile" );
while ( my $line = <FILE1> ) {
if ( $line =~ /^([ATGCN]+)/ ) {
my $genome = $1;
my $size = 21;
my $count = 5;
my $mark = 'X';
if ( 2 * $size * $count - $size - $count >= length($genome) ) {
my #substrings;
while ( #substrings < $count ) {
my $pos = int rand( length($genome) - $size + 1 );
push #substrings, substr( $genome, $pos, $size, $mark x $size )
if substr( $genome, $pos, $size ) !~ /\Q$mark/;
for my $random (#substrings) {
print OUTPUTFILE "random\n";
}
}
}
}
}
Thanks for your help!
One of the neatest ways to select a random start point is to shuffle a list of all possible start points and select the first few -- as many as you need.
It's also best practice to use the three-parameter form of open, and lexical file handles.
The loop in this example starts much like your own -- picking up the genomes using a regex. The subsequences of length $size can start anywhere from zero up to $len_genome - $size, so the program generates a list of all these starting points, shuffles them using the utility function from List::Util, and puts them in #start_points.
Finally, if there are sufficient start points to form $count different subsequences, then they are printed, using substr in the print statement.
use strict;
use warnings;
use autodie;
use List::Util qw/ shuffle /;
my $outputfile = '/Users/edwardtickle/Documents/randomoutput.txt';
open my $in_fh, '<', '/Users/edwardtickle/Documents/EMRSA-15.fasta';
open my $out_fh, '>', $outputfile;
my $size = 21;
my $count = 5;
while (my $line = <$in_fh>) {
next unless $line =~ /^([ATGCN]+)/;
my $genome = $1;
my $len_genome = length $genome;
my #start_points = shuffle(0 .. $len_genome-$size);
next unless #start_points >= $count;
print substr($genome, $_, 21), "\n" for #start_points[0 .. $count-1];
}
output
TACACGATGCGAGCAATCAAA
GTTTACAATGTTTGAATACCT
ACATCACCATGAGTTTGGTCC
ATAACATCACCATGAGTTTGG
GGTCCGAAGCATGAGTGTTTA
I would recommend saving all possible positions for a substring in an array. That way you can remove possibilities after each substring to prevent overlap:
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $infile = "/Users/edwardtickle/Documents/EMRSA-15.fasta";
my $outfile = "/Users/edwardtickle/Documents/randomoutput.txt";
my $size = 21;
my $count = 5;
my $min_length = ( $count - 1 ) * ( 2 * $size - 1 ) + $size;
#open my $infh, '<', $infile;
#open my $outfh, '>', $outfile;
my $infh = \*DATA;
my $outfh = \*STDOUT;
while ( my $line = <$infh> ) {
next unless $line =~ /^([ATGCN]+)/;
my $genome = $1;
# Need a long enough sequence for multiple substrings with no overlap
if ( $min_length > length $genome ) {
warn "Line $., Genome too small: Must be $min_length, not ", length($genome), "\n";
next;
}
# Save all possible positions for substrings in an array. This enables us
# to remove possibilities after each substring to prevent overlap.
my #pos = ( 0 .. length($genome) - 1 - ( $size - 1 ) );
for ( 1 .. $count ) {
my $index = int rand #pos;
my $pos = $pos[$index];
# Remove from possible positions
my $min = $index - ( $size - 1 );
$min = 0 if $min < 0;
splice #pos, $min, $size + $index - $min;
my $substring = substr $genome, $pos, $size;
print $outfh "$pos - $substring\n";
}
}
__DATA__
>gi|385195117|emb|HE681097.1| Staphylococcus aureus subsp. aureus HO 5096 0412 complete genome
CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATACCGATTAAAGATAGAAATACACGATGCGAGCAATCAAA
CGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATACCGATTAAAGATAGAAATACACGATGCGAGCAATCAAATTTCATAACATCACCATGAGTTTGGTCCGAAGCATGAGTGTTTACAATGTTTGAATACCTTATACAGTTCTTATACATAC
Outputs:
Line 2, Genome too small: Must be 185, not 154
101 - CAGTTCTTATACATACCGATT
70 - ATGAGTGTTTACAATGTTTGA
6 - AAGATAGAAATACACGATGCG
38 - TTCATAACATCACCATGAGTT
182 - GAAGCATGAGTGTTTACAATG
Alternative method for large genomes
You mentioned in a comment that genome may be 2 gigs in size. If that's the case then it's possible that there won't be enough memory to have an array of all possible positions.
Your original approach of substituting a fake character for each chosen substring would work in that case. The following is how I would do it, using redo:
for ( 1 .. $count ) {
my $pos = int rand( length($genome) - ( $size - 1 ) );
my $str = substr $genome, $pos, $size;
redo if $str =~ /X/;
substr $genome, $pos, $size, 'X' x $size;
print $outfh "$pos - $str\n";
}
Also note, that if your genome really is that big, then you must also be wary of the randbits setting of your Perl version:
$ perl -V:randbits
randbits='48';
For some Windows versions, the randbits setting was just 15, therefore only returning 32,000 possible random values: Why does rand($val) not warn when $val > 2 ** randbits?
I found it more effective to move the output for loop outside the inner while, and to add a condition to the while such that $genome must contain a $size-long chunk that hasn't already been partly selected.
Just because you've got a string that's 117 characters long doesn't mean you'll find 5 random non-overlapping chunks.
#!/usr/bin/perl
use strict;
use warnings;
use autodie;
my $outputfile = "/Users/edwardtickle/Documents/randomoutput.txt";
open FILE1, "/Users/edwardtickle/Documents/EMRSA-15.fasta";
open( OUTPUTFILE, ">$outputfile" );
while ( my $line = <FILE1> ) {
if ( $line =~ /^([ATGCN]+)/ ) {
my $genome = $1;
my $size = 21;
my $count = 5;
my $mark = 'X';
if ( 2 * $size * $count - $size - $count >= length($genome) ) {
my #substrings;
while ( #substrings < $count
and $genome =~ /[ATGCN]{$size}/ ) { # <- changed this
my $pos = int rand( length($genome) - $size + 1 );
push #substrings, substr( $genome, $pos, $size, $mark x $size )
if substr( $genome, $pos, $size ) !~ /\Q$mark/;
}
# v- changed this
print OUTPUTFILE "$_\n" for #substrings;
}
}
}

How do I speed up pattern recognition in perl

This is the program as it stands right now, it takes in a .fasta file (a file containing genetic code), creates a hash table with the data and prints it, however, it is quite slow. It splits a string an compares it against all other letters in the file.
use strict;
use warnings;
use Data::Dumper;
my $total = $#ARGV + 1;
my $row;
my $compare;
my %hash;
my $unique = 0;
open( my $f1, '<:encoding(UTF-8)', $ARGV[0] ) or die "Could not open file '$ARGV[0]' $!\n";
my $discard = <$f1>;
while ( $row = <$f1> ) {
chomp $row;
$compare .= $row;
}
my $size = length($compare);
close $f1;
for ( my $i = 0; $i < $size - 6; $i++ ) {
my $vs = ( substr( $compare, $i, 5 ) );
for ( my $j = 0; $j < $size - 6; $j++ ) {
foreach my $value ( substr( $compare, $j, 5 ) ) {
if ( $value eq $vs ) {
if ( exists $hash{$value} ) {
$hash{$value} += 1;
} else {
$hash{$value} = 1;
}
}
}
}
}
foreach my $val ( values %hash ) {
if ( $val == 1 ) {
$unique++;
}
}
my $OUTFILE;
open $OUTFILE, ">output.txt" or die "Error opening output.txt: $!\n";
print {$OUTFILE} "Number of unique keys: " . $unique . "\n";
print {$OUTFILE} Dumper( \%hash );
close $OUTFILE;
Thanks in advance for any help!
It is not clear from the description what is wanted from this script, but if you're looking for matching sets of 5 characters, you don't actually need to do any string matching: you can just run through the whole sequence and keep a tally of how many times each 5-letter sequence occurs.
use strict;
use warnings;
use Data::Dumper;
my $str; # store the sequence here
my %hash;
# slurp in the whole file
open(IN, '<:encoding(UTF-8)', $ARGV[0]) or die "Could not open file '$ARGV[0]' $!\n";
while (<IN>) {
chomp;
$str .= $_;
}
close(IN);
# not sure if you were deliberately omitting the last two letters of sequence
# this looks at all the sequence
my $l_size = length($str) - 4;
for (my $i = 0; $i < $l_size; $i++) {
$hash{ substr($str, $i, 5) }++;
}
# grep in a scalar context will count the values.
my $unique = grep { $_ == 1 } values %hash;
open OUT, ">output.txt" or die "Error opening output.txt: $!\n";
print OUT "Number of unique keys: ". $unique."\n";
print OUT Dumper(\%hash);
close OUT;
It might help to remove searching for information that you already have.
I don't see that $j depends upon $i so you're actually matching values to themselves.
So you're getting bad counts as well. It works for 1, because 1 is the square of 1.
But if for each five-character string you're counting strings that match, you're going
to get the square of the actual number.
You would actually get better results if you did it this way:
# compute it once.
my $lim = length( $compare ) - 6;
for ( my $i = 0; $i < $lim; $i++ ){
my $vs = substr( $compare, $i, 5 );
# count each unique identity *once*
# if it's in the table, we've already counted it.
next if $hash{ $vs };
$hash{ $vs }++; # we've found it, record it.
for ( my $j = $i + 1; $j < $lim; $j++ ) {
my $value = substr( $compare, $j, 5 );
$hash{ $value }++ if $value eq $vs;
}
}
However, it could be an improvement on this to do an index for your second loop
and let the c-level of perl do your matching for you.
my $pos = $i;
while ( $pos > -1 ) {
$pos = index( $compare, $vs, ++$pos );
$hash{ $vs }++ if $pos > -1;
}
Also, if you used index, and wanted to omit the last two characters--as you do, it might make sense to remove those from the characters you have to search:
substr( $compare, -2 ) = ''
But you could do all of this in one pass, as you loop through file. I believe the code
below is almost an equivalent.
my $last_4 = '';
my $last_row = '';
my $discard = <$f1>;
# each row in the file after the first...
while ( $row = <$f1> ) {
chomp $row;
$last_row = $row;
$row = $last_4 . $row;
my $lim = length( $row ) - 5;
for ( my $i = 0; $i < $lim; $i++ ) {
$hash{ substr( $row, $i, 5 ) }++;
}
# four is the maximum we can copy over to the new row and not
# double count a strand of characters at the end.
$last_4 = substr( $row, -4 );
}
# I'm not sure what you're getting by omitting the last two characters of
# the last row, but this would replicate it
foreach my $bad_key ( map { substr( $last_row, $_ ) } ( -5, -6 )) {
--$hash{ $bad_key };
delete $hash{ $bad_key } if $hash{ $bad_key } < 1;
}
# grep in a scalar context will count the values.
$unique = grep { $_ == 1 } values %hash;
You may be interested in this more concise version of your code that uses a global regex match to find all the subsequences of five characters. It also reads the entire input file in one go, and removes the newlines afterwards.
The path to the input file is expected as a parameter on the command line, and the output is sent to STDIN, and can be redirected to a file on the command line, like this
perl subseq5.pl input.txt > output.txt
I've also used Data::Dump instead of Data::Dumper because I believe it to be vastly superior. However it is not a core module, and so you will probably need to install it.
use strict;
use warnings;
use open qw/ :std :encoding(utf-8) /;
use Data::Dump;
my $str = do { local $/; <>; };
$str =~ tr|$/||d;
my %dups;
++$dups{$1} while $str =~ /(?=(.{5}))/g;
my $unique = grep $_ == 1, values %dups;
print "Number of unique keys: $unique\n";
dd \%dups;