How to look at the next line of a file in Perl - perl

I have a piece of code which opens up a file and parses it. This text document has a redundant structure and has multiple entries. I need to peek ahead within my loop to see if there is a new entry, if there is, I will be able to parse all of the data my program extracts. Let me first show my implementation so far
use strict;
my $doc = open(my $fileHandler, "<", "test.txt");
while(my $line = <$fileHandler>) {
## right here I want to look at the next line to see if
## $line =~ m/>/ where > denotes a new entry
}

Try handling the iteration yourself:
my $line = <$fileHandler>;
while(1) { # keep looping until I say so
my $nextLine = <$fileHandler>;
if ($line =~ m/>/ || !defined $nextLine) {
### Do the stuff
}
### Do any other stuff;
last unless defined $nextLine;
$line = $nextLine;
}
I added the extra check in the if statement under the assumption that you will also want to process what you have when you reach the end of the file.
Alternatively, as suggested by friedo, if the file can fit into memory, you can load the whole thing into an array at once:
my #lines = <$fileHandler>;
for (my $i = 0; $i <= $#lines; $i++) {
if ($i == $#lines || $lines[$i+1] =~ />/) {
### Do the stuff
}
}
This is more flexible in that you can access any arbitrary line of the file, in any order, but as mentioned the file does have to be small enough to fit into memory.

A nice way to handle these problems is using Tie::File, which allows you to treat a file like an array, without the performance penalty of actually loading the file into memory. It is also a core module since perl v5.7.3.
use Tie::File;
tie my #file, 'Tie::File', "test.txt" or die $!;
for my $linenr (0 .. $#file) { # loop over line numbers
if ($file[$linenr] =~ /foo/) { # this is the current line
if ($file[$linenr + 1] =~ /^>/ && # this is the next line
$linenr <= $#file) { # don't go past end of file
# do stuff
}
}
}
untie #file; # all done

Related

Open (IN...) command failing possibly due to problems with naming

New to Perl and quite new to coding in general so I apologise if this is formatted terribly and an easy question! Trying simply to input somebody's elses code as a step in a larger project involving PRAAT. The code is designed to distinguish beats in speech rhythm, I've followed their nomenclature in file naming (on line 2) but the code won't move past line 13. Could anyone tell me why? Is it trying to open a directory called "intensities"? Additionally, is there anywhere else I may have to change the code, it is quite possibly reasonably old! Thank you very much!
#!/usr/local/bin/perl -w
scalar(#ARGV) == 1 or scalar(#ARGV) == 2 or die "Usage: getBeatsOneShot.pl someSoundFile <threshold>";
$stem = shift;
# Parameters to fiddle with
if (scalar(#ARGV) == 0) {
$threshold = 0.2;
} else {
$threshold = shift;
print "Threshold is $threshold\n";
}
open(IN, "intensities/$stem.intensity") or die "badly";
open(OUT, ">beats/$stem.beats") or die "eek";
# File type = "ooTextFile short"
$_ = <IN>; print OUT $_;
# replace "Intensity" with "TextGrid"
$_ = <IN>; print OUT "\"TextGrid\"\n\n";
# skip a line
$_ = <IN>;
chomp($xmin = <IN>);
chomp($xmax = <IN>);
chomp($nx = <IN>); $nx = 0; #(just suprress a arning here)
chomp($dx = <IN>);
chomp($x1 = <IN>);
# Read in intensity contour into #e (envelope)
#e = ();
while($_ = <IN>) { chomp; last unless $_ eq "1";}
push #e, $_;
while($_ = <IN>) {
chomp($_);
push #e, $_;
}
# (1) Find max and min
$max = 0; $min = 1000000;
foreach $ival (#e) {
if($ival > $max) {
$max = $ival;
}
if($ival < $min) {
$min = $ival;
}
}
# (2) look for beats
#beats = ();
print "Thresh: $threshold\n";
open doesn't create the path to the file. Directories intensities/ and beats/ therefore must exist in the current working directory before the script is run.
When open fails, it sets $! to the reason of the failure. Instead of eek or badly, use die $! so Perl can tell you what went wrong.
Moreover, you should turn strict and warnings on. They prevent many common mistakes. As a newbie, you might like to enable diagnostics, too, to get detailed explanations of all the errors and warnings.

Perl Split File Into Lines and Variables

I am currently splitting a Perl file which holds some user/password information and am doing it successfully, but am not satisfied with my code. I am sure there is a better way to do it in Perl (I am a beginner). If someone could come up with a slicker way that would be great!
my $i = 1;
my $DB;
my $DBHOST;
my $DBUSER;
my $DBPASS;
my $filename = "some_file";
open(my $fh, '<:encoding(UTF-8)', $filename)
or die "Could not open file '$filename' $!";
while (my $row = <$fh>) {
chomp $row;
if ($i == 1) {
$DB = (split /=/, $row)[1];
}
if ($i == 2) {
$DBHOST = (split /=/, $row)[1];
}
if ($i == 3) {
$DBUSER = (split /=/, $row)[1];
}
if ($i == 4) {
$DBPASS = (split /=/, $row)[1];
}
$i++;
}
map() is pretty handy for things like this:
my ($DB, $DBHOST, $DBUSER, $DBPASS) = map {$_ =~ /.*?=(.*)/} <$fh>;
What's happening:
map() operates on lists, so it treats <$fh> as one
for each element in the list (a file line in this case), assign it into the default variable ($_)
then, capture the part of the line we want using regex, return it, and assign it to the relevant variable on the left-hand-side (on each iteration of the file, each one of the receiving variables are shifted as well)
the regex operates as follows:
/
.*? # ignore everything, non greedy until we match a
= # our delimiter
( # begin capture
.* # capture everything until end of line (less the newline char)
) # end capture
/
Note that this solution will iterate all the way through the whole file, even after all four of the variables are populated (as does the way you've done it in your OP).
Also note that there's no error checking here, so if a value isn't captured, you'll get warnings related to undefined variables.
I think I would do away with your individual variables and store the connection information in a hash.
my %db_conn;
while (<$fh>) {
my ($key, $val) = split /=/, $_, 2;
$db_conn{$key} = $val;
}
Of course, this assumes that whatever is to the left of the = on each line is a unique identifier for the value.

Extract and filter a range of lines from the input using Perl

I'm quite new to Perl and I have some problems in skipping lines using a foreach loop. I want to copy some lines of a text file to a new one.
When the first words of a line are FIRST ITERATION, skip two more lines and print everything following until the end of the file or an empty line is encountered.
I've tried to find out a similar post but nobody talks about working with text files.
This is the form I thought of
use 5.010;
use strict;
use warnings;
open( INPUT, "xxx.txt" ) or die("Could not open log file.");
open( OUT, ">>yyy.txt" );
foreach my $line (<INPUT>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
# print OUT
}
}
close(OUT);
close(INFO);
I tried using next and $line++ but my program prints only the line that begins with FIRST ITERATION.
I may try to use a for loop but I don't know how many lines my file may have, nor do I know how many lines there are between "First Iteration" and the next empty line.
The simplest way is to process the file a line at a time and keep a state flag which is set to 1 if the current line is begins with FIRST ITERATION and 0 if it is blank, otherwise it is incremented if it is already positive so that it provides a count of the line number within the current block
This solution expects the path to the input file as a parameter on the command line and prints its output to STDOUT, so you will need to redirect the output to the file on the command line as necessary
Note that the regex pattern /\S/ checks whether there is a non-blank character anywhere in the current line, so not /\S/ is true if the line is empty or all blank characters
use strict;
use warnings;
my $lines = 0;
while ( <> ) {
if ( /^FIRST ITERATION/ ) {
$lines = 1;
}
elsif ( not /\S/ ) {
$lines = 0;
}
elsif ( $lines > 0 ) {
++$lines;
}
print if $lines > 3;
}
This can be simplified substantially by using Perl's built-in range operator, which keeps its own internal state and returns the number of times it has been evaluated. So the above may be written
use strict;
use warnings;
while ( <> ) {
my $s = /^FIRST ITERATION/ ... not /\S/;
print if $s and $s > 3;
}
And the last can be rewritten as a one-line command line program like this
$ perl -ne '$s = /^FIRST ITERATION/ ... not /\S/; print if $s and $s > 3' myfile.txt
Use additional counter, that will say on which condition print line. Something like this:
$skipCounter = 3;
And in foreach:
if ($skipCounter == 2) {
// print OUT
}
if ( $line =~ m/^FIRST ITERATION/) {
$skipCounter = 0;
}
$skipCounter++;
Advice: Use STDIN and STDOUT instead of files, this will allowes you to change them without modifying script
Code:
#!/usr/bin/perl
use 5.010;
use strict;
use warnings;
open(INPUT, "xxx.txt" ) or die "Could not open log file: $!.";
open(OUT, ">yyy.txt") or die "Could not open output file: $!";
while( my $line = <INPUT> )
{
if ( $line =~ m/^FIRST ITERATION/) {
<INPUT>; # skip line
<INPUT>; # skip line
while( $line = <INPUT>) # print till empty line
{
last if $line eq "\n";
print OUT $line;
}
};
};
close (OUT);
close (INPUT);
You're on the right track. What you need to use is the flip-flop operator (which is basically the range operator) ... It will toggle for you between two matches, so you get everything in between. After that, it's a matter of keeping track of the lines you want to skip.
So basically we are checking for FIRST ITERATION and for an empty line, and grab everything in between those. $skip is used to remember how many lines were skipped. It starts at 0 and gets incremented for the first two lines after we start being in the flip-flop if block. In the else case, where we are after the flip-flop, it gets reset to 0 so we can start over with the next block.
Since you know how to open and write files, I'll skip that.
use strict;
use warnings;
my $skip = 0;
while (<DATA>) {
if (/^FIRST ITERATION$/ .. /^$/) {
next if $skip++ <= 2;
print $_;
} else {
$skip = 0;
}
}
__DATA__
FIRST ITERATION
skip1
skip2
foo
bar
baz
don't print this
The output of this is:
foo
bar
baz
To stick with your own code, here's a very verbose solution that uses a foreach and no flip-flop. It does the same thing, just with a lot more words.
my $skip = 0; # skip lines
my $match = 0; # keep track of if we're in between the borders
foreach my $line (<DATA>) {
if ( $line =~ m/^FIRST ITERATION/ ) {
$match = 1; # we are inside the match
next;
}
if ($line =~ m/^$/) {
$match = 0; # we are done matching
next;
}
if ($match) {
$skip++; # count skip-lines
if ($skip <= 2) {
next; # ... and skip the first two
}
print $line; # this is the content we want
}
}
Using paragraph mode (which returns blocks separated by blank lines rather than lines):
local $/ = ""; # Paragraph mode.
while (<>) {
s/\n\K\n+//; # Get rid of trailing empty lines.
print /^FIRST ITERATION\n[^\n]*\n[^\n]*\n(.*)/ms;
}
Using the flip-flop operator:
while (<>) {
if (my $line_num = /^FIRST ITERATION$/ .. /^$/) {
print if $line_num > 3 && $line_num !~ /E0/;
}
}
$line_num !~ /E0/ is true when the flip-flop is flopping (i.e. for the first empty line after FIRST ITERATION). This is checked to avoid printing the blank line.

Perl script to print out all the lines containing a keyword and the line below it

I need to write a perl script to search for a keyword in a large file and then print all the lines containing the keyword plus the line below each keyword to a new file.
In the original file, there are multiple lines (the exact number varies) below each keyword-containing line. I already have a script that makes the variable number of lines to equal 1. I need this functionality to remain in the script and build upon it.
I found out that I could use grep to extract the lines, but this requires running the script I already have first and then using the grep command. I'd really need to have these functions to be combined into one.
Any help is much appreaciated!
Here is the script I have so far:
use strict;
open (FILE, $ARGV[0]) or die ("Cannot open file");
my $name;
my $sequence;
while (my $line = <FILE>) {
chomp ($line);
if (substr ($line, 0, 1) eq ">") {
if ($sequence ne "") {
printf if / ("%s\n%s\n", $name, $sequence);
}
$name = $line;
$sequence = "";
} else {
$sequence .= $line;
}
}
if ($sequence ne "") {
printf ("%s\n%s\n", $name, $sequence);
}
And an example of the original file:
sp|Q6GZX4|001R_FRG3G Putative transcription factor 001R OS=Frog virus 3 (isolate Goorha) GN=FV3-001R PE=4 SV=1
MAFSAEDVLKEYDRRRRMEALLLSLYYPNDRKLLDYKEWSPPRVQVECPKAPVEWNNPPSEKGLIVGHFSGIKYKGEKAQASEVDVNKMCCWVSKFKDAMRRYQGIQTCKIPGKVLSDLDAKIKAYNLTVEGVEGFVRYSRVTKQHVAAFLKELRHSKQYENVNLIHYILTDKRVDIQHLEKDLVKDFKALVESAHRMRQGHMINVKYILYQLLKKHGHGPDGPDILTVKTGSKGVLYDDSFRKIYTDLGW
In this example, the keyword would be "FRG3G". The keyword is always in the same place, the characters before it vary, but the structure is the same.
If you have only 1 line to print after the keyword line, you can just remember if you found the keyword and then print the line like this:
my $matched = 0;
while (<FILE>) {
print if ($matched);
if (m/$keyword/) {
print;
matched = 1;
}
else {
matched = 0;
}
}
If you can detect the end of the lines you want to print somehow, you can adjust the code above instead of just hard-coding it to print 1 line.
Redirect to a new file as needed.

Skip the problematic DAT and proceed with next DATs and out put the error or missed DATs list in separate text file in Perl

I have another question here, i have several dats and want to merge them. But the script first checks for header of all the DATs and if not matching it will throw error and stop the script. Now i want to run the script skipping the problematic dat and output the error in separate text file with list of errored DAts and reason. Could anyone please help on this. Here is what i have so far:
use strict;
my $rootdir = $ARGV[0];
die "usage: perl mergetxtfiles.pl <folder>" if ($#ARGV != 0);
#$rootdir =~ s/\\/\\\\/g;
print "\nFolder = $rootdir\n\n";
opendir(DIR, $rootdir)
or die "failed opening the directory $rootdir";
open(OF,">:utf8",'combined_'.time.'.dat')
or die "failed opening the file";
my $icr = 0;
my $cnt = 0;
my $header = '';
my $header_flag = 0;
while(my $fname = readdir(DIR)) {
# add extensions if needed
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) {
$icr++;
my $fnamepath = $rootdir.'\\'.$fname;
print "\($icr\) $fname\n";
open(IF, "<:utf8", $fnamepath)
or die "ERROR: cannot open the file\n$fnamepath ";
my $sep_icr = 0;
while(<IF>) {
my $line = $_;
chomp $line;
next if (/^$/);
$sep_icr++;
$cnt++;
my #ar = split(/\t/,$line);
if ($cnt == 1) {
$header_flag = 1;
$header = $line;
}
if ($sep_icr == 1 and $header_flag == 1) {
#print "$line \n $header\n";
if ($line ne $header) {
die "Headers are not same\n";
}
elsif (($line eq $header) and ($cnt >1)) {
print "INFO\: ignoring the same header for $fname \n";
$cnt--;
next;
}
}
print OF $line."\n";
}
print "\--Line count= $sep_icr\n\n";
close IF;
#print OF "\n";
}
}
print "\-\-\> Total line count= $cnt\n";
Named Loops
In your loop, we have to change your if-clause and the outer loop a bit:
FILE:
while(my $fname = readdir(DIR)) {
...;
if ($line ne $header) {
logger($fname, "Headers not matching");
next FILE;
}
...;
}
In Perl, loops can be labeled, so we can specify which loop we do next, instead of setting and checking flags. I used an example logging function loggeras given below, but you can substitute it with an appropriate print statement.
Logging
This is probably a bit more than asked, but here is a little logging function for flexibility. Arguments are a filename, a reason, and an optional severity. You can remove the severity code if it isn't needed. The severity is optional anyway and defaults to debug.
open my $logfile, ">>", "FILENAME" or die "..."; # open for append
sub logger {
my ($file, $reason, $severity) = (#_, 'debug');
$severity = {
debug => '',
info => 'INFO',
warn => '!WARN!',
fatal => '!!!ERROR!!!',
}->{$severity} // $severity; # transform the severity if it is a name we know
$severity .= ' ' if length $severity; # append space if we have a severity
print {$logfile} $severity . qq{$reason while processing "$file"\n};
}
If called with logger("./foo/bar", "Headers not matching", 'warn') it will output:
!WARN! Headers not matching while processing "./foo/bar"
Change the printed error message to something more machine-readable if needed.
Style tips and tricks:
If find these lines more elegant:
die "usage: ...\n" unless #ARGV;
my ($rootdir) = #ARGV;
note the newline at the end (supresses the "at line 3" etc). In scalar context, an array returns the array length. In the second line we can avoid array subscripting by assigning in list context. Surplus elements are ignored.
Instead
if ($fname =~ m/(\.txt)|(\.dat)|(\.csv)$/i) { ...; }
we can say
next unless $fname =~ m/(?: \.txt | \.dat | \.csv )$/xi;
and avoid unneccessary intendation, therefore improving readability.
I modified the regex so that all suffixes must come at the end, not only the .csv suffix, and added the /x modifier so that I can use non-semantic whitespace inside the regex.
Windows, and pretty much any OS, understand forward slashes in path names. So instead
my $fnamepath = $rootdir.'\\'.$fname;
we can write
my $fnamepath = "$rootdir/$fname";
I find that easier to write and understand.
The
while(<IF>) {
my $line = $_;
construct can be simplified to
while(my $line = <IF>) {...}
Last but not least, consider starting a habit of using filehandles with my. Often, global filehandles are not needed and can cause some bugs.