Perl Script to Count Words/Lines - perl

I'm learning PERL for the first time and I am attempting to replicate exactly the simple Perl script on page four of this document:
This is my code:
# example.pl, introductory example
# comments begin with the sharp sign
# open the file whose name is given in the first argument on the command
# line, assigning to a file handle INFILE (it is customary to choose
# all-caps names for file handles in Perl); file handles do not have any
# prefixing punctuation
open(INFILE,$ARGV[0]);
# names of scalar variables must begin with $
$line_count - 0;
$word_count - 0;
# <> construct means read one line; undefined response signals EOF
while ($line - <INFILE>) {
$line_count++;
# break $line into an array of tokens separated by " ", using split()
# (array names must begin with #)
#words_on_this_line - split(" ",$line);
# scalar() gives the length of an array
$word_count += scalar(#words_on_this_line);
}
print "the file contains ", $line_count, "lines and ", $word_count, " words\n";
and this is my text file:
This is a test file for the example code.
The code is written in Perl.
It counts the amount of lines
and the amount of words.
This is the end of the text file that will
be run
on the example
code.
I'm not getting the right output and I'm not sure why. My output is:
C:\Users\KP\Desktop\test>perl example.pl test.txt
the file contains lines and words

For some reason all your "=" operators appear to be "-"
$line_count - 0;
$word_count - 0;
...
while ($line - <INFILE>) {
...
#words_on_this_line - split(" ",$line);
I'd recommend using "my" to declare your variables and then "use strict" and "use warnings" to help you detect such typos:
Currently:
$i -1;
/tmp/test.pl -- no output
When you add strict and warnings:
use strict;
use warnings;
$i -1;
/tmp/test.pl Global symbol "$i" requires explicit package name at
/tmp/test.pl line 4. Execution of /tmp/test.pl aborted due to
compilation errors.
When you add "my" to declare it:
vim /tmp/test.pl
use strict;
use warnings;
my $i -1;
/tmp/test.pl Useless use of subtraction (-) in void context at
/tmp/test.pl line 4. Use of uninitialized value in subtraction (-) at
/tmp/test.pl line 4.
And finally with a "=" instead of the "-" typo -- this is what the correct declaration and initializatoin looks like:
use strict;
use warnings;
my $i = 1;

You have to change - by = in multiple sentences in your code. Also, I've included some changes related to get a more modern perl code (use strict it's a must)
use strict;
use warnings;
open my $INFILE, '<', $ARGV[0] or die $!;
# names of scalar variables must begin with $
my $line_count = 0;
my $word_count = 0;
# <> construct means read one line; undefined response signals EOF
while( my $line = <$INFILE> ) {
$line_count++;
# break $line into an array of tokens separated by " ", using split()
# (array names must begin with #)
my #words_on_this_line = split / /,$line;
# scalar() gives the length of an array
$word_count += scalar(#words_on_this_line);
}
print "the file contains ", $line_count, "lines and ", $word_count, " words\n";
close $INFILE;

replace while ($line - <INFILE>) {
with
while ($line = <INFILE>) {

The word count part could be made a bit simpler (and more efficient). Split returns the number elements if called in a scalar context.
replace
my #words_on_this_line = split / /,$line;
$word_count += scalar(#words_on_this_line);
with
$word_count += split / /,$line;

Related

Split file Perl

I want to split parts of a file. Here is what the start of the file looks like (it continues in same way):
Location Strand Length PID Gene
1..822 + 273 292571599 CDS001
906..1298 + 130 292571600 trxA
I want to split in Location column and subtract 822-1 and do the same for every row and add them all together. So that for these two results the value would be: (822-1)+1298-906) = 1213
How?
My code right now, (I don't get any output at all in the terminal, it just continue to process forever):
use warnings;
use strict;
my $infile = $ARGV[0]; # Reading infile argument
open my $IN, '<', $infile or die "Could not open $infile: $!, $?";
my $line2 = <$IN>;
my $coding = 0; # Initialize coding variable
while(my $line = $line2){ # reading the file line by line
# TODO Use split and do the calculations
my #row = split(/\.\./, $line);
my #row2 = split(/\D/, $row[1]);
$coding += $row2[0]- $row[0];
}
print "total amount of protein coding DNA: $coding\n";
So what I get from my code if I put:
print "$coding \n";
at the end of the while loop just to test is:
821
1642
And so the first number is correct (822-1) but the next number doesn't make any sense to me, it should be (1298-906). What I want in the end outside the loop:
print "total amount of protein coding DNA: $coding\n";
is the sum of all the subtractions of every line i.e. 1213. But I don't get anything, just a terminal that works on forever.
As a one-liner:
perl -nE '$c += $2 - $1 if /^(\d+)\.\.(\d+)/; END { say $c }' input.txt
(Extracting the important part of that and putting it into your actual script should be easy to figure out).
Explicitly opening the file makes your code more complicated than it needs to be. Perl will automatically open any files passed on the command line and allow you to read from them using the empty file input operator, <>. So your code becomes as simple as this:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
my $total;
while (<>) {
my ($min, $max) = /(\d+)\.\.(\d+)/;
next unless $min and $max;
$total += $max - $min;
}
say $total;
If this code is in a file called adder and your input data is in add.dat, then you run it like this:
$ adder add.dat
1213
Update: And, to explain where you were going wrong...
You only ever read a single line from your file:
my $line2 = <$IN>;
And then you continually assign that same value to another variable:
while(my $line = $line2){ # reading the file line by line
The comment in this line is wrong. I'm not sure where you got that line from.
To fix your code, just remove the my $line2 = <$IN> line and replace your loop with:
while (my $line = <$IN>) {
# your code here
}

find a match and replace next line in perl

I am working on the perl script and need some help with it. The requirement is, I have to find a lable and once the label is found, I have to replace the word in a line immediately following the label. for Example, if the label is ABC:
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
I want to write a script to match the label (ABC) and once the label is found, replace a word in the next line immediately following the label.
Here is my attempt:
open(my $fh, "<", "file1.txt") or die "cannot open file:$!";
while (my $line = <$fh>))
{
next if ($line =~ /ABC/) {
$line =~ s/original_string/replaced_string/;
}
else {
$msg = "pattern not found \n ";
print "$msg";
}
}
Is this correct..? Any help will be greatly appreciated.
The following one-liner will do what you need:
perl -pe '++$x and next if /ABC:/; $x-- and s/old/new/ if $x' inFile > outFile
The code sets a flag and gets the next line if the label is found. If the flag is set, it's unset and the substitution is executed.
Hope this helps!
You're doing this in your loop:
next if ($line =~ /ABC/);
So, you're reading the file, if a line contains ABC anywhere in that line, you skip the line. However, for every other line, you do the replacement. In the end, you're replacing the string on all other lines and printing that out, and your not printing out your labels.
Here's what you said:
I have to read the file until I find a line with the label:
Once the label is found
I have to read the next line and replace the word in a line immediately following the label.
So:
You want to read through a file line-by-line.
If a line matches the label
read the next line
replace the text on the line
Print out the line
Following these directions:
use strict;
use warnings; # Hope you're using strict and warnings
use autodie; # Program automatically dies on failed opens. No need to check
use feature qw(say); # Allows you to use say instead of print
open my $fh, "<", "file1.txt"; # Removed parentheses. It's the latest style
while (my $line = <$fh>) {
chomp $line; # Always do a chomp after a read.
if ( $line eq "ABC:" ) { # Use 'eq' to ensure an exact match for your label
say "$line"; # Print out the current line
$line = <$fh> # Read the next line
$line =~ s/old/new/; # Replace that word
}
say "$line"; # Print the line
}
close $fh; # Might as well do it right
Note that when I use say, I don't have to put the \n on the end of the line. Also, by doing my chomp after my read, I can easily match the label without worrying about the \n on the end.
This is done exactly as you said it should be done, but there are a couple of issues. The first is that when we do $line = <$fh>, there's no guarantee we are really reading a line. What if the file ends right there?
Also, it's bad practice to read a file in multiple places. It makes it harder to maintain the program. To get around this issue, we'll use a flag variable. This allows us to know if the line before was a tag or not:
use strict;
use warnings; # Hope you're using strict and warnings
use autodie; # Program automatically dies on failed opens. No need to check
use feature qw(say); # Allows you to use say instead of print
open my $fh, "<", "file1.txt"; # Removed parentheses. It's the latest style
my $tag_found = 0; # Flag isn't set
while (my $line = <$fh>) {
chomp $line; # Always do a chomp after a read.
if ( $line eq "ABC:" ) { # Use 'eq' to ensure an exact match for your label
$tag_found = 1 # We found the tag!
}
if ( $tag_found ) {
$line =~ s/old/new/; # Replace that word
$tag_found = 0; # Reset our flag variable
}
say "$line"; # Print the line
}
close $fh; # Might as well do it right
Of course, I would prefer to eliminate mysterious values. For example, the tag should be a variable or constant. Same with the string you're searching for and the string you're replacing.
You mentioned this was a word, so your regular expression replacement should probably look like this:
$line =~ s/\b$old_word\b/$new_word/;
The \b mark word boundaries. This way, if you're suppose to replace the word cat with dog, you don't get tripped up on a line that says:
The Jeopardy category is "Say what".
You don't want to change category to dogegory.
Your problem is that reading in a file does not work like that. You're doing it line by line, so when your regex tests true, the line you want to change isn't there yet. You can try adding a boolean variable to check if the last line was a label.
#!/usr/bin/perl;
use strict;
use warnings;
my $found;
my $replacement = "Hello";
while(my $line = <>){
if($line =~ /ABC/){
$found = 1;
next;
}
if($found){
$line =~ s/^.*?$/$replacement/;
$found = 0;
print $line, "\n";
}
}
Or you could use File::Slurp and read the whole file into one string:
use File::Slurp;
$x = read_file( "file.txt" );
$x =~ s/^(ABC:\s*$ [\n\r]{1,2}^.*?)to\sbe/$1to was/mgx;
print $x;
using /m to make the ^ and $ match embedded begin/end of lines
x is to allow the space after the $ - there is probably a better way
Yields:
ABC:
string to was replaced
some other lines
ABC:
string to was replaced
some other lines
ABC:
string to was replaced
Also, relying on perl's in-place editing:
use File::Slurp qw(read_file write_file);
use strict;
use warnings;
my $file = 'fakefile1.txt';
# Initialize Fake data
write_file($file, <DATA>);
# Enclosed is the actual code that you're looking for.
# Everything else is just for testing:
{
local #ARGV = $file;
local $^I = '.bac';
while (<>) {
print;
if (/ABC/ && !eof) {
$_ = <>;
s/.*/replaced string/;
print;
}
}
unlink "$file$^I";
}
# Compare new file.
print read_file($file);
1;
__DATA__
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
some other lines
ABC:
string to be replaced
ABC:
outputs
ABC:
replaced string
some other lines
ABC:
replaced string
some other lines
ABC:
replaced string
ABC:

perl: Use of uninitialized value and output is truncated

I am trying to use the following script to shuffle the order of sequences (lines) within a file. I'm not sure how to "initialize" values -- please help!
print "Please enter filename (without extension): ";
my $input = <>;
chomp $input;
use strict;
use warnings;
print "Please enter total no. of sequence in fasta file: ";
my $orig_size = <>*2-1;
chomp $orig_size;
open INFILE, "$input.fasta"
or die "Error opening input file for shuffling!";
open SHUFFLED, ">"."$input"."_shuffled.fasta"
or die "Error creating shuffled output file!";
my #array = (0); # Need to initialise 1st element in array1&2 for the shift function
my #array2 = (0);
my $i = 1;
my $index = 0;
my $index2 = 0;
while (my #line = <INFILE>){
while ($i <= $orig_size) {
$array[$i] = $line[$index];
$array[$i] =~ s/(.)\s/$1/seg;
$index++;
$array2[$i] = $line[$index];
$array2[$i] =~ s/(.)\s/$1/seg;
$i++;
$index++;
}
}
my $array = shift (#array);
my $array2 = shift (#array2);
for ($i = my $header_size; $i >= 0; $i--) {
my $j = int rand ($i+1);
next if $i == $j;
#array[$i,$j] = #array[$j,$i];
#array2[$i,$j] = #array2[$j,$i];
}
while ($index2 <= my $header_size) {
print SHUFFLED "$array[$index2]\n";
print SHUFFLED "$array2[$index2]\n";
$index2++;
}
close INFILE;
close SHUFFLED;
I'm getting these warnings:
Use of uninitialized value in substitution (s///) at fasta_corrector6.pl line 27, <INFILE> line 578914.
Use of uninitialized value in substitution (s///) at fasta_corrector6.pl line 31, <INFILE> line 578914.
Use of uninitialized value in numeric ge (>=) at fasta_corrector6.pl line 40, <INFILE> line 578914.
Use of uninitialized value in addition (+) at fasta_corrector6.pl line 41, <INFILE> line 578914.
Use of uninitialized value in numeric eq (==) at fasta_corrector6.pl line 42, <INFILE> line 578914.
Use of uninitialized value in numeric le (<=) at fasta_corrector6.pl line 47, <INFILE> line 578914.
Use of uninitialized value in numeric le (<=) at fasta_corrector6.pl line 50, <INFILE> line 578914.
First, you read the whole input file in:
use IO::File;
my #lines = IO::File->new($file_name)->getlines;
then you shuffle it:
use List::Util 'shuffle';
my #shuffled_lines = shuffle(#lines);
then you write them out:
IO::File->new($new_file_name, "w")->print(#shuffled_lines);
There's an entry in the Perl FAQ about how to shuffle an array. Another entry tells of the many ways to read a file in one go. Perl FAQs contain a lot of samples and trivia on how to do many common things -- it's a good place to continue learning more about Perl.
On your previous question I gave this answer, and noted that your code failed because you had not initialized a variable named $header_size used in a loop condition. Not only have you repeated that mistake, you have elaborated on it by starting to declare the variable with my each time you try to access it.
for ($i = my $header_size; $i >= 0; $i--) {
# ^^--- wrong!
while ($index2 <= my $header_size) {
# ^^--- wrong!
A variable that is declared with my is empty (undef) by default. $index2 can never contain anything but undef here, and your loop will run only once, because 0 <= undef will evaluate true (albeit with an uninitialized warning).
Please take my advice and set a value for $header_size. And only use my when declaring a variable, not every time you use it.
A better solution
Seeing your errors above, it seems that your input files are rather large. If you have over 500,000 lines in your files, it means your script will consume large amounts of memory to run. It may be worthwhile for you to use a module such as Tie::File and work only with array indexes. For example:
use strict;
use warnings;
use Tie::File;
use List::Util qw(shuffle);
tie my #file, 'Tie::File', $filename or die $!;
for my $lineno (shuffle 0 .. $#file) {
print $line[$lineno];
}
untie #file; # all done
I cannot pinpoint what exactly went wrong, but there are a few oddities with your code:
The Diamond Operator
Perl's Diamond operator <FILEHANDLE> reads a line from the filehandle. If no filehandle is provided, each command line Argument (#ARGV) is treated as a file and read. If there are no arguments, STDIN is used. better specify this yourself. You also should chomp before you do arithemtics with the line, not afterwards. Note that strings that do not start with a number are treated as numeric 0. You should check for numericness (with a regex?) and include error handling.
The Diamond/Readline operator is context sensitive. If given in scalar context (e.g, a conditional, a scalar assignment) it returns one line. If given in list context, e.g. as a function parameter or an array assignment, it returns all lines as an array. So
while (my #line = <INFILE>) { ...
will not give you one line but all lines and is thus equivalent to
my #line;
if (#line = <INFILE>) { ...
Array gymnastics
After you read in the lines, you try to do some manual chomping. Here I remove all trailing whitspaces in #line, in a single line:
s/\s+$// foreach #line;
And here, I remove all non-leading whitespaces (what your regex is doing in fact):
s/(?<!^)\s//g foreach #line;
To stuff an element alternatingly into two arrays, this might work as well:
for my $i (0 .. $##line) {
if ($i % 2) {
push #array1, shift #line;
} else {
push #array2, shift #line;
}
}
or
my $i = 0;
while (#line) {
push ($i++ % 2 ? #array1 : #array2), shift #line
}
Manual bookkeeping of array indices is messy and error-prone.
Your for-loop could be written mor idiomatic as
for my $i (reverse 0 .. $header_size)
Do note that declaring $header_size inside the loop initialisation is possible if it was not declared before, but it will yield the undef value, therefore you assigned undef to $i which leads to some of the error messages, as undef should not be used in arithemtic operations. Assignments always assigns the right side to the left side.

Perl readline problem

I'd to read a file, e.g. test.test which contains
#test:testdescription\n
#cmd:binary\n
#return:0\n
#stdin:|\n
echo"toto"\n
echo"tata"\n
#stdout:|\n
toto\n
tata\n
#stderr:\n
I succeeded in taking which are after #test: ; #cmd: etc...
but for stdin or stdout, I want to take all the line before the next # to a table #stdin and #stdout.
I do a loop while ($line = <TEST>) so it will look at each line. If i see a pattern /^#stdin:|/, I want to move to the next line and take this value to a
table until i see the next #.
How do I move to the next line in the while loop?
This file format can be easily handled with some creativity in selecting the appropriate value for $/:
use strict; use warnings;
my %parsed;
{
local $/ = '#';
while ( my $line = <DATA> ) {
chomp $line;
my $content = (split /:/, $line, 2)[1];
next unless defined $content;
$content =~ s/\n+\z//;
if ( my ($chan) = $line =~ /^(std(?:err|in|out))/ ) {
$content =~ s/^\|\n//;
$parsed{$chan} = [ split /\n/, $content];
}
elsif ( my ($var) = $line =~ /^(cmd|return|test)/ ) {
$parsed{ $var } = $content;
}
}
}
use YAML;
print Dump \%parsed;
__DATA__
#test:testdescription
#cmd:binary
#return:0
#stdin:|
echo"toto"
echo"tata"
#stdout:|
toto
tata
#stderr:
Output:
---
cmd: binary
return: 0
stderr: []
stdin:
- echo"toto"
- echo"tata"
stdout:
- toto
- tata
test: testdescription
UPDATED as per user's colmments
If I understand the question correctly, you want to read one more line within a loop?
If so, you can either:
just do another line read inside the loop.
my $another_line = <TEST>;
Keep some state flag and use it next iteration of the loop, and accumulate lines between stdins in a buffer:
my $last_line_was_stdin = 0;
my #line_buffer = ();
while ($line = <TEST>) {
if (/^#stdin:|/) {
#
# Some Code to process all lines acccumulated since last "stdin"
#
#line_buffer = ();
$last_line_was_stdin = 1;
next;
}
push #line_buffer, $line;
}
This solution may not do 100% of what you need but it defines a pattern you need to follow in your state machine implementation: read a line. Check your current state (if it matters). Based on the current state and a pattern in the line, verify what do do about the current line (add to the buffer? change the state? If changing a state, process the buffer based on last state?)
Also, as per your comment, you have a bug in your regex - the pipe (| character) means "OR" in regex, so you are saying "if line starts with #stdin OR matches an empty regex" - the latter part is always true so your regex will match 100% of time. You need to escape the "|" via /^#stdin:\|/ or /^#stdin:[|]/

Perl - the fastest way to read a range of lines from a file into a variable

Given a start and end line number, what's the fastest way to read a range of lines from a file into a variable?
Use the range operator .. (also known as the flip-flop operator), which offers the following syntactic sugar:
If either operand of scalar .. is a constant expression, that operand is considered true if it is equal (==) to the current input line number (the $. variable).
If you plan to do this for multiple files via <>, be sure to close the implicit ARGV filehandle as described in the perlfunc documentation for the eof operator. (This resets the line count in $..)
The program below collects in the variable $lines lines 3 through 5 of all files named on the command line and prints them at the end.
#! /usr/bin/perl
use warnings;
use strict;
my $lines;
while (<>) {
$lines .= $_ if 3 .. 5;
}
continue {
close ARGV if eof;
}
print $lines;
Sample run:
$ ./prog.pl prog.pl prog.c main.hs
use warnings;
use strict;
int main(void)
{
import Data.Function (on)
import Data.List (sortBy)
--import Data.Ord (comparing)
You can use flip-flop operators
while(<>) {
if (($. == 3) .. ($. == 7)) {
push #result, $_;
}
The following will load all desired lines of a file into an array variable. It will stop reading the input file as soon as the end line number is reached:
use strict;
use warnings;
my $start = 3;
my $end = 6;
my #lines;
while (<>) {
last if $. > $end;
push #lines, $_ if $. >= $start;
}
Reading line by line isn't going to be optimal. Fortunately someone has done the hardwork already :)
use Tie::File; it present the file as an array.
http://perldoc.perl.org/Tie/File.html
# cat x.pl
#!/usr/bin/perl
my #lines;
my $start = 2;
my $end = 4;
my $i = 0;
for( $i=0; $i<$start; $i++ )
{
scalar(<STDIN>);
}
for( ; $i<=$end; $i++ )
{
push #lines, scalar(<STDIN>);
}
print #lines;
# cat xxx
1
2
3
4
5
# cat xxx | ./x.pl
3
4
5
#
Otherwise, you're reading a lot of extra lines at the end you don't need to. As it is, the print #lines may be copying memory, so iterating the print while reading the second for-loop might be a better idea. But if you need to "store it" in a variable in perl, then you may not be able to get around it.
Update:
You could do it in one loop with a "continue if $. < $start" but you need to make sure to reset "$." manually on eof() if you're iterating over or <>.