Parsing the large files in Perl - perl

I need to compare the big file(2GB) contains 22 million lines with the another file. its taking more time to process it while using i have done it through 'while' but problem remains. see my code below...
use strict;
use Tie::File;
# use warnings;
my #arr;
# tie #arr, 'Tie::File', 'title_Nov19.txt';
# open(IT,"<title_Nov19.txt");
# my #arr=<IT>;
# close(IT);
while(my $data=<IN>){
my $occ=0;
while(my $line2=<IT>){
my $line=$line2;
print RE"$data\t$occ\n";
so help me to reduce it...

Lots of things wrong with this.
Asides from the usual (lack of use strict, use warnings, use of 2-argument open(), not checking open() result, use of global filehandles), the specific problem in your case is that you are opening/reading/closing the second file once for every single line of the first. This is going to be very slow.
I suggest you open the file title_Nov19.txt once, read all the lines into an array or hash or something, then close it; and then you can open the first file, input.txt and walk along that once, comparing to things in the array so you don't have to reopen that second file all the time.
Futher I suggest you read some basic articles on style/etc.. as your question is likely to gain more attention if it's actually written in vaguely modern standards.

I tried to build a small example script with a better structure but I have to say, man, your problem description is really very unclear. It's important to not read the whole comparison file each time as #LeoNerd explained in his answer. Then I use a hash to keep track of the match count:
#!/usr/bin/env perl
use strict;
use warnings;
# cache all lines of the comparison file
open my $comp_file, '<', 'input.txt' or die "input.txt: $!\n";
chomp (my #comparison = <$comp_file>);
close $comp_file;
# prepare comparison
open my $input, '<', 'title_Nov19.txt' or die "title_Nov19.txt: $!\n";
my %count = ();
# compare each line
while (my $title = <$input>) {
chomp $title;
# iterate comparison strings
foreach my $comp (#comparison) {
$count{$comp}++ if $title =~ /\b$comp\b/i;
# done
close $input;
# output (sorted by count)
open my $output, '>>', 'res.txt' or die "res.txt: $!\n";
foreach my $comp (#comparison) {
print $output "$comp\t$count{$comp}\n";
close $output;
Just to get you started... If someone wants to further work on this: these were my test files:
This is the foo title
Wow, we have bar too
Nothing special here but foo
OMG, the last title! And Foo again!
And the result of the program was written to res.txt:
foo 3
bar 1

Here's another option using memowe's (thank you) data:
use strict;
use warnings;
use File::Slurp qw/read_file write_file/;
my %count;
my $regex = join '|', map { chomp; $_ = "\Q$_\E" } read_file 'input.txt';
for ( read_file 'title_Nov19.txt' ) {
my %seen;
!$seen{ lc $1 }++ and $count{ lc $1 }++ while /\b($regex)\b/ig;
write_file 'res.txt', map "$_\t$count{$_}\n",
sort { $count{$b} <=> $count{$a} } keys %count;
Numerically-sorted output to res.txt:
foo 3
bar 1
An alternation regex which quotes meta characters (\Q$_\E) is built and used, so only one pass against the large file's lines is needed. The hash %seen is used to insure that the input words are only counted once per line.
Hope this helps!

Try this:
grep -i -c -w -f input.txt title_Nov19.txt > res.txt


print specific INFILE area using perl

I have a file with the format below
And I would like to use perl to print out something with the option "-a" follows by the file and outputs something like
Available locales:
To do that, I have the perl script below
$o = $ARGV[0];
$f = $ARGV[1];
open (INFILE, "<$f") or die "error";
my $line = <INFILE>;
my #fields = split(',', $line);
if($o eq "-a"){
if(!$fields[2]){print "No locales available\n";}
else{print "Available locales: \n";
while($fields[2]){print "$fields[2]\n";}
And I have three questions here.
1. my script will only print the first locale "en_Au" forever.
2. it should be able to test if a file is empty, but if a file is purely empty, it outputs nothing, but if I type in two empty lines in the file, it prints two lines of "No locales available" instead.
3.In fact in the (!$filed[2]) part I should verify if the file is empty or no available locales exist, if so do I need to put some regular expression here to verify if it is a locale as well??
Hope someone could help me figure these out! Many thanks!!!
The biggest missing thing is a loop over lines from the file, in which you then process one line at a time. Comments follow the code.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
#my ($opt, $file) = #ARGV; # better use a module
my ($opt, $file);
Getoptions( 'a' => \$opt, 'file=s' => \$file ) or usage();
usage() if not $file; # mandatory argument
open my $fh, '<', $file or die "Can't open $file: $!";
while (my $line = <$fh>) {
chomp $line;
my #fields = split /,/, $line;
next if not $fields[2];
if ($opt) {
say $fields[2];
close $fh;
sub usage {
say STDERR "Usage: $0 [-a] --file filename";
exit 1;
This prints the desired output. (Is that simple condition on $fields[2] really all you need?)
Always have use warnings; and use strict; at the beginning
I do not recommend single-letter variable names. One forgets what they mean, it makes the code harder to follow, and it's way too easy to make silly mistakes
The #ARGV can be assigned to variables in a list. Much better, use Getopt::Long module, which checks invocation and allows for far easier interface changes. I set the -a option to act as a "flag," so it just sets a variable ($opt) if it's given. If that should have possible values instead, use 'a=s' => \$opt and check for a value.
Use lexical filehandles and the three-argument open, open my $fh, '<', $file ...
When die-ing print the error, die "... $!";, using $! variable
The "diamond" (angle) operator, <$fh>, reads one line from a file opened with $fh when used in scalar context, as in $line = <$fh>. It advances a pointer in the file as it reads a line so the next time it's used it returns the next line. If you use it in list context then it returns all lines, but when you process a file you normally want to go line by line.
Some of the described logic and requirements aren't clear to me, but hopefully the code above is going to be easier to adjust as needed.

Opening, spliting and sorting into an Arrray in perl

I am a beginner programmer, who has been given a weeklong assignment to build a complex program, but is having a difficult time starting off. I have been given a set of data, and the goal is separate it into two separate arrays by the second column, based on whether the letter is M or F.
this is the code I have thus far:
open (FILE, "ssbn1898.txt");
if #array1[2]="M";
print #array2;
print #array3;
close (FILE);
How do I fixed this? Please try and use the simplest terms possible I stared coding last week!
Thank You
First off - you split on comma, so I'm going to assume your data looks something like this:
There's a few problems with your code:
turn on strict and warnings. The warn you about possible problems with your code
open is better off written as open ( my $input, "<", $filename ) or die $!;
You only actually read one line from <FILE> - because if you assign it to a scalar $x it only reads one line.
you don't actually insert your value into either array.
So to do what you're basically trying to do:
use strict;
use warnings;
#define your arrays.
my #M_array;
my #F_array;
#open your file.
open (my $input, "<", 'ssbn1898.txt') or die $!;
#read file one at a time - this sets the implicit variable $_ each loop,
#which is what we use for the split.
while ( <$input> ) {
#remove linefeeds
#capture values from either side of the comma.
my ( $name, $id ) = split ( /,/ );
#test if id is M. We _assume_ that if it's not, it must be F.
if ( $id eq "M" ) {
#insert it into our list.
push ( #M_array, $name );
else {
push ( #F_array, $name );
close ( $input );
#print the results
print "M: #M_array\n";
print "F: #F_array\n";
You could probably do this more concisely - I'd suggest perhaps looking at hashes next, because then you can associate key-value pairs.
There's a part function in List::MoreUtils that does exactly what you want.
use strict;
use warnings;
use 5.010;
use List::MoreUtils 'part';
my ($f, $m) = part { (split /,/)[1] eq 'M' } <DATA>;
say "M: #$m";
say "F: #$f";
The output is:
M: one,M,foo
F: two,F,bar
#!/usr/bin/perl -w
use strict;
use Data::Dumper;
my #boys=();
my #girls=();
my $fname="ssbn1898.txt"; # I keep stuff like this in a scalar
open (FIN,"< $fname")
or die "$fname:$!";
while ( my $line=<FIN> ) {
chomp $line;
my #f=split(",",$line);
push #boys,$f[0] if $f[1]=~ m/[mM]/;
push #girls,$f[1] if $f[1]=~ m/[gG]/;
print Dumper(\#boys);
print Dumper(\#girls);
exit 0;
# Caveats:
# Code is not tested but should work and definitely shows the concepts
In fact the same thing...
use strict;
my (#m,#f);
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
print "M=#m\nF=#f\n";
Or a "perl -n" (=for all lines do) variant:
#!/usr/bin/perl -n
push (#m,$1) if(/(.*),M/);
push (#f,$1) if(/(.*),F/);
END { print "M=#m\nF=#f\n";}

I want to replace a sequence name in fasta file with another name

I have one fasta file and one text file fasta file contains sequences in fasta format and text file contains name of genes now I want to replace name of the sequences in fasta file after '>' sign with the gene names in text file
I am new to perl though I have written a script but I don't know why its not working can anyone help me on that please
following is my script:
print"Enter annotated file...";
print"Enter sequence file...";
open(FILE1,$f1) || die"Can't open $f1";
open(FILE2,$f2) || die"Can't open $f2";
print #seqfile[$j];
my files looks like following:
pool75_contig_389 ubiquitin ligase e3a
pool75_contig_704 tumor susceptibility
pool75_contig_1977 serine threonine-protein phosphatase 4 catalytic subunit
pool75_contig_3064 bardet-biedl syndrome 2 protein P
pool75_contig_2499 succinyl- ligase
Consider using Bio::SeqIO to parse your Fasta dataset, instead of doing it yourself. Bio::SeqIO lives for this task, and is well developed for it. Additionally, if you're in bioinformatics, it would serve you well to get to know Bio::SeqIO. Given this, consider the following:
use strict;
use warnings;
use Bio::SeqIO;
open my $fh, '<', 'annot.txt' or die $!;
my %annot = map { /(\S+)\s+(.+)/; $1 => $2 } <$fh>;
close $fh;
my $in = Bio::SeqIO->new( -file => 'goat300.fasta', -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
my $seqID = $annot{ $seq->id } // $seq->id;
print "$seqID\n" . $seq->seq . "\n";
Output on your datasets:
tumor susceptibility
ubiquitin ligase e3a
serine threonine-protein phosphatase 4 catalytic subunit
bardet-biedl syndrome 2 protein P
succinyl- ligase
The hash %annot is initialized by reading and capturing the contents of your annot.txt data. A Bio::SeqIO object is created using your goat300.fasta file data. The while loop iterates through your fasta sequences. The variable $seqID either takes the associated value of the key in the %annot hash or it keeps the current sequence ID (the // notation means defined or, so that insures $seqID will be defined). Finally, the Fasta record is printed.
Hope this helps!
There were a lot of warnings in your code, and your approach was inefficient. Let me first show you a working Perl program. I'll explain afterwards.
use strict;
use warnings;
# Read the annotations file
print"Enter annotated file...\n";
# my $f1 = <STDIN>;
my $f1 = 'annot.txt';
open(my $fh_annotations, '<', $f1) or die "Can't open $f1";
my #annotfile = <$fh_annotations>;
close $fh_annotations;
# Read the sequence file
print"Enter sequence file...\n";
# my $f2 = <STDIN>;
my $f2 = 'goat300.fasta';
open(my $fh_genes, '<', $f2) or die "Can't open $f2";
my #seqfile = <$fh_genes>;
close $fh_genes;
# Process the annotations data
my %names; # this hash is going to hold the names
foreach my $line (#annotfile) {
chomp $line; # remove newline
my #fields = split /\t/, $line; # split into array
$names{$fields[0]} = $fields[1]; # save in the hash as key->value pair
# Process the sequence data
foreach my $line (#seqfile) {
# Look at each line
if ($line =~ m/>(.+)$/) {
# If there is a heading there, remember it...
if (exists $names{$1}) {
# ... check if we know a name for it and replace it in the line
$line =~ s/($1)/$names{$1}/;
# output the line (this would be done to another filehandle)
print $line;
This reads both files and saves them in memory, just like yours did. But instead of trying to build two arrays for the names, I went with a hash, which is a key/value pair. Think of it like an array with names instead of numbers and no particular sorting.
Once these names are set up, I can process the sequence file. I simply look at each line and check if there is a heading there, by looking for the > sign. If it's there (it goes into $1 because of the parenthesis), I look if we have a hash entry (with exists) in our %names hash. If we do, we can replace the heading with the proper name.
After that, we could write it out to a new file. I'm just printing it.
I've used a few other techniques. Unfortunately the literature people get in a BioPerl context is quite outdated. Please take this advice, it will make your live easier.
Always use strict and warnings. They will tell you about problems with your code.
Always declare your variables with my. This is not like other languages, where you need to set up a variable at the top of your problem. You can declare it where you need it. The vars only live in a certain scope, which means between the nearest enclosing { and } brackets, or block.
Use three-argument open and lexical file handles for security. Read more here.
Perl offers foreach as an alternative to the C for loop. In this case, it made things a lot easier.
One more thing about this program: While this example data was rather short, I believe your actual data might be a lot larger. Consider processing the sequence file while you read it so you do not run out of memory. There's no need to save all the lines, unless you want to do something else with them.
open my $fh_out, '>', $filename_out or die $!;
open my $fh_in, '<', $filename_in or die $!;
while (my $line = <$fh_in>) {
# do stuff with the line, like your regex
print $fh_out $line;
close $fh_in;
close $fh_out;

Getting unique random line (at each script run) from an text file with perl

Having an text file like the next one called "input.txt"
some field1a | field1b | field1c
...another approx 1000 lines....
fielaNa | field Nb | field Nc
I can choose any field delimiter.
Need a script, what at every discrete run will get one unique (never repeated) random line from this file, until used all lines.
My solution: I added one column into a file, so have
0|some field1a | field1b | field1c
...another approx 1000 lines....
0|fielaNa | field Nb | field Nc
and processing it with the next code:
use 5.014;
use warnings;
use utf8;
use List::Util;
use open qw(:std :utf8);
my $file = "./input.txt";
#read all lines into array and shuffle them
open(my $fh, "<:utf8", $file);
my #lines = List::Util::shuffle map { chomp $_; $_ } <$fh>;
close $fh;
#search for the 1st line what has 0 at the start
#change the 0 to 1
#and rewrite the whole file
my $random_line;
for(my $i=0; $i<=$#lines; $i++) {
if( $lines[$i] =~ /^0/ ) {
$random_line = $lines[$i];
$lines[$i] =~ s/^0/1/;
open($fh, ">:utf8", $file);
print $fh join("\n", #lines);
close $fh;
$random_line = "1|NO|more|lines" unless( $random_line =~ /\w/ );
do_something_with_the_fields(split /\|/, $random_line))
It is an working solution, but not very nice one, because:
the line order is changing at each script run
not concurrent script-run safe.
How to write it more effective and more elegantly?
What about keeping a shuffled list of the line numbers in a different file, removing the first one each time you use it? Some locking might be needed to asure concurent script-run safety.
From perlfaq5.
How do I select a random line from a file?
Short of loading the file into a database or pre-indexing the lines in
the file, there are a couple of things that you can do.
Here's a reservoir-sampling algorithm from the Camel Book:
rand($.) < 1 && ($line = $_) while <>;
This has a significant advantage in space over reading the whole file
in. You can find a proof of this method in The Art of Computer
Programming, Volume 2, Section 3.4.2, by Donald E. Knuth.
You can use the File::Random module which provides a function for that
use File::Random qw/random_line/;
my $line = random_line($filename);
Another way is to use the Tie::File module, which treats the entire
file as an array. Simply access a random array element.
All Perl programmers should take the time to read the FAQ.
Update: To get a unique random line each time you're going to have to store state. The easiest way to store the state is to remove the lines that you've used from the file.
This program uses the Tie::File module to open your input.txt file as well as an indices.txt file.
If indices.txt is empty then it is initialised with the indices of all the records in input.txt in a shuffled order.
Each run, the index at the end of the list is removed and the corresponding input record displayed.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
tie my #input, 'Tie::File', 'input.txt'
or die qq(Unable to open "input.txt": $!);
tie my #indices, 'Tie::File', 'indices.txt'
or die qq(Unable to open "indices.txt": $!);
#indices = shuffle(0..$#input) unless #indices;
my $index = pop #indices;
print $input[$index];
I have modified this solution so that it populates a new indices.txt file only if it doesn't already exist and not, as before, simply when it is empty. That means a new sequence of records can be printed simply by deleting the indices.txt file.
use strict;
use warnings;
use Tie::File;
use List::Util 'shuffle';
my ($input_file, $indices_file) = qw( input.txt indices.txt );
tie my #input, 'Tie::File', $input_file
or die qq(Unable to open "$input_file": $!);
my $first_run = not -f $indices_file;
tie my #indices, 'Tie::File', $indices_file
or die qq(Unable to open "$indices_file": $!);
#indices = shuffle(0..$#input) if $first_run;
#indices or die "All records have been displayed";
my $index = pop #indices;
print $input[$index];

Comparing lines in a file with perl

Ive been trying to compare lines between two files and matching lines that are the same.
For some reason the code below only ever goes through the first line of 'text1.txt' and prints the 'if' statement regardless of if the two variables match or not.
use strict;
open( <FILE1>, "<text1.txt" );
open( <FILE2>, "<text2.txt" );
foreach my $first_file (<FILE1>) {
foreach my $second_file (<FILE2>) {
if ( $second_file == $first_file ) {
print "Got a match - $second_file + $first_file";
If you compare strings, use the eq operator. "==" compares arguments numerically.
Here is a way to do the job if your files aren't too large.
use Modern::Perl;
use File::Slurp qw(slurp);
use Array::Utils qw(:all);
use Data::Dumper;
# read entire files into arrays
my #file1 = slurp('file1');
my #file2 = slurp('file2');
# get the common lines from the 2 files
my #intersect = intersect(#file1, #file2);
say Dumper \#intersect;
A better and faster (but less memory efficient) approach would be to read one file into a hash, and then search for lines in the hash table. This way you go over each file only once.
# This will find matching lines in two files,
# print the matching line and it's line number in each file.
use strict;
open (FILE1, "<text1.txt") or die "can't open file text1.txt\n";
my %file_1_hash;
my $line;
my $line_counter = 0;
#read the 1st file into a hash
while ($line=<FILE1>){
chomp ($line); #-only if you want to get rid of 'endl' sign
if (!($line =~ m/^\s*$/)){
close (FILE1);
#read and compare the second file
open (FILE2,"<text2.txt") or die "can't open file text2.txt\n";
$line_counter = 0;
while ($line=<FILE2>){
chomp ($line);
if (defined $file_1_hash{$line}){
print "Got a match: \"$line\"
in line #$line_counter in text2.txt and line #$file_1_hash{$line} at text1.txt\n";
close (FILE2);
You must re-open or reset the pointer of file 2. Move the open and close commands to within the loop.
A more efficient way of doing this, depending on file and line sizes, would be to only loop through the files once and save each line that occurs in file 1 in a hash. Then check if the line was there for each line in file 2.
If you want the number of lines,
my $count=`grep -f [FILE1PATH] -c [FILE2PATH]`;
If you want the matching lines,
my #lines=`grep -f [FILE1PATH] [FILE2PATH]`;
If you want the lines which do not match,
my #lines = `grep -f [FILE1PATH] -v [FILE2PATH]`;
This is a script I wrote that tries to see if two file are identical, although it could easily by modified by playing with the code and switching it to eq. As Tim suggested, using a hash would probably be more effective, although you couldn't ensure the files were being compared in the order they were inserted without using a CPAN module (and as you can see, this method should really use two loops, but it was sufficient for my purposes). This isn't exactly the greatest script ever, but it may give you somewhere to start.
use warnings;
open (FILE, "orig.txt") or die "Unable to open first file.\n";
#data1 = ;
open (FILE, "2.txt") or die "Unable to open second file.\n";
#data2 = ;
for($i = 0; $i < #data1; $i++){
$data1[$i] =~ s/\s+$//;
$data2[$i] =~ s/\s+$//;
if ($data1[$i] ne $data2[$i]){
print "Failure to match at line ". ($i + 1) . "\n";
print $data1[$i];
print "Doesn't match:\n";
print $data2[$i];
print "\nProgram Aborted!\n";
print "\nThe files are identical. \n";
Taking the code you posted, and transforming it into actual Perl code, this is what I came up with.
use strict;
use warnings;
use autodie;
open my $fh1, '<', 'text1.txt';
open my $fh2, '<', 'text2.txt';
defined( my $line1 = <$fh1> )
defined( my $line2 = <$fh2> )
chomp $line1;
chomp $line2;
if( $line1 eq $line2 ){
print "Got a match - $line1\n";
print "Lines don't match $line1 $line2"
close $fh1;
close $fh2;
Now what you may really want is a diff of the two files, which is best left to Text::Diff.
use strict;
use warnings;
use Text::Diff;
print diff 'text1.txt', 'text2.txt';