Search for multiple terms in perl - perl

I have a file with more than hundred single column entries. I need to search for each of these entries into a file of multiple column and more than thousand entries and need a output file. I tried these codes:
#!/usr/bin/perl -w
use strict;
use warnings;
print "Enter the input file name:";
my $inputfile = <STDIN>;
chomp($inputfile);
print "\nEnter the search file name:";
my $searchfile=<STDIN>;
chomp($searchfile);
open (INPUTFILE, $inputfile) || die;
open (SEARCHFILE, $searchfile) || die;
open (OUT, ">write.txt") || die;
while (my $line=<SEARCHFILE>){
while (<INPUTFILE>) {
if (/$line/){
print OUT $_;
}
}
}
close (INPUTFILE) || die;
close (SEARCHFILE) || die;
close (OUT) || die;
The output file has only one line. It has searched the term from the search file into input file, but only for the first term, not for all. Please help!

When you read INPUTFILE in the inner loop, it's read to the end during the first round of SEARCHFILE. Because it's not reset, the filehandle is used up and will always return eof.
If there are hundreds of lines, but not several 100,000 you can easily read it into an array first and then use that for the lookup. The fact that it's single-column makes that very easy. Note that this is less efficient then the alternative solution below.
chomp( my #needles = <SEARCHFILE> );
while (<INPUTFILE>) {
foreach my $needle (#needles) {
print OUT $_ if m/\Q$needle\E/; # \Q end \E quote regex meta chars
}
}
Alternatively you can also build one large lookup regex that matches all the strings in one go. That is probably faster than iterating the array for each line.
# open ...
chomp( my #needles = <SEARCHFILE> );
my $lookup = join '|', map quotemeta, #needles;
my $lookup_regex = qr/$lookup/; # possibly with /i?
while (my $line = <INPUTFILE>) {
print OUT $line if $line =~ $lookup_regex;
}
The quotemeta takes care of strings that contain regex meta characters like / or | or even .. It's the same as using \Q and \E as above.
Please also use three-argument-open and named filehandles.
open my $fh_searchfile, '<', $searchfile or die $!;
open my $fh_inputfile, '<', $inputfile or die $!;
open my $fh_out, '>', 'write.txt' or die $!;
chomp( my #needles = <$fh_searchfile> );
# ...
The three-argument-open is important because you are taking user input and using it as the filename directly. A malicious user could enter something like | rm -rf *, which would open a pipe to a delete all my files without asking program. Oops. But if you specify the '<' read open method explicitly in its own parameter, the method characters are ignored in the third param.
The lexical filehandle $fh is, as the name says, lexical, while INPUTFILE is a GLOB, which makes it global. That's not so bad if you only have this one script and no modules, but as soon as you deal with different packages it becomes problematic because those are super-global and every part of the program sees them. That can lead to name collisions and weird stuff happening.

Related

Perl - Compare two large txt files and return the required lines from the first

So I am quite new to perl programming. I have two txt files, combined_gff.txt and pegs.txt.
I would like to check if each line of pegs.txt is a substring for any of the lines in combined_gff.txt and output only those lines from combined_gff.txt in a separate text file called output.txt
However my code returns empty. Any help please ?
P.S. I should have mentioned this. Both the contents of the combined_gff and pegs.txt are present as rows. One row has a string. second row has another string. I just wish to pickup the rows from combined_gff whose substrings are present in pegs.txt
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
foreach my $extline (#ext) {
if ( index($gffline, $extline) != -1) {
$str=$str.$gffline;
$str=$str."\n";
exit;
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
The first problem is exit. The output file is never created if a substring is found.
The second problem is chomp: you don't remove newlines from the lines, so the only way how a substring can be found is when a string from pegs.txt is a suffix of a string from combined_gff.txt.
Even after fixing these two problems, the algorithm will be very slow, as you're comparing each line from one file to each line of the second file. It will also print a line multiple times if it contains several different substrings (not sure if that's what you want).
Here's a different approach: First, read all the lines from pegs.txt and assemble them into a regex (quotemeta is needed so that special characters in substrings are interpreted literally in the regex). Then, read combined_gff.txt line by line, if the regex matches the line, print it.
#!/usr/bin/perl
use warnings;
use strict;
open my $data, '<', 'pegs.txt' or die $!;
chomp( my #ext = <$data> );
my $regex = join '|', map quotemeta, #ext;
open my $file, '<', 'combined_gff.txt' or die $!;
open my $out, '>', 'output.txt' or die $!;
while (<$file>) {
print {$out} $_ if /$regex/;
}
close $out;
I also switched to 3 argument version of open with lexical filehandles as it's the canonical way (3 argument version is safe even for files named >file or rm *| and lexical filehandles aren't global and are easier to pass as arguments to subroutines). Also, showing the actual error is more helpful than just dying with "error".
As choroba says you don't need the "exit" inside the loop since it ends the complete execution of the script and you must remove the line forwards (LF you do it by chomp lines) to find the matches.
Following the logic of your script I made one with the corrections and it worked fine.
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
chomp($gffline);
foreach my $extline (#ext) {
chomp($extline);
print $extline;
if ( index($gffline, $extline) > -1) {
$str .= $gffline ."\n";
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
Hope it works for you.
Welcho

How to find position of a word by using a counter?

I am currently working on a code that changes certain words to Shakespearean words. I have to extract the sentences that contain the words and print them out into another file. I had to remove .START from the beginning of each file.
First I split the files with the text by spaces, so now I have the words. Next, I iterated the words through a hash. The hash keys and values are from a tab delimited file that is structured as so, OldEng/ModernEng (lc_Shakespeare_lexicon.txt). Right now, I'm trying to figure out how to find the exact position of each modern English word that is found, change it to the Shakespearean; then find the sentences with the change words and printing them out to a different file. Most of the code is finished except for this last part. Here is my code so far:
#!/usr/bin/perl -w
use diagnostics;
use strict;
#Declare variables
my $counter=();
my %hash=();
my $conv1=();
my $conv2=();
my $ssph=();
my #text=();
my $key=();
my $value=();
my $conversion=();
my #rmv=();
my $splits=();
my $words=();
my #word=();
my $vals=();
my $existingdir='/home/nelly/Desktop';
my #file='Sentences.txt';
my $eng_words=();
my $results=();
my $storage=();
#Open file to tab delimited words
open (FILE,"<", "lc_shakespeare_lexicon.txt") or die "could not open lc_shakespeare_lexicon.txt\n";
#split words by tabs
while (<FILE>){
chomp($_);
($value, $key)= (split(/\t/), $_);
$hash{$value}=$key;
}
#open directory to Shakespearean files
my $dir="/home/nelly/Desktop/input";
opendir(DIR,$dir) or die "can't opendir Shakespeare_input.tar.gz";
#Use grep to get WSJ file and store into an array
my #array= grep {/WSJ/} readdir(DIR);
#store file in a scalar
foreach my $file(#array){
#open files inside of input
open (DATA,"<", "/home/nelly/Desktop/input/$file") or die "could not open $file\n";
#loop through each file
while (<DATA>){
#text=$_;
chomp(#text);
#Remove .START
#rmv=grep(!/.START/, #text);
foreach $splits(#rmv){
#split data into separate words
#word=(split(/ /, $splits));
#Loop through each word and replace with Shakespearean word that exists
$counter=0;
foreach $words(#word){
if (exists $hash{$words}){
$eng_words= $hash{$words};
$results=$counter;
print "$counter\n";
$counter++;
#create a new directory and store senteces with Shakespearean words in new file called "Sentences.txt"
mkdir $existingdir unless -d $existingdir;
open my $FILE, ">>", "$existingdir/#file", or die "Can't open $existingdir/conversion.txt'\n";
#print $FILE "#words\n";
close ($FILE);
}
}
}
}
}
close (FILE);
close (DIR);
Natural language processing is very hard to get right except in trivial cases, for instance it is difficult to define exactly what is meant by a word or a sentence, and it is awkward to distinguish between a single quote and an apostrophe when they are both represented using the U+0027 "apostrophe" character '
Without any example data it is difficult to write a reliable solution, but the program below should be reasonably close
Please note the following
use warnings is preferable to -w on the shebang line
A program should contain as few comments as possible as long as it is comprehensible. Too many comments just make the program bigger and harder to grasp without adding any new information. The choice of identifiers should make the code mostly self documenting
I believe use diagnostics to be unnecessary. Most messages are fairly self-explanatory, and diagnostics can produce large amounts of unnecessary output
Because you are opening multiple files it is more concise to use autodie which will avoid the need to explicitly test every open call for success
It is much better to use lexical file handles, such as open my $fh ... instead of global ones, like open FH .... For one thing a lexical file handle will be implicitly closed when it goes out of scope, which helps to tidy up the program a lot by making explicit close calls unnecessary
I have removed all of the variable declarations from the top of the program except those that are non-empty. This approach is considered to be best practice as it aids debugging and assists the writing of clean code
The program lower-cases the original word using lc before checking to see if there is a matching entry in the hash. If a translation is found, then the new word is capitalised using ucfirst if the original word started with a capital letter
I have written a regular expression that will take the next sentence from the beginning of the string $content. But this is one of the things that I can't get right without sample data, and there may well be problems, for instance, with sentences that end with a closing quotation mark or a closing parenthesis
use strict;
use warnings;
use autodie;
my $lexicon = 'lc_shakespeare_lexicon.txt';
my $dir = '/home/nelly/Desktop/input';
my $existing_dir = '/home/nelly/Desktop';
my $sentences = 'Sentences.txt';
my %lexicon = do {
open my ($fh), '<', $lexicon;
local $/;
reverse(<$fh> =~ /[^\t\n\r]+/g);
};
my #files = do {
opendir my ($dh), $dir;
grep /WSJ/, readdir $dh;
};
for my $file (#files) {
my $contents = do {
open my $fh, '<', "$dir/$file";
join '', grep { not /\A\.START/ } <$fh>;
};
# Change any CR or LF to a space, and reduce multiple spaces to single spaces
$contents =~ tr/\r\n/ /;
$contents =~ s/ {2,}/ /g;
# Find and process each sentence
while ( $contents =~ / \s* (.+?[.?!]) (?= \s+ [A-Z] | \s* \z ) /gx ) {
my $sentence = $1;
my #words = split ' ', $sentence;
my $changed;
for my $word (#words) {
my $eng_word = $lexicon{lc $word};
$eng_word = ucfirst $eng_word if $word =~ /\A[A-Z]/;
if ($eng_word) {
$word = $eng_word;
++$changed;
}
}
if ($changed) {
mkdir $existing_dir unless -d $existing_dir;
open my $out_fh, '>>', "$existing_dir/$sentences";
print "#words\n";
}
}
}

I want to replace a sequence name in fasta file with another name

I have one fasta file and one text file fasta file contains sequences in fasta format and text file contains name of genes now I want to replace name of the sequences in fasta file after '>' sign with the gene names in text file
I am new to perl though I have written a script but I don't know why its not working can anyone help me on that please
following is my script:
print"Enter annotated file...";
$f1=<STDIN>;
print"Enter sequence file...";
$f2=<STDIN>;
open(FILE1,$f1) || die"Can't open $f1";
#annotfile=<FILE1>;
open(FILE2,$f2) || die"Can't open $f2";
#seqfile=<FILE2>;
#d=split('\t',#annotfile[0]);
for($i=0;$i<scalar(#annotfile);$i++)
{
#curr_all=split('\t',#annotfile[$i]);
#curr_id[$i]=#curr_all[0];
#gene_nm[$i]=#curr_all[1];
}
for($j=0;$j<scalar(#seqfile);$j++)
{
$id=#curr_id[$j];
$gene=#gene_nm[$j];
#seqfile[$j]=~s/$id[$j]/$gene[$j]/g;
print #seqfile[$j];
}
my files looks like following:
annot.txt
pool75_contig_389 ubiquitin ligase e3a
pool75_contig_704 tumor susceptibility
pool75_contig_1977 serine threonine-protein phosphatase 4 catalytic subunit
pool75_contig_3064 bardet-biedl syndrome 2 protein P
pool75_contig_2499 succinyl- ligase
goat300.fasta
goat300.fasta
>pool75_contig_704
CCCTTTCTCCCTTCCCAACATTCAGAGATACTGAATCGAAACTCTTACTGTCTGTTAGAT
GACAAAGAGTTATCCATCCTACATACTCCAATTTCCTTCCGCAACTTGTGATTTCGCCGC
TTGAATCTTGACGCCGTGCGTCCACAGTTTGTTGTGTTTTATCAATCAAGGTCATTATCA
ACCGAAGACGCTATCTATTTTCTTGGCGAAGCTCTCGGAAAGGAGCCATCGAAATGGAAG
TATTTCTCAAGAAAGTCCGCGAGTTATCCCGGAAGCAGTTC
>pool75_contig_389
GACCTATACCGGACCGTCACTGAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ACGATCCAGGCATGGAGTTGTGGTGACGAGTAGGAGGGTCACCGTGGTGAGCGGGAAGCC
TCGGGCGTGAGCCTGGGTGGAGCCGCCACGGGTGCAGATCTTGGTGGTAGTAGCAAATAT
TCAAGTGAGAACCTTGAAGGCCGAGGTGGAGAAGGNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCATTTGTAT
CGCCCGGAAAACGTCACAAGAACGGGAGTTGCGTACAGAA
>pool75_contig_1977
AAGGGACACCGTTGGGTGAGGCGAGCTGCGTTCCTCGAACCATGGCTTCAAAAAGCGACT
TAGACCGTCAGATTGAACAGCTCAGGGCCTGCAAGCTCATTACAGAGGATGAGGTTAAGG
CACTCTGCGCTAAGGCGCGTGAGATTTTAATTGAAGAGAGTAATGTCCAGTGCGTGGACT
CACCTGTCACGGTTTGTGGCGATATCCACGGCCAGTTTTACGACTTGATTGAACTGTTTA
AAGTGGGCGGAGATGTTC
>pool75_contig_3064
TTACTATTTCTGGGCCTTAAGACTGGCTTAGTCGCTTACGACCCTTATAACAATGTAGAT
GTATATTATAAGGATCTTCCTGATGGTGCTAACGCTATGTTAATTTATTCAAACTCACCG
ACAAAGGAACAGAATATGCTTTGGCAGGTGGAAACTGTTCGATAATTGGATTGAACGACG
GCGGATGCGAGGTATTTTGGACAGTCACTGGCGACTCCGTTTGCTCTCTTTGCTCGATTA
AATCCGACAGCGATAAGTCAAGAGATTTTGTGGTTGGCTCTGAAGATTTTGACATCCGAA
TCTTCCATGGGGATGCCATAATATATGAAATCACGGAGTCTGATG
>pool75_contig_2499
AAGAGAAGAGGTGAGTTTGAGTATTGTTTGTGTGTGTGTGGTTGGGTGAGTGTGTGGTAT
GTGGTGTATGTGTGTGATGAATGTATGTGAAAGAGAGTGATGAATCTCATGGATATGTTC
GAGTTCGTGGTTTCCATTGATCGGTTATAGCCGAGATGATGGATGTGTTCCATGTGTCTG
ATTTCAGTTTAGGATTGTGTTGATGATGTTGATGATGAAAATTGTTGATGGTGATGACGA
TAGTGATGATGATGACGATGTTTCGGATAATGGTGATGATGATGATGGTTCCGACGATGA
TGTTTCGCTTGATGATGGTGATAATGATGACTCCGAAAATAACGTTGACTCGGATGAG
Consider using Bio::SeqIO to parse your Fasta dataset, instead of doing it yourself. Bio::SeqIO lives for this task, and is well developed for it. Additionally, if you're in bioinformatics, it would serve you well to get to know Bio::SeqIO. Given this, consider the following:
use strict;
use warnings;
use Bio::SeqIO;
open my $fh, '<', 'annot.txt' or die $!;
my %annot = map { /(\S+)\s+(.+)/; $1 => $2 } <$fh>;
close $fh;
my $in = Bio::SeqIO->new( -file => 'goat300.fasta', -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
my $seqID = $annot{ $seq->id } // $seq->id;
print "$seqID\n" . $seq->seq . "\n";
}
Output on your datasets:
tumor susceptibility
CCCTTTCTCCCTTCCCAACATTCAGAGATACTGAATCGAAACTCTTACTGTCTGTTAGATGACAAAGAGTTATCCATCCTACATACTCCAATTTCCTTCCGCAACTTGTGATTTCGCCGCTTGAATCTTGACGCCGTGCGTCCACAGTTTGTTGTGTTTTATCAATCAAGGTCATTATCAACCGAAGACGCTATCTATTTTCTTGGCGAAGCTCTCGGAAAGGAGCCATCGAAATGGAAGTATTTCTCAAGAAAGTCCGCGAGTTATCCCGGAAGCAGTTC
ubiquitin ligase e3a
GACCTATACCGGACCGTCACTGAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGATCCAGGCATGGAGTTGTGGTGACGAGTAGGAGGGTCACCGTGGTGAGCGGGAAGCCTCGGGCGTGAGCCTGGGTGGAGCCGCCACGGGTGCAGATCTTGGTGGTAGTAGCAAATATTCAAGTGAGAACCTTGAAGGCCGAGGTGGAGAAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCATTTGTATCGCCCGGAAAACGTCACAAGAACGGGAGTTGCGTACAGAA
serine threonine-protein phosphatase 4 catalytic subunit
AAGGGACACCGTTGGGTGAGGCGAGCTGCGTTCCTCGAACCATGGCTTCAAAAAGCGACTTAGACCGTCAGATTGAACAGCTCAGGGCCTGCAAGCTCATTACAGAGGATGAGGTTAAGGCACTCTGCGCTAAGGCGCGTGAGATTTTAATTGAAGAGAGTAATGTCCAGTGCGTGGACTCACCTGTCACGGTTTGTGGCGATATCCACGGCCAGTTTTACGACTTGATTGAACTGTTTAAAGTGGGCGGAGATGTTC
bardet-biedl syndrome 2 protein P
TTACTATTTCTGGGCCTTAAGACTGGCTTAGTCGCTTACGACCCTTATAACAATGTAGATGTATATTATAAGGATCTTCCTGATGGTGCTAACGCTATGTTAATTTATTCAAACTCACCGACAAAGGAACAGAATATGCTTTGGCAGGTGGAAACTGTTCGATAATTGGATTGAACGACGGCGGATGCGAGGTATTTTGGACAGTCACTGGCGACTCCGTTTGCTCTCTTTGCTCGATTAAATCCGACAGCGATAAGTCAAGAGATTTTGTGGTTGGCTCTGAAGATTTTGACATCCGAATCTTCCATGGGGATGCCATAATATATGAAATCACGGAGTCTGATG
succinyl- ligase
AAGAGAAGAGGTGAGTTTGAGTATTGTTTGTGTGTGTGTGGTTGGGTGAGTGTGTGGTATGTGGTGTATGTGTGTGATGAATGTATGTGAAAGAGAGTGATGAATCTCATGGATATGTTCGAGTTCGTGGTTTCCATTGATCGGTTATAGCCGAGATGATGGATGTGTTCCATGTGTCTGATTTCAGTTTAGGATTGTGTTGATGATGTTGATGATGAAAATTGTTGATGGTGATGACGATAGTGATGATGATGACGATGTTTCGGATAATGGTGATGATGATGATGGTTCCGACGATGATGTTTCGCTTGATGATGGTGATAATGATGACTCCGAAAATAACGTTGACTCGGATGAG
The hash %annot is initialized by reading and capturing the contents of your annot.txt data. A Bio::SeqIO object is created using your goat300.fasta file data. The while loop iterates through your fasta sequences. The variable $seqID either takes the associated value of the key in the %annot hash or it keeps the current sequence ID (the // notation means defined or, so that insures $seqID will be defined). Finally, the Fasta record is printed.
Hope this helps!
There were a lot of warnings in your code, and your approach was inefficient. Let me first show you a working Perl program. I'll explain afterwards.
#!/usr/bin/perl
use strict;
use warnings;
# Read the annotations file
print"Enter annotated file...\n";
# my $f1 = <STDIN>;
my $f1 = 'annot.txt';
open(my $fh_annotations, '<', $f1) or die "Can't open $f1";
my #annotfile = <$fh_annotations>;
close $fh_annotations;
# Read the sequence file
print"Enter sequence file...\n";
# my $f2 = <STDIN>;
my $f2 = 'goat300.fasta';
open(my $fh_genes, '<', $f2) or die "Can't open $f2";
my #seqfile = <$fh_genes>;
close $fh_genes;
# Process the annotations data
my %names; # this hash is going to hold the names
foreach my $line (#annotfile) {
chomp $line; # remove newline
my #fields = split /\t/, $line; # split into array
$names{$fields[0]} = $fields[1]; # save in the hash as key->value pair
}
# Process the sequence data
foreach my $line (#seqfile) {
# Look at each line
if ($line =~ m/>(.+)$/) {
# If there is a heading there, remember it...
if (exists $names{$1}) {
# ... check if we know a name for it and replace it in the line
$line =~ s/($1)/$names{$1}/;
}
}
# output the line (this would be done to another filehandle)
print $line;
}
This reads both files and saves them in memory, just like yours did. But instead of trying to build two arrays for the names, I went with a hash, which is a key/value pair. Think of it like an array with names instead of numbers and no particular sorting.
Once these names are set up, I can process the sequence file. I simply look at each line and check if there is a heading there, by looking for the > sign. If it's there (it goes into $1 because of the parenthesis), I look if we have a hash entry (with exists) in our %names hash. If we do, we can replace the heading with the proper name.
After that, we could write it out to a new file. I'm just printing it.
I've used a few other techniques. Unfortunately the literature people get in a BioPerl context is quite outdated. Please take this advice, it will make your live easier.
Always use strict and warnings. They will tell you about problems with your code.
Always declare your variables with my. This is not like other languages, where you need to set up a variable at the top of your problem. You can declare it where you need it. The vars only live in a certain scope, which means between the nearest enclosing { and } brackets, or block.
Use three-argument open and lexical file handles for security. Read more here.
Perl offers foreach as an alternative to the C for loop. In this case, it made things a lot easier.
One more thing about this program: While this example data was rather short, I believe your actual data might be a lot larger. Consider processing the sequence file while you read it so you do not run out of memory. There's no need to save all the lines, unless you want to do something else with them.
open my $fh_out, '>', $filename_out or die $!;
open my $fh_in, '<', $filename_in or die $!;
while (my $line = <$fh_in>) {
# do stuff with the line, like your regex
print $fh_out $line;
}
close $fh_in;
close $fh_out;

Perl - Move Pointer to Start of Line

I have 2 files.
Obfuscated file called input.txt
A second file called mapping.txt consisting of key value pairs.
I want to find every occurrence of the key from mapping.txt in input.txt and replace it with the value corresponding to the key.
Please note that I want to overwrite the contents of the line in input.txt everytime a successful match occurs.
I have written the following code:
#! /usr/bin/perl
use strict;
use warnings;
(my $mapping,my $input)=#ARGV;
open(MAPPING,'<',$mapping) || die("couldn't read from the file, $mapping with error: $!\n");
while(<MAPPING>)
{
chomp $_;
my $line=$_;
(my $key,my $value)=split("=",$line);
open(INPUT,'+<',$input);
while(<INPUT>)
{
chomp $_;
if(index($_,$key)!=-1)
{
$_=~s/\Q$key/$value/g;
# move pointer to beginning of line
print INPUT $_."\n";
}
}
close INPUT;
}
close MAPPING;
Brief Overview of the code:
Opens the mapping.txt file in read mode.
Since each line is a key value pair, it splits it into key and value.
Opens the input.txt file in overwrite mode.
Checks if the key is found in the current line.
If the key is found, then substitute the key with the value ignoring any meta characters in the key (by prefixing \Q)
At this point, the file pointer would be at the end of the line since the previous statement would scan the entire line to find the key and replace it.
If I could move the file pointer to the start of the line, then I can overwrite with:
print INPUT $_,"\n"
I tried looking up the seek function however unable to figure out a way to use it for this purpose.
Once this is done, then the code will close the file. It will pick the next key value pair from mapping.txt and again scan the input file from beginning looking for matches and replacing them.
The most important point is, each time the inner while loop will be operating on the input.txt which was modified in the previous iteration of inner while loop. This way, any successful Find and Replace operations would keep on getting saved in the input.txt file.
How do I do this?
Thanks.
First of all you should use lexical file handles, the three-parameter form of open, and always check the status to make sure that an open has succeeded (as you do with the mapping file but not the input file).
The solution you suggest, of rewinding to the start of the line before using print will not work because you cannot update part of a file unless your replacement data is exactly the same size as the data it is replacing. This will not generally be true in your situation.
There are a number of solutions to this, the first and simplest is to invert the loops and put the read loop for the mapping file inside the read loop for the input file. Your code would look like this:
use strict;
use warnings;
my ($mapping, $input) = #ARGV;
open my $infh, '<', $input or die "Unable to open '$input': $!";
while (my $line = <$input>) {
open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";
while (<$mapfh>) {
chomp;
my ($key, $value) = split /=/;
$line =~ s/\Q$key/$value/g;
}
print $line;
}
but your output is sent to STDOUT and you will have to arrange the output to be saved to a file and renamed appropriately.
An alternative here is to use the -I command-line option, which forces a file to be renamed automatically, and a backup saved if required. Using a bare -I will modify the file in-place by deleting the old file and renaming the new output, while giving the parameter a value like -I.bak will rename the old file by appending .bak instead of deleting it. The -I option applies only to files read from ARGV using an empty <> operator, and setting the built-in variable $^I to a value (or to the empty string '') has the same effect. The code looks like this:
use strict;
use warnings;
my $mapping = shift #ARGV;
$^I = '.bak';
while (my $line = <>) {
open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";
while (<$mapfh>) {
chomp;
my ($key, $value) = split /=/;
$line =~ s/\Q$key/$value/g;
}
print $line;
}
A third, and neater alternative is to use Tie::File, which maps a Perl array to the file contents and reflects all modifications of the array back to the original file. This is an example:
use strict;
use warnings;
use Tie::File;
my ($mapping, $input) = #ARGV;
tie my #input, 'Tie::File', $input or die "Unable to open '$input': $!";
for my $line (#input) {
open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";
while (<$mapfh>) {
chomp;
my ($key, $value) = split /=/;
$line =~ s/\Q$key/$value/g;
}
}
Finally, it is highly inefficient to keep opening and reading the mapping file for every line of input, and it is best to build a regex from its contents and use it throughout the program. This version first builds a hash %mapping from the mapping file and then creates a regex by applying quotemeta to each hash key to escape any regex metacharacters, and then joining them with the regex alternation operator |. The keys are sorted by descending length so that the longest matches are found and replaced in priority over the shorter ones.
use strict;
use warnings;
use Tie::File;
my ($mapping, $input) = #ARGV;
open my $mapfh, '<', $mapping or die "Unable to open '$mapping': $!";
my %mapping = map { chomp; /\S/ ? split /=/ : () } <$mapfh>;
my $regex = join '|', map quotemeta, sort { length $b <=> length $b } keys %mapping;
tie my #input, 'Tie::File', $input or die "Unable to open '$input': $!";
for my $line (#input) {
$line =~ s/($regex)/$mapping{$1}/g;
}
If I could move the file pointer to the start of the line, then I can overwrite with:
print INPUT $_,"\n"
Your premise is wrong: Assuming the byte sequence 00 01 02 and the rule 01 = A1 A2, the resulting byte sequence would be 00 A1 A2 and not 00 A1 A2 02. Ways around this include:
Use the Tie::File module.
Write to another file, and rename the second file to the original, once your pass is complete. This is probably most efficient and scalable.
seeking is not a good idea: You would be constrained to fix-length substitutions, and seek and tell operate on bytes, not characters. If you really have to use in-place editing, you could use this loop:
my $beginning_of_line = tell $fh;
while (<$fh>) {
# do processing
seek $fh, $beginning_of_line, 0;
# do update
} continue {$beginning_of_line = tell $fh}
Also, you make several passes over the input file. Assuming the token sequence a b c and the rules b = d e and d = f, you would produce the sequences a f e c or a d e c depending on the order of the rules! This may not be what you want.
Also, consider the ambiguity between the rules a = c and a b = d over the input a b. Does this produce c b or d?

What can be wrong with word count program?

I've got a question in my test:
What is wrong with program that counts number of lines and words in file?
open F, $ARGV[0] || die $!;
my #lines = <F>;
my #words = map {split /\s/} #lines;
printf "%8d %8d\n", scalar(#lines), scalar(#words);
close(F);
My conjectures are:
If file does not exist, program won't tell us about that.
If there are punctuation signs in file, program will count them, for example, in
abc cba
, , ,dce
will be five word, but on the other hand wc outputs the same result, so it might be considered as correct behavior.
If F is a large file, it might be better to iterate over lines and not to dump it into lines array.
Do you have any less trivial ideas?
On the first line, you have a precedence problem:
open F, $ARGV[0] || die $!;
is the same as
open F, ($ARGV[0] || die $!);
which means the die is executed if the filename is false, not if the open fails. You wanted to say
open(F, $ARGV[0]) || die $!;
or
open F, $ARGV[0] or die $!;
Also, you should be using the 3 argument form of open, in case $ARGV[0] contains characters that mean something to open.
open F, '<', $ARGV[0] or die $!;
On a different note, splitting on /\s/ means that you get a "word" between consecutive whitespace characters. You probably meant /\s+/, or as amphetamachine suggested, /\W+/, depending on how you want to define a "word".
That still leaves the problem of the empty "word" you get if the line begins with whitespace. You could split on ' ' to suppress that (it's a special case), or you could trim leading whitespace first, or insert a grep { length $_ } to weed out empty "words", or abandon split and use a different method for counting words.
Processing line by line instead of reading the whole file at once would also be a good improvement, but it's not as important as those first two items.
Your conjecture #1 is incorrect: your program will die if the open fails. (see cjm's answer re order of operations.)
you're using a global filehandle, rather than a lexical variable.
you're not using the three-argument form of open.
you could just read from stdin, which gives more flexibility as to input - the user can provide a file, or pipe the input into stdin.
lastly, I wouldn't write my own code to parse words; I'd reach for CPAN, say something like Lingua::EN::Splitter.
use strict; use warnings;
use Lingua::EN::Splitter qw(words);
my ($wordcount, $lines);
while (<>)
{
my $line = $_;
$lines++;
$wordcount += scalar(words $line);
}
printf "%8d %8d\n", $lines, $wordcount;
When you open F, $ARGV[0] || die $! that will effectively exit if the file doesn't exist.
There are some improvements to be made here:
{local $/; $lines = <F>;} # read all lines at once
my #words = split /\W+/, $lines;