How to store file content sentence by sentence in an array

How to store file content sentence by sentence in an array - perl

I want to open a file, and store its content in an array and make changes to each sentence one at a time and then print the output of the file.
I have something like this:
open (FILE , $file);
my #lines = split('.' , <FILE>)
close FILE;
for (#lines) {
s/word/replace/g;
}
open (FILE, ">$file");
print FILE #lines;
close FILE;
For some reason, perl doesn't like this and won't output any content into the new file. It seems to not like me splitting up the array. Can someone give me an explanation why perl does this and a possible fix? Thanks!

split needs a regexp. Change split('.' , <FILE>) to split(/\./ , <FILE>)

Change my #lines = split('.' , <FILE>) to my #lines = split('\.' , <FILE>)
Only . is used in regex to match a single character. So you need to escape . to split on full stop.

#!/usr/local/bin/perl
use strict;
use warnings;
my $filename = "somefile.txt";
my $contents = do { local(#ARGV, $/) = $filename; <> };
my #lines = split '\.', $contents;
foreach(#lines){
#lines is an array which contains one sentence at each index.
}

what i found was in second line of your script missing semicolon(;) that is the error and also your script is not capable of handling content of entire file.It will process only one line. So please find the modification of your script below.If any clarification please let me know.
my $file='test.txt';#input file name
open (FILE , $file);
#my #lines = split('\.' ,<FILE>); this will not process the entire content of the file.
my #lines;
while(<FILE>) {
s/word/replace/g;
push(#lines,$_);
}
close FILE;
open (FILE, ">$file");
print FILE #lines;
close FILE;

You have lots of problems in your code.
my #lines = split('.' , <FILE>) will just read the first line and split it.
split('.' should be split(/\./
my #lines = split('.' , <FILE>) no semicolon terminator.
print FILE #lines; - you have lost all your full stops!
Finally I have to wonder why you are bothered about 'sentences' at all when you are just replacing one word. If you really want to read one sentence at a time (presumably to do some kind of sentence based processing) then you need to change the record separator variable $\. For example:
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.txt";
open (FILE , $file);
my #buffer;
$/ = '.'; # Change the Input Separator to read one sentence at a time.
# Be careful though, it won't work for questions ending in ?
while ( my $sentence = <FILE> ) {
$sentence =~ s/word/replace/g;
push #buffer, $sentence;
}
close FILE;
.. saving to file is left for you to solve.
However if you just want to change the strings you can read the whole file in one gulp by setting $/ to undef. Eg:
#!/usr/bin/perl
use strict;
use warnings;
my $file = "data.txt";
open (FILE , $file);
$/ = undef; # Slurp mode!
my $buffer = <FILE>;
close FILE;
$buffer =~ s/word/replace/g;
open (FILE, ">$file");
print FILE $buffer;
close FILE;
If you are really looking to process sentences and you want to get questions then you probably want to slurp the whole file and then split it, but use a capture in your regex so that you don't lose the punctuation. Eg:
!/usr/bin/perl
use strict;
use warnings;
my $file = "data.txt";
open (FILE , $file);
$/ = undef; # slurp!
my $buffer = <FILE>;
close FILE;
open (FILE, ">$file" . '.new'); # Don't want to overwrite my input.
foreach my $sentence (split(/([\.|\?]+)/, $buffer)) # split uses () to capture punctuation.
{
$sentence =~ s/word/replace/g;
print FILE $sentence;
}
close FILE;

Related

Perl - Compare two large txt files and return the required lines from the first

So I am quite new to perl programming. I have two txt files, combined_gff.txt and pegs.txt.
I would like to check if each line of pegs.txt is a substring for any of the lines in combined_gff.txt and output only those lines from combined_gff.txt in a separate text file called output.txt
However my code returns empty. Any help please ?
P.S. I should have mentioned this. Both the contents of the combined_gff and pegs.txt are present as rows. One row has a string. second row has another string. I just wish to pickup the rows from combined_gff whose substrings are present in pegs.txt
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
foreach my $extline (#ext) {
if ( index($gffline, $extline) != -1) {
$str=$str.$gffline;
$str=$str."\n";
exit;
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);

The first problem is exit. The output file is never created if a substring is found.
The second problem is chomp: you don't remove newlines from the lines, so the only way how a substring can be found is when a string from pegs.txt is a suffix of a string from combined_gff.txt.
Even after fixing these two problems, the algorithm will be very slow, as you're comparing each line from one file to each line of the second file. It will also print a line multiple times if it contains several different substrings (not sure if that's what you want).
Here's a different approach: First, read all the lines from pegs.txt and assemble them into a regex (quotemeta is needed so that special characters in substrings are interpreted literally in the regex). Then, read combined_gff.txt line by line, if the regex matches the line, print it.
#!/usr/bin/perl
use warnings;
use strict;
open my $data, '<', 'pegs.txt' or die $!;
chomp( my #ext = <$data> );
my $regex = join '|', map quotemeta, #ext;
open my $file, '<', 'combined_gff.txt' or die $!;
open my $out, '>', 'output.txt' or die $!;
while (<$file>) {
print {$out} $_ if /$regex/;
}
close $out;
I also switched to 3 argument version of open with lexical filehandles as it's the canonical way (3 argument version is safe even for files named >file or rm *| and lexical filehandles aren't global and are easier to pass as arguments to subroutines). Also, showing the actual error is more helpful than just dying with "error".

As choroba says you don't need the "exit" inside the loop since it ends the complete execution of the script and you must remove the line forwards (LF you do it by chomp lines) to find the matches.
Following the logic of your script I made one with the corrections and it worked fine.
#!/usr/bin/perl -w
use strict;
open (FILE, "<combined_gff.txt") or die "error";
my #gff = <FILE>;
close FILE;
open (DATA, "<pegs.txt") or die "error";
my #ext = <DATA>;
close DATA;
my $str = ''; #final string
foreach my $gffline (#gff) {
chomp($gffline);
foreach my $extline (#ext) {
chomp($extline);
print $extline;
if ( index($gffline, $extline) > -1) {
$str .= $gffline ."\n";
}
}
}
open (OUT, ">", "output.txt");
print OUT $str;
close (OUT);
Hope it works for you.
Welcho

Perl script to subset multiple DNA sequences

I have a FASTA file of ~500 DNA sequences, each of which has target position for a Single-Neucleotide Polymorphism (SNP) of interest that is known to me.
For each entry in the file, I have a separate tab-delimited text file that has on each line
The FASTA sequence name
The start position
The end position
The SNP position
The sequences and the positions in the text file are in the same order.
The dummy FASTA file is:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
GACACACACTGATTGTTAGTGGTGTACAGACATTGCTTCAAACTGCA
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TAGGTTTTCTTTCCCATGTCCCCTGAATAACATGGGATTCCCTGTGACTGTGGGGACCCCTGAGAGCCTGGT
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
GATAAGGAGCTCACAGCAACCCACATGAGTTGTCC
and the dummy position file is
AOS-94_S25_L002_R1_001_trimmed_contig_767 5 15 10
AOS-94_S25_L002_R1_001_trimmed_contig_2199 8 19 11
AOS-94_S25_L002_R1_001_trimmed_contig_2585 4 20 18
This is the script I have written and tried
use warnings;
use strict;
# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
# Output file
print "What are we calling the output? \n";
my $out= <STDIN>;
chomp $out;
local $/ = "\n>"; #Read by fasta record
my $seq1 = ();
open(FASTA,$fasta) || die "\n Unable to open the file!\n";
open(POS,$text) || die "\n Unable to open the file! \n";
my #fields = <POS>;
while (my $seq = <FASTA>){
chomp $seq;
my #seq = split(/\n/,$seq);
if($seq[0] =~ /^>/){
$seq1 = $seq[0];
}elsif($seq[0] =~ /[^>]/){ #matches any character except the >
$seq1 = ">".$seq[0];
}
for my $pos (#fields){
chomp $pos;
my #field = split(/\t/,$pos);
open(OUTFILE,">>$out");
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] -1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;
This is what I get out, which is what I sort of want:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
AGGAGCTCAC
However, I need to also print out the SNP position (column 4) after the sequence name, e.g.,
>AOS-94_S25_L002_R1_001_trimmed_contig_767
pos=10
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
pos=11
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
pos=18
AGGAGCTCAC
I tried inserting print OUTFILE "pos= $field[3]\n";after print OUTFILE "$seq1"; and I get the following:
>AOS-94_S25_L002_R1_001_trimmed_contig_767
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
CACACTGATT
>AOS-94_S25_L002_R1_001_trimmed_contig_2199
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
TTTTCTTTCC
>AOS-94_S25_L002_R1_001_trimmed_contig_2585
10
AOS-94_S25_L002_R1_001_trimmed_contig_2199
AGGAGCTCAC
Obviously I have messed up my loops, and probably some chomp commands.
For instance, when I print "$seq1" to a file, why does it not need a "\n" included in the printed string? There must already be a hard return in the string?
I know I am missing some basics of how this is structured, but I so far can't figure out how to fix my mistakes. Can anyone provide any suggestions?
Update
Perl code reformatted for legibility
use warnings;
use strict;
# Read in the complete FASTA file:
print "What is the name of the fasta contig file?\n";
my $fasta = <STDIN>;
chomp $fasta;
# Read in file of contig name, start pos, stop pos, SNP pos in tab delimited
text:
print "Name of text file with contig name and SNP position info? \n";
my $text = <STDIN>;
chomp $text;
#Output file
print "What are we calling the output? \n";
my $out = <STDIN>;
chomp $out;
local $/ = "\n>"; # Read by FASTA record
my $seq1 = ();
open( FASTA, $fasta ) || die "\n Unable to open the file!\n";
open( POS, $text ) || die "\n Unable to open the file! \n";
my #fields = <POS>;
while ( my $seq = <FASTA> ) {
chomp $seq;
my #seq = split( /\n/, $seq );
if ( $seq[0] =~ /^>/ ) {
$seq1 = $seq[0];
}
elsif ( $seq[0] =~ /[^>]/ ) { # matches any character except the >
$seq1 = ">" . $seq[0];
}
for my $pos ( #fields ) {
chomp $pos;
my #field = split( /\t/, $pos );
open( OUTFILE, ">>$out" );
print OUTFILE "$seq1";
my $subseq = substr $seq[1], $field[1] - 1, $field[2] - $field[1];
print OUTFILE "$subseq\n";
}
}
close FASTA;
close POS;
close OUTFILE;

There are many problems with your code
Your comments don't correspond to the code. For instance, you have Read in the complete FASTA file when the code just accepts the file name from STDIN and trims it. It is usually best to write clean code with well-chosen identifiers; that way the program explains itself
You are using the two-parameter form of open and global file handles. You also don't have the reason for failure in the die string, and you have a newline at the end, which will prevent Perl from giving you the source file name and line number where the error occurred
Something like
open( FASTA, $fasta ) || die "\n Unable to open the file!\n"
should be
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!}
and
open( OUTFILE, ">>$out" );
should be
open my $out_fh, '>>', $output_file or die qq{Unable to open "$output_file" for appending: $!}
You should avoid putting quotes around variable names.
print OUTFILE "$seq1"
should be
print OUTFILE $seq1
You set the input record separator to "\n>". That means that every time you call <FASTA> Perl will read up to the next occurrence of that string. It also means that chomp will remove exactly that string from the end of the line, if it is there
The biggest problem is that you never reset $/ before reading from POS. Remember that its setting affects every readline (or <>) and every chomp. And because your $text file probably contains no > characters at the start of a line, you will read the entire file in one go
That is why you are seeing newlines in your output without asking for them. You have read the whole file, together with all embedded newlines, and chomp is useless here because you have modified the string that it removes
local is named that way for a reason. It alters the value temporarily and locally to the current scope. But your "current scope" is the entirety of the rest of the file and you are reading both files with the modified terminator
Use some braces { ... } to limit the scope of the local modification. Alternatively, because file handles in more recent versions of Perl behave as IO::Handle objects, you can write
$fasta_fh->input_record_separator("\n>")
and the change will apply only to that file handle, and there is no need to localise $/ at all
Here's an amended version of your program which also addresses some poor choices of identifier as well as a few other things. Please note that this code is untested. I am working on the train at present and can only check what I'm writing mentally
Note that things like while ( <$fasta_fh> ) and for ( #pos_records ) use the default variable $_ when no loop variable is specified. Likewise, operators like chomp and split will apply to $_ when the corresponding parameter is missing. That way there is never any need to mention any variable explicitly, and it leads to more concise and readable code. $_ is equivalent to it in the English language
I encourage you to understand what the things you're writing actually do. It is becoming common practice to copy code from one part of the internet and offer it to some kind souls elsewhere to get it working for you. That isn't "learning to program", and you will not understand anything unless you study the language and put your mind to it
And please take more care with laying out your code. I hope you can see that the edit I made to your question, and the code in my solution, is more more comfortable to read than the program that you posted? While you're welcome to make your own job as awkward as you like, it's unfair and impolite to offer a mess like that to a world of total strangers whom you're asking for free programming help. A nice middle line is to alter your editor to use an indent of four spaces when the tab key is pressed. Never use tab characters in source code!
use strict;
use warnings 'all';
print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );
print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );
print "Name of the output file: ";
chomp( my $out_file = <STDIN> );
open my $out_fh, '>', $out_file die qq{Unable to open "$out_file" for output: $!};
my #pos_records = do {
open $pos_, '<', $pos_file or die qq{Unable to open "$pos_file" for input: $!};
<$pos_fh>;
};
chomp #pos_records; # Remove all newlines
{
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};
local $/ = "\n>"; # Reading FASTA format now
while ( <$fasta_fh> ) {
chomp; # Remove "">\n" from the end
my ( $header, $seq ) = split /\n/; # Separate the two lines
$header =~ s/^>?/>/; # Replace any chomped >
for ( #pos_records ) {
my ( $name, $beg, $end, $pos ) = split /\t/;
my $subseq = substr $seq, $beg-1, $end-$beg;
print $out_fh "$header\n";
print $out_fh "pos=$pos\n";
print $out_fh "$subseq\n";
}
}
} # local $/ expires here
close $out_fh or die $!;

Okay, with a couple very minor edits, your code worked perfectly. This is the solution that worked for me:
#!/usr/bin/perl
use strict;
use warnings;
print "Name of the FASTA contig file: ";
chomp( my $fasta_file = <STDIN> );
print "Name file with SNP position info: ";
chomp( my $pos_file = <STDIN> );
print "Name of the output file: ";
chomp( my $out_file = <STDIN> );
open my $out_fh, '>', $out_file or die qq{Unable to open "out_file" for output: $!};
my #pos_records = do {
open my $pos_, '<' , $pos_file or die qq{Unable to open "$pos_file" for input: $!};
<$pos_>;
};
chomp #pos_records; #remove all newlines
{
open my $fasta_fh, '<', $fasta_file or die qq{Unable to open "$fasta_file" for input: $!};
local $/ = "\n>"; #Reading FASTA format now
for ( <$fasta_fh> ) {
chomp; #Remove ">\n" from the end
my ( $header, $seq) = split /\n/; #separate the two lines
$header = ">$header" unless $header =~ /^>/; # Replace any chomped >
for ( #pos_records ) {
my ($name,$beg,$end,$pos) = split /\t/;
my $subseq = substr $seq, $beg-1, $end-$beg;
my $final_SNP = $end - $pos;
if($header =~ /$name/){
print $out_fh "$header\n";
print $out_fh "pos=$final_SNP\n";
print $out_fh "$subseq\n";
}
}
}
} #local expires here
close $out_fh or die $!;
The only substantive thing I changed was the addition of an if statement. Without that, each fasta sequence was being written three times, each one with one with one of the three SNP positions. I also slightly changed what I was doing to notate the SNP position, which after excising the sequence, was actually $end - $pos and not just $pos.
Again, I can't thank you enough, as it is obvious you spent a fair bit of time helping me. For what its worth, I sincerely appreciate it. Your solution will serve as a template for my future efforts (which will likely be similar manipulations of fasta files), and your explanations have helped me to better understand things like what local does in a way that my pea brain can comprehend.

Read text file in Perl word by word instead of line by line

I have a big (300 kB) text file containing words delimited by spaces. Now I want to open this file and process every word in it one by one.
The problem is that perl reads the file line by line (i.e) the entire file at once which gives me strange results. I know the normal way is to do something like
open($inFile, 'tagged.txt') or die $!;
$_ = <$inFile>;
#splitted = split(' ',$_);
print $#splitted;
But this gives me a faulty word count (too large array?).
Is it possible to read the text file word by word instead?

Instead of reading it in one fell swoop, try the line-by-line approach which is easier on your machine's memory usage too (although 300 KB isn't too large for modern computers).
use strict;
use warnings;
my #words;
open (my $inFile, '<', 'tagged.txt') or die $!;
while (<$inFile>) {
chomp;
#words = split(' ');
foreach my $word (#words) { # process }
}
close ($inFile);

To read the file one word at a time, change the input record separator ($/) to a space:
local $/ = ' ';
Example:
#!/usr/bin/perl
use strict;
use warnings;
use feature 'say';
{
local $/ = ' ';
while (<DATA>) {
say;
}
}
__DATA__
one two three four five
Output:
one
two
three
four
five

It's unclear what you input file looks like, but you imply that it contains just a single line composed of many "words".
300KB is far from a "big text file". You should read it in its entirety and pull the words from there one by one. This program demonstrates
use strict;
use warnings;
my $data = do {
open my $fh, '<', 'data.txt' or die $!;
local $/;
<$fh>;
};
my $count = 0;
while ($data =~ /(\S+)/g ) {
my $word = $1;
++$count;
printf "%2d: %s\n", $count, $word;
}
output
1: alpha
2: beta
3: gamma
4: delta
5: epsilon
Without more explanation of what a "faulty word count" might be it is very hard to help, but it is certain that the problem isn't because of the size of your array: if there was a problem there then Perl would raise an exception and die.
But if you are comparing the result with the statistics from a word processor, then it is probably because the definition of "word" is different. For instance, the word processor may consider a hyphenated word to be two words.

300K doesn't seem to be big, so you may try:
my $text=`cat t.txt` or die $!;
my #words = split /\s+/, $text;
foreach my $word (#words) { # process }
or slightly modified solution of squiguy
use strict;
use warnings;
my #words;
open (my $inFile, '<', 'tagged.txt') or die $!;
while (<$inFile>) {
push(#words,split /\s+/);
}
close ($inFile);
foreach my $word (#words) { # process }

Read newline delimited file in Perl

I am trying to read a newline-delimited file into an array in Perl. I do NOT want the newlines to be part of the array, because the elements are filenames to read later. That is, each element should be "foo" and not "foo\n". I have done this successfully in the past using the methods advocated in Stack Overflow question Read a file into an array using Perl and Newline Delimited Input.
My code is:
open(IN, "< test") or die ("Couldn't open");
#arr = <IN>;
print("$arr[0] $arr[1]")
And my file 'test' is:
a
b
c
d
e
My expected output would be:
a b
My actual output is:
a
b
I really don't see what I'm doing wrong. How do I read these files into arrays?

Here is how I generically read from files.
open (my $in, "<", "test") or die $!;
my #arr;
while (my $line = <$in>) {
chomp $line;
push #arr, $line;
}
close ($in);
chomp will remove newlines from the line read. You should also use the three-argument version of open.

Put the file path in its own variable so that it can be easily
changed.
Use the 3-argument open.
Test all opens, prints, and closes for success, and if not, print the error and the file name.
Try:
#!/usr/bin/env perl
use strict;
use warnings;
# --------------------------------------
use charnames qw( :full :short );
use English qw( -no_match_vars ); # Avoids regex performance penalty
# conditional compile DEBUGging statements
# See http://lookatperl.blogspot.ca/2013/07/a-look-at-conditional-compiling-of.html
use constant DEBUG => $ENV{DEBUG};
# --------------------------------------
# put file path in a variable so it can be easily changed
my $file = 'test';
open my $in_fh, '<', $file or die "could not open $file: $OS_ERROR\n";
chomp( my #arr = <$in_fh> );
close $in_fh or die "could not close $file: $OS_ERROR\n";
print "#arr[ 0 .. 1 ]\n";

A less verbose option is to use File::Slurp::read_file
my $array_ref = read_file 'test', chomp => 1, array_ref => 1;
if, and only if, you need to save the list of file names anyway.
Otherwise,
my $filename = 'test';
open (my $fh, "<", $filename) or die "Cannot open '$filename': $!";
while (my $next_file = <$fh>) {
chomp $next_file;
do_something($next_file);
}
close ($fh);
would save memory by not having to keep the list of files around.
Also, you might be better off using $next_file =~ s/\s+\z// rather than chomp unless your use case really requires allowing trailing whitespace in file names.

Comparing lines in a file with perl

Ive been trying to compare lines between two files and matching lines that are the same.
For some reason the code below only ever goes through the first line of 'text1.txt' and prints the 'if' statement regardless of if the two variables match or not.
Thanks
use strict;
open( <FILE1>, "<text1.txt" );
open( <FILE2>, "<text2.txt" );
foreach my $first_file (<FILE1>) {
foreach my $second_file (<FILE2>) {
if ( $second_file == $first_file ) {
print "Got a match - $second_file + $first_file";
}
}
}
close(FILE1);
close(FILE2);

If you compare strings, use the eq operator. "==" compares arguments numerically.

Here is a way to do the job if your files aren't too large.
#!/usr/bin/perl
use Modern::Perl;
use File::Slurp qw(slurp);
use Array::Utils qw(:all);
use Data::Dumper;
# read entire files into arrays
my #file1 = slurp('file1');
my #file2 = slurp('file2');
# get the common lines from the 2 files
my #intersect = intersect(#file1, #file2);
say Dumper \#intersect;

A better and faster (but less memory efficient) approach would be to read one file into a hash, and then search for lines in the hash table. This way you go over each file only once.
# This will find matching lines in two files,
# print the matching line and it's line number in each file.
use strict;
open (FILE1, "<text1.txt") or die "can't open file text1.txt\n";
my %file_1_hash;
my $line;
my $line_counter = 0;
#read the 1st file into a hash
while ($line=<FILE1>){
chomp ($line); #-only if you want to get rid of 'endl' sign
$line_counter++;
if (!($line =~ m/^\s*$/)){
$file_1_hash{$line}=$line_counter;
}
}
close (FILE1);
#read and compare the second file
open (FILE2,"<text2.txt") or die "can't open file text2.txt\n";
$line_counter = 0;
while ($line=<FILE2>){
$line_counter++;
chomp ($line);
if (defined $file_1_hash{$line}){
print "Got a match: \"$line\"
in line #$line_counter in text2.txt and line #$file_1_hash{$line} at text1.txt\n";
}
}
close (FILE2);

You must re-open or reset the pointer of file 2. Move the open and close commands to within the loop.
A more efficient way of doing this, depending on file and line sizes, would be to only loop through the files once and save each line that occurs in file 1 in a hash. Then check if the line was there for each line in file 2.

If you want the number of lines,
my $count=`grep -f [FILE1PATH] -c [FILE2PATH]`;
If you want the matching lines,
my #lines=`grep -f [FILE1PATH] [FILE2PATH]`;
If you want the lines which do not match,
my #lines = `grep -f [FILE1PATH] -v [FILE2PATH]`;

This is a script I wrote that tries to see if two file are identical, although it could easily by modified by playing with the code and switching it to eq. As Tim suggested, using a hash would probably be more effective, although you couldn't ensure the files were being compared in the order they were inserted without using a CPAN module (and as you can see, this method should really use two loops, but it was sufficient for my purposes). This isn't exactly the greatest script ever, but it may give you somewhere to start.
use warnings;
open (FILE, "orig.txt") or die "Unable to open first file.\n";
#data1 = ;
close(FILE);
open (FILE, "2.txt") or die "Unable to open second file.\n";
#data2 = ;
close(FILE);
for($i = 0; $i < #data1; $i++){
$data1[$i] =~ s/\s+$//;
$data2[$i] =~ s/\s+$//;
if ($data1[$i] ne $data2[$i]){
print "Failure to match at line ". ($i + 1) . "\n";
print $data1[$i];
print "Doesn't match:\n";
print $data2[$i];
print "\nProgram Aborted!\n";
exit;
}
}
print "\nThe files are identical. \n";

Taking the code you posted, and transforming it into actual Perl code, this is what I came up with.
use strict;
use warnings;
use autodie;
open my $fh1, '<', 'text1.txt';
open my $fh2, '<', 'text2.txt';
while(
defined( my $line1 = <$fh1> )
and
defined( my $line2 = <$fh2> )
){
chomp $line1;
chomp $line2;
if( $line1 eq $line2 ){
print "Got a match - $line1\n";
}else{
print "Lines don't match $line1 $line2"
}
}
close $fh1;
close $fh2;
Now what you may really want is a diff of the two files, which is best left to Text::Diff.
use strict;
use warnings;
use Text::Diff;
print diff 'text1.txt', 'text2.txt';

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to store file content sentence by sentence in an array - perl

split needs a regexp. Change split('.' , <FILE>) to split(/\./ , <FILE>)

Change my #lines = split('.' , <FILE>) to my #lines = split('\.' , <FILE>) Only . is used in regex to match a single character. So you need to escape . to split on full stop.

#!/usr/local/bin/perl use strict; use warnings; my $filename = "somefile.txt"; my $contents = do { local(#ARGV, $/) = $filename; <> }; my #lines = split '\.', $contents; foreach(#lines){ #lines is an array which contains one sentence at each index. }

Related

Perl - Compare two large txt files and return the required lines from the first

Perl script to subset multiple DNA sequences

Read text file in Perl word by word instead of line by line

Read newline delimited file in Perl

Comparing lines in a file with perl

Categories

Resources