Perl - Comparison of Files using specific substrings - perl

i ve writted thsi script to compare lines of two files, and output common/not common lines into two different files. The script is :
use strict;
use warnings;
use autodie;
my $f1 = shift || "CSP8216.TXT";
my $f2 = shift || "CSP8217.TXT";
open my $fh1, '>', 'file1';
open FH2, '>', 'file2';
my %results;
open my $file1, '<', $f1;
while (my $line = <$file1>) {
$results{$line} = 1
}
open my $file2, '<', $f2;
while (my $line = <$file2>) {
$results{$line}++
}
foreach my $line (sort { $results{$b} <=> $results{$a} } keys %results)
{
if ($results{$line} >= 1)
{
print {$fh1} "$line";
}
else
{
print FH2 "$line";
}
}
My problem is when i try to mod this script but run the comparisons based on specific substrings of each line, ie :
If a specific substring of a line of file A matches another specific substring of a line in File B, then output said /entire/ line of File B into fh1, otherwise output it into fh2.
I tried this, but it doesnt work - really new to Perl still, any help will be really appreciated :
use strict;
use warnings;
use autodie;
my $f1 = shift || "CSP8216.TXT";
my $f2 = shift || "CSP8216.TXT";
open my $fh1, '>', 'file1';
open FH2, '>', 'file2';
my %results;
open my $file1, '<', $f1;
while (my $line = <$file1>)
{
my $sbs1 = substr($line, 0, 10);
$results{$sbs1} = 1
}
open my $file2, '<', $f2;
while (my $line = <$file2>)
{
my $sbs2 = substr($line, 0, 10);
$results{$sbs2}++
}
foreach my $line (sort { $results{$b} <=> $results{$a} } keys %results)
{
if ($results{$line} >= 1)
{
print {$fh1} "$line";
}
else
{
print FH2 "$line";
}
}
This does not work, and i have a feeling its a problem in the logic, it outputs just the substrings in a single line.

As per my comment, we need to keep the lines from file A and file B separate if we need to support that a single line can appear twice in one file.
On option is to solve the basic problem like this
open my $fh1, '<', $filename1 or die "Can't open $file1: $!";
while (my $line = <$fh1>) {
$combined{$line} = $file1{$line} = 1;
}
open my $fh2, '<', $filename2 or die "Can't open $file2: $!";
while (my $line = <$fh2>) {
$combined{$line} = $file2{$line} = 1;
}
open my $out1, '>', $outfilename1 or die "...";
open my $out2, '>', $outfilename2 or die "...";
for my $line (keys %combined) {
if ($file1{$line} && $file2{$line}) {
print $out1 $line;
} else {
print $out2 $line;
}
}
To solve the substring issue I would keep the substrings from each file as keys in the hashes. But instead of just storing the true value I would store the full string as value in %file2:
open my $fh1, '<', $filename1 or die "Can't open $file1: $!";
while (my $line = <$fh1>) {
my $substr = substr($line, 0, 10);
$combined{$line} = $file1{$substr} = 1;
}
open my $fh2, '<', $filename2 or die "Can't open $file2: $!";
while (my $line = <$fh2>) {
my $substr = substr($line, 20, 30);
$combined{$line} = 1;
$file2{$substr} = $line;
}
open my $out1, '>', $outfilename1 or die "...";
open my $out2, '>', $outfilename2 or die "...";
for my $line (keys %combined) {
my $substr1 = substr($line, 0, 10);
my $substr2 = substr($line, 20, 30);
if ($file1{$substr1} && $file2{$substr2}) {
print $out1 $file2{$substr2};
} else {
print $out2 $line;
}
}

This works for me
#!/usr/bin/perl
use warnings;
use autodie;
my %results;
my $f1 = shift || "CSP8216.TXT";
my $f2 = shift || "CSP8217.TXT";
open my $fh1, '>', 'file1';
open my $fh2, '>', 'file2';
open my $file1, '<', $f1;
while (my $line = <$file1>) {
my $sbs1 = substr($line, 0, 10);
$results{$sbs1} = 1
}
open my $file2, '<', $f2;
while (my $line = <$file2>) {
my $sbs2 = substr($line, 0, 10);
if (!$results{$sbs2}) {
$results{$sbs2} = 1;
}
$results{$sbs2}++
}
foreach my $line (sort { $results{$b} <=> $results{$a} } keys %results) {
if ($results{$line} > 1) {
print {$fh1} "$line";
}
else {
print {$fh2} "$line";
}
}

Related

Create infinite loop

This script is looping only once. I want to make it an infinite loop.
open( FILE, "<$ARGV[0]" );
if ( $LOOP == 1) {
while ( <FILE> ) {
if ( $. == $LOOP ) {
next
};
}
}
while ( <FILE> ) {
$LIST = $_;
print "$LIST";
}
I want make it looping back to the first line, like this
abc
abcd
abcde
abc
abcd
abcde
...
Just wrap what you want to loop with while (1) { ... }.
Cleaned up, you get:
my $qfn = $ARGV[0];
while (1) {
open(my $fh, '<', $qfn)
or die("Can't open \"$qfn\": $!\n");
while (<$fh>) {
print;
}
}
A very minor optimization is to avoid re-opening the file, seeking instead.
use Fcntl qw( SEEK_SET );
my $qfn = $ARGV[0];
open(my $fh, '<', $qfn)
or die("Can't open \"$qfn\": $!\n");
while (1) {
while (<$fh>) {
print;
}
seek($fh, 0, SEEK_SET);
}
Use seek to go reset the position in the filehandle. eof is true if the filehandle is on the last line.
open FILE,'<',$ARGV[0];
while (my $line = <FILE>) {
print $line;
if (eof) {
seek FILE,0,0;
}
}
You can try something like this...
my ( $filename ) = #ARGV;
while ( 1 )
{
open my $fh, '<', $filename or die $!;
my #lines = <$fh>;
close $fh;
print #lines;
}
It should open the file, read it in, close it, print contents and then repeat.

How can I remove the pattern that are matched from original file

How can I remove the matched pattern that I print in output file from the original file. I used below script to print out all matched into the FILE6
open FILE4,'<RM' or die $!;
open FILE5,'<OR' or die $!;
open FILE6, '>Compare3.txt' or die $!;
my #array3 = <FILE4>;
my #array4 = <FILE5>;
foreach $x (#array3) {
if($x =~ /(\S+) (\d+) (\S+)/) {
$temp_allreg = $3;
foreach $y (#array4) {
if($y =~ /\b$temp_allreg\b/i) {
print FILE6 "$x$y\n";
}
}
}
}
Based on above coding, I print out the matched data from 2 file to the output file which is FILE6. So how to modify this code to remove matched pattern that print out in FILE 6 from original file which is FILE4 and FILE5. So that FILE4 and FILE5 only leave the pattern/data that are not matched.
use strict;
use warnings;
my $rm_qfn = 'RM';
my $or_qfn = 'OR';
my $out_qfn = 'Compare3.txt';
open(my $rm_fh, '<', $rm_qfn)
or die("Can't open \"$rm_qfn\": $!\n");
open(my $or_fh, '<', $or_qfn)
or die("Can't open \"$or_qfn\": $!\n");
open(my $out_fh, '>', $out_qfn)
or die("Can't create \"$out_qfn\": $!\n");
open(my $out_rm_fh, '>', "$rm_qfn.tmp")
or die("Can't create \"$rm_qfn.tmp\": $!\n");
open(my $out_or_fh, '>', "$or_qfn.tmp")
or die("Can't create \"$or_qfn.tmp\": $!\n");
chomp( my #ors = <$or_fh> );
my #matched_ors;
while (my $rm = <$rm_fh>) {
chomp($rm);
my $matched_rm = 0;
if (my ($all_reg) = $rm =~ /\S+ \d+ (\S+)/) {
for my $or_idx (0..$#ors) {
my $or = $ors[$or_idx];
if ($or =~ /\b\Q$all_reg\E\b/i) {
++$matched_rm;
++$matched_ors[$or_idx];
print($out_fh "$rm$or\n");
}
}
}
if (!$matched_rm) {
print($out_rm_fh "$rm\n");
}
}
close($rm_fh);
close($or_fh);
for my $or_idx (0..$#ors) {
if (!$matched_ors[$or_idx]) {
my $or = $ors[$or_idx];
print($out_or_fh "$or\n");
}
}
close($out_rm_fh);
close($out_or_fh);
rename("$rm_qfn.tmp", $rm_qfn)
or die("Can't rename \"$rm_qfn.tmp\" to \"$rm_qfn\": $!\n");
rename("$or_qfn.tmp", $or_qfn)
or die("Can't rename \"$or_qfn.tmp\" to \"$or_qfn\": $!\n");
Do it in the same loop?
use File::Slurp qw(read_file write_file);
my $data = read_file $filename, {binmode => ':utf8'};
foreach $x (#array3) {
if($x =~ /(\S+) (\d+) (\S+)/) {
$temp_allreg = $3;
foreach $y (#array4) {
if($y =~ /\b$temp_allreg\b/i){
#print to output file
print FILE6 "$x$y\n";
#remove from input file
$data =~ s/$x$y//g;
write_file $filename, {binmode => ':utf8'}, $data;
}
}
}
}
Since you can't use File::Slurp module, you can consider writing those subroutines yourself.
sub read_file {
my ($filename) = #_;
open my $in, '<:encoding(UTF-8)', $filename or die "Could not open '$filename' for reading $!";
local $/ = undef;
my $all = <$in>;
close $in;
return $all;
}
sub write_file {
my ($filename, $content) = #_;
open my $out, '>:encoding(UTF-8)', $filename or die "Could not open '$filename' for writing $!";;
print $out $content;
close $out;
return;
}
See: How to replace a string in a file [perlmaven]

Find and replace a string in a text file in Perl

I really have no idea how to do this. I tried a lot methods and I don't know why none works. Here is a sample of what I tried:
{
open my $fh1, '<', 'hex1.txt';
open my $fh2, '<', 'hex2.txt';
until ( eof $fh1 or eof $fh2 ) {
my #l1 = map hex, split //, <$fh1>;
my #l2 = map hex, split //, <$fh2>;
my $n = #l2 > #l1 ? #l2 : #l1;
my #sum = map {
no warnings 'uninitialized';
$l1[$_] + $l2[$_];
} 0 .. $n - 1;
#sum = map { sprintf '%X', $_ } #sum;
open my $out, '>', 'hexsum.txt';
print {$out} #sum, "\n";
}
close $fh1;
close $fh2;
}
{
open my $IN, "<", 'hexsum.txt';
open my $OUT, ">", 'sym.txt';
while ( my $linie = <$IN> ) {
$linie =~ s/40/20/g;
print $OUT $linie;
}
close $IN;
close $OUT;
}
{
my $input = do {
open my $in, '<', 'hexsumspace.txt';
local $/;
<$in>;
};
open my $out, '>', 'sym.txt';
print $out pack 'H*', $input;
}
How can I change it everywhere I find the value 40 with the value 20?
use strict;
use warnings;
open my $OUT, ">", 'output.txt';
open my $IN, "<", 'input.txt';
while (my $line = <$IN>) {
$line =~ s/40/20/g;
print $OUT $line;
}
close $IN;
close $OUT;
Here's a Perl one-liner from the command line:
perl -pe 's/40/20/g' input.txt > output.txt

Perl compilation error

sorry if it seems obvious but Im pretty new at Perl and programming and I've been working over a week and can't get it done.
My idea is simple. I've got a .csv where I've got the names in the first column, a number from -1 to 1 in the second and a position on the third. Then another file where I have got the names (line starts with >) and the info with 80 characters per line.
What I want to do is keep the name lines of the first file and grab the 'position' given from -20 to +60. But I cannot get it to work and I've got to the point where don't know where to follow.
use strict; #read file line by line
use warnings;
my $outputfile = "Output1.txt";
my $filename = "InputP.txt";
my $inputfasta = "Inputfasta.txt";
open my $fh, '<', $filename or die "Couldn't open '$filename'";
open my $fh2, '>', $outputfile or die "Couldn't create '$outputfile'";
open my $fh3, '<', $inputfasta or die "Couldn't open '$inputfasta'";
my $Psequence = 0;
my $seqname = 0;
while (my $line = <$fh>) {
chomp $line;
my $length = index ($line, ",");
$seqname = substr ($line, 0, $length);
my $length2 = index ($line, ",", $length);
my $score = substr ($line, $length +1, $length2);
my $length3 = index ($line, ",", $length2);
my $position = substr ($line, $length2 +1, $length3);
#print $fh2 "$seqname"."\t"."$score"."\t"."$position"."\n"; }
my $Rlength2 = index ($score, ",");
my $Rscore = substr ($score, 0, $Rlength2);
#print "$Rscore"."\n";}
while (my $linea = <$fh3>){ #same order.
chomp $linea;
if ($linea=~/^>(.+)/) {
print $fh3 "\n"."$linea"."\n"; }
else { $linea =~ /^\s*(.*)\s*$/;
chomp $linea;
print $fh3 "$linea". "\n"; }
}
if ($Rscore >= 0.5){
$Psequence = substr ($linea, -20, 81);
print "$seqname"."\n"."$Psequence";}
}
Please, learn to indent the code correctly. Then the error will be more obvious:
while (my $linea = <$fh3>){ #same order.
chomp $linea;
if ($linea =~ /^>(.+)/) {
print $fh3 "\n$linea\n";
} else {
# Commented out as it does nothing.
# $linea =~ /^\s*(.*)\s*$/;
# chomp $linea;
print $fh3 "$linea\n";
}
}
if ($Rscore >= 0.5){
$Psequence = substr $linea, -20, 81;
print "$seqname\n$Psequence";
}
$linea exists only in the while loop, but you try to use it in the following paragraph, too. The variable disappears when the loop ends.
Create a hash from the CSV where the key is the name and the value is the position.
use Text::CSV_XS qw( );
my %pos_by_name;
{
open(my $fh, '<', $input_qfn)
or die("Can't open $input_qfn: $!\n");
my $csv = Text::CSV_XS->new({ auto_diag => 1, binary => 1 });
while (my $row = $csv_in->getline($fh)) {
$pos_by_name{ $row->[0] } = $row->[2];
}
}
Then, it's just a question of extracting the names from the other file, and using the hash to find the associated position.
open(my $fh, '<', $fasta_qfn)
or die("Can't open $fasta_qfn: $!\n");
while (<$fh>) {
chomp;
my ($name) = /^>(.*)/
or next;
my $pos = $pos_by_name{$name};
if (!defined($pos)) {
die("Can't find position for $name\n");
}
... Do something with $name and $pos ...
}

How can I print lines from a file to separate files

I have a file which has lines like this:
1 107275 447049 scaffold1443 465 341154 -
There are several lines which starts with one, after that a blank line separates and start lines with 2 and so on.
I want to separate these lines to different files based on their number.
I wrote this script but it prints in every file only the first line.
#!/usr/bin/perl
#script for choosing chromosome
use strict;
my $filename= $ARGV[0];
open(FILE, $filename);
while (my $line = <FILE>) {
my #data = split('\t', $line);
my $length = #data;
#print $length;
my $num = $data[0];
if ($length == 6) {
open(my $fh, '>', $num);
print $fh $line;
}
$num = $num + 1;
}
please, i need your help!
use >> to open file for appending to end of it as > always truncates desired file to zero bytes,
use strict;
my $filename = $ARGV[0];
open(my $FILE, "<", $filename) or die $!;
while (my $line = <$FILE>) {
my #data = split('\t', $line);
my $length = #data;
#print $length;
my $num = $data[0];
if ($length == 6) {
open(my $fh, '>>', $num);
print $fh $line;
}
$num = $num + 1;
}
If I understand your question correctly, then paragraph mode might be useful. This breaks a record on two or more new-lines, instead of just one:
#ARGV or die "Supply a filename\n";
my $filename= $ARGV[0];
local $/ = ""; # Set paragraph mode
open(my $file, $filename) or die "Unable to open '$filename' for read: $!";
while (my $lines = <$file>) {
my $num = (split("\t", $lines))[0];
open(my $fh, '>', $num) or die "Unable to open '$num' for write: $!";
print $fh $lines;
close $fh;
}
close $file;