Perl: matching data in two files - perl

I would like to match and print data from two files (File1.txt and File2.txt). Currently, I'm trying to match the first letter of the second column in File1 to the first letter of the third column in File2.txt.
File1.txt
1 H 35
1 C 22
1 H 20
File2.txt
A 1 HB2 MET 1
A 2 CA MET 1
A 3 HA MET 1
OUTPUT
1 MET HB2 35
1 MET CA 22
1 MET HA 20
Here is my script, I've tried following this submission: In Perl, mapping between a reference file and a series of files
#!/usr/bin/perl
use strict;
use warnings;
my %data;
open (SHIFTS,"file1.txt") or die;
open (PDB, "file2.txt") or die;
while (my $line = <PDB>) {
chomp $line;
my #fields = split(/\t/,$line);
$data{$fields[4]} = $fields[2];
}
close PDB;
while (my $line = <SHIFTS>) {
chomp($line);
my #columns = split(/\t/,$line);
my $value = ($columns[1] =~ m/^.*?([A-Za-z])/ );
}
print "$columns[0]\t$fields[3]\t$value\t$data{$value}\n";
close SHIFTS;
exit;

Here's one way using split() hackery:
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my $f1 = 'file1.txt';
my $f2 = 'file2.txt';
my #pdb;
open my $pdb_file, '<', $f2
or die "Can't open the PDB file $f2: $!";
while (my $line = <$pdb_file>){
chomp $line;
push #pdb, $line;
}
close $pdb_file;
open my $shifts_file, '<', $f1
or die "Can't open the SHIFTS file $f1: $!";
while (my $line = <$shifts_file>){
chomp $line;
my $pdb_line = shift #pdb;
# - inner split: get the third element from the $pdb_line
# - outer split: get the first element (character) from the
# result of the inner split
my $criteria = (split('', (split('\s+', $pdb_line))[2]))[0];
# - compare the 2nd element of the file1.txt line against
# the above split() operations
if ((split('\s+', $line))[1] eq $criteria){
print "$pdb_line\n";
}
else {
print "**** >$pdb_line< doesn't match >$line<\n";
}
}
Files:
file1.txt (note I changed line two to ensure a non-match worked):
1 H 35
1 A 22
1 H 20
file2.txt:
A 1 HB2 MET 1
A 2 CA MET 1
A 3 HA MET 1
Output:
./app.pl
A 1 HB2 MET 1
****>A 2 CA MET 1< doesn't match >1 A 22<
A 3 HA MET 1

Related

Join element in an array and separate with space

I want to join the first to 16th word and 17th to 31st, etc in an array with space to one line but do not know why the code does not work. Hope to get help here.Thanks
my #file = <FILE>;
for ( $i=0; $i<=$#file; $i+=16 ){
my $string = join ( " ", #file[$i..$i+15] );
print FILE1 "$string\n";
}
Below is part of my file.
1
2
3
...
What i wan to print is
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16
17 18 19 20 21....
I wouldn't do it the way you've done it.
Instead I would:
open ( my $input, '<', "your_file_name" ) or die $!;
chomp ( my #file = <$input> );
print join (" ",splice (#file, 0, 15)),"\n" while #file;
Note - I've used a lexical file handle with a 3 argument open, because that's better style.
splice removes the first 16 elements from #file each iteration, and continues until #file is empty.
Your lines have newlines attached to them. Remove them with chomp. Then loop over the array, remove 16 items and print them.
my #file = <FILE>;
chomp #file;
while (#file) {
my #temp;
INNER: for ( 0 .. 15 ) {
push #temp, shift #file || last INNER; # not or
}
print join( q{ }, #temp ), "\n";
}
This is the long implementation of the splice solution Sobrique suggested in the comments. It's does the same thing, just way more verbose.
This is the old answer before the edit:
If you only want the first 16, this is way more effective.
my $string = join q{ }, map { <FILE>; chomp; $_ } 1 .. 16;
This reads 16 lines and chomp each of them, then joins.
You might also want to use lexical file handles $fh instead of the GLOB FILE.
open my $fh, '<', $path_to_file or die $!;
Suppose if you want to read it from file, don't store the whole file into an array. Instead loop through line by line. And check the line number with $. special variable.
use warnings;
use strict;
open my $fh,"<","input.txt";
my $line;
while (<$fh>)
{
chomp;
$line.= $_." ";
print "$line\n" and $line="" if($. % 16 == 0);
END{ print "$line\n";};
}
Or this also will work
use warnings;
use strict;
open my $wh,"<","input.txt";
my $line;
foreach (;;)
{
my $data = join " ",(map { my $m=<$wh> || ""; chomp($m); $m} (0..15));
last if ($data =~m/^\s+$/);
print $data,"\n";
}
Assuming that you have FILE and FILE1 descriptors open, try:
$.%16?s!\s+! !:1 and print FILE1 $_ while <FILE>;

Perl: Iterating through large hash, runs out of memory

I have been trying to find values that match between two columns (columns a and column b) of a large file and print the common values, plus the corresponding column d. I have been doing this by interating through hashes, however, because the file is so large, there is not enough memory to produce the output file. Is there any other way to do the same thing using less memory resources.
Any help is much appreciated.
The script I have written thus far is below:
#!usr/bin/perl
use warnings;
use strict;
open (FILE1, "<input.txt") || die "$!\n Couldn't open input.txt\n";
open (Output, ">output.txt")||die "Can't Open output.txt ";
my $hash1={};
my $hash2={};
while (<FILE1>) {
chomp (my $line=$_);
my ($a, $b, $c, $d) = split (/\t/, $line);
if ($a) {
$hash1->{$a}{info1} = "$d"; #original_ID-> YOB
}
if ($b) {
$hash2->{$b}{info2} = "$a"; #original_ID-> sire
}
foreach my $key (keys %$hash2) {
if (exists $hash1{$a}) {
$info1 = $hash1->{$a}->{info1};
print "$a\t$info1\n";
}
}
}
close FILE1;
close Output;
print "Done\n";
To clarify, the input file is a large pedigree file. An example is:
1 2 3 1977
2 4 5 1944
3 4 5 1950
4 5 6 1930
5 7 6 1928
An example of the output file is:
2 1944
4 1950
5 1928
Does the below work for you ?
#!/usr/local/bin/perl
use strict;
use warnings;
use DBM::Deep;
use List::MoreUtils qw(uniq);
my #seen;
my $db = DBM::Deep->new(
file => "foo.db",
autoflush => 1
);
while (<>) {
chomp;
my #fields = split /\s+/;
$$db{$fields[0]} = $fields[3];
push #seen, $fields[1];
}
for (uniq #seen) {
print $_ . " " . $$db{$_} . "\n" if exists $$db{$_};
}

Perl: Search and Replace

I'm trying to improve my script in which I hope to match characters in input.txt (column 4: H1, 2HB, CA, HB3) to dictionary.txt and replace with appropriate characters from dictionary.txt (column 2: H, HB, C, 3HB). Using dictionary.txt as a dictionary:
input.txt
1 N 22 H1 MET
1 H 32 2HB MET
1 C 40 CA MET
2 H 35 HB3 ASP
dictionary.txt
MET H H1
MET HB 2HB
MET C CA
ASP 3HB HB3
output
1 N 22 H MET
1 H 32 HB MET
1 C 40 C MET
2 H 35 3HB ASP
I'm trying to approach this by first matching the word in input.txt (MET) and dictionary.txt (MET) and then performing the substitution. This is what I've written so far:
#!/usr/bin/perl
use strict;
use warnings;
my %dictionary;
open my $dic_fh, '<', 'dictionary.txt' or die "Can't open file: $!";
while (my $ref = <$dic_fh>) {
chomp $ref;
my #columns = split(/\t/, $ref);
my $res_name = $columns[0];
my $ref_nuc = $columns[1];
$dictionary{$res_name} = {$ref_nuc};
open my $in_fh, '<', 'input.txt' or die "Can't open file: $!";
while (my $line = <$in_fh>) {
chomp $line;
my #columns = split(/\t/, $line);
my #name = $columns[3];
if (my $name eq $res_name) {
my $line = $_;
foreach my $res_name (keys %dictionary) {
$line =~ s/$name/$dictionary{$ref_nuc}/;
}
print $line;
}
}
}
The problem seems to be that you are assigning the single field $columns[3] to array #name, and then expecting to find it in $name, which is a separate variable altogether. You even declare $name at the point of the comparison
You are also executing the statement
$line =~ s/$name/$dictionary{$ref_nuc}/;
once for each key in the hash. That is unnecessary: it needs to be done only once. It is also better to change the value of $columns[3] to $dictionary{$columns[3]} instead of doing a search and replace on the whole line, as the target string may appear in other columns that you don't want to modify
It is very simple to do by building a dictionary hash and replacing the fourth field of the input file with its dictionary lookup
use strict;
use warnings;
use 5.010;
use autodie;
open my $fh, '<', 'dictionary.txt';
my %dict;
while ( <$fh> ) {
my ($k, $v) = (split)[2,1];
$dict{$k} = $v;
}
open $fh, '<', 'input.txt';
while ( <$fh> ) {
my #fields = split;
$fields[3] = $dict{$fields[3]};
say join "\t", #fields;
}
output
1 N 22 H MET
1 H 32 HB MET
1 C 40 C MET
2 H 35 3HB ASP

perl script to find matching lines in two files

I have two files that look like (below) and want to find the fields from the first in the second file, but print every field of the second.
#rs116801199 720381
#rs138295790 16057310
#rs131531 16870251
#rs131546 16872281
#rs140375 16873251
#rs131552 16873461
and
#--- rs116801199 720381 0.026 0.939 0.996 0 -1 -1 -1
#1 rs12565286 721290 0.028 1.000 1.000 2 0.370 0.934 0.000
#1 rs3094315 752566 0.432 1.000 1.000 2 0.678 0.671 0.435
#--- rs3131972 752721 0.353 0.906 0.938 0 -1 -1 -1
#--- rs61770173 753405 0.481 0.921 0.950 0 -1 -1 -1
My script looks like:
#! perl -w
my $file1 = shift#ARGV;
my #filtered_snps;
open (IN, $file1) or die "couldn't read file one";
while(<IN>){
my#L=split;
#next if ($L[0] =~ m/peak/);
push #filtered_snps,[$L[0],$L[1]];
}
close IN;
my $file2 = shift#ARGV;
my #snps;
open (IN, $file2);
while (<IN>){
my#L=split;
foreach (#filtered_snps){
if (($L[1] eq ${$_}[0]) && ($L[2] == ${$_}[1])) {
print "#L\n";
next;
}
}
}
I am getting no output, when I should be finding every line from file 1. I've also tried grep with no success.
In first while you are assigning to wrong array, you meant #L here.
Then you have pretty different strings in your first array (from first file) and in other. Try to print them both out in your for-iteration. You'll see they can't match.
Create a hash table of the items from the first file, then iterate over the second file and check if that rs-name exists... I'm also confirming that the number matches the name.
use strict;
use warnings;
my %hash;
my $regex = qr/#.* *(rs\d+) (\d+) *.*/;
open my $file1, '<', shift #ARGV;
while (<$file1>) {
my ($name, $num) = $_ =~ $regex;
$hash{$name} = $num;
}
close $file1;
open my $file2, '<', shift #ARGV;
while (<$file2>) {
my ($name, $num) = $_ =~ $regex;
print if (exists $hash{$name} and $hash{$name} = $num)
}
close $file2;

Matching a line and printing lines above

Code :
#!/usr/bin/perl
my $file = $ARGV[0];
my $position = $ARGV[1]; # POSITION OF THE RESIDUE
open (FILE, $file);
while (<FILE>) {
my #f = split;
if (($f[0] == "ANNOT_RESID_NO") && ($f[1] == $position)){
push #line, $_;
}
}
print #line;
close(FILE);
INPUT :
ANNOT_TYPE[1] 0
ANNOT_TYPE_NAME[1] CATRES
ANNOT_NUMBER[1][1] 1
ANNOT_NAME[1][1] 3.1.3.16
ANNOT_DESC[1][1] Phosphoprotein phosphatase.
ANNOT_RESID_NO[1][1][1] 91
ANNOT_RESID_NAME[1][1][1] ASP
ANNOT_RESID_NUM[1][1][1] 95
ANNOT_RESID_NO[1][1][2] 92
ANNOT_RESID_NAME[1][1][2] ARG
ANNOT_NRESID[1][1] 6
ANNOT_NUMBER[1][2] 2
ANNOT_NAME[1][2] 3.1.3.53
ANNOT_DESC[1][2] [Myosin-light-chain] phosphatase.
ANNOT_RESID_NO[1][2][1] 91
ANNOT_RESID_NAME[1][2][1] ASP
ANNOT_RESID_NUM[1][2][1] 95
ANNOT_RESID_NO[1][2][2] 92
ANNOT_RESID_NAME[1][2][2] ARG
Question :
I am printing the line with has $position(for example 91) for the line starting with "ANNOT_RESID_NO". Along with this line, I also want to print, every time, in #line is the first line above this match containing "ANNOT_DESC". This "ANNOT_DESC" line is not necessarily always the line just above my matched line.
Try (complete code):
#!/usr/bin/perl
use strict;
use warnings;
my $file = $ARGV[0];
my $position = $ARGV[1];
open (FILE, $file) or die $!;
my $desc;
my #line;
while (<FILE>) {
my #f = split " ";
if ( $f[0] =~ /^ANNOT_DESC/ ) {
$desc = $_;
next;
}
if ( $f[0] =~ /^ANNOT_RESID_NO/ and $f[1] == $position ) {
push #line, $desc, $_;
}
}
output:
ANNOT_DESC[1][1] Phosphoprotein phosphatase.
ANNOT_RESID_NO[1][1][1] 91
ANNOT_DESC[1][2] [Myosin-light-chain] phosphatase.
ANNOT_RESID_NO[1][2][1] 91
With a data set that small you can push the lines from the file to an array(e.g. #file_data) , iterate the #file_data array and push the values you want into your #line array.