Creating multiple hashes from multiple files in one go - perl

I want to perform a vlookup like process but with multiple files wherein the contents of the first column from all files (sorted n uniq-ed) is reference value. Now I would like to store these key-values pairs from each file in each hash and then print them together. Something like this:
file1: while(){$hash1{$key}=$val}...file2: while(){$hash2{$key}=$val}...file3: while(){$hash3{$key}=$val}...so on
Then print it: print "$ref_val $hash1{$ref_val} $hash3{$ref_val} $hash3{$ref_val}..."
$i=1;
#FILES = #ARGV;
foreach $file(#FILES)
{
open($fh,$file);
$hname="hash".$i; ##trying to create unique hash by attaching a running number to hash name
while(<$fh>){#d=split("\t");$hname{$d[0]}=$d[7];}$i++;
}
$set=$i-1; ##store this number for recreating the hash names during printing
open(FH,"ref_list.txt");
while(<FH>)
{
chomp();print "$_\t";
## here i run the loop recreating the hash names and printing its corresponding value
for($i=1;$i<=$set;$i++){$hname="hash".$i; print "$hname{$_}\t";}
print "\n";
}
Now this where I am stuck perl takes $hname as hash name instead of $hash1, $hash2...
Thanks in advance for the helps and opinions

The shown code attempts to use symbolic references to construct variable names at runtime. Those things can raise a lot of trouble and should not be used, except very occasionally in very specialized code.
Here is a way to read multiple files, each into a hash, and store them for later processing.
use warnings;
use strict;
use feature 'say';
use Data::Dump qw(dd);
my #files = #ARGV;
my #data;
for my $file (#files) {
open my $fh, '<', $file or do {
warn "Skip $file, can't open it: $!";
next;
};
push #data, { map { (split /\t/, $_)[0,6] } <$fh> };
}
dd \#data;
Each hash associates the first column with the seventh (index 6), as clarified, for each line. A reference to such a hash for each file, formed by { }, is added to the array.
Note that when you add a key-value pair to a hash which already has that key the new overwrites the old. So if a string repeats in the first column in a file, the hash for that file will end up with the value (column 7) for the last one. The OP doesn't discuss possible duplicates of this kind in data files (only for the reference file), please clarify if needed.
The Data::Dump is used only to print; if you don't wish to install it use core Data::Dumper.
I am not sure that I get the use of that "reference file", but you can now go through the array of hash references for each file and fetch values as needed. Perhaps like
open my $fh_ref, '<', $ref_file or die "Can't open $ref_file: $!";
while (my $line = <$fh_ref>) {
my $key = ... # retrieve the key from $line
print "$key: ";
foreach my $hr (#data) {
print "$hr->{$key} ";
}
say '';
}
This will print key: followed by values for that string, one from each file.

Related

How to use perl to parse a text file

I am new to perl,
I have text file contains 2 columns:
lib1 cell1
lib1 cell2
lib2 cell3
lib2 cell1
I would like to use perl to find there is duplicated in name in column 2 then print the name of column 1
In this text cell1 is repeated 2 times.
I would like to have a report something like:
cell1 found in lib1 lib2
I use the code below to read and open the file
#!/usr/bin/env perl
use strict;
use warnings;
for my $file ( #ARGV ){
open my$in_fh, '<', $file or die "could not open $file: $!\n";
while( my $line = <$in_fh> ){
chomp( $line );
print "$line\n"
}
}
But I don't know how to find the duplicated name in second column and print the 1st column
There are a few things here that Perl can do for you.
First, Perl will handle opening and reading the files you specify on the command line this the empty deadline operator (and here I'm using the safer double diamond version introduced in v5.22):
use v5.22
while( <<>> ) {
...
}
Then, you can track what you've seen with a hash. Extract the columns, and use the interesting column as the key in the hash. Here I post-increment it's value. On the first go around, the post increment returns 0 (then increases the value by 1), so the conditional is false the first time. The next time it sees that same key, the value is true, so it warns:
use v5.22
my %Seen;
while( <<>> ) {
chomp;
my( $first, $second ) = split;
if( $Seen{$second}++ ) {
warn "Duplicated second column! Line $.\n";
}
}
The hash is a great way to track things that are strings instead of positions.
Now, you want to know which values in the first column appear with each value in the second. You could get a bit more fancy with that hash and make another level in the hash to store the first column. Perl automatically takes care of the details for you (and we have extended examples of this in Intermediate Perl.
First, accumulate the data in the hash:
use v5.22
my %Seen;
while( <<>> ) {
chomp;
my( $first, $second ) = split;
$Seen{$second}{$first}++;
}
Once you have the hash, you move on to the second step of reporting the data. All the values of the second column are the top level keys for the hash. With that key, get the second level of the hash, and get those keys, which are the first column:
foreach my $second ( keys %Seen ) {
my #firsts = keys %{ $Seen{$second} };
say "$second found in #firsts";
}
With v5.24's postfix dereferencing, that's slightly cleaner since the dereference reads left to right rather than inside out:
use v5.24;
foreach my $second ( keys %Seen ) {
my #firsts = keys $Seen{$second}->%*;
say "$second found in #firsts";
}
And, since the hash keys in the second level only appear once per value, you don't have duplicates.

Print different value each time key is called

I need some help with Perl.
I have a list of IDs & corresponding values in a file. Each ID acts as a key in a hash of hashes so there are multiple values for each key. I'm trying to open a second file & assign a different value each time the key is encountered. Here is what I have so far:
This code takes the input file & builds the hash of hashes. $prot is the key & $dir is the value. Each key has multiple values.
open (IN, "file_name");
while (<IN>)
{
($prot, $dir) = split;
push (#{$dir{$prot}}, $dir );
}
In the second part of the code, I would like to read each line of the file & assign a different value using the first column in the line as the key. Each key will appear multiple times in the second file & for each instance I would like it to print a different value.
open (FH, "results_file");
while (<FH>)
{
chomp;
#a=split;
$prot=$a[1];
foreach (values %dir)
{print "$a[1]"."\t"."#{$dir{$prot}}"."\n";}
}
Right now the way the code is written it prints all the values for each key when it encounters the key.
Thanks so much for any help that can be offered!
Edit:
The first input file is something along the lines of
BC_123456 dir_6789
BC_456789 dir_3456
BC_234689 dir_1298
BC_123456 dir_3987
BC_876432 dir_7642
Each ID acts as a key in a hash of hashes
You actually have a hash of arrays there.
Assuming you want to print the first value for the first instance, second for the second instance, and so on, you can just shift off the values for each key you encounter:
open (FH, "results_file");
while (<FH>)
{
chomp;
#a=split;
$prot=$a[1];
foreach (values %dir) {
my $val = shift #{$dir{$prot}};
print "$a[1]\t$val\n";
}
}
This will remove one value from the HoA entry, assuming you don't need to use that array afterwards.
I think this gets your code working. There are some best practices to note:
You should always run under strict & warnings.
use strict;
use warnings;
I recommend the three argument version of open(), with an error check (or die "message $!";), and a lexical variable to store the filehandle.
See:
perldoc -f open
perldoc perlopentut
Close your files when done using them.
Variables should be introduced (declared) with my unless you have a good reason to use something else.
I also made a couple changes that while I recommend, they aren't necessary.
Removed variable #a because you did not actually need it.
Cleaned up your print because it was hard to read. You could also try printf
You have two files named dir (%dir and $dir) which I find confusing in this case so I renamed %dir to %dirs.
CODE:
use strict;
use warnings;
my %dirs;
# Part 1 - Input
my $filename_input = "file_name.txt";
open(my $IN,'<',$filename_input) or die "Unable to open [$filename_input] for reading - $!";
while(<$IN>) {
my ($prot, $dir) = split;
push #{$dirs{$prot}}, $dir;
}
close $IN;
# Part 2 - Output
my $filename_results = "results_file.txt";
open(my $RESULTS,'<',$filename_results) or die "Unable to open [$filename_results] for reading - $!";
while(<$RESULTS>) {
chomp;
my $prot = (split)[1];
foreach (values %dirs) {
print "$prot\t#{$dirs{$prot}}\n"; # Or try: printf "%s\t%s\n",$prot,"#{$dirs{$prot}}";
}
}
close $RESULTS;
file_name.txt
BC_123456 dir_6789
BC_456789 dir_3456
BC_234689 dir_1298
BC_123456 dir_3987
BC_876432 dir_7642
results_file.txt
don'tcare BC_123456
don'tcare BC_234689

identify and insert the missing rows

An array is populated from a tab delimited text (5 column) file that sometimes is missing rows. I need to identify and insert the missing rows. Inserting a string "blank row found" is sufficient.
Here is an example of data from file:
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
I’ve created an array of elements that identifies the second column of each row that should be present in the file, in the order each row should be present. However, I'm not sure how to continue from here, since I'm unable to install any Perl modules on the server (e.g. Arrays::Utils).
Is comparing arrays the correct way of approaching this problem? Perhaps there is a straightforward solution, that doesn’t require installation of any CPAN modules? Thanks for your help.
#!perl
use strict;
use warnings;
use File::Basename;
#use Arrays::Utils;
opendir my $dir, "/data/test_all_runs" or die "Cannot open directory: $!";
my #run_folder = readdir $dir;
closedir $dir;
my $run_folder = pop #run_folder; print "The folder is".$run_folder."\n";
my $home="/data/";
my $CNV_file = $home."test_all_runs/".$run_folder."/CNV.txt";
my #CNVarray;
open(TXT2, "$CNV_file");
while (<TXT2>){
push (#CNVarray, $_);
}
close(TXT2);
foreach (#CNVarray){
chop($_);
}
my #array1 = map { $_->[1] } #CNVarray;
my #array2 = qw(MTOR JAK1 NRAS DDR2 MYCN ALK IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1 FGFR3 PDGFRA KIT APC FGFR4 ROS1 ESR1 EGFR CDK6 MET SMO BRAF FGFR1 MYC JAK2 GNAQ RET FGFR2 HRAS CCND1 BIRC2 KRAS ERBB3 CDK4 AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1 GNA11 MAP2K2 JAK3 AR MED12);
my %array1_hash;
my %array2_hash;
# Create a hash entry for each element in #array1
for my $element ( #array1 ) {
$array1_hash{$element} = #array1;
}
# Same for #array2: This time, use map instead of a loop
map { $array_2{$_} = 1 } #array2;
for my $entry ( #array2 ) {
if ( not $array1_hash{$entry} ) {
return 1; #Entry in #array2 but not #array1: Differ
}else {
return 0; #Arrays contain the same elements
}
#if ( keys %array_hash1 != keys %array_hash2 ) {
#return 1; #Arrays differ
}
Note The best version is reached at the end. It is a few lines of code.
If I get it right, you have a separate reference list of key-words that need to be in the second field in a row, with rows in that order. One way to find skipped rows is to iterate through both lists.
That approach can be picky and error prone but here it can be made easier by removing the front element from the reference list each time. Then you always need to compare the current line against the first element in the reference list. Here is the basic logic, with the better version further below.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
# chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
foreach my $line (#CNVarray)
{
if ( (split /\t/, $line)[1] eq $ref_list[0] ) { # good row
shift #ref_list;
print $line, "\n";
}
else {
shift #ref_list;
print "blank row found\n";
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
# multiple missing rows? keep going through the reference list
shift #ref_list;
print "blank row found\n";
}
}
# We are done with the array, but are there more reference items?
print "blank row found\n" for #ref_list;
The while loop is needed since multiple rows can be missing (in a row), so we need to get to the place in the reference list that does match the current row. A few notes on the code.
The filehandle read <...> in the list context returns a list with all lines from the resource.
The chop in the original code removes the last character, probably not what you want. It is the chomp that removes the new line (or really $/).
Tested against the reference list qw(AA BB CC DD EE) with the input file (note spaces not tabs)
1 AA first
2 BB more
5 EE last
To test with this, change /\t/ to /\s/ (what will then work for tabs as well). It prints
1 AA first
2 BB more
blank row found
blank row found
5 EE last
With further elements added to the #ref_list (FF etc) further blank ... lines are printed.
The code above can be simplified. Lines are also collected in an array, then printed to a new file.
use warnings;
use strict;
open my $cnv_fh, '<', $CNV_file or die "Can't open $CNV_file: $!";
my #CNVarray = <$cnv_fh>;
close $cnv_fh;
chomp(#CNVarray);
my #ref_list = qw(MTOR JAK1 ...);
my #new_lines;
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne $ref_list[0] ) {
shift #ref_list;
push #new_lines, 'blank row found';
print "blank row found\n";
}
shift #ref_list;
push #new_lines, $line;
print $line, "\n";
}
# There may be more items remaining on the reference list
for (#ref_list) {
push #new_lines, 'blank row found';
print "blank row found\n"
}
my $filled_file = 'skipped_rows_added.txt';
open my $out_fh, '>', $filled_file or die "Can't open $filled_file: $!";
print $out_fh "$_\n" for #new_lines;
close $out_fh;
This behaves the same way with the test input above. It can be simplified further yet
foreach my $line (#CNVarray)
{
while ( (split /\t/, $line)[1] ne shift #ref_list ) {
print "blank row found\n";
}
print $line, "\n";
}
The shift returns the removed element, which is what need be tested against.
A note on split syntax, following the code update ("\t" changed to /\t/).
When invoked as split /$patt/, $str, the $patt is used as a regular expression, with a few very minor differences. So with /\s/ the string is split on white space as understood in regex, thus including the tab, for example.
With double quotes "..." used instead of /.../, what is inside is interpolated first which may result in surprises, in particular with escapes. (Unless it is used as m"..." in which case it is merely a regex with " being the delimiter.)
In the above code for the tab one can use /\t/, or "\t", or '\t' (or /\s/ which includes yet other types of space). The "\t" was changed to /\t/, which is better in my opinion, being clearer (it is a regex, no questions asked). Thanks to Borodin for the early edit and for the comment.
I would write this
The input file is read into a hash, keyed by the value of the second column. Then the hash is read back and printed in the specified sequence of keys
Most of the code is finding the input file and setting up the sequence of keys. The core of the program is only three lines of code
use strict;
use warnings 'all';
use File::Spec::Functions 'catfile';
my $home = '/data';
my #run_folder = grep -f, glob catfile($home, 'test_all_runs', '*', 'CNV.txt');
die "No CNV file found" unless #run_folder;
my $cnv_file = $run_folder[-1];
print "The file is $cnv_file\n\n";
my #sequence = qw/
MTOR JAK1 NRAS DDR2 MYCN ALK
IDH1 ERBB4 RAF1 CTNNB1 PIK3CA DCUN1D1
FGFR3 PDGFRA KIT APC FGFR4 ROS1
ESR1 EGFR CDK6 MET SMO BRAF
FGFR1 MYC JAK2 GNAQ RET FGFR2
HRAS CCND1 BIRC2 KRAS ERBB3 CDK4
AKT1 MAP2K1 IDH2 NF1 ERBB2 BRCA1
GNA11 MAP2K2 JAK3 AR MED12
/;
open my $fh, '<', $cnv_file or die qq{Unable to open "$cnv_file" for input: $!};
my %data;
$data{ (split)[1] } = $_ while <$fh>;
print $data{$_} // "no data for $_\n" for #sequence;
output
The file is /data/test_all_runs/XXX/CNV.txt
chr1:11174372 MTOR 42939 42939 7
chr1:65310459 JAK1 1948 1948 3
no data for NRAS
no data for DDR2
no data for MYCN
no data for ALK
no data for IDH1
no data for ERBB4
no data for RAF1
no data for CTNNB1
no data for PIK3CA
no data for DCUN1D1
no data for FGFR3
no data for PDGFRA
no data for KIT
no data for APC
no data for FGFR4
no data for ROS1
no data for ESR1
no data for EGFR
no data for CDK6
no data for MET
no data for SMO
no data for BRAF
no data for FGFR1
no data for MYC
no data for JAK2
no data for GNAQ
no data for RET
no data for FGFR2
no data for HRAS
no data for CCND1
no data for BIRC2
no data for KRAS
no data for ERBB3
no data for CDK4
no data for AKT1
no data for MAP2K1
no data for IDH2
no data for NF1
no data for ERBB2
no data for BRCA1
no data for GNA11
no data for MAP2K2
no data for JAK3
no data for AR
no data for MED12

I want to replace a sequence name in fasta file with another name

I have one fasta file and one text file fasta file contains sequences in fasta format and text file contains name of genes now I want to replace name of the sequences in fasta file after '>' sign with the gene names in text file
I am new to perl though I have written a script but I don't know why its not working can anyone help me on that please
following is my script:
print"Enter annotated file...";
$f1=<STDIN>;
print"Enter sequence file...";
$f2=<STDIN>;
open(FILE1,$f1) || die"Can't open $f1";
#annotfile=<FILE1>;
open(FILE2,$f2) || die"Can't open $f2";
#seqfile=<FILE2>;
#d=split('\t',#annotfile[0]);
for($i=0;$i<scalar(#annotfile);$i++)
{
#curr_all=split('\t',#annotfile[$i]);
#curr_id[$i]=#curr_all[0];
#gene_nm[$i]=#curr_all[1];
}
for($j=0;$j<scalar(#seqfile);$j++)
{
$id=#curr_id[$j];
$gene=#gene_nm[$j];
#seqfile[$j]=~s/$id[$j]/$gene[$j]/g;
print #seqfile[$j];
}
my files looks like following:
annot.txt
pool75_contig_389 ubiquitin ligase e3a
pool75_contig_704 tumor susceptibility
pool75_contig_1977 serine threonine-protein phosphatase 4 catalytic subunit
pool75_contig_3064 bardet-biedl syndrome 2 protein P
pool75_contig_2499 succinyl- ligase
goat300.fasta
goat300.fasta
>pool75_contig_704
CCCTTTCTCCCTTCCCAACATTCAGAGATACTGAATCGAAACTCTTACTGTCTGTTAGAT
GACAAAGAGTTATCCATCCTACATACTCCAATTTCCTTCCGCAACTTGTGATTTCGCCGC
TTGAATCTTGACGCCGTGCGTCCACAGTTTGTTGTGTTTTATCAATCAAGGTCATTATCA
ACCGAAGACGCTATCTATTTTCTTGGCGAAGCTCTCGGAAAGGAGCCATCGAAATGGAAG
TATTTCTCAAGAAAGTCCGCGAGTTATCCCGGAAGCAGTTC
>pool75_contig_389
GACCTATACCGGACCGTCACTGAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
ACGATCCAGGCATGGAGTTGTGGTGACGAGTAGGAGGGTCACCGTGGTGAGCGGGAAGCC
TCGGGCGTGAGCCTGGGTGGAGCCGCCACGGGTGCAGATCTTGGTGGTAGTAGCAAATAT
TCAAGTGAGAACCTTGAAGGCCGAGGTGGAGAAGGNNNNNNNNNNNNNNNNNNNNNNNNN
NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCATTTGTAT
CGCCCGGAAAACGTCACAAGAACGGGAGTTGCGTACAGAA
>pool75_contig_1977
AAGGGACACCGTTGGGTGAGGCGAGCTGCGTTCCTCGAACCATGGCTTCAAAAAGCGACT
TAGACCGTCAGATTGAACAGCTCAGGGCCTGCAAGCTCATTACAGAGGATGAGGTTAAGG
CACTCTGCGCTAAGGCGCGTGAGATTTTAATTGAAGAGAGTAATGTCCAGTGCGTGGACT
CACCTGTCACGGTTTGTGGCGATATCCACGGCCAGTTTTACGACTTGATTGAACTGTTTA
AAGTGGGCGGAGATGTTC
>pool75_contig_3064
TTACTATTTCTGGGCCTTAAGACTGGCTTAGTCGCTTACGACCCTTATAACAATGTAGAT
GTATATTATAAGGATCTTCCTGATGGTGCTAACGCTATGTTAATTTATTCAAACTCACCG
ACAAAGGAACAGAATATGCTTTGGCAGGTGGAAACTGTTCGATAATTGGATTGAACGACG
GCGGATGCGAGGTATTTTGGACAGTCACTGGCGACTCCGTTTGCTCTCTTTGCTCGATTA
AATCCGACAGCGATAAGTCAAGAGATTTTGTGGTTGGCTCTGAAGATTTTGACATCCGAA
TCTTCCATGGGGATGCCATAATATATGAAATCACGGAGTCTGATG
>pool75_contig_2499
AAGAGAAGAGGTGAGTTTGAGTATTGTTTGTGTGTGTGTGGTTGGGTGAGTGTGTGGTAT
GTGGTGTATGTGTGTGATGAATGTATGTGAAAGAGAGTGATGAATCTCATGGATATGTTC
GAGTTCGTGGTTTCCATTGATCGGTTATAGCCGAGATGATGGATGTGTTCCATGTGTCTG
ATTTCAGTTTAGGATTGTGTTGATGATGTTGATGATGAAAATTGTTGATGGTGATGACGA
TAGTGATGATGATGACGATGTTTCGGATAATGGTGATGATGATGATGGTTCCGACGATGA
TGTTTCGCTTGATGATGGTGATAATGATGACTCCGAAAATAACGTTGACTCGGATGAG
Consider using Bio::SeqIO to parse your Fasta dataset, instead of doing it yourself. Bio::SeqIO lives for this task, and is well developed for it. Additionally, if you're in bioinformatics, it would serve you well to get to know Bio::SeqIO. Given this, consider the following:
use strict;
use warnings;
use Bio::SeqIO;
open my $fh, '<', 'annot.txt' or die $!;
my %annot = map { /(\S+)\s+(.+)/; $1 => $2 } <$fh>;
close $fh;
my $in = Bio::SeqIO->new( -file => 'goat300.fasta', -format => 'Fasta' );
while ( my $seq = $in->next_seq() ) {
my $seqID = $annot{ $seq->id } // $seq->id;
print "$seqID\n" . $seq->seq . "\n";
}
Output on your datasets:
tumor susceptibility
CCCTTTCTCCCTTCCCAACATTCAGAGATACTGAATCGAAACTCTTACTGTCTGTTAGATGACAAAGAGTTATCCATCCTACATACTCCAATTTCCTTCCGCAACTTGTGATTTCGCCGCTTGAATCTTGACGCCGTGCGTCCACAGTTTGTTGTGTTTTATCAATCAAGGTCATTATCAACCGAAGACGCTATCTATTTTCTTGGCGAAGCTCTCGGAAAGGAGCCATCGAAATGGAAGTATTTCTCAAGAAAGTCCGCGAGTTATCCCGGAAGCAGTTC
ubiquitin ligase e3a
GACCTATACCGGACCGTCACTGAAAGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNACGATCCAGGCATGGAGTTGTGGTGACGAGTAGGAGGGTCACCGTGGTGAGCGGGAAGCCTCGGGCGTGAGCCTGGGTGGAGCCGCCACGGGTGCAGATCTTGGTGGTAGTAGCAAATATTCAAGTGAGAACCTTGAAGGCCGAGGTGGAGAAGGNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNTCATTTGTATCGCCCGGAAAACGTCACAAGAACGGGAGTTGCGTACAGAA
serine threonine-protein phosphatase 4 catalytic subunit
AAGGGACACCGTTGGGTGAGGCGAGCTGCGTTCCTCGAACCATGGCTTCAAAAAGCGACTTAGACCGTCAGATTGAACAGCTCAGGGCCTGCAAGCTCATTACAGAGGATGAGGTTAAGGCACTCTGCGCTAAGGCGCGTGAGATTTTAATTGAAGAGAGTAATGTCCAGTGCGTGGACTCACCTGTCACGGTTTGTGGCGATATCCACGGCCAGTTTTACGACTTGATTGAACTGTTTAAAGTGGGCGGAGATGTTC
bardet-biedl syndrome 2 protein P
TTACTATTTCTGGGCCTTAAGACTGGCTTAGTCGCTTACGACCCTTATAACAATGTAGATGTATATTATAAGGATCTTCCTGATGGTGCTAACGCTATGTTAATTTATTCAAACTCACCGACAAAGGAACAGAATATGCTTTGGCAGGTGGAAACTGTTCGATAATTGGATTGAACGACGGCGGATGCGAGGTATTTTGGACAGTCACTGGCGACTCCGTTTGCTCTCTTTGCTCGATTAAATCCGACAGCGATAAGTCAAGAGATTTTGTGGTTGGCTCTGAAGATTTTGACATCCGAATCTTCCATGGGGATGCCATAATATATGAAATCACGGAGTCTGATG
succinyl- ligase
AAGAGAAGAGGTGAGTTTGAGTATTGTTTGTGTGTGTGTGGTTGGGTGAGTGTGTGGTATGTGGTGTATGTGTGTGATGAATGTATGTGAAAGAGAGTGATGAATCTCATGGATATGTTCGAGTTCGTGGTTTCCATTGATCGGTTATAGCCGAGATGATGGATGTGTTCCATGTGTCTGATTTCAGTTTAGGATTGTGTTGATGATGTTGATGATGAAAATTGTTGATGGTGATGACGATAGTGATGATGATGACGATGTTTCGGATAATGGTGATGATGATGATGGTTCCGACGATGATGTTTCGCTTGATGATGGTGATAATGATGACTCCGAAAATAACGTTGACTCGGATGAG
The hash %annot is initialized by reading and capturing the contents of your annot.txt data. A Bio::SeqIO object is created using your goat300.fasta file data. The while loop iterates through your fasta sequences. The variable $seqID either takes the associated value of the key in the %annot hash or it keeps the current sequence ID (the // notation means defined or, so that insures $seqID will be defined). Finally, the Fasta record is printed.
Hope this helps!
There were a lot of warnings in your code, and your approach was inefficient. Let me first show you a working Perl program. I'll explain afterwards.
#!/usr/bin/perl
use strict;
use warnings;
# Read the annotations file
print"Enter annotated file...\n";
# my $f1 = <STDIN>;
my $f1 = 'annot.txt';
open(my $fh_annotations, '<', $f1) or die "Can't open $f1";
my #annotfile = <$fh_annotations>;
close $fh_annotations;
# Read the sequence file
print"Enter sequence file...\n";
# my $f2 = <STDIN>;
my $f2 = 'goat300.fasta';
open(my $fh_genes, '<', $f2) or die "Can't open $f2";
my #seqfile = <$fh_genes>;
close $fh_genes;
# Process the annotations data
my %names; # this hash is going to hold the names
foreach my $line (#annotfile) {
chomp $line; # remove newline
my #fields = split /\t/, $line; # split into array
$names{$fields[0]} = $fields[1]; # save in the hash as key->value pair
}
# Process the sequence data
foreach my $line (#seqfile) {
# Look at each line
if ($line =~ m/>(.+)$/) {
# If there is a heading there, remember it...
if (exists $names{$1}) {
# ... check if we know a name for it and replace it in the line
$line =~ s/($1)/$names{$1}/;
}
}
# output the line (this would be done to another filehandle)
print $line;
}
This reads both files and saves them in memory, just like yours did. But instead of trying to build two arrays for the names, I went with a hash, which is a key/value pair. Think of it like an array with names instead of numbers and no particular sorting.
Once these names are set up, I can process the sequence file. I simply look at each line and check if there is a heading there, by looking for the > sign. If it's there (it goes into $1 because of the parenthesis), I look if we have a hash entry (with exists) in our %names hash. If we do, we can replace the heading with the proper name.
After that, we could write it out to a new file. I'm just printing it.
I've used a few other techniques. Unfortunately the literature people get in a BioPerl context is quite outdated. Please take this advice, it will make your live easier.
Always use strict and warnings. They will tell you about problems with your code.
Always declare your variables with my. This is not like other languages, where you need to set up a variable at the top of your problem. You can declare it where you need it. The vars only live in a certain scope, which means between the nearest enclosing { and } brackets, or block.
Use three-argument open and lexical file handles for security. Read more here.
Perl offers foreach as an alternative to the C for loop. In this case, it made things a lot easier.
One more thing about this program: While this example data was rather short, I believe your actual data might be a lot larger. Consider processing the sequence file while you read it so you do not run out of memory. There's no need to save all the lines, unless you want to do something else with them.
open my $fh_out, '>', $filename_out or die $!;
open my $fh_in, '<', $filename_in or die $!;
while (my $line = <$fh_in>) {
# do stuff with the line, like your regex
print $fh_out $line;
}
close $fh_in;
close $fh_out;

Merging two files based on first column and returns multiple values for each key

I am fairly new to Perl so hopefully this has a quick solution.
I have been trying to combine two files based on a key. The problem is there are multiple values instead of the one it is returning. Is there a way to loop through the hash to get the 1-10 more values it could be getting?
Example:
File Input 1:
12345|AA|BB|CC
23456|DD|EE|FF
File Input2:
12345|A|B|C
12345|D|E|F
12345|G|H|I
23456|J|K|L
23456|M|N|O
32342|P|Q|R
The reason I put those last one in is because the second file has a lot of values I don’t want but file 1 I want all values. The result I want is something like this:
WANTED OUTPUT:
12345|AA|BB|CC|A|B|C
12345|AA|BB|CC|D|E|F
12345|AA|BB|CC|G|H|I
23456|DD|EE|FF|J|K|L
23456|DD|EE|FF|M|N|O
Attached is the code I am currently using. It gives an output like so:
OUTPUT I AM GETTING:
12345|AA|BB|CC|A|B|C
23456|DD|EE|FF|J|K|L
My code so far:
#use strict;
#use warnings;
open file1, "<FILE1.txt";
open file2, "<FILE2.txt";
while(<file2>){
my($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~ /(.+)\|(.+)\|(.+)\|(.+)/;
$value4 = "$value1|$value2|$value3";
$file2Hash{$key} = $value4;
}
while(<file1>){
my ($line) = $_;
chomp $line;
my($key, $value1, $value2, $value3) = $line =~/(.+)\|(.+)\|(.+)\|(.+)/;
if (exists $file2Hash{$key}) {
print $line."|".$file2Hash{$key}."\n";
}
else {
print $line."\n";
}
}
Thank you for any help you may provide,
Your overall idea is sound. However in file2, if you encounter a key you have already defined, you overwrite it with a new value. To work around that, we store an array(-ref) inside our hash.
So in your first loop, we do:
push #{$file2Hash{$key}}, $value4;
The #{...} is just array dereferencing syntax.
In your second loop, we do:
if (exists $file2Hash{$key}){
foreach my $second_value (#{$file2Hash{$key}}) {
print "$line|$second_value\n";
}
} else {
print $line."\n";
}
Beyond that, you might want to declare %file2Hash with my so you can reactivate strict.
Keys in a hash must be unique. If keys in file1 are unique, use file1 to create the hash. If keys are not unique in either file, you have to use a more complicated data structure: hash of arrays, i.e. store several values at each unique key.
I assume that each key in FILE1.txt is unique and that each unique key has at least one corresponding line in FILE2.txt.
Your approach is then quite close to what you need, you should just use FILE1.txt to create the hash from (as already mentioned here).
The following should work:
#!/usr/bin/perl
use strict;
use warnings;
my %file1hash;
open file1, "<", "FILE1.txt" or die "$!\n";
while (<file1>) {
my ($key, $rest) = split /\|/, $_, 2;
chomp $rest;
$file1hash{$key} = $rest;
}
close file1;
open file2, "<", "FILE2.txt" or die "$!\n";
while (<file2>) {
my ($key, $rest) = split /\|/, $_, 2;
if (exists $file1hash{$key}) {
chomp $rest;
printf "%s|%s|%s\n", $key, $file1hash{$key}, $rest;
}
}
close file2;
exit 0;