Display full taxon path from NCBI GI number - perl

I prepared the following script that takes a GI ID number from NCBI that I prepared in my tsv file and prints the scientific name associated with the ID:
#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::Taxonomy;
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
while(<>) {
my ($taxonid, $counts) = (split /\t/);
for my $each($taxonid) {
print "$each\n";
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
my $taxon = $db->get_taxon(-taxonid => $taxonid);
print "Taxon ID is $taxon->id, \n";
print "Scientific name is ", $taxon->scientific_name, "\n";
}
}
With this script, I receive the following:
1760
Taxon ID is Bio::Taxon=HASH(0x33a91f8)->id,
Scientific name is Actinobacteria
What I want to do
Now the next step is for me to list the full taxon path of the bacteria in question. So for the above example, I want to see k__Bacteria; p__ Actinobacteria; c__ Actinobacteria as output. Furthermore, I want the GI IDs on my table to be repliaced with this full taxon path.
In which direction should I go?

First, I notice you open $filename which is your first command line argument, but you don't use the file pointer $fh you created.
So, these two lines are not needed in your case because you already do the trick with <>
my ($filename) = #ARGV;
open my $fh, '<', $filename or die qq{Unable to open "$filename": $!};
Next. I don't know what is inside your filename and your database so I cannot help you more. Can you provide an example of what is inside your database and your file?
One more thing, what I can see here is that you may not need to create your $db instance inside the loop.
#!/usr/bin/perl
use strict;
use warnings;
use Bio::DB::Taxonomy;
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
while(<>) {
my ($taxonid, $counts) = (split /\t/);
for my $each($taxonid) {
print "$each\n";
my $taxon = $db->get_taxon(-taxonid => $taxonid);
print "Taxon ID is $taxon->id, \n";
print "Scientific name is ", $taxon->scientific_name, "\n";
}
}
Edit
From your command Is is hard to help you. When you write
my $taxon = $db->get_taxon(-taxonid => $taxonid);
You receive a Bio::Taxon node where the documentation ca be found here
I don't know what k__Bacteria; p__ Actinobacteria; c__ Actinobacteria representy for you. Is it an information offered by a Bio::Taxon node?
Anyway, you can still explore $taxon with this:
#!/usr/bin/env perl
# Author: Yves Chevallier
# Date:
use strict;
use warnings;
use Data::Dumper;
use Bio::DB::Taxonomy;
my $db = Bio::DB::Taxonomy->new(-source => 'entrez');
while(<DATA>) {
my ($taxonid, $counts) = (split /\t/);
for my $each($taxonid) {
print "$each\n";
my $taxon = $db->get_taxon(-taxonid => $taxonid);
print Dumper $taxon;
print "Taxon ID is $taxon->id, \n";
print "Scientific name is ", $taxon->scientific_name, "\n";
}
}
__DATA__
12 1760

Related

Perl print to seperate files

I have a text file which lists a service, device and a filter, here I list 3 examples only:
service1 device04 filter9
service2 device01 filter2
service2 device10 filter11
I have written a perl script that iterates through the file and should then print device=device filter=filter to a file named according to the service it belongs to, but if a string contains a duplicate filter, it should add the devices to the same file, seperated by semicolons. Looking at the above example, I then need a result of:
service1.txt
device=device04 filter=filter9
service2.txt
device=device01 filter=filter2 ; device=device10 filter=filter11
Here is my code:
use strict;
use warnings qw(all);
open INPUT, "<", "file.txt" or die $!;
my #Input = <INPUT>;
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
chomp ($serv, $device, $filter);
push my #arr, "device==$device & filter==$filter";
open OUTPUT, ">>", "$serv.txt" or die $!;
print OUTPUT join(" ; ", #arr);
close OUTPUT;
}
The problem I am having is that both service1.txt and service2.txt are created, but my results are all wrong, see my current result:
service1.txt
device==device04 filter==filter9
service2.txt
device==device04 filter==filter9 ; device==device01 filter==filter2device==device04 filter==filter9 ; device==device01 filter==filter2 ; device==device10 filter==filter11
I apologise, I know this is something stupid, but it has been a really long night and my brain cannot function properly I believe.
For each service to have its own file where data for it accumulates you need to distinguish for each line what file to print it to.
Then open a new service-file when a service without one is encountered, feasible since there aren't so many as clarified in a comment. This can be organized by a hash service => filehandle.
use warnings;
use strict;
use feature 'say';
my $file = shift #ARGV || 'data.txt';
my %handle;
open my $fh, '<', $file or die "Can't open $file: $!";
while (<$fh>) {
my ($serv, $device, $filter) = split;
if (exists $handle{$serv}) {
print { $handle{$serv} } " ; device==$device & filter==$filter";
}
else {
open my $fh_out, '>', "$serv.txt" or do {
warn "Can't open $serv.txt: $!";
next;
};
print $fh_out "device==$device & filter==$filter";
$handle{$serv} = $fh_out;
}
}
say $_ '' for values %handle; # terminate the line in each file
close $_ for values %handle;
For clarity the code prints almost the same in both cases, what surely can be made cleaner. This was tested only with the provided sample data and produces the desired output.
Note that when a filehandle need be evaluated we need { }. See this post, for example.
Comments on the original code (addressed in the code above)
Use lexical filehandles (my $fh) instead of typeglobs (FH)
Don't read the whole file at once unless there is a specific reason for that
split has nice defaults, split ' ', $_, where ' ' splits on whitespace and discards leading and trailing space as well. (And then there is no need to chomp in this case.)
Another option is to first collect data for each service, just as OP attempts, but again use a hash (service => arrayref/string with data) and print at the end. But I don't see a reason to not print as you go, since you'd need the same logic to decide when ; need be added.
Your code looks pretty perl4-ish, but that's not a problem. As MrTux has pointed out, you are confusing collection and fanning out of your data. I have refactored this to use a hash as intermediate container with the service name as keys. Please note that this will not accumulate results across mutliple calls (as it uses ">" and not ">>").
use strict;
use warnings qw(all);
use File::Slurp qw/read_file/;
my #Input = read_file('file.txt', chomp => 1);
my %store = (); # Global container
# Capture
foreach my $item(#Input) {
my ($serv, $device, $filter) = split(/ /, $item);
push #{$store{$serv}}, "device==$device & filter==$filter";
}
# Write out for each service file
foreach my $k(keys %store) {
open(my $OUTPUT, ">", "$k.txt") or die $!;
print $OUTPUT join(" ; ", #{$store{$k}});
close( $OUTPUT );
}

Two csv files: Change one csv by the other and pull out that line

I have two CSV files. The first is a list file, it contains the ID and names. For example
1127100,Acanthocolla cruciata
1127103,Acanthocyrta haeckeli
1127108,Acanthometra fusca
The second is what I want to exchange and extract the line by the first number if a match is found. The first column of numbers correspond in each file. For example
1127108,1,0.60042
1127103,1,0.819671
1127100,2,0.50421,0.527007
10207,3,0.530422,0.624466
So I want to end up with CSV file like this
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
I tried Perl but opening two files at once proved messy. So I tried converting one of the CSV files to a string and parse it that way, but didnt work. But then I was reading about grep and other one-liners but I am not familiar with it. Would this be possible with grep?
This is the Perl code I tried
use strict;
use warnings;
open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};
my $string = <$csv_score>;
while ( <$csv_list> ) {
my ($find, $replace) = split /,/;
$string =~ s/$find/$replace/g;
if ($string =~ m/^$replace/){
print $out $string;
}
}
close $csv_score;
close $csv_list;
close $out;
The general purpose text processing tool that comes with all UNIX installations is named awk:
$ awk -F, -v OFS=, 'NR==FNR{m[$1]=$2;next} $1=m[$1]' file1 file2
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
Your code was failing because you only read the first line from the $csv_score file, and you tried to print $string every time it is changed. You also failed to remove the newline from the end of the lines from your $csv_list file. If you fix those things then it looks like this
use strict;
use warnings;
open my $csv_score, '<', "$ARGV[0]" or die qq{Failed to open "$ARGV[0]" for input: $!\n};
open my $csv_list, '<', "$ARGV[1]" or die qq{Failed to open "$ARGV[1]" for input: $!\n};
open my $out, ">$ARGV[0]_final.txt" or die qq{Failed to open for output: $!\n};
my $string = do {
local $/;
<$csv_score>;
};
while ( <$csv_list> ) {
chomp;
my ( $find, $replace ) = split /,/;
$string =~ s/$find/$replace/g;
}
print $out $string;
close $csv_score;
close $csv_list;
close $out;
output
Acanthometra fusca,1,0.60042
Acanthocyrta haeckeli,1,0.819671
Acanthocolla cruciata,2,0.50421,0.527007
10207,3,0.530422,0.624466
However that's not a safe way of doing things, as IDs may be found elsewhere than at the start of a line
I would build a hash out of the $csv_list file like this, which also makes the program more concise
use strict;
use warnings;
use v5.10.1;
use autodie;
my %ids;
{
open my $fh, '<', $ARGV[1];
while ( <$fh> ) {
chomp;
my ($id, $name) = split /,/;
$ids{$id} = $name;
}
}
open my $in_fh, '<', $ARGV[0];
open my $out_fh, '>', "$ARGV[0]_final.txt";
while ( <$in_fh> ) {
s{^(\d+)}{$ids{$1} // $1}e;
print $out_fh $_;
}
The output is identical to that of the first program above
The problem with the code as written is that you only do this once:
my $string = <$csv_score>;
This reads one line from $csv_score and you don't ever use the rest.
I would suggest that you need to:
Read the first file into a hash
Iterate the second file, and do a replace on the first column.
using Text::CSV is generally a good idea for processing it, but it doesn't seem to be necessary for your example.
So:
#!/usr/bin/env perl
use strict;
use warnings;
use Text::CSV;
use Data::Dumper;
my $csv = Text::CSV->new( { binary => 1 } );
my %replace;
while ( my $row = $csv->getline( \*DATA ) ) {
last if $row->[0] =~ m/NEXT/;
$replace{ $row->[0] } = $row->[1];
}
print Dumper \%replace;
my $search = join( "|", map {quotemeta} keys %replace );
$search =~ qr/($search)/;
while ( my $row = $csv->getline( \*DATA ) ) {
$row->[0] =~ s/^($search)$/$replace{$1}/;
$csv->print( \*STDOUT, $row );
print "\n";
}
__DATA__
1127100,Acanthocolla cruciata
1127103,Acanthocyrta haeckeli
1127108,Acanthometra fusca
NEXT
1127108,1,0.60042
1127103,1,0.819671
1127100,2,0.50421,0.527007
10207,3,0.530422,0.624466
Note - this still prints that last line of your source content:
"Acanthometra fusca ",1,"0.60042 "
"Acanthocyrta haeckeli ",1,"0.819671 "
"Acanthocolla cruciata ",2,0.50421,"0.527007 "
(Your data contained whitespace, so Text::CSV wraps it in quotes)
If you want to discard that, then you could test if the replace actually occurred:
if ( $row->[0] =~ s/^($search)$/$replace{$1}/ ) {
$csv->print( \*STDOUT, $row );
print "\n";
}
(And you can of course, keep on using split /,/ if you're sure you won't have any of the whacky things that CSV supports normally).
I would like to provide a very different approach.
Let's say you are way more comfortable with databases than with Perl's data structures. You can use DBD::CSV to turn your CSV files into kind of relational databases. It uses Text::CSV under the hood (hat tip to #Sobrique). You will need to install it from CPAN as it's not bundled in the default DBI distribution though.
use strict;
use warnings;
use Data::Printer; # for p
use DBI;
my $dbh = DBI->connect( "dbi:CSV:", undef, undef, { f_ext => '.csv' } );
$dbh->{csv_tables}->{names} = { col_names => [qw/id name/] };
$dbh->{csv_tables}->{numbers} = { col_names => [qw/id int float/] };
my $sth_select = $dbh->prepare(<<'SQL');
SELECT names.name, numbers.int, numbers.float
FROM names
JOIN numbers ON names.id = numbers.id
SQL
# column types will be silently discarded
$dbh->do('CREATE TABLE result ( name CHAR(255), int INTEGER, float INTEGER )');
my $sth_insert =
$dbh->prepare('INSERT INTO result ( name, int, float ) VALUES ( ?, ?, ? ) ');
$sth_select->execute;
while (my #res = $sth_select->fetchrow_array ) {
p #res;
$sth_insert->execute(#res);
}
What this does is set up column names for the two tables (your CSV files) as those do not have a first row with names. I made the names up based on the data types. It will then create a new table (CSV file) named result and fill it by writing one row at a time.
At the same time it will output data (for debugging purposes) to STDERR through Data::Printer.
[
[0] "Acanthocolla cruciata",
[1] 2,
[2] 0.50421
]
[
[0] "Acanthocyrta haeckeli",
[1] 1,
[2] 0.819671
]
[
[0] "Acanthometra fusca",
[1] 1,
[2] 0.60042
]
The resulting file looks like this:
$ cat scratch/result.csv
name,int,float
"Acanthocolla cruciata",2,0.50421
"Acanthocyrta haeckeli",1,0.819671
"Acanthometra fusca",1,0.60042

"Out of memory" error from Perl when processing files

I'm using Perl with Mojo::DOM to process a large batch of text files. I need to count the occurrences of all the words that end with certain suffixes.
Running this code keeps returning out of memory error messages for batches of over, say, 40 files.
Is there any way to accomplish this task more efficiently (less memory usage) than what I'm doing below?
#!/software/perl512/bin/perl
use strict;
use warnings;
use autodie;
use Mojo::DOM;
my $path = "/data/10K/2012";
chdir($path) or die "Cant chdir to $path $!";
# This program counts the total number of suffixes of a form in a given document.
my #sequence;
my %sequences;
my $file;
my $fh;
my #output;
# Reading in the data.
for my $file (<*.txt>) {
my %affixes;
my %word_count;
my $data = do {
open my $fh, '<', $file;
local $/; # Slurp mode
<$fh>;
};
my $dom = Mojo::DOM->new($data);
my $text = $dom->all_text();
for (split /\s+/, $text) {
if ($_ =~ /[a-zA-Z]+(ness|ship|dom|ance|ence|age|cy|tion|hood|ism|ment|ure|tude|ery|ity|ial)\b/ ) {
++$affixes{"affix_count"};
}
++$word_count{"word_count"};
}
my $output = join ",", $file, $affixes{"affix_count"}, $word_count{"word_count"};
push #output, ($output);
}
#output = sort #output;
open(my $fh3, '>', '/home/usr16/rcazier/PerlCode/affix_count.txt');
foreach (#output) {
print $fh3 "$_\n ";
}
close $fh3;
This is as near as I can get to a solution. It incorporates all the points that have been made in the comments, and solves the "Out of memory" error by leaving any HTML tags intact. It also leaves the result unsorted as the original code doesn't really do any useful sorting.
Because of the way you are looking for suffixed words, I think it's very unlikely that leaving HTML tags in your text files will pervert your results significantly.
#!/software/perl512/bin/perl
use strict;
use warnings;
use 5.010;
use autodie;
# Build and compile a regex that will match any of the suffixes that interest
# us, for later use in testing each "word" in the input file
#
my $suffix_re = do {
my #suffixes = qw/ ness ship dom ance ence age cy tion hood ism ment ure tude ery ity ial /;
my $alternation = join '|', #suffixes;
qr/ (?: $alternation ) /xi;
};
# Set the directory that we want to examine. `autodie` will check the success
# of `chdir` for us
#
my $path = '/data/10K/2012';
chdir $path;
# Process every file with a `txt` file type
#
for my $filename ( grep -f, glob('*.txt') ) {
warn qq{Processing "$filename"\n};
open my ($fh), '<', $filename;
my ($suffixes, $word_count) = (0, 0);
while (<$fh>) {
for (split) {
++$word_count;
++$suffixes if /\A[a-z]+$suffix_re\z/i;
}
}
say join ',', $filename, $suffixes, $word_count if $suffixes;
}

Compare two files and write the lines from second file to first file

I Have two files name test1 and test2. The contents of the files are as follows.
test1
nijin:qwe123
nijintest:qwerty
nijintest2:abcsdef
nijin2:qwehj
test2
nijin:qwe
nijintest2:abc
I have to change the values of nijin and nijintest2 in test1 to match that in test2, leaving all other values alone. I have tried all possible Perl replace comments without any success. Any help will be appreciated.
Edit
I have tried many open close file functions to replace the entry but none of them gives a required output. I have tried everything here In Perl, how do I change, delete, or insert a line in a file, or append to the beginning of a file? . But with no luck
This works, though it could probably still be compressed:
#!/usr/bin/env perl
use strict;
use warnings;
die "Usage: $0 map [file ...]\n" unless scalar(#ARGV) >= 1;
my %mapping;
open my $fh, "<", $ARGV[0] or die "Failed to open $ARGV[0] for reading";
while (<$fh>)
{
my($key, $value) = ($_ =~ m/^([^:]*):(.*)/);
$mapping{$key} = "$value\n";
}
close $fh;
shift;
while (<>)
{
my($key) = ($_ =~ m/^([^:]*):/);
$_ = "$key:$mapping{$key}" if (defined $mapping{$key});
print;
}
If it is called sub.pl, you can run:
perl sub.pl test2 test1
perl sub.pl test2 <test1
perl sub.pl test2 test1 test3 test4
For the first two invocations, the output is:
nijin:qwe
nijintest:qwerty
nijintest2:abc
nijin2:qwehj
The following should work for importing any number of new files. I also included code for appending new entries, which you didn't specify how you wanted handled.
#!/usr/bin/env perl
use strict;
use warnings;
use autodie;
die "Usage: $0 dbfile [mapfiles ...]\n" if #ARGV < 2;
my $db = shift;
my %mapping = map {chomp; /([^:]*)/; $1 => $_} <>;
local #ARGV = $db;
local $^I = '.bak';
while (<>) {
chomp;
/([^:]*)/;
print delete $mapping{$1} // $_, "\n";
}
#unlink "$db$^I"; # Uncomment if you want to delete backup
# Append new entries;
open my $fh, '>>', $db;
$fh->print($_, "\n") for values %mapping;

check if a pattern exist in a file

i have a very simple perl question regarding pattern matching problem.
I am reading file with a list of names (fileA).
I would like to check if any of these names exist in another file (fileB).
if ($name -e $fileB){
do something
}else{
do something else
}
it is in a way to check if a pattern exists in a file.
I have tried
open(IN, $controls) or die "Can't open the control file\n";
while(my $line = <IN>){
if ($name =~ $line ){
print "$name\tfound\n";
}else{
print "$name\tnotFound\n";
}
}
This is repeating itself as it checks and prints every entry rather than checking whether the name exists or not.
When you are doing compare one list to another, you're interested in hashes. A hash is an array that is keyed and the list itself has no order. A hash can only have a single instance of a particular key (but different keys can have the same data).
What you can do is go through the first file, and create a hash keyed by that line. Then, you go through the second folder and check to see if any of those lines match any keys in your hash:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #You don't have to check if "open" fails.
use constant {
FIRST_FILE => 'file1.txt',
SECOND_FILE => 'file2.txt',
};
open my $first_fh, "<", FIRST_FILE;
# Get each line as a hash key
my %line_hash;
while ( my $line = <$first_fh> ) {
chomp $line;
$line_hash{$line} = 1;
}
close $first_fh;
Now each line is a key in your hash %line_hash. The data really doesn't matter. The important part is the value of the key itself.
Now that I have my hash of the lines in the first file, I can read in the second file and see if that line exists in my hash:
open my $second_fh, "<", SECOND_FILE;
while ( my $line = <$second_fh> ) {
chomp $line;
if ( exists $line_hash{$line} ) {
say qq(I found "$line" in both files);
}
}
close $second_fh;
There's a map function too that can be used:
#! /usr/bin/env perl
use strict;
use warnings;
use feature qw(say);
use autodie; #You don't have to check if "open" fails.
use constant {
FIRST_FILE => 'file1.txt',
SECOND_FILE => 'file2.txt',
};
open my $first_fh, "<", FIRST_FILE
chomp ( my #lines = <$first_fh> );
# Get each line as a hash key
my %line_hash = map { $_ => 1 } #lines;
close $first_fh;
open my $second_fh, "<", SECOND_FILE;
while ( my $line = <$second_fh> ) {
chomp $line;
if ( exists $line_hash{$line} ) {
say qq(I found "$line" in both files);
}
}
close $second_fh;
I am not a great fan of map because I don't find it that much more efficient and it is harder to understand what is going on.
To check whether a pattern exists in a file, you have to open the file and read its content. The fastest way how to search for inclusion of two lists is to store the content in a hash:
#!/usr/bin/perl
use strict;
use warnings;
open my $LST, '<', 'fileA' or die "fileA: $!\n";
open my $FB, '<', 'fileB' or die "fileB: $!\n";
my %hash;
while (<$FB>) {
chomp;
undef $hash{$_};
}
while (<$LST>) {
chomp;
if (exists $hash{$_}) {
print "$_ exists in fileB.\n";
}
}
I have just given an algorithm kind of code which is not tested.
But i feel this does the job for you.
my #a;
my $matched
my $line;
open(A,"fileA");
open(A,"fileB");
while(<A>)
{
chomp;
push #a,$_;
}
while(<B>)
{
chomp;
$line=$_;
$matched=0;
for(#a){if($line=~/$_/){last;$matched=1}}
if($matched)
{
do something
}
else
{
do something else
}
}