Extracting and joining exons from multiple sequence alignments - perl

Using my (fairly) basic coding skills, I have put together a script that will parse an aligned multi-fasta file (a multiple sequence alignment) and extract all the data between two specified columns.
use Bio::SimpleAlign;
use Bio::AlignIO;
$str = Bio::AlignIO->new(-file => $inputfilename, -format => 'fasta');
$aln = $str->next_aln();
$mini = $aln->slice($array[0], $array[1]);
$out = Bio::AlignIO->new(-file => $array[3], -format => 'fasta');
$out->write_aln($mini);
The problem I have is that I want to be able to slice multiple regions from the same alignment and then join these regions prior to writing to an outfile. The complication is that I want to supply a file with a list of co-ordinates where each line contains two or more co-ordinates between which data should be extracted and joined.
Here is an example co-ordinate file
ORF1, 10, 50, exon1 # The above line should produce a slice between columns 10-50 and write to an outfile
ORF2, 70, 140, exon1
ORF2, 190, 270, exon2
ORF2, 500, 800, exon3 # Data should be extracted between the ranges specified here and in the above two lines and then joined (side by side) to produce the outfile.
ORF3, 1200, 1210, exon1
etc etc
And here is an (small) example of an aligned fasta file
\>Sample1
ATGGCGACCGTGCACTACTCCCGCCGACCTGGGACCCCGCCGGTCACCCTCACGTCGTCC
CCCAGCATGGATGACGTTGCGACCCCCATCCCCTACCTACCCACATACGCCGAGGCCGTG
GCAGACGCGCCCCCCCCTTACAGAAGCCGCGAGAGTCTGGTGTTCTCCCCGCCTCTTTTT
CCTCACGTGGAGAATGGCACCACCCAACAGTCTTACGATTGCCTAGACTGCGCTTATGAT
GGAATCCACAGACTTCAGCTGGCTTTTCTAAGAATTCGCAAATGCTGTGTACCGGCTTTT
TTAATTCTTTTTGGTATTCTCACCCTTACTGCTGTCGTGGTCGCCATTGTTGCCGTTTTT
CCCGAGGAACCTCCCAACTCAACTACATGA
\>Sample2
ATGGCGACCGTGCACTACTCCCGCCGACCTGGGACCCCGCCGGTCACCCTCACGTCGTCC
CCCAGCATGGATGACGTTGCGACCCCCATCCCCTACCTACCCACATACGCCGAGGCCGTG
GCAGACGCGCCCCCCCCTTACAGAAGCCGCGAGAGTCTGGTGTTCTCCCCGCCTCTTTTT
CCTCACGTGGAGAATGGCACCACCCAACAGTCTTACGATTGCCTAGACTGCGCTTATGAT
GGAATCCACAGACTTCAGCTGGCTTTTCTAAGAATTCGCAAATGCTGTGTACCGGCTTTT
TTAATTCTTTTTGGTATTCTCACCCTTACTGCTGTCGTGGTCGCCATTGTTGCCGTTTTT
CCCGAGGAACCTCCCAACTCAACTACATGA
I think there should be a fairly simple way to solve this problem, potentially using the information in the first column, paired with the exon number, but I can't for the life of me figure out how this can be done.
Can anyone help me out?

The aligned fasta file you posted -- at least as it appears on the stackoverflow web page -- did not compile. According to https://en.wikipedia.org/wiki/FASTA_format, the description lines should begin with a >, not with \>.
Be sure to run all Perl programs with use strict; use warnings;. This will facilitate debugging.
You have not populated #array. Consequently, you can expect to get errors like these:
Use of uninitialized value $start in pattern match (m//) at perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm line 1086, <GEN0> line 16.
Use of uninitialized value $start in concatenation (.) or string at perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm line 1086, <GEN0> line 16.
------------- EXCEPTION: Bio::Root::Exception -------------
MSG: Slice start has to be a positive integer, not []
STACK: Error::throw
STACK: Bio::Root::Root::throw perl-5.24.0/lib/site_perl/5.24.0/Bio/Root/Root.pm:444
STACK: Bio::SimpleAlign::slice perl-5.24.0/lib/site_perl/5.24.0/Bio/SimpleAlign.pm:1086
STACK: fasta.pl:26
Once you assign plausible values, e.g.,
#array = (1,17);
... you will get more plausible results:
$ perl fasta.pl
>Sample1/1-17
ATGGCGACCGTGCACTA
>Sample2/1-17
ATGGCGACCGTGCACTA
HTH!

Related

Convert the position of a character in a string to account for "gaps" (i.e., non alphanumeric characters in the string)

In a nutshell
I have a string that looks something like this ...
---MNTSDSEEDACNERTALVQSESPSLPSYTRQTDPQHGTTEPKRAGHT--------LARGGVAAPRERD
And I have a list of positions and corresponding characters that looks something like this...
position character
10 A
12 N
53 V
54 A
This position/character key doesn't account for hyphen (-) characters in the string. So for example, in the given string the first letter M is in position 1, the N in position 2, the T in position 3, etc. The T preceding the second chunk of hyphens is position 47, and the L after that hyphen chunk is position 48.
I need to convert the list of positions and corresponding characters so that the position accounts for hyphen characters. Something like this...
position character
13 A
15 N
64 V
65 A
I think there should be a simple enough way to do this, but I am fairly new so I am probably missing something obvious, sorry about that! I am doing this as part of bigger script, so if anyone had a way to accomplish this using perl that would be amazing. Thank you so much in advance and please let me know if I can clarify anything or provide more information!
What I tried
At first, I took a substring of characters equal to the position value, counted the number of hyphens in that substring, and added the hyphen count onto the original position. So for the first position/character in my list, take the first 10 characters, and then there are 3 hyphens in that substring, so 10+3 = 13 which gives the correct position. This works for most of my positions, but fails when the original position falls within a bunch of hyphens like for positions 53 and 54.
I also tried grabbing the character by taking out the hyphens and then using the original position value like this...
my #array = ($string =~ /\w/g);
my $character = $array[$position];
which worked great, but then I was having a hard time using this to convert the position to include the hyphens because there are too many matching characters to match the character I grabbed here back to the original string with hyphens and find the position in that (this may have been a dumb thing to try from the start).
The actual character seems not to be relevant. It's enough to count the non-hyphens:
use strict;
use warnings;
use Data::Dumper;
my $s = '---MNTSDSEEDACNERTALVQSESPSLPSYTRQTDPQHGTTEPKRAGHT--------LARGGVAAPRERD';
my #positions = (10,12,53,54);
my #transformed = ();
my $start = 0;
for my $loc(#positions){
my $dist = $loc - $start;
while ($dist){
$dist-- if($s =~ m/[^-]/g);
}
my $pos = pos($s);
push #transformed, $pos;
$start = $loc;
}
print Dumper \#transformed;
prints:
$VAR1 = [
13,
15,
64,
65
];

Splitting a PDL in half

I have a one-dimensional PDL that I'd like to perform calculations on each half of; i.e. split it, then do calculations on the first half, and the same calculations on the second half.
Is there an easier/nicer/elegant way to simply split the PDL in half than getting the number of elements (with nelem), dividing that in two, then doing two lots of slices?
Thanks
Yes, in so far as you don't need to directly invoke slice to get what you want. You could chain splitdim and dog with something like this:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
That, of course, is easily extended to more than two divisions. However, if you want to extend it to higher-dimensional piddles (i.e. a collection of time series all stored in one piddle), you would need to be a little more subtle. If you want to split along the first dimension (which has index 0), you would say this instead:
# Assume we have $data, a piddle
my ($left, $right) = $data->splitdim(0, $data->dim(0)/2)->mv(1, -1)->dog;
The splitdim operation splits the 0th dimension into two dimensions, the 0th being dim(0)/2 in length, the 1st being 2 in length (because we divided it into two pieces). Since dog operates on the last dimension, we move the 1st dimension to the end before invoking dog.
However, even with the single-dimensional solution, there's a caveat. Due to the way that $data->splitdim works, it will truncate the last piece of data if you have an odd number of elements. Try that operation on a piddle with 21 elements and you'll see what I mean:
my $data = sequence(20);
say "data is $data"; # lists 0-19
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19
$data = sequence(21);
say "data is $data"; # lists 0-20, i.e. 21 elements
my ($left, $right) = $data->splitdim(0, $data->nelem/2)->dog;
say "left is $left and right is $right"; # lists 0-9, then 10-19!!
If you want to avoid that, you can produce your own method that splits the first dimension in half without truncation. It would probably look something like this:
sub PDL::split_in_half {
my $self = shift;
# the int() isn't strictly necessary, but should make things a
# tad faster
my $left = $self->slice(':' . int($self->dim(0)/2-1) );
my $right = $self->slice(int($self->dim(0)/2) . ':');
return ($left, $right);
}
Here I have also used the int built-in to make sure we don't have the .5 if dim(0) is odd. It's a little more complicated, but we're burying this complexity into a method precisely so we don't have to think about the complexity, so we may as well buy ourselves a few clock cycles while we're at it.
Then you could easily invoke the method thus:
my ($left, $right) = $data->split_in_half;

What does #data actually contain in PDF::Report::Table $table_write->addTable(#data);?

I think I've got the gist of creating a table using Perl's PDF::Report and PDF::Report::Table, but am having difficulty seeing what the 2-dimensional array #data would look like.
The documentation says it's a 2-dimensional array, but the example on CPAN just shows an array of arrays test1, test2, and so on, rather than the example showing data and formatting like $padding $bgcolor_odd, and so on.
Here's what I've done so far:
$main_rpt_path = "/home/ics/work/rpts/interim/mtr_prebill.rpt";
$main_rpt_pdf =
new PDF::Report('PageSize' => 'letter', 'PageOrientation' => 'Landscape',);
$main_rpt_tbl_wrt =
PDF::Report::Table->new($main_rpt_pdf);
Obviously, I can't pass a one dimensional array, but I have searched for examples and can only find the one in CPAN search.
Edit:
Here is how I am trying to call addTable:
$main_rpt_tbl_wrt->addTable(build_table_writer_array($pt_column_headers_ref, undef));
.
.
.
sub build_table_writer_array
# $data -- an array ref of data
# $format -- an array ref of formatting
#
# returns an array ref of a 2d array.
#
{
my ($data, $format) = #_;
my $out_data_table = undef;
my #format_array = (10, 10, 0xFFFFFF, 0xFFFFCC);
$out_data_table = [[#$data],];
return $out_data_table;
}
and here is the error I'm getting.
Use of uninitialized value in subtraction (-) at /usr/local/share/perl5/PDF/Report/Table.pm line 88.
at /usr/local/share/perl5/PDF/Report/Table.pm line 88
I cannot figure out what addTable wants for data. That is I am wondering where the formatting is supposed to go.
Edit:
It appears the addData call should look like
$main_rpt_tbl_wrt->addTable(build_table_writer_array($pt_column_headers_ref), 10,10,xFFFFFF, 0xFFFFCC);
not the way I've indicated.
This looks like a bug in the module. I tried running the example code in the SYNOPSIS, and I got the same error you get. The module has no real tests, so it is no surprise that there would be bugs. You can report it on CPAN.
The POD has bugs, too.
You increase your chances of getting it fixed if you look at the source code and fix it yourself with a patch.

perl code taking values from three files and printing in another file

I have one file named 1.txt having values like:
a
b
c
...
Second file named 2.txt like this:
a 123,
a 156,
a 899,
c 255,
Third file named 3.txt like this:
a 236,
a 890,
b 123,
How can read the values from all three files above and write my results in a single file like the one below:
a 123 236,
- 156 890,
- 899 -,
b - 123,
The files have not equal lines and none of lines are about 10000. I have to use Perl for this.
I have to take the values from the first file.
I have to take the second file and I have to take the values of the second column of the second file corresponding to values of first file.
Similarly I have to take values from the third file.
And I have to write my results in an output file like
values from first file in first column, all the corresponding values from the second file in the second column of output-file and all the corresponding values from the third file in the third column of output-file.
Read the three files: the first can be read into a simple array, the other two into hashes where the hash value is an array (reference to an array).
Read through the first array in sorted order.
For each value in the first array, find the array from the second and third files (the corresponding hashes), sorting the array values into order. Handle the printing appropriately, dealing with missing values in the second and third columns and repeated values in the first column specially.
I would start by reading the contents of 2.txt and storing it in a hash of arrayrefs. For example, if 2.txt is this:
a 123,
a 156,
a 899,
c 255,
then the data structure would be this:
{ a => [123, 156, 899],
c => [255]
}
Then, do the same for 3.txt.
Only then does it make sense to read through 1.txt. For each line of 1.txt, you would look up the appropriate arrays from the above data structures, and iterate over those arrays (using something like for(my $i = 0; $i < #two && $i < #three; ++$i)), printing your results.

Python 3.2 lxml fill and submit form, select multiple, how to do it? value not working

Great page this one, coming from the perl world and after several years of doing nothing, I've re-started to program again (this web page didn't exist, how things change). And now, after a 2 full-days of searching, I play the last card of asking here for help.
Working under mac environment, with python 3.2 and lxml 2.3 (installed following www.jtmoon.com/?p=21), what I am trying to do:
web: http://biodbnet.abcc.ncifcrf.gov/db/db2db.php
to fill the form that you find there
to submit it
My code. I put several attempts and the output code.
from lxml.html import parse, submit_form, tostring
page = parse('http://biodbnet.abcc.ncifcrf.gov/db/db2db.php').getroot()
page.forms[0].fields['input'] = 'GI Number'
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
page.forms[0].fields['hasComma'] = 'no'
page.forms[0].fields['removeDupValues'] = 'yes'
page.forms[0].fields['request'] = 'db2db'
page.forms[0].action = 'http://biodbnet.abcc.ncifcrf.gov/db/db2dbRes.php'
page.forms[0].fields['idList'] = '86439006'
submit_form(page.forms[0])
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = 'Gene ID'
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1058, in _value__set
"You must pass in a sequence")
TypeError: You must pass in a sequence
So, since that element is a multi-select element, I understand that I have to give a list
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Output:
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 30, in <module>
page.forms[0].inputs['outputs[]'].value = list('Gene ID')
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1059, in _value__set
self.value.clear()
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/_setmixin.py", line 115, in clear
self.remove(item)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 1159, in remove
"The option %r is not currently selected" % item)
ValueError: The option 'Affy ID' is not currently selected
'Affy ID' is the first option value of the list, and it is not selected. But what's the problem with it?
Surprisingly, if I instead put
page.forms[0].inputs['outputs[]'].multiple = list('Gene ID')
#page.forms[0].inputs['outputs[]'].value = list('Gene ID')
Then, somehow lxml likes it, and move on. However, the multiple attribute should be a boolean (actually it is if I print the value), I shouldn't touch it, and the "value" of the item should actually point to the selected items, according to the lxml docs.
The new output
File "/Users/gerard/Desktop/barbacue/MGFtoXML.py", line 87, in <module>
submit_form(page.forms[0])
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 856, in submit_form
return open_http(form.method, url, values)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/site-packages/lxml/html/__init__.py", line 876, in open_http_urllib
return urlopen(url, data)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 138, in urlopen
return opener.open(url, data, timeout)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 364, in open
req = meth(req)
File "/Library/Frameworks/Python.framework/Versions/3.2/lib/python3.2/urllib/request.py", line 1052, in do_request_
raise TypeError("POST data should be bytes"
TypeError: POST data should be bytes or an iterable of bytes. It cannot be str.
So, what can be done?? I am sure that with python 2.6 I could use mecanize, or that perhaps lxml could work? But I really don't want to code in a sort-of deprecated version. I am enjoying a lot python, but I am starting to consider going back to perl. Perhaps this could be a smart movement??
Any help will be hugely appreciated
Gerard
Reading in this forum, I find pythonpaste.org, could it be a replacement for lxml?
Passing in a sequence to list() will generate a list from that sequence. 'Gene ID' is sequence (namely a sequence of characters). So list('Gene ID') will generate a list of characters, like so:
>>> list('Gene ID')
['G', 'e', 'n', 'e', ' ', 'I', 'D']
That's not what you want. Try this:
>>> ['Gene ID']
['Gene ID']
In other words:
page.forms[0].inputs['outputs[]'].value = ['Gene ID']
That should take you a bit forward.