BioPerl with clustalw - outputting file - perl

I have a perl script to automate many multiple alignments (I'm making the script first with only one file and one multiple alignment - big one though. I can then modify for multiple files) and I want to output the resulting file, but I am unsure on how to do with with AlignIO: so far I have:
use warnings;
use strict;
use Bio::AlignIO;
use Bio::SeqIO;
use Bio::Tools::Run::Alignment::Clustalw;
my $file = shift or die; # Get filename from command prompt.
my $factory = Bio::Tools::Run::Alignment::Clustalw->new(-matrix => 'BLOSUM');
my $ktuple = 3;
$factory->ktuple($ktuple);
my $inseq = Bio::SeqIO->new(
-file => "<$file",
-format => $format
);
my $seq;
my #seq_array;
while ($seq = $inseq->next_seq) {
push(#seq_array, $seq);
}
# Now we do the actual alignment.
my $seq_array_ref = \#seq_array;
my $aln = $factory->align($seq_array_ref);
Once the alignment is done I have $aln which is the alignment I want to get out of the process as a fasta file - I tried something like:
my $out = Bio::AlignIO->new(-file => ">outputalignmentfile",
-format => 'fasta');
while( my $outaln = $aln->next_aln() ){
$out->write_aln($outaln);
}
but it didn't work, presumably because the method next_aln() only applies to AlignIO things, which $aln is probably not. So I need to know what it is that is generated by the line my $aln = $factory->align($seq_array_ref); and how to get the aligned sequences output into a file. My next step is tree estimation or network analysis.
Thanks,
Ben.

$out->write_aln($outaln);
Was the only line needed to write the object returned by the clustalw line to output the object to that stream.

Related

Remove duplicate lines on file by substring - preserve order (PERL)

i m trying to write a perl script to deal with some 3+ gb text files, that are structured like :
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212123x534534534534xx4545454x232323xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
I want to perform two operations :
Count the number of delimiters per line and compare it to a static number (ie 5), those lines that exceed said number should be output to a file.control.
Remove duplicates on the file by substring($line, 0, 7) - first 7 numbers, but i want to preserve order. I want the output of that in a file.output.
I have coded this in simple shell script (just bash), but it took too long to process, the same script calling on perl one liners was quicker, but i m interested in a way to do this purely in perl.
The code i have so far is :
open $file_hndl_ot_control, '>', $FILE_OT_CONTROL;
open $file_hndl_ot_out, '>', $FILE_OT_OUTPUT;
# INPUT.
open $file_hndl_in, '<', $FILE_IN;
while ($line_in = <$file_hndl_in>)
{
# Calculate n. of delimiters
my $delim_cur_line = $line_in =~ y/"$delimiter"//;
# print "$commas \n"
if ( $delim_cur_line != $delim_amnt_per_line )
{
print {$file_hndl_ot_control} "$line_in";
}
# Remove duplicates by substr(0,7) maintain order
my substr_in = substr $line_in, 0, 11;
print if not $lines{$substr_in}++;
}
And i want the file.output file to look like
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx
and the file.control file to look like :
(assuming delimiter control number is 6)
4352342xx23232xxx345545x45454x23232xxx
Could someone assist me? Thank you.
Posting edits : Tried code
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
open(my $fh1, ">>", "outputcontrol.txt");
open(my $fh2, ">>", "outputoutput.txt");
while ( <> ) {
my $count = ($_ =~ y/x//);
print "$count \n";
# print $_;
if ( $count != $delim_amnt_per_line )
{
print fh1 $_;
}
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print fh2;
}
I dont know if i m supposed to post new code in here. But i tried the above, based on your example. What baffles me (i m still very new in perl) is that it doesnt output to either filehandle, but if i redirected from the command line just as you said, it worked perfect. The problem is that i need to output into 2 different files.
It looks like entries with the same seven-character prefix may appear anywhere in the file, so it's necessary to use a hash to keep track of which ones have already been encountered. With a 3GB text file this may result in your perl process running out of memory, in which case a different approach is necessary. Please give this a try and see if it comes in under the bar
The tr/// operator (the same as y///) doesn't accept variables for its character list, so I've used eval to create a subroutine delimiters() that will count the number of occurrences of $delimiter in $_
It's usually easiest to pass the input file as a parameter on the command line, and redirect the output as necessary. That way you can run your program on different files without editing the source, and that's how I've written this program. You should run it as
$ perl filter.pl my_input.file > my_output.file
use strict;
use warnings 'all';
my %seen;
my $delimiter = 'x';
my $delim_amnt_per_line = 5;
eval "sub delimiters { tr/$delimiter// }";
while ( <> ) {
next if delimiters() == $delim_amnt_per_line;
my ($prefix) = substr $_, 0, 7;
next if $seen{$prefix}++;
print;
}
output
1212123x534534534534xx4545454x232322xx
0901001x876879878787xx0909918x212245xx
1212133x534534534534xx4549454x232322xx
4352342xx23232xxx345545x45454x23232xxx

Retain quotes on CSV fields that were quoted in the input

I have a CSV file such that a few of the fields are quoted regardless of whether they need to be. What I wish to do is load this file, modify a few of the values, and produce the modified CSV with the quoted fields intact.
I'm currently using Perl's Text::CSV package to attempt to solve this problem, but have ran into a bit of a roadblock. The following is a small test script to demonstrate the problem:
use Text::CSV;
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
produces:
input: hello,"world"
output: hello,world
According to Text::CSV's documentation, an is_quoted() function exists to test if a field had been quoted in the input, but if I use this to add surrounding quotes to a field, I get unexpected results:
my $csv = Text::CSV->new ({'binary' => 1, 'allow_loose_quotes' => 1, 'keep_meta_info' => 1});
my $line = q^hello,"world"^;
print qq^input: $line\n^;
$csv->parse($line);
my #flds = $csv->fields();
for my $idx (0..$#flds) {
if ($csv->is_quoted($idx)) {
$flds[$idx] = qq^"$flds[$idx]"^;
}
}
$csv->combine(#flds);
print 'output: ', $csv->string(), "\n";
Producing:
input: hello,"world"
output: hello,"""world"""
where I believe the quotes I've added before the combine() are being seen as part of the field, and so are being escaped with a second double quote as combine() is processing.
What would be the best way to ensure quoted fields are left intact from input to output? I'm not certain the application will accept always_quote'ed fields... Is there some combination of Text::CSV object attributes that will allow for keeping quotes intact? Or perhaps am I left with adjusting the record post-combine?
It's a shame but it appears that while keep_meta_info gives you access to the metadata there's no option to tell Text::CSV to reapply the is_quoted state on output.
Depending on how complex your record is you could just reassemble it yourself. But then you'd have to cope with changes to string fields that were previously safely unquoted but after your processing now require quotes. That will depend on the types of changes you introduce, i.e. whether or not you ever expect that a previously "safe" string value will become unsafe. If the answer is "never" (i.e. 0.00000% chance), then you should just do the reassembly yourself and document what you've done.
Post-processing would require that you CSV-parse the string to handle the possibility of commas and other unsafe characters inside strings, so that may not be an option.
Or, you could dive into the code for Text::CSV and implement the desired functionality. I.e. allow the user to force quoting of a specific field on output. I played around with it, and it looks like part of the required mechanism might be in place but unfortunately all I have access to is the XS version, which delegates to native code, so I can't delve deeper at this time. This is as far as I got:
Original combine method. Note the setting of _FFLAGS to undef.
sub combine
{
my $self = shift;
my $str = "";
$self->{_FIELDS} = \#_;
$self->{_FFLAGS} = undef;
$self->{_STATUS} = (#_ > 0) && $self->Combine (\$str, \#_, 0);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine
My attempt. I guessed that the second argument to Combine might be the flags, but since the (lowercase) combine API is based on receiving an array and not an arrayref, there's no way to pass two arrays in. I changed it to expect two arrayrefs and tried passing the second to Combine but that failed with "Can't call method "print" on unblessed reference".
sub combine2
{
my $self = shift;
my $str = "";
my $f = shift;
my $g = shift;
$self->{_FIELDS} = $f;
$self->{_FFLAGS} = $g;
$self->{_STATUS} = (#$f > 0) && $self->Combine (\$str, $f, $g);
$self->{_STRING} = \$str;
$self->{_STATUS};
} # combine

Perl - Need to append duplicates in a file and write unique value only

I have searched a fair bit and hope I'm not duplicating something someone has already asked. I have what amounts to a CSV that is specifically formatted (as required by a vendor). There are four values that are being delimited as follows:
"Name","Description","Tag","IPAddresses"
The list is quite long (and there are ~150 unique names--only 2 in the sample below) but it basically looks like this:
"2B_AppName-Environment","desc","tag","192.168.1.1"
"2B_AppName-Environment","desc","tag","192.168.22.155"
"2B_AppName-Environment","desc","tag","10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4"
"6G_ServerName-AltEnv","desc","tag","192.192.192.40"
"6G_ServerName-AltEnv","desc","tag","192.168.50.5"
I am hoping for a way in Perl (or sed/awk, etc.) to come up with the following:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
So basically, the resulting file will APPEND the duplicates to the first match -- there should only be one line per each app/server name with a list of comma-separated IP addresses just like what is shown above.
Note that the "Decription" and "Tag" fields don't need to be considered in the duplication removal/append logic -- let's assume these are blank for the example to make things easier. Also, in the vendor-supplied list, the "Name" entries are all already sorted to be together.
This short Perl program should suit you. It expects the path to the input CSV file as a parameter on the command line and prints the result to STDOUT. It keeps track of the appearance of new name fields in the #names array so that it can print the output in the order that each name first appears, and it takes the values for desc and tag from the first occurrence of each unique name.
use strict;
use warnings;
use Text::CSV;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my $name = $row->[0];
if ($data{$name}) {
$data{$name}[3] .= ','.$row->[3];
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
output
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Update
Here's a version that ignores any record that doesn't have a valid IPv4 address in the fourth field. I've used Regexp::Common as it's the simplest way to get complex regex patterns right. It may need installing on your system.
use strict;
use warnings;
use Text::CSV;
use Regexp::Common;
my $csv = Text::CSV->new({always_quote => 1, eol => "\n"});
my (#names, %data);
while (my $row = $csv->getline(*ARGV)) {
my ($name, $address) = #{$row}[0,3];
next unless $address =~ $RE{net}{IPv4};
if ($data{$name}) {
$data{$name}[3] .= ','.$address;
}
else {
push #names, $name;
$data{$name} = $row;
}
}
for my $name (#names) {
$csv->print(*STDOUT, $data{$name});
}
I would advise you to use a CSV parser like Text::CSV for this type of problem.
Borodin has already pasted a good example of how to do this.
One of the approaches that I'd advise you NOT to use are regular expressions.
The following one-liner demonstrates how one could do this, but this is a very fragile approach compared to an actual csv parser:
perl -0777 -ne '
while (m{^((.*)"[^"\n]*"\n(?:(?=\2).*\n)*)}mg) {
$s = $1;
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;
print $s
}' test.csv
Outputs:
"2B_AppName-Environment","desc","tag","192.168.1.1,192.168.22.155,10.20.30.40"
"6G_ServerName-AltEnv","desc","tag","1.2.3.4,192.192.192.40,192.168.50.5"
Explanation:
Switches:
-0777: Slurp the entire file
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
while (m{^((.*)"[^"]*"\n(?:(?=\2).*\n)*)}mg): Separate text into matching sections.
$s =~ s/"\n.*"([^"\n]+)(?=")/,$1/g;: Join all ip addresses by a comma in matching sections.
print $s: Print the results.

how to put a file into an array and save it in perl

Hello everyone I'm a beginner in perl and I'm facing some problems as I want to put my strings starting from AA to \ in to an array and want to save it. There are about 2000-3000 strings in a txt file starting from same initials i.e., AA to / I'm doing it by this way plz correct me if I'm wrong.
Input File
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\
Source code
$flag = 0
while ($line = <ifh>)
{
if ( $line = m//\/g)
{
$flag = 1;
}
while ( $flag != 0)
{
for ($i = 0; $i <= 10000; $i++)
{ # Missing brace added by editor
$array[$i] = $line;
} # Missing brace added by editor
}
} # Missing close brace added by editor; position guessed!
print $ofh, $line;
close $ofh;
Welcome to StackOverflow.
There are multiple issues with your code. First, please post compilable Perl; I had to add three braces to give it the remotest chance of compiling, and I had to guess where one of them went (and there's a moderate chance it should be on the other side of the print statement from where I put it).
Next, experts have:
use warnings;
use strict;
at the top of their scripts because they know they will miss things if they don't. As a learner, it is crucial for you to do the same; it will prevent you making errors.
With those in place, you have to declare your variables as you use them.
Next, remember to indent your code. Doing so makes it easier to comprehend. Perl can be incomprehensible enough at the best of times; don't make it any harder than it has to be. (You can decide where you like braces - that is open to discussion, though it is simpler to choose a style you like and stick with it, ignoring any discussion because the discussion will probably be fruitless.)
Is the EB vs VB in the data significant? It is hard to guess.
It is also not clear exactly what you are after. It might be that you're after an array of entries, one for each block in the file (where the blocks end at the line containing just a backslash), and where each entry in the array is a hash keyed by the first two letters (or first word) on the line, with the remainder of the line being the value. This is a modestly complex structure, and probably beyond what you're expected to use at this stage in your learning of Perl.
You have the line while ($line = <ifh>). This is not invalid in Perl if you opened the file the old fashioned way, but it is not the way you should be learning. You don't show how the output file handle is opened, but you do use the modern notation when trying to print to it. However, there's a bug there, too:
print $ofh, $line; # Print two values to standard output
print $ofh $line; # Print one value to $ofh
You need to look hard at your code, and think about the looping logic. I'm sure what you have is not what you need. However, I'm not sure what it is that you do need.
Simpler solution
From the comments:
I want to flag each record starting from AA to \ as record 0 till record n and want to save it in a new file with all the record numbers.
Then you probably just need:
#!/usr/bin/env perl
use strict;
use warnings;
my $recnum = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
print "$_\n";
$recnum++;
}
else
{
print "$recnum $_\n";
}
}
This reads from the files specified on the command line (or standard input if there are none), and writes the tagged output to standard output. It prefixes each line except the 'end of record' marker lines with the record number and a space. Choose your output format and file handling to suit your needs. You might argue that the chomp is counter-productive; you can certainly code the program without it.
Overly complex solution
Developed in the absence of clear direction from the questioner.
Here is one possible way to read the data, but it uses moderately advanced Perl (hash references, etc). The Data::Dumper module is also useful for printing out Perl data structures (see: perldoc Data::Dumper).
#!/usr/bin/env perl
use strict;
use warnings;
use Data::Dumper;
my #data;
my $hashref = { };
my $nrecs = 0;
while (<>)
{
chomp;
if (m/^\\$/)
{
# End of group - save to data array and start new hash
$data[$nrecs++] = $hashref;
$hashref = { };
}
else
{
m/^([A-Z]+)\s+(.*)$/;
$hashref->{$1} = $2;
}
}
foreach my $i (0..$nrecs-1)
{
print "Record $i:\n";
foreach my $key (sort keys $data[$i])
{
print " $key = $data[$i]->{$key}\n";
}
}
print Data::Dumper->Dump([ \#data ], [ '#data' ]);
Sample output for example input:
Record 0:
AA = c0001
BB = afsfjgfjgjgjflffbg
CC = table
DD = hhhfsegsksgk
EB = jksgksjs
Record 1:
AA = e0002
BB = rejwkghewhgsejkhrj
CC = chair
DD = egrhjrhojohkhkhrkfs
VB = rkgjehkrkhkh;r
$#data = [
{
'EB' => 'jksgksjs',
'CC' => 'table',
'AA' => 'c0001',
'BB' => 'afsfjgfjgjgjflffbg',
'DD' => 'hhhfsegsksgk'
},
{
'CC' => 'chair',
'AA' => 'e0002',
'VB' => 'rkgjehkrkhkh;r',
'BB' => 'rejwkghewhgsejkhrj',
'DD' => 'egrhjrhojohkhkhrkfs'
}
];
Note that this data structure is not optimized for searching except by record number. If you need to search the data in some other way, then you need to organize it differently. (And don't hand this code in as your answer without understanding it all - it is subtle. It also does no error checking; beware faulty data.)
It can't be right. I can see two main issues with your while-loop.
Once you enter the following loop
while ( $flag != 0)
{
...
}
you'll never break out because you do not reset the flag whenever you find an break-line. You'll have to parse you input and exit the loop if necessary.
And second you never read any input within this loop and thus process the same $line over and over again.
You should not put the loop inside your code but instead you can use the following pattern (pseudo-code)
if flag != 0
append item to array
else
save array to file
start with new array
end
I believe what you want is to split the files content at \ though it's not too clear.
To achieve this you can slurp the file into a variable by setting the input record separator, then split the content.
To find out about Perl's special variables related to filehandlers read perlvar
#!perl
use strict;
use warnings;
my $content;
{
open my $fh, '<', 'test.txt';
local $/; # slurp mode
$content = <$fh>;
close $fh;
}
my #blocks = split /\\/, $content;
Make sure to localize modifications of Perl's special variables to not interfere with different parts of your program.
If you want to keep the separator you could set $/ to \ directly and skip split.
#!perl
use strict;
use warnings;
my #blocks;
{
open my $fh, '<', 'test.txt';
local $/ = '\\'; # seperate at \
#blocks = <$fh>;
close $fh;
}
Here's a way to read your data into an array. As I said in a comment, "saving" this data to a file is pointless, unless you change it. Because if I were to print the #data array below to a file, it would look exactly like the input file.
So, you need to tell us what it is you want to accomplish before we can give you an answer about how to do it.
This script follows these rules (exactly):
Find a line that begins with "AA",
and save that into $line
Concatenate every new line from the
file into $line
When you find a line that begins with
a backslash \, stop concatenating
lines and save $line into #data.
Then, find the next line that begins
with "AA" and start the loop over.
These matching regexes are pretty loose, as they will match AAARGH and \bonkers as well. If you need them stricter, you can try /^\\$/ and /^AA$/, but then you need to watch out for whitespace at the beginning and end of line. So perhaps /^\s*\\\s*$/ and /^\s*AA\s*$/ instead.
The code:
use warnings;
use strict;
my $line="";
my #data;
while (<DATA>) {
if (/^AA/) {
$line = $_;
while (<DATA>) {
$line .= $_;
last if /^\\/;
}
}
push #data, $line;
}
use Data::Dumper;
print Dumper \#data;
__DATA__
AA c0001
BB afsfjgfjgjgjflffbg
CC table
DD hhhfsegsksgk
EB jksgksjs
\
AA e0002
BB rejwkghewhgsejkhrj
CC chair
DD egrhjrhojohkhkhrkfs
VB rkgjehkrkhkh;r
\

Extract zip Files on cmd with progress indicator

I am looking for a program, which is able to extract zip archives via the windows commandline and that is able to display a progressbar or a percentage indicator on the cmd. I want to use this from within a Perl script and so give the user a hint how long the progress will take. I tried 7zip(http://www.7-zip.org/) and Unzip(from InfoZIP) so far, but was not able to produce the behaviour described above. Does somebody know how to solve this?
Update:
Currently i'm trying it with this approach:
#!/usr/bin/perl
use strict; $|++;
use warnings;
use Archive::Zip;
my $zip = Archive::Zip->new('file.zip');
my $total_bytes = 0;
my $bytes_already_unzipped = 0;
foreach my $member ($zip->members()) {
$total_bytes += $member->uncompressedSize();
}
foreach my $member ($zip->members()) {
$zip->extractMember($member);
$bytes_already_unzipped += $member->uncompressedSize();
print progress_bar($bytes_already_unzipped, $total_bytes, 25, '=' );
}
#routine by tachyon at http://tachyon.perlmonk.org/
#also have a look at http://oreilly.com/pub/h/943
sub progress_bar {
my ( $got, $total, $width, $char ) = #_;
$width ||= 25; $char ||= '=';
my $num_width = length $total;
sprintf "|%-${width}s| Got %${num_width}s bytes of %s (%.2f%%)\r",
$char x (($width-1)*$got/$total). '>',
$got, $total, 100*$got/+$total;
}
However i have two problems:
this approach seems to be very slow
i do not have a periodic update in the progress bar, but only when a file is finished beeing extracted. As i have some large files, the system seems to not respond while extracting them
Do the extraction from within your program instead of delegating to a different one. Use Archive::Zip and Term::ProgressBar. Extract files one by one. Update the progress after each.