Emboss Cons for getting consensus sequence for many files, not just one - perl

I installed and configured emboss and can run the simple command line arguments for getting the consensus of one previously aligned multifasta file:
% cons
Create a consensus sequence from a multiple alignment
Input (aligned) sequence set: dna.msf
output sequence [dna.fasta]: aligned.cons
This is perfect for dealing with one file at a time, but I have hundreds to process.
I have started to write a perl script with a foreach loop to try and process this for every file, but I guess I need to be outside of the script to run these commands. Any clue on how I can run a command line friendly program for getting a single consensus sequence in fasta format from a previously aligned multifasta file, for many files in succession? I don't have to use emboss- I could use another program.
Here is my code so far:
#!/usr/bin/perl
use warnings;
use strict;
my $dir = ("/Users/roblogan/Documents/Clustered_Barcodes_Aligned");
my #ArrayofFiles = glob "$dir/*"; #put all files in the directory into an array
#print join("\n", #ArrayofFiles), "\n"; #diagnostic print
foreach my $file (#ArrayofFiles){
print 'cons', "\n";
print "/Users/roblogan/Documents/Clustered_Barcodes_Aligned/Clustered_Barcode_Number_*.*.Sequences.txt.out", "\n";
print "*.*.Consensus.txt", "\n";
}

EMBOSS cons has two mandatory qualifier:
- sequence( to provide the input sequence)
- outseq (for output).
so you need to provide the above to field .
Now change your code little bit to run multiple program:
my $count=1;
foreach my $file (#ArrayofFiles){
$output_path= "/Users/roblogan/Documents/Clustered_Barcodes_Aligned/";
my $output_file = $output_path. "out$count";# please change here to get your desired output filename
my $command = "cons -sequence '$file' -outseq '$output_file' ";
system($command);
$count ++;
}
Hope the above code will work for you.

Related

Perl script -- Multiple text file parsing and writing

Suppose i have this directory full of text files (raw text). What i need is a Perl script that will parse the directory (up2bottom) text files one by one and save their contents in a new single file, appointed by me. In other words i simply want to create a corpus of many documents. Note: these documents have to be separated by some tag e.g. indicating the sequence in which they were parsed.
So far i have managed to follow some examples and i know how to read, write and parse text files. But i am not yet in position to merge them into one script and handle many text files. Can you please provide some assistance. thanks
edit:
example code for writing to a file.
#!/usr/local/bin/perl
open (MYFILE, '>>data.txt');
print MYFILE "text\n";
close (MYFILE);
example code for reading a file.
#!/usr/local/bin/perl
open (MYFILE, 'data.txt');
while (<MYFILE>) {
chomp;
print "$_\n";
}
close (MYFILE);
I've also find out about the foreach function which can be used for tasks as such, but still don't know how to combine them and achieve the result explained in the description.
The important points in this suggestion are:
the "magic" diamond operator (a.k.a. readline), which reads from each file in *ARGV,
the eof function, which tells if the next readline on the current filehandle will return any data
the $ARGV variable, that contains the name of the currently opened file.
With that intro, here we go!
#!/usr/bin/perl
use strict; # Always!
use warnings; # Always!
my $header = 1; # Flag to tell us to print the header
while (<>) { # read a line from a file
if ($header) {
# This is the first line, print the name of the file
print "========= $ARGV ========\n";
# reset the flag to a false value
$header = undef;
}
# Print out what we just read in
print;
}
continue { # This happens before the next iteration of the loop
# Check if we finished the previous file
$header = 1 if eof;
}
To use it, just do: perl concat.pl *.txt > compiled.TXT

Defining Hash Values and Keys and Using Multiple Different Files

I am struggling with writing a Perl program for several tasks. I have tried really hard to review all errors since I am a beginner and want to understand my mistakes, but I am failing. Hopefully, my description of the tasks and my deficient program so far will not be confusing.
In my current directory, I have a variable number of “.txt.” files. (I can have 4, 5, 8, or any number of files. However, I don’t think I will get more that 17 files.) The format of the “.txt” files is the same. There are six columns, which are separated by white space. I only care about two columns in these files: the second column, which is the coral reef regionID (made up of letters and numbers), and the fifth column, which is the p-value. The number of rows in each file is undetermined. What I need to do is find all the common regionIDs in all .txt files and print these common regions to an outfile. However, before printing, I must sort them.
The following is my program so far, but I have received error messages, which I have included after the program. Thus, my definitions of variables are the major problems. I really appreciate any suggestions for writing the program and thank you for your patience with a beginner like me.
UPDATE: I have declared the variables as suggested. After reviewing my program, two syntax errors appear.
syntax error at oreg.pl line 19, near "$hash{"
syntax error at oreg.pl line 23, near "}"
Execution of oreg.pl aborted due to compilation errors.
Here is an excerpt of the edited program that includes where said errors are.
#!/user/bin/perl
use strict;
use warnings;
# Trying to read files in #txtfiles for reading into hash
foreach my $file (#txtfiles) {
open(FH,"<$file") or die "Can't open $file\n";
while(chomp(my $line = <FH>)){
$line =~ s/^\s+//;
my #IDp = split(/\s+/, $line); # removing whitespace
my $i = 0;
# trying to define values and keys in terms of array elements in IDp
my $value = my $hash{$IDp[$i][1]};
$value .= "$IDp[$i][4]"; # confused here at format to append p-values
$i++;
}
}
close(FH);
These are past errors:
Global symbol "$file" requires explicit package name at oreg.pl line 13.
Global symbol "$line" requires explicit package name at oreg.pl line 16.
#[And many more just like that...]
Execution of oreg.pl aborted due to compilation errors.
You didn't declare $file.
foreach my $file (#txtfiles) {
You didn't declare $line.
while(chomp(my $line = <FH>)){
etc.
use strict;
use warnings;
my %region;
foreach my $file (#txtfiles) {
open my $FH, "<", $file or die "Can't open $file \n";
while (my $line = <$FH>) {
chomp($line);
my #values = split /\s+/, $line;
my $regionID = $values[1]; # 2nd column, per your notes
my $pvalue = $values[4]; # 5th column, per your notes
$region{$regionID} //= []; # Inits this value in the hash to an empty arrayref if undefined
push #{$region{$regionID}}, $pvalue;
}
}
# Now sort and print using %region as needed
At the end of this code, %region is a hash where the keys are the region IDs and the values are array references containing the various pvalues.
Here's a few snippets that may help you with next steps:
keys %regions will give you a list of region id values.
my #pvals = #{$regions{SomeRegionID}} will give you the list of pvalues for SomeRegionID
$regions{SomeRegionID}->[0] will give you the first pvalue for that region.
You may want to check out Data::Printer or Data::Dumper - they are CPAN modules that will let you easily print out your data structure, which might help you understand what's going on in your code.

How to open/join more than one file (depending on user input) and then use 2 files simultaneously

EDIT: Sorry for the misunderstanding, I have edited a few things, to hopefully actually request what I want.
I was wondering if there was a way to open/join two or more files to run the rest of the program on.
For example, my directory has these files:
taggedchpt1_1.txt, parsedchpt1_1.txt, taggedchpt1_2.txt, parsedchpt1_2.txt etc...
The program must call a tagged and parsed simultaneously. I want to run the program on both of chpt1_1 and chpt1_2, preferably joined together in one .txt file, unless it would be very slow to do so. For instance run what would be accomplished having two files:
taggedchpt1_1_and_chpt1_2 and parsedchpt1_1_and_chpt1_2
Can this be done through Perl? Or should I just combine the text files myself(or automate that process, making chpt1.txt which would include chpt1_1, chpt1_2, chpt1_3 etc...)
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the form chp#_sec#:\n"; ##So the user inputs 31_3, for example
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(my $tag_corpus, '<', "tagged${chapter_and_section}.txt") or die $!;
open(my $parse_corpus, '<', "parsed${chapter_and_section}.txt") or die $!;
For the rest of the program to work, I need to be able to have:
my #sentences = <$tag_corpus>; ##right now this is one file, I want to make it more
my #typeddependencies = <$parse_corpus>; ##same as above
EDIT2: Really sorry about the misunderstanding. In the program, after the steps shown, I do 2 for loops. Reading through the lines of the tagged and parsed.
What I want is to accomplish this with more files from the same directory, without having to re-input the next files. (ie. I can run taggedchpt31_1.txt and parsedchpt31_1.txt...... I want to run taggedchpt31 and parsedchpt31 - which includes ~chpt31_1, ~chpt31_2, etc...)
Ultimately, it would be best if I joined all the tagged files and all the parsed files that have a common chapter (in the end still requiring only two files I want to run) but not have to save the joined file to the directory... Now that I put it into words, I think I should just save files that include all the sections.
Sorry and Thanks for all your time! Look at FMc's breakdown of my question for more help.
You could iterate over the file names, opening and reading each one in turn. Or you could produce an iterator that knows how to read lines from sequence of files.
sub files_reader {
# Takes a list of file names and returns a closure that
# will yield lines from those files.
my #handles = map { open(my $h, '<', $_) or die $!; $h } #_;
return sub {
shift #handles while #handles and eof $handles[0];
return unless #handles;
return readline $handles[0];
}
}
my $reader = files_reader('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = $reader->()) {
print $line;
}
Or you could use Perl's built-in iterator that can do the same thing:
local #ARGV = ('foo.txt', 'bar.txt', 'quux.txt');
while (my $line = <>) {
print $line;
}
Edit in response to follow-up questions:
Perhaps it would help to break your problem down into smaller sub-tasks. As I understand it, you have three steps.
Step 1 is to get some input from the user -- perhaps a directory name, or maybe a couple of file name patterns (taggedchpt and parsedchpt).
Step 2 is for the program to find all of the relevant file names. For this task, glob() or readdir()might be useful. There are many questions on StackOverflow related to such issues. You'll end up with two lists of file names, one for the tagged files and one for the parsed files.
Step 3 is to process the lines across all of the files in each of the two sets. Most of the answers you have received, including mine, will help you with this step.
No one has mentioned the #ARGV hack yet? Ok, here it is.
{
local #ARGV = ('taggedchpt1_1.txt', 'parsedchpt1_1.txt', 'taggedchpt1_2.txt',
'parsedchpt1_2.txt');
while (<ARGV>) {
s/THIS/THAT/;
print FH $_;
}
}
ARGV is a special filehandle that iterates through all the filenames in #ARGV, closing a file and opening the next one as necessary. Normally #ARGV contains the command-line arguments that you passed to perl, but you can set it to anything you want.
You're almost there... this is a bit more efficient than discrete opens on each file...
#!/usr/bin/perl
use strict;
use warnings FATAL => "all";
print "Please type in the chapter and section NUMBERS in the for chp#_sec#:\n";
chomp (my $chapter_and_section = "chpt".<>);
print "Please type in the search word:\n";
chomp (my $search_key = <>);
open(FH, '>output.txt') or die $!; # Open an output file for writing
foreach ("tagged${chapter_and_section}.txt", "parsed${chapter_and_section}.txt") {
open FILE, "<$_" or die $!; # Read a filename (from the array)
foreach (<FILE>) {
$_ =~ s/THIS/THAT/g; # Regex replace each line in the open file (use
# whatever you like instead of "THIS" &
# "THAT"
print FH $_; # Write to the output file
}
}

Is there a simple way to do bulk file text substitution in place?

I've been trying to code a Perl script to substitute some text on all source files of my project. I'm in need of something like:
perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi" *.{cs,aspx,ascx}
But that parses all the files of a directory recursively.
I just started a script:
use File::Find::Rule;
use strict;
my #files = (File::Find::Rule->file()->name('*.cs','*.aspx','*.ascx')->in('.'));
foreach my $f (#files){
if ($f =~ s/thisgoesout/thisgoesin/gi) {
# In-place file editing, or something like that
}
}
But now I'm stuck. Is there a simple way to edit all files in place using Perl?
Please note that I don't need to keep a copy of every modified file; I'm have 'em all subversioned =)
Update: I tried this on Cygwin,
perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi" {*,*/*,*/*/*}.{cs,aspx,ascx
But it looks like my arguments list exploded to the maximum size allowed. In fact, I'm getting very strange errors on Cygwin...
If you assign #ARGV before using *ARGV (aka the diamond <>), $^I/-i will work on those files instead of what was specified on the command line.
use File::Find::Rule;
use strict;
#ARGV = (File::Find::Rule->file()->name('*.cs', '*.aspx', '*.ascx')->in('.'));
$^I = '.bak'; # or set `-i` in the #! line or on the command-line
while (<>) {
s/thisgoesout/thisgoesin/gi;
print;
}
This should do exactly what you want.
If your pattern can span multiple lines, add in a undef $/; before the <> so that Perl operates on a whole file at a time instead of line-by-line.
You may be interested in File::Transaction::Atomic or File::Transaction
The SYNOPSIS for F::T::A looks very similar with what you're trying to do:
# In this example, we wish to replace
# the word 'foo' with the word 'bar' in several files,
# with no risk of ending up with the replacement done
# in some files but not in others.
use File::Transaction::Atomic;
my $ft = File::Transaction::Atomic->new;
eval {
foreach my $file (#list_of_file_names) {
$ft->linewise_rewrite($file, sub {
s#\bfoo\b#bar#g;
});
}
};
if ($#) {
$ft->revert;
die "update aborted: $#";
}
else {
$ft->commit;
}
Couple that with the File::Find you've already written, and you should be good to go.
You can use Tie::File to scalably access large files and change them in place. See the manpage (man 3perl Tie::File).
Change
foreach my $f (#files){
if ($f =~ s/thisgoesout/thisgoesin/gi) {
#inplace file editing, or something like that
}
}
To
foreach my $f (#files){
open my $in, '<', $f;
open my $out, '>', "$f.out";
while (my $line = <$in>){
chomp $line;
$line =~ s/thisgoesout/thisgoesin/gi
print $out "$line\n";
}
}
This assumes that the pattern doesn't span multiple lines. If the pattern might span lines, you'll need to slurp in the file contents. ("slurp" is a pretty common Perl term).
The chomp isn't actually necessary, I've just been bitten by lines that weren't chomped one too many times (if you drop the chomp, change print $out "$line\n"; to print $out $line;).
Likewise, you can change open my $out, '>', "$f.out"; to open my $out, '>', undef; to open a temporary file and then copy that file back over the original when the substitution's done. In fact, and especially if you slurp in the whole file, you can simply make the substitution in memory and then write over the original file. But I've made enough mistakes doing that that I always write to a new file, and verify the contents.
Note, I originally had an if statement in that code. That was most likely wrong. That would have only copied over lines that matched the regular expression "thisgoesout" (replacing it with "thisgoesin" of course) while silently gobbling up the rest.
You could use find:
find . -name '*.{cs,aspx,ascx}' | xargs perl -p -i.bak -e "s/thisgoesout/thisgoesin/gi"
This will list all the filenames recursively, then xargs will read its stdin and run the remainder of the command line with the filenames appended on the end. One nice thing about xargs is it will run the command line more than once if the command line it builds gets too long to run in one go.
Note that I'm not sure whether find completely understands all the shell methods of selecting files, so if the above doesn't work then perhaps try:
find . | grep -E '(cs|aspx|ascx)$' | xargs ...
When using pipelines like this, I like to build up the command line and run each part individually before proceeding, to make sure each program is getting the input it wants. So you could run the part without xargs first to check it.
It just occurred to me that although you didn't say so, you're probably on Windows due to the file suffixes you're looking for. In that case, the above pipeline could be run using Cygwin. It's possible to write a Perl script to do the same thing, as you started to do, but you'll have to do the in-place editing yourself because you can't take advantage of the -i switch in that situation.
Thanks to ephemient on this question and on this answer, I got this:
use File::Find::Rule;
use strict;
sub ReplaceText {
my $regex = shift;
my $replace = shift;
#ARGV = (File::Find::Rule->file()->name('*.cs','*.aspx','*.ascx')->in('.'));
$^I = '.bak';
while (<>) {
s/$regex/$replace->()/gie;
print;
}
}
ReplaceText qr/some(crazy)regexp/, sub { "some $1 text" };
Now I can even loop through a hash containing regexp=>subs entries!

Searching/reading another file from awk based on current file's contents, is it possible?

I'm processing a huge file with (GNU) awk, (other available tools are: Linux shell tools, some old (>5.0) version of Perl, but can't install modules).
My problem: if some field1, field2, field3 contain X, Y, Z I must search for a file in another directory which contains field4, and field5 on one line, and insert some data from the found file to the current output.
E.g.:
Actual file line:
f1 f2 f3 f4 f5
X Y Z A B
Now I need to search for another file (in another directory), which contains e.g.
f1 f2 f3 f4
A U B W
And write to STDOUT $0 from the original file, and f2 and f3 from the found file, then process the next line of the original file.
Is it possible to do it with awk?
Let me start out by saying that your problem description isn't really that helpful. Next time, please just be more specific: You might be missing out on much better solutions.
So from your description, I understand you have two files which contain whitespace-separated data. In the first file, you want to match the first three columns against some search pattern. If found, you want to find all lines in another file which contain the fourth and and fifth column of the matching line in the first file. From those lines, you need to extract the second and third column and then print the first column of the first file and the second and third from the second file. Okay, here goes:
#!/usr/bin/env perl -nwa
use strict;
use File::Find 'find';
my #search = qw(X Y Z);
# if you know in advance that the otherfile isn't
# huge, you can cache it in memory as an optimization.
# with any more columns, you want a loop here:
if ($F[0] eq $search[0]
and $F[1] eq $search[1]
and $F[2] eq $search[2])
{
my #files;
find(sub {
return if not -f $_;
# verbatim search for the columns in the file name.
# I'm still not sure what your file-search criteria are, though.
push #files, $File::Find::name if /\Q$F[3]\E/ and /\Q$F[4]\E/;
# alternatively search for the combination:
#push #files, $File::Find::name if /\Q$F[3]\E.*\Q$F[4]\E/;
# or search *all* files in the search path?
#push #files, $File::Find::name;
}, '/search/path'
)
foreach my $file (#files) {
open my $fh, '<', $file or die "Can't open file '$file': $!";
while (defined($_ = <$fh>)) {
chomp;
# order of fields doesn't matter per your requirement.
my #cols = split ' ', $_;
my %seen = map {($_=>1)} #cols;
if ($seen{$F[3]} and $seen{$F[4]}) {
print join(' ', $F[0], #cols[1,2]), "\n";
}
}
close $fh;
}
} # end if matching line
Unlike another poster's solution which contains lots of system calls, this doesn't fall back to the shell at all and thus should be plenty fast.
This is the type of work that got me to move from awk to perl in the first place. If you are going to accomplish this, you may actually find it easier to create a shell script that creates awk script(s) to query and then update in separate steps.
(I've written such a beast for reading/updating windows-ini-style files - it's ugly. I wish I could have used perl.)
I often see the restriction "I can't use any Perl modules", and when it's not a homework question, it's often just due to a lack of information. Yes, even you can use CPAN contains the instructions on how to install CPAN modules locally without having root privileges. Another alternative is just to take the source code of a CPAN module and paste it into your program.
None of this helps if there are other, unstated, restrictions, like lack of disk space that prevent installation of (too many) additional files.
This seems to work for some test files I set up matching your examples. Involving perl in this manner (interposed with grep) is probably going to hurt the performance a great deal, though...
## perl code to do some dirty work
for my $line (`grep 'X Y Z' myhugefile`) {
chomp $line;
my ($a, $b, $c, $d, $e) = split(/ /,$line);
my $cmd = 'grep -P "' . $d . ' .+? ' . $e .'" otherfile';
for my $from_otherfile (`$cmd`) {
chomp $from_otherfile;
my ($oa, $ob, $oc, $od) = split(/ /,$from_otherfile);
print "$a $ob $oc\n";
}
}
EDIT: Use tsee's solution (above), it's much more well-thought-out.