Perl Loop Output to Excel Spreadsheet - perl

I have a Perl script, the relevant bits of which are posted below.
# Pull values from cells
ROW:
for my $row ( $row_min + 1 .. $row_max ) {
my $target_cell = $worksheet->get_cell( $row, $target_col);
my $response_cell = $worksheet->get_cell( $row, $response_col);
if ( defined $target_cell && defined $response_cell ) {
my $target = $target_cell->value();
my $response = $response_cell->value();
# Determine relatedness
my $value = $lesk->getRelatedness($target, $response);
# Copy output to new Excel spreadhseet, 'data.xls'
my $workbook1 = Spreadsheet::WriteExcel->new('data.xls');
my $worksheet1 = $workbook1->add_worksheet();
$worksheet1->set_column(0, 3, 18);
my $row = 0;
foreach ($target) {
$row++;
$worksheet1->write( $row, 0, "Target = $target\n");
$worksheet1->write( $row, 1, "Response = $response\n");
$worksheet1->write( $row, 2, "Relatedness = $value\n");
}
}
}
This script uses the Perl modules ParseExcel and WriteExcel. The input data spreadsheet is a list of words under two columns, one labelled 'Target' and the other labelled 'Response.' The script takes each target word and each response word and computes a value of relatedness between them (that's what the
$lesk->getRelatedness
section of code is doing. It is calling a perl module called WordNet::Similarity that computes a measure of relatedness between words).
All of this works perfectly fine. The problem is I am trying to write the output (the measure of similarity, or $value in this script) into a new Excel file. No matter what I do with the code, the only output it will give me is the relatedness between the LAST target and response words. It ignores all of the rest.
However, this only occurs when I am trying to write to an Excel file. If I use the 'print' function instead, I can see all of the outputs in the command window. I can always just copy and paste this into Excel, but it would be much easier if I could automate this. Any idea what the problem is?

You're resetting the value of $row each time to 0.

Problem is solved. I just needed to move the
my $workbook1 = Spreadsheet::WriteExcel->new('data.xls');
my $worksheet1 = $workbook1->add_worksheet();
lines to another part of the script. Since they were in the 'for' statement, the program kept overwriting the 'data.xls' file every time it ran through the loop.

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;
​
my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}
​
my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5
​
# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];
​
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!
Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

How can Perl operate multiple worksheets in a Excel file?

I'm still learing Perl, and am trying to perform the following task -- and am miserably stuck.
Currently I have the following.
Open the necessary Excel file
Store Workbooks info under var $workbook1
Store Worksheets info under var $worksheets1
my $excel = Win32::OLE -> new("Excel.Application");
$excel->{Visible} = 0;
$excel->{DisplayAlerts}=0;
$excel->{SheetsInNewWorkbook} = 1;
my $workbook1 = $excel->Workbooks->Open($OutFileName);
my $worksheet1 = $workbook1->Worksheets(1);
I later take this info and have Perl to write in the cells of the Excel file (the necessary $Row and $Col are initialized accordingly).
$worksheet1->Cells($Row, $Col)->{'Value'} = $var1;
$worksheet1->Cells($Row, $Col+1)->{'Value'} = $var2;
$worksheet1->Cells($Row, $Col+2)->{'Value'} = $var3;
$worksheet1->Cells($Row, $Col+3)->{'Value'} = $var4;
So far this works fine.
What I intend to do is operate on multiple worksheets instead of just one.
Open the necessary Excel file
Store Workbooks info under var $workbook1
Store Wortsheets info under var(s) $worksheets1, $worksheets2, $worksheets3, $worksheets4, $worksheets5 (or store it into an array if it's possible?)
I have a total of 5 worksheets I need to operate on in the Excel file.
I'm trying to for loop the latter process (where I write in the cells) for each worksheet, but I'm trying to find an easier way to recall the $worksheet1, $worksheet2, ..., $worksheet5 variables. Cuz, for now without the proper array variable -- which I yet learned how to in Perl -- I need to do an if statement for each every for loop and write the according code, which is very ineffective.. Is there a way to perform this with an array or something else for the worksheet variable?
I'm more familiar with how Matlab works.. and here's what I would like Perl to do if it were a Matlab script with arrays.
for i = 1:num_worksheets
# write in worksheet[i] cell (j,k) with variable var1
end
I am not entirely sure I get what you mean, but I think what you want to do is not repeat code. At least that is how I understand your Matlab code.
You can put all the worksheets into one array in Perl like this.
my $worksheet1 = $workbook1->Worksheets(1);
my $worksheet2 = $workbook1->Worksheets(2);
my $worksheet3 = $workbook1->Worksheets(3);
Or shorter:
my #worksheets (
$workbook1->Worksheets(1),
$workbook1->Worksheets(2),
$workbook1->Worksheets(3),
);
Or even shorter:
my #worksheets = map { $workbook1->Worksheets($_) } 1 .. 3;
Then you can iterate that array. The $worksheet variables are all objects, and objects in Perl are references. That means that if you put them into an array, you are not making copies of the objects.
foreach my $sheet ( #worksheets ) {
$sheet->Cells($row, $col)->{'Value'} = $var1;
$sheet->Cells($row, $col + 1)->{'Value'} = $var2;
}
That code will write $var1 and $var2 into the $row/$col cells in all three worksheets 1, 2 and 3.
You don't even have to create an array. You can just supply a list of the workbooks if that's the only place where you use all of them together in that order.
foreach my $sheet ($worksheet1, $worksheet2, $worksheet3) {
$sheet->Cells($row, $col)->{'Value'} = $var1;
}

BioPerl with clustalw - outputting file

I have a perl script to automate many multiple alignments (I'm making the script first with only one file and one multiple alignment - big one though. I can then modify for multiple files) and I want to output the resulting file, but I am unsure on how to do with with AlignIO: so far I have:
use warnings;
use strict;
use Bio::AlignIO;
use Bio::SeqIO;
use Bio::Tools::Run::Alignment::Clustalw;
my $file = shift or die; # Get filename from command prompt.
my $factory = Bio::Tools::Run::Alignment::Clustalw->new(-matrix => 'BLOSUM');
my $ktuple = 3;
$factory->ktuple($ktuple);
my $inseq = Bio::SeqIO->new(
-file => "<$file",
-format => $format
);
my $seq;
my #seq_array;
while ($seq = $inseq->next_seq) {
push(#seq_array, $seq);
}
# Now we do the actual alignment.
my $seq_array_ref = \#seq_array;
my $aln = $factory->align($seq_array_ref);
Once the alignment is done I have $aln which is the alignment I want to get out of the process as a fasta file - I tried something like:
my $out = Bio::AlignIO->new(-file => ">outputalignmentfile",
-format => 'fasta');
while( my $outaln = $aln->next_aln() ){
$out->write_aln($outaln);
}
but it didn't work, presumably because the method next_aln() only applies to AlignIO things, which $aln is probably not. So I need to know what it is that is generated by the line my $aln = $factory->align($seq_array_ref); and how to get the aligned sequences output into a file. My next step is tree estimation or network analysis.
Thanks,
Ben.
$out->write_aln($outaln);
Was the only line needed to write the object returned by the clustalw line to output the object to that stream.

Parsing CSV files, finding columns and remembering them

I am trying to figure out a way to do this, I know it should be possible. A little background first.
I want to automate the process of creating the NCBI Sequin block for submitting DNA sequences to GenBank. I always end up creating a table that lists the species name, the specimen ID value, the type of sequences, and finally the location of the the collection. It is easy enough for me to export this into a tab-delimited file. Right now I do something like this:
while ($csv) {
foreach ($_) {
if ($_ =! m/table|species|accession/i) {
#csv = split('\t', $csv);
print NEWFILE ">[species=$csv[0]] [molecule=DNA] [moltype=genomic] [country=$csv[2]] [spec-id=$csv[1]]\n";
}
else {
next;
}
}
}
I know that is messy, and I just typed up something similar to what I have by memory (don't have script on any of my computers at home, only at work).
Now that works for me fine right now because I know which columns the information I need (species, location, and ID number) are in.
But is there a way (there must be) for me to find the columns that are for the needed info dynamically? That is, no matter the order of the columns the correct info from the correct column goes to the right place?
The first row will usually as Table X (where X is the number of the table in the publication), the next row will usually have the column headings of interest and are nearly universal in title. Nearly all tables will have standard headings to search for and I can just use | in my pattern matching.
First off, I would be remiss if I didn’t recommend the excellent Text::CSV_XS module; it does a much more reliable job of reading CSV files, and can even handle the column-mapping scheme that Barmar referred to above.
That said, Barmar has the right approach, though it ignores the "Table X" row being a separate row entirely. I recommend taking an explicit approach, perhaps something like this (and this is going to have a bit more detail just to make things clear; I would probably write it more tightly in production code):
# Assumes the file has been opened and that the filehandle is stored in $csv_fh.
# Get header information first.
my $hdr_data = {};
while( <$csv_fh> ) {
if( ! $hdr_data->{'table'} && /Table (\d+)/ ) {
$hdr_data->{'table'} = $1;
next;
}
if( ! $hdr_data->{'species'} && /species/ ) {
my $n = 0;
# Takes the column headers as they come, creating
# a map between the column name and column number.
# Assumes that column names are case-insensitively
# unique.
my %columns = map { lc($_) => $n++ } split( /\t/ );
# Now pick out exactly the columns we want.
foreach my $thingy ( qw{ species accession country } ) {
$hdr_data->{$thingy} = $columns{$thingy};
}
last;
}
}
# Now process the rest of the lines.
while( <$csv_fh> ) {
my $col = split( /\t/ );
printf NEWFILE ">[species=%s] [molecule=DNA] [moltype=genomic] [country=%s] [spec-id=%s]\n",
$col[$hdr_data->{'species'}],
$col[$hdr_data->{'country'}],
$col[$hdr_data->{'accession'}];
}
Some variation of that will get you close to what you need.
Create a hash that maps column headings to column numbers:
my %columns;
...
if (/table|species|accession/i) {
my #headings = split('\t');
my $col = 0;
foreach my $col (#headings) {
$columns{"\L$col"} = $col++;
}
}
Then you can use $csv[$columns{'species'}].

Perl - Spreadsheet::WriteExcel function explanation

I'd like to use the function "autofit_columns" as found here:CPAN
Here's my program so far(I skipped the DB connect and query part)
my $workbook = Spreadsheet::WriteExcel->new("TEST.xls");
my $bold = $workbook->add_format();
$bold->set_bold();
my $number = $workbook->add_format();
$number->set_num_format(0x01);
$worksheet = $workbook->add_worksheet('Sheet1');
my #headings = ('Blabla...');
foreach $i (#headings){
$worksheet->write(0, $col++, $i, $bold);
};
$col=0;
$lrow=1;
while (#row = $sth->fetchrow_array()) {
$worksheet->write($lrow,$col,\#row);
$lrow++;
};
$sth->finish;
$dbh->disconnect;
autofit_columns($worksheet);
$workbook->close();
sub autofit_columns {
my $worksheet = shift;
my $col = 0;
for my $width (#{$worksheet->{__col_widths}}) {
$worksheet->set_column($col, $col, $width) if $width;
$col++;
}
}
PROBLEM: My columns are not autofitted in the xls file... Any idea why?
I don't get the peice of code:
for my $width (#{$worksheet->{__col_widths}}) {
$worksheet->set_column($col, $col, $width) if $width;
$col++;
}
You need to look at that example again and implement add_write_handler part too before you write anything to your worksheet.
Please take a look at
$worksheet->add_write_handler(qr[\w], \&store_string_widths);
line and then at store_string_widths subroutine implementation.
Answer is that you need to store absolute width of the string at each write. Then, after you wrote all data to your worksheet, you need to walk through rows and find the biggest string's 'length' for each column - that would be desired column width.
Wish you luck.
You are missing the part of the example code that adds the callback function:
$worksheet->add_write_handler(qr[\w], \&store_string_widths);
You are also missing the store_string_widths() function.
In relation to your second question, the callback stores the maximum string length used for each column. The code snippet is using these lengths to set the column width for each column from the first to the last column that has a length stored. If a column hasn't an autfit width stored then its width isn't adjusted.
This is all a little hacky in Spreadsheet::WriteExcel. It will be more integrated into the module in Excel::Writer::XLSX which is the replacement for WriteExcel.