How do I turn a table into a matrix? - perl

If I got a table in a text file such like
A B 1
A C 2
A D 1
B A 3
C D 2
A E 1
E D 2
C B 2
. . .
. . .
. . .
And I got another symbol list in another text file. I want to transform this table into a Perl data structure like:
_ A D E . . .
A 0 1 1 . . .
D 1 0 2 . . .
E 1 2 0 . . .
. . . . . . .
But I only need some selected symbol, for example A, D and E are selected in the symbol text but B and C are not.

Use an array for the first one and a 2-dimentional hash for the second one. The first one should look roughly like:
$list[0] # row 1 - the value is "A B 1"
And the hash like:
$hash{A}{A} # the intersection of A and A - the value is 0
Figuring out how to implement a problem is about 75% of the mental battle for me. I'm not going to go into specifics about how to print the hash or the array, because that's easy and I'm also not entirely clear on how you want it printed or how much you want printed. But converting the array to the hash should look a bit like this:
foreach (#list) {
my ($letter1, $letter2, $value) = split(/ /);
$hash{$letter1}{$letter2} = $value;
}
At least, I think that's what you're looking for. If you really want you could use a regular expression, but that's probably overkill for just extracting 3 values out of a string.
EDIT: Of course, you could forgo the #list and just assemble the hash straight from the file. But that's your job to figure out, not mine.

you can try this with awk:
awk -f matrix.awk yourfile.txt > newfile.matrix.txt
where matrix.awk is :
BEGIN {
OFS="\t"
}
{
row[$1,$2]=$3
if (!($2 in f2)) { header=(header)?header OFS $2:$2;f2[$2]}
if (col1[c]!=$1)
col1[++c]=$1
}
END {
printf("%*s%s\n", length(col1[1])+2, " ",header)
ncol=split(header,colA,OFS)
for(i=1;i<=c;i++) {
printf("%s", col1[i])
for(j=1;j<=ncol;j++)
printf("%s%s%c", OFS, row[col1[i],colA[j]], (j==ncol)?ORS:"")
}
}

Another way to do this would be to make a two-dimensional array -
my #fArray = ();
## Set the 0,0th element to "_"
push #{$fArray[0]}, '_';
## Assuming that the first line is the range of characters to skip, e.g. BC
chomp(my $skipExpr = <>);
while(<>) {
my ($xVar, $yVar, $val) = split;
## Skip this line if expression matches
next if (/$skipExpr/);
## Check if these elements have already been added in your array
checkExists($xVar);
checkExists($yVar);
## Find their position
for my $i (1..$#fArray) {
$xPos = $i if ($fArray[0][$i] eq $xVar);
$yPos = $i if ($fArray[0][$i] eq $yVar);
}
## Set the value
$fArray[$xPos][$yPos] = $fArray[$yPos][$xPos] = $val;
}
## Print array
for my $i (0..$#fArray) {
for my $j (0..$#{$fArray[$i]}) {
print "$fArray[$i][$j]", " ";
}
print "\n";
}
sub checkExists {
## Checks if the corresponding array element exists,
## else creates and initialises it.
my $nElem = shift;
my $found;
$found = ($_ eq $nElem ? 1 : 0) for ( #{fArray[0]} );
if( $found == 0 ) {
## Create its corresponding column
push #{fArray[0]}, $nElem;
## and row entry.
push #fArray, [$nElem];
## Get its array index
my $newIndex = $#fArray;
## Initialise its corresponding column and rows with '_'
## this is done to enable easy output when printing the array
for my $i (1..$#fArray) {
$fArray[$newIndex][$i] = $fArray[$i][$newIndex] = '_';
}
## Set the intersection cell value to 0
$fArray[$newIndex][$newIndex] = 0;
}
}
I am not too proud regarding the way I have handled references but bear with a beginner here (please leave your suggestions/changes in comments). The above mentioned hash method by Chris sounds a lot easier (not to mention a lot less typing).

CPAN has many potentially useful suff. I use Data::Table for many purposes. Data::Pivot also looks promising, but I have never used it.

Related

Add new hash keys and then print in a new file

Previously, I post a question to search for an answer to using regex to match specifics sequence identification (ID).
Now I´m looking for some recommendations to print the data that I looking for.
If you want to see the complete file, here's a GitHub link.
This script takes two files to work. The first file is something like this (this is only a part of the file):
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 2 2 0.0804934 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 4 4 0.0925522 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 13 13 0.0250116 . .
AGY29650_2_NA netOGlyc-4.0.0.13 CARBOHYD 23 23 0.565981 . .
...
This file tells me when there is a value >= 0.5, this information is in the sixth column. When this happens my script takes the first column (this is an ID, to match in with the second file) and the fourth column (this is a position of a letter in the second file).
Here my second file (this is only a part):
>AGY29650.2|NA spike protein
MTYSVFPLMCLLTFIGANAKIVTLPGNDA...EEYDLEPHKIHVH*
Like I said previously, the script takes the ID in the first file to match with the ID in the second file when these are the same and then searches for the position (fourth column) in the contents of the data.
Here an example, in file one the fourth row is a positive value (>=0.5) and the position in the fourth column is 23.
Then the script searches for position 23 in the data contents of the second file, here position 23 is a letter T:
MTYSVFPLMCLLTFIGANAKIV T LP
When the script match with the letter, the looking for 2 letters right and 2 letters left to the position of interest:
IVTLP
In the previous post, thank the help of some people in Stack I could solve the problem because of a difference between ID in each file (difference like this: AGY29650_2_NA (file one) and AGY29650.2 (file two)).
Now I looking for help to obtain the output that I need to complete the script.
The script is incomplete because I couldn't found the way to print the output of interest, in this case, the 5 letters in the second file (one letter of the position that appears in file one) 2 letters right, and 2 left.
I have thousands of files like the one and two, now I need some help to complete the script with any idea that you recommend.
Here is the script:
use strict;
use warnings;
use Bio::SeqIO;
​
my $file = $ARGV[0];
my $in = $ARGV[1];
my %fastadata = ();
my #array_residues = ();
my $seqio_obj = Bio::SeqIO->new(-file => $in,
-format => "fasta" );
while (my $seq_obj = $seqio_obj->next_seq ) {
my $dd = $seq_obj->id;
my $ss = $seq_obj->seq;
###my $ee = $seq_obj->desc;
$fastadata{$dd} = "$ss";
}
​
my $thres = 0.5; ### Selection of values in column N°5 with the following condition: >=0.5
​
# Open file
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
$one =~ s/\n//g;
$one =~ s/\r//g;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar (#cols) == 7); ### the line must have 7 columns to add to the array
my $val = $cols[5];
​
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 6);
}
}
}
close F;
I´m thinking in add a push function to generate the new data and then print in a new file.
My expected output is to print the position of a positive value (>=0.5), in this case, T (position 23) and the 2 letters right and 2 letters left.
In this case, with the data example in GitHub (link above) the expected output is:
IVTLP
Any recommendation or help is welcome.
Thank!
Main problem seems to be that the line has 8 columns not 7 as assumed in the script. Another small issue is that the extracted substring should have 5 characters not 6 as assumed by the script. Here is a modified version of the loop that works for me:
open (F, $file) or die; ### open the file or end the analyze
while(my $one = <F>) {### readline => F
chomp $one;
my #cols = split(/\s+/, $one); ### split columns
next unless (scalar #cols) == 8; ### the line must have 8 columns to add to the array
my $val = $cols[5];
if ($val >= 0.5) {
my $position = $cols[3];
my $id_list = $cols[0];
$id_list =~ s/^\s*([^_]+)_([0-9]+)_([a-zA-Z0-9]+)/$1.$2|$3/;
if (exists($fastadata{$id_list})) {
my $new_seq = $fastadata{$id_list};
my $subresidues = substr($new_seq, $position -3, 5);
print $subresidues, "\n";
}
}
}

Loop recursive through two files in perl

Hi guys I have an issue to solve,
I have 2 files.
File A
col1,col2, value_total_to_put
File A 201843,12345,30
File B
col1,col2,col3, value_inputted, missing_value, value_max
201843,12345,447,4,0,4
201843,12345,448,0,0,4
201843,12345,449,0,0,2
201843,12345,450,4,0,4
201843,12345,451,2,0,2
201843,12345,455,4,0,4
201843,12345,457,0,0,4
201843,12345,899,10,0,10
201843,12345,334,0,1,1
201843,12345,364,0,1,1
201843,12345,71,0,2,2
201843,12345,260,0,2,2
201843,12345,321,0,2,2
201843,12345,328,0,2,2
201843,12345,371,0,2,2
201843,12345,385,0,2,2
201843,12345,426,0,2,2
201843,12345,444,0,2,2
201843,12345,31,4,6,10
201843,12345,360,2,87,99
201843,12345,373,4,95,99
201843,12345,472,4,95,99
201843,12345,475,4,95,99
201843,12345,430,0,99,99
201843,12345,453,0,99,99
201843,12345,463,0,99,99
201843,12345,482,0,99,99
201843,12345,484,0,99,99
My keys are col1 and col2 from both files and I am doing this way below and my loop is wrong because when I reach the EOF from File B my loop is stopped.
What I want is match File A and B with $col1 and $col2 and while the value_total_to_put is > 0 withdraw 1 in each loop and in value_inputted from File B when value_inputted is less than value_max. For withdraw from File A missing_value might be > 0.
For the result I will print when value_inputted is equal to value_max in other words the last value until reach value_max or value_total_to_put is 0.
while ( <FA> ){
chomp;
my($col1,$col2, $value_total_to_put) = split ",";
push #A, [$col1,$col2, $value_total_to_put];
}
my #B;
while ( <FB> ){
chomp;
my($col1,$col2,$col3, $value_inputted, $missing_value, $value_max) = split ",";
push #B, [$col1,$col2,$col3, $value_inputted, $missing_value, $value_max];
}
foreach my $line (#A){
my $idxl = #$line[0].",".#$line[1];
my $value_total_to_put = #$line[2];
while ($value_total_to_put > 0){
foreach my $row ( #B ){
if ( $idxr eq $idxl ){
my $idxr = #$row[0].",".#$row[1];
my $value_inputted = #$row[3];
my $value_max = #$row[5];
my $missing_value = #$row[4];
if ( ($value_inputted eq 0) and ($missing_value eq 0)){
#do_nothing
} elsif($value_inputted == $value_max){
#do_nothing
print join(",", $idxr, #$row[2],"Value_inputted: ".$value_inputted, "Missing_value: ".$missing_value, "Value_max:".$value_max, "Total: ".$value_total_to_put)."\n";
}else{
$value_inputted++;
$missing_value--;
$value_total_to_put--;
}
}
}
last if $value_total_to_put > 0;
}
}
The third file will be this way:
201843,12345,447,4,0,4
201843,12345,450,4,0,4
201843,12345,451,2,0,2
201843,12345,455,4,0,4
201843,12345,899,10,0,10
201843,12345,334,1,0,1
201843,12345,364,1,0,1
201843,12345,71,2,0,2
201843,12345,260,2,0,2
201843,12345,321,2,0,2
201843,12345,328,2,0,2
201843,12345,371,2,0,2
201843,12345,385,2,0,2
201843,12345,426,2,0,2
201843,12345,444,2,0,2
201843,12345,31,10,0,10
201843,12345,360,3,86,99
201843,12345,373,5,94,99
201843,12345,472,5,94,99
201843,12345,475,5,94,99
201843,12345,430,1,98,99
201843,12345,453,1,98,99
As explained by #Dave Cross, your code is quite hard to read (and misses use strictand use warnings), and your explanation of what you are trying to achieve is quite unclear...
One thing that caught my eye however, is that you start a loop with this statement...
while ($value_total_to_put > 0){
... but at the very end of that same block you do :
last if $value_total_to_put > 0;
}
This will effectively cause Perl to exit the loop after the first iteration, no matter the value of the $value_total_to_put variable. This is probably not what you want. Hence, as far as I understand, you should start your investigation by removing that last statement.

Binary search—Can't use string "1" as a symbol ref while strict refs is in use

I've been browsing over the already answered questions regarding this error message.
I am trying to solve a problem from the Rosalind web site that looks for some indexes using a binary search.
When my subroutine finds the number it seems to ignore it, and if I try to print the $found variable, it gives me the error
Can't use string "1" as a symbol ref while strict refs is in use
The code is this
sub binarysearch
{
my $numbertolook = shift;
my #intarray=#_;
my $lengthint = scalar #intarray;
my #sorted = sort {$a <=> $b} #intarray;
#print $numbertolook, " " , #sorted, "\n";
my $low=0;
my $high=$lengthint-1;
my $found =undef;
my $midpoint;
while ($low<$high)
{
$midpoint=int(($low+$high)/2);
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ($numbertolook<$sorted[$midpoint])
{
$high=$midpoint;
}
elsif ($numbertolook>$sorted[$midpoint])
{
$low=$midpoint;
}
elsif ($numbertolook==$sorted[$midpoint])
{
$found=1;
print $found "\n";
last;
}
if ($low==$high-1 and $low==$midpoint)
{
if ($numbertolook==$sorted[$high])
{
$found=1;
print $found "\n";
last;
}
$low=$high;
}
}
return $found;
}
You want
print $found, "\n";
Or
print $found . "\n";
With no operator between $found and the newline, it thinks $found is the filehandle to print a newline to, and is getting an error because it isn't a filehandle.
I'll try to help
First of all, as simple as it may seem, a binary search is quite difficult to code correctly. The main reason is that it's a hotbed of off-by-one errors, which are so prevalent that they have their own Wikipedia page
The issue is that an array containing, say, the values A to Z will have 26 elements with indices 0 to 25. I think FORTRAN bucks the trend, and Lua, but pretty much every other language has the first element of an array at index zero
A zero base works pretty well for everything until you start using divide and conquer algorithms. Merge Sort as well as Binary Search are such algorithms. Binary search goes
Is it in the first half?
If so then search the first half further
Else search the second half further
The hard part is when you have to decide when you've found the object, or when you need to give up looking. Splitting data in two nearly-halves is easy. Knowing when to stop is hard
It's highly efficient for sorted data, but the problem comes when implementing it that, if we do it properly, we have to deal with all sorts of weird index bases beyond zero or one.
Suppose I have an array
my #alpha = 'A' .. 'Q'
If I print scalar #alpha I will see 17, meaning the array has seventeen elements, indexed from 0 to 16
Now I'm looking for E in that array, so I do a binary search, so I want the "first half" and the "second half" of #alpha. If I add 0 to 16 and divide by 2 I get a neat "8", so the middle element is at index 8, which is H
But wait. There are 17 elements, which is an odd number, so if we say the first eight (A .. H) are left of the middle and the last eight (I .. Q) are right of the middle then surely the "middle" is I?
In truth this is all a deception, because a binary search will work however we partition the data. In this case binary means two parts, and although the search would be more efficient if those parts could be equal in size it's not necessary for the algorithm to work. So it can be the first third and the last two-thirds, or just the first element and the rest
That's why using int(($low+high)/2) is fine. It rounds down to the nearest integer so that with our 17-element array $mid is a usable 8 instead of 8.5
But your code still has to account for some unexpected things. In the case of our 17-element array we have calculated the middle index to be 8. So indexes 0 .. 7 are the "first half" while 8 .. 16 are the "second half", and the middle index is where the second half starts
But didn't we round the division down? So in the case of an odd number of elements, shouldn't our mid point be at the end of the first half, and not the start of the second? This is an arcane off-by-one error, but let's see if it still works with a simple even number of elements
#alpha = `A` .. `D`
The start and and indices are 0 and 3; the middle index is int((0+3)/2) == 1. So the first half is 0..1 and the second half is 2 .. 3. That works fine
But there's still a lot more. Say I have to search an array with two elements X and Y. That has two clear halves, and I'm looking for A, which is before the middle. So I now search the one-element list X for A. The minimum and maximum elements of the target array are both zero. The mid-point is int((0+0)/2) == 0. So what happens next?
It is similar but rather worse when we're searching for Z in the same list. The code has to be exactly right, otherwise we will be either searching off the end of the array or checking the last element again and again
Saving the worst for last, suppose
my #alpha = ( 'A', 'B, 'Y, 'Z' )
and I'm looking for M. That lest loose all sorts of optimisations that involve checks that may may the ordinary case much slower
Because of all of this it's by far the best solution to use a library or a language's built-in function to do all of this. In particular, Perl's hashes are usually all you need to check for specific strings and any associated data. The algorithm used is vastly better than a binary search for any non-trivial data sets
Wikipedia shows this algorithm for an iterative binary search
The binary search algorithm can also be expressed iteratively with two index limits that progressively narrow the search range.
int binary_search(int A[], int key, int imin, int imax)
{
// continue searching while [imin,imax] is not empty
while (imin <= imax)
{
// calculate the midpoint for roughly equal partition
int imid = midpoint(imin, imax);
if (A[imid] == key)
// key found at index imid
return imid;
// determine which subarray to search
else if (A[imid] < key)
// change min index to search upper subarray
imin = imid + 1;
else
// change max index to search lower subarray
imax = imid - 1;
}
// key was not found
return KEY_NOT_FOUND;
}
And here is a version of your code that is far from bug-free but does what you intended. You weren't so far off
use strict;
use warnings 'all';
print binarysearch( 76, 10 .. 99 ), "\n";
sub binarysearch {
my $numbertolook = shift;
my #intarray = #_;
my $lengthint = scalar #intarray;
my #sorted = sort { $a <=> $b } #intarray;
my $low = 0;
my $high = $lengthint - 1;
my $found = undef;
my $midpoint;
while ( $low < $high ) {
$midpoint = int( ( $low + $high ) / 2 );
#print $midpoint, " ",$low," ", $high, " ", #sorted, "\n";
if ( $numbertolook < $sorted[$midpoint] ) {
$high = $midpoint;
}
elsif ( $numbertolook > $sorted[$midpoint] ) {
$low = $midpoint;
}
elsif ( $numbertolook == $sorted[$midpoint] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
if ( $low == $high - 1 and $low == $midpoint ) {
if ( $numbertolook == $sorted[$high] ) {
$found = 1;
print "FOUND\n";
return $midpoint;
}
return;
}
}
return $midpoint;
}
output
FOUND
66
If you call print with several parameters separated with a space print expects the first one to be a filehandle. This is interprented as print FILEHANDLE LIST from the documentation.
print $found "\n";
What you want to do is either to separate with ,, to call it as print LIST.
print $found, "\n";
or to concat as strings, which will also call it as print LIST, but with only one element in LIST.
print $found . "\n";

Perl - Converting special characters to corresponding numerical values in quality score

I am trying to convert the following set of characters into their corresponding values for a quality score that accompanies a fasta file:
!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~
They should have the values 0-93. So when I input a fastq file that uses these symbols I want to output the numerical values for each in a quality score file.
I have tried putting them into an array using split // and then making into a hash where each key is the symbol and the value is its position in the array:
for (my $i = 0; $i<length(#qual); $i++) {
print "i is $i, elem is $qual[$i]\n";
$hash{$qual[$i]} = $i;
I have tried hard coding the hash:
my %hash = {"!"=>"0", "\""=>"1", "#"=>"2", "\$"=>"3"...
With and without escapes for the special characters that require them but cannot seem to get this to work.
This merely outputs:
.
.
.
i is 0, elem is !
i is 1, elem is "
i is 0, elem is !
i is 1, elem is "
i is 0, elem is !
i is 1, elem is "
" 1
Use of uninitialized value $hash{"HASH(0x100804ed0)"} in concatenation (.) or string at convert_fastq.pl line 24, <> line 40.
HASH(0x100804ed0)
! 0
Does anyone have any ideas? I appreciate the help.
Perhaps subtracting 33 from the character's ord to yield the value you want would be helpful:
use strict;
use warnings;
my $string = q{!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~};
for ( split //, $string ) {
print "$_ = ", ord($_) - 33, "\n";
}
Partial output:
! = 0
" = 1
# = 2
$ = 3
% = 4
& = 5
' = 6
( = 7
) = 8
* = 9
+ = 10
...
This way, you don't need to build a hash with character/value pairs, but just use $val = ord ($char) - 33; to get the value.
{ ... }
is similar to
do { my %anon; %anon = ( ... ); \%anon }
So when you did
my %hash = { ... };
you assigned a single item to the hash (a reference to a hash) rather than a list of key-values as you should. Perl warned you about that with the following:
Reference found where even-sized list expected
(Why didn't you mention this?!)
You should be using
my %decode_map = ( ... );
For example,
my %decode_map;
{
my $encoded = q{!"#$%&'()*+,-./0123456789:;<=>?#ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~};
my #encoded = split //, $encoded;
$decode_map{$encoded[$_]} = $_ for 0..$#encoded;
}
Given that those are basically the non-whitespace printable ASCII characters, so you could simply use
my %decode_map = map { chr($_ + 0x21) => $_ } 0x21..0x7E;
Which means you could avoid building the hash at all, replacing
my %decode_map = map { chr($_ + 0x21) => $_ } 0x21..0x7E;
die if !exists($decode_map{$c});
my $num = $decode_map{$c};
with just
die if ord($c) < 0x21 || ord($c) > 0x7E;
my $num = ord($c) - 0x21;
From a language-agnostic point of view: Use an array with 256 entries, one for each ASCII character. You can then store 0 at ['!'], 1 at ['"'] and so on. When parsing the input, you can lookup the index of a char in that array directly. Fore careful error handling, you could store -1 at all invalid chars and check that while parsing the file.

Problem with Hash of Array

Hi all
I got the problem that i cannot return the value and key in the hash of array
sub nextWords{
for my $language(0 .. $#language )
{
my $eng = $db->selectall_arrayref("select word from words
left outer join language
on words.languageId = language.languageId
where words.languageId = $language
order by word asc
;"); # #language[$id] limit 10 offset $currentOffset
#%returnArray2d = (#language[$language] =>[#$eng] );
$returnArray2d{#language[$language]} = [#$eng];
}
return %returnArray2d;
}
I cannot really return all the list of words
my %newwordsList =NextWords();
foreach my $key(keys %newwordsList)
{
print "here you are 2 : " . $key . "\n";
for my $ind(0 .. #{$newwordsList{$key}}){
print "dzo" . $newwordsList{$key}[$ind] . "\n";
}
}
output: $key ==> 132 not 123
and the word cannot be printed.. it just prints some
ARRAY(0x320d514)
ARRAY(0x320d544)
ARRAY(0x320d574)
ARRAY(0x320d5a4)
ARRAY(0x320d5d4)
ARRAY(0x320d604)
Please help.. thanks
It looks like you're not setting up %returnArray2d correctly.
Assuming that #language contains the language ids you want, instead of:
$returnArray2d{ #language[$language] } = [#$eng];
You'll want this:
$returnArray2d{ $language[$language] } = [#$eng];
Also, you should avoid using the same name for an array and a scalar value (it works, but it's confusing) (see #language / $language in your code).
Lastly, you are correctly iterating through each key of %newwordsList, however, you will want to subtract 1 from the iteration, so that you don't go past the end of the array:
for my $ind ( 0 .. #{ $newwordsList{$key} } ) {
Should be:
for my $ind (0 .. #{ $newwordsList{$key} } - 1) {
Or (as David pointed out in the comments), you can do:
for my $ind ( 0 .. $#{ $newwordsList{$key} } ) {