Different output from same code parsing simiiar tab-delimited files - perl

The Perl script below is written in a shell.
If I use the tab-delimited file numeric then I get the desired result of each line parsed accordingly. However, if I use the file alpha as input then only the first line is parsed.
The only difference between alpha and numeric is that numeric has NC_000023
NC_000023.11:g.41747805_41747806delinsTT
NC_000023.11:g.41750615C>A
while alpha has NC_0000X
NC_0000X.11:g.41747805_41747806delinsTT
NC_0000X.11:g.41750615C>A
What am I missing?
numeric
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_000023.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM
NM_003924.3:c.013G>T NC_000023.11:g.41750615C>A LRG_513t1:c.13G>T
alpha
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_0000X.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM_003924.3:c.*18_*19delinsAA
NM_003924.3:c.013G>T NC_0000X.11:g.41750615C>A LRG_513t1:c.13G>T NM_003924.3:c.13G>T
Perl
perl -ne '
next if $. == 1;
if ( /.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/ ) { # indel
print join( "\t", $3, $4, $5, $1, $2 ), "\n";
}
else {
while ( /\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g ) {
# conditional parse
( $num1, $num2, $common ) = ( $1, $2, $3 );
$num3 = $num2;
if ( $common =~ /^([A-Z])>([A-Z])$/ ) { # SNP
( $ch1, $ch2 ) = ( $1, $2 );
}
elsif ( $common =~ /^del([A-Z])$/ ) { # deletion
( $ch1, $ch2 ) = ( $1, "-" );
}
elsif ( $common =~ /^ins([A-Z])$/ ) { # insertion
( $ch1, $ch2 ) = ( "-", $1 );
}
elsif ( $common =~ /^_(\d+)del([A-Z]+)$/ ) { # multi deletion
( $num3, $ch1, $ch2 ) = ( $1, $2, "-" );
}
elsif ( $common =~ /^_(\d+)ins([A-Z]+)$/ ) { # multi insertion
( $num3, $ch1, $ch2 ) = ( "-", $1, $2 );
}
printf( "%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2 ); # output
map { undef } ( $num1, $num2, $num3, $common, $ch1, $ch2 );
}
}' numeric
output
23 41747805 41747806 GC AA
23 41750615 41750615 C A
output using alpha:
X 41747805 41747806 GC AA
If I use \w instead of \d in the while condition, like this
while ( /\t*NC_(\w+)\.\S+g\.(\d+)(\S+)/g ) { ... }
I get this result
X 41747805 41747806 GC AA
0 41750615 41750615 C A
Why the zero in $1

The while (/\t*NC_(\d+)\. will not match 'NC_0000X.11', due to the 'X' and the regex is looking for digits only.
After the change you made, NC_(\w+) will match 'NC_0000X' and $num1 is set to '0000X'.
Your printf "%d...." $num1 ... will print a 0 for non numeric input. As $num1 is '0000X', it will print as 0.
The input example suggests, that each line consists of fields, which are separated by white-space. Some fields are of interest, otheres not. Each field holds recognizable information.
Your program should follow this structure.
Read the file line by line
split the line into fields
skip fields, which are not of interest, e.g. non NC.*
extract the necessary information from the field
do whatever necessary, sum it up, collect it
print the information at the level needed. Per field, line, file or after all files
It is much easier to work on smaller chuncks instead of finding a regex which works for the whole line. It is much easier to read, understand and maintain.

Related

Capture group stored as variable, Substitution operator returns it as blank?

#!/approot/opt/bin/perl
use strict;
my $file = $ARGV[0];
open FILE, $file or die;
my $line;
while (<FILE>) {
if (m/create unique clustered index \S+ on \S+ \(([^\)]+)\)/) {
$line = $1;
}
s/chargeable_items/$line/;
print;
}
Here is the text file I am trying to make this work with but everytime it attempts to replace it. It replaces the string with a blank.
CREATE TABLE t_test
(
system_name varchar(20) NOT NULL,
server_type smallint NOT NULL,
chargeable_system bit NOT NULL,
chargeable_items bit NOT NULL
)
create unique clustered index host_idx on dbo.t_host (system_name, server_type, environment)
create nonclustered index tt_host on dbo.t_host (N.A.)
Everytime it does the substitution operator it replaces "chargeable_items" with a blank value as shown below
CREATE TABLE t_test
(
system_name varchar(20) NOT NULL,
server_type smallint NOT NULL,
chargeable_system bit NOT NULL,
bit NOT NULL
)
create unique clustered index host_idx on dbo.t_host (system_name, server_type, environment)
create nonclustered index tt_host on dbo.t_host (N.A.)
You are reading file line-by-line in while loop.
At line
chargeable_items bit NOT NULL
you have not yet set $line to any value, so chargeable_items is replaced with empty value. You only set $line later on line
create unique clustered index host_idx on dbo.t_host (system_name, server_type, environment)
but $line is never used after that since there is no chargeable_items after that line.
Working solution would be to read whole file at once and then do matching and substitution:
#!/approot/opt/bin/perl
use strict;
my $file = $ARGV[0];
open FILE, $file or die;
local $/;
my $data = <FILE>;
if ($data =~ m/create unique clustered index \S+ on \S+ \(([^\)]+)\)/) {
my $line = $1;
$data =~ s/chargeable_items/$line/;
}
print $data;
Here local $/ sets special variable $/ (input record separator) to undefined value locally, so that <FILE> will read whole file at once, instead of line-by-line, which is normal behaviour.

Untainting a blessed hash member with or without the delete

I saw this line of code in some sources
( $self->{arg} ) = ( ( delete $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
I understand the untainting. I also known delete
My question is, in what circumstances is it necessary or preferred to use the delete, and isn't it enough to use the simpler
( $self->{arg} ) = ( ( $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
For example
#!/usr/bin/env perl -T
use 5.014;
use warnings;
package Some {
use Moose;
has 'arg' => (is => 'rw', isa => 'Str');
sub doit {
my $self = shift;
#( $self->{arg} ) = ( ( delete $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
( $self->{arg} ) = ( ( $self->{arg} ) =~ /(.*)/s ) if ${^TAINT};
}
};
my $some = Some->new( arg => 'some text' );
$some->doit();
say $some->arg;
With a normal hash deleting the value and reinserting will give the same result as modifying it in place.
The commit does not give any information about why he deletes it just that he copies the functionality from Mason 1. But if you look at the source of HTML::Mason::Lexer, you will find this comment:
We need to untaint the component or else the regexes will fail
to a Perl bug. The delete is important because we need to
create an entirely new scalar, not just modify the existing one.
($current->{comp_source}) = (delete $current->{comp_source}) =~ /(.*)/s if taint_is_on;
So the reason of doing it this way is to have a new scalar, although he does not do that for the other place where he is untainting: Mason::Interp, so my guess is an earlier Perl bug, when untainting.
So the difference is that with delete will give you a new scalar, although this will seldom have a practical application. (Delete and insert is also a slower operation of course.)
use strict;
my $hash->{test} = 'test';
print \($hash->{test}),"\n";
( $hash->{test} ) = ( ( $hash->{test} ) =~ /(.*)/s );
print \($hash->{test}),"\n";
( $hash->{test} ) = ( ( delete $hash->{test} ) =~ /(.*)/s );
print \($hash->{test}),"\n";
gives
SCALAR(0x7f84d10047e8)
SCALAR(0x7f84d10047e8)
SCALAR(0x7f84d1029230)

use multiple values as a key in a Perl hash

I have two tables. First one is $sampleand looks like this:
col1 col2
A 1
A 3
A 4
B 7
... ...
Second one is $exonand looks like this:
col1 col2 col3 col4 col5
name1 A 1 100 200
name2 A 2 300 400
name3 A 3 500 600
name4 A 4 700 800
I want to check if there is a match between col1 and col2 from $sampleand col2 and col3from exon.
I normally use hashes for this in Perl. I know how it works when you are just looking for a match between two columns. But I'm stuck now because values from two columns should match. This is what I have for now
my %hash = ();
while(<$sample>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0]; #col1
my $keyfield2 = $cols[1]; #col2
push #{ $hash{$keyfield}}, $keyfield2}; #this is probably not correct
}
seek $exon,0,0; #cursor resetting
while(<$exon>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[1]; #col2
my $keyfield2 = $cols[2]; #col3
if (exists($hash{$keyfield}) && exists($hash{$keyfield2})) {
print $output $cols[0], "\t", $cols[3], "\t", $cols[4], "\n";
}
}
You should use a concatenation of col2 and col3 values as the keys for your hastable
my %hash = ();
while(<$sample>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0] #col1
my $keyfield2 = $cols[1] #col2
my $key = "$keyfield - $keyfield2";
$hash{$key}=1;
}
seek $exon,0,0 #cursor resetting
while(<$exon>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[1]; #col2
my $keyfield2 = $cols[2]; #col3
my $key = "$keyfield - $keyfield2";
if (exists($hash{$key}) {
print $output $cols[0], "\t", $cols[3], "\t", $cols[4], "\n";
}
}
You can put both fields as key separarted with a delimiter in your hash:
my #cols = split(/\t);
my $keyfield = $cols[0]."--".$cols[1];
push #{ $hash{$keyfield}}, value};

more elegant way to construct SQL adding WHERE and using placeholders

What is the best way to construct sql with various number of WHERE conditions ?
My solution looks ugly:
my ($where, #values);
if ($phone_number)
{
$where = 'AND pnone_number=?';
#values = ($from, $till, $phone_number);
}
else
{
$where = '';
#values = ($from, $till);
}
my $sql = 'SELECT * FROM calls WHERE time between ? AND ? '.$where.' ORDER BY time';
my $res = $dbh->selectall_arrayref($sql, undef, #values) or warn 'error';
How about:
my $where = '';
my #values = ( $from, $till );
if ( $phone_number ) {
$where = 'AND phone_number=?';
push #values, $phone_number;
}
That eliminates the need for your else clause.
You could also use something like SQL::Abstract.
use SQL::Abstract;
...
my ( $sql, #values ) = SQL::Abstract->new->select(
'calls', # table
'*', # columns
{ time => { '<=' => $till, '>' => $from }, # where clause
$phone_number ? ( phone_number => $phone_number ) : ( ),
},
'time' # order clause
);
1=1 is added for cases when $where would be epmty.
my $where = "AND time between ? AND ? ";
my #values = ($from, $till);
if ($phone_number) {
$where .= 'AND pnone_number=? ';
push #values, $phone_number;
}
my $sql = 'SELECT * FROM calls WHERE 1=1 $where ORDER BY time';
my $res = $dbh->selectall_arrayref($sql, undef, #values) or warn 'error';
Conditional list-include (aka "enterprise"):
my #values = ( $from,
$till,
( $phone_number ) x !! $phone_number,
);
my $sql = 'SELECT * FROM calls WHERE time between ? AND ? '
. 'AND phone_number=?' x !! $phone_number
. ' ORDER BY time';

Perl DBI and sql now()

I have been trying to use sql NOW() function while I update a table. But nothing happens to that date field, I think DBI just ignores that value. Code is :
dbUpdate($id, (notes => 'This was an update', filesize => '100.505', dateEnd => 'NOW()'));
and the function is :
sub dbUpdate {
my $id = shift;
my %val_hash = #_;
my $table = 'backupjobs';
my #fields = keys %val_hash;
my #values = values %val_hash;
my $update_stmt = "UPDATE $table SET ";
my $count = 1;
foreach ( #fields ) {
$update_stmt .= "$_ = ? ";
$update_stmt .= ', ' if ($count != scalar #fields);
$count++;
}
$update_stmt .= " WHERE ID = ?";
print "update_stmt is : $update_stmt\n";
my $dbo = dbConnect();
my $sth = $dbo->prepare( $update_stmt );
$sth->execute( #values, $id ) or die "DB Update failed : $!\n";
$dbo->disconnect || die "Failed to disconnect\n";
return 1;
}#dbUpdate
As you can see this is a "dynamic" generation of the sql query and hence I dont know where an sql date function(like now()) come.
In the given example the sql query will be
UPDATE backupjobs SET filesize = ? , notes = ? , dateEnd = ? WHERE ID = ?
with param values
100.55, This was an update, NOW(), 7
But the date column still shows 0000-00-........
Any ideas how to fix this ?
I have been trying to use sql NOW() function while I update a table. But
nothing happens to that date field, I think DBI just ignores that
value.
No, it doesn't. But the DB thinks you are supplying a datetime or timestamp data type when you are in fact trying to add the string NOW(). That of course doesn't work. Unfortunately DBI cannot find that out for you. All it does is change ? into an escaped (via $dbh->quote()) version of the string it got.
There are several things you could do:
Supply a current timestamp string in the appropriate format instead of the string NOW().
If you have it available, you can use DateTime::Format::MySQL like this:
use DateTime::Format::MySQL;
dbUpdate(
$id,
(
notes => 'This was an update',
filesize => '100.505',
dateEnd => DateTime::Format::MySQL->format_datetime(DateTime->now),
)
);
If not, just use DateTime or Time::Piece.
use DateTime;
my $now = DateTime->now->datetime;
$now =~ y/T/ /;
dbUpdate(
$id,
(notes => 'This was an update', filesize => '100.505', dateEnd => $now));
Tell your dbUpdate function to look for things like NOW() and react accordingly.
You can do something like this. But there are better ways to code this. You should also consider that there is e.g. CURRENT_TIMESTAMP() as well.
sub dbUpdate {
my $id = shift;
my %val_hash = #_;
my $table = 'backupjobs';
my $update_stmt = "UPDATE $table SET ";
# Check for the NOW() value
# This could be done with others too
foreach ( keys %val_hash ) {
if ($val_hash{$_} =~ m/^NOW\(\)$/i) {
$update_stmt .= "$_ = NOW()";
$update_stmt .= ', ' if scalar keys %val_hash > 1;
delete $val_hash{$_};
}
}
# Put everything together, join keeps track of the trailing comma
$update_stmt .= join(', ', map { "$_=?" } keys %val_hash );
$update_stmt .= " WHERE ID = ?";
say "update_stmt is : $update_stmt";
say "values are: ", join(', ', values %val_hash);
my $dbo = dbConnect();
my $sth = $dbo->prepare( $update_stmt );
$sth->execute( values %val_hash, $id ) or die "DB Update failed : $!\n";
$dbo->disconnect || die "Failed to disconnect\n";
return 1;
}
Write your queries yourself.
You're probably not going to do it and I'll not add an example since you know how to do it anyway.
Here's something else: Is this the only thing you do with your database while your program runs? It is not wise to connect and disconnect the database every time you make a query. It would be better for performance to connect the database once you need it (or at the beginning of the program, if you always use it) and just use this dbh/dbo everywhere. It saves a lot of time (and code).
Sorry but you can't use NOW() as an interpolated value like that.
when a ? is used in an SQL statement, the corresponding value is (via some mechanism or other) passed to the database escaped, so whatever value is interpreted as a string, not as SQL. So you are in effect attempting to use the string 'NOW()' as a date value, not the function NOW().
To use NOW() as a function, you will have to insert it into the SQL itself rather than pass it as a bound value. This means you will either have to use a hack of your dbupdate function or write a new one, or obtain the time as a string in perl and pass the resulting string the dbupdate.