I have two tables. First one is $sampleand looks like this:
col1 col2
A 1
A 3
A 4
B 7
... ...
Second one is $exonand looks like this:
col1 col2 col3 col4 col5
name1 A 1 100 200
name2 A 2 300 400
name3 A 3 500 600
name4 A 4 700 800
I want to check if there is a match between col1 and col2 from $sampleand col2 and col3from exon.
I normally use hashes for this in Perl. I know how it works when you are just looking for a match between two columns. But I'm stuck now because values from two columns should match. This is what I have for now
my %hash = ();
while(<$sample>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0]; #col1
my $keyfield2 = $cols[1]; #col2
push #{ $hash{$keyfield}}, $keyfield2}; #this is probably not correct
}
seek $exon,0,0; #cursor resetting
while(<$exon>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[1]; #col2
my $keyfield2 = $cols[2]; #col3
if (exists($hash{$keyfield}) && exists($hash{$keyfield2})) {
print $output $cols[0], "\t", $cols[3], "\t", $cols[4], "\n";
}
}
You should use a concatenation of col2 and col3 values as the keys for your hastable
my %hash = ();
while(<$sample>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[0] #col1
my $keyfield2 = $cols[1] #col2
my $key = "$keyfield - $keyfield2";
$hash{$key}=1;
}
seek $exon,0,0 #cursor resetting
while(<$exon>){
chomp;
my #cols = split(/\t/);
my $keyfield = $cols[1]; #col2
my $keyfield2 = $cols[2]; #col3
my $key = "$keyfield - $keyfield2";
if (exists($hash{$key}) {
print $output $cols[0], "\t", $cols[3], "\t", $cols[4], "\n";
}
}
You can put both fields as key separarted with a delimiter in your hash:
my #cols = split(/\t);
my $keyfield = $cols[0]."--".$cols[1];
push #{ $hash{$keyfield}}, value};
Related
The Perl script below is written in a shell.
If I use the tab-delimited file numeric then I get the desired result of each line parsed accordingly. However, if I use the file alpha as input then only the first line is parsed.
The only difference between alpha and numeric is that numeric has NC_000023
NC_000023.11:g.41747805_41747806delinsTT
NC_000023.11:g.41750615C>A
while alpha has NC_0000X
NC_0000X.11:g.41747805_41747806delinsTT
NC_0000X.11:g.41750615C>A
What am I missing?
numeric
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_000023.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM
NM_003924.3:c.013G>T NC_000023.11:g.41750615C>A LRG_513t1:c.13G>T
alpha
Input Variant Errors Chromosomal Variant Coding Variant(s)
NM_003924.3:c.*18_*19delGCinsAA NC_0000X.11:g.41747805_41747806delinsTT LRG_513t1:c.*18_*19delinsAA NM_003924.3:c.*18_*19delinsAA
NM_003924.3:c.013G>T NC_0000X.11:g.41750615C>A LRG_513t1:c.13G>T NM_003924.3:c.13G>T
Perl
perl -ne '
next if $. == 1;
if ( /.*del([A-Z]+)ins([A-Z]+).*NC_0+([^.]+)\..*g\.([0-9]+)_([0-9]+)/ ) { # indel
print join( "\t", $3, $4, $5, $1, $2 ), "\n";
}
else {
while ( /\t*NC_(\d+)\.\S+g\.(\d+)(\S+)/g ) {
# conditional parse
( $num1, $num2, $common ) = ( $1, $2, $3 );
$num3 = $num2;
if ( $common =~ /^([A-Z])>([A-Z])$/ ) { # SNP
( $ch1, $ch2 ) = ( $1, $2 );
}
elsif ( $common =~ /^del([A-Z])$/ ) { # deletion
( $ch1, $ch2 ) = ( $1, "-" );
}
elsif ( $common =~ /^ins([A-Z])$/ ) { # insertion
( $ch1, $ch2 ) = ( "-", $1 );
}
elsif ( $common =~ /^_(\d+)del([A-Z]+)$/ ) { # multi deletion
( $num3, $ch1, $ch2 ) = ( $1, $2, "-" );
}
elsif ( $common =~ /^_(\d+)ins([A-Z]+)$/ ) { # multi insertion
( $num3, $ch1, $ch2 ) = ( "-", $1, $2 );
}
printf( "%d\t%d\t%d\t%s\t%s\n", $num1, $num2, $num3, $ch1, $ch2 ); # output
map { undef } ( $num1, $num2, $num3, $common, $ch1, $ch2 );
}
}' numeric
output
23 41747805 41747806 GC AA
23 41750615 41750615 C A
output using alpha:
X 41747805 41747806 GC AA
If I use \w instead of \d in the while condition, like this
while ( /\t*NC_(\w+)\.\S+g\.(\d+)(\S+)/g ) { ... }
I get this result
X 41747805 41747806 GC AA
0 41750615 41750615 C A
Why the zero in $1
The while (/\t*NC_(\d+)\. will not match 'NC_0000X.11', due to the 'X' and the regex is looking for digits only.
After the change you made, NC_(\w+) will match 'NC_0000X' and $num1 is set to '0000X'.
Your printf "%d...." $num1 ... will print a 0 for non numeric input. As $num1 is '0000X', it will print as 0.
The input example suggests, that each line consists of fields, which are separated by white-space. Some fields are of interest, otheres not. Each field holds recognizable information.
Your program should follow this structure.
Read the file line by line
split the line into fields
skip fields, which are not of interest, e.g. non NC.*
extract the necessary information from the field
do whatever necessary, sum it up, collect it
print the information at the level needed. Per field, line, file or after all files
It is much easier to work on smaller chuncks instead of finding a regex which works for the whole line. It is much easier to read, understand and maintain.
Please advice.
for my $record (#item) {
for my $int (#$record){
# DEBUG( "DEBUG:: $record and $int");
my %data = ( $record , $int );
}
}
}
Record is like
abc ,china
abc ,japan
abc , italy
abc , singapore
print Dumper %data;
output :
abc , singapore
Now the issues is when I dump the output it shows me last record entry in hash table.May be because of unique key.
Kindly suggest.
Two problems:
You are recreating the hash in each iteration of the loop. The correct way would be
my %data;
for my $record (#item) {
for my $int (#$record){
$data{$record} = $int;
}
}
Hash keys must be unique. It's not possible to have a hash like
( abc => 'china',
abc => 'japan' )
You can use a hash of arrays, though. Just assign to it with
push #{ $data{$record} }, $int;
It will create the following structure:
( abc => [ 'china', 'japan', 'italy', 'singapore' ] )
What is the best way to construct sql with various number of WHERE conditions ?
My solution looks ugly:
my ($where, #values);
if ($phone_number)
{
$where = 'AND pnone_number=?';
#values = ($from, $till, $phone_number);
}
else
{
$where = '';
#values = ($from, $till);
}
my $sql = 'SELECT * FROM calls WHERE time between ? AND ? '.$where.' ORDER BY time';
my $res = $dbh->selectall_arrayref($sql, undef, #values) or warn 'error';
How about:
my $where = '';
my #values = ( $from, $till );
if ( $phone_number ) {
$where = 'AND phone_number=?';
push #values, $phone_number;
}
That eliminates the need for your else clause.
You could also use something like SQL::Abstract.
use SQL::Abstract;
...
my ( $sql, #values ) = SQL::Abstract->new->select(
'calls', # table
'*', # columns
{ time => { '<=' => $till, '>' => $from }, # where clause
$phone_number ? ( phone_number => $phone_number ) : ( ),
},
'time' # order clause
);
1=1 is added for cases when $where would be epmty.
my $where = "AND time between ? AND ? ";
my #values = ($from, $till);
if ($phone_number) {
$where .= 'AND pnone_number=? ';
push #values, $phone_number;
}
my $sql = 'SELECT * FROM calls WHERE 1=1 $where ORDER BY time';
my $res = $dbh->selectall_arrayref($sql, undef, #values) or warn 'error';
Conditional list-include (aka "enterprise"):
my #values = ( $from,
$till,
( $phone_number ) x !! $phone_number,
);
my $sql = 'SELECT * FROM calls WHERE time between ? AND ? '
. 'AND phone_number=?' x !! $phone_number
. ' ORDER BY time';
#!/usr/bin/env perl
use warnings;
use 5.012;
use DBI;
my $dsn = "DBI:Proxy:hostname=horst;port=2000;dsn=DBI:ODBC:db1.mdb";
my $dbh = DBI->connect( $dsn, undef, undef ) or die $DBI::errstr;
$dbh->{RaiseError} = 1;
$dbh->{PrintError} = 0;
my $my_table = 'my_table';
eval{ $dbh->do( "DROP TABLE $my_table" ) };
$dbh->do( "CREATE TABLE $my_table" );
my $ref = [ qw( 1 2 ) ];
for my $col ( 'col_1', 'col_2', 'col_3' ) {
my $add = "$col INT";
$dbh->do( "ALTER TABLE $my_table ADD $add" );
my $sql = "INSERT INTO $my_table ( $col ) VALUES( ? )";
my $sth = $dbh->prepare( $sql );
$sth->bind_param_array( 1, $ref );
$sth->execute_array( { ArrayTupleStatus => \my #tuple_status } );
}
my $sth = $dbh->prepare( "SELECT * FROM $my_table" );
$sth->execute();
$sth->dump_results();
$dbh->disconnect;
This script outputs:
'1', undef, undef
'2', undef, undef
undef, '1', undef
undef, '2', undef
undef, undef, '1'
undef, undef, '2'
6 rows
How do I have to change this script to get this output:
'1', '1', '1'
'2', '2', '2'
2 rows
Do this in two steps:
Create the 3 columns
insert data in them
You prepare a SQL statement 3 times and execute twice for values 1,2 so you get 6 rows. I don't know how to answer your question of how do you change it to get 2 rows since we've no idea what you are trying to achieve. Without knowing what you are trying to achieve I'd be guessing but the following results in the output you wanted:
my $ref = [ qw( 1 2 ) ];
for my $col ( 'col_1', 'col_2', 'col_3' ) {
my $add = "$col INT";
$dbh->do( "ALTER TABLE $my_table ADD $add" );
}
$sql = "INSERT INTO $my_table ( col_1, col_2, col_3 ) VALUES( ?,?,? )";
my $sth = $dbh->prepare( $sql );
$sth->bind_param_array( 1, $ref );
$sth->bind_param_array( 2, $ref );
$sth->bind_param_array( 3, $ref );
$sth->execute_array( { ArrayTupleStatus => \my #tuple_status } );
In this script I have problems with file-name-extensions:
if I use /home/mm/test_x it works, with file named /home/mm/test_x.csv it doesn't:
#!/usr/bin/env perl
use warnings; use strict;
use 5.012;
use DBI;
my $table_1 = '/home/mm/test_1.csv';
my $table_2 = '/home/mm/test_2.csv';
#$table_1 = '/home/mm/test_1';
#$table_2 = '/home/mm/test_2';
my $dbh = DBI->connect( "DBI:CSV:" );
$dbh->{RaiseError} = 1;
$table_1 = $dbh->quote_identifier( $table_1 );
$table_2 = $dbh->quote_identifier( $table_2 );
my $sth = $dbh->prepare( "SELECT a.id, a.name, b.city FROM $table_1 AS a NATURAL JOIN $table_2 AS b" );
$sth->execute;
$sth->dump_results;
$dbh->disconnect;
Output with file-name-extention:
DBD::CSV::st execute failed:
Execution ERROR: No such column '"/home/mm/test_1.csv".id' called from /usr/local/lib/perl5/site_perl/5.12.0/x86_64-linux/DBD/File.pm at 570.
Output without file-name-extension:
'1', 'Brown', 'Laramie'
'2', 'Smith', 'Watertown'
2 rows
Is this a bug?
cat test_1.csv
id,name
1,Brown
2,Smith
5,Green
cat test_2.csv
id,city
1,Laramie
2,Watertown
8,Springville
DBD::CSV provides a way to map the table names you use in your queries to filenames. The same mechanism is used to set up per-file attributes like line ending, field separator etc. look for 'csv_tables' in the DBD::CSV documentation.
#!/usr/bin/env perl
use warnings;
use strict;
use DBI;
my $dbh = DBI->connect("DBI:CSV:f_dir=/home/mm", { RaiseError => 1 });
$dbh->{csv_tables}->{table_1} = {
'file' => 'test_1.csv',
'eol' => "\n",
};
$dbh->{csv_tables}->{table_2} = {
'file' => 'test_2.csv',
'eol' => "\n",
};
my $sth = $dbh->prepare( "SELECT a.id, a.name, b.city FROM table_1 AS a NATURAL JOIN table_2 AS b" );
$sth->execute();
$sth->dump_results();
$dbh->disconnect();
In my case I had to specify a line ending character, because I created the CSV files in vi so they ended up with Unix line endings whereas DBD::CSV assumes DOS/Windows line-endings regardless of the platform the script is run on.
I looks like even this works:
#!/usr/bin/env perl
use warnings; use strict;
use 5.012;
use DBI;
my $dbh = DBI->connect("DBI:CSV:f_dir=/home/mm/Dokumente", undef, undef, { RaiseError => 1, });
my $table = 'new.csv';
$dbh->do( "DROP TABLE IF EXISTS $table" );
$dbh->do( "CREATE TABLE $table ( id INT, name CHAR(64), city CHAR(64) )" );
my $sth_new = $dbh->prepare( "INSERT INTO $table( id, name, city ) VALUES ( ?, ?, ? )" );
$dbh->{csv_tables}->{table_1} = { 'file' => '/tmp/test_1.csv', 'eol' => "\n", };
$dbh->{csv_tables}->{table_2} = { 'file' => '/tmp/test_2.csv', 'eol' => "\n", };
my $sth_old = $dbh->prepare( "SELECT a.id, a.name, b.city FROM table_1 AS a NATURAL JOIN table_2 AS b" );
$sth_old->execute();
while ( my $hash_ref = $sth_old->fetchrow_hashref() ) {
state $count = 1;
$sth_new->execute( $count++, $hash_ref->{'a.name'}, $hash_ref->{'b.city'} );
}
$dbh->disconnect();
I think you might want to take a look at the f_ext and f_dir attributes. You can then class your table names as "test_1" and "test_2" without the csv but the files used will be test_1.csv and test_2.csv. The problem with a dot in the table name is a dot is usually used for separating the schema from the table name (see f_schema).