Postgres table with select distinct and multiple sub queries - postgresql

I have a table in Postgres 9.2 with 38 variables and I need a selection of the "best" results.
What I need is:
distinct var1 and var2 then from that:
min var3 and also var4 from that same row
max var5 and if more than one result then where min var3, var6 to var12 from that same row
var13 sorted by conditions (3 first, 6 second 0 last) and also var14-var18 from that same row
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 ...
1 1 2 a 2 a . . . . . . 0 . . . . .
1 1 1 b 1 b . . . . . . 3 . . . . .
1 2 4 c 3 c . . . . . . 3 . . . . .
1 2 3 d 4 d . . . . . . 6 . . . . .
2 1 1 a 3 a . . . . . . 3 . . . . .
3 1 3 a 2 a . . . . . . 6 . . . . .
3 1 2 b 4 b . . . . . . 0 . . . . .
4 1 3 a 4 a . . . . . . 3 . . . . .
4 1 6 b 2 b . . . . . . 0 . . . . .
4 2 2 c 2 c . . . . . . 0 . . . . .
4 3 5 d 3 d . . . . . . 3 . . . . .
4 3 4 e 4 e . . . . . . 6 . . . . .
4 3 7 f 4 f . . . . . . 3 . . . . .
...
The result should be:
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18
1 1 1 b 2 a . . . . . . 3 . . . . .
1 2 3 d 4 d . . . . . . 3 . . . . .
2 1 1 a 3 a . . . . . . 3 . . . . .
3 1 2 b 4 b . . . . . . 6 . . . . .
4 1 3 a 4 a . . . . . . 3 . . . . .
4 2 2 c 2 c . . . . . . 0 . . . . .
4 3 4 e 4 e . . . . . . 3 . . . . .
...
here is also an image of the table where the colored fields show what should be selected:
Hope this makes sense.
EDIT:
Got a pointer in another post to provide CREATE and INSERT for the table.
create table parent (
v1 character varying,
v2 character varying,
v3 character varying,
v4 character varying,
v5 character varying,
v6 character varying,
v7 character varying,
v8 character varying,
v9 character varying,
v10 character varying,
v11 character varying,
v12 character varying,
v13 character varying,
v14 character varying,
v15 character varying,
v16 character varying,
v17 character varying,
v18 character varying
);
insert into parent values('1','1','2','a','2','a','x1','x1','x1','x1','x1','x1','0','x1','x1','x1','x1','x1');
insert into parent values('1','1','1','b','1','b','x2','x2','x2','x2','x2','x2','3','x2','x2','x2','x2','x2');
insert into parent values('1','2','4','c','3','c','x3','x3','x3','x3','x3','x3','3','x3','x3','x3','x3','x3');
insert into parent values('1','2','3','d','4','d','x4','x4','x4','x4','x4','x4','6','x4','x4','x4','x4','x4');
insert into parent values('2','1','1','a','3','a','x1','x1','x1','x1','x1','x1','3','x1','x1','x1','x1','x1');
insert into parent values('3','1','3','a','2','a','x1','x1','x1','x1','x1','x1','6','x1','x1','x1','x1','x1');
insert into parent values('3','1','2','b','4','b','x2','x2','x2','x2','x2','x2','0','x2','x2','x2','x2','x2');
insert into parent values('4','1','3','a','4','a','x1','x1','x1','x1','x1','x1','3','x1','x1','x1','x1','x1');
insert into parent values('4','1','6','b','2','b','x2','x2','x2','x2','x2','x2','0','x2','x2','x2','x2','x2');
insert into parent values('4','2','2','c','2','c','x3','x3','x3','x3','x3','x3','0','x3','x3','x3','x3','x3');
insert into parent values('4','3','5','d','3','d','x4','x4','x4','x4','x4','x4','3','x4','x4','x4','x4','x4');
insert into parent values('4','3','4','e','4','e','x5','x5','x5','x5','x5','x5','6','x5','x5','x5','x5','x5');
insert into parent values('4','3','7','f','4','f','x6','x6','x6','x6','x6','x6','3','x6','x6','x6','x6','x6');

Related

Data losing original format

I am relatively new to powershell and having a bit of a strange problem with a script. I have searched the forums and haven't been able to find anything that works.
The issue I am having is that when I covert output of commands to and from base64 for transport via a custom protocol we use in our environment it is losing its formatting. Commands are executed on the remote systems by passing the command string to IEX and store the output to a variable. I convert the output to base64 format using the following command
$Bytes = [System.Text.Encoding]::Unicode.GetBytes($str1)
$EncodedCmd = [Convert]::ToBase64String($Bytes)
At the other end when we recieve the output we convert back using the command
[System.Text.Encoding]::Unicode.GetString([System.Convert]::FromBase64String($EncodedCmd))
The problem I am having is that although the output is correct the formatting of the output has been lost. For example if I run the ipconfig command
Windows IP Configuration Ethernet adapter Local Area Connection 2: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Ethernet
adapter Local Area Connection 3: Connection-specific DNS Suffix . : Link-local IPv6 Address . . . . . : fe80::3cd8:3c7f:c78b:a78f%14 IPv4 Address. . . . . . . . . . .
: 192.168.10.64 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 192.168.10.100 Ethernet adapter Local Area Connection: Connection-sp
ecific DNS Suffix . : IPv4 Address. . . . . . . . . . . : 172.10.15.201 Subnet Mask . . . . . . . . . . . : 255.255.255.0 Default Gateway . . . . . . . . . : 172.10.15
1.200 Tunnel adapter isatap.{42EDCBE-8172-5478-AD67E-8A28273E95}: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Tunnel ada
pter isatap.{42EDCBE-8172-5478-AD67E-8A28273E95}: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Tunnel adapter isatap.{42EDCBE-8172-5478-AD67E-8A28273E95}: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . : Tunnel adapter Teredo Tunneling Pseudo-Inter
face: Media State . . . . . . . . . . . : Media disconnected Connection-specific DNS Suffix . :
The formatting is all over the place and hard to read, I have played around with it a bit, but I can't find a really good way of returning the command output in the correct format. Appreciate any ideas on how I can fix the formatting
What happens here is that the $str1 variable is an array of strings. It doesn't contain newline characters but each line is on its own row.
When the variable is converted as Base64, all the rows in the array are catenated together. This can be seen easily enough:
$Bytes[43..60] | % { "$_ -> " + [char] $_}
0 ->
105 -> i
0 ->
111 -> o
0 ->
110 -> n
0 ->
32 ->
0 ->
32 ->
0 ->
32 ->
0 ->
69 -> E
0 ->
116 -> t
0 ->
104 -> h
Here the 0 are caused by double byte Unicode. Pay attention to 32 that is space character. So one sees that there is just space padding, no line terminators in the source string
Windows IP Configuration
Ethernet
As a solution, either add line feed characters or serialize the whole array as XML.
Adding line feed characters is done via joining the array elements with -join and using [Environment]::NewLine as the separator caracter. Like so,
$Bytes = [System.Text.Encoding]::Unicode.GetBytes( $($str1 -join [environment]::newline))
$Bytes[46..67] | % { "$_ -> " + [char] $_}
105 -> i
0 ->
111 -> o
0 ->
110 -> n
0 ->
13 ->
0 ->
10 ->
0 ->
13 ->
0 ->
10 ->
0 ->
13 ->
0 ->
10 ->
0 ->
69 -> E
0 ->
116 -> t
0 ->
Here, the 13 and 10 are CR and LF characters that Windows uses for line feed. After adding the line feed characters, the result string looks like the source. Be aware that thought it looks the same, it is not the same. Source is an array of strings, the outcome is single string containing line feeds.
If you must preserve the original, serialization is the way to go.

sed: delete lines that match a pattern in a given field

I have a file tab delimited that looks like this:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
53_234 78 . CCG GAT 999 . . GT:PL:DP:DPR
45_569 5 . TCCG GTTA 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
I am trying to use sed to delete all the lines that contain more than one letter in the 4th field (in the case above, line 7 and 8 from the top). I have tried the following regular expression but there must be a glitch some where that I cannot find:
sed '5,${;/\([^.]*\t\)\{3\}\[A-Z][A-Z]\+\t/d;}' input.vcf>new.vcf
The syntax is as follows:
5,$ #start at line 5 until the end of the file ($)
([^.]*\t) #matching group is any single character followed by a zero or more characters followed by a tab.
{3} #previous block repeated 3 times (presumably for the 4th field)
[A-Z][A-Z]+\t #followed by any string of two letters or more followed by a tab.
Unfortunately, this doesn' t work but I know I am close to make it to work. Any hints or help will make this a great teaching moment.
Thanks.
If awk is okay for you, you can use below command:
awk '(FNR<5){print} (FNR>=5)&&length($4)<=1' input.vcf
Default delimiter is space, you can use -F"\t" to switch it to tab, put it after awk. for instance, awk -F"\t" ....
(FNR<5){print} FNR is file number record, when it is less than 5, print the whole line
(FNR>=5) && length($4)<=1 will handle the rest lines and filter lines which 4th field has one character or less.
Output:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
You can redirect the output to an output file.
$ awk 'NR<5 || $4~/^.$/' file
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
Fixed your sed filter (took me a while almost went crazy over it)
5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}
Your errors:
[^.]*: everything but a dot.
Thanks to Ed, now I know that. I thought dot had to be escaped, but that does not seem to apply between brackets. Anyhow, this could match a tabulation char and match 2 or 3 groups instead of one, failing to match your line (regex are greedy by default)
\[A-Z][A-Z]: bad backslash. What did it do? hum, dunno!
test:
$ sed '5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}' foo.Txt
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
conclusion: to process delimited fields, awk is better :)

Optimizing Large Data Intersect

I have two files from which a subset looks like this:
regions
chr1 150547262 150547338 v2MCL1_29.1.122 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr1 150547417 150547537 v2MCL1_29.1.283 . GENE_ID=MCL1;Pool=1;PURPOSE=CNV
chr1 150547679 150547797 v2MCL1_29.2.32 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr1 150547866 150547951 v2MCL1_29.2.574 . GENE_ID=MCL1;Pool=1;PURPOSE=CNV
chr1 150548008 150548096 v2MCL1_29.2.229 . GENE_ID=MCL1;Pool=2;PURPOSE=CNV
chr4 1801108 1801235 v2FGFR3_3.11.182 . GENE_ID=FGFR3;Pool=2;PURPOSE=CNV
chr4 1801486 1801615 v2FGFR3_3.11.202 . GENE_ID=FGFR3;Pool=1;PURPOSE=CNV
chrX 66833436 66833513 v2AR_region.70.118 . GENE_ID=AR;Pool=1;PURPOSE=CNV
chrX 66866117 66866228 v2AR_region.103.68 . GENE_ID=AR;Pool=2;PURPOSE=CNV
chrX 66871579 66871692 v2AR_region.108.32 . GENE_ID=AR;Pool=1;PURPOSE=CNV
Note: field 1 goes from chr1..chrX
query (a somewhat standard VCF file)
1 760912 . C T 21408 PASS . GT:DP:GQ:PL 1/1:623:99:21408,1673,0
1 766105 . T A 11865 PASS . GT:DP:GQ:PL 1/1:618:99:11865,1025,0
1 767780 . G A 15278 PASS . GT:DP:GQ:PL 1/1:512:99:15352,1274,74
1 150547747 . G A 9840 PASS . GT:DP:GQ:PL 0/1:645:99:9840,0,9051
1 204506107 . C T 22929 PASS . GT:DP:GQ:PL 1/1:636:99:22929,1801,0
1 204508549 . T G 22125 PASS . GT:DP:GQ:PL 1/1:638:99:22125,1757,0
2 2765262 . A G 22308 PASS . GT:DP:GQ:PL 1/1:678:99:22308,1854,0
2 2765887 . C T 9355 PASS . GT:DP:GQ:PL 0/1:649:99:9355,0,9235
2 25463483 . G A 31041 PASS . GT:DP:GQ:PL 1/1:936:99:31041,2422,0
2 212578379 . TA T 5355 PASS . GT:DP:GQ:PL 0/1:500:99:5355,0,3249
3 178881270 . T G 10012 PASS . GT:DP:GQ:PL 0/1:632:99:10012,0,7852
3 182673196 . C T 31170 PASS . GT:DP:GQ:PL 1/1:896:99:31170,2483,0
4 1801511 . C T 12218 PASS . GT:DP:GQ:PL 0/1:885:99:12218,0,11568
4 55097835 . G C 7259 PASS . GT:DP:GQ:PL 0/1:512:99:7259,0,7099
4 55152040 . C T 15866 PASS . GT:DP:GQ:PL 0/1:1060:99:15866,0,14953
X 152017752 . G A 9786 PASS . GT:DP:GQ:PL 0/1:735:99:9786,0,11870
X 152018832 . T G 12281 PASS . GT:DP:GQ:PL 0/1:924:99:12281,0,13971
X 152019715 . A G 10128 PASS . GT:DP:GQ:PL 0/1:689:99:10128,0,9802
Note: there are several leading lines that comprise the header and start with a '#' char.
I'm trying to write a script that will use the first two fields of the query file to see if the coordinates fall between the second and third fields of the regions file. I've coded it like this:
#!/usr/bin/perl
use warnings;
use strict;
use Data::Dump;
my $bed = shift;
my $query_file = shift;
my %regions;
open( my $region_fh, "<", $bed ) || die "Can not open the input regions BED file: $!";
while (<$region_fh>) {
next if /track/;
my #line = split;
$line[0] =~ s/chr//; # need to strip of 'chr' or won't match query file
my ($gene, $pool, $purpose) = $line[5] =~ /GENE_ID=(\w+);(Pool=\d+);PURPOSE=(.*)$/;
#{$regions{$line[3]}} = (#line[0..4],$gene,$pool,$purpose);
}
close $region_fh;
my ( #header, #results );
open( my $query_fh, "<", $query_file ) || die "Can not open the query file: $!";
while (<$query_fh>) {
if ( /^#/ ) {
push( #header, $_ );
next;
}
my #fields = split;
for my $amp ( keys %regions ) {
if ( $fields[0] eq $regions{$amp}->[0] && $fields[1] >= $regions{$amp}->[1] && $fields[1] <= $regions{$amp}->[2] ) {
$fields[2] = $regions{$amp}->[5]; # add gene name to VCF file
push( #results, join( "\t", #fields ) );
}
}
}
close $query_fh;
The issue is that the query file is ~3.25 million lines long, and the regions file is about 2500 lines long. So, running this takes a very long time (I quit after about 20 minutes of waiting).
I think my overall logic is correct (hopefully!), and I'm wondering if there is a way to optimize how the data is processed to speed up the time it takes to process. I think the problem is that I need to traverse the array within regions 2500*3.25 million times. Can anyone offer any advice on how to revise my algorithm to process these data more efficiently?
Edit: Added a larger sample dataset, which should show some positives this time.
There are two approaches that I can think of. The first is to change the keys of %regions to the chromosome names, with the values being a list of all the start, end, and gene ID values corresponding to this chromosome, sort by the start value.
With your new data the hash would look like this
(
chr1 => [
[150547262, 150547338, "MCL1"],
[150547417, 150547537, "MCL1"],
[150547679, 150547797, "MCL1"],
[150547866, 150547951, "MCL1"],
[150548008, 150548096, "MCL1"],
],
chr4 => [
[1801108, 1801235, "FGFR3"],
[1801486, 1801615, "FGFR3"]
],
chrX => [
[66833436, 66833513, "AR"],
[66866117, 66866228, "AR"],
[66871579, 66871692, "AR"],
],
)
This way the chromosome name would give instant acccess to the right part of the hash instead of having to search through every entry each time, and the sorted start value allows a binary search.
The other possibility is to write the whole of the regions file to an SQLite temporary in-memory database. Once the data is stored and indexed, looking up a gene ID for a given chromosome and position will be pretty fast.

If/Else statement is not when trying to grab address list

I am trying to run an if/else statement on the following URL:
http://nominatim.openstreetmap.org/search.php?q=MK3+5JE&viewbox=-147.13%2C72.78%2C147.13%2C-55.67
and
http://nominatim.openstreetmap.org/search.php?q=MK1+1AS&viewbox=-147.13%2C72.78%2C147.13%2C-55.67
My code is below:
; Collect results 1
Sleep 1000
Addr1 := IE.document.getElementsByClassName("name")[0].innerHTML
If Substr(Addr1, 1, 2) = "MK"
{
StringSplit, AddrNew, Addr1, `,
StringTrimLeft, AddrNew3, AddrNew3, 1
Addr1 := AddrNew2 . "," . AddrNew3 . "," . PostCode
MsgBox, %Addr1%
}
Else If Substr(Addr1, 1, 2) <> "MK"
{
StringSplit, AddrNew, Addr1, `,
StringTrimLeft, AddrNew2, AddrNew2, 1
Addr1 := AddrNew1 . "," . AddrNew2 . "," . PostCode
MsgBox, %Addr1%
}
; Collect results 2
Sleep 1000
Addr := IE.document.getElementsByClassName("name")[2].innerHTML
If Substr(Addr, 1, 2) = "MK"
{
StringSplit, AddrNew, Addr2, `,
StringTrimLeft, AddrNew3, AddrNew3, 1
Addr := AddrNew2 . "," . AddrNew3 . "," . PostCode
MsgBox, %Addr%
}
Else If Substr(Addr, 1, 2) <> "MK"
{
StringSplit, AddrNew, Addr2, `,
StringTrimLeft, AddrNew2, AddrNew2, 1
Addr := AddrNew1 . "," . AddrNew2 . "," . PostCode
MsgBox, %Addr%
}
Each time I try grabbing the data, the following happens:
When the correct output should be
Saint Andrew's Road, Far Bletchley
Buckingham Road, Far Bletchley
Mount Farm, Milton Keynes
Dawson Road, Mount Farm
Any idea what is causing this issue?
Figured it out - I was not keeping track of my variables...
; Collect results 1
Sleep 1000
Addr1 := IE.document.getElementsByClassName("name")[0].innertext
String_Object := StrSplit(addr1, "`,")
If (Substr(Addr1, 1, 2) = "MK")
{
Addr1 := String_Object[2] . "," . Trim(String_Object[3]) . "," . PostCode
MsgBox, %Addr1%
}
Else
{
Addr1 := String_Object[1] . "," . Trim(String_Object[2]) . "," . PostCode
MsgBox, %Addr1%
}
; Collect results 2
Sleep 1000
Addr2 := IE.document.getElementsByClassName("name")[1].innertext
String_Object := StrSplit(addr2, "`,")
If (Substr(Addr2, 1, 2) = "MK")
{
Addr2 := String_Object[2] . "," . Trim(String_Object[3]) . "," . PostCode
MsgBox, %Addr2%
}
Else
{
Addr2 := String_Object[1] . "," . Trim(String_Object[2]) . "," . PostCode
MsgBox, %Addr2%
}

perltidy formatting multilines

I'm trying to get perltidy to format an if statement like this:
if ($self->image eq $_->[1]
and $self->extension eq $_->[2]
and $self->location eq $_->[3]
and $self->modified eq $_->[4]
and $self->accessed eq $_->[5]) {
but no matter what I try, it insists on formatting it like this:
if ( $self->image eq $_->[1]
and $self->extension eq $_->[2]
and $self->location eq $_->[3]
and $self->modified eq $_->[4]
and $self->accessed eq $_->[5]) {
Also, is there any way to get the last line of this block:
$dbh->do("INSERT INTO image VALUES(NULL, "
. $dbh->quote($self->image) . ", "
. $dbh->quote($self->extension) . ", "
. $dbh->quote($self->location) . ","
. $dbh->quote($self->modified) . ","
. $dbh->quote($self->accessed)
. ")");
to jump up to the previous line like the other lines:
$dbh->do("INSERT INTO image VALUES(NULL, "
. $dbh->quote($self->image) . ", "
. $dbh->quote($self->extension) . ", "
. $dbh->quote($self->location) . ","
. $dbh->quote($self->modified) . ","
. $dbh->quote($self->accessed) . ")");
Here is what I'm currently doing:
perltidy -ce -et=4 -l=100 -pt=2 -msc=1 -bar -ci=0 reporter.pm
Thanks.
I don't have much to offer on the 1st question, but with the 2nd, have you considered refactoring it to use placeholders? It would probably format up better, automaticaly do the quoting for you and give you (and the users of your module) a healthy barrier against SQL injection problems.
my $sth = $dbh->prepare('INSERT INTO image VALUES(NULL, ?, ?, ?, ?, ?)');
$sth->execute(
$self->image, $self->extension, $self->location,
$self->modified, $self->accessed
);
I've also found format skipping: -fs to protect a specific segment of code from perltidy. I'd put an example here but the Site seems to do a hatchet job on it...