I have a file tab delimited that looks like this:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
53_234 78 . CCG GAT 999 . . GT:PL:DP:DPR
45_569 5 . TCCG GTTA 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
I am trying to use sed to delete all the lines that contain more than one letter in the 4th field (in the case above, line 7 and 8 from the top). I have tried the following regular expression but there must be a glitch some where that I cannot find:
sed '5,${;/\([^.]*\t\)\{3\}\[A-Z][A-Z]\+\t/d;}' input.vcf>new.vcf
The syntax is as follows:
5,$ #start at line 5 until the end of the file ($)
([^.]*\t) #matching group is any single character followed by a zero or more characters followed by a tab.
{3} #previous block repeated 3 times (presumably for the 4th field)
[A-Z][A-Z]+\t #followed by any string of two letters or more followed by a tab.
Unfortunately, this doesn' t work but I know I am close to make it to work. Any hints or help will make this a great teaching moment.
Thanks.
If awk is okay for you, you can use below command:
awk '(FNR<5){print} (FNR>=5)&&length($4)<=1' input.vcf
Default delimiter is space, you can use -F"\t" to switch it to tab, put it after awk. for instance, awk -F"\t" ....
(FNR<5){print} FNR is file number record, when it is less than 5, print the whole line
(FNR>=5) && length($4)<=1 will handle the rest lines and filter lines which 4th field has one character or less.
Output:
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
You can redirect the output to an output file.
$ awk 'NR<5 || $4~/^.$/' file
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
Fixed your sed filter (took me a while almost went crazy over it)
5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}
Your errors:
[^.]*: everything but a dot.
Thanks to Ed, now I know that. I thought dot had to be escaped, but that does not seem to apply between brackets. Anyhow, this could match a tabulation char and match 2 or 3 groups instead of one, failing to match your line (regex are greedy by default)
\[A-Z][A-Z]: bad backslash. What did it do? hum, dunno!
test:
$ sed '5,${/^\([^\t]\+\t\)\{3\}[A-Z][A-Z]\+\t/d}' foo.Txt
##INFO=<ID=AC1,Number=1,Type=Float,Description="Max-likelihood">
##INFO=<ID=MQ,Number=1,Type=Integer,Description="Root-mean-square">
##INFO=<ID=FQ,Number=1,Type=Float,Description="Phred probability">
#CHROM POS ID REF ALT QUAL FILTER INFO FORMAT
53_344 2 . C G 999 . . GT:PL:DP:DPR
6_56775 67 . T A 999 . . GT:PL:DP:DPR
3_67687 2 . T G 999 . . GT:PL:DP:DPR
53_569 89 . T G 999 . . GT:PL:DP:DPR
conclusion: to process delimited fields, awk is better :)
I have a table in Postgres 9.2 with 38 variables and I need a selection of the "best" results.
What I need is:
distinct var1 and var2 then from that:
min var3 and also var4 from that same row
max var5 and if more than one result then where min var3, var6 to var12 from that same row
var13 sorted by conditions (3 first, 6 second 0 last) and also var14-var18 from that same row
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18 ...
1 1 2 a 2 a . . . . . . 0 . . . . .
1 1 1 b 1 b . . . . . . 3 . . . . .
1 2 4 c 3 c . . . . . . 3 . . . . .
1 2 3 d 4 d . . . . . . 6 . . . . .
2 1 1 a 3 a . . . . . . 3 . . . . .
3 1 3 a 2 a . . . . . . 6 . . . . .
3 1 2 b 4 b . . . . . . 0 . . . . .
4 1 3 a 4 a . . . . . . 3 . . . . .
4 1 6 b 2 b . . . . . . 0 . . . . .
4 2 2 c 2 c . . . . . . 0 . . . . .
4 3 5 d 3 d . . . . . . 3 . . . . .
4 3 4 e 4 e . . . . . . 6 . . . . .
4 3 7 f 4 f . . . . . . 3 . . . . .
...
The result should be:
v1 v2 v3 v4 v5 v6 v7 v8 v9 v10 v11 v12 v13 v14 v15 v16 v17 v18
1 1 1 b 2 a . . . . . . 3 . . . . .
1 2 3 d 4 d . . . . . . 3 . . . . .
2 1 1 a 3 a . . . . . . 3 . . . . .
3 1 2 b 4 b . . . . . . 6 . . . . .
4 1 3 a 4 a . . . . . . 3 . . . . .
4 2 2 c 2 c . . . . . . 0 . . . . .
4 3 4 e 4 e . . . . . . 3 . . . . .
...
here is also an image of the table where the colored fields show what should be selected:
Hope this makes sense.
EDIT:
Got a pointer in another post to provide CREATE and INSERT for the table.
create table parent (
v1 character varying,
v2 character varying,
v3 character varying,
v4 character varying,
v5 character varying,
v6 character varying,
v7 character varying,
v8 character varying,
v9 character varying,
v10 character varying,
v11 character varying,
v12 character varying,
v13 character varying,
v14 character varying,
v15 character varying,
v16 character varying,
v17 character varying,
v18 character varying
);
insert into parent values('1','1','2','a','2','a','x1','x1','x1','x1','x1','x1','0','x1','x1','x1','x1','x1');
insert into parent values('1','1','1','b','1','b','x2','x2','x2','x2','x2','x2','3','x2','x2','x2','x2','x2');
insert into parent values('1','2','4','c','3','c','x3','x3','x3','x3','x3','x3','3','x3','x3','x3','x3','x3');
insert into parent values('1','2','3','d','4','d','x4','x4','x4','x4','x4','x4','6','x4','x4','x4','x4','x4');
insert into parent values('2','1','1','a','3','a','x1','x1','x1','x1','x1','x1','3','x1','x1','x1','x1','x1');
insert into parent values('3','1','3','a','2','a','x1','x1','x1','x1','x1','x1','6','x1','x1','x1','x1','x1');
insert into parent values('3','1','2','b','4','b','x2','x2','x2','x2','x2','x2','0','x2','x2','x2','x2','x2');
insert into parent values('4','1','3','a','4','a','x1','x1','x1','x1','x1','x1','3','x1','x1','x1','x1','x1');
insert into parent values('4','1','6','b','2','b','x2','x2','x2','x2','x2','x2','0','x2','x2','x2','x2','x2');
insert into parent values('4','2','2','c','2','c','x3','x3','x3','x3','x3','x3','0','x3','x3','x3','x3','x3');
insert into parent values('4','3','5','d','3','d','x4','x4','x4','x4','x4','x4','3','x4','x4','x4','x4','x4');
insert into parent values('4','3','4','e','4','e','x5','x5','x5','x5','x5','x5','6','x5','x5','x5','x5','x5');
insert into parent values('4','3','7','f','4','f','x6','x6','x6','x6','x6','x6','3','x6','x6','x6','x6','x6');
I'm trying to get perltidy to format an if statement like this:
if ($self->image eq $_->[1]
and $self->extension eq $_->[2]
and $self->location eq $_->[3]
and $self->modified eq $_->[4]
and $self->accessed eq $_->[5]) {
but no matter what I try, it insists on formatting it like this:
if ( $self->image eq $_->[1]
and $self->extension eq $_->[2]
and $self->location eq $_->[3]
and $self->modified eq $_->[4]
and $self->accessed eq $_->[5]) {
Also, is there any way to get the last line of this block:
$dbh->do("INSERT INTO image VALUES(NULL, "
. $dbh->quote($self->image) . ", "
. $dbh->quote($self->extension) . ", "
. $dbh->quote($self->location) . ","
. $dbh->quote($self->modified) . ","
. $dbh->quote($self->accessed)
. ")");
to jump up to the previous line like the other lines:
$dbh->do("INSERT INTO image VALUES(NULL, "
. $dbh->quote($self->image) . ", "
. $dbh->quote($self->extension) . ", "
. $dbh->quote($self->location) . ","
. $dbh->quote($self->modified) . ","
. $dbh->quote($self->accessed) . ")");
Here is what I'm currently doing:
perltidy -ce -et=4 -l=100 -pt=2 -msc=1 -bar -ci=0 reporter.pm
Thanks.
I don't have much to offer on the 1st question, but with the 2nd, have you considered refactoring it to use placeholders? It would probably format up better, automaticaly do the quoting for you and give you (and the users of your module) a healthy barrier against SQL injection problems.
my $sth = $dbh->prepare('INSERT INTO image VALUES(NULL, ?, ?, ?, ?, ?)');
$sth->execute(
$self->image, $self->extension, $self->location,
$self->modified, $self->accessed
);
I've also found format skipping: -fs to protect a specific segment of code from perltidy. I'd put an example here but the Site seems to do a hatchet job on it...
When I use LWP::UserAgent to retrieve content encoded in UTF-8 it seems LWP::UserAgent doesn't handle the encoding correctly.
Here's the output after setting the Command Prompt window to Unicode by the command chcp 65001 Note that this initially gives the appearance that all is well, but I think it's just the shell reassembling bytes and decoding UTF-8, From the other output you can see that perl itself is not handling wide characters correctly.
C:\>perl getutf8.pl
======================================================================
HTTP/1.1 200 OK
Connection: close
Date: Fri, 31 Dec 2010 19:24:04 GMT
Accept-Ranges: bytes
Server: Apache/2.2.8 (Win32) PHP/5.2.6
Content-Length: 75
Content-Type: application/xml; charset=utf-8
Last-Modified: Fri, 31 Dec 2010 19:20:18 GMT
Client-Date: Fri, 31 Dec 2010 19:24:04 GMT
Client-Peer: 127.0.0.1:80
Client-Response-Num: 1
<?xml version="1.0" encoding="UTF-8"?>
<name>Budějovický Budvar</name>
======================================================================
response content length is 33
....v....1....v....2....v....3....v....4
<name>Budějovický Budvar</name>
. . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . .
3c6e616d653e427564c49b6a6f7669636bc3bd204275647661723c2f6e616d653e
< n a m e > B u d � � j o v i c k � � B u d v a r < / n a m e >
Above you can see the payload length is 31 characters but Perl thinks it is 33.
For confirmation, in the hex, we can see that the UTF-8 sequences c49b and c3bd are being interpreted as four separate characters and not as two Unicode characters.
Here's the code
#!perl
use strict;
use warnings;
use LWP::UserAgent;
my $ua = LWP::UserAgent->new();
my $response = $ua->get('http://localhost/Bud.xml');
if (! $response->is_success) { die $response->status_line; }
print '='x70,"\n",$response->as_string(), '='x70,"\n";
my $r = $response->decoded_content((charset => 'UTF-8'));
$/ = "\x0d\x0a"; # seems to be \x0a otherwise!
chomp($r);
# Remove any xml prologue
$r =~ s/^<\?.*\?>\x0d\x0a//;
print "Response content length is ", length($r), "\n\n";
print "....v....1....v....2....v....3....v....4\n";
print $r,"\n";
print ". . . . v . . . . 1 . . . . v . . . . 2 . . . . v . . . . 3 . . . . \n";
print unpack("H*", $r), "\n";
print join(" ", split("", $r)), "\n";
Note that Bud.xml is UTF-8 encoded without a BOM.
How can I persuade LWP::UserAgent to do the right thing?
P.S. Ultimately I want to translate the Unicode data into an ASCII encoding, even if it means replacing each non-ASCII character with one question mark or other marker.
Update 1
I have accepted Ysth's "upgrade" answer - because I know it is the right thing to do when possible. However there is a work around to fix up the data into a well formed Perl Unicode string.
$r = decode("utf8", $r);
Update 2
My data gets fed to a non-Perl application that displays the data using Code Page 437 to Putty/Reflection/Teraterm terminals at many locations. The app is currently displaying something like:
Bud├ä┬øjovick├â┬¢ Budvar
I am going to use ($r = decode("UTF-8", $r)) =~ s/[\x80-\x{FFFF}]/\xFE/g; to get the app to display:
Bud■jovick■ Budvar
Moving away from CP437 would be a major job, so that is not going to happen in the short to medium term.
Update 3
CPAN has some interesting Unicode modules such as:
Text::Unidecode
Unicode::Map8
Unicode::Map
Unicode::Escape
Unicode::Transliterate
Text::Unidecode translated "Budějovický Budvar" into "Budejovicky Budvar" - which didn't seem to me a particularly impressive attempt at a phonetic transliteration but then I don't speak Czech. English speakers might prefer it to "Bud■jovick■ Budvar" though.
Upgrade to a newer libwwwperl. The old version you are using only honored the charset argument to decoded_content for text/* content types; the newer version also does so for application/xml or anything ending +xml.