Extract reads from a BAM/SAM file of a designated length - perl

I am a bit of new to Perl and wish to use it in order to extract reads of a specific length from my BAM (alignment) file.
The BAM file contains reads, whose length is from 19 to 29 nt.
Here is an example of first 2 reads:
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:1777:1094 16 4 1313373 1 24M * 0 0 TCGCATTCTTATTGATTTTCCTTT FFFFFFF,FFFFFFFFFFFFFFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:24
I want to extract only those, which are, let's say, 21 nt in length.
I try to do this with the following code:
my $string = <STDIN>;
$length = samtools view ./file.bam | head | perl -F'\t' -lane'length #F[10]';
if ($length == 21){
print($string)
}
However, the program does not give any result...
Could anyone please suggest the right way of doing this?

Your question is a bit confusing. Is the code snippet supposed to be a Perl script or a shell script that calls a Perl one-liner?
Assuming that you meant to write a Perl script into which you pipe the output of samtools view to:
#!/usr/bin/perl
use strict;
use warnings;
while (<STDIN>) {
my #fields = split("\t", $_);
# debugging, just to see what field is extracted...
print "'$fields[10]' ", length($fields[10]), "\n";
if (length($fields[10]) eq 21) {
print $_;
}
}
exit 0;
With your test data in dummy.txt I get:
# this would be "samtools view ./file.bam | head | perl dummy.pl" in your case?
$ cat dummy.txt | perl dummy.pl
'FF:FFFF,FFFFFFFF:FFFFF' 22
'FFFFFFF,FFFFFFFFFFFFFFFF' 24
Your test data doesn't contain a sample with length 21 though, so the if clause is never executed.

Note that the 10th field in your sample input is having either 22 or 24 in length. Also, the syntax that you use is wrong. Here is the Perl one-liner to match the field with length=22.
$ cat pkom.txt
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:1777:1094 16 4 1313373 1 24M * 0 0 TCGCATTCTTATTGATTTTCCTTT FFFFFFF,FFFFFFFFFFFFFFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:24
$ perl -lane ' print if length($F[9])==22 ' pkom.txt
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
$

Related

select specific columns from complex lines

I have a file that contains lines with the following format. I would like to keep only the first column and the column containing the string with the following format NC_XXXX.1
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
.....
....
...
output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Something like this in perl:
#!/usr/bin/env perl
use strict;
use warnings;
while (<DATA>) {
my ( $id, $nc ) = m/^([\d\-]+).*(NC_[\d\.]+)/;
print "$id $nc\n";
}
__DATA__
484-2117 16 gi|9634679|ref|NC_002188.1| 188705 23 21M * 0 0 CGCGTACCAAAAGTAATAATT IIIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:0G20 YT:Z:UU
787-1087 16 gi|21844535|ref|NC_004068.1| 7006 23 20M * 0 0 CTATACAACCTACTACCTCA IIIIIIIIIIIIIIIIIIII AS:i:-6 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:19T0 YT:Z:UU
Output:
484-2117 NC_002188.1
787-1087 NC_004068.1
Which reduces to a one liner of:
perl -ne 'm/^([\d\-]+).*(NC_[\d\.]+)/ and print "$1 $2\n"' yourfile
Note - this specifically matches a first column made up of number and dash - you could do this with a wider regex match.
awk to the rescue!
$ awk -F' +|\\|' '{for(i=2;i<=NF;i++) if($i ~ /^NC_[0-9.]+$/) {print $1,$i; next}}' file
484-2117 NC_002188.1
787-1087 NC_004068.1
if the space is a tab char, need to add to the delimiter list
$ awk -F' +|\\||\t' ...
With perl:
perl -anE'say "$F[0] ",(split /\|/, $F[2])[3]' file
or awk:
awk -F'\\|| +' '{print $1,$6}' file
Using gnu-awk below could be solution:
awk '{printf "%s %s\n",$1,gensub(/.*(NC_.*\.1).*/,"\\1",1,$0)}' file
Output
484-2117 NC_002188.1
787-1087 NC_004068.1
A more restrictive version would be
awk '{printf "%s %s\n",$1,gensub(/.*(NC_[[:digit:]]*\.1).*/,"\\1",1,$0)}' file
awk -F'[ |]' '{print $1,$10}' file
484-2117 NC_002188.1
787-1087 NC_004068.1

Split a column then sum and push sum onto array in perl

I have a file that looks like this:
LOCUS POS ALIAS VAR TEST P I DESC
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0
I want to split the last field by "/", then sum those numbers, and push another column on with the sum. For example, I would want the output to look like:
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1 2
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0 1
I have this code, but it doesn't work:
#! perl -w
my $file1 = shift#ARGV;
my $NVAR=0;
my #vars;
open (IN, $file1) or die "couldn't read file one";
while(<IN>){
my#L=split;
next if ($L[0] =~ m/LOCUS/);
my#counts=split /\//, $L[7];
foreach (#counts){
$NVAR=${$_}[0] + ${$_}[1];
}
push #vars,[$L[0],$L[1],$L[2],$L[3],$L[4],$L[5],$L[6],$L[7],$NVAR];
}
close IN;
print "LOCUS POS ALIAS NVAR TEST P I DESC SUM\n";
foreach(#vars){
print "#{$_}\n";
}
Any help is appreciated.
Always include use strict; and use warnings; at the top of EVERY script.
Limit your variables to the smallest scope possible, as declaring $NVAR outside of the while loop introduced a bug. Your summation can be fixed by the following:
my $NVAR = 0;
foreach (#counts){
#$NVAR=${$_}[0] + ${$_}[1]; <-- this was bad.
$NVAR += $_;
}
However, this can be solved using a perl oneliner
perl -MList::Util=sum -lane 'push #F, sum split "/", $F[-1]; print "#F"' file.txt
Or if you have a header row:
perl -MList::Util=sum -lane '
push #F, $. == 1 ? "SUM" : sum split "/", $F[-1];
print "#F"
' file.txt
Note, you can also utilize List::Util sum in your script as well.

How to extract the column data from multidimensional array in perl

ls -l
-rw-r--r-- 1 angus angus 0 2013-08-16 01:33 copy.pl
-rw-r--r-- 1 angus angus 1931 2013-08-16 08:27 copy.txt
-rw-r--r-- 1 angus angus 492 2013-08-16 03:15 ex.txt
-rw-r--r-- 1 angus angus 25 2013-08-16 09:07 hello.txt
-rw-r--r-- 1 angus angus 98 2013-08-16 09:05 hi.txt
I need only the read, write , access data as well as the file name.
#! /usr/bin/perl -w
#list = `ls -l`;
$index = 0;
#print "#list\n";
for(#list){
($access) = split(/[\s+]/,$_);
print "$access\n";
($data) = split(/pl+/,$_);
print "$data";
#array1 = ($data,$access);
}
print "#array1\n"
I have written this code to extract the read,write,access permission details and the file name corresponding to it.
I couldn't extract the filename which is the last column.
Check perl stat http://perldoc.perl.org/functions/stat.html
It's more robust and efficient than calling external ls command,
use File::stat;
$sb = stat($filename);
printf "File is %s, size is %s, perm %04o, mtime %s\n",
$filename, $sb->size, $sb->mode & 07777,
scalar localtime $sb->mtime;
I think you have an error in line number 8 of your script. You are trying to split the line using the string "pl" as a delimiter which will only match the first line of your input and will not give you what I think you want.
I believe you should just split the whole line on white space and assign just the columns you want (number 1 and 8 in this case).
change your loop for this:
for my $filename (#list){
chomp($filename);
my ($access, $data) = (split(/\s+/, $filename))[0, 7]; #use a slice to get only the columns you want.
print "$access $data\n";
}
Note: mpapec suggestion to use Stat would be better. I just wanted to let you know why your code is not working.

Greping an array obtained through NET:TELNET

I'm writing a Munin-Pluging and I like to capture the screen output from a telnet session.
The output of such a session looks as follows:
...
0x00017 0x41b3f340 BPING 0 0 0 0 198 132 330
0x00018 0x41b47340 CHKFAILED 0 0 0 0 198 132 330
0x00026 0x41b4f340 CIP 0 0 0 0 370 264 634
0x0001e 0x41b57340 CONTROL 0 1 0 0 3876 2178 6054
0x01014 0x41b5f340 UNETSRVR 0 0 0 1 296 198 494
0x00037 0x41b67340 ---- 0 0 0 0 198 132 330
0x00000 0x43b67450 ---- 0 0 0 0 0 0 0
0x00000 0x4bb67450 ---- 0 0 0 0 5084 4224 9308
0x00000 0x49367450 ---- 0 0 0 0 14742 4158 18900
-------------------------------------------------------------------------------------------
SUMMARY : 2 40 5 7 4898229 2728176 7626405
This script extract the screen content into an array (#lines).
#!/usr/bin/perl
use Net::Telnet ();
use strict;
use warnings;
my $t = new Net::Telnet (Timeout => 10);
$t->port(777);
$t->open("192.168.0.1");
$t->buffer_empty;
my #lines = $t->waitfor(match =>"m/.* SUMMARY : .* \n/");
my #gagu = grep { "$_" =~ /^.*BPING.*\n/ } #lines;
print #gagu;
Of what type is the array #lines?
Why do I always get the whole
content from grep and not a filtered line?
Is the array i got from net:telnet different from other arrays?
Yes, I'm new to Perl.
I am not familiar with this module and what it does, but I assume it gives you some sort of return value similar to what you have stated.
If you are getting all the lines in your #gagu array, that can be either that your data in the #lines array consists of just one line, or that the grep fails.
For example, #lines may contain the string:
"foo bar baz\nfoo1 bar1 baz1\n";
and not, as you expect
"foo bar baz\n";
"foo1 bar1 baz1\n";
Your grep statement probably works as expected, though you might want to consider:
Not quoting $_, since that serves no purpose.
Not using $_ at all, since that is the default variable it is not needed (except for clarity) to use it.
Not using anchors ^ and \n, because they are redundant.
For example, ^.* matches any string, anywhere. Using it to simply match a string is redundant. Ending the regex with .*\n is redundant, because all it says is "match any character except newline until we find a newline". Assuming you have newlines, it does nothing. Assuming you don't, it gives you a false negative. All you need for this match is /BPING/. So here's what your code might look like:
use Data::Dumper;
my #lines = $t->waitfor(match =>"m/ SUMMARY :/");
my #gagu = grep /BPING/, #lines;
print Dumper \#gagu;
If you want to see whitespace printed out visibly, you can use the $Data::Dumper::Useqq variable:
$Data::Dumper::Useqq = 1;
print Dumper \#gagu;
Printing variables is a very good debugging tool.

delete lines from multiple files using gawk / awk / sed

I have two sets of text files. First set is in AA folder. Second set is in BB folder. The content of ff.txt file from first set(AA folder) is shown below.
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
I would like to print the second column(number) from this file if marks>60. The output would be 3,4. Next, read the ff.txt file in the BB folder and delete the lines containing numbers 3,4.
files in the BB folder looks like this. second column is the number.
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
I used the following code.This code is working perfectly for one file.
gawk 'BEGIN {getline} $3>60{print $2}' AA/ff.txt | while read number; do gawk -v number=$number '$2 != number' BB/ff.txt > /tmp/ff.txt; mv /tmp/ff.txt BB/ff.txt; done
But when I run this code with multiple files, I get error.
gawk 'BEGIN {getline} $3>60{print $2}' AA/*.txt | while read number; do gawk -v number=$number '$2 != number' BB/*.txt > /tmp/*.txt; mv /tmp/*.txt BB/*.txt; done
error:-
mv: target `BB/kk.txt' is not a directory
I had asked this question two days ago.Please help me to solve this error.
This creates an index of all files in folder AA and checks against all files in folder BB:
cat AA/*.txt | awk 'FNR==NR { if ($3 > 60) array[$2]; next } !($2 in array)' - BB/*.txt
This compares two individual files, assuming they have the same name in folders AA and BB:
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 %" | sh
HTH
EDIT
This should help :-)
ls AA/*.txt | sed "s%AA/\(.*\)%awk 'FNR==NR { if (\$3 > 60) array[\$2]; next } !(\$2 in array)' & BB/\1 > \1_tmp \&\& mv \1_tmp BB/\1 %" | sh
> /tmp/*.txt and mv /tmp/*.txt BB/*.txt are wrong.
For single file
awk 'NR>1 && $3>60{print $2}' AA/ff.txt > idx.txt
awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt BB/ff.txt
For multiple files
awk 'FNR>1 && $3>60{print $2}' AA/*.txt >idx.txt
cat BB/*.txt | awk 'NR==FNR{a[$0]; next}; !($2 in a)' idx.txt -
One perl solution:
use warnings;
use strict;
use File::Spec;
## Hash to save data to delete from files of BB folder.
## key -> file name.
## value -> string with numbers of second column. They will be
## joined separated with '-...-', like: -2--3--1-. And it will be easier to
## search for them using a regexp.
my %delete;
## Check arguments:
## 1.- They are two.
## 2.- Both are directories.
## 3.- Both have same number of regular files and with identical names.
die qq[Usage: perl $0 <dir_AA> <dir_BB>\n] if
#ARGV != 2 ||
grep { ! -d } #ARGV;
{
my %h;
for ( glob join q[ ], map { qq[$_/*] } #ARGV ) {
next unless -f;
my $file = ( File::Spec->splitpath( $_ ) )[2] or next;
$h{ $file }++;
}
for ( values %h ) {
if ( $_ != 2 ) {
die qq[Different files in both directories\n];
}
}
}
## Get files from dir 'AA'. Process them, print to output lines which
## matches condition and save the information in the %delete hash.
for my $file ( glob( shift . qq[/*] ) ) {
open my $fh, q[<], $file or do { warn qq[Couldn't open file $file\n]; next };
$file = ( File::Spec->splitpath( $file ) )[2] or do {
warn qq[Couldn't get file name from path\n]; next };
while ( <$fh> ) {
next if $. == 1;
chomp;
my #f = split;
next unless #f >= 3;
if ( $f[ $#f ] > 60 ) {
$delete{ $file } .= qq/-$f[1]-/;
printf qq[%s\n], $_;
}
}
}
## Process files found in dir 'BB'. For each line, print it if not found in
## file from dir 'AA'.
{
#ARGV = glob( shift . qq[/*] );
$^I = q[.bak];
while ( <> ) {
## Sanity check. Shouldn't occur.
my $filename = ( File::Spec->splitpath( $ARGV ) )[2];
if ( ! exists $delete{ $filename } ) {
close ARGV;
next;
}
chomp;
my #f = split;
if ( $delete{ $filename } =~ m/-$f[1]-/ ) {
next;
}
printf qq[%s\n], $_;
}
}
exit 0;
A test:
Assuming next tree of files. Command:
ls -R1
Output:
.:
AA
BB
script.pl
./AA:
ff.txt
gg.txt
./BB:
ff.txt
gg.txt
And next content of files. Command:
head AA/*
Output:
==> AA/ff.txt <==
Name number marks
john 1 60
maria 2 54
samuel 3 62
ben 4 63
==> AA/gg.txt <==
Name number marks
john 1 70
maria 2 54
samuel 3 42
ben 4 33
Command:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
==> BB/gg.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
Run the script like:
perl script.pl AA/ BB
With following ouput to screen:
samuel 3 62
ben 4 63
john 1 70
And files of BB directory modified like:
head BB/*
Output:
==> BB/ff.txt <==
marks 1 11.824 24.015 41.220 1.00 13.65
marks 1 13.058 24.521 40.718 1.00 11.82
==> BB/gg.txt <==
marks 2 13.058 24.521 40.718 1.00 11.82
marks 3 12.120 13.472 46.317 1.00 10.62
marks 4 10.343 24.731 47.771 1.00 8.18
So, from ff.txt lines with numbers 3 and 4 have been deleted, and lines with number 1 in gg.txt, which all of them were bigger than 60 in last column. I think this is what you wanted to achieve. I hope it helps, although not awk.