Split a column then sum and push sum onto array in perl - perl

I have a file that looks like this:
LOCUS POS ALIAS VAR TEST P I DESC
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0
I want to split the last field by "/", then sum those numbers, and push another column on with the sum. For example, I would want the output to look like:
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.43 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.295 0/1 1
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 1 0.005 1/1 2
CCL23|disruptive chr17:34340329..34340854 . 2 BURDEN 0.676617 0.005 1/0 1
I have this code, but it doesn't work:
#! perl -w
my $file1 = shift#ARGV;
my $NVAR=0;
my #vars;
open (IN, $file1) or die "couldn't read file one";
while(<IN>){
my#L=split;
next if ($L[0] =~ m/LOCUS/);
my#counts=split /\//, $L[7];
foreach (#counts){
$NVAR=${$_}[0] + ${$_}[1];
}
push #vars,[$L[0],$L[1],$L[2],$L[3],$L[4],$L[5],$L[6],$L[7],$NVAR];
}
close IN;
print "LOCUS POS ALIAS NVAR TEST P I DESC SUM\n";
foreach(#vars){
print "#{$_}\n";
}
Any help is appreciated.

Always include use strict; and use warnings; at the top of EVERY script.
Limit your variables to the smallest scope possible, as declaring $NVAR outside of the while loop introduced a bug. Your summation can be fixed by the following:
my $NVAR = 0;
foreach (#counts){
#$NVAR=${$_}[0] + ${$_}[1]; <-- this was bad.
$NVAR += $_;
}
However, this can be solved using a perl oneliner
perl -MList::Util=sum -lane 'push #F, sum split "/", $F[-1]; print "#F"' file.txt
Or if you have a header row:
perl -MList::Util=sum -lane '
push #F, $. == 1 ? "SUM" : sum split "/", $F[-1];
print "#F"
' file.txt
Note, you can also utilize List::Util sum in your script as well.

Related

Extract reads from a BAM/SAM file of a designated length

I am a bit of new to Perl and wish to use it in order to extract reads of a specific length from my BAM (alignment) file.
The BAM file contains reads, whose length is from 19 to 29 nt.
Here is an example of first 2 reads:
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:1777:1094 16 4 1313373 1 24M * 0 0 TCGCATTCTTATTGATTTTCCTTT FFFFFFF,FFFFFFFFFFFFFFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:24
I want to extract only those, which are, let's say, 21 nt in length.
I try to do this with the following code:
my $string = <STDIN>;
$length = samtools view ./file.bam | head | perl -F'\t' -lane'length #F[10]';
if ($length == 21){
print($string)
}
However, the program does not give any result...
Could anyone please suggest the right way of doing this?
Your question is a bit confusing. Is the code snippet supposed to be a Perl script or a shell script that calls a Perl one-liner?
Assuming that you meant to write a Perl script into which you pipe the output of samtools view to:
#!/usr/bin/perl
use strict;
use warnings;
while (<STDIN>) {
my #fields = split("\t", $_);
# debugging, just to see what field is extracted...
print "'$fields[10]' ", length($fields[10]), "\n";
if (length($fields[10]) eq 21) {
print $_;
}
}
exit 0;
With your test data in dummy.txt I get:
# this would be "samtools view ./file.bam | head | perl dummy.pl" in your case?
$ cat dummy.txt | perl dummy.pl
'FF:FFFF,FFFFFFFF:FFFFF' 22
'FFFFFFF,FFFFFFFFFFFFFFFF' 24
Your test data doesn't contain a sample with length 21 though, so the if clause is never executed.
Note that the 10th field in your sample input is having either 22 or 24 in length. Also, the syntax that you use is wrong. Here is the Perl one-liner to match the field with length=22.
$ cat pkom.txt
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:1777:1094 16 4 1313373 1 24M * 0 0 TCGCATTCTTATTGATTTTCCTTT FFFFFFF,FFFFFFFFFFFFFFFF AS:i:0 XS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:24
$ perl -lane ' print if length($F[9])==22 ' pkom.txt
YT:Z:UUA00182:193:HG2NLDMXX:1:1101:29884:1078 0 3R 6234066 42 22M * 0 0 TCACTGGGCTTTGTTTATCTCA FF:FFFF,FFFFFFFF:FFFFF AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:22
$

Randomly pick a region and process it, a number of times

I have a data like this
>sp|Q96A73|P33MX_HUMAN Putative monooxygenase p33MONOX OS=Homo sapiens OX=9606 GN=KIAA1191 PE=1 SV=1
RNDDDDTSVCLGTRQCSWFAGCTNRTWNSSAVPLIGLPNTQDYKWVDRNSGLTWSGNDTCLYSCQNQTKGLLYQLFRNLFCSYGLTEAHGKWRCADASITNDKGHDGHRTPTWWLTGSNLTLSVNNSGLFFLCGNGVYKGFPPKWSGRCGLGYLVPSLTRYLTLNASQITNLRSFIHKVTPHR
>sp|P13674|P4HA1_HUMAN Prolyl 4-hydroxylase subunit alpha-1 OS=Homo sapiens OX=9606 GN=P4HA1 PE=1 SV=2
VECCPNCRGTGMQIRIHQIGPGMVQQIQSVCMECQGHGERISPKDRCKSCNGRKIVREKKILEVHIDKGMKDGQKITFHGEGDQEPGLEPGDIIIVLDQKDHAVFTRRGEDLFMCMDIQLVEALCGFQKPISTLDNRTIVITSHPGQIVKHGDIKCVLNEGMPIYRRPYEKGRLIIEFKVNFPENGFLSPDKLSLLEKLLPERKEVEE
>sp|Q7Z4N8|P4HA3_HUMAN Prolyl 4-hydroxylase subunit alpha-3 OS=Homo sapiens OX=9606 GN=P4HA3 PE=1 SV=1
MTEQMTLRGTLKGHNGWVTQIATTPQFPDMILSASRDKTIIMWKLTRDETNYGIPQRALRGHSHFVSDVVISSDGQFALSGSWDGTLRLWDLTTGTTTRRFVGHTKDVLSVAFSSDNRQIVSGSRDKTIKLWNTLGVCKYTVQDESHSEWVSCVRFSPNSSNPIIVSCGWDKLVKVWNLANCKLK
>sp|P04637|P53_HUMAN Cellular tumor antigen p53 OS=Homo sapiens OX=9606 GN=TP53 PE=1 SV=4
IQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQL
>sp|P10144|GRAB_HUMAN Granzyme B OS=Homo sapiens OX=9606 GN=GZMB PE=1 SV=2
MQPILLLLAFLLLPRADAGEIIGGHEAKPHSRPYMAYLMIWDQKSLKRCGGFLIRDDFVLTAAHCWGSSINVTLGAHNIKEQEPTQQFIPVKRPIPHPAYNPKNFSNDIMLLQLERKAKRTRAVQPLRLPSNKAQVKPGQTCSVAGWGQTAPLGKHSHTLQEVKMTVQEDRKCES
>sp|Q9UHX1|PUF60_HUMAN Poly(U)-binding-splicing factor PUF60 OS=Homo sapiens OX=9606 GN=PUF60 PE=1 SV=1
MGKDYYQTLGLARGASDEEIKRAYRRQALRYHPDKNKEPGAEEKFKEIAEAYDVLSDPRKREIFDRYGEEGLKGSGPSGGSGGGANGTSFSYTFHGDPHAMFAEFFGGRNPFDTFFGQRNGEEGMDIDDPFSGFPMGMGGFTNVNFGRSRSAQEPARKKQDPPVTHDLRVSLEEIYSGCTKKMKISHK
>sp|Q06416|P5F1B_HUMAN Putative POU domain, class 5, transcription factor 1B OS=Homo sapiens OX=9606 GN=POU5F1B PE=5 SV=2
IVVKGHSTCLSEGALSPDGTVLATASHDGYVKFWQIYIEGQDEPRCLHEWKPHDGRPLSCLLFCDNHKKQDPDVPFWRFLITGADQNRELKMWCTVSWTCLQTIRFSPDIFSSVSVPPSLKVCLDLSAEYLILSDVQRKVLYVMELLQNQEEGHACFSSISEFLLTHPVLSFGIQVVSRCRLRHTEVLPAEEENDSLGADGTHGAGAMESAAGVLIKLFCVHTKALQDVQIRFQPQLNPDVVAPLPTHTAHEDFTFGESRPELGSEGLGSAAHGSQPDLRRIVELPAPADFLSLSSETKPKLMTPDAFMTPSASLQQITASPSSSSSGSSSSSSSSSSSLTAVSAMSSTSAVDPSLTRPPEELTLSPKLQLDGSLTMSSSGSLQASPRGLLPGLLPAPADKLTPKGPGQVPTATSALSLELQEVEP
>sp|O14683|P5I11_HUMAN Tumor protein p53-inducible protein 11 OS=Homo sapiens OX=9606 GN=TP53I11 PE=1 SV=2
MIHNYMEHLERTKLHQLSGSDQLESTAHSRIRKERPISLGIFPLPAGDGLLTPDAQKGGETPGSEQWKFQELSQPRSHTSLKVSNSPEPQKAVEQEDELSDVSQGGSKATTPASTANSDVATIPTDTPLKEENEGFVKVTDAPNKSEISKHIEVQVAQETRNVSTGSAENEEKSEVQAIIESTPELDMDKDLSGYKGSSTPTKGIENKAFDRNTESLFEELSSAGSGLIGDVDEGADLLGMGREVENLILENTQLLETKNALNIVKNDLIAKVDELTCEKDVLQGELEAVKQAKLKLEEKNRELEEELRKARAEAEDARQKAKDDDDSDIPTAQRKRFTRVEMARVLMERNQYKERLMELQEAVRWTEMIRASRENPAMQEKKRSSIWQFFSRLFSSSSNTTKKPEPPVNLKYNAPTSHVTPSVK
I want to randomly pick up a region with 10 letters from it then calculate the number of F, I want to do that for a certain number of times for example 1000 times or even more
as an example, I randomly pick
LVPSLTRYLT 0
then
ITNLRSFIHK 1
then again randomly go and pick up 10 letters consecutive
AHSRIRKERP 0
This continues until it meets the number of run asked. I want to store all randomly selected ones with their values, because then I want to calculate how many times F is seen
So I do the following
# first I remove the header
grep -v ">" data.txt > out.txt
then get randomly one region with 10 letters I tried to use shuf with no success,
shuf -n1000 data.txt
then I tried to use awk and was not successful either
awk 'BEGIN {srand()} !/^$/ { if (rand() == 10) print $0}'
then calculate the number of F and save it in the file
grep -i -e [F] |wc -l
Note, we should not pick up the same region twice
I've got to assume some things here, and leave some restrictions
Random regions to pick don't depend in any way on specific lines
Order doesn't matter; there need be N regions spread out through the file
File can be a Gigabyte in size, so can't read it whole (would be much easier!)
There are unhandled (edge or unlikely) cases, discussed after code
First build a sorted list of random numbers; these are positions in the file at which regions start. Then, as each line is read, compute its range of characters in the file, and check whether our numbers fall within it. If some do, they mark the start of each random region: pick substrings of desired length starting at those characters. Check whether substrings fit on the line.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use List::MoreUtils qw(uniq);
my ($region_len, $num_regions) = (10, 10);
my $count_freq_for = 'F';
#srand(10);
GetOptions(
'num-regions|n=i' => \$num_regions,
'region-len|l=i' => \$region_len,
'char|c=s' => \$count_freq_for,
) or usage();
my $file = shift || usage();
# List of (up to) $num_regions random numbers, spanning the file size
# However, we skip all '>sp' lines so take more numbers (estimate)
open my $fh, '<', $file or die "Can't open $file: $!";
$num_regions += int $num_regions * fraction_skipped($fh);
my #rand = uniq sort { $a <=> $b }
map { int(rand (-s $file)-$region_len) } 1..$num_regions;
say "Starting positions for regions: #rand";
my ($nchars_prev, $nchars, $chars_left) = (0, 0, 0);
my $region;
while (my $line = <$fh>) {
chomp $line;
# Total number of characters so far, up to this line and with this line
$nchars_prev = $nchars;
$nchars += length $line;
next if $line =~ /^\s*>sp/;
# Complete the region if there wasn't enough chars on the previous line
if ($chars_left > 0) {
$region .= substr $line, 0, $chars_left;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
$chars_left = -1;
};
# Random positions that happen to be on this line
my #pos = grep { $_ > $nchars_prev and $_ < $nchars } #rand;
# say "\tPositions on ($nchars_prev -- $nchars) line: #pos" if #pos;
for (#pos) {
my $pos_in_line = $_ - $nchars_prev;
$region = substr $line, $pos_in_line, $region_len;
# Don't print if there aren't enough chars left on this line
last if ( $chars_left =
($region_len - (length($line) - $pos_in_line)) ) > 0;
my $cnt = () = $region =~ /$count_freq_for/g;
say "$region $cnt";
}
}
sub fraction_skipped {
my ($fh) = #_;
my ($skip_len, $data_len);
my $curr_pos = tell $fh;
seek $fh, 0, 0 if $curr_pos != 0;
while (<$fh>) {
chomp;
if (/^\s*>sp/) { $skip_len += length }
else { $data_len += length }
}
seek $fh, $curr_pos, 0; # leave it as we found it
return $skip_len / ($skip_len+$data_len);
}
sub usage {
say STDERR "Usage: $0 [options] file", "\n\toptions: ...";
exit;
}
Uncomment the srand line so to have the same run always, for testing.
Notes follow.
Some corner cases
If the 10-long window doesn't fit on the line from its random position it is completed in the next line -- but any (possible) further random positions on this line are left out. So if our random list has 1120 and 1122 while a line ends at 1125 then the window starting at 1122 is skipped. Unlikely, possible, and of no consequence (other than having one region fewer).
When an incomplete region is filled up in the next line (the first if in the while loop), it is possible that that line is shorter than the remaining needed characters ($chars_left). This is very unlikely and needs an additional check there, which is left out.
Random numbers are pruned of dupes. This skews the sequence, but minutely what should not matter here; and we may stay with fewer numbers than asked for, but only by very little
Handling of issues regarding randomness
"Randomness" here is pretty basic, what seems suitable. We also need to consider the following.
Random numbers are drawn over the interval spanning the file size, int(rand -s $file) (minus the region size). But lines >sp are skipped and any of our numbers that may fall within those lines won't be used, and so we may end up with fewer regions than the drawn numbers. Those lines are shorter, thus with a lesser chance of having numbers on them and so not many numbers are lost, but in some runs I saw even 3 out of 10 numbers skipped, ending up with a random sample 70% size of desired.
If this is a bother, there are ways to approach it. To not skew the distribution even further they all should involve pre-processing the file.
The code above makes an initial run over the file, to compute the fraction of chars that will be skipped. That is then used to increase the number of random points drawn. This is of course an "average" measure, but which should still produce the number of regions close to desired for large enough files.
More detailed measures would need to see which random points of a (much larger) distribution are going to be lost to skipped lines and then re-sample to account for that. This may still mess with the distribution, what arguably isn't an issue here, but more to the point may simply be unneeded.
In all this you read the big file twice. The extra processing time should only be in the seconds but if this is unacceptable change the function fraction_skipped to read through only 10-20% of the file. With large files this should still provide a reasonable estimate.
Note on a particular test case
With srand(10) (commented-out line near the beginning) we get the random numbers such that on one line the region starts 8 characters before the end of the line! So that case does test the code to complete the region on the next line.
Here is a simple driver to run the above a given number of times, for statistics.
Doing it using builtin tools (system, qx) is altogether harder and libraries (modules) help. I use IPC::Run here. There are quite a few other options.†
Adjust and add code to process as needed for statistics; output is in files.
use warnings;
use strict;
use feature 'say';
use Getopt::Long;
use IPC::Run qw(run);
my $outdir = 'rr_output'; # pick a directory name
mkdir $outdir if not -d $outdir;
my $prog = 'random_regions.pl'; # your name for the program
my $input = 'data_file.txt'; # your name for input file
my $ch = 'F';
my ($runs, $regions, $len) = (10, 10, 10);
GetOptions(
'runs|n=i' => \$runs,
'regions=i' => \$regions,
'length=i' => \$len,
'char=s' => \$ch,
'input=s' => \$input
) or usage();
my #cmd = ( $prog, $input,
'--num-regions', $regions,
'--region-len', $len,
'--char', $ch
);
say "Run: #cmd, $runs times.";
for my $n (1..$runs) {
my $outfile = "$outdir/regions_r$n.txt";
say "Run #$n, output in: $outdir/$outfile";
run \#cmd, '>', $outfile or die "Error with #cmd: $!";
}
sub usage {
say STDERR "Usage: $0 [options]", "\n\toptions: ...";
exit;
}
Please expand on the error checking. See for instance this post and links on details.
Simplest use: driver_random.pl -n 4, but you can give all of main program's parameters.
The called program (random_regions.pl above) must be executable.
†   Some, from simple to more capable: IPC::System::Simple, Capture::Tiny, IPC::Run3. (Then comes IPC::Run used here.) Also see String::ShellQuote, to prepare commands without quoting issues, shell injection bugs, and other problems. See links (examples) assembled in this post, for example.
awk to the rescue!
you didn't specify but there are two random actions going on. I treated them independently, may not be so. First picking a line and second picking a random 10 letter substring from that line.
This assumes the file (or actually half of it) can fit in memory. Otherwise, split the file into equal chunks and run this on chunks. Doing so will reduce some of the clustering but not sure how important in this case. (If you have one big file, it's possible that all samples may be drawn from the first half, with splitting you eliminate this probability). For certain cases this is a desired property. Don't know your case.
$ awk 'BEGIN {srand()}
!/^>/ {a[++n]=$0}
END {while(i++<1000)
{line=a[int(rand()*n)+1];
s=int(rand()*(length(line)-9));
print ss=substr(line,s,10), gsub(/F/,"",ss)}}' file
GERISPKDRC 0
QDEPRCLHEW 0
LLYQLFRNLF 2
GTHGAGAMES 0
TKALQDVQIR 0
FCVHTKALQD 1
SNKAQVKPGQ 0
CMECQGHGER 0
TRRFVGHTKD 1
...
Here is one solution using Perl
It slurps the entire file to memory. Then the lines starting with > are removed.
Here I'm looping for 10 times $i<10, you can increase the count here.
Then rand function is called by passing length of the file and using the rand value, substr of 10 is computed. $s!~/\n/ guard is to make sure we don't choose the substring crossing newlines.
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f\n" ;$i++} else { $i-- } } '
random10.txt
ENTQLLETKN 0
LSEGALSPDG 0
LRKARAEAED 0
RLWDLTTGTT 0
KWSGRCGLGY 0
TRRFVGHTKD 1
PVKRPIPHPA 0
GMVQQIQSVC 0
LTHPVLSFGI 1
KVNFPENGFL 2
$
To know the random number generated
$ perl -0777 -ne '$_=~s/^>.+?\n//smg; while($i<10) { $x=rand(length($_)); $s=substr($_,$x,10); $f=()=$s=~/F/g; if($s!~/\n/) { print "$s $f $x\n" ;$i++} else { $i-- } }
' random10.txt
QLDGSLTMSS 0 1378.61409368207
DLIAKVDELT 0 1703.46689004765
SGGGANGTSF 1 900.269562152326
PEELTLSPKL 0 1368.55540468164
TCLSEGALSP 0 1016.50744004085
NRTWNSSAVP 0 23.7868578293154
VNFPENGFLS 2 363.527933104776
NSGLTWSGND 0 48.656607650744
MILSASRDKT 0 422.67705815168
RRGEDLFMCM 1 290.828530365
AGDGLLTPDA 0 1481.78080339531
$
Since your input file is huge I'd do it in these steps:
select random 10-char strings from each line of your input file
shuffle those to get the number of samples you want in random order
count the Fs
e.g.
$ cat tst.sh
#!/bin/env bash
infile="$1"
sampleSize=10
numSamples=15
awk -v sampleSize="$sampleSize" '
BEGIN { srand() }
!/^>/ {
begPos = int((rand() * sampleSize) + 1)
endPos = length($0) - sampleSize
for (i=begPos; i<=endPos; i+=sampleSize) {
print substr($0,i,sampleSize)
}
}
' "$infile" |
shuf -n "$numSamples"
.
$ ./tst.sh file
HGDIKCVLNE
QDEPRCLHEW
SEVQAIIEST
THDLRVSLEE
SEWVSCVRFS
LTRYLTLNAS
KDGQKITFHG
SNSPEPQKAV
QGGSKATTPA
QLLETKNALN
LLFCDNHKKQ
DETNYGIPQR
IRFQPQLNPD
LQTIRFSPDI
SLKRCGGFLI
$ ./tst.sh file | awk '{print $0, gsub(/F/,"")}'
SPKLQLDGSL 0
IKLFCVHTKA 1
VVSRCRLRHT 0
SPEPQKAVEQ 0
AYNPKNFSND 1
FGESRPELGS 1
AGDGLLTPDA 0
VGHTKDVLSV 0
VTHDLRVSLE 0
PISLGIFPLP 1
ASQITNLRSF 1
LTRPPEELTL 0
FDRYGEEGLK 1
IYIEGQDEPR 0
WNTLGVCKYT 0
Just change the numSamples from 15 to 1000 or whatever you like when run against your real data.
The above relies on shuf -n being able to handle however much input we throw at it, presumably much like sort does by using paging. If it fails in that regard then obviously you'd have to choose/implement a different tool for that part. FWIW I tried seq 100000000 | shuf -n 10000 (i.e. 10 times as many input lines as the OPs posted max file length of 10000000 to account for the awk part generating N lines of output per 1 line of input and 10 times as many output lines required than the OPs posted 1000) and it worked fine and only took a few secs to complete.

Perl - Print first letter of column

I'm trying to print the first letter of column2 of an input file as well as other columns of interest. I'm not sure why the following script, adapted from Matching first letter of word gives me an 'Use of uninitialized value $columns[2]' warning.
Input File Example:
ATOM 1 CAY GLY X 1 -0.124 0.401 -0.153 1.00 2.67 PEP
ATOM 2 HY1 GLY X 1 -0.648 0.043 -1.064 1.00 0.00 PEP
ATOM 3 HY2 GLY X 1 -0.208 1.509 -0.145 1.00 0.00 PEP
Output File Example:
1 C -0.124 0.401 -0.153 1.00 2.67
2 H -0.648 0.043 -1.064 1.00 0.00
3 H -0.208 1.509 -0.145 1.00 0.00
Script
open (my $input_fh, "<", $filename) or die $!;
while (my $data = <$input_fh>) {
chomp $data;
my #columns = split(/\t/, $data);
my ($firstletter) = ($columns[2] =~ m/^\d+(\w)/);
if (/CAY/../HT2/)
print $output_fh join ("\t", $columns[1], $firstletter, $columns[6], $columns[7], $columns[8]), "\n";
}
UPDATE The warning occurred due to the if (/CAY/../HT2/) statement for some reason -- but since the input files are identical, I don't really need this condition. Also, since there are no digits in column2 it is more appropriate to use the /^(\w)/ regex.
Is there some particular reason that you must split on tabs? Getting various kinds of white space in an arbitrary text file correctly can be picky. If not necessary, it seems fully fitting to just split by (any) space, then grab the first letter
my #cols = split '\s+', $data;
my ($firstletter) = $cols[1] =~ m/^(\w)/;
I am not sure what the rest does but you can easily pluck the columns you need.
Try to debug what you get after splitting:
my #columns = split(/\t/, $data);
local $" = "\n"; print "$data\nSplitted into:\n#columns";
As guess your file have double \t characters. I mean you probably have:
ATOM\t\t1 CAY GLY X... so second column is undef
It sounds to me like the code that gave that warning was not what you show but instead had something like
($columns[2]) = ($columns[2] =~ m/^\d+(\w)/);
And you are getting the warning because the regex is failing due to not finding a digit. Maybe you meant \d*?
For me, maybe i would like to use cut command and pipeline, then split command to get the exact info you want.

Compare fields of two VCF files

I would like to ask you for some help for an apparently easy script I am trying to work on.
Basically I would like to compare each fields of two tab delimited files.
if the second field of the files match --> compare all the rest of the fields of the line.
In the case the field of the first file is "NA" print the field of the second file for the same locations.
Now I have wrote this small script but one of the problem I am having is:
1- how to keep the first field of the first 9 fields from the first file
2- how to tell Perl to print out the line with the changed field from the second file.
Here is an example if I was not clear:
File 1:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT NA NA 0/1
File2:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1
Desidered tab delimited output:
16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1
Thank you in advance for any comment and help!
use strict;
use warnings;
my $frameshift_file = <>;
my $monomorphic_file = <>;
my #split_file1 = split "\t", $frameshift_file; #splits the file on tabs
my #split_file2 = split "\t", $monomorphic_file; #splits line on tab delimeted fields
if ($split_file1[1] eq $split_file2[1] {
for (my $i=0; $i<scalar(#split_file1); $i++) {
if ($split_file1[$i] eq "NA") {
print $split_file2[$i],"\t";
} else { print $split_file1[$i],"\t";
}
}
}
Try something like this.. (replace "\s+" with "\t" to split only on tabs).
use strict;
use warnings;
my (#split_file1, #split_file2, $frameshift_file, $monomorphic_file, $x);
$frameshift_file = "16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT NA NA 0/1";
$monomorphic_file = "16 50763778 x GCCC GCCCC 210.38 PASS AC1=1 GT 0/1 1/1 0/1";
(#split_file1) = split('\s+', $frameshift_file); #splits the file on tabs
(#split_file2) = split('\s+', $monomorphic_file); #splits line on tab delimeted fields
if ("$split_file1[1]" eq "$split_file2[1]"){ # 2nd field of files match
for($x = 2; $x <= $#split_file1; $x++){
if ($split_file1[$x] eq "NA"){ # If file1 shows "NA", print file2 equivalent array element.
print "split_file1[$x] = \"NA\" .. split_file2[$x] = $split_file2[$x]\n";
}
}
}

Perl script to extract part of some numbers

I wrote a Perl script that reads from a file and do so calculations. Basically I 'm trying to calculate the Throughput of network traffic. The file I'm reading from has the following format:
- 0.152416 1 2 tcp 1040 ------- 2 12.0 2.9 2 13
r 0.153584 1 2 tcp 1040 ------- 2 12.0 2.9 1 12
+ 0.154208 1 2 tcp 1040 ------- 2 10.0 2.7 3 15
- 0.154208 1 2 tcp 1040 ------- 2 11.0 2.8 3 15
r 0.155248 1 2 tcp 1040 ------- 2 12.0 2.9 2 13
I'm extracting column[0] , [3], [7], [8], [9]. Since column [8] and [9] comes as double (i.e. x.y), I was trying to get only the first part of column[8] and [9] (i.e x part). In other words, I don't care about the second part that comes after the dot "." . All I need the first part. I guess, I have two ways, whether to deal with regular expressions or add more extra code to customize the token in [8] and [9] for each line I will read?. Any short suggestion. part of the script:
#input parameters:
$infile=$ARGV[0];
$dest=$ARGV[1];
$from=$ARGV[2];
$to=$ARGV[3];
$fId=$ARGV[4];
$TimeShift=$ARGV[5];
I want to make $from and $to contains only the first part.
open (DATA,"<$infile") || die "error in $infile $!";
while (<DATA>)
{
#x = split(' '); Im using space
What about
$from = int $ARGV[2];
See int for details.
Or, rather,
my ($infile, $dest, $from, $to, $fId, $TimeShift) = #ARGV;
$_ = int for $from, $to;
You should be aware though, that while you can use int, that it has some dangerous caveats.
From perldoc -f int:
You should not use this function for rounding: one because
it truncates towards 0, and two because machine representations
of floating-point numbers can sometimes produce
counterintuitive results. For example, "int(-6.725/0.025)"
produces -268 rather than the correct -269; that's because it's
really more like -268.99999999999994315658 instead. Usually,
the "sprintf", "printf", or the "POSIX::floor" and
"POSIX::ceil" functions will serve you better than will int().
Instead, consider doing:
using POSIX;
...
...
$from = POSIX::floor($ARGV[2]);
If you just want to throw away the dots and following digits, you can use s/[.][0-9]+\z//. That way, no floating point conversions would be involved.
#!/usr/bin/env perl
use strict; use warnings;
use Data::Dumper;
while (my $line = <DATA>) {
last unless $line =~ /\S/;
my #cols = (split ' ', $line)[0, 3, 7 .. 9];
s/[.][0-9]+\z// for #cols[-2 .. -1];
print Dumper \#cols;
}
__DATA__
- 0.152416 1 2 tcp 1040 ------- 2 12.0 2.9 2 13
r 0.153584 1 2 tcp 1040 ------- 2 12.0 2.9 1 12
+ 0.154208 1 2 tcp 1040 ------- 2 10.0 2.7 3 15
- 0.154208 1 2 tcp 1040 ------- 2 11.0 2.8 3 15
r 0.155248 1 2 tcp 1040 ------- 2 12.0 2.9 2 13