How to extract the column data from multidimensional array in perl - perl

ls -l
-rw-r--r-- 1 angus angus 0 2013-08-16 01:33 copy.pl
-rw-r--r-- 1 angus angus 1931 2013-08-16 08:27 copy.txt
-rw-r--r-- 1 angus angus 492 2013-08-16 03:15 ex.txt
-rw-r--r-- 1 angus angus 25 2013-08-16 09:07 hello.txt
-rw-r--r-- 1 angus angus 98 2013-08-16 09:05 hi.txt
I need only the read, write , access data as well as the file name.
#! /usr/bin/perl -w
#list = `ls -l`;
$index = 0;
#print "#list\n";
for(#list){
($access) = split(/[\s+]/,$_);
print "$access\n";
($data) = split(/pl+/,$_);
print "$data";
#array1 = ($data,$access);
}
print "#array1\n"
I have written this code to extract the read,write,access permission details and the file name corresponding to it.
I couldn't extract the filename which is the last column.

Check perl stat http://perldoc.perl.org/functions/stat.html
It's more robust and efficient than calling external ls command,
use File::stat;
$sb = stat($filename);
printf "File is %s, size is %s, perm %04o, mtime %s\n",
$filename, $sb->size, $sb->mode & 07777,
scalar localtime $sb->mtime;

I think you have an error in line number 8 of your script. You are trying to split the line using the string "pl" as a delimiter which will only match the first line of your input and will not give you what I think you want.
I believe you should just split the whole line on white space and assign just the columns you want (number 1 and 8 in this case).
change your loop for this:
for my $filename (#list){
chomp($filename);
my ($access, $data) = (split(/\s+/, $filename))[0, 7]; #use a slice to get only the columns you want.
print "$access $data\n";
}
Note: mpapec suggestion to use Stat would be better. I just wanted to let you know why your code is not working.

Related

How to capture large STDOUT output in Perl when executing an external command

I want to get the list of file names present in the remote location.
I am using the below snippet in my Perl script.
my $command = "sftp -q -o${transferAuthMode}=yes -oPort=$sftpPort ${remoteUsername}\#${remoteHost} 2>\&1 <<EOF\n" .
"cd \"${remotePath}\"\n" .
"ls -l \n" .
"quit\n" .
"EOF\n";
my #files = `$command`;
When the number of files in the remote location is large (>500) then not all the file names are captured in #files.
When I manually do SFTP and list the files, all files are getting listed but I'm not getting the same through the script. Each time getting #files size different. It's occurring only when there are large number of files.
I'm unable find the reason behind this. Could you please help?
This can be achieved without requiring any additional package module/s. I tested this on my CentOS 7 Server (Windows VM).
My remote host details: I got ~2000 files in the remote host dir. A CentOS 6.8 server.
%_gaurav#[remotehost]:/home/gaurav/files/test> ls -lrth|head -3;echo;ls -lrth|tail -2
total 7.9M
-rw-rw-r--. 1 gaurav gaurav 35 Feb 16 23:51 File-0.txt
-rw-rw-r--. 1 gaurav gaurav 35 Feb 16 23:51 File-1.txt
-rw-rw-r--. 1 gaurav gaurav 38 Feb 16 23:51 File-1998.txt
-rw-rw-r--. 1 gaurav gaurav 38 Feb 16 23:51 File-1999.txt
%_gaurav#[remotehost]: /home/gaurav/files/test>
Script output from LocalHost: Please note that I am running your command sans the o${transferAuthMode}=yes part. As seen below, the script is able to gather all results in an array, greater than 500 results.
I am prnting the total entries, some particular index numbers from the array to show the results, but give it a try with un-commented Dumper line to see the full result.
%_STATION#gaurav * /root/ga/study/pl> ./scp.pl
Read 2003 lines from SCP command.
ArrayIndex: 2,3,1999,2000 contain:
[-rw-rw-r-- 0 501 501 36B Feb 16 23:51 File-58.txt]
[-rw-rw-r-- 0 501 501 37B Feb 16 23:51 File-129.txt]
[-rw-rw-r-- 0 501 501 38B Feb 16 23:51 File-1759.txt]
[-rw-rw-r-- 0 501 501 38B Feb 16 23:51 File-1810.txt]
%_STATION#gaurav * /root/ga/study/pl>
Script and its Working:
#!/usr/bin/perl
use strict ;
use warnings ;
use Data::Dumper ;
my $sftp_port=22 ;
my ($user, $host) = ("gaurav","192.168.246.137") ;
my $remote_path = '/home/gaurav/files/test' ;
my #result ; # To store result
my $command = "sftp -q -oPort=$sftp_port ${user}\#${host} 2>\&1 <<EOF\n"."cd $remote_path\nls -lrth\nquit\nEOF" ;
# open the command as a file handle, read output and store it.
open FH, "$command |" or die "Something went wrong!!\n" ;
while (<FH>) {
tr/(?\r|\f|\n)//d ; # Removing any new line, carriage return or form feed.
push(#result,"\[$_\]") ;
}
close FH ;
#print Dumper #result ;
# Just for printing a little bit of results from
# the array. Following lines can be deleted.
my $total = scalar #result ;
print "Read $total lines from SCP command.\n" ;
print "\nArrayIndex: 2,3,1999,2000 contain:\n
$result[2]
$result[3]
$result[1999]
$result[2000]
" ;
Another way: One could also get around this issue by making a shell script and calling it from the perl script and read its output. As shown below, my shell script which gets called by the perl script and the final output. This can be used as a quick technique when one doesn't have much time to write/formulate commands in perl directly. You can use the qx style(shown below) in earlier script as well.
Shell script "scp.sh"
%_STATION#gaurav * /root/ga/study/pl> cat scp.sh
#!/bin/bash
sftp -oPort=${1} ${2}#${3} 2>&1 <<EOF
cd ${4}
ls -l
quit
EOF
Perl Script "2scp.pl"
%_STATION#gaurav * /root/ga/study/pl> cat 2scp.pl
#!/usr/bin/perl
use strict ;
use warnings ;
use Data::Dumper ;
my $sftp_port=22 ;
my ($user, $host) = ("gaurav","192.168.246.137") ;
my $remote_path = '/home/gaurav/files/test' ;
# Passing arguements to shell script using concatination.
my $command = './scp.sh '." $sftp_port $user $host $remote_path" ;
my #result = qx{$command} ; # Runs the command and stores the result.
my $total = scalar #result ;
print "Read $total lines from SCP command.\n" ;
# End.
Output:
%_STATION#gaurav * /root/ga/study/pl> ./2scp.pl
Read 2004 lines from SCP command.
%_STATION#gaurav * /root/ga/study/pl>
Try it out and let us know.
Thanks.

How to extract specific columns from different files and output in one file?

I have in a directory 12 files , each file has 4 columns. The first column is a gene name and the rest 3 are count columns. All the files are in the same directory. I want to extract 1,4 columns for each files (12 files in total) and paste them in one output file, since the first column is same for every files the output file should only have one once the 1st column and the rest will be followed by 4th column of each file. The first column of each file is same. I do not want to use R here. I am a big fan of awk. So I was trying something like below but it did not work
My input files look like
Input file 1
ZYG11B 8267 16.5021 2743.51
ZYG11A 4396 0.28755 25.4208
ZXDA 5329 2.08348 223.281
ZWINT 1976 41.7037 1523.34
ZSCAN5B 1751 0.0375582 1.32254
ZSCAN30 4471 4.71253 407.923
ZSCAN23 3286 0.347228 22.9457
ZSCAN20 4343 3.89701 340.361
ZSCAN2 3872 3.13983 159.604
ZSCAN16-AS1 2311 1.1994 50.9903
Input file 2
ZYG11B 8267 18.2739 2994.35
ZYG11A 4396 0.227859 19.854
ZXDA 5329 2.44019 257.746
ZWINT 1976 8.80185 312.072
ZSCAN5B 1751 0 0
ZSCAN30 4471 9.13324 768.278
ZSCAN23 3286 1.03543 67.4392
ZSCAN20 4343 3.70209 318.683
ZSCAN2 3872 5.46773 307.038
ZSCAN16-AS1 2311 3.18739 133.556
Input file 3
ZYG11B 8267 20.7202 3593.85
ZYG11A 4396 0.323899 29.8735
ZXDA 5329 1.26338 141.254
ZWINT 1976 56.6215 2156.05
ZSCAN5B 1751 0.0364084 1.33754
ZSCAN30 4471 6.61786 596.161
ZSCAN23 3286 0.79125 54.5507
ZSCAN20 4343 3.9199 357.177
ZSCAN2 3872 5.89459 267.58
ZSCAN16-AS1 2311 2.43055 107.803
Desired output from above
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803
here as you can see above first column from each file and 4 column , since the first column of each file is same so I just want to keep it once and rest the ouptut will have 4th column of each file. I have just shown for 3 files. It should work for all the files in the directory at once since all of the files have similar naming conventions like file1_quant.genes.sf file2_quant.genes.sf , file3_quant.genes.sf
Every files has same first column but different counts in rest column. My idea is to create one output file which should have 1st column and 4th column from all the files.
awk '{print $1,$2,$4}' *_quant.genes.sf > genes.estreads
Any heads up?
If I understand you correctly, what you're looking for is one line per key, collated from multiple files.
The tool you need for this job is an associative array. I think awk can, but I'm not 100% sure. I'd probably tackle it in perl though:
#!/usr/bin/perl
use strict;
use warnings;
# an associative array, or hash as perl calls it
my %data;
#iterate the input files (sort might be irrelevant here)
foreach my $file ( sort glob("*_quant.genes.sf") ) {
#open the file for reading.
open( my $input, '<', $file ) or die $!;
#iterate line by line.
while (<$input>) {
#extract the data - splitting on any whitespace.
my ( $key, #values ) = split;
#add'column 4' to the hash (of arrays)
push( #{$data{$key}}, $values[2] );
}
close($input);
}
#start output
open( my $output, '>', 'genes.estreads' ) or die;
#sort, because hashes are explicitly unordered.
foreach my $key ( sort keys %data ) {
#print they key and all the elements collected.
print {$output} join( "\t", $key, #{ $data{$key} } ), "\n";
}
close($output);
With data as specified as above, this produces:
ZSCAN16-AS1 50.9903 133.556 107.803
ZSCAN2 159.604 307.038 267.58
ZSCAN20 340.361 318.683 357.177
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN30 407.923 768.278 596.161
ZSCAN5B 1.32254 0 1.33754
ZWINT 1523.34 312.072 2156.05
ZXDA 223.281 257.746 141.254
ZYG11A 25.4208 19.854 29.8735
ZYG11B 2743.51 2994.35 3593.85
The following is how you do it in awk:
awk 'BEGIN{FS = " "};{print $1, $4}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
As cryptic as it looks, I am just using associative arrays.
Here is the solution broken down:
Just print the key and the value, one per line.
print $1, $2
Store the data in an associative array, keep updating
temp = x[$1];x[$1] = temp " " $2;}
Display it:
for(xx in x) print xx,x[xx]
Sample run:
[cloudera#quickstart test]$ cat f1
A k1
B k2
[cloudera#quickstart test]$ cat f2
A k3
B k4
C k1
[cloudera#quickstart test]$ awk 'BEGIN{FS = " "};{print $1, $2}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
A k1 k3
B k2 k4
C k1
As a side note, the approach should be reminiscent of the Map Reduce paradigm.
awk '{E[$1]=E[$1] "\t" $4}END{for(K in E)print K E[K]}' *_quant.genes.sf > genes.estreads
Order is order of appearence when reading files (so normaly based on 1 readed file)
If the first column is the same in all the files, you can use paste:
paste <(tabify f1 | cut -f1,4) \
<(tabify f2 | cut -f4) \
<(tabify f3 | cut -f4)
Where tabify changes consecutive spaces to tabs:
sed 's/ \+/\t/g' "$#"
and f1, f2, f3 are the input files' names.
Here's another way to do it in Perl:
perl -lane '$data{$F[0]} .= " $F[3]"; END { print "$_ $data{$_}" for keys %data }' input_file_1 input_file_2 input_file_3
Here's another way of doing it with awk. And it supports using multiple files.
awk 'FNR==1{f++}{a[f,FNR]=$1}{b[f,FNR]=$4}END { for(x=1;x<=FNR;x++){printf("%s ",a[1,x]);for(y=0;y<=ARGC;y++)printf("%s ",b[y,x]);print ""}}' input1.txt input2.txt input3.txt
That line of code, give the following output
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803

Perl: how to compare array to hash and print out results

I'm quite new to Perl, so I'm sorry if this is somewhat rudimentary.
I'm working with a Perl script that is working as a wrapper for some Python, text formatting, etc. and I'm struggling to get my desired output.
The script takes a folder, for this example, the folder contains 6 text files (test1.txt through test6.txt). The script then extracts some information from the files, runs a series of command line programs and then outputs a tab-delimited result. However, that result contains only those results that made it through the rest of the processing by the script, i.e. the result.
Here are some snippets of what I have so far:
use strict;
use warnings;
## create array to capture all of the file names from the folder
opendir(DIR, $folder) or die "couldn't open $folder: $!\n";
my #filenames = grep { /\.txt$/ } readdir DIR;
closedir DIR;
#here I run some subroutines, the last one looks like this
my $results = `blastn -query $shortname.fasta -db DB/$db -outfmt "6 qseqid sseqid score evalue" -max_target_seqs 1`;
#now I would like to compare what is in the #filenames array with $results
Example of tab delimited result - stored in $results:
test1.txt 200 1:1-20 79 80
test3.txt 800 1:1-200 900 80
test5.txt 900 1:1-700 100 2000
test6.txt 600 1:1-1000 200 70
I would like the final output to include all of the files that were run through the script, so I think I need a way to compare two arrays or perhaps compare an array to a hash?
Example of the desired output:
test1.txt 200 1:1-20 79 80
test2.txt 0 No result
test3.txt 800 1:1-200 900 80
test4.txt 0 No result
test5.txt 900 1:1-700 100 2000
test6.txt 600 1:1-1000 200 70
Update
Ok, so I got this to work with suggestions by #terdon by reading the file into a hash and then comparing. So I was trying to figure out how to do this with out writing to file and the reading the file back in - I still can't seem to get the syntax correct. Here's what I have, however it seems like I'm not able to match the array to the hash - meaning the hash must not be correct:
#!/usr/bin/env perl
use strict;
use warnings;
#create variable to mimic blast results
my $blast_results = "file1.ab1 9 350 0.0 449 418 418 403479 403042 567
file3.ab1 2 833 0.0 895 877 877 3717226 3718105 984";
#create array to mimic filename array
my #filenames = ("file1.ab1", "file2.ab1", "file3.ab1");
#header for file
my $header = "Query\tSeq_length\tTarget found\tScore (Bits)\tExpect(E-value)\tAlign-length\tIdentities\tPositives\tChr\tStart\tEnd\n";
#initialize hash
my %hash;
#split blast results into array
my #row = split(/\s+/, $blast_results);
$hash{$row[0]}=$_;
print $header;
foreach my $file (#filenames){
## If this filename has an associated entry in the hash, print it
if(defined($hash{$file})){
print "$row[0]\t$row[9]\t$row[1]:$row[7]-$row[8]\t$row[2]\t$row[3]\t$row[4]\t$row[5]\t$row[6]\t$row[1]\t$row[7]\t$row[8]\n";
}
## If not, print this.
else{
print "$file\t0\tNo Blast Results: Sequencing Rxn Failed\n";
}
}
print "-----------------------------------\n";
print "$blast_results\n"; #test what results look like
print "-----------------------------------\n";
print "$row[0]\t$row[1]\n"; #test if array is getting split correctly
print "-----------------------------------\n";
print "$filenames[2]\n"; #test if other array present
The result from this script is (the #filenames array is not matching the hash):
Query Seq_length Target found Score (Bits) Expect(E-value) Align-length Identities Positives Chr Start End
file1.ab1 0 No Blast Results: Sequencing Rxn Failed
file2.ab1 0 No Blast Results: Sequencing Rxn Failed
file3.ab1 0 No Blast Results: Sequencing Rxn Failed
-----------------------------------
file1.ab1 9 350 0.0 449 418 418 403479 403042 567
file3.ab1 2 833 0.0 895 877 877 3717226 3718105 984
-----------------------------------
file1.ab1 9
-----------------------------------
file3.ab1
I'm not entirely sure what you need here but the equivalent of awk's A[$1]=$0 is done using hashes in Perl. Something like:
my %hash;
## Open the output file
open(my $fh, "<","text_file");
while(<$fh>){
## remove newlines
chomp;
## split the line
my #A=split(/\s+/);
## Save this in a hash whose keys are the 1st fields and whose
## values are the associated lines.
$hash{$A[0]}=$_;
}
close($fh);
## Now, compare the file to #filenames
foreach my $file (#filenames){
## Print the file name
print "$file\t";
## If this filename has an associated entry in the hash, print it
if(defined($hash{$file})){
print "$hash{$file}\n";
}
## If not, print this.
else{
print "0\tNo result\n";
}
}

Reading output from command into Perl array

I want to get the output of a command into an array — like this:
my #output = `$cmd`;
but it seems that the output from the command does not go into the #output array.
Any idea where it does go?
This simple script works for me:
#!/usr/bin/env perl
use strict;
use warnings;
my $cmd = "ls";
my #output = `$cmd`;
chomp #output;
foreach my $line (#output)
{
print "<<$line>>\n";
}
It produced the output (except for the triple dots):
$ perl xx.pl
<<args>>
<<args.c>>
<<args.dSYM>>
<<atob.c>>
<<bp.pl>>
...
<<schwartz.pl>>
<<timer.c>>
<<timer.h>>
<<utf8reader.c>>
<<xx.pl>>
$
The output of command is split on line boundaries (by default, in list context). The chomp deletes the newlines in the array elements.
The (standard) output does go to that array:
david#cyberman:listing # cat > demo.pl
#!/usr/bin/perl
use strict;
use warnings;
use v5.14;
use Data::Dump qw/ddx/;
my #output = `ls -lh`;
ddx \#output;
david#cyberman:listing # touch a b c d
david#cyberman:listing # perl demo.pl
# demo.pl:8: [
# "total 8\n",
# "-rw-r--r-- 1 david staff 0B 5 Jun 12:15 a\n",
# "-rw-r--r-- 1 david staff 0B 5 Jun 12:15 b\n",
# "-rw-r--r-- 1 david staff 0B 5 Jun 12:15 c\n",
# "-rw-r--r-- 1 david staff 0B 5 Jun 12:15 d\n",
# "-rw-r--r-- 1 david staff 115B 5 Jun 12:15 demo.pl\n",
# ]
Enable automatic error checks:
require IPC::System::Simple;
use autodie qw(:all);
⋮
my #output = `$cmd`;

Loading Big files into Hashes in Perl (BLAST tables)

I'm a perl beginner, please help me out with my query... I'm trying to extract information from a blast table (a snippet of what it looks like is below):
It's a standard blast table input... I basically want to extract any information on a list of reads (Look at my second script below , to get an idea of what I want to do).... Anyhow this is precisely what I've done in the second script:
INPUTS:
1) the blast table:
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
34.3 7.5 123828 GH8NFLV01A03QJ GH8NFLV01A03QJ rank=0239249 x=305.0 y=1945.5 length=452 1 XP_002639994 Hypothetical protein CBG10824 [Caenorhabditis briggsae] 3 0 52.1739130434783 32.6086956521739 46 452 367 10.1769911504425 12.5340599455041 111 248 79 124
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
37.7 0.98 33114 GH8NFLV01A0H5C GH8NFLV01A0H5C rank=0066011 x=298.0 y=2638.5 length=573 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] -3 0 73.5294117647059 52.9411764705882 34 573 703 5.93368237347295 4.83641536273115 131 232 273 306
103 1e-020 65742 GH8NFLV01A0MXI GH8NFLV01A0MXI rank=0124865 x=300.5 y=644.0 length=475 1 ABZ08973 hypothetical protein ALOHA_HF4000APKG6B14ctg1g18 [uncultured marine crenarchaeote HF4000_APKG6B14] 2 0 77.9411764705882 77.9411764705882 68 475 151 14.3157894736842 45.0331125827815 2 205 1 68
41.6 0.053 36083 GH8NFLV01A0QKX GH8NFLV01A0QKX rank=0071366 x=301.0 y=1279.0 length=526 1 XP_766153 hypothetical protein [Theileria parva strain Muguga] -1 0 66.6666666666667 56.6666666666667 30 526 304 5.70342205323194 9.86842105263158 392 481 31 60
45.4 0.003 78246 GH8NFLV01A0Z29 GH8NFLV01A0Z29 rank=0148293 x=304.0 y=1315.0 length=432 1 ZP_04111769 hypothetical protein bthur0007_56280 [Bacillus thuringiensis serovar monterrey BGSC 4AJ1] 3 0 51.8518518518518 38.8888888888889 54 432 193 12.5 27.979274611399 48 209 97 150
71.6 4e-011 97250 GH8NFLV01A14MR GH8NFLV01A14MR rank=0184885 x=317.5 y=609.5 length=314 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 92.5 92.5 40 314 311 12.7388535031847 12.8617363344051 193 312 13 52
58.2 5e-007 154555 GH8NFLV01A1KCH GH8NFLV01A1KCH rank=0309994 x=310.0 y=2991.0 length=267 1 ZP_03823721 DNA replication protein [Acinetobacter sp. ATCC 27244] 1 0 82.051282051282 82.051282051282 39 267 311 14.6067415730337 12.540192926045 142 258 1 39
2) The reads list:
GH8NFLV01A09B8
GH8NFLV01A02ED
etc
etc
3) the output I want:
37.7 0.70 62716 GH8NFLV01A09B8 GH8NFLV01A09B8 rank=0119267 x=307.0 y=1014.0 length=512 1 XP_002756773 PREDICTED: probable G-protein coupled receptor 123-like, partial [Callithrix jacchus] 1 0 73.5294117647059 52.9411764705882 34 512 703 6.640625 4.83641536273115 43 144 273 306
38.1 0.53 59544 GH8NFLV01A02ED GH8NFLV01A02ED rank=0113471 x=305.0 y=211.5 length=345 1 YP_003242370 Dynamin family protein [Paenibacillus sp. Y412MC10] -1 0 48.936170212766 40.4255319148936 47 345 1213 13.6231884057971 3.87469084913438 31 171 544 590
I want a subset of the information in the first list, given a list of read names I want to extract (that is found in the 4th column)
Instead of hashing the reads list (only?) I want to hash the blast table itself, and use the information in Column 4 (of the blast table)as the keys to extract the values of each key, even when that key may have more than one value(i.e: each read name might actually have more than one hit , or associated blast result in the table), keeping in mind, that the value includes the WHOLE row with that key(readname) in it.
My greplist.pl script does this, but is very very slow, I think , ( and correct me if i'm wrong) that by loading the whole table in a hash, that this should speed things up tremendously ...
Thank you for your help.
My scripts:
The Broken one (mambo5.pl)
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash = <DATA>;
close (DATA);
my $filename=$ARGV[0];
open(OUT, "> $filename.bololom");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "bingo!";
my $output =$hash{$readName};
print OUT "$output\n";
}
else
{
print "it aint workin\n";
#print %hash;
}
}
}
close (LIST);
The Slow and quick cheat (that works) and is very slow (my blast tables can be about 400MB to 2GB large, I'm sure you can see why it's so slow)
#!/usr/bin/perl -w
##
# This script finds a list of names in a blast table and outputs the result in a new file
# name must exist and list must be correctly formatted
# will not output anything using a "normal" blast file, must be a table blast
# if you have the standard blast output use blast2table script
use strict;
my $filein=$ARGV[0] or die ("usage: ./listgrep.pl readslist blast_table\n");
my $db=$ARGV[1] or die ("usage: ./listgrep.pl readslist blast_table\n");
#open the reads you want to grep
my $read;
my $line;
open(READSLIST,$filein);
while($line=<READSLIST>)
{
if ($line=~/^(.*)$/)
{
$read = $1;
print "$read\n";
system("grep \"$read\" $db >$read\_.out\n");
}
#system("grep $read $db >$read\_.out\n");
}
system("cat *\_.out >$filein\_greps.txt\n");
system("rm *.out\n");
I don't know how to define that 4th column as the key : maybe I could use the split function, but I've tried to find a way that does this for a table of more than 2 columns to no avail... Please help!
If there is an easy way out of this please let me know
Thanks !
I'd do the opposite i.e read the readslist file into a hash then walk thru the big blast file and print the desired lines.
#!/usr/bin/perl
use strict;
use warnings;
use 5.010;
# Read the readslist file into a hash
open my $fh, '<', 'readslist' or die "Can't open 'readslist' for reading:$!";
my %readslist = map { chomp; $_ => 1 }<$fh>;
close $fh;
open my $fh_blast, '<', 'blastfile' or die "Can't open 'blastfile' for reading:$!";
# loop on all the blastfile lines
while (<$fh_blast>) {
chomp;
# retrieve the key (4th column)
my ($key) = (split/\s+/)[3];
# print the line if the key exists in the hash
say $_ if exists $readslist{$key};
}
close $fh_blast;
I suggest you build an index to turn your blasts file temporarily into an indexed-sequential file. Read through it and build a hash of addresses within the file where every record for each key starts.
After that it is just a matter of seeking to the correct places in the file to pick up the records required. This will certainly be faster than most simple solutions, as it entails read the big file only once. This example code demonstrates.
use strict;
use warnings;
use Fcntl qw/SEEK_SET/;
my %index;
open my $blast, '<', 'blast.txt' or die $!;
until (eof $blast) {
my $place = tell $blast;
my $line = <$blast>;
my $key = (split ' ', $line, 5)[3];
push #{$index{$key}}, $place;
}
open my $reads, '<', 'reads.txt' or die $!;
while (<$reads>) {
next unless my ($key) = /(\S+)/;
next unless my $places = $index{$key};
foreach my $place (#$places) {
seek $blast, $place, SEEK_SET;
my $line = <$blast>;
print $line;
}
}
Voila, 2 ways of doing this, one with nothing to do with perl :
awk 'BEGIN {while ( i = getline < "reads_list") ar[$i] = $1;} {if ($4 in ar) print $0;}' blast_table > new_blast_table
Mambo6.pl
#!/usr/bin/perl -w
# purpose: extract blastX data from a list of readnames. HINT: Make sure your list file only has unique names , that way you save time.
use strict;
open (DATA,$ARGV[0]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
open (LIST,$ARGV[1]) or die ("Usage: ./mambo5.pl BlastXTable readslist");
my %hash;
my $val;
my $key;
while (<DATA>)
{
#chomp;
if(/((.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?)\t(.*?))$/)
{
#print "$1\n";
$key= $5;#read
$val= $1;#whole row; notice the brackets around the whole match.
$hash{$key} .= exists $hash{$key} ? "$val\n" : $val;
}
else {
print "something wrong with format";
}
}
close (DATA);
open(OUT, "> $ARGV[1]\_out\.txt");
my $readName;
while ( <LIST> )
{
#########;
if(/^(.*?)$/)#
{
$readName=$1;#
chomp $readName;
if (exists $hash{$readName})
{
print "$readName\n";
my $output =$hash{$readName};
print OUT "$output";
}
else
{
#print "it aint workin\n";
}
}
}
close (LIST);
close (OUT);
The oneliner is faster, and probably better than my script, I'm sure some people can find easier ways to do it... I just thought I'd put this up since it does what I want.