Read and parse multiple text files - perl

Can anyone suggest a simple way of achieving this. I have several files which ends with extension .vcf . I will give example with two files
In the below files, we are interested in
File 1:
38 107 C 3 T 6 C/T
38 241 C 4 T 5 C/T
38 247 T 4 C 5 T/C
38 259 T 3 C 6 T/C
38 275 G 3 A 5 G/A
38 304 C 4 T 5 C/T
38 323 T 3 A 5 T/A
File2:
38 107 C 8 T 8 C/T
38 222 - 6 A 7 -/A
38 241 C 7 T 10 C/T
38 247 T 7 C 10 T/C
38 259 T 7 C 10 T/C
38 275 G 6 A 11 G/A
38 304 C 5 T 12 C/T
38 323 T 4 A 12 T/A
38 343 G 13 A 5 G/A
Index file :
107
222
241
247
259
275
304
323
343
The index file is created based on unique positions from file 1 and file 2. I have that ready as index file. Now i need to read all files and parse data according to the positions here and write in columns.
From above files, we are interested in 4th (Ref) and 6th (Alt) columns.
Another challenge is to name the headers accordingly. So the output should be something like this.
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
343 13 5

You can do this using the join command:
# add file1
$ join -e' ' -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n index) <(sort -n -k2 file1) > file1.merged
# add file2
$ join -e' ' -1 1 -2 2 -a 1 -o 0,1.2,1.3,2.4,2.6 file1.merged <(sort -n -k2 file2) > file2.merged
# create the header
$ echo "Position File1_Ref File1_Alt File2_Ref File2_Alt" > report
$ cat file2.merged >> report
Output:
$ cat report
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
323 4 12 4 12
343 13 5 13 5
Update:
Here is a script which can be used to combine multiple files.
The following assumptions have been made:
The index file is sorted
The vcf files are sorted on their second column
There are no spaces (or any other special characters) in filenames
Save the following script to a file e.g. report.sh and run it without any arguments from the directory containing your files.
#!/bin/bash
INDEX_FILE=index # the name of the file containing the index data
REPORT_FILE=report # the file to write the report to
TMP_FILE=$(mktemp) # a temporary file
header="Position" # the report header
num_processed=0 # the number of files processed so far
# loop over all files beginning with "file".
# this pattern can be changed to something else e.g. *.vcf
for file in file*
do
echo "Processing $file"
if [[ $num_processed -eq 0 ]]
then
# it's the first file so use the INDEX file in the join
join -e' ' -t, -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n "$INDEX_FILE") <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
else
# work out the output fields
for ((outputFields="0",j=2; j < $((2 + $num_processed * 2)); j++))
do
outputFields="$outputFields,1.$j"
done
outputFields="$outputFields,2.4,2.6"
# join this file with the current report
join -e' ' -t, -1 1 -2 2 -a 1 -o "$outputFields" "$REPORT_FILE" <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
fi
((num_processed++))
header="$header,File${num_processed}_Ref,File${num_processed}_Alt"
mv "$TMP_FILE" "$REPORT_FILE"
done
# add the header to the report
echo "$header" | cat - "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
# the report is a csv file. Uncomment the line below to make it space-separated.
# tr ',' ' ' < "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"

This Perl solution will handle 1 or more, (50), files.
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp qw/ slurp /;
use Text::Table;
my $path = '.';
my #file = qw/ o33.txt o44.txt /;
my #position = slurp('index.txt') =~ /\d+/g;
my %data;
for my $filename (#file) {
open my $fh, "$path/$filename" or die "Can't open $filename $!";
while (<$fh>) {
my ($pos, $ref, $alt) = (split)[1, 3, 5];
$data{$pos}{$filename} = [$ref, $alt];
}
close $fh or die "Can't close $filename $!";
}
my #head;
for my $file (#file) {
push #head, "${file}_Ref", "${file}_Alt";
}
my $tb = Text::Table->new( map {title => $_}, "Position", #head);
for my $pos (#position) {
$tb->load( [
$pos,
map $data{$pos}{$_} ? #{ $data{$pos}{$_} } : ('', ''), #file
]
);
}
print $tb;

Related

Modifying perl script to not print duplicates and extract sequences of a certain length

I want to first apologize for the biological nature of this post. I thought I should post some background first. I have a set of gene files that contain anywhere from one to five DNA sequences from different species. I used a bash shell script to perform blastn with each gene file as a query and a file of all transcriptome sequences (all_transcriptome_seq.fasta) from the five species as the subject. I now want to process these output files (and there are many) so that I can get all subject sequences that hit into one file per gene, with duplicate sequences removed (except to keep one), and ensure I'm getting the length of the sequences that actually hit the query.
Here is what the blastn output looks like for one gene file (columns: qseqid qlen sseqid slen qframe qstart qend sframe sstart send evalue bitscore pident nident length)
Acur_01000750.1_OFAS014956-RA-EXON04 248 Apil_comp17195_c0_seq1 1184 1 1 248 1 824 1072 2e-73 259 85.60 214 250
Acur_01000750.1_OFAS014956-RA-EXON04 248 Atri_comp5613_c0_seq1 1067 1 2 248 1 344 96 8e-97 337 91.16 227 249
Acur_01000750.1_OFAS014956-RA-EXON04 248 Acur_01000750.1 992 1 1 248 1 655 902 1e-133 459 100.00 248 248
Acur_01000750.1_OFAS014956-RA-EXON04 248 Btri_comp17734_c0_seq1 1001 1 1 248 1 656 905 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Atri_comp5613_c0_seq1 1067 1 2 250 1 344 96 1e-60 217 82.33 205 249
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Acur_01000750.1 992 1 1 250 1 655 902 5e-69 244 84.40 211 250
Btri_comp17734_c0_seq1_OFAS014956-RA-EXON04 250 Btri_comp17734_c0_seq1 1001 1 1 250 1 656 905 1e-134 462 100.00 250 250
I've been working on a perl script that would, in short, take the sseqid column to pull out the corresponding sequences from the all_transcriptome_seq.fasta file, place these into a new file, and trim the transcripts to the sstart and send positions. Here is the script, so far:
#!/usr/bin/env perl
use warnings;
use strict;
use Data::Dumper;
############################################################################
# blastn_post-processing.pl v. 1.0 by Michael F., XXXXXX
############################################################################
my($progname) = $0;
############################################################################
# Initialize variables
############################################################################
my($jter);
my($com);
my($t1);
if ( #ARGV != 2 ) {
print "Usage:\n \$ $progname <infile> <transcriptomes>\n";
print " infile = tab-delimited blastn text file\n";
print " transcriptomes = fasta file of all transcriptomes\n";
print "exiting...\n";
exit;
}
my($infile)=$ARGV[0];
my($transcriptomes)=$ARGV[1];
############################################################################
# Read the input file
############################################################################
print "Reading the input file... ";
open (my $INF, $infile) or die "Unable to open file";
my #data = <$INF>;
print #data;
close($INF) or die "Could not close file $infile.\n";
my($nlines) = $#data + 1;
my($inlines) = $nlines - 1;
print "$nlines blastn hits read\n\n";
############################################################################
# Extract hits and place sequences into new file
############################################################################
my #temparray;
my #templine;
my($seqfname);
open ($INF, $infile) or die "Could not open file $infile for input.\n";
#temparray = <$INF>;
close($INF) or die "Could not close file $infile.\n";
$t1 = $#temparray + 1;
print "$infile\t$t1\n";
$seqfname = "$infile" . ".fasta";
if ( -e $seqfname ) {
print " --> $seqfname exists. overwriting\n";
unlink($seqfname);
}
# iterate through the individual hits
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
} # end for ($jter=0; $jter<$t1...
# Arguments for "extract_from_genome2"
# // argv[1] = name of genome file
# // argv[2] = gi number for contig
# // argv[3] = start of subsequence
# // argv[4] = end of subsequence
# // argv[5] = name of output sequence
Using this script, here is the output I'm getting:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
As you can see, it's pretty close to what I'm wanting. Here are the two issues I have and cannot seem to figure out how to resolve with my script. The first is that a sequence may occur more than once in the sseqid column, and with the script in its current form, it will print out duplicates of these sequences. I only need one. How can I modify my script to not duplicate sequences (i.e., how do I only retain one but remove the other duplicates)? Expected output:
>Apil_comp17195_c0_seq1
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
>Atri_comp5613_c0_seq1
GAGAATTCTAGCATCAGCAGTGAGGCCTGAAATACTCATGCCTATGTGACTATCTAGAGGTATTATTTTTTTTTGATGAGCTGACAGTTCAGAAGAAGCTCTTTTGAGAGCTACAAGAACTGCATACTGTTTATTTTTTACTCCAACTGTTGCTGCTCCAAGCTTTACAGCCTCCATTGCATATTCCACTTGGTGTAAACGCCCCTGAGGACTCCATACCGTAACATCAGAATCATACTGATTACGGA
>Acur_01000750.1
GAATTCTAGCGTCAGCAGTGAGTCCTGAAATACTCATCCCTATGTGGCTATCTAGAGGTATTATTTTTTCTGATGGGCCGACAGTTCAGAGGATGCTCTTTTAAGAGCCACAAGAACTGCATACTCTTTATTTTTACTCCAACAGTAGCAGCTCCAAGCTTCACAGCCTCCATTGCATATTCCACCTGGTGTAAACGTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
>Btri_comp17734_c0_seq1
GAATCCTTGCATCTGCAGTAAGTCCAGAAATGCTCATTCCAATATGGCTATCTAATGGTATTATTTTTTTCTGGTGAGCAGACAATTCAGATGATGCTCTTTTAAGAGCTACCAGTACTGCAAAATCATTGTTCTTCACTCCAACAGTTGCAGCACCTAATTTGACTGCCTCCATTGCATACTCCACTTGGTGCAATCTTCCCTGAGGGCTCCATACCGTAACATCAGAATCATACTGGTTACGGAACA
The second is the script is not quite extracting the right base pairs. It's super close, off by one or two, but its not exact.
For example, take the first subject hit Apil_comp17195_c0_seq1. The sstart and send values are 824 and 1072, respectively. When I go to the all_transcriptome_seq.fasta, I get
AAGATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAAC
at that base pair range, not
GATTCTTGCATCTGCAGTAAGACCAGAAATGCTCATTCCTATATGGCTATCTAATGGTATTATTTTTTTCTGATGTGCTGATAATTCAGACGAAGCTCTTTTAAGAGCCACAAGAACTGCATACTGCTTGTTTTTTACTCCAACAGTAGCAGCTCCCAGTTTTACAGCTTCCATTGCATATTCGACTTGGTGCAGGCGTCCCTGGGGACTCCAGACGGTAACGTCAGAATCATACTGGTTACGGAACA
as outputted by my script, which is what I'm expecting. You will also notice that the sequence outputted by my script is slightly shorter than it should be. Does anyone know how I can fix these issues in my script?
Thanks, and sorry for the lengthy post!
Edit 1: a solution was offered that work for some of the infiles. However, some were causing the script to output fewer sequences than expected. Here is one such infile with 9 hits, from which I was expecting only 4 sequences.
Note: this issue has been largely resolved based on the solution provided below the answer section
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Apil_comp16418_c0_seq1 2079 1 1 1587 1 416 2002 0.0 2931 100.00 1587 1587
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Atri_comp13712_c0_seq1 1938 1 1 1587 1 1651 75 0.0 1221 80.73 1286 1593
Apil_comp16418_c0_seq1_OFAS000119-RA-EXON01 1587 Ctom_01003023.1 2162 1 1 1406 1 1403 1 0.0 1430 85.07 1197 1407
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Apil_comp16418_c0_seq1 2079 1 1 1437 1 1866 430 0.0 1170 81.43 1175 1443
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Atri_comp13712_c0_seq1 1938 1 1 1441 1 201 1641 0.0 2662 100.00 1441 1441
Atri_comp13712_c0_seq1_OFAS000119-RA-EXON01 1441 Acur_01000228.1 2415 1 1 1440 1 2231 797 0.0 1906 90.62 1305 1440
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Apil_comp16418_c0_seq1 2079 1 3 1284 1 1714 430 0.0 1351 85.69 1102 1286
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Acur_01000228.1 2415 1 1 1287 1 2084 797 0.0 1219 83.81 1082 1291
Ctom_01003023.1_OFAS000119-RA-EXON01 1289 Ctom_01003023.1 2162 1 1 1289 1 106 1394 0.0 2381 100.00 1289 1289
Edit 2: There is still an occasional output lacking fewer sequences than expected, although not as many after incorporating modifications to my script from Edit 1 suggestion (i.e., accounting for reverse direction). I cannot figure out why the script would be outputting fewer sequences in these other cases. Below the infile in question. The output is lacking Btri_comp15171_c0_seq1:
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Apil_comp19456_c0_seq1 3549 1 1 2464 1 761 3224 0.0 4551 100.00 2464 2464
Apil_comp19456_c0_seq1_OFAS000248-RA-EXON07 2464 Btri_comp15171_c0_seq1 3766 1 1 2456 1 3046 591 0.0 1877 80.53 1985 2465
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Apil_comp19456_c0_seq1 3549 1 1 2457 1 3214 758 0.0 1879 80.54 1986 2466
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Atri_comp28646_c0_seq1 1403 1 1256 2454 1 1401 203 0.0 990 81.60 980 1201
Btri_comp15171_c0_seq1_OFAS000248-RA-EXON07 2457 Btri_comp15171_c0_seq1 3766 1 1 2457 1 593 3049 0.0 4538 100.00 2457 2457
You can use hash to remove duplicates
The bellow code remove duplicates depending on their subject length (keep larger subject length rows).
Just update your # iterate through the individual hits part with
# iterate through the individual hits
my %filterhash;
my $subject_length;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(exists $filterhash{$templine[2]} ){
if($filterhash{$templine[2]} < $subject_length){
$filterhash{$templine[2]}= $subject_length;
}
}
else{
$filterhash{$templine[2]}= $subject_length;
}
}
my %printhash;
for ($jter=0; $jter<$t1; $jter++) {
(#templine) = split(/\s+/, $temparray[$jter]);
$subject_length = $templine[9] -$templine[8];
if(not exists $printhash{$templine[2]})
{
$printhash{$templine[2]}=1;
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
# print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
else{
if(exists $filterhash{$templine[2]} and $filterhash{$templine[2]} == $subject_length ){
$com = "./extract_from_genome2 $transcriptomes $templine[2] $templine[8] $templine[9] $templine[2]";
#print "$com\n";
system("$com");
system("cat temp.3 >> $seqfname");
}
}
} # end for ($jter=0; $jter<$t1...
Hope this will help you.
Edit part update
for negative stand you need to replace
$subject_length = $templine[9] -$templine[8];
with
if($templine[8] > $templine[9]){
$subject_length = $templine[8] -$templine[9];
}else{
$subject_length = $templine[9] -$templine[8];
}
You also need to update your extract_from_genome2 code for negative strand sequences.

Modifying Script to include the Count of a each time a name appears from a table

I have a script below that takes my FILE1 and parses out FILE2 only if the first column of FILE1 matches column number 10 of FILE2. So it will print out the rows I need. This part works great. The part I am having a tad bit of difficulty is inserting a sort of count for the output. The goal of the script is take column 10 at the end and produce an output. In my list there are 12 names and I want to get the count of each name. For the example below, I have used four names.
FILE1:
name1 15
name2 15
name2 30
name5 15
name4 10
name2 5
name2 5
FILE2:
23 15 5.4 1.3 5 55 128 21799 + 32 name2 1 77 0 1
23 20 5.4 1.3 5 55 128 7998 + 18 name4 1 77 0 1
23 20 5.4 1.3 6 55 128 9984 + 13 name4 1 77 1 1
23 20 5.4 1.3 7 55 128 7998 + 14 name5 1 77 2 1
23 20 5.4 1.3 6 55 128 994 + 14 name1 1 77 3
23 20 5.4 1.3 9 55 128 984 + 5 name7 1 77 4 1
23 20 5.4 1.3 5 55 128 99 + 5 name8 1 77 5 1
Expected Output
$VAR1 = {
'name1' => 1,
'name2' => 4,
'name4' => 1,
'name5' => 1,
};
5 55 128 21799 32 name2 77 0 1
5 55 128 7998 18 name4 77 0 1
6 55 128 9984 13 name4 77 1 1
7 55 128 7998 14 name5 77 2 1
6 55 128 994 14 name1 77 3 1
name1 1
name2 1
name4 2
name5 1
You can test the script it works. The part I am having difficulty with is inserting the count of each name based on the output. The print \%x is a way of checking if my original list was truly used as I am working with a much larger set of data. If someone could point me the right direction on how to modify my script without changing it drastically that would be great. I feel like this script fulfills the majority of my needs even if it is not the most efficient way of doing it.
use strict;
use Data::Dumper;
my %x;
open(FILE1, $ARGV[0]) or die "Cannot open the file: $!";
while (my $line = <FILE1>) {
my #array = split(" ", $line);
$x{$array[0]}++;
}
close FILE1;
print Dumper( \%x );
my %count;
open(FILE2, $ARGV[1]) or die "Cannot open the file: $!";
while (my $line = <FILE2>) {
my #name = split(" ", $line);
my $y = $name[9];
if ( $x{ $y } ) {
print join(" ", #name[4,5,6,7,9,11,12,13]), "\n";
$count{#name[9]}++;
}
}
print Dumper (\%count);
close FILE2;
exit;
Script now counts. Just need to debug.
the "minimal" change would be to set the elements of %x to 0 in the FILE1 loop, then check for exists $x{$y} in the FILE2 loop and do ++$x{$y} inside the condition body. Now at the end %x has the counts of all the occurrences.
The usual way (as mentioned in the comments of the question) would be to declare an additional %count and perform the same ++$count{$y} inside the if block as in the above method.
The first has the advantage and disadvantage (depending on your needs) of reporting the count even when the name has zero found occurrences.

Extraction of rows which have a value > 50

How to select those lines which have a value < 10 value from a large matrix of 21 columns and 150 rows.eg.
miRNameIDs degradome AGO LKM......till 21
osa-miR159a 0 42 42
osa-miR396e 0 7 9
vun-miR156a 121 77 4
ppt-miR156a 12 7 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
gma-miR156k 0 46 1
osa-miR1882e 0 7 0
.
.
.
Desired output is:-
miRNameIDs degradome AGO LKM......till 21
vun-miR156a 121 77 4
gma-miR6300 118 2 0
bna-miR156a 0 114 48
.
.
.
till 150 rows
Using a perl one-liner
perl -ane 'print if $. == 1 || grep {$_ > 50} #F[1..$#F]' file.txt
Explanation:
Switches:
-a: Splits the line on space and loads them in an array #F
-n: Creates a while(<>){...} loop for each “line” in your input file.
-e: Tells perl to execute the code on command line.
Code:
$. == 1: Checks if the current line is line number 1.
grep {$_ > 50} #F[1..$#F]: Looks at each entries from the array to see if it is greater than 50.
||: Logical OR operator. If any of our above stated condition is true, it prints the line.

concat columns in perl

Each iteration in my perl code generates a vector of 5.
Output of first iteration is
out1
1
2
3
4
5
The second iterations generates same length of vector.
out2
10
20
30
40
50
and then it runs for nth time
out n
100
200
300
400
500
I want to merge these columns and have the final output in a tabular format or matrix format if you like:
out1 out2 ... outn
1 10 100
2 20 200
3 30 300
4 40 400
5 50 500
I tried splitting and then using the push but it prints "(101" and only do it once and not for all 20. I also have no idea where the "(101" comes from.
Any suggestions?
First, put all those output lists to a list. Second, iterate on that list: output every first element of each element-list in the first iteration, output every second element of each element-list in the second iteration, and so on.
For example
#!/usr/bin/perl
use strict;
use warnings;
use Data::Dumper;
my #lists;
for my $i (1..10) {
my #list;
push #list, $_ * $i for (1..5);
push #lists, \#list;
}
$Data::Dumper::Indent = 0;
print Dumper(\#lists), "\n\n";
while (#{$lists[0]}) {
for my $list (#lists) {
print shift #$list, "\t";
}
print "\n";
}
Output:
$ perl t.pl
$VAR1 = [
[1,2,3,4,5],
[2,4,6,8,10],
[3,6,9,12,15],
[4,8,12,16,20],
[5,10,15,20,25],
[6,12,18,24,30],
[7,14,21,28,35],
[8,16,24,32,40],
[9,18,27,36,45],
[10,20,30,40,50]
];
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Note: The output of Data::Dumper has been edited to make it more compact.
Save your vector information to an array of array as you do your processing. Then you can output the rows using a simple join:
use strict;
use warnings;
my #rows;
for my $i (1..10) {
my #vector = map {$i * $_} (1..5);
push #{$rows[$_]}, $vector[$_] for (0..$#vector);
}
for my $row (#rows) {
print join(" ", map {sprintf "%-3s", $_} #$row), "\n";
}
Outputs:
1 2 3 4 5 6 7 8 9 10
2 4 6 8 10 12 14 16 18 20
3 6 9 12 15 18 21 24 27 30
4 8 12 16 20 24 28 32 36 40
5 10 15 20 25 30 35 40 45 50
Note: It'd be a lot easier to advise if you provided code and actual data.

bash merge files by matching columns

I do have two files:
File1
12 abc
34 cde
42 dfg
11 df
9 e
File2
23 abc
24 gjr
12 dfg
8 df
I want to merge files column by column (if column 2 is the same) for the output like this:
File1 File2
12 23 abc
42 12 dfg
11 8 df
34 NA cde
9 NA e
NA 24 gjr
How can I do this?
I tried it like this:
cat File* >> tmp; sort tmp | uniq -c | awk '{print $2}' > column2; for i in
$(cat column2); do grep -w "$i" File*
But this is where I am stuck...
Don't know how after greping I should combine files column by column & write NA where value is missing.
Hope someone could help me with this.
Since I was testing with bash 3.2 running as sh (which does not have process substitution as sh), I used two temporary files to get the data ready for use with join:
$ sort -k2b File2 > f2.sort
$ sort -k2b File1 > f1.sort
$ cat f1.sort
12 abc
34 cde
11 df
42 dfg
9 e
$ cat f2.sort
23 abc
8 df
12 dfg
24 gjr
$ join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA f1.sort f2.sort
12 23 abc
34 NA cde
11 8 df
42 12 dfg
9 NA e
NA 24 gjr
$
With process substitution, you could write:
join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA <(sort -k2b File1) <(sort -k2b File2)
If you want the data formatted differently, use awk to post-process the output:
$ join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA f1.sort f2.sort |
> awk '{ printf "%-5s %-5s %s\n", $1, $2, $3 }'
12 23 abc
34 NA cde
11 8 df
42 12 dfg
9 NA e
NA 24 gjr
$