How to extract specific columns from different files and output in one file?

How to extract specific columns from different files and output in one file? - perl

I have in a directory 12 files , each file has 4 columns. The first column is a gene name and the rest 3 are count columns. All the files are in the same directory. I want to extract 1,4 columns for each files (12 files in total) and paste them in one output file, since the first column is same for every files the output file should only have one once the 1st column and the rest will be followed by 4th column of each file. The first column of each file is same. I do not want to use R here. I am a big fan of awk. So I was trying something like below but it did not work
My input files look like
Input file 1
ZYG11B 8267 16.5021 2743.51
ZYG11A 4396 0.28755 25.4208
ZXDA 5329 2.08348 223.281
ZWINT 1976 41.7037 1523.34
ZSCAN5B 1751 0.0375582 1.32254
ZSCAN30 4471 4.71253 407.923
ZSCAN23 3286 0.347228 22.9457
ZSCAN20 4343 3.89701 340.361
ZSCAN2 3872 3.13983 159.604
ZSCAN16-AS1 2311 1.1994 50.9903
Input file 2
ZYG11B 8267 18.2739 2994.35
ZYG11A 4396 0.227859 19.854
ZXDA 5329 2.44019 257.746
ZWINT 1976 8.80185 312.072
ZSCAN5B 1751 0 0
ZSCAN30 4471 9.13324 768.278
ZSCAN23 3286 1.03543 67.4392
ZSCAN20 4343 3.70209 318.683
ZSCAN2 3872 5.46773 307.038
ZSCAN16-AS1 2311 3.18739 133.556
Input file 3
ZYG11B 8267 20.7202 3593.85
ZYG11A 4396 0.323899 29.8735
ZXDA 5329 1.26338 141.254
ZWINT 1976 56.6215 2156.05
ZSCAN5B 1751 0.0364084 1.33754
ZSCAN30 4471 6.61786 596.161
ZSCAN23 3286 0.79125 54.5507
ZSCAN20 4343 3.9199 357.177
ZSCAN2 3872 5.89459 267.58
ZSCAN16-AS1 2311 2.43055 107.803
Desired output from above
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803
here as you can see above first column from each file and 4 column , since the first column of each file is same so I just want to keep it once and rest the ouptut will have 4th column of each file. I have just shown for 3 files. It should work for all the files in the directory at once since all of the files have similar naming conventions like file1_quant.genes.sf file2_quant.genes.sf , file3_quant.genes.sf
Every files has same first column but different counts in rest column. My idea is to create one output file which should have 1st column and 4th column from all the files.
awk '{print $1,$2,$4}' *_quant.genes.sf > genes.estreads
Any heads up?

If I understand you correctly, what you're looking for is one line per key, collated from multiple files.
The tool you need for this job is an associative array. I think awk can, but I'm not 100% sure. I'd probably tackle it in perl though:
#!/usr/bin/perl
use strict;
use warnings;
# an associative array, or hash as perl calls it
my %data;
#iterate the input files (sort might be irrelevant here)
foreach my $file ( sort glob("*_quant.genes.sf") ) {
#open the file for reading.
open( my $input, '<', $file ) or die $!;
#iterate line by line.
while (<$input>) {
#extract the data - splitting on any whitespace.
my ( $key, #values ) = split;
#add'column 4' to the hash (of arrays)
push( #{$data{$key}}, $values[2] );
}
close($input);
}
#start output
open( my $output, '>', 'genes.estreads' ) or die;
#sort, because hashes are explicitly unordered.
foreach my $key ( sort keys %data ) {
#print they key and all the elements collected.
print {$output} join( "\t", $key, #{ $data{$key} } ), "\n";
}
close($output);
With data as specified as above, this produces:
ZSCAN16-AS1 50.9903 133.556 107.803
ZSCAN2 159.604 307.038 267.58
ZSCAN20 340.361 318.683 357.177
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN30 407.923 768.278 596.161
ZSCAN5B 1.32254 0 1.33754
ZWINT 1523.34 312.072 2156.05
ZXDA 223.281 257.746 141.254
ZYG11A 25.4208 19.854 29.8735
ZYG11B 2743.51 2994.35 3593.85

The following is how you do it in awk:
awk 'BEGIN{FS = " "};{print $1, $4}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
As cryptic as it looks, I am just using associative arrays.
Here is the solution broken down:
Just print the key and the value, one per line.
print $1, $2
Store the data in an associative array, keep updating
temp = x[$1];x[$1] = temp " " $2;}
Display it:
for(xx in x) print xx,x[xx]
Sample run:
[cloudera#quickstart test]$ cat f1
A k1
B k2
[cloudera#quickstart test]$ cat f2
A k3
B k4
C k1
[cloudera#quickstart test]$ awk 'BEGIN{FS = " "};{print $1, $2}' *|awk 'BEGIN{FS = " "};{temp = x[$1];x[$1] = temp " " $2;};END {for(xx in x) print xx,x[xx]}'
A k1 k3
B k2 k4
C k1
As a side note, the approach should be reminiscent of the Map Reduce paradigm.

awk '{E[$1]=E[$1] "\t" $4}END{for(K in E)print K E[K]}' *_quant.genes.sf > genes.estreads
Order is order of appearence when reading files (so normaly based on 1 readed file)

If the first column is the same in all the files, you can use paste:
paste <(tabify f1 | cut -f1,4) \
<(tabify f2 | cut -f4) \
<(tabify f3 | cut -f4)
Where tabify changes consecutive spaces to tabs:
sed 's/ \+/\t/g' "$#"
and f1, f2, f3 are the input files' names.

Here's another way to do it in Perl:
perl -lane '$data{$F[0]} .= " $F[3]"; END { print "$_ $data{$_}" for keys %data }' input_file_1 input_file_2 input_file_3

Here's another way of doing it with awk. And it supports using multiple files.
awk 'FNR==1{f++}{a[f,FNR]=$1}{b[f,FNR]=$4}END { for(x=1;x<=FNR;x++){printf("%s ",a[1,x]);for(y=0;y<=ARGC;y++)printf("%s ",b[y,x]);print ""}}' input1.txt input2.txt input3.txt
That line of code, give the following output
ZYG11B 2743.51 2994.35 3593.85
ZYG11A 25.4208 19.854 29.8735
ZXDA 223.281 257.746 141.254
ZWINT 1523.34 312.072 2156.05
ZSCAN5B 1.32254 0 1.33754
ZSCAN30 407.923 768.278 596.161
ZSCAN23 22.9457 67.4392 54.5507
ZSCAN20 340.361 318.683 357.177
ZSCAN2 159.604 307.038 267.58
ZSCAN16-AS1 50.9903 133.556 107.803

Related

How can I extract specific columns in perl?

chr1 1 10 el1
chr1 13 20 el2
chr1 50 55 el3
I have this tab delimited file and I want to extract the second and third column using perl. How can I do that?
I tried reading the file using file handler and storing it in a string, then converting the string to an array but it didn't get me anywhere.
My attempt is:
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}

Simply autosplit on tab
# ↓ index starts on 0
$ perl -F'\t' -lane'print join ",", #F[1,2]' inputfile
Output:
1,10
13,20
50,55
See perlrun.

use strict;
my $input=shift or die "must provide <input_file> as an argument\n";
open(my $in,"<",$input) or die "Cannot open $input for reading: $!";
while(<$in>)
{
my #tf1=split(/\t/,$_);
print "$tf1[1]|$tf1[2]\n"; # $tf1[1] is the second column and $tf1[2] is the third column
}
close($in)

What problem are you having? Your code already does all the hard parts.
while (defined($line=<FILE_HANDLE>)) {
my #tf1;
#tf1 = split(/\t/ , $line);
}
You have all three columns in your #tf1 array (by the way - your variable naming needs serious work!) All you need to do now is to print the second and third elements from the array (but remember that Perl array elements are numbered from zero).
print "$tf1[1] / $tf1[2]\n";
It's possible to simplify your code quite a lot by taking advantage of Perl's default behaviours.
while (<FILE_HANDLE>) { # Store record in $_
my #tf1 = split(/\t/); # Declare and initialise on one line
# split() works on $_ by default
print "$tf1[1] / $tf1[2]\n";
}

Even more pithily than #daxim as a one-liner:
perl -aE 'say "#F[1,2]" ' file
See also: How to sort an array or table by column in perl?

How to sort column A uniquely based on descending order of column B in unix/per/tcl?

I have a csv file like the one below.
Column A, Column B
cat,30
cat,40
dog,10
elephant,23
dog,3
elephant,37
How would i uniquely sort column A, based on largest corresponding value on
column B?
The result I would like to get is,
Column A, Column B
cat,40
elephant,37
dog,10

awk to the rescue!
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '!a[$1]++'
Column A, Column B
cat,40
dog,10
elephant,37
if you want your specific output it needs little more coding because of the header line.
$ sort -t, -k1,1 -k2nr filename | awk -F, 'NR==1{print "999999\t"$0;next} !a[$1]++{print $2"\t"$0}' | sort -k1nr | cut -f2-
Column A, Column B
cat,40
elephant,37
dog,10
Another alternative with removing header upfront and adding it back at the end
$ h=$(head -1 filename); sed 1d filename | sort -t, -k1,1 -k2nr | awk -F, '!a[$1]++' | sort -t, -k2nr | sed '1i'"$h"''

Perlishly:
#!/usr/bin/env perl
use strict;
use warnings;
#print header row
print scalar <>;
my %seen;
#iterate the magic filehandle (file specified on command line or
#stdin - e.g. like grep/sed)
while (<>) {
chomp; #strip trailing linefeed
#split this line on ','
my ( $key, $value ) = split /,/;
#save this value if previous is lower or non existant
if ( not defined $seen{$key}
or $seen{$key} < $value )
{
$seen{$key} = $value;
}
}
#sort, comparing values in %seen
foreach my $key ( sort { $seen{$b} <=> $seen{$a} } keys %seen ) {
print "$key,$seen{$key}\n";
}

I've +1'd karakfa's answer. It's simple and elegant.
My answer is an extension of karakfa's header handling. If you like it, please feel free to +1 my answer, but "best answer" should go to karakfa. (Unless of course you prefer one of the other answer! :] )
If your input is as you've described in your question, then we can recognize the header by seeing that $2 is not numeric. Thus, the following does not take the header into consideration:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '!a[$1]++'
You might alternately strip the header with:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '$2~/^[0-9]+$/&&!a[$1]++'
This slows things down quite a bit, since a regex may take longer to evaluate than a simple array assignment and numeric test. I'm using a regex for the numeric test in order to permit a 0, which would otherwise evaluate to "false".
Next, if you want to keep the header, but print it first, you can process your output at the end of the stream:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, '$2!~/^[0-9]+$/{print;next} !a[$1]++{b[$1]=$0} END{for(i in b){print b[i]}}'
Last option to achieve the same effect without storing the extra array in memory would be to process your input a second time. This is more costly in terms of IO, but less costly in terms of memory:
$ sort -t, -k1,1 -k2,2nr filename | awk -F, 'NR==FNR&&$2!~/^[0-9]+$/{print;nextfile} $2~/^[0-9]+$/&&!a[$1]++' filename -

Another perl
perl -MList::Util=max -F, -lane '
if ($.==1) {print; next}
$val{$F[0]} = max $val{$F[0]}, $F[1];
} {
print "$_,$val{$_}" for reverse sort {$val{$a} <=> $val{$b}} keys %val;
' file

One possible Tcl solution:
# read the contents of the file into a list of lines
set f [open data.csv]
set lines [split [string trim [chan read $f]] \n]
chan close $f
# detach the header
set lines [lassign $lines header]
# map the list of lines to a list of tuples
set tuples [lmap line $lines {split $line ,}]
# use an associative array to get unique tuples in a flat list
array set uniqueTuples [concat {*}[lsort -index 1 -integer $tuples]]
# reassemble the tuples, sorted by name
set tuples [lmap {a b} [lsort -stride 2 -index 0 [array get uniqueTuples]] {list $a $b}]
# map the tuples to csv lines and insert the header
set lines [linsert [lmap tuple $tuples {join $tuple ,}] 0 $header]
# convert the list of lines into a data string
set data [join $lines \n]
This solution assumes a simplified data set where there are no quoted elements. If there are quoted elements, the csv module should be used instead of the split command.
Another solution, inspired by the Perl solution:
puts [gets stdin]
set seen [dict create]
while {[gets stdin line] >= 0} {
lassign [split $line ,] key value
if {![dict exists $seen $key] || [dict get $seen $key] < $value} {
dict set seen $key $value
}
}
dict for {key val} [lsort -stride 2 -index 0 $seen] {
puts $key,$val
}
Documentation: chan, concat, dict, gets, if, join, lassign, linsert, lmap, lmap replacement, lsort, open, set, split, string, while

How to repeat a sequence of numbers to the end of a column?

I have a data file that needs a new column of identifiers from 1 to 5. The final purpose is to split the data into five separate files with no leftover file (split leaves a leftover file).
Data:
aa
bb
cc
dd
ff
nn
ww
tt
pp
with identifier column:
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
Not sure if this can be done with seq? Afterwards it will be split with:
awk '$2 == 1 {print $0}'
awk '$2 == 2 {print $0}'
awk '$2 == 3 {print $0}'
awk '$2 == 4 {print $0}'
awk '$2 == 5 {print $0}'

Perl to the rescue:
perl -pe 's/$/" " . $. % 5/e' < input > output
Uses 0 instead of 5.
$. is the line number.
% is the modulo operator.
the /e modifier tells the substitution to evaluate the replacement part as code
i.e. end of line ($) is replaced with a space concatenated (.) with the line number modulo 5.

$ awk '{print $0, ((NR-1)%5)+1}' file
aa 1
bb 2
cc 3
dd 4
ff 5
nn 1
ww 2
tt 3
pp 4
No need for that to create 5 separate files of course. All you need is:
awk '{print > ("file_" ((NR-1)%5)+1)}' file
Looks like you're happy with a perl solution that outputs 1-4 then 0 instead of 1-5 so FYI here's the equivalent in awk:
$ awk '{print $0, NR%5}' file
aa 1
bb 2
cc 3
dd 4
ff 0
nn 1
ww 2
tt 3
pp 4

I am going to offer a Perl solution even though it wasn't tagged because Perl is well suited to solve this problem.
If I understand what you want to do, you have a single file that you want to split into 5 separate files based on the position of a line in the data file:
the first line in the data file goes to file 1
the second line in the data file goes to file 2
the third line in the data file goes to file 3
...
since you already have the lines position in the file, you don't really need the identifier column (though you could pursue that solution if you wanted).
Instead you can open 5 filehandles and simply alternate which handle you write to:
use strict;
use warnings;
my $datafilename = shift #ARGV;
# open filehandles and store them in an array
my #fhs;
foreach my $i ( 0 .. 4 ) {
open my $fh, '>', "${datafilename}_$i"
or die "$!";
$fhs[$i] = $fh;
}
# open the datafile
open my $datafile_fh, '<', $datafilename
or die "$!";
my $row_number = 0;
while ( my $datarow = <$datafile_fh> ) {
print { $fhs[$row_number++ % #fhs] } $datarow;
}
# close resources
foreach my $fh ( #fhs ) {
close $fh;
}

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

I have 2 separate files namely A & B containing same header lines but 2 and 1 column respectively. I want to take inverse log2 of the 2nd column or 1st column in separate files but keep the other description intact. I am having some thing like this.. values in file A $1 and $2 are separated by delimiter tab
file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 0.781985
16 0.810993
20 0.769601
24 0.733831
file B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
0.721985
0.610993
0.760123
0.573831
I expect an output like this. file A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.7194950944
16 1.754418585
20 1.7047982296
24 1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr2
for file B (in this file values are just copy paste of file A)
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.7194950944
1.754418585
1.7047982296
1.6630493726
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig rep1.bar.wig graphType=bar
variableStep chrom=chr2

This awk script does the calculation that you want:
awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' file
This matches lines that contain only digits, periods and any space characters, substituting the value of the last field $NF for 2 raised to the power of $NF. The format specifier %.12f can be modified to give you the required number of decimal places. The 1 at the end is shorthand for {print}.
Testing it out on your new files:
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' A
track type=wiggle_0 name=rep1.bar.wig description=GSM1076_rep1.bar.wig graphType=bar
variableStep chrom=chr1
12 1.719495094445
16 1.754418584953
20 1.704798229573
24 1.663049372620
$ awk '/^[0-9.[:space:]]+$/{$NF=sprintf("%.12f", 2^$NF)}1' B
track type=wiggle_0 name=rep1.bar.wig description=GSM1078_rep1.bar.wig graphType=bar
variableStep chrom=chr1
1.649449947457
1.527310087388
1.693635012985
1.488470882686

So here's the Perl version:
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(.*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;
If you run <file>.pl <datafile> (replace with appropriate names) it will convert one file so the lines have 2**<2nd value>). It simply echoes the lines that do not match the number pattern.

This is the modified little script of #ThomasKilian
Thanks to him for providing the framework.
use strict;
open IN, $ARGV[0];
while (<IN>) {
chomp;
if (/^(\d*)[\t ]*(-?\d\.\d*)/) { # format "nn m.mmmmm"
my $power = 2 ** $2;
$power= sprintf("%.12f", $power);
print("$1\t" . $power . "\n");
} elsif (/^(-?\d\.\d*)/) { # format "m.mmmmm"
my $power = 2 ** $1;
$power= sprintf("%.12f", $power);
print($power . "\n");
} else { # echo all other stuff
print;
print ("\n");
}
}
close IN;

Need to replace a specific column value of a file based on values in other columns of the file

Below is my content of file and below is my requirement
Based value in column no 1 and column no 5, I would want to replace value in column 7 with "1".
For example:
If column 1="change" and column 5="defer", then replace value in column 7 with "1".
If column 1="change" and column 5="defererence" then replace value in column 7 with "1".
Otherwise do not do anything with row, keep it as it is.
Input file:
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
Above is just a sample to make it easier to explain.However I want to pass values for column 1 and column 5 from a file to match against value in file. If it matches, then only replace column 7 with value "1" otherwise dont do anything with row, keep it as it is.
I tried couple of options and not able to achieve required results.
perl -F'\|' -i -lape 'if ($F[0] eq "change" && $F[4] eq "defer") { s/$F[6]/1/g;}' file_name
Above command is replacing all values of 3 in file irrespective of fields. But i want to only replace 6th column value based on 1st column and 4th column by passing different values to 1st and 4th column in a loop.
Adding more information:
As mentioned by me above, above example is just simplest form of my problem to make everybody understand the requirement. I have a file with name "listfile" which has got list of values for column no 1 and column no 5 for matching. If values in column no 1 and column no 5 from my sourcefile matches with the values passed from file "listfile", then solution should replace value in column no 7 with "1". Otherwise do not do anything with row from source file, keep it as it is.
I tried to do below, but unable to achieve required.
#!/usr/bin/ksh
for line in $(cat dir/listfile)
do
VAR1=$(echo $line | nawk -F"|" '{print $1}')
VAR2=$(echo $line | nawk -F"|" '{print $2}')
nawk -F"|" 'BEGIN {OFS="|"} {if($1!="'"$VAR1"'" && $5!="'"$VAR2"'") {$8="1"; print $0;} else {print $0;}}' dir/sourcefile >> dir/sourcefile_REVISED
done
No of records between original source file and revised source file after replacing Column no 7 should be same. Only thing is for all values of Column no 1 and 5 from file listfile, i need column no 7 value to be replaced by "1".
Thanks,

You can use awk to do this.
awk -F'|' 'BEGIN{OFS="|"}{if($1=="Change"&&$5=="defer"){$7=1}{print}}' file
I realize that you also need 5th column to match with "differerence"... Following should work:
awk -F'|' 'BEGIN{OFS="|"}{if($1=="Change"&&$5=="defer"||$5=="defererence"){$7=1}{print}}' file

perl -F'\|' -i -lape '
BEGIN{ $" = "|" }
$F[6]=1, $_="#F" if $F[0] eq "Change" && $F[4] =~ /^defer(erence)?$/;
' listfile

The substitute operator s/// doesn't understand the concept of columns of data, it just operates on the given string. In this case, you are replacing whatever is in column 7 everywhere it occurs in the input line because of the g modifier.
The awk solution in the answer by #imauser is a good solution.

With awk you would do:
$ awk '$1=="Change"&&$5~/^defer(erence)?$/{$7=1}1' FS='|' OFS='|' file
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3|
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3|
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|

You can still use a perl+regexp solution by "skipping" the first six fields before doing the replacement:
perl -F'\|' -i -lapE 'if ($F[0] eq "change" && $F[4] =~ m{^defer(erence)?$}) { s{^(?:[^|]*\|){6}\K([^|]*)}{1} }' file_name
An advantage over the awk solutions: you can still use the -i switch here.

You can achieve this using split and join in perl script as follows:
#!/usr/bin/perl
use warnings;
use strict;
my $infile = "dir/sourcefile";
my $listfile = "dir/listfile";
my $outfile = "dir/sourcefile_REVISED";
my #list;
open LFOPEN, $listfile or die $!;
while (<LFOPEN>) {
chomp;
my #col = split /\|/, $_;
push #list, \#col;
}
close LFOPEN;
open IFOPEN, $infile or die $!;
open OFOPEN, '>', $outfile or die $!;
while (<IFOPEN>) {
chomp;
my #col = split /\|/, $_;
foreach my $lref (#list) {
$col[6] = '1' if ($col[0] eq $lref->[0] and $col[4] eq $lref->[1]);
}
print OFOPEN join ('|', #col) . "\n";
}
close IFOPEN;
close OFOPEN;
Input (dir/sourcefile):
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|
List (dir/listfile):
Change|defer
Change|defererence
Output (dir/sourcefile_REVISED):
Change|sinmg|1234|ewfew|def|fdfd|JAMES|rewr|ROBERT|3|fe
Change|sinmg|2345|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3
noChange|sinmg|2323|ewfew|def|VENKTRAAAMMAMAMAMMAMA|3|rewr|BEAEY|3|fe
Change|sinmg|3456|ewfew|defer|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3
Change|sinmg|2345|ewfew|defererence|VENKTRAAAMMAMAMAMMAMA|1|rewr|BEAEY|3

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to extract specific columns from different files and output in one file? - perl

awk '{E[$1]=E[$1] "\t" $4}END{for(K in E)print K E[K]}' *_quant.genes.sf > genes.estreads Order is order of appearence when reading files (so normaly based on 1 readed file)

If the first column is the same in all the files, you can use paste: paste <(tabify f1 | cut -f1,4) \ <(tabify f2 | cut -f4) \ <(tabify f3 | cut -f4) Where tabify changes consecutive spaces to tabs: sed 's/ \+/\t/g' "$#" and f1, f2, f3 are the input files' names.

Here's another way to do it in Perl: perl -lane '$data{$F[0]} .= " $F[3]"; END { print "$_ $data{$_}" for keys %data }' input_file_1 input_file_2 input_file_3

Related

How can I extract specific columns in perl?

How to sort column A uniquely based on descending order of column B in unix/per/tcl?

How to repeat a sequence of numbers to the end of a column?

How to calculate inverse log2 ratio of a UCSC wiggle file using perl?

Need to replace a specific column value of a file based on values in other columns of the file

Categories

Resources