bash merge files by matching columns - perl

I do have two files:
File1
12 abc
34 cde
42 dfg
11 df
9 e
File2
23 abc
24 gjr
12 dfg
8 df
I want to merge files column by column (if column 2 is the same) for the output like this:
File1 File2
12 23 abc
42 12 dfg
11 8 df
34 NA cde
9 NA e
NA 24 gjr
How can I do this?
I tried it like this:
cat File* >> tmp; sort tmp | uniq -c | awk '{print $2}' > column2; for i in
$(cat column2); do grep -w "$i" File*
But this is where I am stuck...
Don't know how after greping I should combine files column by column & write NA where value is missing.
Hope someone could help me with this.

Since I was testing with bash 3.2 running as sh (which does not have process substitution as sh), I used two temporary files to get the data ready for use with join:
$ sort -k2b File2 > f2.sort
$ sort -k2b File1 > f1.sort
$ cat f1.sort
12 abc
34 cde
11 df
42 dfg
9 e
$ cat f2.sort
23 abc
8 df
12 dfg
24 gjr
$ join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA f1.sort f2.sort
12 23 abc
34 NA cde
11 8 df
42 12 dfg
9 NA e
NA 24 gjr
$
With process substitution, you could write:
join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA <(sort -k2b File1) <(sort -k2b File2)
If you want the data formatted differently, use awk to post-process the output:
$ join -1 2 -2 2 -o 1.1,2.1,0 -a 1 -a 2 -e NA f1.sort f2.sort |
> awk '{ printf "%-5s %-5s %s\n", $1, $2, $3 }'
12 23 abc
34 NA cde
11 8 df
42 12 dfg
9 NA e
NA 24 gjr
$

Related

How to skip a line every two lines starting by skipping the first line?

Here's my code : ls -lt | sed -n 'p;n'
That code makes me skip from a line to another when listing file names but doesn't start by skipping the first one, how to make that happen?
Here's an exemple without my code to skip to make it clear:
And here's an exemple of when I use the skip code:
You have to invert your sed command: it should be n;p instead of p;n:
Your code:
for x in {1..20}; do echo $x ; done | sed -n 'p;n'
1
3
5
7
9
11
13
15
17
19
The version with sed inverted:
for x in {1..20}; do echo $x ; done | sed -n 'n;p'
Output:
2
4
6
8
10
12
14
16
18
20
You can use sed's ~ operator: first~step
$ seq 1 10 | sed -n '1~2p'
1
3
5
7
9
$ seq 1 10 | sed -n '2~2p'
2
4
6
8
10

Find "N" minimum and "N" maximum values with respect to a column in the file and print the specific rows

I have a tab delimited file such as
Jack 2 98 F
Jones 6 25 51.77
Mike 8 11 61.70
Gareth 1 85 F
Simon 4 76 4.79
Mark 11 12 38.83
Tony 7 82 F
Lewis 19 17 12.83
James 12 1 88.83
I want to find the N minimum values and N maximum values (more than 5) in th the last print the rows that has those values. I want to ignore the rows with E. For example, if I want minimum two values and maximum in above data, my output would be
Minimum case
Simon 4 76 4.79
Lewis 19 17 12.83
Maximum case
James 12 1 88.83
Mike 8 11 61.70
I can ignore the columns that does not have numeric value in fourth column using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt
I can also pipe this output and find one minimum value using
awk -F "\t" '$4+0 != $4{next}1' inputfile.txt |awk 'NR == 1 || $4 < min {line = $0; min = $4}END{print line}'
and similarly for maximum value, but how can I extend this to more than one values like 2 values in the toy example above and 10 cases for my real data.
n could be a variable. in this case, I set n=3. not, this may have problem if there are lines with same value in last col.
kent$ awk -v n=3 '$NF+0==$NF{a[$NF]=$0}
END{ asorti(a,k,"#ind_num_asc")
print "min:"
for(i=1;i<=n;i++) print a[k[i]]
print "max:"
for(i=length(a)-n+1;i<=length(a);i++)print a[k[i]]}' f
min:
Simon 4 76 4.79
Lewis 19 17 12.83
Mark 11 12 38.83
max:
Jones 6 25 51.77
Mike 8 11 61.70
James 12 1 88.83
You can get the minimum and maximum at once with a little redirection:
minmaxlines=2
( ( grep -v 'F$' inputfile.txt | sort -n -k4 | tee /dev/fd/4 | head -n $minmaxlines >&3 ) 4>&1 | tail -n $minmaxlines ) 3>&1
Here's a pipeline approach to the problem.
$ grep -v 'F$' inputfile.txt | sort -nk 4 | head -2
Simon 4 76 4.79
Lewis 19 17 12.83
$ grep -v 'F$' inputfile.txt | sort -rnk 4 | tail -2
Mike 8 11 61.70
James 12 1 88.83

How to implement this in awk or shell?

Input File1:
5 5 NA
NA NA 1
2 NA 2
Input File2:
1 1 1
2 NA 2
3 NA NA
NA 4 4
5 5 5
NA NA 6
Output:
3 NA NA
NA 4 4
NA NA 6
The purpose is: in file1 , set any item of each line that is not NA into a set, then in file2, eliminate the line that whose fields are within this set. Does anyone have ideas about this?
To add any item not 'NA':
awk -f script.awk file1 file2
Contents of script.awk:
FNR==NR {
for (i=1;i<=NF;i++) {
if ($i != "NA") {
a[$i]++
}
}
next
}
{
for (j=1;j<=NF;j++) {
if ($j in a) {
next
}
}
}1
Results:
3 NA NA
NA 4 4
NA NA 6
Alternatively, here's the one-liner:
awk 'FNR==NR { for (i=1;i<=NF;i++) if ($i != "NA") a[$i]++; next } { for (j=1;j<=NF;j++) if ($j in a) next }1' file1 file2
You could do this with grep:
$ egrep -o '[0-9]+' file1 | fgrep -wvf - file2
3 NA NA
NA 4 4
NA NA 6
awk one-liner:
 
awk 'NR==FNR{for(i=1;i<=NF;i++)if($i!="NA"){a[$i];break} next}{for(i=1;i<=NF;i++)if($i in a)next;}1' file1 file2
with your data:
kent$ awk 'NR==FNR{for(i=1;i<=NF;i++)if($i!="NA"){a[$i];break;} next}{for(i=1;i<=NF;i++)if($i in a)next;}1' file1 file2
3 NA NA
NA 4 4
NA NA 6
If the column position of the values matters:
awk '
NR==FNR{
for(i=1; i<=NF; i++) if($i!="NA") A[i,$i]=1
next
}
{
for(i=1; i<=NF; i++) if($i!=NA && A[i,$i]) next
print
}
' file1 file2

Read and parse multiple text files

Can anyone suggest a simple way of achieving this. I have several files which ends with extension .vcf . I will give example with two files
In the below files, we are interested in
File 1:
38 107 C 3 T 6 C/T
38 241 C 4 T 5 C/T
38 247 T 4 C 5 T/C
38 259 T 3 C 6 T/C
38 275 G 3 A 5 G/A
38 304 C 4 T 5 C/T
38 323 T 3 A 5 T/A
File2:
38 107 C 8 T 8 C/T
38 222 - 6 A 7 -/A
38 241 C 7 T 10 C/T
38 247 T 7 C 10 T/C
38 259 T 7 C 10 T/C
38 275 G 6 A 11 G/A
38 304 C 5 T 12 C/T
38 323 T 4 A 12 T/A
38 343 G 13 A 5 G/A
Index file :
107
222
241
247
259
275
304
323
343
The index file is created based on unique positions from file 1 and file 2. I have that ready as index file. Now i need to read all files and parse data according to the positions here and write in columns.
From above files, we are interested in 4th (Ref) and 6th (Alt) columns.
Another challenge is to name the headers accordingly. So the output should be something like this.
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
343 13 5
You can do this using the join command:
# add file1
$ join -e' ' -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n index) <(sort -n -k2 file1) > file1.merged
# add file2
$ join -e' ' -1 1 -2 2 -a 1 -o 0,1.2,1.3,2.4,2.6 file1.merged <(sort -n -k2 file2) > file2.merged
# create the header
$ echo "Position File1_Ref File1_Alt File2_Ref File2_Alt" > report
$ cat file2.merged >> report
Output:
$ cat report
Position File1_Ref File1_Alt File2_Ref File2_Alt
107 3 6 8 8
222 6 7
241 4 5 7 10
247 4 5 7 10
259 3 6 7 10
275 3 5 6 11
304 4 5 5 12
323 3 5 4 12
323 4 12 4 12
343 13 5 13 5
Update:
Here is a script which can be used to combine multiple files.
The following assumptions have been made:
The index file is sorted
The vcf files are sorted on their second column
There are no spaces (or any other special characters) in filenames
Save the following script to a file e.g. report.sh and run it without any arguments from the directory containing your files.
#!/bin/bash
INDEX_FILE=index # the name of the file containing the index data
REPORT_FILE=report # the file to write the report to
TMP_FILE=$(mktemp) # a temporary file
header="Position" # the report header
num_processed=0 # the number of files processed so far
# loop over all files beginning with "file".
# this pattern can be changed to something else e.g. *.vcf
for file in file*
do
echo "Processing $file"
if [[ $num_processed -eq 0 ]]
then
# it's the first file so use the INDEX file in the join
join -e' ' -t, -1 1 -2 2 -a 1 -o 0,2.4,2.6 <(sort -n "$INDEX_FILE") <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
else
# work out the output fields
for ((outputFields="0",j=2; j < $((2 + $num_processed * 2)); j++))
do
outputFields="$outputFields,1.$j"
done
outputFields="$outputFields,2.4,2.6"
# join this file with the current report
join -e' ' -t, -1 1 -2 2 -a 1 -o "$outputFields" "$REPORT_FILE" <(sed 's/ \+/,/g' "$file") > "$TMP_FILE"
fi
((num_processed++))
header="$header,File${num_processed}_Ref,File${num_processed}_Alt"
mv "$TMP_FILE" "$REPORT_FILE"
done
# add the header to the report
echo "$header" | cat - "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
# the report is a csv file. Uncomment the line below to make it space-separated.
# tr ',' ' ' < "$REPORT_FILE" > "$TMP_FILE" && mv "$TMP_FILE" "$REPORT_FILE"
This Perl solution will handle 1 or more, (50), files.
#!/usr/bin/perl
use strict;
use warnings;
use File::Slurp qw/ slurp /;
use Text::Table;
my $path = '.';
my #file = qw/ o33.txt o44.txt /;
my #position = slurp('index.txt') =~ /\d+/g;
my %data;
for my $filename (#file) {
open my $fh, "$path/$filename" or die "Can't open $filename $!";
while (<$fh>) {
my ($pos, $ref, $alt) = (split)[1, 3, 5];
$data{$pos}{$filename} = [$ref, $alt];
}
close $fh or die "Can't close $filename $!";
}
my #head;
for my $file (#file) {
push #head, "${file}_Ref", "${file}_Alt";
}
my $tb = Text::Table->new( map {title => $_}, "Position", #head);
for my $pos (#position) {
$tb->load( [
$pos,
map $data{$pos}{$_} ? #{ $data{$pos}{$_} } : ('', ''), #file
]
);
}
print $tb;

Convert Unix `cal` output to latex table code: one-liner solution?

Trying to achieve the following struggled my mind:
Convert Unix cal output to latex table code, using a short and sweet one-liner (or few-liner).
E.g cal -h 02 2012 | $magicline should yield
Mo &Tu &We &Th &Fr \\
& & 1 & 2 & 3 \\
6 & 7 & 8 & 9 &10 \\
13 &14 &15 &16 &17 \\
20 &21 &22 &23 &24 \\
27 &28 & & & \\
The only reasonable solution I could come up with so far was
cal -h | sed -r -e '1d' -e \
's/^(..)?(...)?(...)?(...)?(...)?(...)?(...)?$/\2\t\&\3\t\&\4\t\&\5\t\&\6\t\\\\/'
... and I really tried hard. The nice thing about it being that it's uncomplicated and easy to understand, the bad thing about it that it's "unflexible" (It couldn't cope with a week of 8 days) and a little verbose. I'm looking for alternative solutions to learn from ;-)
EDIT: Found another one that seems acceptable
cal -h | tail -n +2 |
perl -ne 'chomp;
$,="\t&";
$\="\t\\\\\n";
$line=$_;
print map {substr($line,$_*3,3)} (1..5)'
EDIT: Nice one:
cal -h | perl \
-F'(.{1,3})' -ane \
'BEGIN{$,="\t&";$\="\t\\\\\n"}
next if $.==1;
print #F[3,5,7,9,11]'
Tested on OS-X:
cal 02 2012 |grep . |tail +2 |perl -F'/(.{3})/' -ane \
'chomp(#F=grep $_,#F); $m=$#F if !$m; printf "%s"."\t&%s"x$m."\t\\\\\n", #F;'
Where cal output has 3-character columns; {3} could be changed to match your cal output.
Using the GNU version of awk:
My output of cal using an english LANG.
Command:
LANG=en_US cal
Output:
February 2012
Su Mo Tu We Th Fr Sa
1 2 3 4
5 6 7 8 9 10 11
12 13 14 15 16 17 18
19 20 21 22 23 24 25
26 27 28 29
The awk one-line:
LANG=en_US cal | awk '
BEGIN {
FIELDWIDTHS = "3 3 3 3 3 3 3";
OFS = "&";
}
FNR == 1 || $0 ~ /^\s*$/ { next }
{
for (i=2; i<=6; i++) {
printf "%-3s%2s", $i, i < 6 ? OFS : "\\\\";
}
printf "\n";
}'
Result:
Mo &Tu &We &Th &Fr \\
& & 1 & 2 & 3 \\
6 & 7 & 8 & 9 &10 \\
13 &14 &15 &16 &17 \\
20 &21 &22 &23 &24 \\
27 &28 &29 & & \\
cal 02 2012|perl -lnE'$.==1||eof||do{$,="\t&";$\="\t\\\\\n";$l=$_;print map{substr($l,$_*3,3)}(1..5)}'
my new favorite:
cal 02 2012|perl -F'(.{1,3})' -anE'BEGIN{$,="\t&";$\="\t\\\\\n"}$.==1||eof||do{$i//=#F;print#F[map{$_*2-1}(1..$i/2)]}'
This might work for you:
cal | sed '1d;2{h;s/./ /g;x};/^\s*$/b;G;s/\n/ /;s/^...\(.\{15\}\).*/\1/;s/.../ &\t\&/g;s/\&$/\\\\/'
This works for my implementation of cal, which uses four-character columns and has an initial title line showing the month and year
cal | perl -pe "next if $.==1;s/..../$&&/g;s/&$/\\\\/"
It looks as though yours may have eight-character columns and has no title line, in which case
cal | perl -pe "s/.{8}/$&&/g;s/&$/\\\\/"
should do the trick, but be prepared to tweak it.
cal -h 02 2012| cut -c4-17 | sed -r 's/(..)\s/\0\t\&/g' | sed 's/$/\t\\\\/' | head -n-1 | tail -n +2
This will produce:
Mo &Tu &We &Th &Fr \\
& & 1 & 2 & 3 \\
6 & 7 & 8 & 9 &10 \\
13 &14 &15 &16 &17 \\
20 &21 &22 &23 &24 \\
27 &28 &29 & & \\
You can easily replace \t with number of spaces you wish