Second Try on Joining 3 tables using AWK/SED/ - sed

So, my last question I wasn't quite specific enough, and although I'm alot closer, I am still having problems with joining my 3 text tables in a way that makes sense. Now, in more detail here they are:
T1_01 = Table 1
No Object CCmax Vhel cont noise Mag1
001 _P10644 0.816 123.04 2450.3 74.2 15.34
002 Parked -99.900 -99.90 -99.9 -99.9 -99.90
003 _P10569 0.791 146.30 2650.7 75.3 15.50
004 _P10769 0.641 141.49 482.7 30.2 16.42
005 _P10572 0.848 138.15 2161.4 46.3 15.85
T1_02 = Table 2
Fibrel Namel Typel Pivl RAl DECl Magl
001 F1_P10644 P 1 4.89977691 -0.5104696 15.3
002 Parked N 2 4.88965087 -0.4904939 0.0
003 F1_P10569 P 3 4.89642427 -0.5099916 15.5
004 F1_P10769 P 4 4.90643599 -0.5112466 16.4
005 F1_P10572 P 5 4.89644907 -0.5105655 15.8
T1_03 = Table 3
Name RA DEC Imag Fieldname fiber RV eRV
F1_P10644 4.899776910023531 -0.510469633262908 15.34 100606F1red 001 122.47 2.94
F1_P10569 4.896424277974554 -0.509991655454702 15.50 100606F1red 003 145.55 2.72
F1_P10769 4.906435995618358 -0.511246644149622 16.42 100606F1red 004 116.28 12.87
F1_P10572 4.896449076194342 -0.510565529409031 15.85 100606F1red 005 136.15 3.01
The table output I am hoping for is:
T1_0123 (joined on column 1 T1_01, column 1 T1_02, and column 6 T1_03)
No Object CCmax Vhel cont noise Mag1 Fibrel Namel Typel Pivl RAl DECl Magl Name RA DEC Imag Fieldname fiber RV eRV
where line1 =
001 _P10644 0.816 123.04 2450.3 74.2 15.34 001 F1_P10644 P 1 4.89977691 -0.5104696 15.3 F1_P10644 4.899776910023531 -0.510469633262908 15.34 100606F1red 001 122.47 2.94
and line2 =
002 Parked -99.9 -99.9 -99.9 -99.9 -99.9 002 Parked N 2 4.88965087 -0.4904939 0.0 -99.9 -99.9 -99.9 -99.9 -99.9 -99.9 -99.9 -99.9
So that -99.9 was written into the line that had no match for the 3rd file.
Now I CAN join the files if I skip the header with:
join -1 1 -2 1 |awk 'NR != 1' <T1_02 |awk 'NR != 1'<T1_01 >T1_021
join -1 1 -2 6 T1_021 |awk 'NR != 1'<T1_03 >T1_0123
However this ONLY prints the results of the first table listed in the join, so I don't get all columns I need. Likewise if I want all 3 tables I 'could' do:
paste T1_01 T1_02 T1_03
Except, in this case my T1_03 will not match as it is missing several values. So what I am looking for is a way to say something like:
for all i in files T1_01,T1_02,T1_03
if T1_01 $1 == T1_02 $2 == T1_03 $6
# then print T1_01[i] T1_02[i] T1_03[i] \n,
else
# print T1_01[i] T1_02[i] -99.9 (for all blanks)
fi
done
Or conversely, use my join statement above and print all lines in BOTH tables joined, or perhaps some sort of paste | join?? Not sure about that last idea as I haven't found anything that really works yet.
Additionally I can do put the -99.9 in later with:
sed -i -e 's/ / 99.9 -99.9 -99.9 -99.9 -99.9 -99.9 -99.9 -99.9/' T1_0123
And I can manually add headers as well, so the main problem is getting the right paste result.
Hopefully I have phrased the question better this time, thanks everyone, for helping a new bash user!

This is doing what you want. The script assumes your data to be in data1, data2 and data3. It writes all this data into a temporary file while tagging it according to origin (lines from data1 are appended "A", etc...). It also adds the index on which to join to the beginning of lines from data3. Then the data is sorted to group corresponding lines.
Then awk is used to print corresponding records and fill in placeholder data for missing entries from data3.
You should be able to adjust to your needs if that's not exactly what you wanted - otherwise drop a comment :-)
#!/bin/bash
awk 'NR > 1 {print $0, "A"}' data1 >tmp
awk 'NR > 1 {print $0, "B"}' data2 >>tmp
awk '{print $6, $0, "C"}' data3 >>tmp
sort -nk1,1 tmp | \
awk '
function printDATA() {
print DATA["A"], DATA["B"], DATA["C"]
DATA["C"] = "-99.9 -99.9 -99.9 -99.9 -99.9"
DATA["C"] = DATA["C"] " -99.9 -99.9 -99.9"
}
$1 != last && NR > 1{printDATA()}
{
m = $NF; $NF = ""; last = $1;
if(m == "C") {$1 = ""}
DATA[m] = $0
}
END {printDATA()}
'

Related

Merging two files with condition on two columns

I have two files of the type:
File1.txt
1 117458 rs184574713 rs184574713
1 119773 rs224578963 rs224500000
1 120000 rs224578874 rs224500045
1 120056 rs60094200 rs60094200
2 120056 rs60094536 rs60094536
File2.txt
10 120200 120400 A 0 189,183,107
1 0 119600 C 0 233,150,122
1 119600 119800 D 0 205,92,92
1 119800 120400 F 0 192,192,192
2 120400 122000 B 0 128,128,128
2 126800 133200 A 0 192,192,192
I want to add the information contained in the second file to the first file. The first column in both files needs to match, while the second column in File1.txt should fall in the interval that is indicated by columns 2 and 3 in File2.txt. So that the output should look like this:
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
2 120440 rs60094536 rs60094536 B 0 128,128,128
Please help me with awk/perl.. or any other script.
This is how you would do it in bash (with a little help from awk):
join -1 1 -2 1 \
<(sort -n -k1,1 test1) <(sort -n -k1,1 test2) | \
awk '$2 >= $5 && $2 <= $6 {print $1, $2, $3, $4, $7, $8, $9}'
Here is a brief explanation.
First, we use join to join lines based on the common key (the
first field).
But join expects both input files to be already sort (hence
sort).
At least, we employ awk to apply the required condition, and to
project the fields we want.
Try this: (Considering the fact that there is a typo in your output for last entry. 120056 is not between 120400 122000.
$ awk '
NR==FNR {
a[$1,$2,$3]=$4 FS $5 FS $6;
next
}
{
for(x in a) {
split(x,tmp,SUBSEP);
if($1==tmp[1] && $2>=tmp[2] && $2<=tmp[3])
print $0 FS a[x]
}
}' file2 file1
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
You read through the first file creating an array indexed at column 1,2 and 3 having values of column 4,5 and 6.
For the second file, you look up in your array. For every key, you split the key and check for your condition of first column matching and second column to be in range.
If the condition is true you print the entire line from file 1 followed by the value of array.

Awk to generate a file of 100 lines using 5 lines supplied based on some condition

I have a file with 5 lines in it. I'd like to copy and paste these lines until a condition is satisfied.. e.g.,
123 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
I'd like to repeat these lines 20 times and each time incrementing the number '123' (on the fist line) by 1. i.e.,
123 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
124 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
.....
.....
172 89.98 34.56
AbcDef
0.0 10.567
Hijkl
XYZ 345
Is this possible?
One way to do it (but without any elegance) is this:
1) Get a file with the desired final number of lines (in your case, 20x5). Let's name it "100lines.txt".
2) Do:
awk '{if(NR%5==1) {print (NR+4)/5+122, "89.98 34.56"} if(NR%5==2) {print "AbcDef"} if(NR%5==3) {print "0.0 10.567"} if(NR%5==4) {print "Hijkl"} if(NR%5==0) {print "XYZ 345"}}' 100lines.txt
I think it could work.

How to find lines in a file matching lines in another file?

I have a large file with 11 columns of either text or numbers:
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03002 E0147 a1 1001 0147 10303002 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
...
and another file of only one column of numbers:
0146
0148
...
I need to extract lines from the first file when the 6th column matches the entries of the second file. So, in the above example, if the second file contains only the two entries, then the first and the third lines are printed from the first file.
Thanks
Using awk
awk 'FNR==NR {a[$1];next} $6 in a' file2 file1
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
This store the file2 (index) in an array
Then look if $6 is equal in the array, yes, print line.
sed 's/^/^\\([^[:blank:]]\\{1,\\}[[:blank:]]\\{1,\\}\\)\\{5\\}/' Other.file > /tmp/pregrep.txt
egrep -f /tmp/pregrep.txt Source.File
Use of sed only is possible (after a cat of both file and a pipe) but lot more instruction. So awk of Jotne seems to be the champ
Try this:
awk 'FNR==NR &&NF{a[$1];next} $6 in a' file2 file1

Insert space between pairs of characters - sed

Another sed question! I have nucleotide data in pairs
1 Affx-14150122 0 75891 00 CT TT CT TT CT
split by spaces and I need to put a space into every pair, eg
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
I've tried sed 's/[A-Z][A-Z]/ &/g' and sed 's/[A-Z][A-Z]/& /g'
And both A-Z replaced with .. and it never splits the pair as I'd like it to (it puts spaces before or after or splits every other pair or similar!).
I assume that this will work for you, however it's not perfect!
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g' matches whitespace (\s) upper case character ([A-Z]), puts that in a group (\(...\)), and then matches upper case character and stores that in second group. Then this match is substituted by first group (\1) space second group (\2).
NOTE:
This fails when you have sequences that are longer than 2 characters.
An solution using awk which modifies only pairs of characters and might be more robust depending on your input data:
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
awk '
{
for(i=1;i<=NF;i++) {
if($i ~ /^[A-Z][A-Z]$/){
$i=substr($i,1,1)" "substr($i,2,1)
}
}
}
1'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T1
This might work for you (GNU sed):
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' |
sed ':a;s/\(\s\S\)\(\S\(\s\|$\)\)/\1 \2/g;ta'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This second method works but might provide false positives:
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' | sed 's/\<\(.\)\(.\)\>/\1 \2/g'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This is actually easier in python than in awk:
echo caca | python -c 'import sys;\
for line in sys.stdin: print (" ".join(line))'
c a c a

create file by feching corresponding data from other files?

I have a list of SNPs for example (let's call it file1):
SNP_ID chr position
rs9999847 4 182120631
rs999985 11 107192257
rs9999853 4 148436871
rs999986 14 95803856
rs9999883 4 870669
rs9999929 4 73470754
rs9999931 4 31676985
rs9999944 4 148376995
rs999995 10 78735498
rs9999963 4 84072737
rs9999966 4 5927355
rs9999979 4 135733891
I have another list of SNP with corresponding P-value (P) and BETA (as shown below) for different phenotypes here i have shown only one (let's call it file2):
CHR SNP BP A1 TEST NMISS BETA SE L95 U95 STAT P
1 rs3094315 742429 G ADD 1123 0.1783 0.2441 -0.3 0.6566 0.7306 0.4652
1 rs12562034 758311 A ADD 1119 -0.2096 0.2128 -0.6267 0.2075 -0.9848 0.3249
1 rs4475691 836671 A ADD 1111 -0.006033 0.2314 -0.4595 0.4474 -0.02608 0.9792
1 rs9999847 878522 A ADD 1109 -0.2784 0.4048 -1.072 0.5149 -0.6879 0.4916
1 rs999985 890368 C ADD 1111 0.179 0.2166 -0.2455 0.6034 0.8265 0.4087
1 rs9999853 908247 C ADD 1110 -0.02015 0.2073 -0.4265 0.3862 -0.09718 0.9226
1 rs999986 918699 G ADD 1111 -1.248 0.7892 -2.795 0.2984 -1.582 0.114
Now I want to make two files named file3 and file4 such that:
file3 should contain:
SNPID Pvalue_for_phenotype1 Pvalue_for_phenotype2 Pvalue_for_phenotype3 and so on....
rs9999847 0.9263 0.00005 0.002 ..............
The first column (SNPIDs) in file3 will be fixed (all the snps in my chip will be listed here), and i want to write a programe so that it will match snp id in file3 and file2 and will fetch the P-value for that corresponding snp id and put it in file3 from file2.
file4 should contain:
SNPID BETAvale_for_phenotype1 BETAvale_for_phenotype2 BETAvale_for_phenotype3 .........
rs9999847 0.01812 -0.011 0.22
the 1st column (SNPIDs) in file4 will be fixed (all the SNPs in my chip will be listed here), and I want to write a program so that it will match SNP ID in file4 and file2 and will fetch the BETA for that corresponding SNP ID and put it in file4 from file2.
it's a simple exercise about How to transfer the data of columns to rows (with awk)?
file2 to file3.
I assumed that you have got machine with large RAM, because I think that you have got million lines into file2.
you could save this code into column2row.awk file:
#!/usr/bin/awk -f
BEGIN {
snp=2
val=12
}
{
if ( vector[$snp] )
vector[$snp] = vector[$snp]","$val
else
vector[$snp] = $val
}
END {
for (snp in vector)
print snp","vector[snp]
}
where snp is column 2 and val is column 12 (pvalue).
now you could run script:
/usr/bin/awk -f column2row.awk file2 > file3
If you have got small RAM, then you could divide load:
cat file1 | while read l; do s=$(echo $l|awk '{print $1}'); grep -w $s file2 > $s.snp; /usr/bin/awk -f column2row.awk $s.snp >> file3; done
It recovers from $l (line) first parameter ($s, snp name), search $s into file2 and create small file about each snp name.
and then it uses awk script to generate file3.
file2 to file4.
you could modify value about val into awk script from column 12 to 7.