Merging two files with condition on two columns - perl

I have two files of the type:
File1.txt
1 117458 rs184574713 rs184574713
1 119773 rs224578963 rs224500000
1 120000 rs224578874 rs224500045
1 120056 rs60094200 rs60094200
2 120056 rs60094536 rs60094536
File2.txt
10 120200 120400 A 0 189,183,107
1 0 119600 C 0 233,150,122
1 119600 119800 D 0 205,92,92
1 119800 120400 F 0 192,192,192
2 120400 122000 B 0 128,128,128
2 126800 133200 A 0 192,192,192
I want to add the information contained in the second file to the first file. The first column in both files needs to match, while the second column in File1.txt should fall in the interval that is indicated by columns 2 and 3 in File2.txt. So that the output should look like this:
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
2 120440 rs60094536 rs60094536 B 0 128,128,128
Please help me with awk/perl.. or any other script.

This is how you would do it in bash (with a little help from awk):
join -1 1 -2 1 \
<(sort -n -k1,1 test1) <(sort -n -k1,1 test2) | \
awk '$2 >= $5 && $2 <= $6 {print $1, $2, $3, $4, $7, $8, $9}'
Here is a brief explanation.
First, we use join to join lines based on the common key (the
first field).
But join expects both input files to be already sort (hence
sort).
At least, we employ awk to apply the required condition, and to
project the fields we want.

Try this: (Considering the fact that there is a typo in your output for last entry. 120056 is not between 120400 122000.
$ awk '
NR==FNR {
a[$1,$2,$3]=$4 FS $5 FS $6;
next
}
{
for(x in a) {
split(x,tmp,SUBSEP);
if($1==tmp[1] && $2>=tmp[2] && $2<=tmp[3])
print $0 FS a[x]
}
}' file2 file1
1 117458 rs184574713 rs184574713 C 0 233,150,122
1 119773 rs224578963 rs224500000 D 0 205,92,92
1 120000 rs224578874 rs224500045 F 0 192,192,192
1 120056 rs60094200 rs60094200 F 0 192,192,192
You read through the first file creating an array indexed at column 1,2 and 3 having values of column 4,5 and 6.
For the second file, you look up in your array. For every key, you split the key and check for your condition of first column matching and second column to be in range.
If the condition is true you print the entire line from file 1 followed by the value of array.

Related

How to delete lines in a file with sed which match a certain pattern and are longer or shorter than certain length

I am able to delete lines with certain patterns and shorter sed '/^.\{,20\}$/d' -i FILE or longer sed '/^.\{25\}..*/d' -i FILE than certain length separately, but how do I unite pattern and length in sed?
Lines containing A should be between 20 and 25 characters
Lines containing B should be between 10 and 15 characters
Lines containing C should be between 3 and 8 characters
All other lines should be deleted from the file
1234567890 A 1234567890
12345 A 12345
1 A 1
1234567890 B 1234567890
12345 B 12345
1 B 1
1234567890 C 1234567890
12345 C 12345
1 C 1
So that the output should look like this
1234567890 A 1234567890
12345 B 12345
1 C 1
This is how you can do it with sed:
$ sed -ne '/A/ s/^\(.\{20,25\}\)$/\1/p; /B/ s/^\(.\{10,15\}\)$/\1/p; /C/ s/^\(.\{3,8\}\)$/\1/p;' file
1234567890 A 1234567890
12345 B 12345
1 C 1
How does it work:
-ne - suppress printing pattern
/A/ - look for pattern A
^\(.\{20,25\}\)$ - line with 20-25 characters
/\1/p - print pattern space
Use awk and you can simply write the conditions as a boolean expression, you're not stuck trying to make a condition out of a regexp:
$ awk '(/A/ && /^.{20,25}$/) || (/B/ && /^.{10,15}$/) || (/C/ && /^.{3,8}$/)' file
1234567890 A 1234567890
12345 B 12345
1 C 1
Here's an awk solution
awk '/.*A.*/ && length($0) > 19 && length($0) < 26 \
|| /.*B.*/ && length($0) > 9 && length($0) < 16 \
|| /.*C.*/ && length($0) > 2 && length($0) < 9' test1.dat
edit
And here's a more efficient version, where we only get the length($0) one time
awk '{len=length($0)}
/.*A.*/ && len > 19 && len < 26 \
|| /.*B.*/ && len > 9 && len < 16 \
|| /.*C.*/ && len > 2 && len < 9' test1.dat
output
1234567890 A 1234567890
12345 B 12345
1 C 1
I have incremented/decremented your boundary numbers by one to eliminate the need to test with <= and >= (Which are slightly more expensive tests. On a very large file it might cost you a 30 secs (just a guess!)).
(don't let any whitespace characters creep in after the \ at the end of those continued lines).
(Also, you can remove that \ chars and fold this up onto one-line if you need that.)
This can be enhanced to accept variable values, and I include a short example here, finishing it out to your needs can be seen as an opportunity for learning ;-)
awk -v lim1=10 -v lim2=26 '/.*A.*/ && length($0) > lim1 && length($0) < lim2 ...
IHTH

Output from calculation is messed in perl one-liner

I'm trying to do some calculations on the columns of a tab delimited file using this perl one-liner:
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/e && s/$F[3]/$F[3]\/$F[4]/e}' infile
the idea is to get A and B columns divided by C column
infile:
X Y A B C
5001 3 1.03333 0.652549 4215
6001 4 1.2 0.723137 4870
7001 2 1 0.807843 5153
8001 2 1 0.807843 5355
9001 2 1 0.807843 5389
10001 2 1 0.807843 4955
11001 7 1.7671 1.05573 4966
12001 17 8.18802 4.72554 5124
But the output is this:
X Y A B C
5001 3 0.000245155397390273 0.000154815895610913 4215
6001 4 0.000246406570841889 0.000148488090349076 4870
7000.000194061711624297 2 1 0.000156771395303707 5153
8000.000186741363211951 2 1 0.000150857703081232 5355
9000.000185563184264242 2 1 0.000149905919465578 5389
0.0002018163471241170001 2 1 0.000163035923309788 4955
11001 7 0.000355839710028192 0.000212591623036649 4966
12001 17 0.00159797423887588 0.000922236533957845 5124
What is going on on the 3rd to 6th lines? How can manage to fix this?
Thanks.
EDIT:
I removed the /e option from the substitute command and it seems that the calculation is being performed on the wrong column.
perl -ape 'if (/^\d/) { s/$F[2]/$F[2]\/$F[4]/ && s/$F[3]/$F[3]\/$F[4]/}' infile
X Y A B C
5001 3 1.03333/4215 0.652549/4215 4215
6001 4 1.2/4870 0.723137/4870 4870
7001/5153 2 1 0.807843/5153 5153
8001/5355 2 1 0.807843/5355 5355
9001/5389 2 1 0.807843/5389 5389
1/49550001 2 1 0.807843/4955 4955
11001 7 1.7671/4966 1.05573/4966 4966
12001 17 8.18802/5124 4.72554/5124 5124
13001 30 13.8763/5138 8.05385/5138 5138
After substitution and evaluation, you have something like s/1/0.000194061711624297/. So the s operator looks for a 1 and finds it as part of the first column. Whoops. If we add some \b word-boundary markers, we can force the match part of the s operators to match a complete column, never just part of a column:
perl -ape 'if (/^\d/) { s/\b$F[2]\b/$F[2]\/$F[4]/e && s/\b$F[3]\b/$F[3]\/$F[4]/e}' infile
But that's still going to run into issues if it's possible for column X to equal column A or B. Better to just do the calculations and then replace the entire line by assigning to $_:
perl -ape 'if (/^\d/) { $F[2] /= $F[4]; $F[3] /= $F[4]; $_ = join(" ", #F); }'
Use sprintf instead of join if you want a particular format to the output.
Your basic problem is that you are substituting the value that is in column 3 and 4 whereever they appear in the whole line. For row 3, for example, you are doing s/1/1\/5153/e which affects the first occurrence of the digit 1 in the line, not necessarily the 1 that happens to be in column 3.
Try this:
perl -lane 'if ($F[4] =~ /[1-9]/) { $F[2] /= $F[4]; $F[3] /= $F[4] } print join "\t", #F' infile
If you want to limit the precision, do something like $F[2] = sprintf "%f", $F[2]/$F[4]; ...

How to sum values in a column grouped by values in the other

I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this
Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12
In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt
You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done
Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12
Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?

Insert space between pairs of characters - sed

Another sed question! I have nucleotide data in pairs
1 Affx-14150122 0 75891 00 CT TT CT TT CT
split by spaces and I need to put a space into every pair, eg
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
I've tried sed 's/[A-Z][A-Z]/ &/g' and sed 's/[A-Z][A-Z]/& /g'
And both A-Z replaced with .. and it never splits the pair as I'd like it to (it puts spaces before or after or splits every other pair or similar!).
I assume that this will work for you, however it's not perfect!
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g' matches whitespace (\s) upper case character ([A-Z]), puts that in a group (\(...\)), and then matches upper case character and stores that in second group. Then this match is substituted by first group (\1) space second group (\2).
NOTE:
This fails when you have sequences that are longer than 2 characters.
An solution using awk which modifies only pairs of characters and might be more robust depending on your input data:
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
awk '
{
for(i=1;i<=NF;i++) {
if($i ~ /^[A-Z][A-Z]$/){
$i=substr($i,1,1)" "substr($i,2,1)
}
}
}
1'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T1
This might work for you (GNU sed):
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' |
sed ':a;s/\(\s\S\)\(\S\(\s\|$\)\)/\1 \2/g;ta'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This second method works but might provide false positives:
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' | sed 's/\<\(.\)\(.\)\>/\1 \2/g'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This is actually easier in python than in awk:
echo caca | python -c 'import sys;\
for line in sys.stdin: print (" ".join(line))'
c a c a

create file by feching corresponding data from other files?

I have a list of SNPs for example (let's call it file1):
SNP_ID chr position
rs9999847 4 182120631
rs999985 11 107192257
rs9999853 4 148436871
rs999986 14 95803856
rs9999883 4 870669
rs9999929 4 73470754
rs9999931 4 31676985
rs9999944 4 148376995
rs999995 10 78735498
rs9999963 4 84072737
rs9999966 4 5927355
rs9999979 4 135733891
I have another list of SNP with corresponding P-value (P) and BETA (as shown below) for different phenotypes here i have shown only one (let's call it file2):
CHR SNP BP A1 TEST NMISS BETA SE L95 U95 STAT P
1 rs3094315 742429 G ADD 1123 0.1783 0.2441 -0.3 0.6566 0.7306 0.4652
1 rs12562034 758311 A ADD 1119 -0.2096 0.2128 -0.6267 0.2075 -0.9848 0.3249
1 rs4475691 836671 A ADD 1111 -0.006033 0.2314 -0.4595 0.4474 -0.02608 0.9792
1 rs9999847 878522 A ADD 1109 -0.2784 0.4048 -1.072 0.5149 -0.6879 0.4916
1 rs999985 890368 C ADD 1111 0.179 0.2166 -0.2455 0.6034 0.8265 0.4087
1 rs9999853 908247 C ADD 1110 -0.02015 0.2073 -0.4265 0.3862 -0.09718 0.9226
1 rs999986 918699 G ADD 1111 -1.248 0.7892 -2.795 0.2984 -1.582 0.114
Now I want to make two files named file3 and file4 such that:
file3 should contain:
SNPID Pvalue_for_phenotype1 Pvalue_for_phenotype2 Pvalue_for_phenotype3 and so on....
rs9999847 0.9263 0.00005 0.002 ..............
The first column (SNPIDs) in file3 will be fixed (all the snps in my chip will be listed here), and i want to write a programe so that it will match snp id in file3 and file2 and will fetch the P-value for that corresponding snp id and put it in file3 from file2.
file4 should contain:
SNPID BETAvale_for_phenotype1 BETAvale_for_phenotype2 BETAvale_for_phenotype3 .........
rs9999847 0.01812 -0.011 0.22
the 1st column (SNPIDs) in file4 will be fixed (all the SNPs in my chip will be listed here), and I want to write a program so that it will match SNP ID in file4 and file2 and will fetch the BETA for that corresponding SNP ID and put it in file4 from file2.
it's a simple exercise about How to transfer the data of columns to rows (with awk)?
file2 to file3.
I assumed that you have got machine with large RAM, because I think that you have got million lines into file2.
you could save this code into column2row.awk file:
#!/usr/bin/awk -f
BEGIN {
snp=2
val=12
}
{
if ( vector[$snp] )
vector[$snp] = vector[$snp]","$val
else
vector[$snp] = $val
}
END {
for (snp in vector)
print snp","vector[snp]
}
where snp is column 2 and val is column 12 (pvalue).
now you could run script:
/usr/bin/awk -f column2row.awk file2 > file3
If you have got small RAM, then you could divide load:
cat file1 | while read l; do s=$(echo $l|awk '{print $1}'); grep -w $s file2 > $s.snp; /usr/bin/awk -f column2row.awk $s.snp >> file3; done
It recovers from $l (line) first parameter ($s, snp name), search $s into file2 and create small file about each snp name.
and then it uses awk script to generate file3.
file2 to file4.
you could modify value about val into awk script from column 12 to 7.