diff ignore white spaces or the same string on a different line - diff

I need to make diff between two files but If I have the same lines in the files on a different line, I don't want to display any output.
Example:
File1:
cc aaaw
bb bbbw
aa cccw
File2:
cc aaaw
bb bbbw
aa cccw
diff file1 file2:
2d1
< bb bbbw
3a3
> bb bbbw
-> I don't want any output
but If I have file1 as the one above and file2:
cc aaaw
bb bbbw
aa cccw
ddddddd
I want this output:
4a5
> ddddddd
Thanks.

You might use diff -B to ignore empty/blank lines.

Related

substituting chemical atomic numbers using sed

I am trying to substitute some patterns of atomic numbers in a single file. That file contain a series of atomic numbers in a column as shown in the first column. Now I want to substitute the first column of numbers with the series of numbers as in the second column line after line.
C1 C21
C2 C22
C4 C23
C5 C24
C6 C25
C7 C26
C8 C27
C9 C28
C10 C29
C11 C30
C12 C31
C13 C32
C14 C33
O1 O11
O2 O12
O3 O13
O4 O14
O5 O15
O6 O16
H1 H31
H2 H32
H3 H33
H4 H34
H5 H35
H6 H36
H7 H37
H8 H38
H9 H39
H10 H40
H11 H41
H12 H42
H13 H43
H14 H44
H15 H45
H16 H46
H17 H47
H18 H48
H19 H49
H20 H50
H21 H51
H22 H52
H23 H53
H24 H54
H25 H55
H26 H56
H27 H57
H28 H58
To achieve this I tried the sed command as below
sed -i -e 's/C1/C21/;s/C2/C22/;s/C3/C23/;s/C4/C24/;s/C5/C25/;s/C6/C26/;s/C7/C27/;s/C8/C28/;s/C9/C29/;s/C10/C30/;s/C11/C31/;s/C12/C32/;s/C13/C33/;s/C14/C34/;s/O1/O11/;s/O2/O12/;s/O3/O13/;s/O4/O14/;s/O5/O15/;s/O6/O16/;s/H1/H31/;s/H2/H32/;s/H3/H33/;s/H4/H34/;s/H5/H35/;s/H6/H36/;s/H7/H37/;s/H8/H38/;s/H9/H39/;s/H10/H40/;s/H11/H41/;s/H12/H42/;s/H13/H43/;s/H14/H44/;s/H15/H45/;s/H16/H46/;s/H17/H47/;s/H18/H48/;s/H19/H49/;s/H20/H50/;s/H21/H51/;s/H22/H52/;s/H23/H53/;s/H24/H54/;s/H25/H55/;s/H26/H56/;s/H27/H57/;s/H28/H58/' FILE_NAME
Unfortunately, what I get is multiple substitutions like C3328 and so on.
Can anyone help me to address the correct way of doing this? Appreciate in advance.
It's still not clear but I THINK this is what you want:
$ cat tst.awk
BEGIN { cnt["C"]=21; cnt["O"]=11; cnt["H"]=31 }
NF { c=substr($0,1,1); $0=c cnt[c]++ }
{ print }
.
$ awk -f tst.awk file
C21
C22
C23
C24
C25
C26
C27
C28
C29
C30
C31
C32
C33
O11
O12
O13
O14
O15
O16
H31
H32
H33
H34
H35
H36
H37
H38
H39
H40
H41
H42
H43
H44
H45
H46
H47
H48
H49
H50
H51
H52
H53
H54
H55
H56
H57
H58
The problem is that sed will attempt to carry out all substitutions in order, which results in multiple substitutions. So you need to rearrange your substitutions from most specific to least specific. For example:
echo "C1" | sed -n 's/C1/C21/p; s/C2/C22/p; s/C3/C23/p'
C21
C221
echo "C1" | sed -n 's/C3/C23/p; s/C2/C22/p; s/C1/C21/p'
C21
put [^0-9] after each pattern should work fine, to automate this process:
awk '$0{printf("s/%s\\([^0-9]\\)/%s\\1/g\n", $1, $2)}' <pattern-file >sedscr
run this one-liner for the pattern file, cat sedscr, then you would get:
s/C1\([^0-9]\)/C21\1/g
s/C2\([^0-9]\)/C22\1/g
s/C4\([^0-9]\)/C23\1/g
...
after that you run sed with the generated script for your sample files.
sed -f sedscr sample-files...

How to find lines in a file matching lines in another file?

I have a large file with 11 columns of either text or numbers:
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03002 E0147 a1 1001 0147 10303002 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
...
and another file of only one column of numbers:
0146
0148
...
I need to extract lines from the first file when the 6th column matches the entries of the second file. So, in the above example, if the second file contains only the two entries, then the first and the third lines are printed from the first file.
Thanks
Using awk
awk 'FNR==NR {a[$1];next} $6 in a' file2 file1
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
This store the file2 (index) in an array
Then look if $6 is equal in the array, yes, print line.
sed 's/^/^\\([^[:blank:]]\\{1,\\}[[:blank:]]\\{1,\\}\\)\\{5\\}/' Other.file > /tmp/pregrep.txt
egrep -f /tmp/pregrep.txt Source.File
Use of sed only is possible (after a cat of both file and a pipe) but lot more instruction. So awk of Jotne seems to be the champ
Try this:
awk 'FNR==NR &&NF{a[$1];next} $6 in a' file2 file1

Sed remove text to text except last line

I want to delete part of text:
0
1
test1
a
b
random letter
test2
e
f
g
I want to get:
0
1
test2
e
f
g
I've tried use sed:
sed '/test1/,/test2/d'
But it will remove test2 too
How can I delete text and save test2, if I don't exactly know what text before test2
I need to use awk or sed
give this a try:
sed '/test1/,/test2/{/test2/!d}'
test with your example:
kent$ echo "0
1
test1
a
b
random letter
test2
e
f
g"|sed '/test1/,/test2/{/test2/!d}'
0
1
test2
e
f
g
awk 'BEGIN{p=1}/test1/{p=0}/test2/{p=1}p' your_file
Tested Below:
> cat temp
0
1
test1
a
b
random letter
test2
e
f
g
>
> awk 'BEGIN{p=1}/test1/{p=0}/test2/{p=1}p' temp
0
1
test2
e
f
g
>
If you want to search for whole word in awk:
search like below:
/\<WORD\>/
Alternatively you can go perl as well:
perl -lne 'BEGIN{$p=1}if(/\btest1\b/){$p=0}if(/\btest2\b/){$p=1}print if $p' your_file

Insert space between pairs of characters - sed

Another sed question! I have nucleotide data in pairs
1 Affx-14150122 0 75891 00 CT TT CT TT CT
split by spaces and I need to put a space into every pair, eg
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
I've tried sed 's/[A-Z][A-Z]/ &/g' and sed 's/[A-Z][A-Z]/& /g'
And both A-Z replaced with .. and it never splits the pair as I'd like it to (it puts spaces before or after or splits every other pair or similar!).
I assume that this will work for you, however it's not perfect!
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g' matches whitespace (\s) upper case character ([A-Z]), puts that in a group (\(...\)), and then matches upper case character and stores that in second group. Then this match is substituted by first group (\1) space second group (\2).
NOTE:
This fails when you have sequences that are longer than 2 characters.
An solution using awk which modifies only pairs of characters and might be more robust depending on your input data:
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
awk '
{
for(i=1;i<=NF;i++) {
if($i ~ /^[A-Z][A-Z]$/){
$i=substr($i,1,1)" "substr($i,2,1)
}
}
}
1'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T1
This might work for you (GNU sed):
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' |
sed ':a;s/\(\s\S\)\(\S\(\s\|$\)\)/\1 \2/g;ta'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This second method works but might provide false positives:
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' | sed 's/\<\(.\)\(.\)\>/\1 \2/g'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This is actually easier in python than in awk:
echo caca | python -c 'import sys;\
for line in sys.stdin: print (" ".join(line))'
c a c a

create file by feching corresponding data from other files?

I have a list of SNPs for example (let's call it file1):
SNP_ID chr position
rs9999847 4 182120631
rs999985 11 107192257
rs9999853 4 148436871
rs999986 14 95803856
rs9999883 4 870669
rs9999929 4 73470754
rs9999931 4 31676985
rs9999944 4 148376995
rs999995 10 78735498
rs9999963 4 84072737
rs9999966 4 5927355
rs9999979 4 135733891
I have another list of SNP with corresponding P-value (P) and BETA (as shown below) for different phenotypes here i have shown only one (let's call it file2):
CHR SNP BP A1 TEST NMISS BETA SE L95 U95 STAT P
1 rs3094315 742429 G ADD 1123 0.1783 0.2441 -0.3 0.6566 0.7306 0.4652
1 rs12562034 758311 A ADD 1119 -0.2096 0.2128 -0.6267 0.2075 -0.9848 0.3249
1 rs4475691 836671 A ADD 1111 -0.006033 0.2314 -0.4595 0.4474 -0.02608 0.9792
1 rs9999847 878522 A ADD 1109 -0.2784 0.4048 -1.072 0.5149 -0.6879 0.4916
1 rs999985 890368 C ADD 1111 0.179 0.2166 -0.2455 0.6034 0.8265 0.4087
1 rs9999853 908247 C ADD 1110 -0.02015 0.2073 -0.4265 0.3862 -0.09718 0.9226
1 rs999986 918699 G ADD 1111 -1.248 0.7892 -2.795 0.2984 -1.582 0.114
Now I want to make two files named file3 and file4 such that:
file3 should contain:
SNPID Pvalue_for_phenotype1 Pvalue_for_phenotype2 Pvalue_for_phenotype3 and so on....
rs9999847 0.9263 0.00005 0.002 ..............
The first column (SNPIDs) in file3 will be fixed (all the snps in my chip will be listed here), and i want to write a programe so that it will match snp id in file3 and file2 and will fetch the P-value for that corresponding snp id and put it in file3 from file2.
file4 should contain:
SNPID BETAvale_for_phenotype1 BETAvale_for_phenotype2 BETAvale_for_phenotype3 .........
rs9999847 0.01812 -0.011 0.22
the 1st column (SNPIDs) in file4 will be fixed (all the SNPs in my chip will be listed here), and I want to write a program so that it will match SNP ID in file4 and file2 and will fetch the BETA for that corresponding SNP ID and put it in file4 from file2.
it's a simple exercise about How to transfer the data of columns to rows (with awk)?
file2 to file3.
I assumed that you have got machine with large RAM, because I think that you have got million lines into file2.
you could save this code into column2row.awk file:
#!/usr/bin/awk -f
BEGIN {
snp=2
val=12
}
{
if ( vector[$snp] )
vector[$snp] = vector[$snp]","$val
else
vector[$snp] = $val
}
END {
for (snp in vector)
print snp","vector[snp]
}
where snp is column 2 and val is column 12 (pvalue).
now you could run script:
/usr/bin/awk -f column2row.awk file2 > file3
If you have got small RAM, then you could divide load:
cat file1 | while read l; do s=$(echo $l|awk '{print $1}'); grep -w $s file2 > $s.snp; /usr/bin/awk -f column2row.awk $s.snp >> file3; done
It recovers from $l (line) first parameter ($s, snp name), search $s into file2 and create small file about each snp name.
and then it uses awk script to generate file3.
file2 to file4.
you could modify value about val into awk script from column 12 to 7.