How to replace a value with "." in sed

How to replace a value with "." in sed - sed

I want to replace all instances of "a number followed by any number of spaces followed by a period and possibly more spaces" with the number and period only.
For example, '14 . x' will become '14.x'.
My test data is:
1. c4 e5 2. g3 c6 { good move. } 3. Bg2 Nf6 4. Nc3 $6 d5 5. cxd5 cxd5 6. Qb3 Nc6 $1.. Nxd5 Nd4
8. Nxf6+ Qxf6 9. Qd1.f5 10. d3 Rc8 (10... Bb4+ $5 11. Bd2 Bxd2+ 12. Qxd2 Qa6 $1.3. Rc1.xa2
14. Bxb7 $2 Rb8 15. Qb4 Bd7) 11. Kf1.c5 12. Nf3 O-O
How can I do that?

If you want any number of spaces removed from either side of the period, you should try s/\([0-9]\) *\. */\1./g:
$ echo '11. A 12 .B 13 . C 14.D 15 . E' | sed 's/\([0-9]\) *\. */\1./g'
11.A 12.B 13.C 14.D 15.E
For your test data, the results are:
1.c4 e5 2.g3 c6 { good move. } 3.Bg2 Nf6 4.Nc3 $6 d5 5.cxd5 cxd5 6.Qb3 Nc6 $1.. Nxd5 Nd4
8.Nxf6+ Qxf6 9.Qd1.f5 10.d3 Rc8 (10... Bb4+ $5 11.Bd2 Bxd2+ 12.Qxd2 Qa6 $1.3.Rc1.xa2
14.Bxb7 $2 Rb8 15.Qb4 Bd7) 11.Kf1.c5 12.Nf3 O-O

Related

Utf8 encoding makes me confused

let buf1 = Buffer.from("3", "utf8");
let buf2 = Buffer.from("Здравствуйте", "utf8");
// <Buffer 33>
// <Buffer d0 97 d0 b4 d1 80 d0 b0 d0 b2 d1 81 d1 82 d0 b2 d1 83 d0 b9 d1 82 d0 b5>
Why does char '3' encode to '33' in buf1 but 'd0 97' in buf2?

Because 3 is not З, despite the similarity to the untrained eye. Look closer and you'll see the difference, however subtle.
The former is Unicode code point U+0033 - DIGIT THREE (see here), while the latter is U+0417 - CYRILLIC CAPITAL LETTER ZE (see here), encoded in UTF-8 as d0 97.
The Russian word is actually hello, pronounced (very roughly, since I only know hello and goodbye, taught by a Russian girlfriend many decades ago) "Strasvoytza", with no "three" anywhere in the concept.

The first character of the second buffer is the Cyrillic character "Ze" https://en.m.wikipedia.org/wiki/Ze_(Cyrillic) and not the Arabic numeral 3 https://en.m.wikipedia.org/wiki/3

How to sum values in a column grouped by values in the other

I have a large file consisting data in 2 columns
100 5
100 10
100 10
101 2
101 4
102 10
102 2
I want to sum the values in 2nd column with matching values in column 1. For this example, the output I'm expecting is
100 25
101 6
102 12
I'm trying to work on this using bash script preferably. Can someone explain me how can I do this

Using awk:
awk '{a[$1]+=$2}END{for(i in a){print i, a[i]}}' inputfile
For your input, it'd produce:
100 25
101 6
102 12

In a perl oneliner
perl -lane "$s{$F[0]} += $F[1]; END { print qq{$_ $s{$_}} for keys %s}" file.txt

You can use an associative array. The first column is the index and the second becomes what you add to it.
#!/bin/bash
declare -A columns=()
while read -r -a line ; do
columns[${line[0]}]=$((${columns[${line[0]}]} + ${line[1]}))
done < "${1}"
for idx in ${!columns[#]} ; do
echo "${idx} ${columns[${idx}]}"
done

Using awk and maintain the order:
awk '!($1 in a){a[$1]=$2; b[++i]=$1;next} {a[$1]+=$2} END{for (k=1; k<=i; k++) print b[k], a[b[k]]}' file
100 25
101 6
102 12

Python is my choice:
d = {}
for line in f.readlines():
key,value = line.split()
if d[key] == None:
d[key] = 0
d[key] += value
print d
Why would you want a bash script?

How to find lines in a file matching lines in another file?

I have a large file with 11 columns of either text or numbers:
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03002 E0147 a1 1001 0147 10303002 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
...
and another file of only one column of numbers:
0146
0148
...
I need to extract lines from the first file when the 6th column matches the entries of the second file. So, in the above example, if the second file contains only the two entries, then the first and the third lines are printed from the first file.
Thanks

Using awk
awk 'FNR==NR {a[$1];next} $6 in a' file2 file1
ETNOFIKK 03001 E0146 a1 1001 0146 10303001 10 500 EKO24 2001_200
ETNOFIKK 03003 E0148 a1 1001 0148 10303003 10 500 EKO24 2001_200
This store the file2 (index) in an array
Then look if $6 is equal in the array, yes, print line.

sed 's/^/^\\([^[:blank:]]\\{1,\\}[[:blank:]]\\{1,\\}\\)\\{5\\}/' Other.file > /tmp/pregrep.txt
egrep -f /tmp/pregrep.txt Source.File
Use of sed only is possible (after a cat of both file and a pipe) but lot more instruction. So awk of Jotne seems to be the champ

Try this:
awk 'FNR==NR &&NF{a[$1];next} $6 in a' file2 file1

Insert space between pairs of characters - sed

Another sed question! I have nucleotide data in pairs
1 Affx-14150122 0 75891 00 CT TT CT TT CT
split by spaces and I need to put a space into every pair, eg
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
I've tried sed 's/[A-Z][A-Z]/ &/g' and sed 's/[A-Z][A-Z]/& /g'
And both A-Z replaced with .. and it never splits the pair as I'd like it to (it puts spaces before or after or splits every other pair or similar!).

I assume that this will work for you, however it's not perfect!
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T
sed 's/\(\s[A-Z]\)\([A-Z]\)/\1 \2/g' matches whitespace (\s) upper case character ([A-Z]), puts that in a group (\(...\)), and then matches upper case character and stores that in second group. Then this match is substituted by first group (\1) space second group (\2).
NOTE:
This fails when you have sequences that are longer than 2 characters.

An solution using awk which modifies only pairs of characters and might be more robust depending on your input data:
echo "1 Affx-14150122 0 75891 00 CT TT CT TT CT" | \
awk '
{
for(i=1;i<=NF;i++) {
if($i ~ /^[A-Z][A-Z]$/){
$i=substr($i,1,1)" "substr($i,2,1)
}
}
}
1'
gives
1 Affx-14150122 0 75891 00 C T T T C T T T C T1

This might work for you (GNU sed):
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' |
sed ':a;s/\(\s\S\)\(\S\(\s\|$\)\)/\1 \2/g;ta'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T
This second method works but might provide false positives:
echo '1 Affx-14150122 0 75891 00 CT TT CT TT CT' | sed 's/\<\(.\)\(.\)\>/\1 \2/g'
1 Affx-14150122 0 75891 0 0 C T T T C T T T C T

This is actually easier in python than in awk:
echo caca | python -c 'import sys;\
for line in sys.stdin: print (" ".join(line))'
c a c a

sed remove line containing a string and nothing but; automation using for loop

Q1: Sed specify the whole line and if the line is nothing but the string then delete
I have a file that contains several of the following numbers:
1 1
3 1
12 1
1 12
25 24
23 24
I want to delete numbers that are the same in each line. For that I have either been using:
sed '/1 1/d' < old.file > new.file
OR
sed -n '/1 1/!p' < old.file > new.file
Here is the main problem. If I search for pattern '1 1' that means I get rid of '1 12' as well. So for I want the pattern to specify the whole line and if it does, to delete it.
Q2: Automation of question 1
I am also trying to automate this problem. The range of numbers in the first column and the second column could be from 1 to 25.
So far this is what I got:
for ((i=1;i<26;i++)); do
sed "/'$i' '$i'/d" < oldfile > newfile; mv newfile oldfile;
done
This does nothing to the oldfile in the end. :(

This would be more readable with awk:
awk '$1 == $2 {next} {print}' oldfile > newfile
Update based on comment:
If the requirement is to remove lines where the two values are within 1 of each other:
awk '{d = $1-$2; if (-1 <= d && d <= 1) next; else print}' oldfile
Unfortunately, awk does not have abs() (at least nawk and gawk don't)

Just put the first number in a group (\([0-9]*\)) and then look for it with a backreference (\1). Since the line to delete should contain only the group, repeated, use the ^ to mark the beginning of line and the $ to mark the end of line. For example, for the following file:
$ cat input
1 1
3 1
12 1
1 12
12 12
12 13
13 13
25 24
23 24
...the result is:
$ sed '/^\([0-9]*\) \1$/d' input
3 1
12 1
1 12
12 13
25 24
23 24

You can also do it with grep:
grep -E -v "([0-9])*\s\1" testfile
Look for multiple digits in a row and remember them, followed by a single whitespace, followed by whatever digits you remembered.