How to find and replace a diaeresis?

How to find and replace a diaeresis? - sed

I have a file containing some diaeresis marks, ̈. I need to replace them with \textdiaeresis, for use in TeX.
The usual commands which seem to work with other symbols always causes the output to be \\textdiaeresis or \ extdiaeresis, the later, where \t is interpreted to mean "tab".
I have tried these sed commands:
sed -i 's/\ ̈/\textdiaeresis /g' ./file.txt
sed -i 's/\ ̈/\\textdiaeresis /g' ./file.txt
sed -i 's/\ ̈/\\\textdiaeresis /g' ./file.txt
sed -i "s/\ ̈/\textdiaeresis /g" ./file.txt
sed -i "s/\ ̈/\\textdiaeresis /g" ./file.txt
sed -i "s/\ ̈/\\\textdiaeresis /g" ./file.txt
I have tried these nawk commands:
nawk '{sub(/ ̈/,"\textdiaeresis"); print}' file.txt > file.txt2
cp file.txt2 file.txt
nawk '{sub(/ ̈/,"\\textdiaeresis"); print}' file.txt > file.txt2
cp file.txt2 file.txt
nawk '{sub(/ ̈/,"\\\textdiaeresis"); print}' file.txt > file.txt2
cp file.txt2 file.txt
How can I replace a diaeresis with this TeX code?

On Mac OS X 10.7.4, under bash (version 3.2.48), I find no problem with sed (which is the Mac OS X sed, not the GNU sed).
$ x="s, ̈. "
$ echo "$x" | ~/src/sbcs2utf8/utf8-unicode
(standard input):
0x73 = U+0073
0x2C = U+002C
0x20 = U+0020
0xCC 0x88 = U+0308
0x2E = U+002E
0x20 = U+0020
0x0A = U+000A
$ echo "$x" | sed 's/ ̈/\\textdiaresis/'
s,\textdiaresis.
$
The character is U+0308 COMBINING DIAERESIS; I copied the fragment assigned to x from the question. The Unicode standard specifies (Chapter 2, §2.11):
In the Unicode Standard, all combining characters are to be used in sequence following the
base characters to which they apply. The sequence of Unicode characters U+0061 “a”
LATIN SMALL LETTER A, U+0308 “ ¨ ”combining diaeresis, U+0075 “u” LATIN SMALL LETTER U unambiguously represents “äu” and not “aü”.
Thus, the diaeresis in the question text should be rendered over the space. Using Firefox (14.0.1), in the shell output, the diaeresis is shown over the . following it, which is wrong. And in the sed command, the diaeresis appears to be combined with the following slash, which is also wrong. Oh well! But the translation via sed looks correct to me.

Related

sed to copy part of line to end

I'm trying to copy part of a line to append to the end:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
becomes:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
I have tried:
sed 's/\(.*(GCA_\)\(.*\))/\1\2\2)'

$ f1=$'ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz'
$ echo "$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\1\2\3\/\2\4/' <<<"$f1"
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz
sed -E (or -r in some systems) enables extended regex support in sed , so you don't need to escape the group parenthesis ( ).
The format (GCA_.[^.]*) equals to "get from GCA_ all chars up and excluding the first found dot" :
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\2/' <<<"$f1"
GCA_900169985
Similarly (.[^_]*) means get all chars up to first found _ (excluding _ char). This is the regex way to perform a non greedy/lazy capture (in perl regex this would have been written something like as .*_?)
$ sed -E 's/(.*)(GCA_.[^.]*)(.[^_]*)(.*)/\3/' <<<"$f1"
.1

Short sed approach:
s="ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1_IonXpress_024_genomic.fna.gz"
sed -E 's/(GCA_[^._]+)\.([^_]+)/\1.\2\/\1/' <<< "$s"
The output:
ftp://ftp.ncbi.nlm.nih.gov/genomes/all/GCA/900/169/985/GCA_900169985.1/GCA_900169985_IonXpress_024_genomic.fna.gz

Why is sed not matching excess whitespace between non-whitespace characters

I have a sed oneliner which removes excess whitespace:
sed -e 's/^\s*//' -e 's/\s*$//' -e 's/\s{2,}/ /g'
When I test it on " \tone1 two\t3three\t ", the sed removes the whitespace at the beginning and end of the line but doesn't match the excess whitespace between words, and sed returns \tone1 two\t3three. What I want is \tone1 two 3three, so sed -e 's/[ \t]{2,}/ /g' is not functioning.
regexr.com shows the expression as functional.
My version is GNU sed version 4.2.1.

{ and } need to be escaped in basic regex mode that sed uses.
However, you can use this sed with a single substitution with alternation:
sed -E 's/^[[:blank:]]+|[[:blank:]]+$|[[:blank:]]{2,}//g' file
POSIX character class [[:blank:]] matches a space or tab characters.

how to fix: sed -e 's/é/\\'{e}/g'

how may I fix the following: sed -e 's/é/\\'{e}/g', as to substitute é by \'{e}?
Issue is that second occurence of ' is seen as command delimiter;
sed -e 's/é/\\\'{e}/g' does not work either.

With GNU sed. To replace \'{e} by é:
echo "\'{e}" | sed "s/\\\'{e}/é/"
Output:
é

How to find and replace all percent, plus, and pipe signs?

I have a document containing many percent, plus, and pipe signs. I want to replace them with a code, for use in TeX.
% becomes \textpercent.
+ becomes \textplus.
| becomes \textbar.
This is the code I am using, but it does not work:
sed -i "s/\%/\\\textpercent /g" ./file.txt
sed -i "s/|/\\\textbar /g" ./file.txt
sed -i "s/\+/\\\textplus /g" ./file.txt
How can I replace these symbols with this code?

Test script:
#!/bin/bash
cat << 'EOF' > testfile.txt
1+2+3=6
12 is 50% of 24
The pipe character '|' looks like a vertical line.
EOF
sed -i -r 's/%/\\textpercent /g;s/[+]/\\textplus /g;s/[|]/\\textbar /g' testfile.txt
cat testfile.txt
Output:
1\textplus 2\textplus 3=6
12 is 50\textpercent of 24
The pipe character '\textbar ' looks like a vertical line.
This was already suggested in a similar way by #tripleee, and I see no reason why it should not work. As you can see, my platform uses the very same version of GNU sed as yours. The only difference to #tripleee's version is that I use the extended regex mode, so I have to either escape the pipe and the plus or put it into a character class with [].

nawk '{sub(/%/,"\\textpercent");sub(/\+/,"\\textplus");sub(/\|/,"\\textpipe"); print}' file
Tested below:
> echo "% + |" | nawk '{sub(/%/,"\\textpercent");sub(/\+/,"\\textplus");sub(/\|/,"\\textpipe"); print}'
\textpercent \textplus \textpipe

Use single quotes:
$ cat in.txt
foo % bar
foo + bar
foo | bar
$ sed -e 's/%/\\textpercent /g' -e 's/\+/\\textplus /g' -e 's/|/\\textbar /g' < in.txt
foo \textpercent bar
foo \textplus bar
foo \textbar bar

How do I get rid of this unicode character?

Any idea how to get rid of this irritating character U+0092 from a bunch of text files? I've tried all the below but it doesn't work. It's called U+0092+control from the character map
sed -i 's/\xc2\x92//' *
sed -i 's/\u0092//' *
sed -i 's///' *
Ah, I've found a way:
CHARS=$(python2 -c 'print u"\u0092".encode("utf8")')
sed 's/['"$CHARS"']//g'
But is there a direct sed method for this?

Try sed "s/\`//g" *. (I added the g so it will remove all the backticks it finds).
EDIT: It's not a backtick that OP wants to remove.
Following the solution in this question, this ought to work:
sed 's/\xc2\x92//g'
To demonstrate it does:
$ CHARS=$(python -c 'print u"asdf\u0092asdf".encode("utf8")')
$ echo $CHARS
asdf<funny glyph symbol>asdf
$ echo $CHARS | sed 's/\xc2\x92//g'
asdfasdf
Seeing as it's something you tried already, perhaps what is in your text file is not U+0092?

This might work for you (GNU sed):
echo "string containing funny character(s)" | sed -n 'l0'
This will display the string as sed sees it in octal, then use:
echo "string containing funny character(s)" | sed 's/\onnn//g'
Where nnn is the octal value, to delete it/them.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to find and replace a diaeresis? - sed

Related

sed to copy part of line to end

Why is sed not matching excess whitespace between non-whitespace characters

how to fix: sed -e 's/é/\\'{e}/g'

How to find and replace all percent, plus, and pipe signs?

How do I get rid of this unicode character?

Categories

Resources