How to Replace special Character in Unix Command - sed

My source data contains special characters not in readable format. Can anyone help on the below :
Source data:
Commands Tryed:
sed 's/../t/g' test.txt > test2.txt

you can use tr to keep only printable characters:
tr -cd "[:print:]" <test.txt > test2.txt
Uses tr delete option on non-printable (print criteria negated by -c option)
If you want to replace those special chars by something else (ex: X):
tr -c "[:print:]" "X" <test.txt > test2.txt
With sed, you could try that to replace non-printable by X:
sed -r 's/[^[:print:]]/X/g' text.txt > test2.txt
it works on some but fails on chars >127 (maybe because the one I tried is printable as ▒ !) on my machine whereas tr works perfectly.
inline examples (printf to generate special chars + filter + od to show bytes):
$ printf "\x01ABC\x05\xff\xe0" | od -c
0000000 001 A B C 005 377 340
0000007
$ printf "\x01ABC\x05\xff\xe0" | sed "s/[^[:print:]]//g" | od -c
0000000 A B C 377 340
0000005
$ printf "\x01ABC\x05\xff\xe0" | tr -cd "[:print:]" | od -c
0000000 A B C
0000003

Related

How to convert a ASCII NULL (NUL) into single spacing in a text file using Unix command?

When I BCP the data in sql server
In the output file I am getting a NUL like character in the output file, and i want to replace this with the single blank space.
When I used the below sed command it removes the NUL character but between those 2 delimiter we don't have single space.
sed 's/\x0/ /g' output file name
Example: After sed command i am getting output file like below
PHMO||P00000005233
PHMO||P00000005752
But i need a single spacing in between those delimiter as
PHMO| |P00000005233
PHMO| |P00000005752
The usual approach to this would be using tr. However, solutions with tr and sed are not portable. (The question is tagged "unix", so only portable solutions are interesting).
Here is a simple demo script
#!/bin/sh
date
tr '\000' ' ' <$0.in
date
sed -e 's/\x00/ /g' <$0.in
which I named foo, and its input (with the ASCII NUL shown here as ^#):
this is a null: "^#"
Running with GNU tr and sed:
Fri Apr 1 04:41:15 EDT 2016
this is a null: " "
Fri Apr 1 04:41:15 EDT 2016
this is a null: " "
With OSX:
Fri Apr 1 04:41:53 EDT 2016
this is a null: " "
Fri Apr 1 04:41:53 EDT 2016
this is a null: "^#"
With Solaris 10 (and 11, though there may be a recent change):
Fri Apr 1 04:38:08 EDT 2016
this is a null: ""
Fri Apr 1 04:38:08 EDT 2016
this is a null: ""
Bear in mind that sed is line-oriented, and that ASCII NUL is considered a binary (non-line) character. If you want a portable solution, then other tools such as Perl (which do not have that limitation) are useful. For that case one could add this to the script:
perl -np -e 's/\0/ /g' <$0.in
The intermediate tool awk is no better in this instance. Going to Solaris again, with these lines:
for awk in awk nawk mawk gawk
do
echo "** $awk:"
$awk '{ gsub("\0"," "); print; }' <$0.in
done
I see this output:
** awk:
awk: syntax error near line 1
awk: illegal statement near line 1
** nawk:
nawk: empty regular expression
source line number 1
context is
{ gsub("\0"," >>> ") <<<
** mawk:
this is a null: " "
** gawk:
this is a null: " "
Further reading:
sed - stream editor (POSIX)
tr - translate characters (POSIX), which notes
Unlike some historical implementations, this definition of the tr utility correctly processes NUL characters in its input stream. NUL characters can be stripped by using:
tr -d '\000'
perlrun - how to execute the Perl interpreter
This is an easy job for sed. Let's start creating a test file as you didn't provide one:
$ echo -e "one,\x00,two,\x00,three" > a
$ echo -e "four,\x00,five,\x00,six" >> a
As you can see it contains ASCII 0:
$ od -c a
0000000 o n e , \0 , t w o , \0 , t h r e
0000020 e \n f o u r , \0 , f i v e , \0 ,
0000040 s i x \n
0000044
Now let's run sed:
$ sed 's/\x00/ /g' a > b
And check the output:
$ cat b
one, ,two, ,three
four, ,five, ,six
$ od -c b
0000000 o n e , , t w o , , t h r e
0000020 e \n f o u r , , f i v e , ,
0000040 s i x \n
0000044
it can be done quite easily with perl
cat -v inputfile.txt
abc^#def^#ghij^#klmnop^#qrstuv^#wxyz
perl -np -e 's/\0/ /g' <inputfile.txt >outputfile.txt
cat -v outputfile.txt
abc def ghij klmnop qrstuv wxyz

Why does sed only replace the first character?

$ echo lcdefghijklmnopqrstblvcxyz | tr [a-i] [1-9] | sed 's/j/10/' | sed 's/k/11/' | sed 's/l/12/' | sed 's/m/13/' | sed 's/n/14/' | sed 's/o/15/' | sed 's/p/16/' | sed 's/q/17/' | sed 's/r/18/' | sed 's/s/19/' | sed 's/t/20/' | sed 's/u/21/' | sed 's/v/22/' | sed 's/w/23/' | sed 's/x/24/' | sed 's/y/25/' | sed 's/z/26/'
1234567891011l13141516171819202l223242526
The long command is intended to replace a..z with 1..26. Notice there are 3 "l" characters in the echoed string. Why is the first one correctly converted to "12" yet the other two (results 11l13 and 202l223) aren't?
I tried this on both my Windows 7 PC running Cygwin (bash 4.3.33(1)-release (x86_64-unknown-cygwin)) and on my MacBook Pro running Terminal (bash 3.2) and got the same results. I expected the result to be 1..26 concatenated. This is part of a bigger problem that I reduced to this test case.
You need the g flag for the substitution to be repeated:
$ echo lll | sed 's/l/12/'
12ll
$ echo lll | sed 's/l/12/'g
121212
Without the g flag, s replaces the first instance, as documented in man sed.
Also, you can put all of those commands in a single invocation of sed. You don't need all those pipes:
sed 's/j/10/g;s/k/11/g;s/l/12/g...'
Multiple sed commands (with g switch)
Under bash, you could try something like:
c=1 o=
for i in {a..z};do
o+="s/$i/$((c++))/g;"
done
sed -e "$o" <<<'lcdefghijklmnopqrstblvcxyz'
1234567891011121314151617181920212223242526
or
fold -s <<< ${o//;/; }
s/a/1/g; s/b/2/g; s/c/3/g; s/d/4/g; s/e/5/g; s/f/6/g; s/g/7/g; s/h/8/g;
s/i/9/g; s/j/10/g; s/k/11/g; s/l/12/g; s/m/13/g; s/n/14/g; s/o/15/g; s/p/16/g;
s/q/17/g; s/r/18/g; s/s/19/g; s/t/20/g; s/u/21/g; s/v/22/g; s/w/23/g; s/x/24/g;
s/y/25/g; s/z/26/g;
then
sed -e '
s/a/1/g; s/b/2/g; s/c/3/g; s/d/4/g; s/e/5/g; s/f/6/g; s/g/7/g; s/h/8/g;
s/i/9/g; s/j/10/g; s/k/11/g; s/l/12/g; s/m/13/g; s/n/14/g; s/o/15/g; s/p/16/g;
s/q/17/g; s/r/18/g; s/s/19/g; s/t/20/g; s/u/21/g; s/v/22/g; s/w/23/g; s/x/24/g;
s/y/25/g; s/z/26/g;
' <<<'lcdefghijklmnopqrstblvcxyz'
1234567891011121314151617181920212223242526
This might work for you (GNU sed):
sed -r '1{x;s/^/a1b2c3d4e5f6g7h8i9j10k11l12m13n14o15p16q17r18s19t20u21v22w23x24y25z26/;x};G;:a;s/([a-z])(.*\n.*\1([0-9]+))/\3\2/;ta;P;d' file
This uses a lookup table to translate the required strings.

optimize sed multiple expressions including white spaces and square brackets

I have following command working fine, but just for learning purpose, I want to know how can I put following three expressions of sed into one:
bash
[user#localhost]$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]'| sed -e 's/\]//g' -e 's/\[//g' -e 's/\s\+/\t/g' -e 's/\:/\t/'
lib Library10 idx:10 Fragment 75 color
sed 's/[][]//g; s/:\|\s\+/\t/g'
Demonstrating:
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]'| sed 's/[][]//g; s/:\|\s\+/\t/g'
lib Library10 idx 10 Fragment 75 color
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]'| sed 's/[][]//g; s/:\|\s\+/\t/g' | od -c
0000000 l i b \t L i b r a r y 1 0 \t i d
0000020 x \t 1 0 \t F r a g m e n t \t 7 5
0000040 \t c o l o r \n
0000047
If you want to put a right bracket in a character class, it must be the first character, so [][] will match either a left or a right bracket.
You can group it in two blocks:
$ sed -re 's/(\]|\[)//g' -e 's/(\s+|\:)/\t/g' <<< "[lib:Library10] [idx:10] [Fragment] [75] [color]"
lib Library10 idx 10 Fragment 75 color
That is,
sed -e 's/\]//g' -e 's/\[//g' -e 's/\s\+/\t/g' -e 's/\:/\t/'
-------------------------- ------------------------------
| delete ] and [ | | replace \s+ and : with tab |
-------------------------- ------------------------------
-re 's/(\]|\[)//g' -e 's/(\s+|\:)/\t/g'
By pieces:
sed -e 's/\]//g' -e 's/\[//g'
can be compacted as:
sed -re 's/(\]|\[)//g'
by joining the conditions with a (condition1|condition2) statement together with -r for sed.
And the same with the other expression.
As a side note, tr can be better to delete the [, ] chars:
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]' | tr -d '[]'
lib:Library10 idx:10 Fragment 75 color
And to replace : with \t you can also use tr:
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]' | tr ':' '\t'
[lib Library10] [idx 10] [Fragment] [75] [color]

How to change what sed thinks is the line delimiter

As I'm new with sed, I'm having the fun of seeing that sed doesn't think that the \r character is a valid line delimiter.
Does anyone know how to tell sed which character(s) I'd like it to use as the line delimiter when processing many lines of text?
You can specify it with awk's RS (record separator) variable: awk 'BEGIN {RS = "\r"} ...
Or you can convert with: tr '\r' '\n'
(For making the examples below clearer and less ambiguous, I'll use the od util extensively.)
It is not possible to do with a flag, for example. I bet the best solution is the one cited by the previous answers: using tr. If you have a file such as the one below:
$ od -xc slashr.txt
0000000 6261 0d63 6564 0d66
a b c \r d e f \r
0000010
There are various ways of using tr; the one we wanted is to pass two parameters for it - two different chars - and tr will replace the first parameter by the second one. Sending the file content as input for tr '\r' '\n', we got the following result:
$ tr '\r' '\n' < slashr.txt | od -xc
0000000 6261 0a63 6564 0a66
a b c \n d e f \n
0000010
Great! Now we can use sed:
$ tr '\r' '\n' < slashr.txt | sed 's/^./#/'
#bc
#ef
$ tr '\r' '\n' < slashr.txt | sed 's/^./#/' | od -xc
0000000 6223 0a63 6523 0a66
# b c \n # e f \n
0000010
But I presume you need to use \r as the line delimiter, right? In this case, just use tr '\n' '\r' to reverse the conversion:
$ tr '\r' '\n' < slashr.txt | sed 's/^./#/' | tr '\n' '\r' | od -xc
0000000 6223 0d63 6523 0d66
# b c \r # e f \r
0000010
As far as I know, you can't. What's wrong with using a newline as the delimiter? If your input has DOS-style \r\n line endings it can be preprocessed to remove them and, if necessary, they can be returned afterwards.

Collect numerals at the beginning of the file

I have a text file which contains some numerals, for example,
There are 60 nuts and 35 apples,
but only 24 pears.
I want to collect these numerals (60, 35, 24) at the beginning of the same file, in particular, I want after processing, the file to read
read "60"
read "35"
read "24"
There are 60 nuts and 35 apples,
but only 24 pears.
How could I do this using one of the text manipulating tolls available in *nix?
You can script an ed session to edit the file in place:
{
echo 0a # insert text at the beginning of the file
grep -o '[0-9]\+' nums.txt | sed 's/.*/read "&"/'
echo ""
echo . # end insert mode
echo w # save
echo q # quit
} | ed nums.txt
More succinctly:
printf "%s\n" 0a "$(grep -o '[0-9]\+' nums.txt|sed 's/.*/read "&"/')" "" . w q | ed nums.txt
One way to do it is:
egrep -o [0-9]+ input | sed -re 's/([0-9]+)/read "\1"/' > /tmp/foo
cat input >> /tmp/foo
mv /tmp/foo input