I want to remove the NL character but not the NL character after the CR character. tr would not do this. What is the best command to achieve this?
Perl can do this:
$ od -c file
0000000 o n e \n t w o \r \n t h r e e \n f
0000020 o u r \r \n \n
0000026
$ perl -0777 -pe 's/(\A|[^\r])\n/$1/g' file > file.out
$ od -c file.out
0000000 o n e t w o \r \n t h r e e f o u
0000020 r \r \n
0000023
Related
My source data contains special characters not in readable format. Can anyone help on the below :
Source data:
Commands Tryed:
sed 's/../t/g' test.txt > test2.txt
you can use tr to keep only printable characters:
tr -cd "[:print:]" <test.txt > test2.txt
Uses tr delete option on non-printable (print criteria negated by -c option)
If you want to replace those special chars by something else (ex: X):
tr -c "[:print:]" "X" <test.txt > test2.txt
With sed, you could try that to replace non-printable by X:
sed -r 's/[^[:print:]]/X/g' text.txt > test2.txt
it works on some but fails on chars >127 (maybe because the one I tried is printable as ▒ !) on my machine whereas tr works perfectly.
inline examples (printf to generate special chars + filter + od to show bytes):
$ printf "\x01ABC\x05\xff\xe0" | od -c
0000000 001 A B C 005 377 340
0000007
$ printf "\x01ABC\x05\xff\xe0" | sed "s/[^[:print:]]//g" | od -c
0000000 A B C 377 340
0000005
$ printf "\x01ABC\x05\xff\xe0" | tr -cd "[:print:]" | od -c
0000000 A B C
0000003
When I BCP the data in sql server
In the output file I am getting a NUL like character in the output file, and i want to replace this with the single blank space.
When I used the below sed command it removes the NUL character but between those 2 delimiter we don't have single space.
sed 's/\x0/ /g' output file name
Example: After sed command i am getting output file like below
PHMO||P00000005233
PHMO||P00000005752
But i need a single spacing in between those delimiter as
PHMO| |P00000005233
PHMO| |P00000005752
The usual approach to this would be using tr. However, solutions with tr and sed are not portable. (The question is tagged "unix", so only portable solutions are interesting).
Here is a simple demo script
#!/bin/sh
date
tr '\000' ' ' <$0.in
date
sed -e 's/\x00/ /g' <$0.in
which I named foo, and its input (with the ASCII NUL shown here as ^#):
this is a null: "^#"
Running with GNU tr and sed:
Fri Apr 1 04:41:15 EDT 2016
this is a null: " "
Fri Apr 1 04:41:15 EDT 2016
this is a null: " "
With OSX:
Fri Apr 1 04:41:53 EDT 2016
this is a null: " "
Fri Apr 1 04:41:53 EDT 2016
this is a null: "^#"
With Solaris 10 (and 11, though there may be a recent change):
Fri Apr 1 04:38:08 EDT 2016
this is a null: ""
Fri Apr 1 04:38:08 EDT 2016
this is a null: ""
Bear in mind that sed is line-oriented, and that ASCII NUL is considered a binary (non-line) character. If you want a portable solution, then other tools such as Perl (which do not have that limitation) are useful. For that case one could add this to the script:
perl -np -e 's/\0/ /g' <$0.in
The intermediate tool awk is no better in this instance. Going to Solaris again, with these lines:
for awk in awk nawk mawk gawk
do
echo "** $awk:"
$awk '{ gsub("\0"," "); print; }' <$0.in
done
I see this output:
** awk:
awk: syntax error near line 1
awk: illegal statement near line 1
** nawk:
nawk: empty regular expression
source line number 1
context is
{ gsub("\0"," >>> ") <<<
** mawk:
this is a null: " "
** gawk:
this is a null: " "
Further reading:
sed - stream editor (POSIX)
tr - translate characters (POSIX), which notes
Unlike some historical implementations, this definition of the tr utility correctly processes NUL characters in its input stream. NUL characters can be stripped by using:
tr -d '\000'
perlrun - how to execute the Perl interpreter
This is an easy job for sed. Let's start creating a test file as you didn't provide one:
$ echo -e "one,\x00,two,\x00,three" > a
$ echo -e "four,\x00,five,\x00,six" >> a
As you can see it contains ASCII 0:
$ od -c a
0000000 o n e , \0 , t w o , \0 , t h r e
0000020 e \n f o u r , \0 , f i v e , \0 ,
0000040 s i x \n
0000044
Now let's run sed:
$ sed 's/\x00/ /g' a > b
And check the output:
$ cat b
one, ,two, ,three
four, ,five, ,six
$ od -c b
0000000 o n e , , t w o , , t h r e
0000020 e \n f o u r , , f i v e , ,
0000040 s i x \n
0000044
it can be done quite easily with perl
cat -v inputfile.txt
abc^#def^#ghij^#klmnop^#qrstuv^#wxyz
perl -np -e 's/\0/ /g' <inputfile.txt >outputfile.txt
cat -v outputfile.txt
abc def ghij klmnop qrstuv wxyz
I have following command working fine, but just for learning purpose, I want to know how can I put following three expressions of sed into one:
bash
[user#localhost]$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]'| sed -e 's/\]//g' -e 's/\[//g' -e 's/\s\+/\t/g' -e 's/\:/\t/'
lib Library10 idx:10 Fragment 75 color
sed 's/[][]//g; s/:\|\s\+/\t/g'
Demonstrating:
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]'| sed 's/[][]//g; s/:\|\s\+/\t/g'
lib Library10 idx 10 Fragment 75 color
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]'| sed 's/[][]//g; s/:\|\s\+/\t/g' | od -c
0000000 l i b \t L i b r a r y 1 0 \t i d
0000020 x \t 1 0 \t F r a g m e n t \t 7 5
0000040 \t c o l o r \n
0000047
If you want to put a right bracket in a character class, it must be the first character, so [][] will match either a left or a right bracket.
You can group it in two blocks:
$ sed -re 's/(\]|\[)//g' -e 's/(\s+|\:)/\t/g' <<< "[lib:Library10] [idx:10] [Fragment] [75] [color]"
lib Library10 idx 10 Fragment 75 color
That is,
sed -e 's/\]//g' -e 's/\[//g' -e 's/\s\+/\t/g' -e 's/\:/\t/'
-------------------------- ------------------------------
| delete ] and [ | | replace \s+ and : with tab |
-------------------------- ------------------------------
-re 's/(\]|\[)//g' -e 's/(\s+|\:)/\t/g'
By pieces:
sed -e 's/\]//g' -e 's/\[//g'
can be compacted as:
sed -re 's/(\]|\[)//g'
by joining the conditions with a (condition1|condition2) statement together with -r for sed.
And the same with the other expression.
As a side note, tr can be better to delete the [, ] chars:
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]' | tr -d '[]'
lib:Library10 idx:10 Fragment 75 color
And to replace : with \t you can also use tr:
$ echo '[lib:Library10] [idx:10] [Fragment] [75] [color]' | tr ':' '\t'
[lib Library10] [idx 10] [Fragment] [75] [color]
I just learned that in Perl, the symbol table for a given module is stored in a hash that matches the module name -- so, for example, the symbol table for the fictional module Foo::Bar would be %Foo::Bar. The default symbol table is stored in %main::. Just for the sake of curiosity, I decided that I wanted to see what was in %main::, so iterated through each key/value pair in the hash, printing them out as I went:
#! /usr/bin/perl
use v5.14;
use strict;
use warnings;
my $foo;
my $bar;
my %hash;
while( my ( $key, $value ) = each %:: ) {
say "Key: '$key' Value '$value'";
}
The output looked like this:
Key: 'version::' Value '*main::version::'
Key: '/' Value '*main::/'
Key: '' Value '*main::'
Key: 'stderr' Value '*main::stderr'
Key: '_<perl.c' Value '*main::_<perl.c'
Key: ',' Value '*main::,'
Key: '2' Value '*main::2'
...
I was expecting to see the STDOUT and STDERR file handles, and perhaps #INC and %ENV... what I wasn't expecting to see was non-ascii characters ... what the code block above doesn't show is that the third line of the output actually had a glyph indicating a non-printable character.
I ran the script and piped it as follows:
perl /tmp/asdf.pl | grep '[^[:print:]]' | while read line
do
echo $line
od -c <<< $line
echo
done
The output looked like this:
Key: '' Value '*main::'
0000000 K e y : ' 026 ' V a l u e '
0000020 * m a i n : : 026 ' \n
0000032
Key: 'ARNING_BITS' Value '*main::ARNING_BITS'
0000000 K e y : ' 027 A R N I N G _ B I
0000020 T S ' V a l u e ' * m a i n
0000040 : : 027 A R N I N G _ B I T S ' \n
0000060
Key: '' Value '*main::'
0000000 K e y : ' 022 ' V a l u e '
0000020 * m a i n : : 022 ' \n
0000032
Key: 'E_TRIE_MAXBUF' Value '*main::E_TRIE_MAXBUF'
0000000 K e y : ' 022 E _ T R I E _ M A
0000020 X B U F ' V a l u e ' * m a
0000040 i n : : 022 E _ T R I E _ M A X B
0000060 U F ' \n
0000064
Key: ' Value '*main:'
0000000 K e y : ' \b ' V a l u e '
0000020 * m a i n : : \b ' \n
0000032
Key: '' Value '*main::'
0000000 K e y : ' 030 ' V a l u e '
0000020 * m a i n : : 030 ' \n
0000032
So what are non-printable characters doing in the Perl symbol table? What are they symbols for?
Guru is on the right track: specifically, the answer is to be found in perlvar, which says:
"Perl variable names may also be a sequence of digits or a single punctuation or control character. These names are all reserved for special uses by Perl; for example, the all-digits names are used to hold data captured by backreferences after a regular expression match. Perl has a special syntax for the single-control-character names: It understands ^X (caret X) to mean the control-X character. For example, the notation $^W (dollar-sign caret W) is the scalar variable whose name is the single character control-W. This is better than typing a literal control-W into your program.
Since Perl 5.6, Perl variable names may be alphanumeric strings that begin with control characters (or better yet, a caret). These variables must be written in the form ${^Foo}; the braces are not optional. ${^Foo} denotes the scalar variable whose name is a control-F followed by two o's. These variables are reserved for future special uses by Perl, except for the ones that begin with ^_ (control-underscore or caret-underscore). No control-character name that begins with ^_ will acquire a special meaning in any future version of Perl; such names may therefore be used safely in programs. $^_ itself, however, is reserved."
If you want to print those names in a readable way, you could add a line like this to your code:
$key = '^' . ($key ^ '#') if $key =~ /^[\0-\x1f]/;
If first character of $key is a control character, this will replace it with a caret followed by the corresponding letter (^A for control-A, ^B for control-B, etc.).
Perl has special variables such as $", $, , $/ , $\ and so on. All these are part of symbol table which is what you are seeing. Also, you should be able to see #INC, %ENV in the symbol table as well.
As I'm new with sed, I'm having the fun of seeing that sed doesn't think that the \r character is a valid line delimiter.
Does anyone know how to tell sed which character(s) I'd like it to use as the line delimiter when processing many lines of text?
You can specify it with awk's RS (record separator) variable: awk 'BEGIN {RS = "\r"} ...
Or you can convert with: tr '\r' '\n'
(For making the examples below clearer and less ambiguous, I'll use the od util extensively.)
It is not possible to do with a flag, for example. I bet the best solution is the one cited by the previous answers: using tr. If you have a file such as the one below:
$ od -xc slashr.txt
0000000 6261 0d63 6564 0d66
a b c \r d e f \r
0000010
There are various ways of using tr; the one we wanted is to pass two parameters for it - two different chars - and tr will replace the first parameter by the second one. Sending the file content as input for tr '\r' '\n', we got the following result:
$ tr '\r' '\n' < slashr.txt | od -xc
0000000 6261 0a63 6564 0a66
a b c \n d e f \n
0000010
Great! Now we can use sed:
$ tr '\r' '\n' < slashr.txt | sed 's/^./#/'
#bc
#ef
$ tr '\r' '\n' < slashr.txt | sed 's/^./#/' | od -xc
0000000 6223 0a63 6523 0a66
# b c \n # e f \n
0000010
But I presume you need to use \r as the line delimiter, right? In this case, just use tr '\n' '\r' to reverse the conversion:
$ tr '\r' '\n' < slashr.txt | sed 's/^./#/' | tr '\n' '\r' | od -xc
0000000 6223 0d63 6523 0d66
# b c \r # e f \r
0000010
As far as I know, you can't. What's wrong with using a newline as the delimiter? If your input has DOS-style \r\n line endings it can be preprocessed to remove them and, if necessary, they can be returned afterwards.