I have a file with millions of lines
And in a script I want to use the following line (which removes unprintable characters) in a loop for a given line range
sed -i $'s/[^[:print:]\t]//g' ~/test.txt
How do I do that ? It needs to run very fast
I've tried
sed -i $"{${line1},${line2}}{s/[^[:print:]\t]//g}" ~/test.txt
which is very slow for range of 1000 lines near the end of the file
I'm not sure of the composition of your test file (how many replacements are needed?), so I can't faithfully reproduce your test, but I'm using my /bin/bash (1168776 bytes in 5714 "lines", with 372073 (31.8%) printable/tab characters).
sed
(Baseline for timing purposes)
$ cp /bin/bash sh; time sed -i $'s/[^[:print:]\t]//g' sh
sed -i $'s/[^[:print:]\t]//g' sh 1.66s user 0.01s system 98% cpu 1.687 total
$ cp /bin/bash sh; time sed -i $'s/[^[:print:]\t]//g' sh
sed -i $'s/[^[:print:]\t]//g' sh 1.74s user 0.01s system 89% cpu 1.945 total
$ cp /bin/bash sh; time sed -i $'s/[^[:print:]\t]//g' sh
sed -i $'s/[^[:print:]\t]//g' sh 1.67s user 0.01s system 97% cpu 1.718 total
Mean average of total times = 1.783s 🚶 (it's important to run multiple times to control for caching. I ran four times and dropped the first to account for caching, then averaged to control for externalities like my web browser)
perl
I translated this to perl to see if that would be faster:
$ cp /bin/bash sh; time perl -i -pe $'s/[^[:print:]\t]//g' sh
perl -i -pe $'s/[^[:print:]\t]//g' sh 0.18s user 0.01s system 92% cpu 0.208 total
$ cp /bin/bash sh; time perl -i -pe $'s/[^[:print:]\t]//g' sh
perl -i -pe $'s/[^[:print:]\t]//g' sh 0.18s user 0.01s system 68% cpu 0.271 total
$ cp /bin/bash sh; time perl -i -pe $'s/[^[:print:]\t]//g' sh
perl -i -pe $'s/[^[:print:]\t]//g' sh 0.21s user 0.00s system 81% cpu 0.258 total
Mean average of total times = 0.246s 🏃
However, I noticed some differences. GNU sed is struggling, perhaps due to either a different definition of the [:print:] class or (more likely) different handling of control characters:
$ sed $'s/[^[:print:]\t]//g' /bin/bash |head -c64 |hd
00000000 45 4c 46 3e 30 f6 40 48 ce 40 38 40 40 40 40 68 |ELF>0.#H.#8####h|
00000010 68 a8 a8 a8 98 cd 98 cd d0 d0 d0 8d d7 0a 8d d7 |h...............|
00000020 0a b0 b0 b0 30 57 30 57 f0 f0 23 f0 23 b9 a8 55 |....0W0W..#.#..U|
00000030 f0 3c f0 4c f0 4c c4 c4 c4 44 44 50 e5 74 64 30 |.<.L.L...DDP.td0|
00000040
$ perl -pe $'s/[^[:print:]\t]//g' /bin/bash |head -c64 |hd
00000000 45 4c 46 3e 30 40 48 40 38 40 40 40 40 68 68 30 |ELF>0#H#8####hh0|
00000010 57 30 57 23 23 55 3c 4c 4c 44 44 50 74 64 30 49 |W0W##U<LLDDPtd0I|
00000020 30 49 30 49 44 44 51 74 64 52 74 64 23 23 2c 2c |0I0IDDQtdRtd##,,|
00000030 2f 6c 69 62 36 34 2f 6c 64 2d 6c 69 6e 75 78 2d |/lib64/ld-linux-|
00000040
See all those dots in the GNU sed output? Those are failures to replace content. I also observe these in Busybox sed. BSD sed (which is used by Mac OS X) does not appear to have this limitation, but note this for portability purposes as needed.
tr
$ cp /bin/bash sh; time tr -cd $'[ -~\t\n]' < sh > sh-tr && mv sh-tr sh
tr -cd $'[[:print:]\t]' < sh > sh-tr 0.00s user 0.01s system 62% cpu 0.012 total
$ cp /bin/bash sh; time tr -cd $'[ -~\t\n]' < sh > sh-tr && mv sh-tr sh
tr -cd $'[[:print:]\t]' < sh > sh-tr 0.00s user 0.01s system 81% cpu 0.009 total
$ cp /bin/bash sh; time tr -cd $'[ -~\t\n]' < sh > sh-tr && mv sh-tr sh
tr -cd $'[[:print:]\t]' < sh > sh-tr 0.00s user 0.01s system 82% cpu 0.012 total
Mean average of total times = 0.011s ⚡️
I've tested this with GNU tr and Busybox tr (they have equal performance). We're using tr to delete (-d) rather than translate, and we're acting on the complement (-c) of the given class (tr does not use regex, so we can't invert the character class with a caret the way we can in sed).
Busybox tr does not support $'[[:print:]\t]' so I have converted it to a range from space to tilde (all printable lower-ASCII except tab and newline) and I added not just tab but also newline since tr needs to explicitly preserve that character (sed did not). If the lines don't match properly, consider adding \r to the replacement set.
strings is also good here, but it does not preserve lines (it replaces each contiguous string of non-printable characters with a newline)
Related
I have these bytes:
E6 2A 1B EF 11 00 00 00 00 00 00 4E 43 DB E8
I need to replace them with these:
64 08 1A EF 11 00 00 00 00 00 DA D8 26 04
When I started to experiment, I've noticed one strange thing.
sed -e 's/\xE6/\x64/g'
This code replaces first E6 with 64 ok
However when I try to change more bytes (2A) causing problem.
sed -e 's/\xE6\x2A/\x64\x08/g'
as I understand 2A inserts same code.. How to avoid it? I just need to change 2A with 08. Thanks in advance :)
UPDATED
Now I've stuck on \x26. sed -e 's/\xDB/\x26/g' this code refuses to replace DB to 26, but when I run s/\xDB/\xFF it works. Any ideas? In this way something is wrong with 26. I have tried [\x26], not helped here.
OK. s/\xDB/\&/g' seems to be working :)
\x2a is *, which is special in regex :
$ sed 's/a*/b/' <<<'aaa'
b
$ sed 's/a\x2a/b/' <<<'aaa'
b
You may use bracket expression in regex to cancel the special properties of characters, but I see it doesn't work well with all characters with my GNU sed:
$ sed 's/\xE6[\x2A]/OK/' <<<$'I am \xE6\x2A'
I am OK
$ sed 's/[\xE6][\x2A]/OK/' <<<$'I am \xE6\x2A'
I am �*
That's because \xE6] is probably some invalid UTF character. Remember to use C locale:
$ LC_ALL=C sed 's/[\xE6][\x2A]/OK/' <<<$'I am \xE6\x2A'
I am OK
Remember that \1 \2 etc. and & are special in replacement part too. Read Escape a string for a sed replace pattern - but you will need to escape \xXX sequences instead of each character (or convert characters to the actual bytes first, why work with all the \xXX sequences?). Or pick another tool.
I have a shell script that I use to remotely clean an XML file produced by another system that contains invalid UNICODE characters. I am currently using this command in the script to remove the invalid characters:
perl -CSDA -i -pe's/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
and this has worked so far but now the file has new error of, as far as I can tell, 'xA0', and what happens is my perl command reaches that error in the file and erases the rest of the file. I modified my command to include xA0, but it doesn't work:
perl -CSDA -i -pe's/[^\x9\xA0\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;' file.xml
I have also tried using:
iconv -f UTF-8 -t UTF-8 -c file.xml > file2.xml
but that doesn't do anything. It produces an identical file with the same errors.
Is there a unix command that I can use that will completely remove all invalid UNICODE characters?
EDIT:
some HEX output (note the 1A's and A0's):
3E 1A 1A 33 30 34 39 37 1A 1A 3C 2F 70
6D 62 65 72 3E A0 39 34 32 39 38 3C 2F
You may use the following onliner:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{""}))' file.xml
You also may extend it with warnings:
perl -i -MEncode -0777ne'print encode("UTF-8",decode("UTF-8",$_,sub{warn "Bad byte: #_";""}))' file.xml
A0 is not a valid UTF-8 sequence. The errors you were encountering where XML encoding errors, while this one is a character encoding error.
A0 is the Unicode Code Point for a non-breaking space. It is also the iso-8859-1 and cp1252 encoding of that Code Point.
I would recommend fixing the problem at its source. But if that's not possible, I would recommend using Encoding::FixLatin to fix this new type of error (perhaps via the bundled fix_latin script). It will correctly replace A0 with C2 A0 (the UTF-8 encoding of a non-breaking space).
Combined with your existing script:
perl -i -MEncoding::FixLatin=fix_latin -0777pe'
$_ = fix_latin($_);
utf8::decode($_);
s/[^\x9\xA\xD\x20-\x{D7FF}\x{E000}-\x{FFFD}\x{10000}-\x{10FFFF}]+//g;
utf8::encode($_);
' file.xml
Using GNU Sed i'm able to replace some hex value using the following command
gsed 's/.*\xFF\xD8/\xFF\xD8/g' myfile
I'm on OSX, so the default sed is the BSD one. Unfortunately the previous command does not work the BSD sed.
Any idea why this and how to do what i'm looking for : removing everything before a FFD8 value in my file.
The simplest way to deal with that problem is to use bash's 'ANSI-C Quoting' mechanism:
sed $'s/.*\xFF\xD8/\xFF\xD8/g' myfile
Note that \xFF\xD8 is not valid UTF-8, so you may have problems with the characters, but the basic mechanism works:
$ echo sed $'s/.*\xFF\xD8/\xFF\xD8/g' | odx
0x0000: 73 65 64 20 73 2F 2E 2A FF D8 2F FF D8 2F 67 0A sed s/.*../../g.
0x0010:
$
odx is a hex dump program.
My input foo.txt is this:
Grull^Zn Hernand^Zz
where the ^Z resolves to the control character \x1a (verified with od -x on the file )
When I run the following Perl command:
perl -pe s/\x1a//g foo.txt
I get the output: Grulln Hernandz
as expected. However when I redirect this to a file
perl -pe s/\x1a//g foo.txt > out.txt
The files are identical, demonstrated by
diff -c out.txt foo.txt
No differences encountered
How can I force this behavior to work as expected?
I don't know how you're are ascertaining that the first version works, but it doesn't for me.
You need to either escape the backslash in the regex, or quote it (quoting it is more common).
$ hexdump -C input
00000000 61 62 63 1a 64 65 66 1a 67 68 69 0a |abc.def.ghi.|
$ perl -pe s/\x1a//g input | hexdump -C
00000000 61 62 63 1a 64 65 66 1a 67 68 69 0a |abc.def.ghi.|
$ perl -pe s/\\x1a//g input | hexdump -C
00000000 61 62 63 64 65 66 67 68 69 0a |abcdefghi.|
$ perl -pe 's/\x1a//g' input | hexdump -C
00000000 61 62 63 64 65 66 67 68 69 0a |abcdefghi.|
I don't think
perl -pe s/\x1a//g foo.txt
does what you think it does. In any sane solaris shell, an unquoted \x is treated the same as x, and you are running the same thing as
perl -pe s/x1a//g foo.txt
You can test this by executing
echo s/\x1a//g
and see what gets passed to the shell. You can also try
perl -pe s/\x1a//g foo.txt | od -c
to see whether the control characters really are removed from your input.
The correct thing to do is to enclose your one-line script in single quotes:
perl -pe 's/\x1a//g' foo.txt > out.txt
What I ultimately ended up doing (though I found out that mob's solution worked too) was instead of entering \x1a I pressed and held Ctrl, then v, z
This also has the benefit of being a little more readable.
setopt rcquotes
zsh -c 'export LANG="ru_RU.CP1251"; echo "Русский текст" | iconv -f utf8 | perl -p -i -e ''BEGIN{use open ":locale"}s/\p{InCyrillic}/й/g'''
gives me a bunch of errors:
"\x{00d0}" does not map to cp1251, <> line 1.
"\x{00b9}" does not map to cp1251, <> line 1.
What should be done in order not to get this errors (note that locale may be any).
You forgot to denote the encoding of the substitution text. Update: In the first revision, I had a solution involving the nasty encoding pragma. It can be completely avoided, but the standard way as below did not come to my mind until now for some reason.
bash> export LANG=ru_RU.koi8r # I do not have CP…
bash> echo "Русский текст" | iconv -f UTF-8 | hex
0000 f2 d5 d3 d3 cb c9 ca 20 d4 c5 cb d3 d4 0a ������� �����.
bash> echo "Русский текст" | iconv -f UTF-8 | perl -p -i -e'BEGIN {use open ":locale"}; use utf8; s/\p{InCyrillic}/й/g' | hex
0000 ca ca ca ca ca ca ca 20 ca ca ca ca ca 0a ������� �����.
bash> echo "Русский текст" | iconv -f UTF-8 | perl -p -i -e'BEGIN {use open ":locale"}; use utf8; s/\p{InCyrillic}/й/g' | iconv -t UTF-8
ййййййй ййййй