I know, this is an old question, but none of the answers I found helps in the following scenario:
fc /u TextA.txt TextB.txt
compares the two Unicode encoded txt files and displays the result correctly (!) on the screen.
As expected,
fc /u TextA.txt TextB.txt > Comp.txt
does not result in a Unicode encoded file.
Unfortunately the method used in similar situations
cmd /u /c fc /u TextA.txt TextB.txt > Comp.txt
does not work, the generated file is ANSI encoded.
I hope somebody here can help ...
EDITED (after first comments): The problem seems to be that cmd /u (or chcp) works only with "internal" commands (like dir). fc is not an internal command ... (Thanks to LotPings!)
Short answer:
Use PowerShell's Compare-Object cmdlet as follows:
Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt")
Basically customized output to a file:
Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt") |
Format-Table -Property SideIndicator, InputObject -AutoSize -HideTableHeaders -Wrap |
Out-File .\fileAB.txt -Encoding unicode
or
Compare-Object (Get-Content ".\fileA.txt") (Get-Content ".\fileB.txt") -PassThru |
Out-File .\fileAB.txt -Encoding unicode
Original answer (see also amendment below):
The č letter (Latin Small Letter C With Caron, codepoint U+010D) appears in code pages 775/1257 (Baltic) and 852/1250 (Central Europe). I would suppose the latter as the koča word sounds like common Slavonic term for English hut, cabin or cottage.
Reproduce the problem. Next example shows possible mojibake case between OEM and ANSI code pages; apparently, cmd.exe itself makes some implicit (and unclear) character code transformations:
D:\test\Unicode> powershell -c "'fileA','fileB'|ForEach-Object {$_; Get-Content .\$_.txt}"
fileA
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
fileB
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
D:\test\Unicode> chcp
Active code page: 1250
D:\test\Unicode> fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB_1250.txt
D:\test\Unicode> type .\CompAB_1250.txt
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc Řçźý§ě ˇ‚ Ô ś ĺ
a UC ·ć¬ü¦íµÖ Ň › Ő
***** .\FILEB.TXT
b lc Řçźý§ě ˇ‚ Ô ś ĺ
b UC ·ć¬ü¦íµÖ Ň › Ő
*****
cmd fix:
D:\test\Unicode> chcp 852
Active code page: 852
D:\test\Unicode> fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB_852.txt
D:\test\Unicode> type .\CompAB_852.txt
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
In above example, both CompAB_1250.txt (garbled) and CompAB_852.txt (valid) are encoded in a one-byte code page. To get Unicode output, use PowerShell as follows:
PowerShell fix #1. Force PowerShell to use code page 852 from command line (use chcp 852 command explicitly before calling powershell):
D:\test\Unicode> chcp 852
Active code page: 852
D:\test\Unicode> powershell -c ". fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB.txt"
D:\test\Unicode> powershell -c "'CompAB' | ForEach-Object {$_; Get-Content .\$_.txt}"
CompAB
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
PowerShell fix #2 Force PowerShell to use code page 852 on the fly regardless of active console code page and keeping the latter unchanged (for illustration, chosen 1252 code page which does not contain most of used letters):
D:\test\Unicode> chcp 1252
Active code page: 1252
D:\test\Unicode> powershell -c "[System.Console]::OutputEncoding=[System.Text.ASCIIEncoding]::GetEncoding(852);. fc.exe /U .\fileA.txt .\fileB.txt > .\CompAB.txt"
D:\test\Unicode> powershell -c "'CompAB' | ForEach-Object {$_; Get-Content .\$_.txt}"
CompAB
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
D:\test\Unicode> chcp
Active code page: 1252
Please run next commands from a newly opened cmd window for further explanation:
powershell -c "[console]::OutputEncoding"
chcp 1252
powershell -c "[console]::OutputEncoding"
chcp 1250
powershell -c "[console]::OutputEncoding"
chcp 852
powershell -c "[console]::OutputEncoding"
rem etc. etc. etc.
Edit (amendment): finally tested with some Greek characters added to input files; fc.exe output looks fine from command line fc.exe /U .\fileA.txt .\fileB.txt or even from Powershell:
D:\test\Unicode> powershell -c ". fc.exe /U .\fileA.txt .\fileB.txt"
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a Ελληνικά ΕΛΛΗΝΙΚΆ
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b Ελληνικά ΕΛΛΗΝΙΚΆ
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
However, > redirecting above output to a file as well as | piping it into another cmdlet leads to loss of information so that some characters are either garbled (via mojibake) or at least replaced by ? question mark, e.g. as follows:
PS D:\test\Unicode> . fc.exe /U .\fileA.txt .\fileB.txt | ForEach-Object {$_}
Comparing files .\fileA.txt and .\FILEB.TXT
***** .\fileA.txt
a lc ěščřžýáíé ď ť ň
a ???????? ????????
a UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
***** .\FILEB.TXT
b lc ěščřžýáíé ď ť ň
b ???????? ????????
b UC ĚŠČŘŽÝÁÍÉ Ď Ť Ň
*****
Related
How can Perl do input from stdin, one char like
readline -N1
does?
You can do that with the base perl distribution, no need to install extra packages:
use strict;
sub IO::Handle::icanon {
my ($fh, $on) = #_;
use POSIX;
my $ts = new POSIX::Termios;
$ts->getattr(fileno $fh) or die "tcgetattr: $!";
my $f = $ts->getlflag;
$ts->setlflag($on ? $f | ICANON : $f & ~ICANON);
$ts->setattr(fileno $fh) or die "tcsetattr: $!";
}
# usage example
# a key like `Left` or `á` may generate multiple bytes
STDIN->icanon(0);
sysread STDIN, my $c, 256;
STDIN->icanon(1);
# the read key is in $c
Reading just one byte may not be a good idea because it will just leave garbage to be read later when pressing a key like Left or F1. But you can replace the 256 with 1 if you want just that, no matter what.
<STDIN> will read stdin one byte (C char type, which is not the same as a character which these days are typically made of several bytes except for those in the US-ASCII charset) at a time from stdin if the record separator is set to a reference to the number 1.
$ echo perl | perl -le '$/ = \1; $a = <STDIN>; print "<$a>"'
<p>
Note that underneath, it may read (consume) more than one byte from the input. Above, the next <STDIN> within perl would return <e>, but possibly from some large buffer that was read beforehand.
$ echo perl | (perl -le '$/ = \1; $a = <STDIN>; print "<$a>"'; wc -c)
<p>
0
Above, you'll notice that wc didn't receive any input as it had all already been consumed by perl.
$ echo perl | (PERLIO=raw perl -le '$/ = \1; $a = <STDIN>; print "<$a>"'; wc -c)
<p>
4
This time, wc got 4 bytes (e, r, l, \n) as we told perl to use raw I/O so the <STDIN> translates to a read(0, bud, 1).
Instead of <STDIN>, you can use perl's read with the same caveat:
$ echo perl | (perl -le 'read STDIN, $a, 1; print "<$a>"'; wc -c)
<p>
0
$ echo perl | (PERLIO=raw perl -le 'read STDIN, $a, 1; print "<$a>"'; wc -c)
<p>
4
Or use sysread which is the true wrapper for the raw read():
$ echo perl | (perl -le 'sysread STDIN, $a, 1; print "<$a>"'; wc -c)
<p>
4
To read one character at a time, you need to read one byte at a time until the end of the character.
You can do it for UTF-8 encoded input (in locales using that encoding) in perl with <STDIN> or read (not sysread) with the -C option, including with raw PERLIO:
$ echo été | (PERLIO=raw perl -C -le '$/ = \1; $a = <STDIN>; print "<$a>"'; wc -c)
<é>
4
$ echo été | (PERLIO=raw perl -C -le 'read STDIN, $a, 1; print "<$a>"'; wc -c)
<é>
4
With strace, you'd see perl does two read(0, buf, 1) system calls underneath to read that 2-byte é character.
Like with ksh93 / bash's read -N (or zsh's read -k), you can get surprises if the input is not properly encoded in UTF-8:
$ printf '\375 12345678' | (PERLIO=raw perl -C -le 'read STDIN, $a, 1; print "<$a>"'; wc -c)
<� 1234>
4
\375 (\xFD) would normally be the first byte of the encoding of a 6 byte character in UTF-8¹, so perl reads all 6 bytes here even though the second to sixth can't possibly be part of that character as they don't have the 8th bit set.
Note that when stdin is a tty device, read() will not return until the terminal at the other end sends a LF (eol), CR (which is by default converted to LF), or eof (usually ^D) or eol2 (usually not defined) character as configured in the tty line discipline (like with the stty command) as the tty driver implements its own internal line editor allowing you to edit what you type before pressing enter.
If you want to read the byte(s) that is(are) sent for each key pressed by the user there, you'd need to disable that line editor (which bash/ksh93's read -N or zsh's read -k do when stdin is a tty), see #guest's answer for details on how to do that.
¹ While now Unicode restricts codepoints to up to 0x10FFFF which means UTF-8 encodings have at most 4 bytes, UTF-8 was originally designed to encode code points up to 0x7fffffff (up to 6 byte encoding) and perl extends it to up to 0x7FFFFFFFFFFFFFFF (13 byte encoding)
First, this is not a duplicate of, e.g., How can I replace each newline (\n) with a space using sed?
What I want is to exactly replace every newline (\n) in a string, like so:
printf '%s' $'' | sed '...; s/\n/\\&/g'
should result in the empty string
printf '%s' $'a' | sed '...; s/\n/\\&/g'
should result in a (not followed by a newline)
printf '%s' $'a\n' | sed '...; s/\n/\\&/g'
should result in
a\
(the trailing \n of the final line should be replaced, too)
A solution like :a;N;$!ba; s/\n/\\&/g from the other question doesn't do that properly:
printf '%s' $'' | sed ':a;N;$!ba; s/\n/\\&/g' | hd
works;
printf '%s' $'a' | sed ':a;N;$!ba;s/\n/\\&/g' | hd
00000000 61 |a|
00000001
works;
printf '%s' $'a\nb' | sed ':a;N;$!ba;s/\n/\\&/g' | hd
00000000 61 5c 0a 62 |a\.b|
00000004
works;
but when there's a trailing \n on the last line
printf '%s' $'a\nb\n' | sed ':a;N;$!ba;s/\n/\\&/g' | hd
00000000 61 5c 0a 62 0a |a\.b.|
00000005
it doesn't get quoted.
Easier to use perl than sed, since it has (by default, at least) a more straightforward treatment of the newlines in its input:
printf '%s' '' | perl -pe 's/\n/\\\n/' # Empty string
printf '%s' a | perl -pe 's/\n/\\\n/' # a
printf '%s\n' a | perl -pe 's/\n/\\\n/' # a\<newline>
printf '%s\n' a b | perl -pe 's/\n/\\\n/' # a\<newline>b\<newline>
# etc
If your inputs aren't huge, you could use
perl -0777 -pe 's/\n/\\\n/g'
instead to read the entire input at once instead of line by line, which can be more efficient.
how to replace newline charackters with a string in sed
It's not possible. From sed script point of view, the trailing line missing or not makes no difference and is undetectable.
Aaaanyway, use GNU sed with sed -z:
sed -z 's/\n/\\\n/g'
GNU awk can use the RT variable to detect a missing record terminator:
$ printf 'a\nb\n' | gawk '{ORS=(RT != "" ? "\\" : "") RT} 1'
a\
b\
$ printf 'a\nb' | gawk '{ORS=(RT != "" ? "\\" : "") RT} 1'
a\
b$
This adds a "\" before each non-empty record terminator.
Using any awk:
$ printf 'a\nb\n\n' | awk '{printf "%s%s", sep, $0; sep="\\\n"}'
a\
b\
$ printf 'a\nb\n' | awk '{printf "%s%s", sep, $0; sep="\\\n"}'
a\
b$
Or { cat file; echo; } | awk ... – always add a newline to the input.
Using Perl, I want to replace CRLF by | in the end of a line beginning with "ID".
So, to be more explicit: If a line begins with "ID", I replace CRLF in the end of this sentence by |.
This is what I have done:
elsif ($line =~ /^ID:\n/) { print $outputFile $line."|"; }
I think that it is not good ..
Depending on platform, \n has diffrent meanings. From perlport:
LF eq \012 eq \x0A eq \cJ eq chr(10) eq ASCII 10
CR eq \015 eq \x0D eq \cM eq chr(13) eq ASCII 13
| Unix | DOS | Mac |
---------------------------
\n | LF | LF | CR |
\r | CR | CR | LF |
\n * | LF | CRLF | CR |
\r * | CR | CR | LF |
---------------------------
* text-mode STDIO
You could do:
elsif ($line =~ /^(ID\b.*)\R/) { print $outputFile "$1|" }
\R stands for any kind of linebreak.
I am improving a script listing duplicated files that I have written last year (see the second script if you follow the link).
The record separator of the duplicated.log output is the zero byte instead of the carriage return \n. Example:
$> tr '\0' '\n' < duplicated.log
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
32 dir6/video.m4v
32 dir7/video.m4v
(in this example, the five files dir1/index.htm, ... and dir5/index.htm have same md5sum and their size is 12 bytes. The other two files dir6/video.m4vand dir7/video.m4v have same md5sum and their content size (du) is 32 bytes.)
As each line is ended by a zero byte (\0) instead of carriage return symbol (\n), blank lines are represented as two successive zero bytes (\0\0).
I use zero byte as line separator because, path-file-name may contain carriage return symbol.
But, doing that I am faced to this issue:
How to 'grep' all duplicates of a specified file from duplicated.log?
(e.g. How to retrieve duplicates of dir1/index.htm?)
I need:
$> ./youranswer.sh "dir1/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir4/index.htm" < duplicated.log | tr '\0' '\n'
12 dir1/index.htm
12 dir2/index.htm
12 dir3/index.htm
12 dir4/index.htm
12 dir5/index.htm
$> ./youranswer.sh "dir7/video.m4v" < duplicated.log | tr '\0' '\n'
32 dir6/video.m4v
32 dir7/video.m4v
I was thinking about some thing like:
awk 'BEGIN { RS="\0\0" } #input record separator is double zero byte
/filepath/ { print $0 }' duplicated.log
...but filepathmay contain slash symbols / and many other symbols (quotes, carriage return...).
I may have to use perl to deal with this situation...
I am open to any suggestions, questions, other ideas...
You're almost there: use the matching operator ~:
awk -v RS='\0\0' -v pattern="dir1/index.htm" '$0~pattern' duplicated.log
I have just realized that I could use the md5sum instead of the pathname because in my new version of the script I am keeping the md5sum information.
This is the new format I am currently using:
$> tr '\0' '\n' < duplicated.log
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
32 fc191f86efabfca83a94d33aad2f87b4 dir6/video.m4v
32 fc191f86efabfca83a94d33aad2f87b4 dir7/video.m4v
gawk and nawk give wanted result:
$> awk 'BEGIN { RS="\0\0" }
/89e8a208e5f06c65e6448ddeb40ad879/ { print $0 }' duplicated.log |
tr '\0' '\n'
12 89e8a208e5f06c65e6448ddeb40ad879 dir1/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir2/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir3/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir4/index.htm
12 89e8a208e5f06c65e6448ddeb40ad879 dir5/index.htm
But I am still open about your answers :-)
(this current answer is just a workaround)
For curious, below the new (horrible) script under construction...
#!/bin/bash
fifo=$(mktemp -u)
fif2=$(mktemp -u)
dups=$(mktemp -u)
dirs=$(mktemp -u)
menu=$(mktemp -u)
numb=$(mktemp -u)
list=$(mktemp -u)
mkfifo $fifo $fif2
# run processing in background
find . -type f -printf '%11s %P\0' | #print size and filename
tee $fifo | #write in fifo for dialog progressbox
grep -vzZ '^ 0 ' | #ignore empty files
LC_ALL=C sort -z | #sort by size
uniq -Dzw11 | #keep files having same size
while IFS= read -r -d '' line
do #for each file compute md5sum
echo -en "${line:0:11}" "\t" $(md5sum "${line:12}") "\0"
#file size + md5sim + file name + null terminated instead of '\n'
done | #keep the duplicates (same md5sum)
tee $fif2 |
uniq -zs12 -w46 --all-repeated=separate |
tee $dups |
#xargs -d '\n' du -sb 2<&- | #retrieve size of each file
gawk '
function tgmkb(size) {
if(size<1024) return int(size) ; size/=1024;
if(size<1024) return int(size) "K"; size/=1024;
if(size<1024) return int(size) "M"; size/=1024;
if(size<1024) return int(size) "G"; size/=1024;
return int(size) "T"; }
function dirname (path)
{ if(sub(/\/[^\/]*$/, "", path)) return path; else return "."; }
BEGIN { RS=ORS="\0" }
!/^$/ { sz=substr($0,0,11); name=substr($0,48); dir=dirname(name); sizes[dir]+=sz; files[dir]++ }
END { for(dir in sizes) print tgmkb(sizes[dir]) "\t(" files[dir] "\tfiles)\t" dir }' |
LC_ALL=C sort -zrshk1 > $dirs &
pid=$!
tr '\0' '\n' <$fifo |
dialog --title "Collecting files having same size..." --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
tr '\0' '\n' <$fif2 |
dialog --title "Computing MD5 sum" --no-shadow --no-lines --progressbox $(tput lines) $(tput cols)
wait $pid
DUPLICATES=$( grep -zac -v '^$' $dups) #total number of files concerned
UNIQUES=$( grep -zac '^$' $dups) #number of files, if all redundant are removed
DIRECTORIES=$(grep -zac . $dirs) #number of directories concerned
lins=$(tput lines)
cols=$(tput cols)
cat > $menu <<EOF
--no-shadow
--no-lines
--hline "After selection of the directory, you will choose the redundant files you want to remove"
--menu "There are $DUPLICATES duplicated files within $DIRECTORIES directories.\nThese duplicated files represent $UNIQUES unique files.\nChoose directory to proceed redundant file removal:"
$lins
$cols
$DIRECTORIES
EOF
tr '\n"' "_'" < $dirs |
gawk 'BEGIN { RS="\0" } { print FNR " \"" $0 "\" " }' >> $menu
dialog --file $menu 2> $numb
[[ $? -eq 1 ]] && exit
set -x
dir=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f4- )
md5=$( grep -zam"$(< $numb)" . $dirs | tac -s'\0' | grep -zam1 . | cut -f2 )
grep -zao "$dir/[^/]*$" "$dups" |
while IFS= read -r -d '' line
do
file="${line:47}"
awk 'BEGIN { RS="\0\0" } '"/$md5/"' { print $0 }' >> $list
done
echo -e "
fifo $fifo \t dups $dups \t menu $menu
fif2 $fif2 \t dirs $dirs \t numb $numb \t list $list"
#rm -f $fifo $fif2 $dups $dirs $menu $numb
How to shutter the "#" characters to one "#" char by sed ?
From:
param=## ### ff ## e ##44
To:
param=# # ff # e #44
One way to do it using extended regexps:
vinko#parrot:~$ echo "## ### ff ## e ##44" | sed -r s/#+/#/g
# # ff # e #44
With regular regexps:
vinko#parrot:~$ echo "## ### ff ## e ##44" | sed -e s/##*/#/g
# # ff # e #44
Only after the equal sign:
vinko#parrot:~$ echo "param=## ### ff ## e ##44" | sed s/=##*/=#/g
param=# ### ff ## e ##44
posix version (for non GNU sed)
sed 's/#\{2,\}/#/g' YourFile