How do applications know character encoding? - unicode

Lets say I have two files as below :
$ ll
total 8
-rw-rw-r--. 1 matias matias 6 Nov 27 20:25 ascii.txt
-rw-rw-r--. 1 matias matias 8 Nov 28 21:57 unicode.txt
Both contain a single line of text, but there is an extra character in the second file as shown here ( Greek letter Sigma ) :
$ cat ascii.txt
matias
$ cat unicode.txt
matiasΣ
If I pass them through file command this is the output :
$ file *
ascii.txt: ASCII text, with no line terminators
unicode.txt: UTF-8 Unicode text, with no line terminators
Which seems ok. Now If I make an hexdump of the file I get this :
$ hexdump -C ascii.txt
00000000 6d 61 74 69 61 73 |matias|
00000006
$ hexdump -C unicode.txt
00000000 6d 61 74 69 61 73 ce a3 |matias..|
00000008
So, my question is, how does an application as cat know that the last two bytes are actually a single Unicode character. If I print the last two bytes individually I get:
$ printf '%d' '0xce'
206
$ printf '%d' '0xa3'
163
Which in extended ASCII are :
$ py3 -c 'print(chr(206))'
Î
$ py3 -c 'print(chr(163))'
£
Is my logic flawed? What Am I missing here?

Command-line tools work with bytes – they receive bytes and send bytes.
The notion of a character – be it represented by a single or multiple bytes – is a task-specific interpretation of the raw bytes.
When you call cat on a UTF-8 file, I assume it just forwards the bytes it reads without caring about characters.
But your terminal, which has to display the output of cat, does take care to interpret the bytes as characters and show a single character for the byte sequence 206, 163.
From its configuration (locale env vars etc.), your terminal apparently assumes that text IO happens with UTF-8.
If this assumption is violated (eg. if a command sends the byte 206 in isolation, which is invalid UTF-8), you will see � symbols or other text garbage.
Since UTF-8 was designed to be backwards-compatible to ASCII, ASCII text files can be treated just like UTF-8 files (the are UTF-8).
While cat probably doesn't care about characters, many other commands do, eg. the wc -m command to count characters (not bytes!) in a text file.
Such commands all need to know how UTF-8 (or whatever your terminal encoding is) maps bytes to characters and vice versa.
For example, when you print(chr(206)) in Python, then it sends the bytes 195, 142 to STDOUT because:
(a) it has figured out your terminal expects UTF-8 and (b) the character "Î" (to which Unicode codepoint 206 corresponds) is represented with these two bytes in UTF-8.
Finally, the terminal displays "Î", because it decodes the two bytes to the corresponding character.

How do applications know character encoding?
Either:
(They guess—perhaps with heuristics. This isn't "knowing".)
They tell you exactly which one to use (via documentation, standard, convention, etc). (This isn't really "knowing" either.)
They allow you to tell them which one you are using.
It's your file; You have to know.

Related

How to remove leading whitespace denoted with ? using rename in MacOS

I have directories that looks like this in my MacOs:
For example ???9-24_v_hMrgprx2 where ??? is actually
white spaces. What I want to do is to use rename to remove those leading white spaces.
I tried this but failed.
rename "s/\s*//g" *
What's the right way to do it?
Update
Hexdump looks like this:
ls | hexdump -C
00000000 e3 80 80 39 2d 32 34 5f 76 5f 68 4d 72 67 70 72 |...9-24_v_hMrgpr|
00000010 78 32 0a |x2.|
00000013
Verify what those characters are first, since macOS doesn't display ASCII whitespace characters in filenames as ? (unless you have some weird encoding issue going on). It would help if you added information like this to your question:
$ touch " touched"
$ ls -l *touched
-rw-rw-r--# 1 brian staff 0 Aug 18 13:52 touched
$ ls *touched | hexdump -C
00000000 20 20 20 74 6f 75 63 68 65 64 0a | touched.|
0000000b
For rename, you've almost got it right if those leading characters were whitespace. However, you want to anchor the pattern so you only match whitespace at the beginning of the name:
rename 's/\A\s+//' *
Now that we know your filenames start with U+3000 (which is whitespace), I can see what's going on.
There are various versions of rename. Larry Wall wrote one, #tchrist wrote one based on that (and I use that), and File::Rename is another modification of Larry's original. Then there is Aristotle's version.
The problem with my rename (from #tchrist) is that it doesn't interpret the filenames as UTF-8. So, U+3000, looks like the three bytes you see: e3 80 80. I'm guessing that your font might not support any of those. There could be all sorts of things going on. See #tchrist's Unicode answer.
I can create the file:
% perl -CS -le 'print qq(\x{3000}abc)' | xargs touch
I can easily see the file, but I have a font that can display that character:
$ ls -l *abc
-rw-rw-r-- 1 brian staff 0 Aug 22 02:44  abc
But, when I try to rename it, using the -n for a dry run, I get no output (so, no matching files to change):
$ rename -n 's/\A\s+//' *abc
If I run perl directly and give it -CSA to treat the standard file handles (-CS) and the command-line arguments (-CA) as UTF-8, the file matches and the replacement happens:
$ perl -CSA `which rename` -n 's/\A\s+//' *abc
rename  abc abc
So, for my particular version, I can edit the shebang line to have the options I need. This works for me because I know my terminal settings, so it might not work everywhere for all settings:
#!/usr/bin/env perl -CSA
But the trick is how did I get that version of rename? I'm pretty sure I installed some module from CPAN that gave it to me, but what? I'd supply a patch if I could.
E3 80 80 is the UTF-8 encoding of U+3000 which is a CJK whitespace character.
rename is not a standard utility on MacOS, and there are several popular utilities with this name, so what exactly works will depend on which version you have installed. The syntax looks like you have this one, from the Perl distribution. Maybe try
rename 's/\xe3\x80\x80//' *

What is this sed command doing?

I am looking for non printable characters into a file, and I found this web page.
It shows the following command:
sed "l" file
If I am not mistaken, according to man, this option is:
List out the current line in a ''visually unambiguous'' form.
Moreover, when I run this command on a fake file with one line, the output is as follow:
The line is displayed twice, but each displayed line (in the output) contains at most 69 bytes of the input line. The rest of the line is displayed at the next line.
The second time the line is displayed, it is in its full length.
fake file
toto, titi, tatafdsfdsfdgfgfdsgrgdfgzfdgzgffgerssssssssssssssssssssssssss
Command
sed "l" fake_file
output
$ sed "l" fake_file
toto, titi, tatafdsfdsfdgfgfdsgrgdfgzfdgzgffgerssssssssssssssssssssss\
ssss$
toto, titi, tatafdsfdsfdgfgfdsgrgdfgzfdgzgffgerssssssssssssssssssssssssss
Questions
What does ''visually unambiguous'' exactly mean ?
Why is the output like this ? I was expecting only one line with the $ sign at the end. I was also not expecting output to be displayed on 69 bytes max.
Environment
Tested with same output on:
sed (GNU sed) 4.7
sed (GNU sed) 4.2.2
By default, sed outputs the result after processing a line. If you handle the output yourself, tell sed not to output the line by the -n switch.

How can I redirect input in PowerShell without a BOM?

I am trying to redirect input in PowerShell by:
Get-Content input.txt | my-program args
The problem is the piped UTF-8 text is preceded with a BOM (0xEFBBBF), and my program cannot handle that correctly.
A minimal working example:
// File: Hex.java
import java.io.IOException;
public class Hex {
public static void main(String[] dummy) {
int ch;
try {
while ((ch = System.in.read()) != -1) {
System.out.print(String.format("%02X ", ch));
}
} catch (IOException e) {
}
}
}
Then in PowerShell:
javac Hex.java
Set-Content textfile "ABC" -Encoding Ascii
# Now the content of textfile is 0x41 42 43 0D 0A
Get-Content textfile | java Hex
Or simply
javac Hex.java
Write-Output "ABC" | java Hex
In either case, the output is EF BB BF 41 42 43 0D 0A.
How can I pipe the text into the program without 0xEFBBBF?
Note:
The following contains general information that in a normally functioning PowerShell environment would explain the OP's symptom. That the solution doesn't work in the OP's case is owed to machine-specific causes that are unknown at this point.
This answer is about sending BOM-less UTF-8 to an external program; if you're looking to make your PowerShell console windows use UTF-8 in all respects, see this answer.
To ensure that your Java program receives its input UTF-8-encoded without a BOM, you must set $OutputEncoding to a System.Text.UTF8Encoding instance that does not emit a BOM:
# Assigns UTF-8 encoding *without a BOM*.
# PowerShell uses this encoding to encode data piped to external programs.
# $OutputEncoding defaults to ASCII(!) in Windows PowerShell, and more sensibly
# to BOM-*less* UTF-8 in PowerShell [Core] v6+
$OutputEncoding = [Text.UTF8Encoding]::new($false)
Caveats:
Do NOT use the seemingly equivalent New-Object Text.Utf8Encoding $false, because, due to the bug described in GitHub issue #5763, it won't work if you assign to $OutpuEncoding in a non-global scope, such as in a script. In PowerShell v4 and below, use
(New-Object Text.Utf8Encoding $false).psobject.BaseObject as a workaround.
Windows 10 version 1903 and up allow you to set BOM-less UTF-8 as the system-wide default encoding (although note that the feature is still classified as beta as of version 20H2) - see this answer; [fixed in PowerShell 7.1] in PowerShell [Core] up to v7.0, with this feature turned on, the above technique is not effective, due to a presumptive .NET Core bug that causes a UTF-8 BOM always to be emitted, irrespective of what encoding you set $OutputEncoding to (the bug is possibly connected to GitHub issue #28929); the only solution is to turn the feature off, as shown in imgx64's answer.
If, by contrast, you use [Text.Encoding]::Utf8, you'll get a System.Text.Encoding.UTF8 instance with BOM - which is what I suspect happened in your case.
Note that this problem is unrelated to the source encoding of any file read by Get-Content, because what is sent through the PowerShell pipeline is never a stream of raw bytes, but .NET objects, which in the case of Get-Content means that .NET strings are sent (System.String, internally a sequence of UTF-16 code units).
Because you're piping to an external program (a Java application, in your case), PowerShell character-encodes the (stringified-on-demand) objects sent to it based on preference variable $OutputEncoding, and the resulting encoding is what the external program receives.
Perhaps surprisingly, even though BOMs are typically only used in files, PowerShell respects the BOM setting of the encoding assigned to $OutputEncoding also in the pipeline, prepending it to the first line sent (only).
See the bottom section of this answer for more information about how PowerShell handles pipeline input for and output from external programs, including how it is [Console]::OutputEncoding that matters when PowerShell interprets data received from external programs.
To illustrate the difference using your sample program (note how using a PowerShell string literal as input is sufficient; no need to read from a file):
# Note the EF BB BF sequence representing the UTF-8 BOM.
# Enclosure in & { ... } ensures that a local, temporary copy of $OutputEncoding
# is used.
PS> & { $OutputEncoding = [Text.Encoding]::Utf8; 'hö' | java Hex }
EF BB BF 68 C3 B6 0D 0A
# Note the absence of EF BB BF, due to using a BOM-less
# UTF-8 encoding.
PS> & { $OutputEncoding = [Text.Utf8Encoding]::new($false); 'hö' | java Hex }
68 C3 B6 0D 0A
In Windows PowerShell, where $OutputEncoding defaults to ASCII(!), you'd see the following with the default in place:
# The default of ASCII(!) results in *lossy* encoding in Windows PowerShell.
PS> 'hö' | java Hex
68 3F 0D 0A
Note that 3F represents the literal ? character, which is what the non-ASCII ö character was transliterated too, given that it has no representation in ASCII; in other words: information was lost.
PowerShell [Core] v6+ now sensibly defaults to BOM-less UTF-8, so the default behavior there is as expected.
While BOM-less UTF-8 is PowerShell [Core]'s consistent default, also for cmdlets that read from and write to files, on Windows [Console]::OutputEncoding still reflects the active OEM code page by default as of v7.0, so to correctly capture output from UTF-8-emitting external programs, it must be set to [Text.UTF8Encoding]::new($false) as well - see GitHub issue #7233.
You could try setting the OutputEncoding to UTF-8 without BOM:
# Keep the current output encoding in a variable
$oldEncoding = [console]::OutputEncoding
# Set the output encoding to use UTF8 without BOM
[console]::OutputEncoding = New-Object System.Text.UTF8Encoding $false
Get-Content input.txt | my-program args
# Reset the output encoding to the previous
[console]::OutputEncoding = $oldEncoding
If the above has no effect and your program does understand UTF-8, but only expects it to be without the 3-byte BOM, then you can try removing the BOM from the content and pipe the result your program
(Get-Content 'input.txt' -Raw -Encoding UTF8) -replace '^\xef\xbb\xbf' | my-program args
If ever you have 'hacked' the codepage with chcp 65001, I recommend turning that back to chcp 5129 for English - New Zealand. See here.
Although mklement0's answer worked for me on one PC, it didn't work on another PC.
The reason was that I had the Beta: Use Unicode UTF-8 for worldwide language support checkbox selected in Language → Administrative language settings → Change system locale.
I unchecked it and now $OutputEncoding = [Text.UTF8Encoding]::new($false) works as expected.
It's odd that enabling it forces BOM, but I guess it's beta for a reason.

Generating md5 checksum using Windows Certutil program

I need to generate an md5 hash for a list of tif and jpg images.
The hash must be inserted in an XML file with metadata about each image, which will be used for the digitalisation of the document.
The use of the md5 hash was not my decision, but a formal requirement of a standard based on the Dublin Core for the digitalisation of these kinds of documents. Xml file, md5 tag is underlined
I am currently generating each md5 hash using Windows built-in Certutil program from the command prompt.
My question is simple: am I doing this right?
I know the process is slow, but the list is short.
Certutil hash function
it looks OK, remember that specifying hash algorithm (MD5) works from windows 7 and up (older windows throw and error) and must be in uppercase.
You can also add find /v "hash" to get only hash itself
like this
certUtil -hashfile pathToFileToCheck MD5 | find /v "hash"
for example, running on windows 8, i got this output
C:\Users\xxxx\Documents>certutil -hashfile innfo MD5
MD5 hash of file innfo:
67 4b ba 79 42 32 d6 24 f0 56 91 b6 da 41 34 6d
CertUtil: -hashfile command completed successfully.
and with find /v "hash" i've got
C:\Users\xxxx\Documents>certutil -hashfile innfo MD5
67 4b ba 79 42 32 d6 24 f0 56 91 b6 da 41 34 6d
the find trick is to exclude (/v parameter) following string "hash" and you have to specify the string in double quotes.
As the first and last line around hash itself contain word hash, you have clean output
my version works from cmd not powershell
I often need to create a hash file to place on an FTP service.
These hash files must contain the name of the original file to allow automatic verification that is done by various tools.
For example, if you have a file named foo.txt, it is necessary to have a file foo.txt.md5 with the following content:
a3713593c5edb65c8287eb6ff9ec4bc0 *foo.txt
The following batch does the job
for %%a in (%1) do set filename=%%~nxa
#certutil -hashfile "%1" md5 | find /V "hash" >temp.txt
for /f "delims=" %%i in (temp.txt) do set hash=%%i
echo %hash% *%filename%>%1.md5
del temp.txt
You can replace md5 for sha256 or 512 if you need. The CERTUTIL supports.

does utf-8 encoding messes file globbing and grep'ing?

I'm playing with bash, experiencing with utf-8 encoding. I'm new to unicode.
The following command (well, their output) surprises me :
$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ printf '1\né\n12\n123\n' | egrep '^(.|...)$'
1
é
12
$ touch 1 é 12 123
$ ls | egrep '^(.|...)$'
1
123
Ok. The two egrep filters lines with one or three characters. Their input is quite similar, but the output differs with the character é. Any explanation?
More details on my environment :
$ uname -a
Darwin macbook-pro-de-admin-6.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
$ egrep -V
egrep (GNU grep) 2.5.1
Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Any variable length encoding can mess with tools that is not aware of the encoding, and considers bytes, not characters, when you use single-character wildcards (because the tool assumes that byte=character). If you use literal characters, then for UTF-8, it doesn't matter since the structure of UTF-8 prevents matches in the middle of a character (assuming proper encoding).
At least some versions of grep are supposed to be UTF-8 aware, according to http://mailman.uib.no/public/corpora/2006-December/003760.html, GNU grep 2.5.1 and later is included there as long as an appropriate LANG is set. If you use an older version, however, or something other than GNU grep, that would likely be the cause of your problem, since é is a two-byte character (0xC3 0xA9).
EDIT: Based on your recent comment, your grep is probably Unicode-aware, but it does not perform any sort of Unicode normalization (and I wouldn't really expect it to, to be honest).
0x65 0xCC 0x81 is an e, followed by COMBINING ACUTE ACCENT (U+0301). This is effectively two characters, but it's rendered as one due to the semantics of combining characters. This then causes grep to detect it as two characters; one for the e and one for the accent.
It seems likely that decomposed Unicode is how the file name is actually stored in your file system - otherwise, you could store files that, for all intent and purposes, have the exact same name, but only differ in their use of combining characters.