does utf-8 encoding messes file globbing and grep'ing?

does utf-8 encoding messes file globbing and grep'ing? - encoding

I'm playing with bash, experiencing with utf-8 encoding. I'm new to unicode.
The following command (well, their output) surprises me :
$ locale
LANG="fr_FR.UTF-8"
LC_COLLATE="fr_FR.UTF-8"
LC_CTYPE="fr_FR.UTF-8"
LC_MESSAGES="fr_FR.UTF-8"
LC_MONETARY="fr_FR.UTF-8"
LC_NUMERIC="fr_FR.UTF-8"
LC_TIME="fr_FR.UTF-8"
LC_ALL=
$ printf '1\né\n12\n123\n' | egrep '^(.|...)$'
1
é
12
$ touch 1 é 12 123
$ ls | egrep '^(.|...)$'
1
123
Ok. The two egrep filters lines with one or three characters. Their input is quite similar, but the output differs with the character é. Any explanation?
More details on my environment :
$ uname -a
Darwin macbook-pro-de-admin-6.local 10.4.0 Darwin Kernel Version 10.4.0: Fri Apr 23 18:28:53 PDT 2010; root:xnu-1504.7.4~1/RELEASE_I386 i386
$ egrep -V
egrep (GNU grep) 2.5.1
Copyright 1988, 1992-1999, 2000, 2001 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.

Any variable length encoding can mess with tools that is not aware of the encoding, and considers bytes, not characters, when you use single-character wildcards (because the tool assumes that byte=character). If you use literal characters, then for UTF-8, it doesn't matter since the structure of UTF-8 prevents matches in the middle of a character (assuming proper encoding).
At least some versions of grep are supposed to be UTF-8 aware, according to http://mailman.uib.no/public/corpora/2006-December/003760.html, GNU grep 2.5.1 and later is included there as long as an appropriate LANG is set. If you use an older version, however, or something other than GNU grep, that would likely be the cause of your problem, since é is a two-byte character (0xC3 0xA9).
EDIT: Based on your recent comment, your grep is probably Unicode-aware, but it does not perform any sort of Unicode normalization (and I wouldn't really expect it to, to be honest).
0x65 0xCC 0x81 is an e, followed by COMBINING ACUTE ACCENT (U+0301). This is effectively two characters, but it's rendered as one due to the semantics of combining characters. This then causes grep to detect it as two characters; one for the e and one for the accent.
It seems likely that decomposed Unicode is how the file name is actually stored in your file system - otherwise, you could store files that, for all intent and purposes, have the exact same name, but only differ in their use of combining characters.

Related

Using SED on MAC (zsh) to get first jpg after marker string

Please Note: I found other gnu implementations of this, but they don't seem to work on a mac. This question is specifically for MacOS running zsh
I'm trying to pipe some output into SED and use it to find the first jpg after a marker string.
Here is my sample .sh file:
Phrase="where is \“frankenstien\" tonight.jpg with my hamburger tomorrow.jpg"
echo $Phrase | sed 's/.*\frankenstien" \(.*\)jpg/\1/'
The marker string is “frankenstien" (WITH quotes). I would like the output to be:
tonight.jpg
But instead its
tonight.jpg with my hamburger tomorrow.
So obviously the sequence passed to SED is wrong, how should I write it so that it stops after the first jpg AND includes the ".jpg" in it? I found many examples online of similar things but they did not work for MAC running zsh. Can the same code work on macs running bash? If you only get it to work on bash that might be good enough.
Thanks!

If the first jpg, is immediately following the frankenstien string (marker), then you can modify your regex to do below. The following should work on any POSIX compliant sed as it does not involve any constructs from the GNU version
sed 's/.*\"frankenstien\" \([^ ]*\).*/\1/'
The above regex will capture the string after the marker string and up to the subsequent space following the required string and ignore the rest.
P.S. Note that the shell versions don't play a role in how your regex string is interpreted by your sed installed. Remember sed is a binary on its own and comes shipped with your native distro (GNU on Linux and BSD on MacOS). There are few features supported in one and not in the other ( GNU vs *BSD ), but as such the native shell should not come into the picture here. E.g. In MacOS, with a default shell say zsh, you can have both BSD sed (shipped default) and GNU version (installable using homebrew).

how should I write it so that it stops after the first jpg AND includes the ".jpg" in it?
Match up until a space.
sed 's/.*frankenstien" \([^ ]*\) .*/\1/' <<<"$Phrase"
Handle tab also:
sed 's/.*frankenstien" \([^[:space:]]*\)[[:space:]].*/\1/' <<<"$Phrase"

How do applications know character encoding?

Lets say I have two files as below :
$ ll
total 8
-rw-rw-r--. 1 matias matias 6 Nov 27 20:25 ascii.txt
-rw-rw-r--. 1 matias matias 8 Nov 28 21:57 unicode.txt
Both contain a single line of text, but there is an extra character in the second file as shown here ( Greek letter Sigma ) :
$ cat ascii.txt
matias
$ cat unicode.txt
matiasΣ
If I pass them through file command this is the output :
$ file *
ascii.txt: ASCII text, with no line terminators
unicode.txt: UTF-8 Unicode text, with no line terminators
Which seems ok. Now If I make an hexdump of the file I get this :
$ hexdump -C ascii.txt
00000000 6d 61 74 69 61 73 |matias|
00000006
$ hexdump -C unicode.txt
00000000 6d 61 74 69 61 73 ce a3 |matias..|
00000008
So, my question is, how does an application as cat know that the last two bytes are actually a single Unicode character. If I print the last two bytes individually I get:
$ printf '%d' '0xce'
206
$ printf '%d' '0xa3'
163
Which in extended ASCII are :
$ py3 -c 'print(chr(206))'
Î
$ py3 -c 'print(chr(163))'
£
Is my logic flawed? What Am I missing here?

Command-line tools work with bytes – they receive bytes and send bytes.
The notion of a character – be it represented by a single or multiple bytes – is a task-specific interpretation of the raw bytes.
When you call cat on a UTF-8 file, I assume it just forwards the bytes it reads without caring about characters.
But your terminal, which has to display the output of cat, does take care to interpret the bytes as characters and show a single character for the byte sequence 206, 163.
From its configuration (locale env vars etc.), your terminal apparently assumes that text IO happens with UTF-8.
If this assumption is violated (eg. if a command sends the byte 206 in isolation, which is invalid UTF-8), you will see � symbols or other text garbage.
Since UTF-8 was designed to be backwards-compatible to ASCII, ASCII text files can be treated just like UTF-8 files (the are UTF-8).
While cat probably doesn't care about characters, many other commands do, eg. the wc -m command to count characters (not bytes!) in a text file.
Such commands all need to know how UTF-8 (or whatever your terminal encoding is) maps bytes to characters and vice versa.
For example, when you print(chr(206)) in Python, then it sends the bytes 195, 142 to STDOUT because:
(a) it has figured out your terminal expects UTF-8 and (b) the character "Î" (to which Unicode codepoint 206 corresponds) is represented with these two bytes in UTF-8.
Finally, the terminal displays "Î", because it decodes the two bytes to the corresponding character.

How do applications know character encoding?
Either:
(They guess—perhaps with heuristics. This isn't "knowing".)
They tell you exactly which one to use (via documentation, standard, convention, etc). (This isn't really "knowing" either.)
They allow you to tell them which one you are using.
It's your file; You have to know.

Format string on Linux and Solaris

I have a Korn Shell script, and one part of it is that it takes a given date in YYYYMMDD format and outputs it in YYYY/MM/DD format. At first I tried
typeset displaystart=`date --date="${gbegdate}" '+%Y/%m/%d'`
which works fine on Linux, but Solaris's date doesn't have a --date option. I then tried
typeset displaystart=`echo ${gbegdate:0:4}`/`echo ${gbegdate:4:2}`/`echo ${gbegdate:6:2}`
which also works on Linux, but on Solaris it just outputs //.
How can I format this date string in a way that works on Linux and Solaris?

The ${variable:start:length} extension to POSIX shell syntax was introduced in the version of ksh released in 1993, precisely named ksh93, and was also introduced in bash 1.13 the very same year.
The Advanced bash scripting guide from the Linux Documentation Project states:
Variable expansion / Substring replacement
These constructs have been adopted from ksh.
${var:pos}
Variable var expanded, starting from offset pos.
${var:pos:len}
Expansion to a max of len characters of variable var,
from offset pos. See Example A-13 for an example of the creative use
of this operator.
The issue is that on Solaris 10 and older, /bin/ksh is providing a previous ksh standard, ksh88, which didn't implemented this feature.
On the other hand, on Linux, ksh is often ksh93 which supports substring extraction. That explains why your script works under Linux ksh (if you really tested it on ksh.)
An old derivative of ksh93 is available on Solaris 10 though. It is named dtksh ans is located in /usr/dt/bin/dtksh. Your command should work unchanged with it however I wouldn't recommend to fully switch to dtksh, this shell being phased out from Solaris but you might still use it from a regular ksh script to workaround your issue:
typeset displaystart=$(/usr/dt/bin/dtksh -c "gbedate=$gbedate; echo \${gbegdate:0:4}/\${gbegdate:4:2}/\${gbegdate:6:2}")
Note that Solaris 11 and newer provide both GNU date and ksh93 so you wouldn't have that issue in the first place.

Korn shell doesn't have ${variable:start:length} syntax; this is a bash extension to POSIX shell syntax.
You can use echo "$variable" | cut -cstart-end instead.
typeset displaystart=`echo $gbegdate | cut -c1-4`/`echo $gbegdate | cut -c5-6`/`echo $gbegdate | cut -c7-8`
Or maybe you could change your script to use bash instead of ksh.

Which is the Dockerfile encoding?

Defining my Dockerfile I got to this line:
...
MAINTAINER Ramón <ramon#example.com>
...
Which encoding shall I use to save this file?
Shall I escape non ASCII characters?

Considering Docker is done in Go, and Go has native support for utf-8, it is best to save a Dockerfile directly encoded in UTF-8.
That way, all characters (ASCII or not) are supported.
See "Dealing with encodings in Go".
Even though Go has good support for UTF-8 (and minimal support for UTF-16), it has no built-in support for any other encoding.
If you have to use other encodings (e.g. when dealing with user input), you have to use third party packages, like for example go-charset.
Here, it is best if the Dockerfile is directly encoded in UTF-8.
Update July 2016, docker 1.12-rc5 adds:
PR 23372: Support unicode characters in parseWords
PR 23234: Skip UTF-8 BOM bytes from Dockerfile and .dockerignore if exist

You need to set the locale correctly, remove the accent, check the encoding with a basic docker run -it container env and then put a correct encoding, the "Bible" on that is http://jaredmarkell.com/docker-and-locales/

Convert GB2312 to UTF-8

I have a text file that contains localized language strings that is currently encoded in GB2312 (simplified Chinese), but all of my other language files are in UTF-8. I am finding it very difficult to work with this file, as none of my text editors will work properly with it and keep corrupting it. Are there any tools to convert this to UTF-8, and are there any downsides to doing this? Would it be better to just keep it as GB2312 and use a different editor (if so, can you recommend one)?
Update: I'm using Windows XP (English install).
Update #2: I've tried using Notepad++ and Notepad2 to edit the GB2312 files, but both are unable to read the files and corrupt them.

You can try this online service that uses the Open Source iconv utility.
You can also install Charco, a command-line version of it on your machine.
For GB2312, you can use CP936 as the encoding.
If you are a .Net developer you can make a small tool that does just that.
I've struggled with this as well and found that it was actually simple to solve from a programmatic point of view.
All you need is something like this (I tested it and it works):
In C#
static void Main(string[] args) {
string infile = args[0];
string outfile = args[1];
using (StreamReader sr = new StreamReader(infile, Encoding.GetEncoding(936))) {
using (StreamWriter sw = new StreamWriter(outfile, false, Encoding.UTF8)) {
sw.Write(sr.ReadToEnd());
sw.Close();
}
sr.Close();
}
}
In VB.Net
Private Shared Sub Main(ByVal args() As String)
Dim infile As String = args(0)
Dim outfile As String = args(1)
Dim sr As StreamReader = New StreamReader(infile, Encoding.GetEncoding(936))
Dim sw As StreamWriter = New StreamWriter(outfile, false, Encoding.UTF8)
sw.Write(sr.ReadToEnd)
sw.Close
sr.Close
End Sub

I might be thinking a bit too simple here, but if it's just this one plain text file, you could try the following:
Replace all & by &, all < by < and all > by > (to be on the safe side)
Prepend the following to the text file:
<html><head><meta http-equiv="Content-Type" content="text/html; charset=gb2312" /></head><body><pre>
Open the file in your favorite browser
Select and copy all text
Paste it in Notepad and save as UTF-8.
You'd be done with this before you could have written any code to do the conversion or downloaded any programs that would do the conversion for you.
Of course, I'm not a hundred percent sure this'll work, and your browser would need the correct fonts and everything, but considering you're working with these kinds of files I'm assuming you already have those.

GB 2312 is mostly compatible with GB 18030, so any tool able to deal with the latter should treat GB 2312 correctly as well. There are many tools for converting GB 18030 to UTF-8 (or some other Unicode encoding form), but I can't recommend any specific one for Windows, because I work on Unix. If you're wanting to write a bit of code, the iconv library, or ICU, springs to mind: you'll find all the conversion data readily available in these libraries.
Conversion from GB 2312 to UTF-8 is completely safe and lossless, you shouldn't worry about it.

I agree on the currently chosen answer in that "found that it was actually simple to solve from a programmatic point of view", especially when your source file contains sensitive information that you do not want to expose to an unknown 3rd-party online service.
And, nowadays Python is available out-of-box in most Linux environment, and also easy to install on a Windows environment (easier than installing C# stack, IMHO). So, without further ado, this is the 2-liner Python script that can convert GB2312 to UTF8. I tested it, it works.
# Usage: python this_script.py your_input.txt your_output.txt
import io, sys
io.open(sys.argv[2], "w", encoding="utf-8").write(io.open(sys.argv[1], encoding="gb2312").read())

If there is command line tool iconv in your OS, you can achieve this by running the one-line scirpt:
# From GB18030
iconv -f gb18030 -t utf8 -o output.txt input.txt
# From GB2313
iconv -f gb2313 -t utf8 -o output.txt input.txt
Check whether your OS have iconv:
$ iconv --version
iconv (Debian GLIBC 2.31-13+deb11u3) 2.31
Copyright (C) 2020 Free Software Foundation, Inc.
This is free software; see the source for copying conditions. There is NO
warranty; not even for MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
Written by Ulrich Drepper.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse