(e)grep: accented characters not recognised as part of a word - unicode

I would like to use (e)grep to match a whole word using the -w switch. I've set the locale, but accented characters are being treated as word boundaries as in this example:
$ locale
LANG=es_VE.utf8
LC_CTYPE="es_VE.utf8"
LC_NUMERIC="es_VE.utf8"
LC_TIME="es_VE.utf8"
LC_COLLATE="es_VE.utf8"
LC_MONETARY="es_VE.utf8"
LC_MESSAGES="es_VE.utf8"
LC_ALL=es_VE.utf8
$ echo -e "cáñamo\namo" | egrep -w amo
cáñamo
amo
How can I find amo while ignoring cáñamo

Which code points count as a word-class character is not locale-dependent in Unicode, and LATIN SMALL LETTER N WITH TILDE is always a word character.
Here’s an all-UTF8 workflow demonstrating searching for amo after a word boundary, and after a non-(word-boundary):
$ perl -Mutf8 -CSDA -e 'print "cáñamo\namo\n"' |
perl -Mutf8 -CSDA -ne 'print if /\bamo\b/'
amo
$ perl -Mutf8 -CSDA -e 'print "cáñamo\namo\n"' |
perl -Mutf8 -CSDA -ne 'print if /\Bamo\b/'
cáñamo
I cannot help but be amused by your choice of search strings. Thanks for the chuckle.

Related

Insert linebreak in a file after a string

I have a unique (to me) situation:
I have a file - file.txt with the following data:
"Line1", "Line2", "Line3", "Line4"
I want to insert a linebreak each time the pattern ", is found.
The output of file.txt shall look like:
"Line1",
"Line2",
"Line3",
"Line4"
I am having a tough time trying to escape ", .
I tried sed -i -e "s/\",/\n/g" file.txt, but I am not getting the desired result.
I am looking for a one liner using either perl or sed.
You may use this gnu sed:
sed -E 's/(",)[[:blank:]]*/\1\n/g' file.txt
"Line1",
"Line2",
"Line3",
"Line4"
Note how you can use single quote in sed command to avoid unnecessary escaping.
If you don't have gnu sed then here is a POSIX compliant sed solution:
sed -E 's/(",)[[:blank:]]*/\1\
/g' file.txt
To save changes inline use:
sed -i.bak -E 's/(",)[[:blank:]]*/\1\
/g' file.txt
Could you please try following. using awk's substitution mechanism here, in case you are ok with awk.
awk -v s1="\"" -v s2="," '{gsub(/",[[:blank:]]+"/,s1 s2 ORS s1)} 1' Input_file
Here's a Perl solution:
perl -pe 's/",\K/\n/g' file.txt
The substitution pattern matches the ",, but the \K says to ignore anything to the left for the replacement (so, ",) will not be replaced. The replacement then effectively inserts the newline.
I used the single quote for the argument to -e, but that doesn't work on Windows where you have to use ". Instead of escaping the ", you can specify it in another way. That's code number 0x22, so you can write:
perl -pe "s/\x22,\K/\n/g" file.txt
Or in octal:
perl -pe "s/\042,\K/\n/g" file.txt
Use this Perl one-liner:
perl -F'/"\K,\s*/' -lane 'print join ",\n", #F;' in_file > out_file
Or this for in-line replacement:
perl -i.bak -F'/"\K,\s*/' -lane 'print join ",\n", #F;' in_file
The Perl one-liner uses these command line flags:
-e : Tells Perl to look for code in-line, instead of in a file.
-n : Loop over the input one line at a time, assigning it to $_ by default.
-l : Strip the input line separator ("\n" on *NIX by default) before executing the code in-line, and append it when printing.
-a : Split $_ into array #F on whitespace or on the regex specified in -F option.
-F'/"\K,\s*/' : Split into #F on a double quote, followed by comma, followed by 0 or more whitespace characters, rather than on whitespace. \K : Cause the regex engine to "keep" everything it had matched prior to the \K and not include it in the match. This causes to keep the double quote in #F elements, while comma and whitespace are removed during the split.
-i.bak : Edit input files in-place (overwrite the input file). Before overwriting, save a backup copy of the original file by appending to its name the extension .bak.
SEE ALSO:
perldoc perlrun: how to execute the Perl interpreter: command line switches
perldoc perlrequick: Perl regular expressions quick start

sed equivalent of perl -pe

I'm looking for an equivalent of perl -pe. Ideally, it would be replace with sed if it's possible. Any help is highly appreciated.
The code is:
perl -pe 's/^\[([^\]]+)\].*$/$1/g'
$ echo '[foo] 123' | perl -pe 's/^\[([^\]]+)\].*$/$1/g'
foo
$ echo '[foo] 123' | sed -E 's/^\[([^]]+)\].*$/\1/'
foo
sed by default accepts code from command line, so -e isn't needed (though it can be used)
printing the pattern space is default, so -p isn't needed and sed -n is similar to perl -n
-E is used here to be as close as possible to Perl regex. sed supports BRE and ERE (not as feature rich as Perl) and even that differs from implementation to implementation.
with BRE, the command for this example would be: sed 's/^\[\([^]]*\)\].*$/\1/'
\ isn't special inside character class unless it is an escape sequence like \t, \x27 etc
backreferences use \N format (and limited to maximum 9)
Also note that g flag isn't needed in either case, as you are using line anchors

Perl regex replacement of logical unicode characters

Here is a simple substitution that adds parentheses arounds upper-case characters in an unicode string. As you can see, the result is rather ugly:
~$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
My understanding is that the regex operates on "code points" instead of "logical characters", which splits my 'é' into meaningless characters. Is there a way to force the regex to work on logical unicode characters at once ?
Thanks,
As illustrated by the other answers, turning on UTF-8 in Perl is a piecemeal process. There's use utf8 for the syntax and raw strings. Then you have to make sure all your filehandles are UTF-8. What about #ARGV? readdir? glob? The output from ``?
There's nothing worse than having half your program working in ASCII and the other half working in UTF-8. utf8::all to the rescue!
Install it, add use utf8::all, and it will turn on UTF-8... all of it. Someone else figured it out, you don't have to worry about it.
$ echo "Whatéver 5" | perl -ape "use utf8::all; s/(\p{Upper})/(\1)/g"
(W)hatéver 5
You haven't told Perl to expect UTF-8 input, so it is treating each byte of the encoding as a separate character
Within a program you can set the default encoding for the three standard IO channels like this
use open ':std' => ':encoding(UTF-8)'
On the command line, the option -CS does the same thing, so this should work for you. I have removed the unnecessary autosplit option and replaced \1 with the correct $1 in the replacement string
echo "Whatéver 5" | perl -CS -pe "s/(\p{Upper})/($1)/g"
Assuming that your terminal uses UTF-8 encoding,
$ echo -n "é" | perl -ne 'printf "%vX\n", $_'
gives
C3.A9
so the input to the Perl program has not been converted internally to Unicode (it is still a string of UTF-8 bytes)
To convert the input to a Perl string, add a UTF-8 layer on the standard input stream using option -CI :
$ echo -n "é" | perl -CI -ne 'printf "%vX\n", $_'
the output is now
E9
However, if you also try to print the character back to standard output
you will not get é but a unicode replacement character � from the terminal. This is because the character 0xE9 is Unicode, but the terminal expect UTF-8, and 0xE9 is not valid UTF-8:
$ echo -n "é" | perl -CI -nE 'printf "$_: %vX\n", $_, $_'
�: E9
To get correct output, you can add an UFT-8 encoding layer on the standard output stream also (using -CO flag):
$ echo -n "é" | perl -CIO -nE 'printf "$_: %vX\n", $_, $_'
é: E9
According to perlunicode
"Upper" is a synonym for "Uppercase" , and we could have written
\p{Uppercase} equivalently as \p{Upper}
and
For instance, \p{Uppercase} matches any single character with the
Unicode "Uppercase" property
It seems like if you try to use \p{Upper} on a byte string, you will not get any warnings from Perl. Also bytes in the range 0xC0 to 0xDE will match the uppercase property. Try
perl -E 'for $i (0x80..0xFF) {$_=chr $i; printf "%x\n", $i if /\p{Upper}/}'
This explains the output you got:
$ echo "Whatéver 5" | perl -ape "s/(\p{Upper})/(\1)/g"
(W)hat(�)�ver 5
Here, the letter é is represented as 2 bytes (in UTF-8) 0xC3 and 0xA9, and 0xC3 will match the Unicode Upper property.
A solution to your problem is therefore to add UTF-8 encoding layers on the standard input and output (you can combine -CI and -CO using -CS):
echo "Whatéver 5" | perl -CS -ape "s/(\p{Upper})/(\1)/g"
with output:
(W)hatéver 5

Is there a way to run a tr command (or something like it) in Windows Perl?

I have a line of shell script that I need to recreate the functionality of in Windows Perl?
tr -cd '[:print:]\n\r' | tr -s ' ' | sed -e 's/ $//'
The sed part is easy but not being a shell scripting expert (or perl expert for that matter) I'm having trouble recreating the functionality of the tr command in Perl.
Break down the bits to what each does and convert that to Perl:
tr -cd '[:print:]\n\r'
Take the complement (-c) of printable characters, newlines, and carriage returns and delete them (-d). The tr manpage tells you which each of those do.
tr -s ' '
Collapse multiple same characters (-s) to one character.
sed -e 's/ $//'
Get rid of a trailing space.
Now just do that in whatever language you want to use. Putting it together, you might like something like:
perl -pe 's/\P{PosixPrint}//g; tr/ //s; s/ \z//;'
Note that Perl's tr does not do character classes, but I can use a complemented (\P{...} with the uppercase P) Unicode character class to do the same thing. Perl character classes also understand the regular POSIX character classes.

perl -a: How to change column separator?

I want to read the columns in a file where the separator is :.
I tried it like this (because according to http://www.asciitable.com, the octal representation of the colon is 072):
$ echo "a:b:c" | perl -a -072 -ne 'print "$F[1]\n";'
I want it to print b, but it doesn't work.
Look at -F in perlrun:
% echo "a:b:c" | perl -a -F: -ne 'print "$F[1]\n";'
b
Note that the value is taken as regular expression, so some delimiters may need some extra escaping:
% echo "a.b.c" | perl -a -F. -ne 'print "$F[1]\n";'
% echo "a.b.c" | perl -a -F\\. -ne 'print "$F[1]\n";'
b
-0 specifies the record (line) separator. It was cause Perl to receive three lines:
>echo a:b:c | perl -072 -nE"say"
a:
b:
c
Since there's no whitespace on any of those lines, $F[1] would be empty if -a were to be used.
-F specifies the input field separator. This is what you want.
perl -F: -lanE'say $F[1];'
Or if you're stuck with an older Perl:
perl -F: -lane'print $F[1];'
Command line options are documented in perlrun.