Make vimdiff ignore unicode equivalence classes

Make vimdiff ignore unicode equivalence classes - unicode

I try to compare two directories synced by synching. I do this using the following:
vimdiff <(cd "~/Pictures/shared" && find . | sort) <(ssh argon "cd ~/pictures/shared && find . " | sort)
One machine is a recent archlinux box and the local machine is a MacBook Pro. Skimming through the diff I have problems finding real differences because most of the differences are Umlauts that get somehow interpreted wrong:
Hexdump shows the character differ. Here its a german ö (U+00F6) while there it is a o with combining diaeresis ◌̈ (U+0308). Is vimdiff capable of recognizing these equivalences as identical?

You can edit each of the buffers to replace the problematic character with the same character in both (eg. here I'd replace them with o). Vimdiff should automatically update after modifying one of the buffers.
For the replacement you can use :%s/<ctrl+v>u00f6/o/g (and equiv for the other) the u00f6 should automatically be replaced with the unicode sign in the command line after typing it.

I found a way to translate the encoding before comparing them by piping the output through iconv -f utf-8 -t utf-8-mac:
vimdiff <(cd ~/Pictures/shared && find . | sort) <(ssh argon "cd ~/pictures/shared && find . " | iconv -f utf-8 -t utf-8-mac | sort)
Also see this question on iconv.

Related

translate perl script removing control characters in script(1) output to sed

I'm recording terminal sessions using the script command. Unfortunately the typescript output file contains many control-characters - for example from pressing the full screen command (F11) when in the vim editor or try it below.
script -f -t 2>${LOGNAME}-$(/bin/date +%Y%m%d-%H%M%S).time -a ${LOGNAME}-$(/bin/date +%Y%m%d-%H%M%S).session
vi abc.log
#write something and save
#:x to quit vi
ctrl + d to quit script
The script output hostname-datetime.session contais too many vi control-characters.
I found a perl script in commandlinefu, which can remove these control characters from the typescript.
I am actually doing this replacement in C, and the program runs on a chroot envrioment, where the perl is not avaliable.
Question: Is there a a way to translate the following perl command to sed ?
cat typescript | perl -pe 's/\e([^\[\]]|\[.*?[a-zA-Z]|\].*?\a)//g' | col -b > typescript-processed

if you ONLY want printable ascii :
LC_ALL=C tr -cd ' -~\n\t' < typescript > typescript_printable_ascii_only
why this works? all printable ("normal") ascii are between Space and Tilde
And in addition you need Newline and Tab.
So ' -~\n\t' covers all printable "normal" ascii characters. And tr -d 'chars' deletes all chars, -c takes the opposite of the range given (so everything except 'chars').
=> This LC_ALL=C tr -cd ' -~\n\t' deletes everything except the normal ascii characters (including newline and tab) (I force the locale to be 'C' to be sure we are in the right locale when calling "tr")

This works well for me with GNU sed (or gsed on a Mac):
sed -re 's/\x1b[^m]*m//g' typescript | col -b
I created a sample typescript, and since I'm using a relatively advanced shell prompt, it's full of control characters, and the perl script in the OP doesn't actually work, so rather than converting I had to come up with my own.
Looking at the typescript with hexdump -C, it seems that all control sequences start with \x1b (the Escape character, or ^[), and end with the letter "m". So in sed I use a simple replacement from ^[ until m, normally written as \x1b.*?m but since sed doesn't support the ? symbol to make a pattern non-greedy, I used [^m]*m to emulate non-greedy matching.

Batch file that removes blank lines and sorts the file (case insensitive) for all encodings

I would like to make a batch file that removes all blank lines and sorts the lines in the files a regular case-insensitive sort.
So far I got this:
#echo off
IF [%1]==[] goto BAR_PAR
IF EXIST %1 (
egrep -v "^[[:space:]]*$" %1 | sort > xxx
mv -f xxx %1
) else (
echo File doesn't exist
)
goto END
:BAR_PAR
echo No Parameter Passed
:END
But this screws up my files that have encoding UCS-2 Little Endian.
Is there a way to handle all encoding blindly?
If not, what should I do to make this UCS-2 Little Endian Compatible?
UPDATE
Forgot to mention that I was using Windows but with Cygwin so I have general linux shell commands like grep, sed, etc...

Cygwin sort -f will sort the file case-insensitively by converting all characters to upper-case.
Cygwin iconv converts from one character set to another.

grep -e '[[:graph:]]' foo.txt | sort -f
In short, this command looks for any line that has at least one visible character. Therefore, lines with only spaces and tabs are excluded.
For some reason, the file I was working with didn't respond to any combination I could think of using '^' and '$'.

Replace all Windows-1252 characters to the respective UTF-8 ones

I looking to a way to recursively replace all imcompatible windows-1252 caracteres to the respective utf-8 ones.
I tried iconv, without success.
I also found the following command:
grep -rl oldstring . |xargs sed -i -e 's/oldstring/newstring/'
But I'll not like to exec this command by hand for every charactere.
Is there a way or software that can do that?

Search a text in selected coding system in file hierarchy

I want to search for text in a specified coding system (cp1251/UTF-8/UTF-16-le/iso-8859-4, etc) in a file hierarchy.
For example I have source code in cp1251 coding and I run Debian with system coding UTF-8. grep or Midnight Commander perform searches in UTF-8 coding. So I can not find Russian words.
Preferred solutions will use standard POSIX or GNU command line utilities (like grep).
MC or Emacs solution also appreciated.
I tried:
$ grep `echo Привет | iconv -f cp1251 -t utf-8` *
but this command does not show results sometimes.

The command you proposed outputs the string Привет, then pipes the result of that output to iconv and applies grep to the result of iconv. That is not what you want. What you want is this:
find . -type f -printf "iconv -f cp1251 -t utf-8 '%p' | grep --label '%p' -H 'Привет'\n" | sh
This applies iconv, followed by grep, to every file below the current directory.
But note that this assumes that all of your files are in CP1251. It will fail if only some of them are. In that case you'd first have to write a program that detects the encoding of a file and then applies iconv only if necessary.

From the command line:
LANG=ru_RU.cp1251 grep Привет *

How to convert unicode to ASCII?

I must remove Unicode characters from many files (many cpp files!) and I'm looking for script or something to remove these unicode. the files are in many folders!

If you have it, you should be able to use iconv (the command-line tool, not the C function). Something like this:
$ for a in $(find . -name '*.cpp') ; do iconv -f utf-8 -t ascii -c "$a" > "$a.ascii" ; done
The -c option to iconv causes it to drop characters it can't convert. Then you'd verify the result, and go over them again, renaming the ".ascii" files to the plain filenames, overwriting the Unicode input files:
$ for a in $(find . -name '*.ascii') ; do mv $a $(basename $a .ascii) ; done
Note that both of these commands are untested; verify by adding echo after the do in each to make sure they seem sane.

Open the srt file in Gaupol, click on file, click on save as, drop menu for character encoding, select UTF-8, save the file.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Make vimdiff ignore unicode equivalence classes - unicode

I found a way to translate the encoding before comparing them by piping the output through iconv -f utf-8 -t utf-8-mac: vimdiff <(cd ~/Pictures/shared && find . | sort) <(ssh argon "cd ~/pictures/shared && find . " | iconv -f utf-8 -t utf-8-mac | sort) Also see this question on iconv.

Related

translate perl script removing control characters in script(1) output to sed

Batch file that removes blank lines and sorts the file (case insensitive) for all encodings

Replace all Windows-1252 characters to the respective UTF-8 ones

Search a text in selected coding system in file hierarchy

How to convert unicode to ASCII?

Categories

Resources