How to find and remove the invisible characters in text file using emacs - emacs

I have a .txt file named COPYING which is edited on windows.
It contains Windows-style line breaks :
$ file COPYING
COPYING: ASCII English text, with CRLF line terminators
I tried to convert it to Unix style using dos2unix. Below is the output :
$ dos2unix COPYING
dos2unix: Skipping binary file COPYING
I was surprised to find that the dos2unix program reports it as a binary file. Then using some other editor (not Emacs) I found that the file contains a control character. I am interested in finding all the invisible characters in the file using Emacs.
By googling, I have found the following solution which uses tr :
tr -cd '\11\12\40-\176' < file_name
How can I do the same in an Emacs way? I tried the Hexl mode. The Hexl mode shows text and their corresponding ASCII values in a single buffer which is great. How do I find the characters which have ASCII values other than 11-12, 40-176 (i.e tab, space, and visible characters)? I tried to create a regular expression for that search, but it is quite complicated.

To see invisible characters, you can try whitespace-mode. Spaces and tabs will be displayed with a symbol in a different face. If the coding system is automatically being detected as dos (showing (DOS) on the status bar), carriage returns at the end of a line will be hidden as well. Run revert-buffer-with-coding-system to switch it to Unix or binary (e.g. C-x RET r unix) and they'll always show up as ^M. The binary coding system will display any non-ASCII characters as control characters as well.

Emacs won't hide any character by default. Press Ctrl+Meta+%, or Esc then Ctrl+% if the former is too hard on your fingers, or M-x replace-regexp RET if you prefer. Then, for the regular expression, enter
[^#-^H^K-^_^?]
However, where I wrote ^H, type Ctrl+Q then Ctrl+H, to enter a “control-H” character literally, and similarly for the others. You can press Ctrl+Q then Ctrl+Space for ^#, and usually Ctrl+Q then Backspace for ^?. Replace all occurrences of this regular expression by the empty string.
Since you have the file open in Emacs, you can change its line endings while you're at it. Press C-x RET f (Ctrl+X Return F) and enter us-ascii-unix as the new desired encoding for the file.

Check out M-x set-buffer-file-coding-system. From the documentation:
(set-buffer-file-coding-system CODING-SYSTEM &optional FORCE NOMODIFY)
Set the file coding-system of the current buffer to CODING-SYSTEM.
This means that when you save the buffer, it will be converted
according to CODING-SYSTEM. For a list of possible values of
CODING-SYSTEM, use M-x list-coding-systems.
So, going from DOS to UNIX, M-x set-buffer-file-coding-system unix.

Related

Dired appears with 015 (Octal?)

Recently, my Dired listing in Emacs starting appearing with 015 at the end of each line:
I'm not sure what brought it on. I had been making some changes with my Spacemacs layers but since then I've gone to a completely out-of-the-box Spacemacs configuration and the 015s are still there. It makes Dired pretty much useless because if I try to select a file or drill into a directory it doesn't recognize it. Any ideas or suggestions would be greatly appreciated!
Those are Control M characters. Emacs writes them as either ^M (one char, not two) or \015 (again, one char, not 4).
This Emacs Wiki page tells you about this: EndOfLineTips.
This is some of what it says:
If you see ^M in your file, you may have opened a file with DOS-style line endings (carriage return + line feed) while Emacs assumes it has Unix-style line endings (line feed only). (The carriage-return character, sometimes abbreviated as CR, is ^M. The line-feed character, sometimes abbreviated as LF, is ^J.)
You can reopen the file with the correct line ending with a command like C-x C-m r dos.
C-x C-m r is bound to revert-buffer-with-coding-system. Use C-h k or C-h f to see more about it.
See also (C-h v) variable buffer-file-coding-system.
Also: use i line endings in the Emacs manual, to go to node Coding Systems. That tells all you need to know about this.
This question and its answers might also help. And see the UNIX/Linux command dos2unix.

Github code preview looks odd

On GitHub
why does the code look like this?
It looks fine in all the code editors out there
Ctrl+M characters present in a file in linux, when originally the file came from Windows environment.
You may need to remove Ctrl+M characters, when you import a text file from MS-DOS (or MS-Windows), and forget to transfer it in ASCII or text mode. Here are several ways to do it; pick the one you are most comfortable with.
The easiest way is probably to use the stream editor sed to remove the ^M characters. Type this command: % sed -e "s/^M//" filename > newfilename
To enter ^M, type CTRL-V, then CTRL-M. That is, hold down the CTRL key then press V and M in succession.
You can also do it in vi: % vi filename
Inside vi [in ESC mode] type: :%s/^M//g
To enter ^M, type CTRL-V, then CTRL-M. That is, hold down the CTRL key then press V and M in succession.
You can also do it inside Emacs. To do so, follow these steps:
Go to the beginning of the document
Type: M-x replace-string RET C-q C-m RET RET
where "RET" means and C-q and C-m mean .
Courtesy: https://its.ucsc.edu/unix-timeshare/tutorials/clean-ctrl-m.html

How can I input line-sparator in Emacs?

In some elisp file, they use line-sparator(I have no idea what is the name of it) to sparate some functions.
Some functions maybe be called only by some functions which will be made for API for user. So the two part of functions are different.
In some elisp file, there a one charator which display like a line in Emacs(I call it line-sparator).
For example, in help.el file, after the line (defvar help-button-cache nil) , there is a line-sparator in line 114.
So, My question is How to input it in Emacs.
This character is called "form feed", shown in Emacs as ^L, represented in files as byte 12 (decimal) / 0C (hex). Its function is to separate pages; when sent to a printer, it will usually make the printer output the current page and restart output at the top of a new page.
You can input it with C-q C-l. C-q is bound to quoted-insert, which can insert almost anything into the buffer literally.
You are looking for C-q C-l I believe. This inserts the ^L escape, which is commonly known as a FORM_FEED. Traditionally, this command was used to tell printers to eject the page and start a new one; of course, this has changed over time. Normally, this is used as a directive to clear the screen in terminals.
I'm not sure what you're seeing, because the character displays as ^L to me.
EDIT: sniped.

Emacs yanking (pasting) from website always yields character code 160 instead of SPC

When copying code from the web (usually using Chrome in Ubuntu) I am frustrated by the fact that Emacs inserts blank spaces of Char:   (160, #o240, #xa0) wherever there should be a space character, Char: SPC (32, #o40, #x20). This appears fine in the editor but as soon as I try to execute the code I get errors. How can I make Emacs convert entities into normal space characters?
You can use query-replace (M-%) to convert the characters. Copy-paste can help you enter the non-breaking space.

^M in file names & find-file

Update: This only occurs when I access the particular server from a Windows machine
With emacs tramp (plink) I'm logging on to 2 different servers, and am experiencing a problem in one of them with find-file.
If I do tab completion in a directory, all file names have ^M appended to them, e.g.:
Click <mouse-2> on a completion to select it.
In this buffer, type RET to select the completion near point.
Possible completions are:
-name^M ../^M
./^M .bash_history^M
.git/^M .gitconfig^M
.gitignore^M .lesshst^M
.ssh/^M .subversion/^M
and when I tab-complete the file name, it completes with the ^M suffix, which is the filename of a nonexistent file:
/plink:user#myserver.com:/home/me/.gitignore^M
Anyone experience a similar problem? ^M is ungoogleable!
^M really is the carriage return part of a carriage return/line feed (hex 0x0d, oct 015). You probably need to configure your server to use linefeeds as line endings. There might be a way to fix this in emacs, but I don't know offhand.
In a way, it's MS Windows (carriage return/line feed) vs. Linux (line feed) issue. However, it's not really that simple and boths types of line endings are there for a historically good reason.
As mentioned elsewhere, ^M is the CR character, which makes up the first half of a CR LF pair used as a line terminator in DOS. Unix/Linux just uses the LF, so the ^M is displayed as an extra character.
When I see this in vim, I remove it by searching for ^M and replacing it with nothing. To specify ^M I press ctrl-V then M
I'm sure emacs has some way to do the same thing.
Try
(add-hook 'comint-output-filter-functions
'shell-strip-ctrl-m nil t)
(add-hook 'comint-output-filter-functions
'comint-watch-for-password-prompt nil t)
or maybe even
'(ansi-color-for-comint-mode-on)
'(ansi-color-for-comint-mode-filter)
^M is Enter.
It could has to do with DOS vs. Unix line break.
DOS: (13, 10)
Unix: (10)
So when you want to write lines using DOS style, a Unix style renderer will say:
your like(char13)
another line
Check your terminal and stuff like that...
And... if you are writing a batch file for Unix in DOS happens the same. You have to convert it using dos2unix filename to remove the extra 13's in your file.