How to strip CR (^M) and leave LF (^J) characters? - emacs

I am trying to use Hexl mode to manually remove some special characters from a text file and don't see how to delete anything in Hexl mode.
What I really want is to remove carriage return and keep linefeed characters.
Is Hexl mode the right way to do this?

No need to find replace. Just use.
M-x delete-trailing-whitespace
You can also set the file encoding through
C-x RET f unix

Oops. That ^J^M needs to be entered as two literal characters.
Use c-q c-j, c-q c-m and for the replacement string, use c-q c-j.

No need for hexl-mode for this. Just do a global-search-and-replace of ^J^M with ^J Works for me. :) Then save the file, kill the buffer, and revisit the file so the window shows the new file mode (Unix vs DOS).

There's also a command-line tool called unix2dos/dos2unix that exists specifically to convert line endings.

Assuming you want a DOS encoded file to be changed into UNIX encoding, use M-x set-buffer-file-coding-system (C-x RET f) to set the coding-system to "unix" and save the file.

If you want to remove a carriage return (usually displayed as ^M) and leave the line feed. You can just visit the file w/out any conversion:
M-x find-file-literally /path/to/file
Because a file with carriage returns is generally displayed in DOS mode (hiding the carriage returns). The mode line will likely display (DOS) on the left side.
Once you've done that, the ^M will show up and you can delete them like you would any character.

You don't need to use hexl-mode. Instead:
open file in a way that shows you those ^M's. See M-x find-file-literally /path/to/file above. In XEmacs you can also do C-u C-x C-f and select binary encoding.
select the string you want replace and copy it using M-w
do M-% (query replace) and paste what you want to copy using C-y
present Enter when prompted to what replace it with
possible press ! now to replace all occurrences
The point is that even if you don't how to enter what you are trying to replace, you can always select/copy it.

(in hexl mode) I'm not sure that you can delete characters. I've always converted them to spaces or some other character, switched to the regular text editor, and deleted them there.

I use this function:
(defun l/cr-sanitise ()
"Make sure current buffer uses unix-utf8 encoding.
If necessary remove superfluous ^M. Buffer will need to be saved
for changes to be permanent."
(interactive)
(set-buffer-file-coding-system 'utf-8-unix)
(delete-trailing-whitespace)
(message "Please save buffer to persist encoding changes."))

From http://www.xsteve.at/prg/emacs/xsteve-functions.el:
;02.02.2000
(defun xsteve-remove-control-M ()
"Remove ^M at end of line in the whole buffer."
(interactive)
(save-match-data
(save-excursion
(let ((remove-count 0))
(goto-char (point-min))
(while (re-search-forward (concat (char-to-string 13) "$") (point-max) t)
(setq remove-count (+ remove-count 1))
(replace-match "" nil nil))
(message (format "%d ^M removed from buffer." remove-count))))))
Add this to your .emacs and run it via M-x xsteve-remove-control-M or bind it to a easier key. It will strip the ^Ms in anymode.

Related

Highlighting and replacing non-printable unicode characters in Emacs

I have an UTF-8 file containing some Unicode characters like LEFT-TO-RIGHT OVERRIDE (U+202D) which I want to remove from the file. In Emacs, they are hidden (which should be the correct behavior?) by default. How do I make such "exotic" unicode characters visible (while not changing display of "regular" unicode characters like german umlauts)? And how do I replace them afterwards (with replace-string for example. C-X 8 Ret does not work for isearch/replace-string).
In Vim, its quite easy: These characters are displayed with their hex representation per default (is this a bug or missing feature?) and you can easily remove them with :%s/\%u202d//g for example. This should be possible with Emacs?
You can do M-x find-file-literally then you will see these characters.
Then you can remove them using usual string-replace
How about this:
Put the U+202d character you want to match at the top of the kill ring by typing M-:(kill-new "\u202d"). Then you can yank that string into the various searching commands, with either C-y (eg. query-replace) or M-y (eg. isearch-forward).
(Edited to add:)
You could also just call commands non-interactively, which doesn't present the same keyboard-input difficulties as the interactive calls. For example, type M-: and then:
(replace-string "\u202d" "")
This is somewhat similar to your Vim version. One difference is that it only performs replacements from the cursor position to the bottom of the file (or narrowed region), so you'd need to go to the top of the file (or narrowed region) prior to running the command to replace all matches.
I also have this issue, and this is particularly annoying for commits as it may be too late to fix the log message when one notices the mistake. So I've modified the function I use when I type C-x C-c to check whether there is a non-printable character, i.e. matching "[^\n[:print:]]", and if there is one, put the cursor over it, output a message, and do not kill the buffer. Then it is possible to manually remove the character, replace it by a printable one, or whatever, depending on the context.
The code to use for the detection (and positioning the cursor after the non-printable character) is:
(progn
(goto-char (point-min))
(re-search-forward "[^\n[:print:]]" nil t))
Notes:
There is no need to save the current cursor position since here, either the buffer will be killed or the cursor will be put over the non-printable character on purpose.
You may want to slightly modify the regexp. For instance, the tab character is a non-printable character and I regard it as such, but you may also want to accept it.
About the [:print:] character class in the regexp, you are dependent on the C library. Some printable characters may be regarded as non-printable, like some recent emojis (but not everyone cares).
The re-search-forward return value will be regarded as true if and only if there is a non-printable character. This is exactly what we want.
Here's a snippet of what I use for Subversion commits (this is between more complex code in my .emacs).
(defvar my-svn-commit-frx "/svn-commit\\.\\([0-9]+\\.\\)?tmp\\'")
and
((and (buffer-file-name)
(string-match my-svn-commit-frx (buffer-file-name))
(progn
(goto-char (point-min))
(re-search-forward "[^\n[:print:]]" nil t)))
(backward-char)
(message "The buffer contains a non-printable character."))
in a cond, i.e. I apply this rule only on filenames used for Subversion commits. The (backward-char) can be used or not, depending on whether you want the cursor to be over or just after the non-printable character.

Force Emacs to use a particular encoding if and only if that causes no trouble

In my .emacs file, I use the line
'(setq coding-system-for-write 'iso-8859-1-unix)
to have Emacs save files in the iso-8859-1-unix encoding. When I
enter characters that cannot be encoded that way ("Łódź" for
example), I get prompted to select a different encoding, but upon
entering `iso-8859-1-unix' into the minibuffer, the file is saved and
the offending characters are lost.
If I just hit enter at the prompt, the file is saved in Unicode, and
when I close and reopen Emacs it is interpreted as a Unicode file
again. If I then remove the offending characters, save the file and
close and reopen Emacs another time, it is still interpreted as a
Unicode file -- from which I deduce that it has still been saved in
Unicode, even though saving in iso-8859-1-unix would have been
possible.
So is there a way to force Emacs to write a file in iso-8859-1
whenever possible, and never save it in that encoding if doing so
would gobble characters?
Many thanks in advance,
Thure Dührsen
As per the doc string for coding-system-for-write, you should not be setting it globally.
Perhaps what you are looking for is (prefer-coding-system 'iso-8859-1-unix)?
I'd try to write a save time hook function which would check the content of the buffer and set the encoding correspondingly. By using find-coding-system-region there shouldn't be much work.
Try
(setq-default buffer-file-coding-system 'iso-8859-1)
Edit:
Incorporating AProgrammer's suggestion, we get
(defun enforce-coding-system-priority ()
(let ((pref (car (coding-system-priority-list)))
(list (find-coding-systems-region (point-min) (point-max))))
(when (or (memq 'undecided list) (memq pref list))
(setq buffer-file-coding-system pref))))
(add-hook 'before-save-hook 'enforce-coding-system-priority)
(prefer-coding-system 'iso-8859-1)
The following should make Emacs ask when the buffer encoding is not that with which emacs is set up to save the file. Emacs will then prompt you to chose one among "safe" encodings.
(setq select-safe-coding-system-accept-default-p
'(lambda (coding)
(string=
(coding-system-base coding)
(coding-system-base buffer-file-coding-system))))

How can I set the encoding of shell-command-on-region output?

I have a small elisp script which applies Perl::Tidy on region or whole file. For reference, here's the script (borrowed from EmacsWiki):
(defun perltidy-command(start end)
"The perltidy command we pass markers to."
(shell-command-on-region start
end
"perltidy"
t
t
(get-buffer-create "*Perltidy Output*")))
(defun perltidy-dwim (arg)
"Perltidy a region of the entire buffer"
(interactive "P")
(let ((point (point)) (start) (end))
(if (and mark-active transient-mark-mode)
(setq start (region-beginning)
end (region-end))
(setq start (point-min)
end (point-max)))
(perltidy-command start end)
(goto-char point)))
(global-set-key "\C-ct" 'perltidy-dwim)
I'm using current Emacs 23.1 for Windows (EmacsW32). The problem I'm having is that if I apply that script on a UTF-8 coded file ("U(Unix)" in the status bar) the output comes back Latin-1 coded, i.e. two or more characters for each non-ASCII source character.
Is there any way I can fix that?
EDIT: Problem seems to be solved by using (set-terminal-coding-system 'utf-8-unix) in my init.el. In anyone has other solutions, go ahead and write them!
Below are from shell-command-on-region document
To specify a coding system for converting non-ASCII characters
in the input and output to the shell command, use C-x RET c
before this command. By default, the input (from the current buffer)
is encoded using coding-system specified by `process-coding-system-alist',
falling back to `default-process-coding-system' if no match for COMMAND
is found in `process-coding-system-alist'.
During executing, it looks for coding system from process-coding-system-alist at first, if it's nil, then looks from default-process-coding-system.
If your want to change the encoding, you can add your converting option to process-coding-system-alist, below are the content of it.
Value: (("\\.dz\\'" no-conversion . no-conversion)
...
("\\.elc\\'" . utf-8-emacs)
("\\.utf\\(-8\\)?\\'" . utf-8)
("\\.xml\\'" . xml-find-file-coding-system)
...
("" undecided))
Or, if you didn't set process-coding-system-alist, it's nil, you could assign your encoding option to default-process-coding-system,
for example:
(setq default-process-coding-system '(utf-8 . utf-8))
(If input is encoded as utf-8, then output encoded as utf-8)
Or
(setq default-process-coding-system '(undecided-unix . iso-latin-1-unix))
I also wrote a post about this if you want details.
Quoting the documentation for shell-command-on-region (C-h f shell-command-on-region RET):
To specify a coding system for converting non-ASCII characters
in the input and output to the shell command, use C-x RET c
before this command. By default, the input (from the current buffer)
is encoded in the same coding system that will be used to save the file,
`buffer-file-coding-system'. If the output is going to replace the region,
then it is decoded from that same coding system.
The noninteractive arguments are START, END, COMMAND,
OUTPUT-BUFFER, REPLACE, ERROR-BUFFER, and DISPLAY-ERROR-BUFFER.
Noninteractive callers can specify coding systems by binding
`coding-system-for-read' and `coding-system-for-write'.
In other words, you'd do something like
(let ((coding-system-for-read 'utf-8-unix))
(shell-command-on-region ...) )
This is untested, not sure what the value of coding-system-for-read (or perhaps -write instead? or as well?) should be in your case. I guess you could also utilize the OUTPUT-BUFFER argument and direct the output to a buffer whose coding system is set to what you need it to be.
Another option might be to wiggle the locale in the perltidy invocation, but again, without more information about what you are using now, and no means to experiment on a system similar to yours, I can only hint.

Hiding ^M in Emacs

Sometimes I need to read log files that have ^M (control-M) in the line endings. I can do a global replace to get rid of them, but then something more is logged to the log file and, of course, they all come back.
Setting Unix-style or dos-style end-of-line encoding doesn't seem to make much difference (but Unix-style is my default). I'm using the undecided-(unix|dos) coding system.
I'm on Windows, reading log files created by log4net (although log4net obviously isn't the only source of this annoyance).
(defun remove-dos-eol ()
"Do not show ^M in files containing mixed UNIX and DOS line endings."
(interactive)
(setq buffer-display-table (make-display-table))
(aset buffer-display-table ?\^M []))
Solution by Johan Bockgård. I found it here.
Modern versions of emacs know how to handle both UNIX and DOS line endings, so when ^M shows up in the file, it means that there's a mixture of both in the file. When there is such a mixture, emacs defaults to UNIX mode, so the ^Ms are visible. The real fix is to fix the program creating the file so that it uses consistent line-endings.
What about?
C-x RET c dos RET C-x C-f FILENAME RET
I made a file that has two lines, with the second having a carriage return. Emacs would open the file in Unix coding, and switching coding system does nothing. However, the universal-coding-system-argument above works.
I believe you can change the line coding system the file is using to the Unix format with
C-x RET f UNIX RET
If you do that, the mode line should change to add the word "(Unix)", and all those ^M's should go away.
If you'd like to view the log files and simply hide the ^M's rather than actually replace them you can use Drew Adam's highlight extension to do so.
You can either write elisp code or make a keyboard macro to do the following
select the whole buffer
hlt-highlight-regexp-region
C-q C-M
hlt-hide-default-face
This will first highlight the ^M's and then hide them. If you want them back use `hlt-show-default-face'
Edric's answer should get more attention. Johan Bockgård's solution does address the poster's complaint, insofar as it makes the ^M's invisible, but that just masks the underlying problem, and encourages further mixing of Unix and DOS line-endings.
The proper solution would be to do a global M-x replace-regexp to turn all line endings to DOS ones (or Unix, as the case may be). Then close and reopen the file (not sure if M-x revert-buffer would be enough) and the ^M's will either all be invisible, or all be gone.
You can change the display-table entry of the Control-M (^M) character, to make it displayable as whitespace or even disappear totally (vacuous). See the code in library pp-c-l.el (Pretty Control-L) for inspiration. It displays ^L chars in an arbitrary way.
Edited: Oops, I just noticed that #binOr already mentioned this method.
Put this in your .emacs:
(defun dos2unix ()
"Replace DOS eolns CR LF with Unix eolns CR"
(interactive)
(goto-char (point-min))
(while (search-forward "\r" nil t) (replace-match "")))
Now you can simply call dos2unix and remove all the ^M characters.
If you encounter ^Ms in received mail in Gnus, you can use W c (wash CRs), or
(setq gnus-treat-strip-cr t)
what about using dos2unix, unix2dos (now tofrodos)?
sudeepdino008's answer did not work for me (I could not comment on his answer, so I had to add my own answer.).
I was able to fix it using this code:
(defun dos2unix ()
"Replace DOS eolns CR LF with Unix eolns CR"
(interactive)
(goto-char (point-min))
(while (search-forward (string ?\C-m) nil t) (replace-match "")))
Like binOr said add this to your %APPDATA%.emacs.d\init.el on windows or where ever is your config.
;; Windows EOL
(defun hide-dos-eol ()
"Hide ^M in files containing mixed UNIX and DOS line endings."
(interactive)
(setq buffer-display-table (make-display-table))
(aset buffer-display-table ?\^M []))
(defun show-dos-eol ()
"Show ^M in files containing mixed UNIX and DOS line endings."
(interactive)
(setq buffer-display-table (make-display-table))
(aset buffer-display-table ?\^M ?\^M))
(add-hook 'text-mode-hook 'hide-dos-eol)

determining the line terminator in Emacs

I'm writing a config file and I need to define if the process expects a windows format file or a unix format file. I've got a copy of the expected file - is there a way I can check if it uses \n or \r\n without exiting emacs?
If it says (DOS) on the modeline when you open the file on Unix, the line endings are Windows-style. If it says (Unix) when you open the file on Windows, the line endings are Unix-style.
From the Emacs 22.2 manual (Node: Mode Line):
If the buffer's file uses
carriage-return linefeed, the colon
changes to either a backslash ('\') or
'(DOS)', depending on the operating
system. If the file uses just
carriage-return, the colon indicator
changes to either a forward slash
('/') or '(Mac)'. On some systems,
Emacs displays '(Unix)' instead of the
colon for files that use newline as
the line separator.
Here's a function that – I think – shows how to check from elisp what Emacs has determined to be the type of line endings. If it looks inordinately complicated, perhaps it is.
(defun describe-eol ()
(interactive)
(let ((eol-type (coding-system-eol-type buffer-file-coding-system)))
(when (vectorp eol-type)
(setq eol-type (coding-system-eol-type (aref eol-type 0))))
(message "Line endings are of type: %s"
(case eol-type
(0 "Unix") (1 "DOS") (2 "Mac") (t "Unknown")))))
If you go in hexl-mode (M-x hexl-mode), you shoul see the line termination bytes.
Open the file in emacs using find-file-literally. If lines have ^M symbols at the end, it expects a windows format text file.
The following Elisp function will return nil if no "\r\n" terminators appear in a file (otherwise it returns the point of the first occurrence). You can put it in your .emacs and call it with M-x check-eol.
(defun check-eol (FILE)
(interactive "fFile: ")
(set-buffer (generate-new-buffer "*check-eol*"))
(insert-file-contents-literally FILE)
(let ((point (search-forward "\r\n")))
(kill-buffer nil)
point))