Highlighting and replacing non-printable unicode characters in Emacs

Highlighting and replacing non-printable unicode characters in Emacs - unicode

I have an UTF-8 file containing some Unicode characters like LEFT-TO-RIGHT OVERRIDE (U+202D) which I want to remove from the file. In Emacs, they are hidden (which should be the correct behavior?) by default. How do I make such "exotic" unicode characters visible (while not changing display of "regular" unicode characters like german umlauts)? And how do I replace them afterwards (with replace-string for example. C-X 8 Ret does not work for isearch/replace-string).
In Vim, its quite easy: These characters are displayed with their hex representation per default (is this a bug or missing feature?) and you can easily remove them with :%s/\%u202d//g for example. This should be possible with Emacs?

You can do M-x find-file-literally then you will see these characters.
Then you can remove them using usual string-replace

How about this:
Put the U+202d character you want to match at the top of the kill ring by typing M-:(kill-new "\u202d"). Then you can yank that string into the various searching commands, with either C-y (eg. query-replace) or M-y (eg. isearch-forward).
(Edited to add:)
You could also just call commands non-interactively, which doesn't present the same keyboard-input difficulties as the interactive calls. For example, type M-: and then:
(replace-string "\u202d" "")
This is somewhat similar to your Vim version. One difference is that it only performs replacements from the cursor position to the bottom of the file (or narrowed region), so you'd need to go to the top of the file (or narrowed region) prior to running the command to replace all matches.

I also have this issue, and this is particularly annoying for commits as it may be too late to fix the log message when one notices the mistake. So I've modified the function I use when I type C-x C-c to check whether there is a non-printable character, i.e. matching "[^\n[:print:]]", and if there is one, put the cursor over it, output a message, and do not kill the buffer. Then it is possible to manually remove the character, replace it by a printable one, or whatever, depending on the context.
The code to use for the detection (and positioning the cursor after the non-printable character) is:
(progn
(goto-char (point-min))
(re-search-forward "[^\n[:print:]]" nil t))
Notes:
There is no need to save the current cursor position since here, either the buffer will be killed or the cursor will be put over the non-printable character on purpose.
You may want to slightly modify the regexp. For instance, the tab character is a non-printable character and I regard it as such, but you may also want to accept it.
About the [:print:] character class in the regexp, you are dependent on the C library. Some printable characters may be regarded as non-printable, like some recent emojis (but not everyone cares).
The re-search-forward return value will be regarded as true if and only if there is a non-printable character. This is exactly what we want.
Here's a snippet of what I use for Subversion commits (this is between more complex code in my .emacs).
(defvar my-svn-commit-frx "/svn-commit\\.\\([0-9]+\\.\\)?tmp\\'")
and
((and (buffer-file-name)
(string-match my-svn-commit-frx (buffer-file-name))
(progn
(goto-char (point-min))
(re-search-forward "[^\n[:print:]]" nil t)))
(backward-char)
(message "The buffer contains a non-printable character."))
in a cond, i.e. I apply this rule only on filenames used for Subversion commits. The (backward-char) can be used or not, depending on whether you want the cursor to be over or just after the non-printable character.

Related

Emacs macro that only fires on non-empty lines

Generally, I find myself writing short macros that, e.g. add or remove line comments or correct indentation on a line.
However, with whitespace-mode enabled, I will still have to look out not to fire these macros on blank lines; if the macro tries to delete a character on an empty line, generally it will mess up the entire document.
Is there any solution to this problem which does not involve having some amount of spaces on blank lines, or otherwise altering my document structure?

You could use C-M-s ^. at the beginning of the macro. That is, search for a line that contains at least one character.

Expanding on the legoscia's answer -- here's a version which also notifies if there's no previous line at all:
(defun goto-first-previous-non-empty-line ()
(interactive)
(if (re-search-backward "^." nil t)
(message "First previous non-empty line")
(message "Beginning of buffer")
)
)

How to disable underscore (_) subscripting in Emacs, TeX input method

On Emacs, while editing a text document of notes for myself (a .txt document, not a .tex document), I am using M-x set-input-method Ret TeX, in order to get easy access to various Unicode characters. So for example, typing \tospace causes a "→" to be inserted into the text, and typing x^2 causes "x2" to be inserted, because the font I am using has support for Unicode codepoints 0x2192 and 0x00B2, respectively.
One of the specially handled characters in the method is for the underscore key, _. However, the font I am using for Emacs does not appear to have support for the codepoints for the various subscript characters, such as subscript zero (codepoint 0x2080), and so when I type _0, I get something rendered as a thin blank in my output. I would prefer to just have the two characters _0 in this case.
I can get _0 by the awkward keystroke sequence _spacedel0, since the space keystroke in the middle of the sequence causes Emacs to abort the TeX input method. But this is awkward.
So, my question: How can I locally customize my Emacs to not remap the _ key at all when I am in the TeX input method? Or how can I create a modified clone (or extension, etc) of the TeX input method that leaves out underscore from its magic?
Things I have tried so far:
I have already done M-xdescribe-key on _; but it is just bound to self-insert-command, like many other text characters. I did see a post-self-insert-hook there, but I have not explored trying to use that to subvert the TeX input method.
Things I have not tried so far:
I have not tried learning anything about the input method architecture or its source code. From my quick purview of the code and methods. it did not seem like something I could quickly jump into.

So here is the solution I just found: Make a personalized copy of the TeX input method, with all of the undesirable entries removed. Then when using M-x set-input-method, select the personalized version instead of TeX.
I would have tried this earlier, but the built-in documentation for set-input-mode and its ilk does not provide sufficient guidance to the actual source for the input-methods for me to find it. It was only after doing another search on SO and finding this: Emacs: Can't activate input method that I was able to get enough information to do this on my own.
Details:
In Emacs, open /usr/share/emacs/22.1/leim/leim-list.el and find the entry for the input method you want to customize. The entry will be something like the following form:
(register-input-method
"TeX" "UTF-8" 'quail-use-package
"\\" "LaTeX-like input method for many characters."
"quail/latin-ltx")
Note the file name prefix referenced in the last element in the form above. Find the corresponding Elisp source file; in this case, it is a relative path to the file quail/latin-ltx.el[.gz]. Open that file in Emacs, and check it out; it should have the entries for the method remappings, both desired and undesired.
Make a user-local copy of that Elisp source file amongst your other Emacs customizations. Open that local copy in Emacs.
In your local copy, find the (quail-define-package ...) form in the file, and change the name of the package; I used FSK-TeX as my new name, like so:
(quail-define-package
"FSK-TeX" "UTF-8" "\\" t ;; <-- The first argument here is the important bit to change.
"LaTeX-like input method for many characters but not as many as you might think.
...)
Go through your local copy, and delete all the S-expressions for mappings that you don't want.
In your .emacs configuration file, register your customized input method, using a form analogous to the one you saw when you looked at leim-list.el in step 1:
(register-input-method
"FSK-TeX" "UTF-8" 'quail-use-package
"\\" "FSK-customized LaTeX-like input method for many characters."
"~/ConfigFiles/Elisp/leim/latin-ltx")
Restart Emacs and test your new input-method; in my case, by doing M-x set-input-method FSK-TeX, typing a_0, and confirming that a_0 shows up in the buffer.
So, there's at least one answer that is less awkward once you have it installed than some of the workarounds listed in the question (and as it turns out, are also officially documented in the Emacs 22 manual as a way to cut off input method processing).
However, I am not really happy with this solution, since I would prefer to inherit future changes to TeX mode, and just have my .emacs remove the undesirable entries on startup.
So I will wait to see if anyone else comes up with a better answer than this.

I did not test this myself, but this seems to be the exact thing you are looking for:
"How to disable underscore subscript in TeX mode in emacs" - source
Two solutions are given in this blogpot:
By the author of the blogpost: (setq font-lock-maximum-decoration nil) (from maximum)
Mentioned as comment:
(eval-after-load "tex-mode" '(fset 'tex-font-lock-subscript 'ignore))

The evil plugin for vim-like modal keybinding allows to map two subsequent presses of the _ key to the insertion of a single _ character:
(set-input-method 'TeX)
(define-key evil-insert-state-local-map (kbd "_ _")
(lambda () (interactive) (insert "_")))
(define-key evil-insert-state-local-map (kbd "^ ^")
(lambda () (interactive) (insert "^")))
When _ and then 1 is pressed, we get ₁ as before, but
when _ and then _ is pressed, we get _.
Analogous for ^.

As already explained in pnkfelix answer, it seems we have to make a personalized copy of the TeX input method. But here comes a lighter way to do that, without any file tweaking. Simply put the following in your .emacs :
(eval-after-load "quail/latin-ltx"
'(let ((pkg (copy-tree (quail-package "TeX"))))
(setcar pkg "MyTeX")
(assq-delete-all ?_ (nth 2 pkg))
(quail-add-package pkg)))
(set-input-method 'TeX)
(register-input-method "MyTeX" "UTF-8" 'quail-use-package "\\")
(set-input-method 'MyTeX)
The important part is the assq-delete-all line in the middle that remove all shortcut entries starting with _. It's a bit of a lisp hack but it seems to work. Since I'm also annoyed by the shortcuts starting with - and ^, I also use the following two lines to disable them :
(assq-delete-all ?- (nth 2 pkg))
(assq-delete-all ?^ (nth 2 pkg))
Note that afterwards you can M-x set-input-method at any time and indicate TeX or MyTeX to switch between the pristine TeX input method or the customized one.

Define a character as a word boundary

I've defined the \ character to behave as a word constituent in latex-mode, and I'm pretty happy with the results. The only thing bothering me is that a sequence like \alpha\beta gets treated as a single word (which is the expected behavior, of course).
Is there a way to make emacs interpret a specific character as a word "starter"? This way it would always be considered part of the word following it, but never part of the word preceding it.
For clarity, here's an example:
\alpha\beta
^ ^
1 2
If the point is at 1 and I press M-d, the string "\alpha" should be killed.
If the point is at 2 and I press M-<backspace>, the string "\beta" should be killed.
How can I achieve this?

Another thought:
Your requirement is very like what subword-mode provides for camelCase.
You can't customize subword-mode's behaviour -- the regexps are hard-coded -- but you could certainly copy that library and modify it for your purposes.
M-x find-library RET subword RET
That would presumably be a pretty robust solution.
Edit: updated from the comments, as suggested:
For the record, changing every instance of [[:upper:]] to [\\\\[:upper:]] in the functions subword-forward-internal and subword-backward-internal inside subword.el works great =) (as long as "\" is defined as "w" syntax).
Personally I would be more inclined to make a copy of the library than edit it directly, unless for the purpose of making the existing library a little more general-purpose, for which the simplest solution would seem to be to move those regexps into variables -- after which it would be trivial to have buffer-local modified versions for this kind of purpose.
Edit 2: As of Emacs 24.3 (currently a release candidate), subword-mode facilitates this with the new subword-forward-regexp and subword-backward-regexp variables (for simple modifications), and the subword-forward-function and subword-backward-function variables (for more complex modifications).
By making those regexp variables buffer-local in latex-mode with the desired values, you can just use subword-mode directly.

You should be able to implement this using syntax text properties:
M-: (info "(elisp) Syntax Properties") RET
Edit: Actually, I'm not sure if you can do precisely this?
The following (which is just experimentation) is close, but M-<backspace> at 2 will only delete "beta" and not the preceding "\".
I suppose you could remap backward-kill-word to a function which checked for that preceding "\" and killed that as well. Fairly hacky, but it would probably do the trick if there's not a cleaner solution.
I haven't played with this functionality before; perhaps someone else can clarify.
(modify-syntax-entry ?\\ "w")
(setq parse-sexp-lookup-properties t)
(setq syntax-propertize-function 'my-propertize-syntax)
(defun my-propertize-syntax (start end)
"Set custom syntax properties."
(save-excursion
(goto-char start)
(while (re-search-forward "\\w\\\\" end t)
(put-text-property
(1- (point)) (point) 'syntax-table (cons "." ?\\)))))

emacs equivalent of ct

looking for an equivalent cut and paste strategy that would replicate vim's 'cut til'. I'm sure this is googleable if I actually knew what it was called in vim, but heres what i'm looking for:
if i have a block of text like so:
foo bar (baz)
and I was at the beginning of the line and i wanted to cut until the first paren, in visual mode, I'd do:
ct (
I think there is probably a way to look back and i think you can pass more specific regular expressions. But anyway, looking for some emacs equivalents to doing this kind of text replacement. Thanks.

Here are three ways:
Just type M-dM-d to delete two words. This will leave the final space, so you'll have to delete it yourself and then add it back if you paste the two words back elsewhere.
M-z is zap-to-char, which deletes text from the cursor up to and including a character you specify. In this case you'd have to do something like M-2M-zSPC to zap up to and including the second space character.
Type C-SPC to set the mark, then go into incremental search with C-s, type a space to jump to the first space, then C-s to search forward for the next space, RET to terminate the search, and finally C-w to kill the text you selected.
Personally I'd generally go with #1.

as ataylor said zap-to-char is the way to go, The following modification to the zap-to-char is what exactly you want
(defun zap-up-to-char (arg char)
"Like standard zap-to-char, but stops just before the given character."
(interactive "p\ncZap up to char: ")
(kill-region (point)
(progn
(search-forward (char-to-string char) nil nil arg)
(forward-char (if (>= arg 0) -1 1))
(point))))
(define-key global-map [(meta ?z)] 'zap-up-to-char) ; Rebind M-z to our version
BTW don't forget that it has the ability to go backward with a negative prefix

That sounds like zap-to-char in emacs, bound to M-z by default. Note that zap-to-char will cut all the characters up to and including the one you've selected.

How do I run an Emacs hook when a buffer is modified?

Building on Getting Emacs to untabify when saving certain file types (and only those file types) , I'd like to run a hook to untabify my C++ files when I start modifying the buffer. I tried adding hooks to untabify the buffer on load, but then it untabifies all my writable files that are autoloaded when emacs starts.
(For those that wonder why I'm doing this, it's because where I work enforces the use of tabs in files, which I'm happy to comply with. The problem is that I mark up my files to tell me when lines are too long, but the regexp matches the number of characters in the line, not how much space the line takes up. 4 tabs in a line can push it far over my 132 character limit, but the line won't be marked appropriately. Thus, I need a way to tabify and untabify automatically.)

Take a look at the variable "before-change-functions".
Perhaps something along this line (warning: code not tested):
(add-hook 'before-change-functions
(lambda (&rest args)
(if (not (buffer-modified-p))
(untabify (point-min) (point-max)))))

Here is what I added to my emacs file to untabify on load:
(defun untabify-buffer ()
"Untabify current buffer"
(interactive)
(untabify (point-min) (point-max)))
(defun untabify-hook ()
(untabify-buffer))
; Add the untabify hook to any modes you want untabified on load
(add-hook 'nxml-mode-hook 'untabify-hook)

This answer is tangential, but may be of use.
The package wide-column.el link text changes the cursor color when the cursor is past a given column - and actually the cursor colors can vary depending on the settings. This sounds like a less intrusive a solution than your regular expression code, but it may not suit your needs.

And a different, tangential answer.
You mentioned that your regexp wasn't good enough to tell when the 132 character limit was met. Perhaps a better regexp...
This regexp will match a line when it has more than 132 characters, assuming a tabs width is 4. (I think I got the math right)
"^\\(?: \\|[^ \n]\\{4\\}\\)\\{33\\}\\(.+\\)$"
The last parenthesized expression is the set of characters that are over the limit. The first parenthesized expression is shy.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse