how to process the buffer character-by-character in elisp? - emacs

I am trying to write some elisp to process each character in the current buffer
(I know this will be a bit slow, but think it is the best way). I do not want to use a regular expression. How to do this?
The function buffer-string returns the current buffer as a string. Using this could iterate over and reference/set each character. I cannot figure out how to put the result back in the buffer however.
Can someone give a example of just iterating over each character, change it in some simple way, and put the result back in the buffer?

Use a while loop and forward-char to iterate over all characters in a buffer:
(goto-char (point-min))
(while (not (eobp))
(let* ((current-character (char-after))
(new-character (do-something current-character)))
(delete-char 1)
(insert-char new-character))
(forward-char 1))
The loop moves forward one character at a time, as long as the end of the buffer (eobp) is not reached.
char-after gives you the character at the current position. The delete-char/insert-char calls replace the old character with the new one, that results from processing the old character.
To replace the old character with multiple characters, i.e. a string, simply insert-char with insert. insert puts the point after the newly inserted characters, so the loop will proceed with the next unprocessed character afterwards.

To add something to what #lunaryorn says - perhaps a part of a solution, depending on what you need, and so not just a comment:
When you process a buffer character by character, it is very commonly the case that you do not want or need to do something for each character, but you instead need to do something for particular characters in the buffer.
When this is the case, you often do not need to examine each character. Instead, you can use functions such as these (see their doc for what they do):
search-forward or re-search-forward, if specific characters are targeted
next-single-property-change, if one or more characters with specific text properties are targeted (or perhaps next-single-char-property-change, if overlays are involved)
next-property-change, if one or more characters any text-property change are targeted (or perhaps next-char-property-change, if overlays are involved)
In such cases, you iterate over buffer positions (so, over the chars at those positions), as for #lunaryorn's solution, but you use such a function to quickly skip over characters you are not interested in. This is much more common, IMO, than checking each character, one by one. But whether or not it fits your use case, I don't know.

Related

Null pointer Exception in the case of reading in emoticons

I have a text file that looks like this:
shooting-stars 💫 "are cool"
I have a lexical analyzer that uses FileInputStream to read the characters one at a time, passing those characters to a switch statement that returns the corresponding lexeme.
In this case, 💫 represents assignment so this case passes:
case 'ð' :
return new Lexeme("ASSIGN");
For some reason, the file reader stops at that point, returning a null pointer even though it has yet to process the string (or whatever is after the 💫). Any time it reads in an emoticon it does this. If there were no emoticons, it gets to the end of file. Any ideas?
I suspect the problem is that the character 💫 (Unicode code point U+1F4AB) is outside the range of characters that Java represents internally as single char values. Instead, Java represents characters above U+FFFF as two characters known as surrogate pairs, in this case U+D83D followed by U+DCAB. (See this thread for more info and some links.)
It's hard to know exactly what's going on with the little bit of code that you presented, but my guess is that you are not handling this situation correctly. You will need to adjust your processing logic to deal with your emoticons arriving in two pieces.

How to escape double quote?

In org mode, if I want to format text a monospace verbatim, i.e. ~...~, if it is inside quotes: ~"..."~, it is not formatted (left as is).
Also, are quotes a reserved symbol, if so, what do they mean? (they don't seem to affect the generated HTML / inside Emacs display).
The culprit in this case is the regular expression in org-emph-re org-verbatim-re, responsible for determining if a sequence of characters in the document is to be set verbatim or not.
org-verbatim-re is a variable defined in `org.el'.
Its value is
"\([ ('\"{]\|^\)\(\([=~]\)\([^
\n,\"']\|[^
\n,\"'].?\(?:\n.?\)\{0,1\}[^
\n,\"']\)\3\)\([- .,:!?;'\")}\]\|$\)"
quotes and double quotes are explicitly forbidden inside verbatim characters =~ by
[^
\n,\"']\|[^
\n,\"']
I found discussions dating back 3 years comming to the conclusion that you have to tinker with this regular expression and set the variable org-emph-re/org-verbatim-re to something that matches your wishes in your emacs setup (maybe a file local variable works as well). You can experiment by excluding double quotes from the excluding character classes and outside matches as in
"\([ ('{]\|^\)\(\([*/_=~+]\)\([^
\n,']\|[^
\n,'].?\(?:\n.?\)\{0,1\}[^
\n,']\)\3\)\([- .,:!?;')}\]\|$\)"
but looking at that regex, heaven knows what happens to complex documents -- you have to try...
Edit: as it happens, if I evalute the following as region, quotes inside = are exported correctly, but nothing else is :-), I investigate further when I have more time.
(setq org-emph-re "\([ ('{]\|^\)\(\([*/_=~+]\)\([^
\n,']\|[^
\n,'].?\(?:\n.?\)\{0,1\}[^
\n,']\)\3\)\([- .,:!?;')}]\|$\)")
Edit 2:: Got it to work by changing org.el directly:
Change the line following (defvar org-emphasis-regexp-components from '(" \t('\"{" "- \t.,:!?;'\")}\\" " \t\r\n,\"'" "." 1) to '(" \t('{" "- \t.,:!?;')}\\" " \t\r\n,'" "." 1) and recompile org then restart emacs.
This was a defcustom prior to the 8.0 release, it isn't anymore, so you have to live with this manual modification.
regards,
Tom
Finally, I found a solution from http://comments.gmane.org/gmane.emacs.orgmode/82571
According to that thread, the regexp for verbatim is built from variable org-emphasis-regexp-components, which defines legal characters before, after, at the border of, or in the body of emphasis; and verbatim is one of the emphasis environment in org mode.
A workable setting given by that thread:
(setcar (nthcdr 2 org-emphasis-regexp-components) " \t\n,")
(custom-set-variables `(org-emphasis-alist ',org-emphasis-alist))
For small amounts of characters which have some unwanted effect in Emacs org-mode (because being metacharacters) it may be helpful to have a look at special symbols in org-mode (org-entities.el).
So for example " can be encoded by \quot{} (where the braces pair at the end is not mandatory, but needed if no whitespace follows).
So instead ="..."= you would write =\quot{}...\quot{}=.
That is some typing more and looks pretty ugly. But for the latter org-mode has a solution: by C-c C-x \ you can toggle a display magic for those symbols. If the magic is active, so directly after typing \quot{} resp. \quot{} a " will be displayed.
Besides, this symbols list can easily be extended, f.e.
(add-to-list 'org-entities
'("backslash" "\\textbackslash" nil "\\" "\\" "\\" "\\"))
Nevertheless I am heavily missing easier escaping in org-mode, besides the above solution and besides escaping a whole line by a : at its beginning.
I'd be happy if =verbatim= in all cases would leave the text between the ='s unchanged. Not =this*bold*text=, but =this *bold* text=. Like we know that from each well-designed markup/-down language.
But, of course, this is better placed at the org-mode development pages. Ideally with a fitting patch... :-)
I've met similar problem, and thanks #chaiko for a basic solution. However, #chaiko's solution only work for org-mode's fontification, it doesn't affect org-export. To get correct exported document, you need to do some more extra hack to org-mode's parser by (org-element--set-regexps).
So the full code snippets should be something like:
(setcar (nthcdr 2 org-emphasis-regexp-components) " \t\n\r")
(custom-set-variables `(org-emphasis-alist ',org-emphasis-alist))
(org-element--set-regexps)
I've integrated this to my oh-my-emacs project: https://github.com/xiaohanyu/oh-my-emacs/blob/e82fce10d47f7256df6d39e32ca288d0ec97a764/core/ome-org.org#code-block-fontification .

how can I get emacs to recognize single quotes as not being string begin/end tokens in font-lock mode

I've a preprocessor (xhp) that allows me to write unquoted text in php code e.g.:
<foo>
my enemies' base
</foo>
might appear in a .php file, but as soon as emacs sees that single quote it sees the entire rest of the file as being in a string.
I can't figure out where 'font-lock-syntactic-keywords' is getting set in (c-mode), but it has a syntax table associated with it that seems to cause this
(c-in-literal) returns 'string as well, so maybe I need to solve this deeper in the code than at the font-lock level, if anyone has any tips on this it would be appreciated
The simplest solution that I'd be happy with would be just assuming the string is one-line only.
I don't know what major-mode you're using, but in general the trick is to change the syntax of the ' character with something like (modify-syntax-entry ?\' "." <syntaxtable>). Of course, if the ' character can sometimes delimit strings and sometimes not, then it's more tricky and you'll need to come up with a font-lock-syntactic-keywords (or syntax-propertize-function) rule which can tell which is used at any given point.
E.g. assuming PHP never treats ' as a string delimiter, something like the following might solve your problem:
(add-hook 'php-mode-hook
(lambda () (modify-syntax-table ?\' ".")))

transliterate a portion of the input

Could I get some pointers for a sed script to transliterate like y/abc/123/
but only on some of input. The processing would follow these rules:
enable transliterate once ¡ char seen
disable once µ char seen (may be on diff line to ¡)
never transliterate between &; or <> chars
This can be done in sed, but it's going to be extremely painful. Perl, Python, Ruby, etc. would be better choices.
If you must do it in sed, the basic approach is to preserve parts of the line you don't want to change in the hold buffer, working your way through the line and appending completed portions to the hold buffer until the main buffer is empty, then pull the hold buffer back into the main buffer. Also, you want to have two separate loops, one for transliterating mode (entered on ¡) and the other for passthrough mode (the initial mode, and entered on µ).

How to use '^#' in Vim scripts?

I'm trying to work around a problem with using ^# (i.e., <ctrl-#>) characters in Vim scripts. I can insert them into a script, but when the script runs it seems the line is truncated at the point where a ^# was located.
My kludgy solution so far is to have a ^# stored in a variable, then reference the variable in the script whenever I would have quoted a literal ^#. Can someone tell me what's going on here? Is there a better way around this problem?
That is one reason why I never use raw special character values in scripts. While ^# does not work, string <C-#> in mappings works as expected, so you may use one of
nnoremap <C-#> {rhs}
nnoremap <Nul> {rhs}
It is strange, but you cannot use <Char-0x0> here. Some notes about null byte in strings:
Inserting null byte into string truncates it: vim uses old C-style strigs that end with null byte, thus it cannot appear in strings. These strings are very inefficient, so if you want to generate a very large text, try accumulating it into a list of lines (using setline is very fast as buffer is represented as a list of lines).
Most functions that return list of strings (like readfile, getline(start, end)) or take list of strings (like writefile, setline, append) treat \n (NL) as Null. It is also the internal representation of buffer lines, see :h NL-used-for-Nul.
If you try to insert \n character into the command-line, you will get Null shown (but this is really a newline). If you want to edit a file that has \n in a filename (it is possible on *nix), you will need to prepend newline with backslash.
The byte ctrl-# is also known as '\0'. Many languages, programs, etc. use it as an "end of string" marker, so it's not surprising that vim gets confused there. If you must use this byte in the middle of a script string, it sounds like your workaround is a decent one.