How to correct files with mixed encodings?

How to correct files with mixed encodings? - emacs

Given a corrupted file with mixed encoding (e.g. utf-8 and latin-1), how do I configure Emacs to "project" all its symbols to a single encoding (e.g. utf-8) when saving the file?
I did the following function to automatize some of the cleaning, but I would guess I could find somewhere the information to map the symbol "é" in one encoding to "é" in utf-8 somewhere in order to improve this function (or that somebody already wrote such a function).
(defun jyby/cleanToUTF ()
"Cleaning to UTF"
(interactive)
(progn
(save-excursion (replace-regexp "अ" ""))
(save-excursion (replace-regexp "आ" ""))
(save-excursion (replace-regexp "ॆ" ""))
)
)
(global-unset-key [f11])
(global-set-key [f11] 'jyby/cleanToUTF)
I have many files "corrupted" with mixed encoding (due to copy pasting from a browser with an ill font configuration), generating the error below. I sometime clean them by hand by searching and replacing for each problematic symbol by either "" or the appropriate character, or more quickly specifying "utf-8-unix" as the encoding (which will prompt the same message next time I edit and save the file). It has become an issue as in any such corrupted file any accentuated character is replaced by a sequence which doubles in size at each save, ending up doubling the size of the file. I am using GNU Emacs 24.2.1
These default coding systems were tried to encode text
in the buffer `test_accents.org':
(utf-8-unix (30 . 4194182) (33 . 4194182) (34 . 4194182) (37
. 4194182) (40 . 4194181) (41 . 4194182) (42 . 4194182) (45
. 4194182) (48 . 4194182) (49 . 4194182) (52 . 4194182))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these: ...
Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.
Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).
raw-text emacs-mule no-conversion

I have struggled with this in emacs many times. When I have a file that was messed up, e.g. in raw-text-unix mode, and save as utf-8, emacs complains even about text that is already clean utf-8. I haven't found a way to get it to only complain about non-utf-8.
I just found a reasonable semi-automated approach using recode:
f=mixed-file
recode -f ..utf-8 $f > /tmp/recode.out
diff $f recode.out | cat -vt
# manually fix lines of text that can't be converted to utf-8 in $f,
# and re-run recode and diff until the output diff is empty.
One helpful tool along the way is http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+224&mode=obytes
Then I just re-open the file in emacs, and it is recognized as clean unicode.

Here's something to maybe get you started:
(put 'eof-error 'error-conditions '(error eof-error))
(put 'eof-error 'error-message "End of stream")
(put 'bad-byte 'error-conditions '(error bad-byte))
(put 'bad-byte 'error-message "Not a UTF-8 byte")
(defclass stream ()
((bytes :initarg :bytes :accessor bytes-of)
(position :initform 0 :accessor position-of)))
(defun logbitp (byte bit) (not (zerop (logand byte (ash 1 bit)))))
(defmethod read-byte ((this stream) &optional eof-error eof)
(with-slots (bytes position) this
(if (< position (length bytes))
(prog1 (aref bytes position) (incf position))
(if eof-error (signal eof-error (list position)) eof))))
(defmethod unread-byte ((this stream))
(when (> (position-of this) 0) (decf (position-of this))))
(defun read-utf8-char (stream)
(let ((byte (read-byte stream 'eof-error)))
(if (not (logbitp byte 7)) byte
(let ((numbytes
(cond
((not (logbitp byte 5))
(setf byte (logand #2r11111 byte)) 1)
((not (logbitp byte 4))
(setf byte (logand #2r1111 byte)) 2)
((not (logbitp byte 3))
(setf byte (logand #2r111 byte)) 3))))
(dotimes (b numbytes byte)
(let ((next-byte (read-byte stream 'eof-error)))
(if (and (logbitp next-byte 7) (not (logbitp next-byte 6)))
(setf byte (logior (ash byte 6) (logand next-byte #2r111111)))
(signal 'bad-byte (list next-byte)))))
(signal 'bad-byte (list byte))))))
(defun load-corrupt-file (file)
(interactive "fFile to load: ")
(with-temp-buffer
(set-buffer-multibyte nil)
(insert-file-literally file)
(with-output-to-string
(set-buffer-multibyte t)
(loop with stream = (make-instance 'stream :bytes (buffer-string))
for next-char =
(condition-case err
(read-utf8-char stream)
(bad-byte (message "Fix this byte %d" (cdr err)))
(eof-error nil))
while next-char
do (write-char next-char)))))
What this code does - it loads a file with no conversion and tries to read it as if it was encoded using UTF-8, once it encounters a byte that doesn't seem like it belongs to UTF-8, it errors, and you need to handle it somehow, it's where "Fix this byte" message is). But you would need to be inventive about how you fixing it...

Related

Is there a string to 32 bit integer coding system in elisp?

Is there a coding system such that (encode-coding-string msg '??? t) would convert my message into a list of 32 bit integers?
The binary coding system converts well the message to 8 bit data, and I am aware that I could post-process it to convert the result into 32 bits. I'm just wondering if there is already a coding system that does this... :) #lazy

Ok, attempt number two.
I wrote a test generator snippet in python:
def make_test_2():
with open('test2.bin', 'wb') as f:
args = [1867, 1982]
for a in args:
f.write((a).to_bytes(4, byteorder='little'))
This is a pretty hacky (a couple of reverse calls etc). But, its only meant to be a quick and dirty prototype.
(defun read-int-list2 (filename)
(let ((result '())
(accumulator '())
(accum-count 0)
(accum-max 4))
(with-temp-buffer
(set-buffer-multibyte nil)
(setq buffer-file-coding-system 'binary) ;; find a way to set temporarily? not sure
(insert-file-contents-literally filename)
(while (< (point) (point-max))
(if (< accum-count accum-max)
(progn
(setq accumulator (cons (aref (buffer-substring-no-properties (point) (1+ (point))) 0) accumulator))
(setq accum-count (1+ accum-count))))
(if (>= accum-count accum-max) ;; four bytes accumulated, lets bundle
(progn
(let* ((s (reverse accumulator))
(e1 (elt s 0))
(e2 (elt s 1))
(e3 (elt s 2))
(e4 (elt s 3))
(val (logior (lsh e4 24) (lsh e3 16) (lsh e2 8) e1))) ;; assume little endian (intel, ARM)
;; (message (format "%x %x %x %x -> %d" e1 e2 e3 e4 val))
(setq result (cons val result))
(setq accum-count 0)
(setq accumulator '()))))
(forward-char)))
(reverse result)))
(read-int-list2 "test2.bin") ;; (1867 1982)
I only did the one test. So that needs improvement. In words:
accumulate 8 byte chars from the special temp buffer (special because binary/literal load)
once the bytes per integer count has been reached, merge the accumulated bytes into an integer by bit shifting (be aware some machines are big endian, I assume here little endian) the bytes into place.
dump merged into result list
reset the accumulator
go to step 1
i have no doubt there are many improvements, my lisp is rusty.

Haskell with emacs org-mode: Variable not in scope

After wandering off in frustration from before, I've decided to try Haskell in Emacs org-mode again. I'm using Haskell stack-ghci (8.6.3), Emacs 26.2, org-mode 9.2.3 set up with intero. This code block
#+begin_src haskell :results raw :session *haskell*
pyth2 :: Int -> [(Int, Int, Int)]
pyth2 n =
[ (x, y, z)
| x <- [1 .. n]
, y <- [x .. n]
, z <- [y .. n]
, x ^ 2 + y ^ 2 == z ^ 2
]
#+end_src
produces this RESULTS:
*Main| *Main| *Main| *Main| *Main|
<interactive>:59:16: error: Variable not in scope: n
<interactive>:60:16: error: Variable not in scope: n
<interactive>:61:16: error: Variable not in scope: n
However, this
#+begin_src haskell :results raw
tripleMe x = x + x + x
#+end_src
works fine. I've added the :set +m to both ghci.conf and the individual code block to no effect. This code works fine in a separate hs file run in a separate REPL. The pyth2 code in a separate file also can be called from the org-mode started REPL and run just fine as well. Not sure how to proceed. Can include Emacs init info if necessary.

Over on the org-mode mailing list I got an answer that basically is saying the same as you, D. Gillis. He had a similar work-around that actually is more org-mode-centric. Under a heading where your code blocks will be put this "drawer"
:PROPERTIES:
:header-args:haskell: :prologue ":{\n" :epilogue ":}\n"
:END:
and then (possibly in a local variable) run
#+begin_src haskell :results output
:set prompt-cont ""
#+end_src
For reasons unknown I've had to include the :results output otherwise a cryptic error of "expecting a string" happens.
On a few other notes, haskell babel doesn't respond/care about the :session option, i.e., when you run a code block, a REPL *haskell* starts and that will be the sole REPL. Also, a haskell-mode started REPL doesn't play well with an existing org-mode initiated REPL, i.e., if you start a REPL from haskell-mode, it kills the original org-mode *haskkell*REPL, and any new attempt to run org-mode code blocks can't see this new, non-*haskell*REPL. Then if you kill the haskell-mode REPL and try to run org-mode blocks, you get
executing Haskell code block...
inferior-haskell-start-process: List contains a loop: ("--no-build" "--no-load" "--ghci-options=-ferror-spans" "--no-build" "--no-load" . #2)
... you're hosed -- and nothing seems to shake it, not any restart/refresh, nor killing, reloading the file, i.e., a complete restart of Emacs is necessary. Anyone knowing a better solution, please tells usses.

This is a GHCi issue.
The same error occurs when your code is copied directly into GHCi, which also gives a parse error when it encounters the new line after the equal sign. This first error isn't showing up here because org-babel only shows the value of the last expression (in this case, the error caused by the list comprehension).
I'm not entirely familiar with how Haskell-mode sends the code to GHCi, but it looks like it involves loading in the buffer into GHCi as a file, which may be why you didn't have this problem working from the hs file.
There are a few options to fix this, none of which are completely ideal:
Move some portion of the list into the first line (e.g. the first line could be pyth2 n = [).
Wrap the entire function definition with :{ and :}.
Write an Elisp function to modify what is being sent to GHCi and then changes it back after it is evaluated.
The first two options require you to format your code in a form that the GHCi will accept. In your example case, the first option may not be too bad, but this won't always be so trivial for all multi-line declarations (e.g. pattern-matching function declarations). The downside to the second option is that it requires adding brackets to the code that shouldn't be there in real source code.
To fix the issue of extraneous brackets being added, I've written an Elisp command (my-org-babel-execute-haskell-blocks) that places these brackets around code blocks that it finds, evaluates the region, and then deletes the brackets. Note that this function requires that blocks be separated from all other code with at least one empty line.
Calling my-org-babel-execute-haskell-blocks on your example declares the function without any errors.
EDIT: The previous function I gave failed to work on pattern matching declarations. I've rewritten the function to fix this issue as well as to be comment aware. This new function should be significantly more useful. However, it's worth noting that I didn't handle multi-line comments in a sophisticated manner, so code blocks with multi-line comments may not be wrapped properly.
(defun my-org-babel-execute-haskell-blocks ()
"Wraps :{ and :} around all multi-line blocks and then evaluates the source block.
Multi-line blocks are those where all non-indented, non-comment lines are declarations using the same token."
(interactive)
(save-excursion
;; jump to top of source block
(my-org-jump-to-top-of-block)
(forward-line)
;; get valid blocks
(let ((valid-block-start-ends (seq-filter #'my-haskell-block-valid-p (my-get-babel-blocks))))
(mapcar #'my-insert-haskell-braces valid-block-start-ends)
(org-babel-execute-src-block)
(mapcar #'my-delete-inserted-haskell-braces (reverse valid-block-start-ends)))))
(defun my-get-blocks-until (until-string)
(let ((block-start nil)
(block-list nil))
(while (not (looking-at until-string))
(if (looking-at "[[:space:]]*\n")
(when (not (null block-start))
(setq block-list (cons (cons block-start (- (point) 1))
block-list)
block-start nil))
(when (null block-start)
(setq block-start (point))))
(forward-line))
(when (not (null block-start))
(setq block-list (cons (cons block-start (- (point) 1))
block-list)))))
(defun my-get-babel-blocks ()
(my-get-blocks-until "#\\+end_src"))
(defun my-org-jump-to-top-of-block ()
(forward-line)
(org-previous-block 1))
(defun my-empty-line-p ()
(beginning-of-line)
(= (char-after) 10))
(defun my-haskell-type-declaration-line-p ()
(beginning-of-line)
(and (not (looking-at "--"))
(looking-at "^.*::.*$")))
(defun my-insert-haskell-braces (block-start-end)
(let ((block-start (car block-start-end))
(block-end (cdr block-start-end)))
(goto-char block-end)
(insert "\n:}")
(goto-char block-start)
(insert ":{\n")))
(defun my-delete-inserted-haskell-braces (block-start-end)
(let ((block-start (car block-start-end))
(block-end (cdr block-start-end)))
(goto-char block-start)
(delete-char 3)
(goto-char block-end)
(delete-char 3)))
(defun my-get-first-haskell-token ()
"Gets all consecutive non-whitespace text until first whitespace"
(save-excursion
(beginning-of-line)
(let ((starting-point (point)))
(re-search-forward ".*?[[:blank:]\n]")
(goto-char (- (point) 1))
(buffer-substring-no-properties starting-point (point)))))
(defun my-haskell-declaration-line-p ()
(beginning-of-line)
(or (looking-at "^.*=.*$") ;; has equals sign
(looking-at "^.*\n[[:blank:]]*|")
(looking-at "^.*where[[:blank:]]*$")))
(defun my-haskell-block-valid-p (block-start-end)
(let ((block-start (car block-start-end))
(block-end (cdr block-start-end))
(line-count 0))
(save-excursion
(goto-char block-start)
(let ((token 'nil)
(is-valid t))
;; eat top comments
(while (or (looking-at "--")
(looking-at "{-"))
(forward-line))
(when (my-haskell-type-declaration-line-p)
(progn
(setq token (my-get-first-haskell-token)
line-count 1)
(forward-line)))
(while (<= (point) block-end)
(let ((current-token (my-get-first-haskell-token)))
(cond ((string= current-token "") ; line with indentation
(when (null token) (setq is-valid nil))
(setq line-count (+ 1 line-count)))
((or (string= (substring current-token 0 2) "--") ;; skip comments
(string= (substring current-token 0 2) "{-"))
'())
((and (my-haskell-declaration-line-p)
(or (null token) (string= token current-token)))
(setq token current-token
line-count (+ 1 line-count)))
(t (setq is-valid nil)
(goto-char (+ 1 block-end))))
(forward-line)))
(and is-valid (> line-count 1))))))

In split window emacs, how to search the "other buffer first line" for ":" then shift the border of the original buffer to that point

I use blame-mercurial using the monky.el package.
In a split window, when activating blame the results come up in the other window with info about the line changed (author/changeset/date:).
I would like to have a command that searches the first line of the "result" buffer, get to where the mark ":" is and shift the border of the original buffer up till that point.
Basically, if the borders of both windows are:
| ...... | ...... |
Before executing the command:
|author 4543 11-27-2013: int x; | int x; |
After executing the command:
|author 4543 11-27-2013:| int x; |
The reason for this is I would like to keep the coloring of data types/functions...etc while seeing who last changed these source file lines.
In the blame resulted file, when the lines are proceeded with author changeset date. they loose their coloring.
So I want to use the info for each line from the blame buffer "side by side" with the original fontified file.
I also can't use a fixed window border shift value, because depending on the author(s) name length for each file the position of ":" will change accordingly.

I have modified the last version at
"Mirroring location in file in two opened buffers side by side"
such that the following code makes sense. If you run the code below and then switch on sync-window-mode for the mercury-blame buffer then isearch will have the desired effect.
(defun mercury-blame-resize ()
"Resize mercury blame window to blame string at point only."
(interactive) ;; for debugging
(window-resize (selected-window)
(- (save-excursion
(beginning-of-line)
(skip-chars-forward "^:\n"))
(window-width) -1)
'horizontal 'ignore-fixed-size))
(add-hook 'sync-window-master-hook 'mercury-blame-resize)
(add-hook 'sync-window-mode-hook '(lambda ()
(setq-local isearch-update-post-hook #'(lambda () (set-window-hscroll (selected-window) 0)))))
Version for emacs 23:
(defvar mercury-blame-resize-min 5)
(defun mercury-blame-resize ()
"Resize mercury blame window to blame string at point only."
(interactive) ;; for debugging
(save-excursion
(beginning-of-line)
(let ((n (skip-chars-forward "^:\n")))
(when (looking-at ":")
(condition-case err
(enlarge-window (- (max mercury-blame-resize-min n)
(window-width) -1)
'horizontal)
(error))))))
(add-hook 'sync-window-master-hook 'mercury-blame-resize)
(add-hook 'sync-window-mode-hook '(lambda ()
(set (make-local-variable 'isearch-update-post-hook) #'(lambda () (set-window-hscroll (selected-window) 0)))))
Helper for testing (without mercury):
(loop for i from 1 upto 100 do
(loop for j from 0 upto (random 20) do
(insert (+ 32 (random 20))))
(insert ":\n"))
EDIT: In version for emacs 23: Only re-size mercurity blame buffer when there is a ":" on the current line.

How to replace string in a file with lisp?

What's the lisp way of replacing a string in a file.
There is a file identified by *file-path*, a search string *search-term* and a replacement string *replace-term*.
How to make file with all instances of *search-term*s replaced with *replace-term*s, preferably in place of the old file?

One more take at the problem, but few warnings first:
To make this really robust and usable in the real-life situation you would need to wrap this into handler-case and handle various errors, like insufficient disc space, device not ready, insufficient permission for reading / writing, insufficient memory to allocate for the buffer and so on.
This does not do regular expression-like replacement, it's simple string replacement. Making a regular expression based replacement on large files may appear far less trivial than it looks like from the start, it would be worth writing a separate program, something like sed or awk or an entire language, like Perl or awk ;)
Unlike other solutions it will create a temporary file near the file being replaced and will save the data processed so far into this file. This may be worse in the sense that it will use more disc space, but this is safer because in case the program fails in the middle, the original file will remain intact, more than that, with some more effort you could later resume replacing from the temporary file if, for example, you were saving the offset into the original file in the temporary file too.
(defun file-replace-string (search-for replace-with file
&key (element-type 'base-char)
(temp-suffix ".tmp"))
(with-open-file (open-stream
file
:direction :input
:if-exists :supersede
:element-type element-type)
(with-open-file (temp-stream
(concatenate 'string file temp-suffix)
:direction :output
:element-type element-type)
(do ((buffer (make-string (length search-for)))
(buffer-fill-pointer 0)
(next-matching-char (aref search-for 0))
(in-char (read-char open-stream nil :eof)
(read-char open-stream nil :eof)))
((eql in-char :eof)
(when (/= 0 buffer-fill-pointer)
(dotimes (i buffer-fill-pointer)
(write-char (aref buffer i) temp-stream))))
(if (char= in-char next-matching-char)
(progn
(setf (aref buffer buffer-fill-pointer) in-char
buffer-fill-pointer (1+ buffer-fill-pointer))
(when (= buffer-fill-pointer (length search-for))
(dotimes (i (length replace-with))
(write-char (aref replace-with i) temp-stream))
(setf buffer-fill-pointer 0)))
(progn
(dotimes (i buffer-fill-pointer)
(write-char (aref buffer i) temp-stream))
(write-char in-char temp-stream)
(setf buffer-fill-pointer 0)))
(setf next-matching-char (aref search-for buffer-fill-pointer)))))
(delete-file file)
(rename-file (concatenate 'string file temp-suffix) file))

It can be accomplished in many ways, for example with regexes. The most self-contained way I see is something like the following:
(defun replace-in-file (search-term file-path replace-term)
(let ((contents (rutil:read-file file-path)))
(with-open-file (out file-path :direction :output :if-exists :supersede)
(do* ((start 0 (+ pos (length search-term)))
(pos (search search-term contents)
(search search-term contents :start2 start)))
((null pos) (write-string (subseq contents start) out))
(format out "~A~A" (subseq contents start pos) replace-term))))
(values))
See the implementation of rutil:read-file here: https://github.com/vseloved/rutils/blob/master/core/string.lisp#L33
Also note, that this function will replace search terms with any characters, including newlines.

in chicken scheme with the ireggex egg:
(use irregex) ; irregex, the regular expression library, is one of the
; libraries included with CHICKEN.
(define (process-line line re rplc)
(irregex-replace/all re line rplc))
(define (quickrep re rplc)
(let ((line (read-line)))
(if (not (eof-object? line))
(begin
(display (process-line line re rplc))
(newline)
(quickrep re rplc)))))
(define (main args)
(quickrep (irregex (car args)) (cadr args)))
Edit: in the above example buffering the input doesn't permit the regexp to span over
many lines.
To counter that here is an even simpler implementation which scans the whole file as one string:
(use ireggex)
(use utils)
(define (process-line line re rplc)
(irregex-replace/all re line rplc))
(define (quickrep re rplc file)
(let ((line (read-all file)))
(display (process-line line re rplc))))
(define (main args)
(quickrep (irregex (car args)) (cadr args) (caddr args)))

unibyte text buffers in emacs: encode in hexa?

I have a "text" file that has some invalid byte sequences. Emacs renders these as "\340\360", is there a way to make the mighty text processor render those in hexadecimal, for instance, e.g.: "\co0a"? Thanks.
EDIT: I will not mark my own answer as accepted, but just wanted to say that it does work fine.

Found it, just in case someone needs it too... (from here)
(setq standard-display-table (make-display-table))
(let ( (i ?\x80) hex hi low )
(while (<= i ?\xff)
(setq hex (format "%x" i))
(setq hi (elt hex 0))
(setq low (elt hex 1))
(aset standard-display-table (unibyte-char-to-multibyte i)
(vector (make-glyph-code ?\\ 'escape-glyph)
(make-glyph-code ?x 'escape-glyph)
(make-glyph-code hi 'escape-glyph)
(make-glyph-code low 'escape-glyph)))
(setq i (+ i 1))))

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

How to correct files with mixed encodings? - emacs

Related

Is there a string to 32 bit integer coding system in elisp?

Haskell with emacs org-mode: Variable not in scope

In split window emacs, how to search the "other buffer first line" for ":" then shift the border of the original buffer to that point

How to replace string in a file with lisp?

unibyte text buffers in emacs: encode in hexa?

Categories

Resources