I am trying to detect file encoding using LispWorks.
LispWorks should be capable of such functionality, see External Formats and File Streams.
[Note: details based on #rainer-joswig and #svante comments]
system:*file-encoding-detection-algorithm* is set to its default,
(setf system:*file-encoding-detection-algorithm*
'(find-filename-pattern-encoding-match
find-encoding-option
detect-utf32-bom
detect-unicode-bom
detect-utf8-bom
specific-valid-file-encoding
locale-file-encoding))
And also,
;; Specify the correct characters
(lw:set-default-character-element-type 'cl:character)
Some verifiable files available here:
UCS-2 and UTF-8
LATIN-1: windows-1252-2000.ucm
UNICODE and LATIN-1 are properly detected
;; UNICODE
;; http://www.humancomp.org/unichtm/tongtwst.htm
(with-open-file (ss "/tmp/tongtwst.htm")
(stream-external-format ss))
;; => (:UNICODE :LITTLE-ENDIAN T :EOL-STYLE :CRLF)
;; LATIN-1
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)
Detecting UTF-8 does not work right away,
;; UTF-8 encoding
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)
Adding UTF-8 to *specific-valid-file-encodings* makes it work,
(pushnew :utf-8 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:UTF-8)
;; http://www.humancomp.org/unichtm/tongtwst8.htm
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :CRLF)
But now same LATIN-1 file as above is detected as UTF-8,
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:UTF-8 :EOL-STYLE :LF)
Pushing LATIN-1 to *specific-valid-file-encodings* as well,
(pushnew :latin-1 system:*specific-valid-file-encodings*)
;; system:*specific-valid-file-encodings*
;; => (:LATIN-1 :UTF-8)
;; This one works again
(with-open-file (ss "/tmp/windows-1252-2000.ucm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :LF)
;; But this one, which was properly detected as `UTF-8`,
;; is now detected as `LATIN-1`, *which is wrong.*
(with-open-file (ss "/tmp/tongtws8.htm")
(stream-external-format ss))
;; => (:LATIN-1 :EOL-STYLE :CRLF)
What I am doing wrong?
How can I correctly detect file encoding using LispWorks?
Related
I'm reading a file char by char and constructing a list which is consist of list of letters of words. I did that but when it comes to testing it prints out NIL. Also outside of test function when i print out list, it prints nicely. What is the problem here? Is there any other meaning of LET keyword?
This is my read fucntion:
(defun read-and-parse (filename)
(with-open-file (s filename)
(let (words)
(let (letter)
(loop for c = (read-char s nil)
while c
do(when (char/= c #\Space)
(if (char/= c #\Newline) (push c letter)))
do(when (or (char= c #\Space) (char= c #\Newline) )
(push (reverse letter) words)
(setf letter '())))
(reverse words)
))))
This is test function:
(defun test_on_test_data ()
(let (doc (read-and-parse "document2.txt"))
(print doc)
))
This is input text:
hello
this is a test
You're not using let properly. The syntax is:
(let ((var1 val1)
(var2 val2)
...)
body)
If the initial value of the variable is NIL, you can abbreviate (varN nil) as just varN.
You wrote:
(let (doc
(read-and-parse "document2.txt"))
(print doc))
Based on the above, this is using the abbreviation, and it's equivalent to:
(let ((doc nil)
(read-and-parse "document2.txt"))
(print doc))
Now you can see that this binds doc to NIL, and binds the variable read-and-parse to "document2.txt". It never calls the function. The correct syntax is:
(let ((doc (read-and-parse "document2.txt")))
(print doc))
Barmar's answer is the right one. For interest, here is a version of read-and-parse which makes possibly-more-idiomatic use of loop, and also abstracts out the 'is the character white' decision since this is something which is really not usefully possible in portable CL as the standard character repertoire is absurdly poor (there's no tab for instance!). I'm sure there is some library available via Quicklisp which deals with this better than the below.
I think this is fairly readable: there's an outer loop which collects words, and an inner loop which collects characters into a word, skipping over whitespace until it finds the next word. Both use loop's collect feature to collect lists forwards. On the other hand, I feel kind of bad every time I use loop (I know there are alternatives).
By default this collects the words as lists of characters: if you tell it to it will collect them as strings.
(defun char-white-p (c)
;; Is a character white? The fallback for this is horrid, since
;; tab &c are not a standard characters. There must be a portability
;; library with a function which does this.
#+LispWorks (lw:whitespace-char-p c)
#+CCL (ccl:whitespacep c) ;?
#-(or LispWorks CCL)
(member char (load-time-value
(mapcan (lambda (n)
(let ((c (name-char n)))
(and c (list c))))
'("Space" "Newline" "Page" "Tab" "Return" "Linefeed"
;; and I am not sure about the following, but, well
"Backspace" "Rubout")))))
(defun read-and-parse (filename &key (as-strings nil))
"Parse a file into a list of words, splitting on whitespace.
By default the words are returned as lists of characters. If
AS-STRINGS is T then they are coerced to strings"
(with-open-file (s filename)
(loop for maybe-word = (loop with collecting = nil
for c = (read-char s nil)
;; carry on until we hit EOF, or we
;; hit whitespace while collecting a
;; word
until (or (not c) ;EOF
(and collecting (char-white-p c)))
;; if we're not collecting and we see
;; a non-white character, then we're
;; now collecting
when (and (not collecting) (not (char-white-p c)))
do (setf collecting t)
when collecting
collect c)
while (not (null maybe-word))
collect (if as-strings
(coerce maybe-word 'string)
maybe-word))))
I have a code that if executed from the slime prompt inside emacs run with no error. If I started sbcl from the prompt, I got the error:
* (ei:proc-file "BRAvESP000.log" "lixo")
debugger invoked on a SB-INT:STREAM-ENCODING-ERROR:
:UTF-8 stream encoding error on
#<SB-SYS:FD-STREAM for "file /Users/arademaker/work/IBM/scolapp/lixo"
{10049E8FF3}>:
the character with code 55357 cannot be encoded.
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [OUTPUT-NOTHING ] Skip output of this character.
1: [OUTPUT-REPLACEMENT] Output replacement string.
2: [ABORT ] Exit debugger, returning to top level.
(SB-IMPL::STREAM-ENCODING-ERROR-AND-HANDLE #<SB-SYS:FD-STREAM for "file /Users/arademaker/work/IBM/scolapp/lixo" {10049E8FF3}> 55357)
0]
The mistery is that in both cases I am using the same sbcl 1.1.8 and the same machine, Mac OS 10.8.4. Any idea?
The code:
(defun proc-file (filein fileout &key (fn-convert #'identity))
(with-open-file (fout fileout
:direction :output
:if-exists :supersede
:external-format :utf8)
(with-open-file (fin filein :external-format :utf8)
(loop for line = (read-line fin nil)
while line
do
(handler-case
(let* ((line (ppcre:regex-replace "^.*{jsonTweet=" line "{\"jsonTweet\":"))
(data (gethash "jsonTweet" (yason:parse line))))
(yason:encode (funcall fn-convert (yason:parse data)) fout)
(format fout "~%"))
(end-of-file ()
(format *standard-output* "Error[~a]: ~a~%" filein line)))))))
This is almost certainly a bug in yason. JSON requires that if a non BMP character is escaped, it is done so through a surrogate pair. Here's a simple example with U+10000 (which is optionally escaped in json as "\ud800\udc00"; I use babel as babel's conversion is less strin):
(map 'list #'char-code (yason:parse "\"\\ud800\\udc00\""))
=> (55296 56320)
unicode code point 55296 (decimal) is the start for a surrogate pair, and should not appear except as a surrogate pair in UTF-16. Fortunately it can be easily worked around by using babel to encode the string to UTF-16 and back again:
(babel:octets-to-string (babel:string-to-octets (yason:parse "\"\\ud800\\udc00\"") :encoding :utf-16le) :encoding :utf-16le)
=> "𐀀"
You should be able to work around this by changing this line:
(yason:encode (funcall fn-convert (yason:parse data)) fout)
To use an intermediate string, which you convert to UTF-16 and back.
(write-sequence
(babel:octets-to-string
(babel:string-to-octets
(with-output-to-string (outs)
(yason:encode (funcall fn-convert (yason:parse data)) outs))
:encoding :utf-16le)
:encoding :utf-16le)
fout)
I submitted a patch that has been accepted to fix this in yason:
https://github.com/hanshuebner/yason/commit/4a9bdaae652b7ceea79984e0349a992a5458a0dc
Given a corrupted file with mixed encoding (e.g. utf-8 and latin-1), how do I configure Emacs to "project" all its symbols to a single encoding (e.g. utf-8) when saving the file?
I did the following function to automatize some of the cleaning, but I would guess I could find somewhere the information to map the symbol "é" in one encoding to "é" in utf-8 somewhere in order to improve this function (or that somebody already wrote such a function).
(defun jyby/cleanToUTF ()
"Cleaning to UTF"
(interactive)
(progn
(save-excursion (replace-regexp "अ" ""))
(save-excursion (replace-regexp "आ" ""))
(save-excursion (replace-regexp "ॆ" ""))
)
)
(global-unset-key [f11])
(global-set-key [f11] 'jyby/cleanToUTF)
I have many files "corrupted" with mixed encoding (due to copy pasting from a browser with an ill font configuration), generating the error below. I sometime clean them by hand by searching and replacing for each problematic symbol by either "" or the appropriate character, or more quickly specifying "utf-8-unix" as the encoding (which will prompt the same message next time I edit and save the file). It has become an issue as in any such corrupted file any accentuated character is replaced by a sequence which doubles in size at each save, ending up doubling the size of the file. I am using GNU Emacs 24.2.1
These default coding systems were tried to encode text
in the buffer `test_accents.org':
(utf-8-unix (30 . 4194182) (33 . 4194182) (34 . 4194182) (37
. 4194182) (40 . 4194181) (41 . 4194182) (42 . 4194182) (45
. 4194182) (48 . 4194182) (49 . 4194182) (52 . 4194182))
However, each of them encountered characters it couldn't encode:
utf-8-unix cannot encode these: ...
Click on a character (or switch to this window by `C-x o'
and select the characters by RET) to jump to the place it appears,
where `C-u C-x =' will give information about it.
Select one of the safe coding systems listed below,
or cancel the writing with C-g and edit the buffer
to remove or modify the problematic characters,
or specify any other coding system (and risk losing
the problematic characters).
raw-text emacs-mule no-conversion
I have struggled with this in emacs many times. When I have a file that was messed up, e.g. in raw-text-unix mode, and save as utf-8, emacs complains even about text that is already clean utf-8. I haven't found a way to get it to only complain about non-utf-8.
I just found a reasonable semi-automated approach using recode:
f=mixed-file
recode -f ..utf-8 $f > /tmp/recode.out
diff $f recode.out | cat -vt
# manually fix lines of text that can't be converted to utf-8 in $f,
# and re-run recode and diff until the output diff is empty.
One helpful tool along the way is http://www.ltg.ed.ac.uk/~richard/utf-8.cgi?input=342+200+224&mode=obytes
Then I just re-open the file in emacs, and it is recognized as clean unicode.
Here's something to maybe get you started:
(put 'eof-error 'error-conditions '(error eof-error))
(put 'eof-error 'error-message "End of stream")
(put 'bad-byte 'error-conditions '(error bad-byte))
(put 'bad-byte 'error-message "Not a UTF-8 byte")
(defclass stream ()
((bytes :initarg :bytes :accessor bytes-of)
(position :initform 0 :accessor position-of)))
(defun logbitp (byte bit) (not (zerop (logand byte (ash 1 bit)))))
(defmethod read-byte ((this stream) &optional eof-error eof)
(with-slots (bytes position) this
(if (< position (length bytes))
(prog1 (aref bytes position) (incf position))
(if eof-error (signal eof-error (list position)) eof))))
(defmethod unread-byte ((this stream))
(when (> (position-of this) 0) (decf (position-of this))))
(defun read-utf8-char (stream)
(let ((byte (read-byte stream 'eof-error)))
(if (not (logbitp byte 7)) byte
(let ((numbytes
(cond
((not (logbitp byte 5))
(setf byte (logand #2r11111 byte)) 1)
((not (logbitp byte 4))
(setf byte (logand #2r1111 byte)) 2)
((not (logbitp byte 3))
(setf byte (logand #2r111 byte)) 3))))
(dotimes (b numbytes byte)
(let ((next-byte (read-byte stream 'eof-error)))
(if (and (logbitp next-byte 7) (not (logbitp next-byte 6)))
(setf byte (logior (ash byte 6) (logand next-byte #2r111111)))
(signal 'bad-byte (list next-byte)))))
(signal 'bad-byte (list byte))))))
(defun load-corrupt-file (file)
(interactive "fFile to load: ")
(with-temp-buffer
(set-buffer-multibyte nil)
(insert-file-literally file)
(with-output-to-string
(set-buffer-multibyte t)
(loop with stream = (make-instance 'stream :bytes (buffer-string))
for next-char =
(condition-case err
(read-utf8-char stream)
(bad-byte (message "Fix this byte %d" (cdr err)))
(eof-error nil))
while next-char
do (write-char next-char)))))
What this code does - it loads a file with no conversion and tries to read it as if it was encoded using UTF-8, once it encounters a byte that doesn't seem like it belongs to UTF-8, it errors, and you need to handle it somehow, it's where "Fix this byte" message is). But you would need to be inventive about how you fixing it...
There are Tags as in #+AUTHOR or #+LATEX in org-mode - are they called tags? I'd like to define my own tag which calls a function to preprocess the data and then outputs it - if the export target is LaTeX.
My solution was defining an own language, qtree, for SRC blocks.
#+BEGIN_SRC qtree
[.CP [.TP [.NP [] [.N' [.N Syntax] []]] [.VP [] [.V' [.V sucks] []]]]]
#+END_SRC
And process it accordingly. I even added a qtree-mode with paredit.
And a landscape parameter if the trees grow big. https://github.com/Tass/emacs-starter-kit/blob/master/vendor/assorted/org-babel-qtree.el
(require 'org)
(defun org-babel-execute:qtree (body params)
"Reformat a block of lisp-edited tree to one tikz-qtree likes."
(let (( tree
(concat "\\begin{tikzpicture}
\\tikzset{every tree node/.style={align=center, anchor=north}}
\\Tree "
(replace-regexp-in-string
" \\_<\\w+\\_>" (lambda (x) (concat "\\\\\\\\" (substring x 1)))
(replace-regexp-in-string
(regexp-quote "]") " ]" ; qtree needs a space
; before every closing
; bracket.
(replace-regexp-in-string
(regexp-quote "[]") "[.{}]" body)) ; empty leaf
; nodes, see
; http://tex.stackexchange.com/questions/75915
) ; For
; http://tex.stackexchange.com/questions/75217
"\n\\end{tikzpicture}"
)))
(if (assoc :landscape params)
(concat "\\begin{landscape}\n" tree "\n\\end{landscape}")
tree)))
(setq org-babel-default-header-args:qtree '((:results . "latex") (:exports . "results")))
(add-to-list 'org-src-lang-modes '("qtree" . qtree))
(define-generic-mode
'qtree-mode ;; name of the mode to create
'("%") ;; comments start with '%'
'() ;; no keywords
'(("[." . 'font-lock-operator) ;; some operators
("]" . 'font-lock-operator))
'() ;; files for which to activate this mode
'(paredit-mode) ;; other functions to call
"A mode for qtree edits" ;; doc string for this mode
)
They seem to be called keywords for in-buffer settings no more. Whatever they're called, they don't seem to be user-definable.
What you want to do is extremely related to a common way of handling whereas to export with xelatex or pdflatex as described on Worg.
The relevant part would be :
;; Originally taken from Bruno Tavernier: http://thread.gmane.org/gmane.emacs.orgmode/31150/focus=31432
(defun my-auto-tex-cmd ()
(if (string-match "YOUR_TAG: value1" (buffer-string))
(do something))
(if (string-match "YOUR_TAG: value2" (buffer-string))
(do something else))
(add-hook 'org-export-latex-after-initial-vars-hook 'my-auto-tex-cmd)
I have a "text" file that has some invalid byte sequences. Emacs renders these as "\340\360", is there a way to make the mighty text processor render those in hexadecimal, for instance, e.g.: "\co0a"? Thanks.
EDIT: I will not mark my own answer as accepted, but just wanted to say that it does work fine.
Found it, just in case someone needs it too... (from here)
(setq standard-display-table (make-display-table))
(let ( (i ?\x80) hex hi low )
(while (<= i ?\xff)
(setq hex (format "%x" i))
(setq hi (elt hex 0))
(setq low (elt hex 1))
(aset standard-display-table (unibyte-char-to-multibyte i)
(vector (make-glyph-code ?\\ 'escape-glyph)
(make-glyph-code ?x 'escape-glyph)
(make-glyph-code hi 'escape-glyph)
(make-glyph-code low 'escape-glyph)))
(setq i (+ i 1))))