Common Lisp: getting the Unicode name of a character

Common Lisp: getting the Unicode name of a character - unicode

In CL, can I get the Unicode name of a character into a string? Is there a
function that, receiving #\α as an argument, would return "GREEK SMALL LETTER ALPHA"?

Using the cl-unicode library:
CL-USER> (cl-unicode:unicode-name #\α)
"GREEK SMALL LETTER ALPHA"
CL-USER> (cl-unicode:unicode-name 945)
"GREEK SMALL LETTER ALPHA"

The result of CHAR-NAME is not standardized, but often you'll get:
? (char-name #\α)
"Greek_Small_Letter_Alpha"
In LispWorks:
CL-USER 40 > (char-name #\α)
"U+03B1"
CL-USER 41 > (system::lookup-unicode-character-name #\α)
"GREEK SMALL LETTER ALPHA"

Related

How can I use Lisp subseq using colon (or other non-alphanumeric characters)?

I need to extract a substring from a string; the substring is enclosed by ":" and ";". E.g.
:substring;
But with Lisp (SBCL), I'm having trouble extracting the substring. When I run:
(subseq "8.I:123;" : ;)
I get:
#<THREAD "main thread" RUNNING {1000510083}>:
illegal terminating character after a colon: #\
Stream: #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDIN* {1000025923}>
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [ABORT] Exit debugger, returning to top level.
(SB-IMPL::READ-TOKEN #<SYNONYM-STREAM :SYMBOL SB-SYS:*STDIN* {1000025923}> #\:)
I've tried preceding the colon and semicolon with \ but that throws a different error. Can anyone advise? Thanks in advance for the help!

As you can see in docs for subseq, start and end are bounding index designators and they can be either integer or nil.
#\: and #\; are characters, so you can't use them, but you can use the function position to find the first index of each character and use these indices as arguments for subseq. You have to check that both indices exist and the second one is bigger than the first one:
(let* ((string "8.I:123;")
(pos1 (position #\: string))
(pos2 (position #\; string)))
(when (and pos1 pos2 (> pos2 pos1))
(subseq string
(1+ pos1)
pos2)))
=> "123"
This is a little bit cumbersome, so I suggest you to use some regex library. The following example was created with CL-PPCRE:
(load "~/quicklisp/setup.lisp")
(ql:quickload :cl-ppcre)
> (cl-ppcre:all-matches-as-strings "(?<=:)([^;]*)(?=;)" "8.I:123;:aa;")
("123" "aa")

How to insert unusual characters in emacs?

In VIM I can insert unusual characters by using digraphs:
<C-K>{char1}{char2}
for example the ¿ character is represented by the ?I digraph.
<C-K>?I
then I can define a custom list for digraphs in a separate file, but for now, I'm just going to post the content of that file:
digraph uh 601 " ə UNSTRESSED SCHWA VOWEL
digraph uH 652 " ʌ STRESSED SCHWA VOWEL
digraph ii 618 " ɪ NEAR-CLOSE NEAR-FRONT UNROUNDED VOWEL
digraph uu 650 " ʊ NEAR-CLOSE NEAR-BACK ROUNDED VOWEL
digraph ee 603 " ɛ OPEN-MID FRONT UNROUNDED VOWEL
digraph er 604 " ɜ OPEN-MID CENTRAL UNROUNDED VOWEL
digraph oh 596 " ɔ OPEN-MID BACK ROUNDED VOWEL
digraph ae 230 " æ NEAR-OPEN FRONT UNROUNDED VOWEL
digraph ah 593 " ɑ OPEN BACK UNROUNDED VOWEL
digraph th 952 " θ VOICELESS DENTAL FRICATIVE
digraph tH 240 " ð VOICED DENTAL FRICATIVE
digraph sh 643 " ʃ VOICELESS POSTALVEOLAR FRICATIVE
digraph zs 658 " ʒ VOICED POSTALVEOLAR FRICATIVE
digraph ts 679 " ʧ VOICELESS POSTALVEOLAR AFFRICATE
digraph dz 676 " ʤ VOICED POSTALVEOLAR AFFRICATE
digraph ng 331 " ŋ VOICED VELAR NASAL
digraph as 688 " ʰ ASPIRATED
digraph ps 712 " ˈ PRIMARY STRESS
digraph ss 716 " ˌ SECONDARY STRESS
digraph st 794 " ̚ NO AUDIBLE RELEASE
digraph li 8255 " ‿ LINKING
They are symbols of the phonetic alphabet I frequently use in documents.
The question is: Is there a way to port the same symbols to emacs so I can use them possibly with the same letter combination "uh, uH, ii, uu" and so on?

First of all, Emacs comes with three "input methods" that let you type IPA characters, ipa-kirshenbaum, ipa-praat and ipa-x-sampa. You can see the description of them by typing C-h I (for describe-input-method), and you can switch to one of them with C-u C-\ (for toggle-input-method with a prefix argument).
If you'd rather use your own combinations, you can define your own input method:
(quail-define-package
"my-ipa-symbols" "" "IPA" t
"My IPA input method
Documentation goes here."
nil t nil nil nil nil nil nil nil nil t)
(quail-define-rules
("uh" ?ə) ; UNSTRESSED SCHWA VOWEL
("uH" ?ʌ) ; STRESSED SCHWA VOWEL
;; add more combinations here
)
Evaluate that with eval-buffer or eval-region, and then switch to the newly created input method with C-u C-\ my-ipa-symbols.

M-x insert-char will let you interactively search for a character to insert. Searching for 'schwa' brings up a set of different schwa's to choose from.
For characters I've found I like to insert often, I've added keybinding for them like this:
(global-set-key (kbd "C-<down>") (lambda () (interactive) (insert "↓")))
where I just copy-and-pasted the character I want into that string there. Looking at the docs, you should be able to create a keybinding using insert char with the name or the hex key of the character you want, as well: https://www.gnu.org/software/emacs/manual/html_node/emacs/Inserting-Text.html

A nicer alternative to M-x insert-char is to use helm-ucs (or alternatively helm-unicode). This brings up a nice list of unicode characters in a helm interface. You can enter words of the name in any order (eg "alpha small greek") to choose from characters matching those strings.
note: helm-ucs takes a few seconds to load the first time it's used in a session, but helm-unicode doesn't suffer from this problem.

Convert binary string to number

Pretty straightforward, but I can't seem to find an answer. I have a string of 1s and 0s such as "01001010" - how would I parse that into a number?

Use string-to-number, which optionally accepts the base:
(string-to-number "01001010" 2)
;; 74

As explained by #sds in a comment, string-to-number returns 0 if the conversion fails. This is unfortunate, since a return value of 0 could also means that the parsing succeeded.
I'd rather use the Common Lisp version of this function, cl-parse-integer. The standard function is described in the Hyperspec, whereas the one in Emacs Lisp is slightly different (in particular, there is no secondary return value):
(cl-parse-integer STRING &key START END RADIX JUNK-ALLOWED)
Parse integer from the substring of STRING from START to END. STRING
may be surrounded by whitespace chars (chars with syntax ‘ ’). Other
non-digit chars are considered junk. RADIX is an integer between 2 and
36, the default is 10. Signal an error if the substring between START
and END cannot be parsed as an integer unless JUNK-ALLOWED is non-nil.
(cl-parse-integer "001010" :radix 2)
=> 10
(cl-parse-integer "0" :radix 2)
=> 0
;; exception on parse error
(cl-parse-integer "no" :radix 2)
=> Debugger: (error "Not an integer string: ‘no’")
;; no exception, but nil in case of errors
(cl-parse-integer "no" :radix 2 :junk-allowed t)
=> nil
;; no exception, parse as much as possible
(cl-parse-integer "010no" :radix 2 :junk-allowed t)
=> 2

This thread has an elisp tag. Because it also has a lisp tag, I would like to show standard Common Lisp versions of two solutions. I checked these on LispWorks only. If my solutions are not standard Common Lisp, maybe someone will correct and improve my solutions.
For solutions
(string-to-number "01001010" 2)
and
(cl-parse-integer "001010" :radix 2)
LispWorks does not have string-to-number and does not have cl-parse-integer.
In LispWorks, you can use:
(parse-integer "01001010" :radix 2)
For the solution
(read (concat "#2r" STRING))
LispWorks does not have concat. You can use concatenate instead. read won't work on strings in LispWorks. You have to give read a stream.
In LispWorks, you can do this:
(read (make-string-input-stream (concatenate 'string "#2r" "01001010")))
You can also use format like this:
(read (make-string-input-stream (format nil "#2r~a" "01001010")))

This seems hacky by comparison, but FWIW you could also do this:
(read (concat "#2r" STRING))
i.e. read a single expression from STRING as a binary number.
This method will signal an error if the expression isn't valid.

Converting Integers to Characters in Common Lisp

Is there a way to parse integers to their char equivalents in Common Lisp?
I've been looking all morning, only finding char-int...
* (char-int #\A)
65
Some other sources also claim the existance of int-char
* (int-char 65)
; in: INT-CHAR 65
; (INT-CHAR 65)
;
; caught STYLE-WARNING:
; undefined function: INT-CHAR
;
; compilation unit finished
; Undefined function:
; INT-CHAR
; caught 1 STYLE-WARNING condition
debugger invoked on a UNDEFINED-FUNCTION:
The function COMMON-LISP-USER::INT-CHAR is undefined.
Type HELP for debugger help, or (SB-EXT:EXIT) to exit from SBCL.
restarts (invokable by number or by possibly-abbreviated name):
0: [ABORT] Exit debugger, returning to top level.
("undefined function")
What I'm really looking for, however, is a way of converting 1 to #\1
How exactly would I do that?

To convert between characters and their numeric encodings, there are char-code and code-char:
* (char-code #\A)
65
* (code-char 65)
#\A
However, to convert a digit to the corresponding character, there is digit-char:
* (digit-char 1)
#\1
* (digit-char 13 16) ; radix 16
#\D

There's already an accepted answer, but it can be just as helpful to learn how to find the answer as getting the specific answer. One way of finding the function you needed would have been to do an apropos search for "CHAR". E.g., in CLISP, you'd get:
> (apropos "CHAR" "CL")
...
CHAR-CODE function
...
CODE-CHAR function
...
Another useful resource is the HyperSpec. There's permuted index, and searching for "char" in the "C" page will be useful. Alternatively, in the HyperSpec, the chapter 13. Characters is relevant, and 13.2 The Characters Dictionary would be useful.
Both of these approaches would also find the digit-char function mentioned in the other answer, too.

Emacs byte-to-position function is not consistent with document?

Emacs 24.3.1, Windows 2003
I found the 'byte-to-position' function is a little strange.
According to the document:
-- Function: byte-to-position byte-position
Return the buffer position, in character units, corresponding to
given BYTE-POSITION in the current buffer. If BYTE-POSITION is
out of range, the value is `nil'. **In a multibyte buffer, an
arbitrary value of BYTE-POSITION can be not at character boundary,
but inside a multibyte sequence representing a single character;
in this case, this function returns the buffer position of the
character whose multibyte sequence includes BYTE-POSITION.** In
other words, the value does not change for all byte positions that
belong to the same character.
We can make a simple experiment:
Create a buffer, eval this expression: (insert "a" (- (max-char) 128) "b")
Since the max bytes number in Emacs' internal coding system is 5, the character between 'a' and 'b' is 5 bytes. (Note that the last 128 characters is used for 8 bits raw bytes, their size is only 2 bytes.)
Then define and eval this test function:
(defun test ()
(interactive)
(let ((max-bytes (1- (position-bytes (point-max)))))
(message "%s"
(loop for i from 1 to max-bytes collect (byte-to-position i)))))
What I get is "(1 2 3 2 2 2 3)".
The number in the list represents the character position in the buffer. Because there is a 5 bytes big character, there should be five '2' between '1' and '3', but how to explain the magic '3' in the '2's ?

This was a bug. I no longer see this behavior in 26.x. You can read more about it here (which actually references this SO question).
https://debbugs.gnu.org/cgi/bugreport.cgi?bug=20783

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Common Lisp: getting the Unicode name of a character - unicode

In CL, can I get the Unicode name of a character into a string? Is there a function that, receiving #\α as an argument, would return "GREEK SMALL LETTER ALPHA"?

Using the cl-unicode library: CL-USER> (cl-unicode:unicode-name #\α) "GREEK SMALL LETTER ALPHA" CL-USER> (cl-unicode:unicode-name 945) "GREEK SMALL LETTER ALPHA"

The result of CHAR-NAME is not standardized, but often you'll get: ? (char-name #\α) "Greek_Small_Letter_Alpha" In LispWorks: CL-USER 40 > (char-name #\α) "U+03B1" CL-USER 41 > (system::lookup-unicode-character-name #\α) "GREEK SMALL LETTER ALPHA"

Related

How can I use Lisp subseq using colon (or other non-alphanumeric characters)?

How to insert unusual characters in emacs?

Convert binary string to number

Converting Integers to Characters in Common Lisp

Emacs byte-to-position function is not consistent with document?

Categories

Resources