On Linux with UTF-8 enabled console:
Clojure 1.6.0
user=> (def c \の)
#'user/c
user=> (str c)
"の"
user=> (def c \🍒)
RuntimeException Unsupported character: \🍒 clojure.lang.Util.runtimeException (Util.java:221)
RuntimeException Unmatched delimiter: ) clojure.lang.Util.runtimeException (Util.java:221)
I was hoping to have an emoji-rich Clojure application with little effort, but it appears I will be looking up and typing in emoji codes? Or am I missing something obvious here? 😞
Java represents Unicode characters in UTF-16. The emoji characters are "supplementary characters" and have a codepoint that cannot be represented in 16 bits.
http://www.oracle.com/technetwork/articles/javase/supplementary-142654.html
In essence, supplementary characters are represented not as chars but as ints and there are special apis for dealing with them.
One way is with (Character/toChars 128516) - this returns a char array that you can convert to a string to print: (apply str (Character/toChars 128516)). Or you can create a String from an array of codepoint ints directly with (String. (int-array [128516]) 0 1). Depending on all the various things between Java/Clojure and your eyeballs, that may or may not do what you want.
The format api supports supplementary characters so that may be easiest, however it takes an int so you'll need a cast: (format "Smile! %c" (int 128516)).
Thanks to Clojure’s extensible reader tags, you can create Unicode literals quite easily yourself.
We already know that not all of Unicode can be represented as char literals; that the preferred representation of Unicode characters on the JVM is int; and that a string literal can hold any Unicode character in a way that’s also convenient for humans to read.
So, a tagged literal #u "🍒" that reads as an int would make an excellent Unicode character literal!
Set up a reader function for the new tagged literal in *data-readers*:
(defn read-codepoint
[^String s]
{:pre [(= 1 (.codePointCount s 0 (.length s)))]}
(.codePointAt s 0))
(set! *data-readers* (assoc *data-readers* 'u #'read-codepoint))
With that in place, the reader reads such literals as code point integers:
#u"🍒" ; => 127826
(Character/getName #u"🍒") ; => "CHERRIES"
‘Reader tags without namespace qualifiers are reserved for Clojure’, says the documentation … #u is short but perhaps not the most responsible choice.
Related
Are there any syntax for unicode characters in Common Lisp? Like \u03B1 in Java?
Maybe something like #\U+03B1, or something similar?
The uber-portable way is #.(code-char X) which will produce the
Unicode char with the given numeric code X (provided that the implementation actually uses Unicode - which the ANSI standard does not require - and, indeed, all implementations that go beyond ASCII - which is not mandated either! - do use Unicode).
If you know the Unicode name of the character, you can also use the #\ syntax:
(char= (code-char 12345) #\HANGZHOU_NUMERAL_TWENTY)
T
Implementations often define additional Unicode character syntax, e.g.:
#\Code<decimal> in CLISP.
#\U+0<hex> in SBCL.
See:
code-char
*read-eval*
Sharpsign Dot
I'm very very new in elisp and just started learning it. I have seen the following expressions in the document:
(1+ (buffer-size))
(+ 1 (buffer-size))
What do they mean? As I know elisp use prefix notation, so the second one should be correct one. But both of them can be executed without any errors. The first one is from the documentation of point-max function.
Thanks.
The token 1+ is an identifier which denotes a symbol. This symbol has a binding as a function, and so (1+ arg) means "call the 1+ function, with the value of arg as its argument). The 1+ function returns 1 plus the value of its argument.
The syntax (+ 1 arg) is a different way to achieve that effect. Here the function is named by the symbol +. The + function receives two arguments which it adds together.
In many mainstream programming languages popular today, the tokenization rules are such that there is no difference between 1+ and 1 +: both of these denote a numeric constant followed by a + token. Lisp tokenization is different. Languages in the Lisp family usually support tokens that can contain can contain digits and non-alpha-numeric characters. I'm looking at the Emacs Lisp reference manual and do not see a section about the logic which the read function uses to convert printed representations to objects. Typically, "Lispy" tokenizing behavior is something like this: token is scanned first without regard for what kind of token it is based on accumulating characters which are valid token constituents, stopping at a character which is not a token constituent. For instance when the input is abcde(f, the token that will be extracted is abcde. The ( character terminates the token (and stays in the input stream). Then, the resulting clump of characters abcde is re-examined and classified, converted to an object based on what it looks like, according to the rules of the given Lisp dialect. Across Lisp dialects, we can broadly depend on a token of all alphabetic characters to denote a symbol, and a token of all digits (possibly with a leading sign) to denote an integer. 1+ has a trailing + though, which is different!
I find when use 'utf-16' as the encoding to convert a lisp string to C string with cffi, the actual encoding used is 'utf-16le'. But, when convert C string back to lisp string, the actual encoding used is 'utf-16be'. Since I'm not familiar with 'babel' yet (which provides the encoding facility for 'cffi'), I'm not sure whether that's a bug.
(defun convtest (str to-c from-c)
(multiple-value-bind (ptr size)
(cffi:foreign-string-alloc str :encoding to-c)
(declare (ignore size))
(prog1
(cffi:foreign-string-to-lisp ptr :encoding from-c)
(cffi:foreign-string-free ptr))))
(convtest "hello" :utf-16 :utf-16) ;=> garbage string
(convtest "hello" :utf-16 :utf-16le) ;=> "hello"
(convtest "hello" :utf-16 :utf-16be) ;=> garbage string
(convtest "hello" :utf-16le :utf-16be) ;=> garbage string
(convtest "hello" :utf-16le :utf-16le) ;=> "hello"
The `convtest' convert a lisp string to C string then back to lisp string, with the `to-c', `from-c' as encoding. All the output garbage string are the same. From the test we see that if we use 'utf-16' as `to-c' and `from-c' at the same time, the conversion failed.
Here the encoding to-c assumes little endian (le) by default. From-c then has big-endian as default (be).
The platform itself (x86) is little endian. UTF-16 prefers big endian or takes the information from a byte-order mark.
This probably depends on the platform you are running on? Platforms seem to have different defaults.
Best to look into the source code, why those encodings are chosen. Also you may ask on the CFFI mailing list about the encoding choices and how they depend on the platform, if at all.
If I need to have the following python value, unicode char '0':
>>> unichr(0)
u'\x00'
How can I define it in Lua?
There isn't one.
Lua has no concept of a Unicode value. Lua has no concept of Unicode at all. All Lua strings are 8-bit sequences of "characters", and all Lua string functions will treat them as such. Lua does not treat strings as having any Unicode encoding; they're just a sequence of bytes.
You can insert an arbitrary number into a string. For example:
"\065\066"
Is equivalent to:
"AB"
The \ notation is followed by 3 digits (or one of the escape characters), which must be less than or equal to 255. Lua is perfectly capable of handling strings with embedded \000 characters.
But you cannot directly insert Unicode codepoints into Lua strings. You can decompose the codepoint into UTF-8 and use the above mechanism to insert the codepoint into a string. For example:
"x\226\131\151"
This is the x character followed by the Unicode combining above arrow character.
But since no Lua functions actually understand UTF-8, you will have to expose some function that expects a UTF-8 string in order for it to be useful in any way.
How about
function unichr(ord)
if ord == nil then return nil end
if ord < 32 then return string.format('\\x%02x', ord) end
if ord < 126 then return string.char(ord) end
if ord < 65539 then return string.format("\\u%04x", ord) end
if ord < 1114111 then return string.format("\\u%08x", ord) end
end
While native Lua does not directly support or handle Unicode, its strings are really buffers of arbitrary bytes that by convention hold ASCII characters. Since strings may contain any byte values, it is relatively straightforward to build support for Unicode on top of native strings. Should byte buffers prove to be insufficiently robust for the purpose, one can also use a userdata object to hold anything, and with the addition of a suitable metatable, endow it with methods for creation, translation to a desired encoding, concatenation, iteration, and anything else that is needed.
There is a page at the Lua User's Wiki that discusses various ways to handle Unicode in Lua programs.
For a more modern answer, Lua 5.3 now has the utf8.char:
Receives zero or more integers, converts each one to its corresponding UTF-8 byte sequence and returns a string with the concatenation of all these sequences.
This is a double question for you amazingly kind Stacked Overflow Wizards out there.
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.
It's really easy to do regular expressions on latin text:
(re-seq #"[\w]+" "It's really true that Japanese sentences don't need spaces?")
But what if I had some japanese? I thought that this would work, but I can't test it:
(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:
(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")
Thanks!
Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.
On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:
(re-seq #"\w+" "prøve") =>("pr" "ve") ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große") => ("gro" "e") ; German
(re-seq #"\w+" "plaît") => ("pla" "t") ; French
The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.
But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian
(re-seq #"\p{L}+" "prøve")
=> ("prøve")
as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):
(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")
There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.
Edit: More on Unicode in Java
A quick reference to other points of potential interest when working with Unicode.
Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.
This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).
java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
java.lang.String - lets you specify a charset when creating a String from an array of bytes.
java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.
Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.
When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.
unicode.org - the Unicode standard and code charts.
I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.
I'll answer half a question here:
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?
A more interactive way:
M-x customize-group
"slime-lisp"
Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.
Or place this in your .emacs:
(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))
That's what the interactive menu will do anyway.
Works on Emacs 23 and works on my machine
For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:
user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当?")
("スペース")
Hiragana, for what it's worth:
user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当?")
("の" "には" "が" "ないって")
I'd be pretty amazed if any regex could detect Japanese word breaks.
for international characters you need to use Java Character classes, something like [\p{javaLowerCase}\p{javaUpperCase}]+ to match any word character... \w is used for ASCII - see java.util.Regex documentation
Prefix your regex with (?U) like so: (re-matches #"(?U)\w+" "ñé2_hi") => "ñé2_hi".
This sets the UNICODE_CHARACTER_CLASS flag to true so that the typical character classes do what you want with non-ASCII Unicode.
See here for more info: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS