MIT Scheme using special characters in the interpreter - unicode

I am using MIT scheme, and would like to be able to do something like this:
(define π 3.14159265)
Without having an encoding error like this:
;Illegal character: #\U+80
;To continue, call RESTART with an option number:
; (RESTART 1) => Return to read-eval-print level 1
MIT Scheme does have Unicode support, but it appears that it doesn't have support for unicode in the code, which is what I am looking to do. It turns out that ISO-8859-1 (the encoding used in MIT Scheme) does not have any greek letters within it, which is a pity.
Solutions that might work, but are not very good:
Writing all of my code into text files and using the built in unicode support to read in the unicode characters as code.
Rewriting the entire interpreter to accept unicode names
Using a different lisp implementation which allows for Unicode names.
Can't wait to hear from the Stack Overflowers!

You can use unicode symbols in guile, gambit, scm, and chicken for sure.

Related

Julia: How to deal with special unicode characters

I am working with the Distributions package which uses special unicode characters for many of the variables within types. The normal distribution, for instance, uses μ and σ. If I want to edit the standard deviation, I need to somehow type:
n.σ = 5.0
Is it possible to type these values into the repl (outside of using copy-paste)? How does one create these characters with one's keyboard?
Thank you
At the REPL, use LaTeX shortcuts, e.g. type \sigma and press tab to autocomplete. Note you need to using Julia 0.3 or higher for this to work.
Many text editors have add-ins to do something similar, e.g. https://github.com/mvoidex/UnicodeMath for SublimeText.
In Windows 10 and Linux under most modern desktops you can add a Greek keyboard map and then switch between Greek and English using [[windows button]]+space. Since many Latin / English letters are derived from ancient Greek, these have analogs so S types a Sigma (σΣ), D, a delta(δΔ) etc. ετψ.
Figured it out by looking up "Entering Unicode in Linux" on Google.
One can press Ctrl+Shift+u, then the 4-digit UTF-16 Hex encoding for the character. For example, σ = u03bc
Tt is difficult to remember all the unicodes or latex shortcuts (at least for me) or search it on the web. When working with REPL or jupyter notebook, Julia has provided a simple way to do it as mentioned here:
You can also get information on how to type a symbol by entering it in the REPL help, i.e. by typing ? and then entering the symbol in the REPL (e.g., by copy-paste from somewhere you saw the symbol).
For example:

Using unicode / utf-8 in programmers editors

There are a lot of programmers editors that claim to support unicode / utf-8. I've tried a number of them (UltraEdit, jedit, emedit) but none of them tell you how to actually enter unicode characters into a file. Some of them tell you how to change the default file encoding to utf-8 or how to select a font that has good support for utf-8, but not how to enter utf-8 into a file using their editor.
The Go language (and some others) support utf-8 and I like the idea of using the actual utf-8 symbols for variables instead of variables with names like omega. I haven't found a programmers editor yet that actually allows you to do this, though.
The only editor / word processor that I've found that lets you how to enter unicode is Microsoft Word. Type the unicode and Alt+X and Word converts it. To get the Greek letter omega type "03c9" followed by Alt+X. UltraEdit will let you copy utf-8 from a web page into it, but their docs don't say how to actually enter utf-8 in a file, and their tech. support people don't know either.
This should be simple, but seems to be completely undocumented. Is there some key combination convention the lets you enter unicode into these editors that supposedly support unicode the way that Ctrl-F is widely used for search?
Thanks.
The standard programmer’s editor vim(1) supports limited Unicode input even if your operating system should be too broken to do so (are there any such, still?).
Just enter ^VuXXXX, where XXXX represents exactly four hex digits.
That will allow you to enter the ~6% of Unicode allocated to the Basic Multilingual Plane. The rest are forbidden to you.
This may be fixed in a newer release.
Otherwise, just use your mouse.
A few techniques I use if an editor is lacking:
Use the Windows charmap.exe utility to select characters and paste into a document.
Install an input method editor (IME) to write in a particular language.
Windows ALT keycodes.
Better to set your keyboard to generate Unicode characters across all Windows applications than to rely on a single application's custom input feature IMO.
Use the EnableHexNumpad feature and you can type any character in the Basic Multilingual Plane using Alt+numbad-plus,hexcode. (May not be of much use on a laptop without a numpad though.)
Or if there are particular characters you want to type a lot, find a keyboard layout that allows you to type them directly. For example eurokb might cover it, or you can make your own with MSKLC.
Old question, but you can type a lot of unicode in GNU Emacs or Vim
GNU Emacs: M-x set-input-method RET tex (or C-x RET C-\ tex) will let you type \omega to generate ω
Vim: Vim digraphs can generate unicode; C-k w * in insert mode gives you ω.
deceze hit the nail on the head. (S)he just didn't elaborate. bobince gave a bit more.
And I'm hazarding a guess that you're a developer or tester working on L14N or I18N. I'm also guessing you need to do more than just a few characters here or there, or you'd be satisfied with pasting from another app. So, I'll share some advice. (note: here, "you" refers to the next person to look here. I'm sure the original poster doesn't care anymore by now. :-))
If you're on Windows 10, install an appropriate keyboard driver that lets you input the characters you want into any application. I'm sure Linux has support for the same sort of thing.
E.g. I'm teaching myself Hindi (हिंदी), so I installed Windows' Hindi (Devanangari) support. I typed "Hindi", in Hindi using that support, then I switched back to US English to do the rest of this post. If all you need are accented characters from Western European languages, you can install the INTL English support and type directly in español or français or whatever.
Don't look at entering Unicode characters as entering some sort of special data amidst your English text. It's just someone else's language. Use their keyboard. Type their language.
I'm writing a flashcard app to help my learning. I'm using the Hindi keyboard support to type characters into Word, WordPad, Excel, and the Visual Studio editor. And that Hindi keyboard support works exactly the same way in all of those apps, as I'd expect it to work in just about any text editor that supports Unicode. And as you saw above, it also works in a simple text edit control in Chrome. No copy and paste. No remembering special codes. It's as ubiquitous as ctrl-F.
It looks like the unicode support in programmers editors (except for some Microsoft products) is mostly read-only. They can open a file with unicode and display the characters, but typing unicode into a file is a different story. If you want to enter unicode in a programmers editor you can copy it from somewhere else (a web page or Microsoft Word or Notepad) and paste it into the editor, but the editors make typing unicode difficult or impossible.
UltraEdit tech support referred me to this web page which explains a lot. Unfortunately none of the solutions worked with UltraEdit.
Microsoft Word and Notepad support unicode entry. Type the unicode value followed by Alt+X and it converts the hexadecimal and displays it. You can then copy and paste it into UltraEdit or one of the other programmers editors. As others have mentioned unicode support depends on support within the operating system as well as the editor.
What got me interested in using unicode in source code files is Mark Summerfield's book Programming in Go. He includes an example .go file that uses unicode. It would be great to use unicode Greek characters for variable names instead of variables named "omega" or "theta".
Using unicode in source code is a bad idea, however. Support for unicode in programmers editors is lousy, and developers would have to save or convert their source code files to utf-8 instead of ASCII. Developer's tools are just not ready to write code in unicode no matter how neat the idea sounds.

wxWidgets and Unicode

i want to use korean translations under in my - quite large - wxwidgets application. The application uses the wxwidgets translation framework, which is based on gettext.
I have working translations for french, german and russian. I want to go unicode anyway, but my first question is:
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
I have thousands of string literals. Do i have to prepend each and every one of them with 'L' ? ( wxString foo("foo") --> wxString foo(L"foo") )
if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ? ( pleeze! =) )
Will this change in wxWidgets 3.0?
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Is there any remarkeable performance impact in using Unicode?
Thank you!
Wendy
Addressing some of your questions…
does my application need unicode support to display korean and japanese languages?
If so, - just for interest - why does russian work without, since they have a cyrillic letterset?
Russian fits in a single-byte charset, just like western European languages (though it is a different charset). Korean and Japanese (and Chinese) don't. There are many workarounds for this, but the most elegant I know of to date is to use Unicode so that you don't need to rebuild your application for each locale; just change its message catalog.
Unicode question general: i use these string literals in many descriptive and many technical ways .. as displayed text as well as parts of GLSL shaders as well as XML. These APIs have char* / const char* as function arguments, so my internal wxString representation should not matter in these areas. Theory and practice: is this true? Some experiences to share, anyone?
Only strings that are going to be shown to (non-technical) users need to be localized, so they're the only ones that have to be in Unicode. The most common approach is to use UTF-8 (which is a particular way of encoding Unicode) as that means that ASCII strings – the most common type passed around inside programs – are exactly the same, which simplifies things a lot. The down-side is that you no longer have cheap indexing into the string as not all characters are the same number of bytes long. That can be anything from a non-issue to a right royal hindering PITA, depending on what the program is doing.
I do some text processing ( comparing, string finding etc ) - are there any logical differences in unicode vs. ansi?
Comparisons work fine, as does simple string finding. Other operations (e.g., getting the 20th character of a string, or working out how many characters into a string you've found a substring) are nasty because you've not got constant character widths. The nastiness can be mitigated by using wide characters, but they're less nice to use for external data (they introduce potential problems with endianness unless you go into working with byte-order marks, and that's another matter right there).
Is there any remarkeable performance impact in using Unicode?
Depends on exactly what you do. With UTF-8, if you're mostly dealing with ASCII text in reality then you get very little in the way of performance problems for most operations. With wide characters, you take more memory for every character, which naturally has performance implications (but which might acceptable because it does mean you've got constant-time indexing).
There's a korean .po file on http://www.wxwidgets.org/about/i18n.php for wxWidget's own strings. If your application displays wxWidget's own strings correctly when using that file, then it does not need Unicode support to display Korean and Japanese languages.
ISO-8859-5 is an 8 bit character set with Cyrillic letters.
Only if 1. does not yield the correct result. But if you want to translate the string, you should have used _().
I don't know.
wxWidgets 3.0 will not have separate Unicode- and ANSI-builds. 2.9.1 doesn't have, either.
It depends on how you use the arguments. C- and C++-functions usually operate on the representation of strings and are unaware of any particular character encoding. Particularly what you perceive to be a character and what the program considers a character might be different things.
See 6.
I do not know, but many toolkits use UTF-16 or UTF-32 instead of UTF-8 because these schemes are simpler. It's a size-speed tradeoff.
1.does my application need unicode support to display korean and japanese
languages?
Thanks to Oswald, i found out that you can have a korean translation without using unicode in your wxwidgets application. Change ( under windows, at least ) settings for non-unicode aware programs. But i still have to check out if this is enough for a whole application.
3.I have thousands of string literals. Do i have to prepend each
and every one of them with 'L' ? (
wxString foo("foo") --> wxString
foo(L"foo") )
If you have to use unicode with wxwidgets prior to 3.0, you have to. But do not use 'L' under wxwidgets, use wxT("foo")
4.if so, did someone build a regex or sed or perl script to do this in ca. 500 .cpp files ?
I did, at least a search and replace under Visual Studio:
Search: {"([^"]*)"}
Replace: wxT(\1)
But be careful! Will replace all string literals, #include "file.h" with #include wxT("file.h")
Will this change in wxWidgets 3.0?
Yes. See answer/quote above.

What encoding Win32 API functions expect?

For example, MessageBox function has LPCTSTR typed argument for text and caption, which is a pointer to char or pointer to wchar when _UNICODE or _MBCS are defined, respectively.
How does the MessageBox function interpret those stings? As which encoding?
Only explanation I managed to find is this:
http://msdn.microsoft.com/en-us/library/cwe8bzh0(VS.90).aspx
But it doesn't say anything about encoding? Just that in case of _MBCS one character takes up one wchar (which is 16-bit on Windows) and that in case of _UNICODE one or two char's (8-bit).
So are those some Microsoft's versions of UTF-8 and UTF-16 that ignore anything that has to be encoded in 3 or four bytes in case of UTF-8 and anything that has to be encoded in 4 bytes in case of UTF-16? And is there a way to show anything outside of basic multilingual plane of Unicode with MessageBox?
There are normally two different implementations of each function:
MessageBoxA, which accepts ANSI strings
MessageBoxW, which accepts Unicode strings
Here, 'ANSI' means the multi-byte code page currently assigned to the process. This varies according to the user's preferences and locale setting, although Win32 API functions such as WideCharToMultiByte can be counted on to do the right conversion, and the GetACP function will tell you the code page in use. MSDN explains the ANSI code page and how it interacts with Unicode.
'Unicode' generally means UCS-2; that is, support for characters above 0xFFFF isn't consistent. I haven't tried this, but UI functions such as MessageBox in recent versions (> Windows 2000) should support characters outside the BMP.
The ...A functions are obsolete and only wrap the ...W functions. The former were required for compatibility with Windows 9x, but since that is not used any more, you should avoid them at any costs and use the ...W functions exclusively. They require UTF-16 strings, the only native Windows encoding. All modern Windows versions should support non-BMP characters quite well (if there is a font that has these characters, of course).

How do I use unicode (UTF-8) characters in Clojure regular expressions?

This is a double question for you amazingly kind Stacked Overflow Wizards out there.
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL? At the moment I cannot send any non-roman characters to swank-clojure, and using the command-line REPL garbles things.
It's really easy to do regular expressions on latin text:
(re-seq #"[\w]+" "It's really true that Japanese sentences don't need spaces?")
But what if I had some japanese? I thought that this would work, but I can't test it:
(re-seq #"[(?u)\w]+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
It gets harder if we have to use a dictionary to find word breaks, or to find a katakana-only word ourselves:
(re-seq #"[アイウエオ-ン]" "日本語の文章にはスペースが必要ないって、本当?")
Thanks!
Can't help with swank or Emacs, I'm afraid. I'm using Enclojure on NetBeans and it works well there.
On matching: As Alex said, \w doesn't work for non-English characters, not even the extended Latin charsets for Western Europe:
(re-seq #"\w+" "prøve") =>("pr" "ve") ; Norwegian
(re-seq #"\w+" "mañana") => ("ma" "ana") ; Spanish
(re-seq #"\w+" "große") => ("gro" "e") ; German
(re-seq #"\w+" "plaît") => ("pla" "t") ; French
The \w skips the extended chars. Using [(?u)\w]+ instead makes no difference, same with the Japanese.
But see this regex reference: \p{L} matches any Unicode character in category Letter, so it actually works for Norwegian
(re-seq #"\p{L}+" "prøve")
=> ("prøve")
as well as for Japanese (at least I suppose so, I can't read it but it seems to be in the ballpark):
(re-seq #"\p{L}+" "日本語 の 文章 に は スペース が 必要 ない って、 本当?")
=> ("日本語" "の" "文章" "に" "は" "スペース" "が" "必要" "ない" "って" "本当")
There are lots of other options, like matching on combining diacritical marks and whatnot, check out the reference.
Edit: More on Unicode in Java
A quick reference to other points of potential interest when working with Unicode.
Fortunately, Java generally does a very good job of reading and writing text in the correct encodings for the location and platform, but occasionally you need to override it.
This is all Java, most of this stuff does not have a Clojure wrapper (at least not yet).
java.nio.charset.Charset - represents a charset like US-ASCII, ISO-8859-1, UTF-8
java.io.InputStreamReader - lets you specify a charset to translate from bytes to strings when reading. There is a corresponding OutputStreamWriter.
java.lang.String - lets you specify a charset when creating a String from an array of bytes.
java.lang.Character - has methods for getting the Unicode category of a character and converting between Java chars and Unicode code points.
java.util.regex.Pattern - specification of regexp patterns, including Unicode blocks and categories.
Java characters/strings are UTF-16 internally. The char type (and its wrapper Character) is 16 bits, which is not enough to represent all of Unicode, so many non-Latin scripts need two chars to represent one symbol.
When dealing with non-Latin Unicode it's often better to use code points rather than characters. A code point is one Unicode character/symbol represented as an int. The String and Character classes have methods for converting between Java chars and Unicode code points.
unicode.org - the Unicode standard and code charts.
I'm putting this here since I occasionally need this stuff, but not often enough to actually remember the details from one time to the next. Sort of a note to my future self, and it might be useful to others starting out with international languages and encodings as well.
I'll answer half a question here:
How do I set emacs/slime/swank to use UTF-8 when talking with Clojure, or use UTF-8 at the command-line REPL?
A more interactive way:
M-x customize-group
"slime-lisp"
Find the option for slime coding system, and select utf-8-unix. Save this so Emacs picks it up in your next session.
Or place this in your .emacs:
(custom-set-variables '(slime-net-coding-system (quote utf-8-unix)))
That's what the interactive menu will do anyway.
Works on Emacs 23 and works on my machine
For katakana, Wikipedia shows you the Unicode ordering. So if you wanted to use a regex character class that caught all the katakana, I suppose you could do something like this:
user> (re-seq #"[\u30a0-\u30ff]+" "日本語の文章にはスペースが必要ないって、本当?")
("スペース")
Hiragana, for what it's worth:
user> (re-seq #"[\u3040-\u309f]+" "日本語の文章にはスペースが必要ないって、本当?")
("の" "には" "が" "ないって")
I'd be pretty amazed if any regex could detect Japanese word breaks.
for international characters you need to use Java Character classes, something like [\p{javaLowerCase}\p{javaUpperCase}]+ to match any word character... \w is used for ASCII - see java.util.Regex documentation
Prefix your regex with (?U) like so: (re-matches #"(?U)\w+" "ñé2_hi") => "ñé2_hi".
This sets the UNICODE_CHARACTER_CLASS flag to true so that the typical character classes do what you want with non-ASCII Unicode.
See here for more info: http://docs.oracle.com/javase/8/docs/api/java/util/regex/Pattern.html#UNICODE_CHARACTER_CLASS