Programming in a language other than English - unicode

I was having a discussion on twitter about adding the ability of Ruby to use λ instead of lambda, and more generally about Unicode support. I realized that all the languages I know work only with English reserved words and mostly assume a us-en keyboard (for example using $ instead of £ or ¥). While some languages are now starting to have some support for Unicode in there string functions, there are still so many convention based on English or the Latin style character set. For example Ruby requires class names begin with an upper case letter, but upper and lower case is not a property of glyphs in most scripts.
So the question is: "Are there programming languages that work in a large set of languages, and how do they do it?"

You can have a look ant the APL programming language, for example.

Some languages define very simple syntaxes and little or no keyworks. For example, LISPs and languages that function like them (Tcl, etc...) where everything is "command arg1 ... argn". These languages, since there are no keywords per se, are language agnostic.
For example, in Tcl, you can rename the various commands to use whatever language you want and everything should work perfectly.

Python 3 is completely Unicode-based, so identifiers can be constructed out of any Unicode letters/digits etc.
It's still not a good idea to use characters for function names that programmers from other nations don't have easy access to on their keyboards.

In the 3.0.0 release of the Parrot VM, they added support for a language, Ωη;)XD that is named using unicode which caused all kinds of breakage for the VM. It might be worth taking a look at.

Related

Is there such a thing as an "Encoding Language"?

I know that there are at least two types of coding languages: markup and programming. HTML is an example of the former, Python an example of the latter.
Is there such a thing as an encoding language? An example of this could be Unicode.
Here's a concept tree I made to help illustrate my point:
Unicode and ASCII are character sets and no languages, so they only define the amount of symbols you can use and display.
For the other two (Markup and Programming Languages) it depends on your definition of language. Maybe this is interesting for you: formal languages

Which langages let you use fully customised lexems, including keywords and all symbol defined in their grammar?

I wish to code fully with Esperanto lexemes. That is, not ending up with a English/Esperanto mix up. Perligata is a good example of the kind of result I would like, but it use Latin where I want to use Esperanto.
So Perl seems to be a valid answer to my question. On the other end a language like Python have no mechanism that would let you use se (if in esperanto) rather than if. On what you may call middle ground, you have languages like C that allow to replace keywords through its processor (#define se if), but won't allow you to get ride of the define keywords itself. Also you have languages like racket and the LISP-family that would probably let you use wrap most internal symbols, but probably not allow you to easily change parentheses for anything else. For example mapping ( with ene and ) with ele.
Also an other point is ability to use unicode in identifiers, as Esperanto do use non-ASCII characters in its alphabet, like ĉ. That's not really a blocking element, as one is available to use cx instead of ĉ, but its nevertheless an interesting parameter.
So I guess an ideal answer to this question would be a matrix of languages specifying their lexeme and grammar customability.
each formal language has its syntax. in my opinion, lowest 'syntax overhead' is offered by lisp-like languages. but then you don't want to have parenthesis. you also don't want to have #define therefore you reject any syntax at all and all possible replacements.
i don't think there is any language that will let you do it. you should look for language generators, write your own language (at least the syntax part) or the simplest possible way, add your own find-replace layer on top of any existing language

Perl functions naming convention for long names

I have read some articles online on the naming convention for Perl style which suggest using lowercase letters and separating words by underscore for functions or methods names. Some others use the first word in lowercase then capitalize the other words. Of course Windows .NET etc Capitalize every word and no underscore.
I have some packages methods many words like entriesoncurrentpage, if I follow Perl style suggested I should do it like:
sub entries_on_current_page {...}
this added four underscore letters to the method name, the other style is:
sub entriesOnCurrentPage {...}
and Windows style should be:
sub EntriesOnCurrentPage {...}
PHP sometimes uses all lowercase with underscore like mysql_real_escape_string() and sometimes uses all lowercase without underscore like htmlspecialchars, of course PHP function names are not case sensitive so this feature is not supported in Perl.
So the question is, for the long name with many words what is the best style to use for Perl coding.
Originally, most Perl developers used camel casing with the first letter lowercased. This is the standard with most programming languages. Names with first letter capitalized were used for classes and methods.
Later on, Damian Conway's book Perl Best Practices suggested using underscores rather than camel casing. Damian argued that it increased readability, and was not that much harder to type.
Damian Conway's suggestion on names became the standard because 1). He was correct. It's much more legible and isn't that much harder to type, and most importantly 2). It was incorporated into Perltidy. Perltidy is a program that helps standardize your code according to Damian's suggestions. Perltidy is much like CheckStyle in Java.
Are these arbitrary standards? Yes, all standards are somewhat arbitrary in nature. You have a few candidate suggestions for rules, and you must make a decision:
Should the curly brace in while loops and if statements be appended on the end of the line, or go under the while or if statement. In standerd C style, curly braces are cuddled. In Java, they're not suppose to be according to CheckStyle. In Kornshell, the then goes under the if. In Bash, the standard is now that the then goes on the same line even though the Bash interpreter doesn't really like it. (You have to add a semicolon before the then because it's considered a separate command.
How should variable names be done. In most languages, CamelCase rules. In .NET, you even capitalize the first character, but in Perl, we use underscores.
Should constants be all uppercase? Most languages have agreed with that. However, in shell script, you usually reserve all uppercase variable names for special environment variables such as $PWD, $PATH, etc. Therefore, in Bash and Kornshell scripts, constant variables are all camelCased like regular variables.
The idea is to follow the standard for that language. Why? Because the standard says so. Because you can't argue with The Standard as you can with your fellow programmers whether or not curly braces are cuddled or not. The main this is to realize that most standards may be somewhat arbitrary, but they don't really affect the way you program. By everyone following a standard, you make it easier to understand other people's code.

If Ascii operators are definable, why not Unicode Symbols?

I'm sure I join many in being glad there's finally a powerful language tied tightly to a mainstream GUI/Database/Communication framework.
I haven't been sure where to post this, but here seems the best spot.
I need to use Unicode symbol characters either as operators or as function names. I'd like syntactic sugar, but I don't need it.
Guy Steele pointed out in Communications of the ACM that "*" was a forced choice when it was adopted from Ascii as multiply, but my software works in Unicode, so I'm not tethered to Ascii anymore.
!$%&*+-./<=>?,#^|~:
Part of localization includes local programmers. Why limit the set of operators that can be defined in F#? It isn’t orthogonal to C#'s and F#'s acceptance of many Unicode IsLetter in identifiers.
Also, F# is likely to be used for symbolic manipulation of problems from logic, math, physicists, etc. It makes work much easier if there’s a direct mapping into the language of the basic operators. (F# and C# accept many Unicode IsLetter? as well as IsDigit’? This is a request to allow Unicode IsSymbol? As operators with the precedence of, for example, *, or, since “+” is both a unary and binary operator, I could put up with the precedence of + and make up the difference with parenthesized groupings.
Consider the domain-specific needs of logicians, mathematicians, physicists, etc. I’d rather write a symbolic differentiator or integrator using math symbols than Ascii permutations of already-taken operators.
Logic: ∀ ∃ ⇒
Math: ∑ ∫ ∂
Group theory: ≤ ≥ ∈ ∉
Set Theory: ⊆ ⊇ ⊃ ∪ ∩
Tensors: ⊗
I’ve written many languages in other languages, but because F# is tightly .Net-integrated, this issue poses special challenges without language support:
It’s trivial to cobble up a translator that takes Unicode-operator F# source and maps it, line-by-line, to Ascii-operator F# source.
But when debugging, how do I make sure the programmer still sees their untranslated source? And that they can see variable values.
Operators and converts them is trivial. But how do I ensure the translation is what gets compiled, while the programmer sees their own source? If I map line-for-line correctly, how do I ensure they can still point at a variable and see its value?
There is a math (Unicode) symbol extension for F# available in the Visual Studio Gallery.
This allows you to define Unicode symbols, e.g.:
let inline (~∑) xs = xs |> Seq.sum
let total = ∑myList
You may be interested in Project Fortress which is a new functional programming language that embraces the Unicode character set (among many other features). In particular, see the Mathematical Syntax in Fortress page which contains some sample code.
For an interesting discussion on this check: http://cs.hubfs.net/forums/thread/9690.aspx
Other languages, such as Scala, do permit operators from outside the ASCII range -- mathematical symbols(Sm) and other symbols(So)

What are the experiences with using unicode in identifiers

These days, more languages are using unicode, which is a good thing. But it also presents a danger. In the past there where troubles distinguising between 1 and l and 0 and O. But now we have a complete new range of similar characters.
For example:
ì, î, ï, ı, ι, ί, ׀ ,أ ,آ, ỉ, ﺃ
With these, it is not that difficult to create some very hard to find bugs.
At my work, we have decided to stay with the ANSI characters for identifiers. Is there anybody out there using unicode identifiers and what are the experiences?
Besides the similar character bugs you mention and the technical issues that might arise when using different editors (w/BOM, wo/BOM, different encodings in the same file by copy pasting which is only a problem when there are actually characters that cannot be encoded in ASCII and so on), I find that it's not worth using Unicode characters in identifiers. English has become the lingua franca of development and you should stick to it while writing code.
This I find particularly true for code that may be seen anywhere in the world by any developer (open source, or code that is sold along with the product).
My experience with using unicode in C# source files was disastrous, even though it was Japanese (so there was nothing to confuse with an "i"). Source Safe doesn't like unicode, and when you find yourself manually fixing corrupted source files in Word you know something isn't right.
I think your ANSI-only policy is excellent. I can't really see any reason why that would not be viable (as long as most of your developers are English, and even if they're not the world is used to the ANSI character set).
I think it is not a good idea to use the entire ANSI character set for identifiers. No matter which ANSI code page you're working in, your ANSI code page includes characters that some other ANSI code pages don't include. So I recommend sticking to ASCII, no character codes higher than 127.
In experiments I have used a wider range of ANSI characters than just ASCII, even in identifiers. Some compilers accepted it. Some IDEs needed options to be set for fonts that could display the characters. But I don't recommend it for practical use.
Now on to the difference between ANSI code pages and Unicode.
In experiments I have stored source files in Unicode and used Unicode characters in identifiers. Some compilers accepted it. But I still don't recommend it for practical use.
Sometimes I have stored source files in Unicode and used escape sequences in some strings to represent Unicode character values. This is an important practice and I recommend it highly. I especially had to do this when other programmers used ANSI characters in their strings, and their ANSI code pages were different from other ANSI code pages, so the strings were corrupted and caused compilation errors or defective results. The way to solve this is to use Unicode escape sequences.
I would also recommend using ascii for identifiers. Comments can stay in a non-english language if the editor/ide/compiler etc. are all locale aware and set up to use the same encoding.
Additionally, some case insensitive languages change the identifiers to lowercase before using, and that causes problems if active system locale is Turkish or Azerbaijani . see here for more info about Turkish locale problem. I know that PHP does this, and it has a long standing bug.
This problem is also present in any software that compares strings using Turkish locales, not only the language implementations themselves, just to point out. It causes many headaches
It depends on the language you're using. In Python, for example, is easierfor me to stick to unicode, as my aplications needs to work in several languages. So when I get a file from someone (something) that I don't know, I assume Latin-1 and translate to Unicode.
Works for me, as I'm in latin-america.
Actually, once everithing is ironed out, the whole thing becomes a smooth ride.
Of course, this depends on the language of choice.
I haven't ever used unicode for identifier names. But what comes to my mind is that Python allows unicode identifiers in version 3: PEP 3131.
Another language that makes extensive use of unicode is Fortress.
Even if you decide not to use unicode the problem resurfaces when you use a library that does. So you have to live with it to a certain extend.