Unicode functions for Lua - unicode

Sorry if this has already been asked, but Lua's in-built string functions (such as string.len and string.sub) don't work very well with unicode characters, so are there any alternatives which do?

There are various libraries available that do this, for example: https://github.com/alexander-yakushev/awesompd/blob/master/utf8.lua. Also, Lua 5.3+ supports some of utf8-related functions out of the box: https://www.lua.org/manual/5.3/manual.html#6.5

Related

Is there such a thing as an "Encoding Language"?

I know that there are at least two types of coding languages: markup and programming. HTML is an example of the former, Python an example of the latter.
Is there such a thing as an encoding language? An example of this could be Unicode.
Here's a concept tree I made to help illustrate my point:
Unicode and ASCII are character sets and no languages, so they only define the amount of symbols you can use and display.
For the other two (Markup and Programming Languages) it depends on your definition of language. Maybe this is interesting for you: formal languages

Does development with scalaz require an Unicode/APL-like keyboard?

Can scalaz be used without a keyboard containing the appropriate Unicode characters or does every Unicode identifier also have an "ASCII" equivalent (and if yes, is there any guarantee that it stays that way)? Are there special keyboard layouts for usage with scalaz?
What's the best practice? Inputting the Unicode identifiers directly or using the ASCII substitutes and using a script to replace them with the Unicode ones before commit?
No, you don't need anything besides ASCII to use Scalaz.
However, most editors and IDEs have some way of automatically or semi-automatically (like, -space) converting a sequence of characters into something else. That takes care of it if you want to keep your source code in Unicode.
Now, the problem with keeping stuff in Unicode is that you might trouble with some fonts when displaying stuff in web pages, etc. Hell, you might even be forced to convert the code to ASCII for some reason. Yes, it is unlikely, but it is an issue you should be aware of.
This post from Superuser has some information about this.
This wikipedia article on Unicode input might be helpful.
No. Yes. Yes. No. Benign guarantees are for sissies. Write code. I use an appropriate development environment that allows me to type whatever I like.

Programming in a language other than English

I was having a discussion on twitter about adding the ability of Ruby to use λ instead of lambda, and more generally about Unicode support. I realized that all the languages I know work only with English reserved words and mostly assume a us-en keyboard (for example using $ instead of £ or ¥). While some languages are now starting to have some support for Unicode in there string functions, there are still so many convention based on English or the Latin style character set. For example Ruby requires class names begin with an upper case letter, but upper and lower case is not a property of glyphs in most scripts.
So the question is: "Are there programming languages that work in a large set of languages, and how do they do it?"
You can have a look ant the APL programming language, for example.
Some languages define very simple syntaxes and little or no keyworks. For example, LISPs and languages that function like them (Tcl, etc...) where everything is "command arg1 ... argn". These languages, since there are no keywords per se, are language agnostic.
For example, in Tcl, you can rename the various commands to use whatever language you want and everything should work perfectly.
Python 3 is completely Unicode-based, so identifiers can be constructed out of any Unicode letters/digits etc.
It's still not a good idea to use characters for function names that programmers from other nations don't have easy access to on their keyboards.
In the 3.0.0 release of the Parrot VM, they added support for a language, Ωη;)XD that is named using unicode which caused all kinds of breakage for the VM. It might be worth taking a look at.

iPhone Chinese simplified to traditional character conversion

Is there a way to convert Chinese simplified characters to traditional characters in Cocoa/Objective-C? On the .NET platform you can include a VB dll in your projects that gives you access to a function for an easy conversion. Is there anything I can use in Cocoa/Objective-C that will allow me to do the same? I want to go between simplified and traditional and vice-versa. Thank you!
As I know, Apple does not have public APIs to let you convert Chinese characters by simply calling a function, but you can do the conversion character by character your self.
The OpenVanilla project, an open source input method project, maintains a Chinese character conversion table. It was used in the input method software but I think it could also be used for other purposes. It is available at
http://github.com/lukhnos/openvanilla-oranje/blob/master/Modules/OVOFHanConvert/VXHCSC2TCTable.c
http://github.com/lukhnos/openvanilla-oranje/blob/master/Modules/OVOFHanConvert/VXHCTC2SCTable.c

Which programming languages were designed with Unicode support from the beginning?

Which widely used programming languages were designed ground-up with Unicode support?
A lot of programming languages have added Unicode support as an afterthought in later versions, but which widely used languages were released with Unicode support from day one?
Java was probably the first popular language to have ground-up Unicode support.
Basically all of the .NET languages are Unicode languages, such as C# and VB.NET.
There were many breaking changes in Python 3, among them the switch to Unicode for all text.
So Python wasn't designed ground-up for Unicode, but Python 3 was.
I don't know how far this goes in other languages, but a fun thing about C# is that not only is the runtime (the string class etc) unicode aware - but unicode is fully supported in source:
using משליט = System.Object;
using תוצאה = System.Int32;
public class שלום : משליט {
public תוצאה בית() {
int אלף = 0;
for (int λ = 0; λ < 20; λ++) אלף+=λ;
return אלף;
}
}
Google's Go programming language supports Unicode and works with UTF-8.
It really is difficult to design Unicode support for the future, in a programming language right from the beginning.
Java is one one of the languages that had this designed into the language specification. However, Unicode support in v1.0 of Java is different from v5 and v6 of the Java SDK. This is primarily due to the version of Unicode that the language specification catered to, when the language was originally designed. Java attempts to track changes in the Unicode standard with every major release.
Early implementations of the JLS could claim Unicode support, primarily because Unicode itself supported 65536 characters (v1.0 of Java supported Unicode 1.1, and Java v1.4 supported Unicode 3.0) which was compatible with the 16-bit storage space taken up by characters. That changed with Unicode 3.1 - its an evolving standard, usually with more characters getting added in each release. The characters added later in 3.1 were called supplementary characters. Support for supplementary characters were added in Java 5 via JSR-204; Java 5 and 6 support Unicode 4.0.
Therefore, don't be surprised if different programming languages implement Unicode support differently.
On the other hand, PHP(!!) and Ruby did not have Unicode support built into them during inception.
PS: Support for v5.1 of Unicode is to be made in Java 7.
Java and the .NET languages, as other commenters have pointed out, although Java's strings are UTF-16 rather than UCS or UTF-8. (At the time, it seemed like a sensible idea! Now clearly either UTF-8 or UCS would be better.) And Python 3 is really a different, incompatible language from Python 1.x and 2.x, so it qualifies too.
The Plan9 languages around 1992 were probably the first to do this: their dialect of C, rc, Alef, mk, ACID, and so on, were all Unicode-enabled. They took the very simple approach that anything that wasn't ASCII was an identifier character. See their paper from 1993 on the subject. (This is the project where UTF-8 was invented, which meant they could do this in a pretty compatible way, in particular without plumbing binary-versus-text through all their programs.)
Other languages that support non-ASCII identifiers include current PHP.
Perl 6 has complete unicode support from scratch.
(With the Rakudo Perl 6 compiler being the first implementation)
General overview
Unicode operators
Strings, Regular expressions and grammars all operate based on graphemes, even for those codepoint combination for which there is no composed representation (a composed representation artificial codepoint is generated on the fly for those cases).
A special encoding exists to handle data of unknown encoding "utf8-c8": this assumes utf-8 when possible, but creates artificial codepoints for unencodable sequences, allowing them to roundtrip if necessary.
Python 3.x: http://docs.python.org/dev/3.0/whatsnew/3.0.html
Sometimes, a feature that was included in a language when it was first designed is not always the best.
Languages have changed over time and many have become bloated with extra features, while not necessarily keeping up-to-date with the features it first included.
So I just throw out the idea that you shouldn't necessarily discount languages that have recently added Unicode. They will have the advantage of adding Unicode to an already mature development tool, and getting the chance to do it right the first time.
With that in mind, I want to ensure that Delphi is included here, as one of your answers. Embarcadero added Unicode in their Delphi 2009 version and did a mighty fine job on it. It was enough to finally prompt me to upgrade from the Delphi 4 that I had been using for 10 years.
Java uses characters from the Unicode character set.
java and .net languages