Looking for gem with normalisers (NFD, NFKD, NFC, NFKC) for jruby 1.8.2 (native implementation) - unicode

Is there a native gem (so it can be used for jruby 1.8.2) which implements UTF8 normalizers (NFD, NFKD, NFC, NFKC)?

Ruby v1.8 is really flaky on Unicode. I find v1.9 the minimal Ruby version for sane processing. Even then, the unicode_utils gem for v1.9.1 for better is absolutely indispensable. It has things like full casemapping and normalization functions. You really do need it.
Unfortunately, it doesn’t include collation, so you can’t do alphabetic sorts in Ruby the way you can in Perl or languages with access to the ICU libraries. Collation is the hardest to get right so it is not surprising that it is missing. But it is critical because it underlies nearly everything we ever do with text. It’s not just about sorting; it’s about simple string comparisons. Most people don’t realize this.
I talk about Ruby’s Unicode support and what you can do to make your life easier there a little in my third OSCON talk from a couple weeks ago. I confess that I gave up on Ruby v1.8; it was just too too frustrating.
That’s not a knock against Ruby, because the same thing can be said for most languages today that aren’t the latest versions.
You will not be happy with Ruby and Unicode unless you’re running v1.9.
If you aren’t running Python v3 (and preferably v3.2 or probably v3.3) with a wide build, you will be unhappy in Python with Unicode.
If you aren’t running Java v1.7, you will be unhappy in Java with Unicode — and maybe even then. :(
If you aren’t running Perl v5.14 or better, you will be arguably unhappy in Perl with Unicode.
The situation with those four therefore is quite unlike the one(s) with PHP, Javascript, and Go. With those latter three languages, it doesn’t matter what version you run, because
With the first two you will always be unhappy with their Unicode support. This is really really terrible because the people using them can almost never switch to a real language with real Unicode support. The niche is too special-purpose.
Whereas with Go you will never be unhappy with its Unicode support — unless you’re in a hurry: the normalization module is very close to ready and be out already, while the collation module is being worked on but it really is a great deal harder.
Is there any possible way for you to use Ruby v1.9?

Related

A development platform for unicode spell checker?

I have decided to develop a (Unicode) spell checker for my final year project for a south Asian language. I want to develop it as a plugin or a web service. But I need to decide a suitable development platform for it. (This will not just check for a dictionary file, morphological analysis / generation modules (a stemmer) will also be used).
Would java script be able to handle such processing with a fair response time?
Will I be able to process a large dictionary on client side?
Is there any better suggestions that you can make?
Javascript is not up to the task, at least not by itself; its Unicode support is too primitive, and in many parts, actually missing. For example, Javascript has no support for Unicode grapheme clusters.
If you use Java, then make sure you use the ICU libraries so that you can get all the whizbang Unicode properties you’ll need for text segmentation. The place where Java’s native Unicode processing breaks down is in its regex library, which is why Android JNIs over to the ICU C/C++ regex library. There are a lot of NLP tools written for Java, some of which you might find handy. Most of these that I am aware of though are for English or at least Western languages.
If you are willing to run part of your computation server-side via CGI instead of just client-side action, you are no longer bound by language choice. For example, you might combine Javascript on the client with Perl on the server, whose Unicode support is even better than Java’s. How that would meld together and how to get the performance and behavior you would want depends on just what you actually want to do.
Perl also has quite a good number of industry-standard NLP modules widely available for it, most of which already know to use Unicode, since like Java, Perl uses Unicode internally.
A brief slide presentation on using NLP tools in Perl for certain sorts of morphological analysis, namely stemming and lammatization, is available here. The presentation is known to work under Safari, Firefox, or Chrome, but not so well under Opera or Microsoft’s Internet Explorer.
I am not aware of any tools specifically targeting Asian languages, although Perl does support UAX#11 (East Asian Width) and UAX#14 (Unicode Linebreaking) via the Unicode::LineBreak module from CPAN, and Perl does come with a fully-compliant collation module (implementing UTS#10, the Unicocde Collation Algorithm) by way of the standard Unicode::Collate module, with locale support available from the also-standard Unicode::Collate::Locale module, where many Asian locales are supported. If you are using CJK languages, you may want access to the Unihan database, available via the Unicode::Unihan module from CPAN. Even more fundamentally, Perl has native support for Unicode extended grapheme clusters by way of its \X metacharacter in its builtin regex engine, which neither Java nor Javascript provides.
All this is the sort of thing you are likely to need, and find terribly lacking, in Javascript.

Why should I use Perl instead of Ruby/Python/etc? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 12 years ago.
I love Ruby and have been using it for a few years to handle day-to-day scripting tasks. Lately however, I've had a number of people tell me that Perl is where it's at. I have nothing against Perl, but it seems like it's kind of fallen behind the times a bit.
However, that's probably just my perception, so I'm asking all of you, what makes Perl so great? I'm genuinely seeking information here; I'd like to understand why this language has such ardent followers.
I know a good handful of hackers who left Perl to go to Ruby. Python is obviously a nice language too. I am neither saying nor implying anything against either.
Pros for Perl 5
Since about 2005 or so Perl has been in a fairly dramatic renaissance in both CPAN and core releases. Perl 6 has helped drive this by sending concepts like role-oriented OO back. Strawberry Perl has made Perl hacking on Windows more like *nix.
The CPAN is huge, still growing, and most of the more widely used authors/teams are responsive to bugfixes. Most popular Perl modules are tested widely and well. CPAN testers recently sent their 10 millionth test report.
Many of the big kits have good communities associated where expert help is available quickly.
The tool chain has become very flexible.
The combination of perlbrew, local::lib, and cpanminus lets users (even without root) have an arbitrary number of perl versions and libraries accessible on the same box.
Many of things that Java, Ruby, Python do right come back to Perl and with facility. For example–
KinoSearch is Lucene but even faster by some benchmarks.
Catalyst is Rails but more flexible. It’s a completely agnostic C with regards to the M and V.
Plack is Python’s WSGI + Ruby’s Rack.
It’s as fast and personal or readable and robust as you want it to be.
A short one-liner can edit every HTML file in your tree when you’re in a hurry to fix something.
A clear and robust program with error reporting, logging, and feedback built on any of the 6 or 7 suitable HTML/XML packages could do the same for a client.
Perlmonks. Though there are notable exceptions, the Perl community is generally friendly, helpful, and positive.
There are quite a few good Perl jobs waiting to be filled. The back and forth between the high level languages has left oodles of Perl in the wild without a matching crop of Perl-centric devs. (I get 5-7 cold calls from recruiters a year.)
It’s fun. In quotes: “Perl has the happiest users.” I can’t speak to the scientific nature of that but I can say I only program today because Perl exists. Many other Perl hackers share this stupid giddiness for the language.
Keep in mind it’s not a zero sum game. The more languages you can wield, the better.
If I had to name one great strength of Perl, it's one word: CPAN.
Having worked with Ruby as well, I'd not say that Perl is necessarily better or worse, but definitely more mature. It is, after all, much older. However, it's not decrepit. It has plenty of modern stuff, e.g., Moose and the 5.10 and 5.12 updates have fixed a lot of problems that the ancient 5.0.x had.
(And if you're wondering: Perl 5 and Perl 6 are different languages. The similar name is an unfortunate mistake. Though Perl 5 does borrow ideas from Perl 6 and vice versa.)
CPAN.
The syntax of Perl is sometimes painful to look at but it is available on Unix machines everywhere and with the command line access to the huge number of packages in CPAN (which can also be accessed via browser), Perl is the de facto standard because of its broad applicability and availability.
These days, IMO the main reason to use Perl is that you can be pretty confident that just about any UNIX system will have it available, even on the sparser commercial UNIX distros.
Also, it has some features that make it work very conveniently with the UNIX shell and filesystem. Perl one-liners are convenient in shell scripting when you need a little more power.
If you're not on a UNIX machine then there's probably little advantage over more modern scripting languages.
First of all I love Python and Ruby as well. In fact I think anything you can do in anyone of the 3 languages you can do in the other just as easily.
CPAN however is a big advantage. There are not many times I find myself looking for a specific general functionality and not finding a module for it.
The greatest thing for me is however is that I can do absolutely everything I want, quickly, and in 10 different ways if I like, but maybe that's just because Perl is my 'mother tongue'.
Anyway, I think it depends on what you want to do. If you want to create a scalable website or web application with all the plumbing (authentication, authorization, session tracking, database ORM, etc, etc) taken care of, it can be done in Perl, but the hassle is not worth it. Go with Python (Django) or Ruby (Rails 3.0 rocks) then.
Good luck and watch out fire setting of flamewars with this subject, this kind of stuff get seriously get you hurt ;)
Rob

Parrot - Can I use it? And how?

I've had an eye on Parrot for a little while. I understand that it's a virtual machine designed to run a variety of dynamic languages, and to allow a degree of interoperability (e.g. calling a Perl function from a Python script). So, my question is in two parts:
Is Parrot ready for general use? I can see releases are being made, but it's not obvious whether it's something that works well now, or still a work in progress.
Is there any documentation on how to use Parrot? I've looked at links in previous questions, but all the documentation I can find talks about the various levels of Parrot-specific code (PIR, PASM, etc.), or how to add support for more languages. That's great, but how do I run code in existing languages on Parrot? And how do I use code written in another language?
Finally, I don't want to start a flamewar, but I know Parrot is tied up with Perl 6. I prefer Python. I understand Python is a supported language, but realistically, is it perceived as a multi-language VM, or is it a Perl 6 interpreter with other languages included as curiosities?
I'm a Python developer primarily, so I'm biased. But probably in the same direction as you.
Parrot is intended to be a multi-language VM. Its Perl roots show sometimes ("0" is false, the bootstrapping language NQP is a subset of perl), but at the runtime level it's quite language-agnostic.
That said, interop between languages won't be entirely seamless. For example, the String type will most likely be used as a base by all languages, but a Ruby object will probably need wrappers (but not proxies) to act pythonic. There's no story for object interop, at least not so far.
The Python 3 compiler "Pynie" has quite a way to go. Here's the repo http://bitbucket.org/allison/pynie. Maybe you'd like to help out? Right now it's quite young, not even objects yet.
And to answer your actual question:
Sort of. It's not fast and the languages that target it aren't complete, but it won't crash or corrupt your memory.
Normally, you write code in your favourite High Level Language (Python) and compile your .py code to parrot (and from there, you can compile it to native code if you want to). Another dev can write their Perl(6) code and compile it to parrot and, if the compilers have been written with interop in mind, you'll be able to call a Perl function from python
It is still work in progress, but it's mature enough for language implementors and library developers. Caveat: some subsystems are getting reworked (garbage collection, embedding), so there might be some bumps on the road ahead.
Each language needs a compiler that generates code Parrot understands. These compilers are released separately. (see http://trac.parrot.org/parrot/wiki/Languages )
Most languages targeting Parrot are in an early incomplete state, so interoperability isn't a big issue right now. Parrot isn't a Perl 6 interpreter, but Rakudo Perl 6 happens to be one of the most heavily developed compiler that targets Parrot.

When generating code, what language should you generate?

I've worked on a number of products that make use of code generation. It seems to be the only way to achieve both a high degree of user-customizability and high execution speed.
The downside is that we are requiring users to install a compiler (primarily on MS Windows).
This has been an on-going headache, because vendors like MS keep obsoleting compilers, and some users tend to have more than one compiler installed.
We're considering using GNU C, and possibly C++, but even there, there are continual version issues.
I've considered possibly generating assembly language, in an effort to get off the compiler-version-treadmill, but assembly languages are all machine-specific.
Ideally there would be some way to produce generated code that would be flexible, run fast, and not expose us to the whims of third-party providers.
Maybe I'm overlooking something simple, like Java. Any ideas would be appreciated. Thanks.
If you're considering C and even assembler, take a look at LLVM first: http://llvm.org
I might be missing some context here, but could you just pin yourself to a specific version? E.g., .NET 2.0 can be installed side by side with .NET 1.1 and .NET 3.5, as well as other versions that will come out in the future. So as long as your code makes use of a specific version of a compiler, what's the problem?
I've considered possibly generating assembly language, in an effort to get off the compiler-version-treadmill, but assembly languages are all machine-specific.
That would be called a compiler :)
Why don't you stick to C90?
I haven't heard much of severe violations of standards from gcc's side, if you don't use extensions.
And you can always distribute a certain version of gcc along with your product, say, 4.3.2, giving an option to users to use their own compiler at their own risk.
As long as all code is generated by you (i. e. you don't embed your instructions into other's code), there shouldn't be any problems in testing against this version and using it to compile your libraries.
If you want to generate assembly language code, you may take a look at asmjit.
One option would be to use a language/environment that provides access to the compiler in code; For example, here is a C# example.
Why not ship a GNU C compiler with your code generator? That way you have no version issues, and the client can constantly generate code that is usable.
It sounds like you're looking for LLVM.
Start here: The Code Generation conference
In the spirit of "might not be to late to add my 2 cents" as in #Alvin's answer's case, here is something I'd think about: if your application is meant to last for some years, it is going to face several changes in how applications and systems work.
For instance, let's say you were thinking about this 10 years ago. I was watching Dexter back then, but I guess you actually have memories of how things were at that time. From what I can tell, multithreading was not much of an issue to developers of 2000, and now it is. So Moore's law broke for them. Before that people didn't even care about what will happen in "Y2K".
Speaking of Moore's law, processors are indeed getting quite fast, so maybe certain optimizations won't be even that necessary. And possibly the array of optimizations will be much bigger, some processors are getting optimizations for several server-centric stuff (XML, cryptography, compression and regex! I am surprised such things can get done on a chip) and also spend less energy (which is probably very important for warfare hardware...).
My point being that focusing on what exist today as a platform for tomorrow is not a good idea. Make it work today, and surely it will work tomorrow (backward-compatibility is especially valued by Microsoft, Apple is not bad it seems and Linux is very liberal about making it work as you want).
There is, yes, one thing that you can do. Attach your technology to something that just won't (likely) die, such as Javascript. I'm serious, Javascript VMs are getting terribly efficient nowdays and are just going to get better, plus everyone loves it so it's not going to dissappear suddenly. If needing more efficiency/features, maybe target the CRL or JVM?
Also I believe multithreading will become more and more of an issue. I have a gut feeling the number of processor cores will have a Moore's law of their own. And architectures are more than likely to change, from the looks of the cloud buzz.
PS: In any case, I belive C optimizations of the past are still quite valid under modern compilers!
I would stick to that language that you use for generating that language. You can generate and compile Java code in Java, Python code in Python, C# in C#, and even Lisp in Lisp, etc.
But it is not clear whether such languages are sufficiently fast for you. For top speed I would choose to generate C++ and use GCC for compilation.
Why not use something like SpiderMonkey or Rhino (JavaScript support in Java or C++). You can export your objects to JavaScript namespaces, and your users don't have to compile anything.
Embed an interpreter for a language like Lua/Scheme into your program, and generate code in that language.

What exactly is Parrot?

I understand that Parrot is a virtual machine, but I feel like I'm not completely grasping the idea behind it.
As I understand, it's a virtual machine that's being made to handle multiple languages. Is this correct?
What are the advantages of using a virtual machine instead of just an interpreter?
What specifically is Parrot doing that makes it such a big deal?
Parrot is a virtual machine specifically designed to handle several languages, especially the dynamic languages. Despite some of the interesting technology involved, since it can handle more than one language, it will be able to cross language boundaries. For instance, once it can compile Ruby, Perl, and Python, it should be easy to cross those boundaries to let me use a Ruby library in Python, a Perl library from Python, so whatever combination that I like.
Parrot started in the Perl world and many of the people working on it are experienced Perl people. Instead of using the current Perl interpreter, which is showing its age, Parrot allows Perl to have features such as distributable pre-compiled modules (which everyone else has had for a long time) and a smarter garbage collector.
Chris covered the user-facing differences, so I'll cover the other side.
Parrot is register-based rather than stack-based. What that means is that compiler developers can more easily optimize the way in which the registers should be allocated for a given piece of code. In addition, the compilation from Parrot bytecode to machine code can, in theory, be faster than stack-based code since we run register-based systems and have a lot more experience optimizing for them.
Parrot is a bytecode interpreter (possibly with a JIT at a future stage). Think Java and its virtual machine, except that Java is (at the moment) more geared towards static languages, and Parrot is geared towards dynamic languages from the beginning.
Also see Cody's excellent answer! Highly recommended.
Others have given excellent answers, so what remains for me is to explain what "dynamic" languages actually mean.
In the context of a virtual machine it means that the type of a variable is not known at compile time. In "static" languages the type (or at least a parent class of it) is known at compile time, and many optimizations build on that knowledge.
On the other hand in dynamic languages you might know if a variable holds a container type (like an array) or a scalar (string, number, ...), but you have much less type information at compile time.
Another characteristic is that dynamic languages usually make type conversions much easier, for example in perl and javascript if you use a string as a number, it is automatically converted to a number.
Parrot is designed to make such operations easy and fast, and to allow optimizations without knowing having type informations at compile time.
Here is The Official Parrot Wiki.
You can find lots of info and links there.
The bottom of the Parrot wiki home page also displays the latest headlines from the Planet Parrot feed aggregator.
In addition to the VM, the Parrot project is building a very powerful tool chain to make it easier to port existing languages, or develop new one.
The Parrot VM will also provide other languages under-the-covers support for many powerful new Perl 6 features (please see the Official Perl 6 Wiki for more Perl 6 info).
Parrot will provide interoperability between modules of differing languages, so that for example, other languages can take advantage of what will become the huge Perl 6 version of CPAN (the vast Perl 5 module archive, which Perl 6 will be able to access via the forthcoming Perl 5.12).
Honestly, I didn't know it was that big of a deal. It has come a long way, but just isn't seeing much use. The main target language has yet to really arrive, and has lost a huge mind-share among the industry professionals. Meanwhile, other solutions like .Net and projects like Jython show us that the here-and-now can beat out any perceived hype.
Parrot will be what java aimed for but never achieved - a vm for all
OS's and platforms
Parrot will implement the ideas behind the Microsoft's Common Language Runtime for any dynamic language and truly cross-platform
On top of everything Parrot is and will be free and open source
Parrot will become the de facto standard for open source programming with dynamic languages