A development platform for unicode spell checker? - unicode

I have decided to develop a (Unicode) spell checker for my final year project for a south Asian language. I want to develop it as a plugin or a web service. But I need to decide a suitable development platform for it. (This will not just check for a dictionary file, morphological analysis / generation modules (a stemmer) will also be used).
Would java script be able to handle such processing with a fair response time?
Will I be able to process a large dictionary on client side?
Is there any better suggestions that you can make?

Javascript is not up to the task, at least not by itself; its Unicode support is too primitive, and in many parts, actually missing. For example, Javascript has no support for Unicode grapheme clusters.
If you use Java, then make sure you use the ICU libraries so that you can get all the whizbang Unicode properties you’ll need for text segmentation. The place where Java’s native Unicode processing breaks down is in its regex library, which is why Android JNIs over to the ICU C/C++ regex library. There are a lot of NLP tools written for Java, some of which you might find handy. Most of these that I am aware of though are for English or at least Western languages.
If you are willing to run part of your computation server-side via CGI instead of just client-side action, you are no longer bound by language choice. For example, you might combine Javascript on the client with Perl on the server, whose Unicode support is even better than Java’s. How that would meld together and how to get the performance and behavior you would want depends on just what you actually want to do.
Perl also has quite a good number of industry-standard NLP modules widely available for it, most of which already know to use Unicode, since like Java, Perl uses Unicode internally.
A brief slide presentation on using NLP tools in Perl for certain sorts of morphological analysis, namely stemming and lammatization, is available here. The presentation is known to work under Safari, Firefox, or Chrome, but not so well under Opera or Microsoft’s Internet Explorer.
I am not aware of any tools specifically targeting Asian languages, although Perl does support UAX#11 (East Asian Width) and UAX#14 (Unicode Linebreaking) via the Unicode::LineBreak module from CPAN, and Perl does come with a fully-compliant collation module (implementing UTS#10, the Unicocde Collation Algorithm) by way of the standard Unicode::Collate module, with locale support available from the also-standard Unicode::Collate::Locale module, where many Asian locales are supported. If you are using CJK languages, you may want access to the Unihan database, available via the Unicode::Unihan module from CPAN. Even more fundamentally, Perl has native support for Unicode extended grapheme clusters by way of its \X metacharacter in its builtin regex engine, which neither Java nor Javascript provides.
All this is the sort of thing you are likely to need, and find terribly lacking, in Javascript.

Related

Do interpreted languages need an operating system to work?

Do interpreted languages such as Java and Python need an operating system to work?
For example, on a bare-metal ARM microcontroller, can an interpreter be installed such that we can have both compiled code such as C, and interpreted code such as Python working together, Or is an OS needed to support this?
Of course you can write an interpreter that runs on bare-metal, it is just that if the platform does not have an OS any run-time support the language needs must be part of the interpreter. To the extent in some cases that such an interpreter might essentially be an OS. That is if it provides the services to operate a system, it could be called an operating system.
It is not perhaps as simple as interpreted vs compiled. Java for example runs on a virtual machine and is "compiled" to bytecode. The bytecode is interpreted (or just-in-time compiled in some cases), rather then the Java source directly. In an embedded system, it is possible that you would deploy cross-compiled bytecode on the target rather then the source. Certainly however JVMs exist for bare-metal. Some support multi-threading through a third party RTOS, others either have that support built-in or do not support threading at all.
There are interpreters for cut-down subsets of JavaScript and Python that run on bare-metal microcontrollers. I am not sure about full implementations, but it is technically possible given sufficient run-time support even if not explicitly implemented. To fully support some of these languages along with all the standard and third-party libraries and frameworks a developer might expect, may require so much run-time support and resource that it is simpler to deploy and OS, so implementations for resource constrained systems are often subsets or have restricted libraries.
Java needs a VM - virtual machine. It isn't interpreted, but executes byte code. Interpreted would mean grabbing the source in run-time as it goes, like BASIC.
When Java was new and exciting around year 2000, everyone thought it would be the next big general-purpose language, replacing C++. The syntax was so clean, it was "pure OO" and not some "filthy hybrid".
It was the major buzz word of the time. Schools stopped teaching C and C++. MCU manufacturers started to make chips with Java VM in hardware. Microsoft made their own Java "standard". Everyone was high on the Java hype.
Then as the Internet hype as whole collapsed in 2002, it took the Java hype with it. In the sober hang-over afterwards, people started to realize that things like byte code, VMs and garbage collection probably don't belong on bare metal systems.
They went back to using compiled C for hardware-related programming. Or in fact they never stopped, since Java never quite made it there, save for some oddball exotic architectures.
Java remained used only in the areas were it was suitable, namely web, desktop and mobile development. And so it got a second golden age when the smart phone hype struck around 2010.
No. See for example picoJava, which is one of several solutions for running Java natively. You can't get closer to bare metal than running bytecode on the CPU.
No. Some 8-bit computers had interpreted languages in ROM despite not having anything reasonably resembling a modern operating system. The Apple 2 is one example. You could boot the system without any disks or tapes, and it would go straight to a BASIC prompt, where you could write basic (no pun intended) programs.
Note that an operating system is somewhat of a vague term when speaking about these days - these 8-bit computers did have some level of firmware, and this firmware did provide some OS-type functionality like access to basic peripherals. In these days, what we now know as an OS was more commonly called a "DOS" - a Disk Operating System. MS-DOS is one of them, as well as Apple's ProDOS. These DOS's evolved into our modern-day operating systems (e.g. Windows 95 was based on top of MS-DOS, while modern Windows versions derive from a separate branch that was largely re-implemented with more modern techniques), so one could claim that their ancestors are the closest they had to what we now call an OS.
But what is an interpreter but a piece of software?
In a more theoretical sense, an interpreter is simply software - a program that takes input and produces output. Suppose you were to implement a custom solid-state Turing Machine. In this case, your "input" would be the program to be interpreted, and the "output" would be the program's behavior. If "software" can run without an operating system, then an interpreter can.
Is this model a little simplified? Of course. The difference is a matter of degree, not nature. Add very basic user input and output capabilities (e.g. a TTY) and you have the foundation to implement all, or nearly all, of the basic functionality of a language such as Java byte code, Python, or BASIC. The main things you would be missing are libraries and whatnot that depend on things like screen manipulation, multiprocessing, and networking, but you could handle them with time too.

Looking for gem with normalisers (NFD, NFKD, NFC, NFKC) for jruby 1.8.2 (native implementation)

Is there a native gem (so it can be used for jruby 1.8.2) which implements UTF8 normalizers (NFD, NFKD, NFC, NFKC)?
Ruby v1.8 is really flaky on Unicode. I find v1.9 the minimal Ruby version for sane processing. Even then, the unicode_utils gem for v1.9.1 for better is absolutely indispensable. It has things like full casemapping and normalization functions. You really do need it.
Unfortunately, it doesn’t include collation, so you can’t do alphabetic sorts in Ruby the way you can in Perl or languages with access to the ICU libraries. Collation is the hardest to get right so it is not surprising that it is missing. But it is critical because it underlies nearly everything we ever do with text. It’s not just about sorting; it’s about simple string comparisons. Most people don’t realize this.
I talk about Ruby’s Unicode support and what you can do to make your life easier there a little in my third OSCON talk from a couple weeks ago. I confess that I gave up on Ruby v1.8; it was just too too frustrating.
That’s not a knock against Ruby, because the same thing can be said for most languages today that aren’t the latest versions.
You will not be happy with Ruby and Unicode unless you’re running v1.9.
If you aren’t running Python v3 (and preferably v3.2 or probably v3.3) with a wide build, you will be unhappy in Python with Unicode.
If you aren’t running Java v1.7, you will be unhappy in Java with Unicode — and maybe even then. :(
If you aren’t running Perl v5.14 or better, you will be arguably unhappy in Perl with Unicode.
The situation with those four therefore is quite unlike the one(s) with PHP, Javascript, and Go. With those latter three languages, it doesn’t matter what version you run, because
With the first two you will always be unhappy with their Unicode support. This is really really terrible because the people using them can almost never switch to a real language with real Unicode support. The niche is too special-purpose.
Whereas with Go you will never be unhappy with its Unicode support — unless you’re in a hurry: the normalization module is very close to ready and be out already, while the collation module is being worked on but it really is a great deal harder.
Is there any possible way for you to use Ruby v1.9?

Parrot - Can I use it? And how?

I've had an eye on Parrot for a little while. I understand that it's a virtual machine designed to run a variety of dynamic languages, and to allow a degree of interoperability (e.g. calling a Perl function from a Python script). So, my question is in two parts:
Is Parrot ready for general use? I can see releases are being made, but it's not obvious whether it's something that works well now, or still a work in progress.
Is there any documentation on how to use Parrot? I've looked at links in previous questions, but all the documentation I can find talks about the various levels of Parrot-specific code (PIR, PASM, etc.), or how to add support for more languages. That's great, but how do I run code in existing languages on Parrot? And how do I use code written in another language?
Finally, I don't want to start a flamewar, but I know Parrot is tied up with Perl 6. I prefer Python. I understand Python is a supported language, but realistically, is it perceived as a multi-language VM, or is it a Perl 6 interpreter with other languages included as curiosities?
I'm a Python developer primarily, so I'm biased. But probably in the same direction as you.
Parrot is intended to be a multi-language VM. Its Perl roots show sometimes ("0" is false, the bootstrapping language NQP is a subset of perl), but at the runtime level it's quite language-agnostic.
That said, interop between languages won't be entirely seamless. For example, the String type will most likely be used as a base by all languages, but a Ruby object will probably need wrappers (but not proxies) to act pythonic. There's no story for object interop, at least not so far.
The Python 3 compiler "Pynie" has quite a way to go. Here's the repo http://bitbucket.org/allison/pynie. Maybe you'd like to help out? Right now it's quite young, not even objects yet.
And to answer your actual question:
Sort of. It's not fast and the languages that target it aren't complete, but it won't crash or corrupt your memory.
Normally, you write code in your favourite High Level Language (Python) and compile your .py code to parrot (and from there, you can compile it to native code if you want to). Another dev can write their Perl(6) code and compile it to parrot and, if the compilers have been written with interop in mind, you'll be able to call a Perl function from python
It is still work in progress, but it's mature enough for language implementors and library developers. Caveat: some subsystems are getting reworked (garbage collection, embedding), so there might be some bumps on the road ahead.
Each language needs a compiler that generates code Parrot understands. These compilers are released separately. (see http://trac.parrot.org/parrot/wiki/Languages )
Most languages targeting Parrot are in an early incomplete state, so interoperability isn't a big issue right now. Parrot isn't a Perl 6 interpreter, but Rakudo Perl 6 happens to be one of the most heavily developed compiler that targets Parrot.

What languages can be used to make dynamic websites?

So, there are several languages which will allow you to create a website, as long as you configure the server(s) well enough.
To my knowledge, there is:
PHP
ASP.NET
Ruby(on rails, what is
that all about?)
And thusly, my knowledge is limited. Ruby and ASP, I've only heard of, never worked with. If there are other languages, I suppose they have some way to make files containing the needed html. It would then suffice to add a line to the Apache config to associate the file-extension.
And if other languages: are there any notable characteristics about the one(s) you mention?
ANY language can be use to make a dynamic website - you could do it in COBOL or FORTRAN if you were twisted enough. Back in the olden days (about 10 years ago) most dynamic websites were done with CGI scripts - all you needed was a program that could read data from standard input and write data (usually HTML) to standard output.
Most modern languages have libraries and frameworks to make it easier. As well as the languages you have already mentioned, Java, C# and Python are probably the most common in use today.
Typically a web framework will have:
a way of mapping URLs to a class or function to handle the request
a mechanism for extracting data from a request and converting it into an easy to use form
a template system to easily create HTML by filling in the blanks
an easy way to access a database, such as an ORM
mechanisms to handle caching, redirections, errors etc
You can find a comparison of popular web frameworks on wikipedia.
How can you forget Java ? :)
Python
It runs on Windows, Linux/Unix, Mac OS X, and has been ported to the Java and .NET virtual machines.
Python is a perfect scripting language for web applications, e.g. via mod_python for the Apache web server. With Web Server Gateway Interface a standard API has been developed to facilitate these applications. Web application frameworks or application servers like Django, Pylons, TurboGears, web2py and Zope support developers in the design and maintenance of complex applications. Around libraries like NumPy, Scipy and Matplotlib, Python is a standard in scientific computing.
Among the users of Python are YouTube and the original BitTorrent client. Large organizations that make use of Python include Google, Yahoo!, CERN, NASA,and ITA.
This could be for your interest.
Virtually thru CGI all programming languages that produce output may use for web page generation.
Basically, you can use any language (if you are hosting your own server)
Very closely related and very interesting is this article where LISP has been used to build a very succesfull website.
Python has a 3rd party module CherryPy which can be used with or without a http server.
Amongst others: Erlang (YAWS, Mochiweb), Python
JSP has the advantage that it automatically wraps your code in a servlet, compiles that to bytecodes, then uses the just-in-time Java compiler to recompile critical sections into native object code. Not aware of any alternative which allows optimizes your work automatically in this way.
Also allows you to develop and deploy on any combination of Windows, Mac OS X, or Linux.
If you'd like to choose one for the beginning, you should check out PHP first. It gives you the basic clues about how dynamic sites work in general.
After you've become familiar with the basics, I recommend ASP.NET.
Fist off, you should know that ASP.NET is a technology and not a language. (It actually supports any language that can be used on the .NET platform.) Also it is not to be confused with classic ASP. (The old ASP was much more like PHP.)
ASP.NET is very easy to begin with, and after you have some clues about its concepts, you can always dig deeper and customize everything in it. The http://asp.net site is a very good starting point, if you are to learn it. I think it is really worth the effort, because even if you choose not to stick to it, it will give you some interesting ideas and concepts.
I tell you its most important advantages:
The code is compiled (and NOT interpreted like PHP), and it has a very good performance. (In a performace comparsion, it is 10-15 times faster. http://www.misfitgeek.com/pages/Perf_Stat_0809.htm)
It can be run on Windows without effort, and on Linux / Mac / etc using the Mono project.
It implements the Separation of Concerns principle very well.
It has most of the general functionality you'll need built-in. (Such as membership, roles, database management, and so on.)

What exactly is Parrot?

I understand that Parrot is a virtual machine, but I feel like I'm not completely grasping the idea behind it.
As I understand, it's a virtual machine that's being made to handle multiple languages. Is this correct?
What are the advantages of using a virtual machine instead of just an interpreter?
What specifically is Parrot doing that makes it such a big deal?
Parrot is a virtual machine specifically designed to handle several languages, especially the dynamic languages. Despite some of the interesting technology involved, since it can handle more than one language, it will be able to cross language boundaries. For instance, once it can compile Ruby, Perl, and Python, it should be easy to cross those boundaries to let me use a Ruby library in Python, a Perl library from Python, so whatever combination that I like.
Parrot started in the Perl world and many of the people working on it are experienced Perl people. Instead of using the current Perl interpreter, which is showing its age, Parrot allows Perl to have features such as distributable pre-compiled modules (which everyone else has had for a long time) and a smarter garbage collector.
Chris covered the user-facing differences, so I'll cover the other side.
Parrot is register-based rather than stack-based. What that means is that compiler developers can more easily optimize the way in which the registers should be allocated for a given piece of code. In addition, the compilation from Parrot bytecode to machine code can, in theory, be faster than stack-based code since we run register-based systems and have a lot more experience optimizing for them.
Parrot is a bytecode interpreter (possibly with a JIT at a future stage). Think Java and its virtual machine, except that Java is (at the moment) more geared towards static languages, and Parrot is geared towards dynamic languages from the beginning.
Also see Cody's excellent answer! Highly recommended.
Others have given excellent answers, so what remains for me is to explain what "dynamic" languages actually mean.
In the context of a virtual machine it means that the type of a variable is not known at compile time. In "static" languages the type (or at least a parent class of it) is known at compile time, and many optimizations build on that knowledge.
On the other hand in dynamic languages you might know if a variable holds a container type (like an array) or a scalar (string, number, ...), but you have much less type information at compile time.
Another characteristic is that dynamic languages usually make type conversions much easier, for example in perl and javascript if you use a string as a number, it is automatically converted to a number.
Parrot is designed to make such operations easy and fast, and to allow optimizations without knowing having type informations at compile time.
Here is The Official Parrot Wiki.
You can find lots of info and links there.
The bottom of the Parrot wiki home page also displays the latest headlines from the Planet Parrot feed aggregator.
In addition to the VM, the Parrot project is building a very powerful tool chain to make it easier to port existing languages, or develop new one.
The Parrot VM will also provide other languages under-the-covers support for many powerful new Perl 6 features (please see the Official Perl 6 Wiki for more Perl 6 info).
Parrot will provide interoperability between modules of differing languages, so that for example, other languages can take advantage of what will become the huge Perl 6 version of CPAN (the vast Perl 5 module archive, which Perl 6 will be able to access via the forthcoming Perl 5.12).
Honestly, I didn't know it was that big of a deal. It has come a long way, but just isn't seeing much use. The main target language has yet to really arrive, and has lost a huge mind-share among the industry professionals. Meanwhile, other solutions like .Net and projects like Jython show us that the here-and-now can beat out any perceived hype.
Parrot will be what java aimed for but never achieved - a vm for all
OS's and platforms
Parrot will implement the ideas behind the Microsoft's Common Language Runtime for any dynamic language and truly cross-platform
On top of everything Parrot is and will be free and open source
Parrot will become the de facto standard for open source programming with dynamic languages