GraalVM: How to implement compiler optimizations? - compiler-optimization

I want to develop a tool that performs certain optimizations in a program based on the program structure. For example, let's say I want to identify if-else within a loop, and my tool shall rewrite it into two loops.
I want the tool to be able to rewrite programs from a wide range of languages, example Java, C++, Python, Javascript, etc.
I am exploring if GraalVM can be used for this purpose, to act as the common platform in which I can implement the same optimizations for various languages.
Does GraalVM have a common intermediate representation (something like the LLVM IR)? I looked at the documentation but I am not sure where to get started. Any pointers?
Note: I am not looking for inter-operability between languages. You can assume that the programs I want to rewrite are written in one single language; the language may be different for different programs.

GraalVM has two components that are relevant for this:
compiler, which compiles Java bytecode to native code
truffle, which is a framework for implementing other programming languages on top of GraalVM.
Languages implemented with the Truffle framework get partially evaluated to Java bytecode, which is then compiled by the Graal compiler. This article/talk gives more details including the IR used by Graal compiler: https://chrisseaton.com/truffleruby/jokerconf17/. Depending on your concrete use case you may want to hook into Truffle, Truffle partial evaluator or Graal compiler.

Related

Compilation / Code Generation of External Scala DSL

My understanding is that it is quite simple to create & parse an external DSL in Scala (e.g. representing rules). Is my assumption correct that the DSL can only be interpreted during runtime but does not support code generation (like ANTLR) for archiving better performance ?
EDIT: To be more precise, my question is if I could achieve this (create an external domain specific language and generate java/scala code) with built-in Scala tools/libraries (e.g. http://www.artima.com/pins1ed/combinator-parsing.html). Not writing a whole parser / code generator completely by yourself in scala. It's also clear that you can achieve this with third-party tools but you have to learn additional stuff and have additional dependencies. I'm new in the area of implementing DSLs, so I have no gutfeeling so far when to use external tools like ANTLR and what you can (with a reasonable effort) do with Scala on-board stuff.
Is my assumption correct that the DSL can only be interpreted during runtime but does not support code generation (like ANTLR) for archiving better performance ?
No, this is wrong. It is possible to write a compiler in Scala, after all, Scala is Turing-complete (i.e. you can write anything), and you don't even need Turing-completeness for a compiler.
Some examples of compilers written in Scala include
the Scala compiler itself (in all its variations, Scala-JVM, Scala.js, Scala-native, Scala-virtualized, Typelevel Scala, the abandoned Scala.NET, …)
the Dotty compiler
Scalisp
Scalispa
… and many others …

Is it possible/useful to transpile Scala to golang?

Scala native has been recently released, but the garbage collector they used (for now) is extremely rudimentary and makes it not suitable for serious use.
So I wonder: why not just transpile Scala to Go (a la Scala.js)? It's going to be a fast, portable runtime. And their GC is getting better and better. Not to mention the inheritance of a great concurrency model: channels and goroutines.
So why did scala-native choose to go so low level with LLVM?
What would be the catch with a golang transpiler?
There are two kinds of languages that are good targets for compilers:
Languages whose semantics closely match the source language's semantics.
Languages which have very low-level and thus very general semantics (or one might argue: no semantics at all).
Examples for #1 include: compiling ECMAScript 2015 to ECMAScript 5 (most language additions were specifically designed as syntactic sugar for existing features, you just have to desugar them), compiling CoffeeScript to ECMAScript, compiling TypeScript to ECMAScript (basically, after type checking, just erase the types and you are done), compiling Java to JVM byte code, compiling C♯ to CLI CIL bytecode, compiling Python to CPython bytecode, compiling Python to PyPy bytecode, compiling Ruby to YARV bytecode, compiling Ruby to Rubinius bytecode, compiling ECMAScript to SpiderMonkey bytecode.
Examples for #2 include: machine code for a general purpose CPU (RISC even more so), C--, LLVM.
Compiling Scala to Go fits neither of the two. Their semantics are very different.
You need either a language with powerful low-level semantics as the target language, so that you can build your own semantics on top, or you need a language with closely matching semantics, so that you can map your own semantics into the target language.
In fact, even JVM bytecode is already too high-level! It has constructs such as classes that do not match constructs such as Scala's traits, so there has to be a fairly complex encoding of traits into classes and interfaces. Likewise, before invokedynamic, it was actually pretty much impossible to represent dynamic dispatch on structural types in JVM bytecode. The Scala compiler had to resort to reflection, or in other words, deliberately stepping outside of the semantics of JVM bytecode (which resulted in a terrible performance overhead for method dispatch on structural types compared to method dispatch on other class types, even though both are the exact same thing).
Proper Tail Calls are another example: we would like to have them in Scala, but because JVM bytecode is not powerful enough to express them without a very complex mapping (basically, you have to forego using the JVM's call stack altogether and manage your own stack, which destroys both performance and Java interoperability), it was decided to not have them in the language.
Go has some of the same problems: in order to implement Scala's expressive non-local control-flow constructs such as exceptions or threads, we need an equally expressive non-local control-flow construct to map to. For typical target languages, this "expressive non-local control-flow construct" is either continuations or the venerable GOTO. Go has GOTO, but it is deliberately limited in its "non-localness". For writing code by humans, limiting the expressive power of GOTO is a good thing, but for a compiler target language, not so much.
It is very likely possible to rig up powerful control-flow using goroutines and channels, but now we are already leaving the comfortable confines of just mapping Scala semantics to Go semantics, and start building Scala high-level semantics on top of Go high-level semantics that weren't designed for such usage. Goroutines weren't designed as a general control-flow construct to build other kinds of control-flow on top of. That's not what they're good at!
So why did scala-native choose to go so low level with LLVM?
Because that's precisely what LLVM was designed for and is good at.
What would be the catch with a golang transpiler?
The semantics of the two languages are too different for a direct mapping and Go's semantics are not designed for building different language semantics on top of.
their GC is getting better and better
So can Scala-native's. As far as I understand, the choice for current use of Boehm-Dehmers-Weiser is basically one of laziness: it's there, it works, you can drop it into your code and it'll just do its thing.
Note that changing the GC is under discussion. There are other GCs which are designed as drop-ins rather than being tightly coupled to the host VM's object layout. E.g. IBM is currently in the process of re-structuring J9, their high-performance JVM, into a set of loosely coupled, independently re-usable "runtime building blocks" components and releasing them under a permissive open source license.
The project is called "Eclipse OMR" (source on GitHub) and it is already production-ready: the Java 8 implementation of IBM J9 was built completely out of OMR components. There is a Ruby + OMR project which demonstrates how the components can easily be integrated into an existing language runtime, because the components themselves assume no language semantics and no specific memory or object layout. The commit which swaps out the GC and adds a JIT and a profiler clocks in at just over 10000 lines. It isn't production-ready, but it boots and runs Rails. They also have a similar project for CPython (not public yet).
why not just transpile Scala to Go (a la Scala.js)?
Note that Scala.JS has a lot of the same problems I mentioned above. But they are doing it anyway, because the gain is huge: you get access to every web browser on the planet. There is no comparable gain for a hypothetical Scala.go.
There's a reason why there are initiatives for getting low-level semantics into the browser such as asm.js and WebAssembly, precisely because compiling a high-level language to another high-level language always has this "semantic gap" you need to overcome.
In fact, note that even for lowish-level languages that were specifically designed as compilation targets for a specific language, you can still run into trouble. E.g. Java has generics, JVM bytecode doesn't. Java has inner classes, JVM bytecode doesn't. Java has anonymous classes, JVM bytecode doesn't. All of these have to be encoded somehow, and specifically the encoding (or rather non-encoding) of generics has caused all sorts of pain.

What would be a good approach to generate native code from an interpreter written with scala parser combinators?

I already have an interpreter for my language.
It is implemented with:
parser -> scala parser combinators;
AST -> scala case classes;
evaluator -> scala pattern matching.
Now I want to compile AST to native code and hopefully Java bytecode.
I am thinking of two main options to accomplish at least one of these two tasks:
generate LLVM IR code;
generate C code and/or Java code;
obs.: GCJ and SLEM seem to be unusable (GCJ works with simple code, as I could test)
Short Answer
I'd go with Java Bytecode.
Long Answer
The thing is, the higher-level the language you compile to,
The slower and more cumbersome the compilation process is
The more flexibility you get
For instance, if you compile to C, you can then get a lot of possible backends for C compilers - you can generate Java Bytecode, LLVM IR, asm for many architectures, etc., but you basically compile twice. If you choose LLVM IR you're already halfway to compiling to asm (parsing LLVM IR is far faster than parsing a language such as C), but you'll have a very hard time getting Java Bytecode from that. Both intermediate languages can compile to native, though.
I think compiling to some intermediate representation is preferable to compiling to a general-purpose programming language. Between LLVM IR and Java Bytecode I'd go with Java Bytecode - even though I personally like LLVM IR better - because you wrote that you basically want both, and while you can sort of convert Java Bytecode to LLVM IR, the other direction is very difficult.
The only remaining difficulty is translating your language to Java Bytecode. This related question about tools that can make it easier might help.
Finally, another advantage of Java Bytecode is that it'll play well with your interpreter, effectively allowing you to easily generate a hotspot-like JITter (or even a trace compiler).
I agree with #Oak about the choice of ByteCode as the most simple target. A possible Scala library to generate ByteCode is CafeBabe by #psuter.
You cannot do everything with it, but for small project it could be sufficient. The syntax is also very clear. Please see the project Wiki for more information.

Interpreter vs. Code Generator Xtext

I've a DSL written using Xtext. What I want is to execute that DSL to perform something good out of it.
I wrote myDslGenerator class implementing the interface IGenerator in xtend to generate java code and it's working fine.
I've two questions;
What is the difference between Interpreter and Code Generator?
Aren't both for executing DSL?
How to write an interpreter? Any step by step tutorial link? I found many tutorial to generate code using xtend but couldn't find any for writing an interpreter.
Thank you,
Salman
Basically, interpreters and code generators work really differently. Code generators are like a compiler: they create executable code of your DSL in another language; on the other hand, interpreters are used to traverse your DSL and execute them in your own environment. This means, the generated code does not have to (but of course it can) depend on your DSL, can be faster/more optimized; while interpreters need to understand the constructs of your language, but can be executed in your development IDE, not required to run an additional application.
AFAIK Xtext does not support writing interpreters, its somewhat out of their scope (not entirely - for Xbase expressions there is an XbaseInterpreter instance, that can be reused - provided you set its classpath correctly), as they are extremely language-specific.
I also don't know any step-by-step tutorial about interpreting Xtext DSLs (not even for the XbaseInterpreter), but it basically boils down to a traversal of the AST, and as a node is traversed, the corresponding statement is executed dynamically. For this traversal to work, as expected, the interpreter has to maintain a (possibly hierarchic) context of variables and other references.

Code generation for Java JVM / .NET CLR

I am doing a compilers discipline at college and we must generate code for our invented language to any platform we want to. I think the simplest case is generating code for the Java JVM or .NET CLR. Any suggestion which one to choose, and which APIs out there can help me on this task? I already have all the semantic analysis done, just need to generate code for a given program.
Thank you
From what I know, on higher level, two VMs are actually quite similar: both are classic stack-based machines, with largely high-level operations (e.g. virtual method dispatch is an opcode). That said, CLR lets you get down to the metal if you want, as it has raw data pointers with arithmetic, raw function pointers, unions etc. It also has proper tailcalls. So, if the implementation of language needs any of the above (e.g. Scheme spec mandates tailcalls), or if it is significantly advantaged by having those features, then you would probably want to go the CLR way.
The other advantage there is that you get a stock API to emit bytecode there - System.Reflection.Emit - even though it is somewhat limited for full-fledged compiler scenarios, it is still generally enough for a simple compiler.
With JVM, two main advantages you get are better portability, and the fact that bytecode itself is arguably simpler (because of less features).
Another option that i came across what a library called run sharp that can generate the MSIL code in runtime using emit. But in a nicer more user friendly way that is more like c#. The latest version of the library can be found here.
http://code.google.com/p/runsharp/
In .NET you can use the Reflection.Emit Namespace to generate MSIL code.
See the msdn link: http://msdn.microsoft.com/en-us/library/3y322t50.aspx