Xtext: grammar for language with significant/semantic whitespace - eclipse

How can I use Xtext to parse languages with semantic whitespace? I'm trying to write a grammar for CoffeeScript and I can't find any good documentation on this.

Here's an example whitespace sensitive language in XText

AFAIK, you can't.
In case of parsing Python-like languages, you'd need the lexer to emit INDENT and DEDENT tokens. For that to happen, you'd need semantic predicates to be supported inside lexer rules (Xtext's terminal rules) that would first check if the current-position-in-line of the next character int the input equals 0 (the beginning of the line) and is a ' ' or '\t'.
But browsing through the documentation, I don't see this is supported by Xtext at the moment. Since Xtext 2.0, support has been added for semantic predicates in production rules (see: 6.2.8. Syntactic Predicates), but not in terminal rules.
The only way to do this with Xtext would be to let the lexer produce terminal spaces and line-breaks, but this would make an utter mess of your production rules.
If you want to parse such a language using Java (and a Java oriented parser generator) I'd recommend ANTLR, in which you can emit such INDENT and DEDENT tokens quite easily. But if you're keen on Eclipse integration, then I don't see how you'd be able to do this using Xtext, sorry.

Version 2.8 of Xtext comes with support for Whitespace-Aware Languages. This version ships with the "Home Automation Example" that you can use as a template.

For people interested in CoffeeScript, Adam Schmideg has an Eclipse plugin that uses XText.
For people interested in parsing Python-like DSL's in XText, Ralf Ebert's code for Todotext mentioned above is no longer available from Github but you can find it in the Eclipse test repository. See the original thread about this work and the Eclipse issue that was raised about it.
I have been playing with this code today and my conclusion is it no longer works in the current version of XText. When XText is used in Eclipse, I think it does "partial parsing". This is not compatible with the stateful lexer you need to process indentation sensative languages. So I suspect even if you patch the lexer, the Eclipse editor does not work. In the issue, it looks like Ralf proposed patches to address these issues, but looking into the XText source, these changes seem long gone? If I am wrong and someone can get it to work, I would be very interested?
There is a different implementation here but I cannot get that to work with the current version of XText either.
Instead I have switched to parboiled which does supports indentation based grammars out the box.

Related

how to extend existing LSP language server to derivative language?

VS Code has great Bash language support -- syntax highlighting (via shellscript), LSP support (via Bash-IDE), linting (via vscode-shellcheck), and a debugger (via bash debug). There is a Bash-based TAP-compliant unit test framework called BATS that also has syntax highlighting support for some new BATS-specific keywords (via bats). I'd primarily like to marry the Bash LSP with the BATS syntax language so that I can get individual tests listed in the VS Code outline panel, but I'd also be potentially interested in shellcheck linting as well (though that is a bigger project, I think).
There are pieces of all of this architecture that are extensively documented, but I haven't found anything that documents how all of these work together, let alone how to then derive off of them. A fundamental question therefor is: can I define a new language as an extension of an existing one such that I pick up all of this language support for free, albeit with different/added keyword definitions? I guess each piece has their own definition of the different syntax elements (e.g. how a "function" is defined), so perhaps I need to override each one of them individually somehow?

How to write diff code with syntax highlight in Github

Github supports syntax highlight as follows:
```javascript
let message = 'hello world!'
```
And it supports diff as follows: (but WITHOUT syntax highlight)
```diff
-let message = 'hello world!'
+let message = 'hello stackoverflow!'
```
How can I get both 'syntax hightlight' AND 'diff' ?
No, this is not a supported feature at this time.
GitHub documents their processing of lightweight markup languages (including Markdown, among others) in github/markup. Note step 3:
Syntax highlighting is performed on code blocks. See github/linguist for more information about syntax highlighting.
If we follow that link, we find a list of grammars that Linguist uses to provide syntax highlighting on GitHub. Linguist can only apply one of the grammars in that list to a block of code at a time. Of course, one of the grammars is Diff. However, that grammar knows nothing about the language of code being diffed, so you don't get syntax highlighting of that.
Of course, there are other languages which are often combined. For example, HTML is often included in a templating language. Therefore, in addition to the HTML grammar, we also find grammars for HTML+Django, HTML+ECR HTML+EEX, HTML+ERB, and HTML+PHP. In each case, the single grammar is aware of two languages. Both the specific templating language and the HTML which is interspersed within the template.
To accomplish the same thing with a diff, you would need a separate "diff" grammar for every single language listed. In other words, the number of grammars would double. Of course, a way to avoid this might be to treat diff differently. When diff is specified, they could run the block through the syntax highlighter twice, once for diff and once for the source language. However, at least when processing code blocks in lightweight markup languages, they have not implemented such a feature.
And if they ever were to implement such a feature in the future, it would likely be more complicated that simply running the code block through twice. After all, every line of the diff has diff specific content which would confuse the other language grammar. Therefore, every grammar would need to be diff aware, or each line would need to be fed to the grammar separately with the diff parts removed. The problem with the later is that the grammar would not have the context of each line and is more likely to get things wrong. Whether such a solution is possible is outside this cope of this answer, but the point is that it is reasonable to expect that such a feature would be much lower priority to support due to the complexity involved.
So why does GitHub do syntax highlighting in other places on its website? Because, in those cases, it has access to the two source files being diffed and it generates the diff itself. Each source is first highlighted (avoiding the complexity mentioned above), then the diff is created from the two highlighted source files. However, a diff included in a Markdown code block is already a diff when GitHub first sees it. There is no way for them to highlight the pre-diff code first. In other words, the process they currently use would not be transferable to supporting the requested feature.
You would need to post-process the output of the git diff in order to add syntax highlighting for the right language of the file being diff'ed.
But since you are asking for GitHub, that post-processing is not in your control, and is not provided by GitHub at the moment in its GFM (GitHub Flavored Markdown Spec).
It is supported for source files, in a regular diff like this one or in a PR: GitHub does the syntax highlighting of the two versions of the file, and then computes the diff.
It is not supported in a regular markdown fenced code block, where the +/- of a diff would throw off the syntax highlighting engine, considering there is no "diff" operation done here (just the writer trying to add diff +/- symbols)

Is it possible/easy to build a VS Code extension that does syntax highlighting with a lexer?

I am building an experimental lexer generator and I think it would be cool to output simple syntax highlighters for VS Code. The input grammar goes through the classic regular language -> NFA -> DFA transformation, then generates state machine code (it also has some unconventional features to support nested languages). Converting all this back into tmlanguage definitions is a complicated problem, and I'm starting to wonder if a VS Code extension is a better option. The question is:
Are VS Code syntax highlighting internals completely tied to the tmlanguage regex scanner, or would it be possible to write an extension that provides tokens / highlight ranges programmatically?
Is there an API that would make this reasonably straightforward, or would this project be a tour de force?
As of VSCode 1.15, you have to use textmate grammars for syntax highlighting. There's an feature request open that tracks what you are after: https://github.com/Microsoft/vscode/issues/1967

Use java regex in a Scala application

ScalaJs uses the javascript regexp library instead of the Java regexp library. Javascript regexp are a bit different, for instance they do not implement look-behind (?<=X).
How can I use the java regexp in a ScalaJs application?
(in order to have the look-behind feature)
ps: I am aware that sometimes it's possible to simulate look-behind in javascript regexp, as shown here, and I am currently doing this. However, this question is about how to use java regexp in ScalaJs.
pss: I am also aware that there is a regexp external javascript library with look-behind, xregexp. Again, this question is about how to use java regexp in ScalaJs.
Basically, you can't -- the Java Regexp, like most of the Java runtime environment, doesn't exist in the browser, so it doesn't exist for Scala.js. If you specifically need the Java Regex behavior, using one of those simulator libraries is probably your only option, at least for the time being.
As a rule of thumb, if something comes from the JRE, you should assume it does not exist in Scala.js unless someone has explicitly ported it. A growing amount has been ported, but it's still a fairly modest fraction of the total Java Runtime...

Filter Eclipse CDT textual search results by program-structural features

I want to use Eclipse CDT to search for all occurrences of regular expresion, say foo(_bar_)*baz - in the bodies/bodies and declarations of functions/methods meeting a certain criterion. For the example let's make it all functions/methods named ignore_me (but within all classes and namespaces).
Is that possible somehow?
I'm not aware of a way to do this sort of search in Eclipse CDT.
Clang's AST matchers provide a rich domain-specific language for expressing queries like yours (and many other kinds), but using them requires writing code (in a clang plugin or standalone clang-based tool). It may be relatively straightforward to write a simple tool that allows you to write a query and searches a codebase for matches. I'm not sure whether such a tool is already available.