How to generate stream of custom symbols instead of ints in lexer?

How to generate stream of custom symbols instead of ints in lexer? - lex

I am asking for gplex, however it might be the case, the solution to the problem works for other lex-derived tools.
I wrote all rules, everything is fine with one exception. The type of the scan method of the generated scanner is int, and I would like to be MySymbol (which would consist of id of the token -- INT, STR, PLUS, so on, its value, and possible location in the file).
I checked the samples (not many of them), but they are very simplistic and just write out the fact rule was matched, I've read the manual, but it starts from parser perspective and for now I am a bit lost.
One of my rules in lex file:
while { return new MySymbol(MyTokens.WHILE); }
All I have now is scanning phase, I have to finish it, and then I will think about parser.

Yacc and Yacc-like tools (here GPLex) relies on side effect. Normally you could think of returning the data, but here you are returning token id, and any extra data has to be "passed" via special variables like yyval.

Related

Visual Studio Code Providers for language extension

I am trying to learn how to implement some of the helper providers: autocomplete, signature help and hover.
I am doing it for a framework that, as far as I know, it cannot be executed outside its main application, so one way I thought to go about this (get the objects types, methods and docs) is by parsing its documentation.
For example the Hover provider; once the cursor is hovering the word, I can search for it in the documentation and display the result:
class HSHoverProvider implements vscode.HoverProvider {
public provideSignatureHelp(
document: vscode.TextDocument,
position: vscode.Position,
token: vscode.CancellationToken
): vscode.SignatureHelp {
// get current word/line under the cursor and find a match inside the docs
...
return new vscode.Hover(data);
}
}
...
context.subscriptions.push(
vscode.languages.registerHoverProvider("lua", new HSHoverProvider())
);
This works fine when the action is directly on the initial declaration. I can parse directly the line and find what I need with a regex.
-- hovering over `application`, I check the context with a regex.
local app = hs.application('Code')
However, I am having a hard time when it comes to a "reference". Searching the document for the declaration of app with a regex approach leads to many edge cases, mainly because of the declaration scope:
Example:
-- declaration target
local app = hs.application('Code')
local function foo()
local app = hs.pasteboard()
end
local function bar()
if 'foo' then
local app = hs.alert()
end
do local app = hs.window.focusedWindow() end
-- a regex will have a hard time to understand which declaration is correct
print(app:title())
end
This lead me thinking that a regex is not the appropriate solution. I also thought that implementing vscode.DefinitionProvider will give me some insight but it did not.
I've tried to look at other extensions that do already the same thing (mainly Lua Language Server by sumneko), but I am not able to understand how they went for it (besides they are using the language server approach).
How would I go for something like this? Do I need an AST tree and inspect from there? Would using the language server be a better choice? Am I missing a bigger picture or I just need a more robust document parser?
Any insight is appreciated. Thanks in advance

The usual approach in such cases is to create a symbol table. You start by parsing the code for which you want to provide the tooling. From the parse tree (or syntax tree, depending on the parser tool used) you generate your symbol table, which holds the informations you need, including the nesting of blocks and symbols, the type of symbols (e.g. object name or object reference) and the scope for which a symbol is valid.

Matlab function signature changes

Let us say that I have a Matlab function and I change its signature (i.e. add parameter). As Matlab does not 'compile' is there an easy way to determine which other functions do not use the right signature (i.e. submits the additional parameter). I do not want to determine this at runtime (i.e. get an error message) or have to do text searches. Hope this makes sense. Any feedback would be very much appreciated. Many thanks.

If I understand you correctly, you want to change a function's signature and find all functions/scripts/classes that call it in the "old" way, and change it to the "new" way.
You also indicated you don't want to do it at runtime, or do text searches, but there is no way to detect "incorrect" calls at "parse-time", so I'm afraid these demands leave no option at all to detect old function calls...
What I would do in that case is temporarily add a few lines to the new function:
function myFunc(param1, param2, newParam) % <-- the NEW signature
if nargin == 2
clc, error('old call detected.'); end
and then run the main script/function/whatever in which this function resides. You'll get one error for each time something calls the function incorrectly, along with the error stack in the Matlab command window.
It is then a matter of clicking on the link in the bottom of the error stack, correct the function call, and repeat from the top until no more errors occur.
Don't forget to remove these lines when you're done, or better, replace the word error with warning just to capture anything that was missed.
Better yet: if you're on linux, a text search would be a matter of
$ grep -l 'myFunc(.*,.*); *.m'
which will list all the files having the "incorrect" call. That's not too difficult I'd say...You can probably do a similar thing with the standard windows search, but I can't test that right now.

This is more or less what the dependency report was invented for. Using that tool, you can find what functions/scripts call your altered function. Then it is just a question of manually inspecting every occurrence.
However, I'd advise to make your changes to the function signature such that backwards compatibility is maintained. You can do so by specifying default values for new parameters and/or issuing a warning in those scenarios. That way, your code will run, and you will get run-time hints of deprecated code (which is more or less a necessary evil in interpreted/dynamic languages).
For many dynamic languages (and MATLAB specifically) it is generally impossible to fully inspect the code without the interpreter executing the code. Just imagine the following piece of code:
x = magic(10);
In general, you'd say that the magic function is called. However, magic could map to a totally different function. This could be done in ways that are invisible to a static analysis tool (such as the dependency report): e.g. eval('magic = 1:100;');.
The only way is to go through your whole code base, either inspecting every occurrence manually (which can be found easily with a text search) or by running a test that fully covers your code base.
edit:
There is however a way to access intermediate outputs of the MATLAB parser. This can be accessed using the undocumented and unsupported mtree function (which can be called like this: t = mtree(file, '-file'); for every file in your code base). Using the resulting structure you might be able to find calls with a certain amount of parameters.

Varnish 3.0.3 req.hash_always_miss vs Vary

I'm trying to build a system that can purge and regenerate URLs as required for a particular system. I previously was having issues with purging when the system located the object by hash but missed the variant as I didn't have a "purge;" in my vcl_miss (only in my vcl_hit, some guides/example vcl files do not mention this need but the main documentation does here).
What I'm trying to figure-out is if I need to do something similar for a REGEN call. From my understanding, "set req.hash_always_miss = true;" will mean that the old hash is missed and a new hash object is generated. Subsequent calls will find the new hash, but may still miss that object if there is not an appropriate variant in the cache.
Could someone confirm for me whether a subsequent request missing the variant in the new object will lead directly to a cache miss and fetch, rather than finding any of the variants from the previous object?

hash_always_miss will only influence the current/ongoing request and the cache contents that it replaces. A fetch will always happen, and the object will be put into the cache using the same rules as any other miss/fetch sequence.
The "old" other variants of the same hash are still valid objects and will be served to a client indicating request headers matching the varied headers.
hash_always_miss will replace the current variant, and nothing else.
To answer your question, the second part of your sentence is most correct.

How to internationalize java source code?

EDIT: I completely re-wrote the question since it seems like I was not clear enough in my first two versions. Thanks for the suggestions so far.
I would like to internationalize the source code for a tutorial project (please notice, not the runtime application). Here is an example (in Java):
/** A comment */
public String doSomething() {
System.out.println("Something was done successfully");
}
in English , and then have the French version be something like:
/** Un commentaire */
public String faitQuelqueChose() {
System.out.println("Quelque chose a été fait avec succès.");
}
and so on. And then have something like a properties file somewhere to edit these translations with usual tools, such as:
com.foo.class.comment1=A comment
com.foo.class.method1=doSomething
com.foo.class.string1=Something was done successfully
and for other languages:
com.foo.class.comment1=Un commentaire
com.foo.class.method1=faitQuelqueChose
com.foo.class.string1=Quelque chose a été fait avec succès.
I am trying to find the easiest, most efficient and unobtrusive way to do this with the least amount of manual grunt work (other than obviously translating the actual text). Preferably working under Eclipse. For example, the original code would be written in English, then externalized (to properties, preferably leaving the original source untouched), translated (humanly) and then re-generated (as a separate source file / project).
Some trails I have found (other than what AlexS suggested):
AntLR, a language parser / generator. There seems to be a supporting Eclipse plugin
Using Eclipse's AST (Abstract Syntax Tree) and I guess building some kind of plugin.
I am just surprised there isn't a tool out there that does this already.

I'd use unique strings as methodnames (or anything you want to be replaced by localized versions.
public String m37hod_1() {
System.out.println(m355a6e_1);
}
then I'd define a propertyfile for each language like this:
m37hod_1=doSomething
m355a6e_1="Something was done successfully"
And then I'd write a small program parsing the sourcefiles and replacing the strings. So everything just outside eclipse.
Or I'd use the ant task Replace and propertyfiles as well, instead of a standalone translation program.
Something like that:
<replace
file="${src}/*.*"
value="defaultvalue"
propertyFile="${language}.properties">
<replacefilter
token="m37hod_1"
property="m37hod_1"/>
<replacefilter
token="m355a6e_1"
property="m355a6e_1"/>
</replace>
Using one of these methods you won't have to explain anything about localization in your tutorials (except you want to), but can concentrate on your real topic.

What you want is a massive code change engine.
ANTLR won't do the trick; ASTs are necessary but not sufficient. See my essay on Life After Parsing. Eclipse's "AST" may be better, if the Eclipse package provides some support for name and type resolution; otherwise you'll never be able to figure out how to replace each "doSomething" (might be overloaded or local), unless you are willing to replace them all identically (and you likely can't do that, because some symbols refer to Java library elements).
Our DMS Software Reengineering Toolkit could be used to accomplish your task. DMS can parse Java to ASTs (including comment capture), traverse the ASTs in arbitrary ways, analyze/change ASTs, and the export modified ASTs as valid source code (including the comments).
Basically you want to enumerate all comments, strings, and declarations of identifiers, export them to an external "database" to be mapped (manually? by Google Translate?) to an equivalent. In each case you want to note not only the item of interest, but its precise location (source file, line, even column) because items that are spelled identically in the original text may need different spellings in the modified text.
Enumeration of strings is pretty easy if you have the AST; simply crawl the tree and look for tree nodes containing string literals. (ANTLR and Eclipse can surely do this, too).
Enumeration of comments is also straightforward if the parser you have captures comments. DMS does. I'm not quite sure if ANTLR's Java grammar does, or the Eclipse AST engine; I suspect they are both capable.
Enumeration of declarations (classes, methods, fields, locals) is relatively straightforward; there's rather more cases to worry about (e.g., anonymous classes containing extensions to base classes). You can code a procedure to walk the AST and match the tree structures, but here's the place that DMS starts to make a difference: you can write surface-syntax patterns that look like the source code you want to match. For instance:
pattern local_for_loop_index(i: IDENTIFIER, t: type, e: expression, e2: expression, e3:expression): for_loop_header
= "for (\t \i = \e,\e2,\e3)"
will match declarations of local for loop variables, and return subtrees for the IDENTIFIER, the type, and the various expressions; you'd want to capture just the identifier (and its location, easily done by taking if from the source position information that DMS stamps on every tree node). You'd probably need 10-20 such patterns to cover the cases of all the different kinds of identifiers.
Capture step completed, something needs to translate all the captured entities to your target language. I'll leave that to you; what's left is to put the translated entities back.
The key to this is the precise source location. A line number isn't good enough in practice; you may have several translated entities in the same line, in the worst case, some with different scopes (imagine nested for loops for example). The replacement process for comments, strings and the declarations are straightforward; rescan the tree for nodes that match any of the identified locations, and replace the entity found there with its translation. (You can do this with DMS and ANTLR. I think Eclipse ADT requires you generate a "patch" but I guess that would work.).
The fun part comes in replacing the identifier uses. For this, you need to know two things:
for any use of an identifier, what is the declaration is uses; if you know this, you can replace it with the new name for the declaration; DMS provides full name and type resolution as well as a usage list, making this pretty easy, and
Do renamed identifiers shadow one another in scopes differently than the originals? This is harder to do in general. However, for the Java language, we have a "shadowing" check, so you can at least decide after renaming that you have an issues. (There's even a renaming procedure that can be used to resolve such shadowing conflicts
After patching the trees, you simply rewrite the patched tree back out as a source file using DMS's built-in prettyprinter. I think Eclipse AST can write out its tree plus patches. I'm not sure ANTLR provides any facilities for regenerating source code from ASTs, although somebody may have coded one for the Java grammar. This is harder to do than it sounds, because of all the picky detail. YMMV.
Given your goal, I'm a little surprised that you don't want a sourcefile "foo.java" containing "class foo { ... }" to get renamed to .java. This would require not only writing the transformed tree to the translated file name (pretty easy) but perhaps even reconstructing the directory tree (DMS provides facilities for doing directory construction and file copies, too).
If you want to do this for many languages, you'd need to run the process once per language. If you wanted to do this just for strings (the classic internationalization case), you'd replace each string (that needs changing, not all of them do) by a call on a resource access with a unique resource id; a runtime table would hold the various strings.

One approach would be to finish the code in one language, then translate to others.
You could use Eclipse to help you.
Copy the finished code to language-specific projects.
Then:
Identifiers: In the Outline view (Window>Show View>Outline), select each item and Refactor>Rename (Alt+Shift+R). This takes care of renaming the identifier wherever it's used.
Comments: Use Search>File to find all instances of "/*" or "//". Click on each and modify.
Strings:
Use Source>Externalize strings to find all of the literal strings.
Search>File for "Messages.getString()".
Click on each result and modify.
On each file, ''Edit>Find/Replace'', replacing "//\$NON-NLS-.*\$" with empty string.

for the printed/logged string, java possess some internatization functionnalities, aka ResourceBundle. There is a tutorial about this on oracle site
Eclipse also possess a funtionnality for this ("Externalize String", as i recall).
for the function name, i don't think there anything out, since this will require you to maintain the code source on many version...
regards

Use .properties file, like:
Locale locale = new Locale(language, country);
ResourceBundle captions= ResourceBundle.getBundle("Messages",locale);
This way, Java picks the Messages.properties file according to the current local (which is acquired from the operating system or Java locale settings)
The file should be on the classpath, called Messages.properties (the default one), or Messages_de.properties for German, etc.
See this for a complete tutorial:
http://docs.oracle.com/javase/tutorial/i18n/intro/steps.html
As far as the source code goes, I'd strongly recommend staying with English. Method names like getUnternehmen() are worse to the average developer then plain English ones.
If you need to familiarize foreign developers to your code, write a proper developer documentation in their language.
If you'd like to have Javadoc in both English and other languages, see this SO thread.

You could write your code using freemarker templates (or another templating language such as velocity).
doSomething.tml
/** ${lang['doSomething.comment']} */
public String ${lang['doSomething.methodName']}() {
System.out.println("${lang['doSomething.message']}");
}
lang_en.prop
doSomething.comment=A comment
doSomething.methodName=doSomething
doSomething.message=Something was done successfully
And then merge the template with each language prop file during your build (using Ant / Gradle / Maven etc.)

What's the best way to check that the output is stable using junit?

I have a method which takes a data structure and writes to an OutputStream. I'd like to write a unit test which ensures that the output of that function for a given input doesn't change unexpectedly.
I think I have the test structured the right way, using JUnit Theories. My #DataPoints are pairs of identifiers and example data structures, and I have an #Theory that the output from the method will be identical to a file checked in to version control that I fetch using getResourceAsStream using the identifier.
The painful part, however, is generating those files in the first place. My current very ugly solution is a boolean constant called WRITE_TEST_OUTPUT, which is usually false. When I want to regenerate the output files (either because I've changed the list of #DataPoints or the desired behaviour of the method has changed) I set this constant to true, and there's a test which when this is true runs the function again to write the current output files to a directory in /tmp. It then asserts that the constant is false, so I don't accidentally check in a version in which the constant is true.
It's very convenient to have the generation pretend to be a test so I can run it from within Eclipse, but it seems like a horrible kludge. This seems like the sort of thing people must do all the time - what kind of solutions have other people found?

What I'd probably do is detect the file's existence rather than use an externally modified constant. So you'd have something like
public void testMethodOutput() {
if (new File(expectedOutput).exists()) {
// check method output against file
} else {
// generate expected output from current behaviour; test passes automatically
// (saves output to the expected-output file)
}
}
When you want to update the expected output, delete the file, run tests, commit the new version to version control. No source change necessary.
It seems a little strange that the correct output is determined by looking at what a given version of the program does, rather than something you yourself set up (what if that version has a subtle bug?). Possibly that's a hint that your test is too coarse-grained, and that in turn might hint that your method is itself too coarse-grained. (Test smells are nearly always design smells.)