Since DOM is a tree representation of XML and xslt processors use tree structures as their input, I was wondering what is the difference between the DOM and a tree structure of an xslt processor? and how can a xslt processor such as Saxon be implemented on SAX? Does that mean that it generates the input tree structure from the events the SAX reports? Isnt that a little cumbersome? Why use something that doesnt already provide a tree structure?
Since DOM is a tree representation of XML
While you can use DOM with XML, DOM is most often a tree representation of HTML.
and xslt processors use tree structures as their input,
When I use saxon at the command line I give it a file. It is a sequence of bytes, not a tree. Something has to transform this sequence into a tree.
I was wondering what is the difference between the DOM and a tree structure of an xslt processor?
Listing all the differences would be too long but some salient ones:
DOM trees are supposed to conform to a set of standards. They have a public interface. The trees used by XSLT processors do not have to conform to any standard. An XSLT processor could very well create its own tree, which is designed to be used only for its own purposes.
DOM trees are usually structurally mutable. Trees used for XSLT processing could be structurally mutable but are usually used immutably. (By "structurally mutable" I'm talking about adding, removing, moving elements around. There's no facility in XSLT to modify the input as it is processed.)
DOM trees support things that don't make sense in XSLT processing. For instance, event handlers.
and how can a xslt processor such as Saxon be implemented on SAX? Does that mean that it generates the input tree structure from the events the SAX reports? Isnt that a little cumbersome? Why use something that doesnt already provide a tree structure?
Yes, an XSLT processor would use SAX's events to generate a tree for internal processing. (And it seems this is what Saxon does.) There are good reasons one might use SAX to construct a custom tree rather than, say, use a library that constructs a DOM tree. Earlier I said that XSLT processors usually use their tree as structurally immutable. This can allow for optimizations that a DOM implementation might not be able to perform. Here's an example. If your tree is structurally immutable, you can compute the list of child elements of a node once and for all. It cannot ever change once it is computed. However, in a DOM tree, the list of child elements of a node can change because the tree is mutable. So you have to have provisions to handle changes. Custom trees also do not have to support things that don't make sense for XSLT processing, like events. These kinds of optimizations may lead to an application that takes less memory and runs faster.
Related
I have specified a grammar using ANTLR4 using VScode and the extension by Mike Lischke. I am wondering if there is a way to parse the code of the program that is conforming to the grammar and generate eventually some XML tags.
Xtext provides this solution by generating a .xtend file that contains the famous doGenerate method, in which we access to objects and then generate a new code.
There’s not a “write this parse tree out as XML” functionality built into ANTLR.
It would not be too hard to write a listener that produced XML while traversing the parse tree. You’d have to make decisions about which property to include in you XML, as well as which to make attributes.
Probably, most people wanting to serialize to XML create an AST from the parse tree (parse trees can be rather verbose depending upon the grammar). With an AST you could even annotate it, and use a library to serialize the AST as XML (using something like JaxB, for instance)
In The Pragmatic Programmer:
Normally, you can simply hide a third-party product behind a
well-defined, abstract interface. In fact , we've always been able to
do so on any project we've worked on. But suppose you couldn't isolate
it that cleanly. What if you had to sprinkle certain statements
liberally throughout the code? Put that requirement in metadata, and
use some automatic mechanism, such as Aspects (see page 39 ) or Perl,
to insert the necessary statements into the code itself.
Here the author is referring to Aspect Oriented Programming and Perl as tools that support "automatic mechanisms" for inserting metadata.
In my mind I envision some type of run-time injection of code. How does Perl allow for "automatic mechanisms" for inserting metadata?
Skip ahead to the section on Code Generators. The author provides a number of examples of processing input files to generate code, including this one:
Another example of melding environments using code generators happens when different programming languages are used in the same application. In order to communicate, each code base will need some information in commondata structures, message formats, and field names, for example. Rather than duplicate this information, use a code generator. Sometimes you can parse the information out of the source files of one language and use it to generate code in a second language. Often, though, it is simpler to express it in a simpler, language-neutral representation and generate the code for both languages, as shown in Figure 3.4 on the following page. Also see the answer to Exercise 13 on page 286 for an example of how to separate the parsing of the flat file representation from code generation.
The answer to Exercise 13 is a set of Perl programs used to generate C and Pascal data structures from a common input file.
I am in the process of writing a toy compiler in scala. The target language itself looks like scala but is an open field for experiment.
After several large refactorings I can't find a good way to model my abstract syntax tree. I would like to use the facilities of scala's pattern matching, the problem is that the tree carries moving information (like types, symbols) along the compilation process.
I can see a couple of solutions, none of which I like :
case classes with mutable fields (I believe the scala compiler does this) : the problem is that those fields are not present a each stage of the compilation and thus have to be nulled (or Option'd) and it becomes really heavy to debug/write code. Moreover, if for exemple, I find a node with null type after the typing phase I have a really hard time finding the cause of the bug.
huge trait/case class hierarchy : something like Node, NodeWithSymbol, NodeWithType, ... Seems like a pain to write AND work with
something completly hand crafted with extractors
I'm also not sure if it is good practice to go with a fully immutable AST, especially in scala where there is no implicit sharing (because the compiler is not aware of immutability) and it could hurt performances to copy the tree all the time.
Can you think of an elegant pattern to model my tree using scala's powerful type system ?
TL;DR I prefer to keep the AST immutable and carry things like type information in a separate structure, e.g. a Map, that can be referred by IDs stored in the AST. But there is no perfect answer.
You're by no means the first to struggle with this question. Let me list some options:
1) Mutable structures that get updated at each phase. All the up and downsides you mention.
2) Traits/cake pattern. Feasible, but expensive (there's no sharing) and kinda ugly.
3) A new tree type at each phase. In some ways this is the theoretically cleanest. Each phase can deal only with a structure produced for it by the previous phase. Plus the same approach carries all the way from front end to back end. For instance, you may "desugar" at some point and having a new tree type means that downstream phase(s) don't have to even consider the possibility of node types that are eliminated by desugaring. Also, low level optimizations usually need IRs that are significantly lower level than the original AST. But this is also a lot of code since almost everything has to be recreated at each step. This approach can also be slow since there can be almost no data sharing between phases.
4) Label every node in the AST with an ID and use that ID to reference information in other data structures (maps and vectors and such) that hold information computed for each phase. In many ways this is my favorite. It retains immutability, maximizes sharing and minimizes the "excess" code you have to write. But you still have to deal with the potential for "missing" information that can be tricky to debug. It's also not as fast as the mutable option, though faster than any option that requires producing a new tree at each phase.
I recently started writing a toy verifier for a small language, and I am using the Kiama library for the parser, resolver and type checker phases.
Kiama is a Scala library for language processing. It enables convenient analysis and transformation of structured data. The programming styles supported by the library are based on well-known formal language processing paradigms, including attribute grammars, tree rewriting, abstract state machines, and pretty printing.
I'll try to summarise my (fairly limited) experience:
[+] Kiama comes with several examples, and the main contributor usually responds quickly to questions asked on the mailing list
[+] The attribute grammar paradigm allows for a nice separation into "immutable components" of the nodes, e.g., names and subnodes, and "mutable components", e.g., type information
[+] The library comes with a versatile rewriting system which - so far - covered all my use cases
[+] The library, e.g., the pretty printer, make nice examples of DSLs and of various functional patterns/approaches/ideas
[-] The learning curve it definitely steep, even with examples and the mailing list at hand
[-] Implementing the resolving phase in a "purely function" style (cf. my question) seems tricky, but a hybrid approach (which I haven't tried yet) seems to be possible
[-] The attribute grammar paradigm and the resulting separation of concerns doesn't make it obvious how to document the properties nodes have in the end (cf. my question)
[-] Rumour has it, that the attribute grammar paradigm does not yield the fastest implementations
Summarising my summary, I enjoy using Kiama a lot and I strongly recommend that you give it a try, or at least have a look at the examples.
(PS. I am not affiliated with Kiama)
I'm using the scala^Z3 tool for a small library that (among other things) prints the constraints of a Z3Context in latex format. While it's possible to traverse the Z3AST and latex-ify the expressions by string comparison, it would be much nicer to use the object structure of the z3.scala.dsl package. Is there a way to obtain a z3.scala.dsl.Tree from a Z3AST?
It's true that the DSL is currently "write only", in that you can use it to create trees and ship them to Z3 but not to read them back.
The standard way to read Z3 trees is to use getASTKind and getDeclKind from Z3Context. The classes that represent the results are Z3ASTKind and Z3DeclKind respectively. (Since most trees are applications, the latter is where most of the information is).
It looks like the way to do this is create the original constraints using z3.scala.dsl, then add each constraint using Z3Context.assertCnstr (tree: Tree[BoolSort]). This way I have the whole DSL tree for easy transformation to latex. For some reason the examples on the scala^Z3 website assemble the AST without using the DSL at all, so this alternative wasn't obvious.
Are there any tools which diff hierarchies?
IE, consider the following hierarchy:
A has child B.
B has child C.
which is compared to:
A has child B.
A has child C.
I would like a tool that shows that C has moved from a child of B to a child of A. Do any such utilities exist? If there are no specific tools, I'm not opposed to writing my own, so what are some good algorithms which are applicable to this problem?
A great general resource for diffing hierarchies (not specifically XML, HTML, etc) is the Hierarchical-Diff github project based on a bit of Dartmouth research. They have a pretty extensive list of related work ranging from XML diffing, to configuration file diffing to HTML diffing.
In general, actually performing diffs/patches on tree structures is a fairly well-solved problem, but displaying those diffs in a manner that makes sense to humans is still the wild west. That's double true when your data structure already has some semantic meaning like with HTML.
You might consider our SmartDifferencer tools.
These tools compare computer source code files in a diff-like way. Unlike diff, which is line oriented, these tools see changes according to code structure (variable name, expression, statement, block, function, class, etc.) as plausible edits ("move, insert, delete, replace, copy, rename"), producing answers that makes sense to programmers.
These computer source codes have exactly the "hierarchy" structure you are suggesting; the various constructs nest. Specifically to your topic, typically code blocks can nest inside code blocks. The SmartDifferencer tools use target-language accurate parsers to "deconstruct" the source text into these hierarchical entities. We have a Smart Differencer for XML in which you can obviously write nested tags.
The answer isn't reported as "Nth child of M has moved" although it is actually computed that way, by operating on the parse trees produced by the parsers. Rather it is reported as "code fragment of type at line x col y to line a col b has moved/..."
The answer my good sir is: Depth-first search, also known as Depth-first traversal. You might find some use of the Visitor pattern.
You can't swing a dead cat without hitting some sort of implementation for this when dealing with comparing XML trees. Take a gander at diffxml for an example.