How to properly parse a multi-character token in tree-sitter scanner function - treesitter

Tree-sitter allows you to use an external scanner for those tokens that are tricky to parse or that depend on specific states like multiline strings.
The scanner takes a convenient lexer object with several methods that allow you to "scan" the document looking for the proper token characters.
Two of the key parts of this lexer are lookahead, which tells you the next character the lexer is "lookin at" and advance, which will move the lexer pointer to the next caracter.
However, after reading the docs and several other parsers that make use of this it's still not clear to me if calling this methods will "affect" the overall tree-sitter parser of if they are just local to my function invocation.
Specially tricky is trying to parse a multi-character token (more than 2 characters in fact) because you need to "advance the lexer, consuming the potential next chars that may be part of other tokens. One possible escape is to just return false after consuming the tokens and let tree-sitter go to the next step in the parsing, but thay may skip other valid tokens that potentially depend on the characters that I already consumed.
Of course I can move this parsing to the bottom of the scan function, but then maybe other shorter tokens may shadow this longer one and also produce an incorrect parsing.
As far as I know, there is no way to "rewind" the parser to undo the "consumption" of the characters, so I am not sure how to deal with this.
The tokens that I'm trying to parse are {js| for string opening and |js} for string closing.

Related

Parse::RecDescent does it only operate on a string?

I was looking at using Parse::RecDescent to parse some large files. I was figuring I'd pass it a token at a time. After looking at it a while, it appears that the tokenizer is built into it and you have to pass it the whole string up front. Is this correct?
Yes. You normally pass the complete text to be parsed as a string.
However, note that it's documented that if you pass the text as a reference:
$parser->startrule(\$text);
then the matched portion of $text will be removed, leaving only what did not match. It may be possible to design your grammar so that you can parse a file in chunks.

Can RegexParser support patterns with (white)spaces in them?

We want to create regex patterns with whitespace. However that seems to conflict with the token parsing done by the RegexParser: the pieces of the input character stream are broken into separate tokens before the individual Rules (/Parsers) ever see the inputs. Therefore the Rules will never be able to match their intended inputs.
Is there any workaround or suggested approach for this?
The RegexParser skips whitespace by default:
The parsing methods call the method skipWhitespace (defaults to true) and, if true, skip any whitespace before each parser is called.
RegexParser Doc
Overriding skipWhitespace() should fix your issue.

lex default token definition syntax

I guess this is a simple question, but I have found no reference. I have a small lex file defining some tokens from a string and altering them (actually converting them to uppercase).
Basically it is a list of commands like this:
word {setToUppercase(yytext);}
Where setToUppercase is a procedure to change case and store it.
I need to have the complete entry string with the altered words. Is there a way to define a default token / rest of tokens so I can asociate them with an unaltered storage in an output string?
You can do that in one shot with:
.|\n {save_str(yytext);}
I said it was an easy one.
. {save_str(yytext);}
\n {save_str(yytext);}
This way all characters and newline are treated.

Encoding special chars in XSLT output

I have built a set of scripts, part of which transform XML documents from one vocabulary to a subset of the document in another vocabulary.
For reasons that are opaque to me, but apparently non-negotiable, the target platform (Java-based) requires the output document to have 'encoding="UTF-8"' in the XML declaration, but some special characters within text nodes must be encoded with their hex unicode value - e.g. '”' must be replaced with '”' and so forth. I have not been able to acquire a definitive list of which chars must be encoded, but it does not appear to be as simple as "all non-ASCII".
Currently, I have a horrid mess of VBScript using ADODB to directly check each line of the output file after processing, and replace characters where necessary. This is painfully slow, and unsurprisingly some characters get missed (and are consequently nuked by the target platform).
While I could waste time "refining" the VBScript, the long-term aim is to get rid of that entirely, and I'm sure there must be a faster and more accurate way of achieving this, ideally within the XSLT stage itself.
Can anyone suggest any fruitful avenues of investigation?
(edit: I'm not convinced that character maps are the answer - I've looked at them before, and unless I'm mistaken, since my input could conceivably contain any unicode character, I would need to have a map containing all of them except the ones I don't want encoded...)
<xsl:output encoding="us-ascii"/>
Tells the serialiser that it has to produce ASCII-compatible output. That should force it to produce character references for all non-ASCII characters in text content and attribute values. (Should there be non-ASCII in other places like tag or attribute names, serialisation will fail.)
Well with XSLT 2.0 you have tagged your post with you can use a character map, see http://www.w3.org/TR/xslt20/#character-maps.

Can can I encode spaces as %20 in a POST from WWW::Mechanize?

I'm using WWW::Mechanize to do some standard website traversal, but at one point I have to construct a special POST request and send it off. All this requires session cookies.
In the POST request I'm making, spaces are being encoded to + symbols, but I need them encoded as a %20.
I can't figure out how to alter this behaviour. I realise that they are equivalent, but for reasons that are out of my hands, this is what I have to do.
Thanks for any help.
This is hard-coded in URI::_query::query_form(). It translates the spaces to +.
$val =~ s/ /+/g;
It then calls URI::_query::query with the joined pairs, where the only + signs should be encoded spaces. The easiest thing to do is probably to intercept calls to URI::_query::query with Hook::LexWrap, modify the argument before the call starts so you can turn + into %20, and go on from there.
A little bit more annoying would be to redefine URI::_query::query. It's not that long, and you just need to add some code at the beginning of the subroutine to transform the arguments before it continues.
Or, you can fix the broken parser on the other side. :)
I have a couple chapters on dealing with method overriding and dynamic subroutines in Mastering Perl. The trick is to do it without changing the original source so you don't introduce new problems for everyone else.
This appears to be hardcoded in URI::_query::query_form(). I'd conditionally modify that based on a global as is done with $URI::DEFAULT_QUERY_FORM_DELIMITER and submit your change to the URI maintainer.
Other than that, perhaps you could use a LWP::UserAgent request_prepare callback handler?