I want to parse Scala grammar with flex and bison. But I don't know how to parse the newline token in Scala grammar.
If I parse newline as a token T_NL, Here's the Toy.l for example:
...
[a-zA-Z_][a-zA-Z0-9_]* { yylval->literal = strdup(yy_text); return T_ID; }
\n { yylval->token = T_LN; return T_LN; }
[ \t\v\f\r] { /* skip whitespaces */ }
...
And here's the Toy.y for example:
function_def: 'def' T_ID '(' argument_list ')' return_expression '=' expression T_NL
;
argument_list: argument
| argument ',' argument_list
;
expression: ...
;
return_expression: ...
;
You could see that I have to skip T_NL in all other statements and definitions in Toy.y, which is really boring.
Please educate me with source code example!
This is a clear case where bison push-parsers are useful. The basic idea is that the decision to send an NL token (or tokens) can only be made when the following token has been identified (and, in one corner case, the second following token).
The advantage of push parsers is that they let us implement strategies like this, where there is not necessarily a one-to-one relationship between input lexemes and tokens sent to the parser. I'm not going to deal with all the particularities of setting up a push parser (though it's not difficult); you should refer to the bison manual for details. [Note 1]
First, it's important to read the Scala language description with care. Newline processing is described in section 2.13:
A newline in a Scala source text is treated as the special token “nl” if the three following criteria are satisfied:
The token immediately preceding the newline can terminate a statement.
The token immediately following the newline can begin a statement.
The token appears in a region where newlines are enabled.
Rules 1 and 2 are simple lookup tables, which are precisely defined in the following two paragraphs. There is just one minor exception to rule 2 has a minor exception, described below:
A case token can begin a statement only if followed by a class or object token.
One hackish possibility to deal with that exception would be to add case[[:space:]]+class and case[[:space:]]+object as lexemes, on the assumption that no-one will put a comment between case and class. (Or you could use a more sophisticated pattern, which allows comments as well as whitespace.) If one of these lexemes is recognised, it could either be sent to the parser as a single (fused) token, or it could be sent as two tokens using two invocations of SEND in the lexer action. (Personally, I'd go with the fused token, since once the pair of tokens has been recognised, there is no advantage to splitting them up; afaik, there's no valid program in which case class can be parsed as anything other than case class. But I could be wrong.)
To apply rules one and two, then, we need two lookup tables indexed by token number: token_can_end_stmt and token_cannot_start_stmt. (The second one has its meaning reversed because most tokens can start statements; doing it this way simplifies initialisation.)
/* YYNTOKENS is defined by bison if you specify %token-tables */
static bool token_can_end_stmt[YYNTOKENS] = {
[T_THIS] = true, [T_NULL] = true, /* ... */
};
static bool token_cannot_start_stmt[YYNTOKENS] = {
[T_CATCH] = true, [T_ELSE] = true, /* ... */
};
We're going to need a little bit of persistent state, but fortunately when we're using a push parser, the scanner does not need to return to its caller every time it recognises a token, so we can keep the persistent state as local variables in the scan loop. (That's another advantage of the push-parser architecture.)
From the above description, we can see that what we're going to need to maintain in the scanner state are:
some indication that a newline has been encountered. This needs to be a count, not a boolean, because we might need to send two newlines:
if two tokens are separated by at least one completely blank line (i.e a line which contains no printable characters), then two nl tokens are inserted.
A simple way to handle this is to simply compare the current line number with the line number at the previous token. If they are the same, then there was no newline. If they differ by only one, then there was no blank line. If they differ by more than one, then there was either a blank line or a comment (or both). (It seems odd to me that a comment would not trigger the blank line rule, so I'm assuming that it does. But I could be wrong, which would require some adjustment to this scanner.) [Note 2]
the previous token (for rule 1). It's only necessary to record the token number, which is a simple small integer.
some way of telling whether we're in a "region where newlines are enabled" (for rule 3). I'm pretty sure that this will require assistance from the parser, so I've written it that way here.
By centralising the decision about sending a newline into a single function, we can avoid a lot of code duplication. My typical push-parser architecture uses a SEND macro anyway, to deal with the boilerplate of saving the semantic value, calling the parser, and checking for errors; it's easy to add the newline logic there:
// yylloc handling mostly omitted for simplicity
#define SEND_VALUE(token, tag, value) do { \
yylval.tag = value; \
SEND(token); \
} while(0);
#define SEND(token) do { \
int status = YYPUSH_MORE; \
if (yylineno != prev_lineno) \
&& token_can_end_stmt[prev_token] \
&& !token_cannot_start_stmt[token] \
&& in_new_line_region) { \
status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled); \
if (status == YYPUSH_MORE \
&& yylineno - prev_lineno > 1) \
status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled); \
} \
nl_encountered = 0; \
if (status == YYPUSH_MORE) \
status = yypush_parse(ps, token, &yylval, &yylloc, &nl_enabled); \
if (status != YYPUSH_MORE) return status; \
prev_token = token; \
prev_lineno = yylineno; \
while (0);
Specifying the local state in the scanner is extremely simple; just place the declarations and initialisations at the top of your scanner rules, indented. Any indented code prior to the first rule is inserted directly into yylex, almost at the top of the function (so it is executed once per call, not once per lexeme):
%%
int nl_encountered = 0;
int prev_token = 0;
int prev_lineno = 1;
bool nl_enabled = true;
YYSTYPE yylval;
YYLTYPE yylloc = {0};
Now, the individual rules are pretty simple (except for case). For example, we might have rules like:
"while" { SEND(T_WHILE); }
[[:lower:]][[:alnum:]_]* { SEND_VALUE(T_VARID, str, strdup(yytext)); }
That still leaves the question of how to determine if we are in a region where newlines are enabled.
Most of the rules could be handled in the lexer by just keeping a stack of different kinds of open parentheses, and checking the top of the stack: If the parenthesis on the top of the stack is a {, then newlines are enabled; otherwise, they are not. So we could use rules like:
[{([] { paren_stack_push(yytext[0]); SEND(yytext[0]); }
[])}] { paren_stack_pop(); SEND(yytext[0]); }
However, that doesn't deal with the requirement that newlines be disabled between a case and its corresponding =>. I don't think it's possible to handle that as another type of parenthesis, because there are lots of uses of => which do not correspond with a case and I believe some of them can come between a case and it's corresponding =>.
So a better approach would be to put this logic into the parser, using lexical feedback to communicate the state of the newline-region stack, which is what is assumed in the calls to yypush_parse above. Specifically, they share one boolean variable between the scanner and the parser (by passing a pointer to the parser). [Note 3] The parser then maintains the value of this variable in MRAs in each rule which matches a region of potentially different newlinedness, using the parse stack itself as a stack. Here's a small excerpt of a (theoretical) parser:
%define api.pure full
%define api.push-pull push
%locations
%parse-param { bool* nl_enabled; }
/* More prologue */
%%
// ...
/* Some example productions which modify nl_enabled: */
/* The actions always need to be before the token, because they need to take
* effect before the next lookahead token is requested.
* Note how the first MRA's semantic value is used to keep the old value
* of the variable, so that it can be restored in the second MRA.
*/
TypeArgs : <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
'[' Types
{ *nl_enabled = $1; } ']'
CaseClause : <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
"case" Pattern opt_Guard
{ *nl_enabled = $1; } "=>" Block
FunDef : FunSig opt_nl
<boolean>{ $$ = *nl_enabled; *nl_enabled = true; }
'{' Block
{ *nl_enabled = $3; } '}'
Notes:
Push parsers have many other advantages; IMHO they are they are the solution of choice. In particular, using push parsers avoids the circular header dependency which plagues attempts to build pure parser/scanner combinations.
There is still the question of multiline comments with preceding and trailing text:
return /* Is this comment
a newline? */ 42
I'm not going to try to answer that question.)
It would be possible to keep this flag in the YYLTYPE structure, since only one instance of yylloc is ever used in this example. That might be a reasonable optimisation, since it cuts down on the number of parameters sent to yypush_parse. But it seemed a bit fragile, so I opted for a more general solution here.
Related
I am using this to find customer name in text file. Names are each on a separate line. I need to find exact name. If searching for Nick specifically it should find Nick only but my code will say found even if only Nickolson is in te list.
On*:text:*!Customer*:#: {
if ($read(system\Customer.txt,$2)) {
.msg $chan $2 Customer found in list! | halt }
else { .msg $chan 4 $2 Customer not found in list. | halt }
}
You have to loop through every matching line and see if the line is an exact match
Something like this
On*:text:*!Custodsddmer*:#: {
var %nick
; loop over all lines that contains nick
while ($read(customer.txt, nw, *nick*, $calc($readn + 1))) {
; check if the line is an exact match
if ($v1 == nick) {
%nick = $v1
; stop the loop because a result is found
break;
}
}
if (%nick == $null) {
.msg $chan 4 $2 Customer not found in list.
}
else{
.msg $chan $2 Customer found in list!
}
You can find more here: https://en.wikichip.org/wiki/mirc/text_files#Iterating_Over_Matches
If you're looking for exact match in a new line separate list, then you can use the 'w' switch without using wildcard '*' character.
From mIRC documentation
$read(filename, [ntswrp], [matchtext], [N])
Scans the file info.txt for a line beginning with the word mirc and
returns the text following the match value. //echo $read(help.txt, w,
*help*)
Because we don't want the wildcard matching, but a exact match, we would use:
$read(customers.txt, w, Nick)
Complete Code:
ON *:TEXT:!Customer *:#: {
var %foundInTheList = $read(system\Customer.txt, w, $2)
if (%foundInTheList) {
.msg # $2 Customer found in list!
}
else {
.msg 4 # $2 Customer not found in list.
}
}
Few remarks on Original code
Halting
halt should only use when you forcibly want to stop any future processing to take place. In most cases, you can avoid it, by writing you code flow in a way it will behave like that without explicitly using halting.
It will also resolve new problems that may arise, in case you will want to add new code, but you will wonder why it isn't executing.. because of the darn now forgotten halt command.
This will also improve you debugging, in the case it will not make you wonder on another flow exit, without you knowing.
Readability
if (..) {
.... }
else { .. }
When considering many lines of codes inside the first { } it will make it hard to notice the else (or elseif) because mIRC remote parser will put on the same identification as the else line also the line above it, which contains the closing } code. You should almost always few extra code in case of readability, especially which it costs new nothing!, as i remember new lines are free of charge.
So be sure the to have the rule of thump of every command in a new line. (that includes the closing bracket)
Matching Text
On*:text:*!Customer*:#: {
The above code has critical problem, and bug.
Critical: Will not work, because on*:text contains no space between on and *:text
Bug: !Customer will match EVERYTHING-BEFORE!customerANDAFTER <NICK>, which is clearly not desired behavior. What you want is :!Customer *: will only match if the first word was !customer and you must enter at least another text, because I've used [SPACE]*.
I have a publication, essentially what's below:
Meteor.publish('entity-filings', function publishFunction(cik, queryArray, limit) {
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
var limit = 40;
var entityFilingsSelector = {};
if (filingsArray.indexOf('all-entity-filings') > -1)
entityFilingsSelector = {ct: 'filing',cik: cik};
else
entityFilingsSelector = {ct:'filing', cik: cik, formNumber: { $in: filingsArray} };
return SB.Content.find(entityFilingsSelector, {
limit: limit
});
});
I'm having trouble with the filingsArray part. filingsArray is an array of regexes for the Mongo $in query. I can hardcode filingsArray in the publication as [/8-K/], and that returns the correct results. But I can't get the query to work properly when I pass the array from the router. See the debugged contents of the array in the image below. The second and third images are the client/server debug contents indicating same content on both client and server, and also identical to when I hardcode the array in the query.
My question is: what am I missing? Why won't my query work, or what are some likely reasons it isn't working?
In that first screenshot, that's a string that looks like a regex literal, not an actual RegExp object. So {$in: ["/8-K/"]} will only match literally "/8-K/", which is not the same as {$in: [/8-K/]}.
Regexes are not EJSON-able objects, so you won't be able to send them over the wire as publish function arguments or method arguments or method return values. I'd recommend sending a string, then inside the publish function, use new RegExp(...) to construct a regex object.
If you're comfortable adding new methods on the RegExp prototype, you could try making RegExp an EJSON-able type, by putting this in your server and client code:
RegExp.prototype.toJSONValue = function () {
return this.source;
};
RegExp.prototype.typeName = function () {
return "regex";
}
EJSON.addType("regex", function (str) {
return new RegExp(str);
});
After doing this, you should be able to use regexes as publish function arguments, method arguments and method return values. See this meteorpad.
/8-K/.. that's a weird regex. Try /8\-K/.
A minus (-) sign is a range indicator and usually used inside square brackets. The reason why it's weird because how could you even calculate a range between 8 and K? If you do not escape that, it probably wouldn't be used to match anything (thus your query would not work). Sometimes, it does work though. Better safe than never.
/8\-K/ matches the string "8-K" anywhere once.. which I assume you are trying to do.
Also it would help if you would ensure your publication would always return something.. here's a good area where you could fail:
if (!cik || !filingsArray)
console.error('PUBLICATION PROBLEM');
If those parameters aren't filled, console.log is probably not the best way to handle it. A better way:
if (!cik || !filingsArray) {
throw "entity-filings: Publication problem.";
return false;
} else {
// .. the rest of your publication
}
This makes sure that the client does not wait unnecessarily long for publications statuses as you have successfully ensured that in any (input) case you returned either false or a Cursor and nothing in between (like surprise undefineds, unfilled Cursors, other garbage data.
I saw a very manual way of doing this in another post: How do I add a query parameter to a URL?
This doesn't seem very intuitive, but someone there mentioned an easier way to accomplish this using the upcoming "URL scope". Is this feature out yet, and how would I use it?
If you're using the stdlib mixer, you should be able to use the URL scope which provides helper functions for adding, viewing, editing, and removing URL params. Here's a quick example:
$original_url = "http://cuteoverload.com/2013/08/01/buttless-monkey-jams?hi=there"
$new_url = url($original_url) {
log(param("hi"))
param("hello", "world")
remove_param("hi")
}
log($new_url)
Tritium Tester example here: http://tester.tritium.io/9fcda48fa81b6e0b8700ccdda9f85612a5d7442f
Almost forgot, link to docs: http://tritium.io/current (You'll want to click on the URL category).
AFAIK, there's no built-in way of doing so.
I'll post here how I did to append a query param, making sure that it does not get duplicated if already on the url:
Inside your functions/main.ts file, you can declare:
# Adds a query parameter to the URL string in scope.
# The parameter is added as the last parameter in
# the query string.
#
# Sample use:
# $("//a[#id='my_link]") {
# attribute("href") {
# value() {
# appendQueryParameter('MVWomen', '1')
# }
# }
# }
#
# That will add MVwomen=1 to the end of the query string,
# but before any hash arguments.
# It also takes care of deciding if a ? or a #
# should be used.
#func Text.appendQueryParameter(Text %param_name, Text %param_value) {
# this beautiful regex is divided in three parts:
# 1. Get anything until a ? or # is found (or we reach the end)
# 2. Get anything until a # is found (or we reach the end - can be empty)
# 3. Get the remainder (can be empty)
replace(/^([^#\?]*)(\?[^#]*)?(#.*)?$/) {
var('query_symbol', '?')
match(%2, /^\?/) {
$query_symbol = '&'
}
# first, it checks if the %param_name with this %param_value already exists
# if so, we don't do anything
match_not(%2, concat(%param_name, '=', %param_value)) {
# We concatenate the URL until ? or # (%1),
# then the query string (%2), which can be empty or not,
# then the query symbol (either ? or &),
# then the name of the parameter we are appending,
# then an equals sign,
# then the value of the parameter we are appending
# and finally the hash fragment, which can be empty or not
set(concat(%1, %2, $query_symbol, %param_name, '=', %param_value, %3))
}
}
}
The other features you want (remove, modify) can be achieved similarly (by creating a function inside functions/main.ts and leveraging some regex magic).
Hope it helps.
I want to know whether is it possible to reject numeric phrases or numeric values while indexing or searching in Lucene.net.
For example (this is one line),
Hi all my no is 4756396
Now, when I index or search it should reject the numeric value 4756396 to be indexed or searched. I tried making a custom stop word list with 1, 2, 3, 4, 5, 6, etc, but I guess it will only ignore if a single number will appears.
You can copy the StandardAnalyzer and customize the grammar (simple JFlex stuff) to reject number tokens. If you do that, you'll need to port back the analyzer to Java since JFlex will generate java code, tho you could give it a try with C# Flex.
You could also write a TokenFilter that scans tokens one by one and rejects them if they are numbers. If you wanna filter only whole numbers and still retain numbers that are for example separate by hyphens, the filter could simply attempt a double.TryParse() and if it fails you accept the Token. A more robust and customizable solution would still use a lexical parser.
Edit:
Heres a quick sample of what I mean, with a little main method that shows how to use it. In this I used a TryParse() to filter out tokens, if it were for a more complex production system I'd use a lexical parser system. (take a look at C# Flex for that)
public class NumericFilter : TokenFilter
{
private ITermAttribute termAtt ;
public NumericFilter(TokenStream tokStream)
: base(tokStream)
{
termAtt = AddAttribute<ITermAttribute>();
}
public override bool IncrementToken()
{
while (base.input.IncrementToken())
{
string term = termAtt.Term;
double res ;
if(double.TryParse(term, out res))
{
// skip this token
continue;
}
// accept this token
return true;
}
// no more token in the stream
return false;
}
}
static void Main(string[] args)
{
RAMDirectory dir = new RAMDirectory();
IndexWriter iw = new IndexWriter(dir, new KeywordAnalyzer(), IndexWriter.MaxFieldLength.UNLIMITED);
Document d = new Document();
Field f = new Field("text", "", Field.Store.YES, Field.Index.ANALYZED);
d.Add(f);
// use our Filter here
f.SetTokenStream(new NumericFilter(new LowerCaseFilter(new WhitespaceTokenizer(new StringReader("I have 300 dollars")))));
iw.AddDocument(d);
iw.Commit();
IndexReader reader = iw.GetReader();
// print all terms in the text field
TermEnum terms = reader.Terms(new Term("text", ""));
do
{
Console.WriteLine(terms.Term.Text);
}
while (terms.Next());
reader.Dispose();
iw.Dispose();
Console.ReadLine();
Environment.Exit(42);
}
I've coded a parser based on Scala parser combinators:
class SxmlParser extends RegexParsers with ImplicitConversions with PackratParsers {
[...]
lazy val document: PackratParser[AstNodeDocument] =
((procinst | element | comment | cdata | whitespace | text)*) ^^ {
AstNodeDocument(_)
}
[...]
}
object SxmlParser {
def parse(text: String): AstNodeDocument = {
var ast = AstNodeDocument()
val parser = new SxmlParser()
val result = parser.parseAll(parser.document, new CharArrayReader(text.toArray))
result match {
case parser.Success(x, _) => ast = x
case parser.NoSuccess(err, next) => {
tool.die("failed to parse SXML input " +
"(line " + next.pos.line + ", column " + next.pos.column + "):\n" +
err + "\n" +
next.pos.longString)
}
}
ast
}
}
Usually the resulting parsing error messages are rather nice. But sometimes it becomes just
sxml: ERROR: failed to parse SXML input (line 32, column 1):
`"' expected but `' found
^
This happens if a quote characters is not closed and the parser reaches the EOT. What I would like to see here is (1) what production the parser was in when it expected the '"' (I've multiple ones) and (2) where in the input this production started parsing (which is an indicator where the opening quote is in the input). Does anybody know how I can improve the error messages and include more information about the actual internal parsing state when the error happens (perhaps something like a production rule stacktrace or whatever can be given reasonably here to better identify the error location). BTW, the above "line 32, column 1" is actually the EOT position and hence of no use here, of course.
I don't know yet how to deal with (1), but I was also looking for (2) when I found this webpage:
https://wiki.scala-lang.org/plugins/viewsource/viewpagesrc.action?pageId=917624
I'm just copying the information:
A useful enhancement is to record the input position (line number and column number) of the significant tokens. To do this, you must do three things:
Make each output type extend scala.util.parsing.input.Positional
invoke the Parsers.positioned() combinator
Use a text source that records line and column positions
and
Finally, ensure that the source tracks positions. For streams, you can simply use scala.util.parsing.input.StreamReader; for Strings, use scala.util.parsing.input.CharArrayReader.
I'm currently playing with it so I'll try to add a simple example later
In such cases you may use err, failure and ~! with production rules designed specifically to match the error.