JSONata: words to lowerCamelCase - camelcasing

I have a string consisting of words and punctuation, such as "Accept data protection terms / conditions (German)". I need to normalize that to camelcase, removing punctuation.
My closest attempt so far fails to camelcase the words, I only manage to make them into kebab-case or snake_case:
$normalizeId := function($str) <s:s> {
$str.$lowercase()
.$replace(/\s+/, '-')
.$replace(/[^-a-zA-Z0-9]+/, '')
};

Anindya's answer works for your example input, but if (German) was not capitalized, it would result in the incorrect output:
"acceptDataProtectionTermsConditionsgerman"
Link to playground
This version would work and prevent that bug:
(
$normalizeId := function($str) <s:s> {
$str
/* normalize everything to lowercase */
.$lowercase()
/* replace any "punctuations" with a - */
.$replace(/[^-a-zA-Z0-9]+/, '-')
/* Find all letters with a dash in front,
strip the dash and uppercase the letter */
.$replace(/-(.)/, function($m) { $m.groups[0].$uppercase() })
/* Clean up any leftover dashes */
.$replace("-", '')
};
$normalizeId($$)
/* OUTPUT: "acceptDataProtectionTermsConditionsGerman" */
)
Link to playground

You should target the letters which has a space in front, and capitalize them by using this regex /\s(.)/.
Here is my snippet: (Edited
(
$upper := function($a) {
$a.groups[0].$uppercase()
};
$normalizeId := function($str) <s:s> {
$str.$lowercase()
.$replace(/[^-a-zA-Z0-9]+/, '-')
.$replace(/-(.)/, $upper)
.$replace(/-/, '')
};
$normalizeId("Accept data protection terms / conditions (German)");
)
/* OUTPUT: "acceptDataProtectionTermsConditionsGerman" */
Edit: Thanks #vitorbal. The "$lower" function on regex replacement earlier was not necessary, and did not handle the scenario you mentioned. Thanks for pointing that out. I have updated my snippet as well as added a link to the playground below.
Link to playground

Related

How to parse new line in Scala grammar with flex/bison?

I want to parse Scala grammar with flex and bison. But I don't know how to parse the newline token in Scala grammar.
If I parse newline as a token T_NL, Here's the Toy.l for example:
...
[a-zA-Z_][a-zA-Z0-9_]* { yylval->literal = strdup(yy_text); return T_ID; }
\n { yylval->token = T_LN; return T_LN; }
[ \t\v\f\r] { /* skip whitespaces */ }
...
And here's the Toy.y for example:
function_def: 'def' T_ID '(' argument_list ')' return_expression '=' expression T_NL
;
argument_list: argument
| argument ',' argument_list
;
expression: ...
;
return_expression: ...
;
You could see that I have to skip T_NL in all other statements and definitions in Toy.y, which is really boring.
Please educate me with source code example!
This is a clear case where bison push-parsers are useful. The basic idea is that the decision to send an NL token (or tokens) can only be made when the following token has been identified (and, in one corner case, the second following token).
The advantage of push parsers is that they let us implement strategies like this, where there is not necessarily a one-to-one relationship between input lexemes and tokens sent to the parser. I'm not going to deal with all the particularities of setting up a push parser (though it's not difficult); you should refer to the bison manual for details. [Note 1]
First, it's important to read the Scala language description with care. Newline processing is described in section 2.13:
A newline in a Scala source text is treated as the special token “nl” if the three following criteria are satisfied:
The token immediately preceding the newline can terminate a statement.
The token immediately following the newline can begin a statement.
The token appears in a region where newlines are enabled.
Rules 1 and 2 are simple lookup tables, which are precisely defined in the following two paragraphs. There is just one minor exception to rule 2 has a minor exception, described below:
A case token can begin a statement only if followed by a class or object token.
One hackish possibility to deal with that exception would be to add case[[:space:]]+class and case[[:space:]]+object as lexemes, on the assumption that no-one will put a comment between case and class. (Or you could use a more sophisticated pattern, which allows comments as well as whitespace.) If one of these lexemes is recognised, it could either be sent to the parser as a single (fused) token, or it could be sent as two tokens using two invocations of SEND in the lexer action. (Personally, I'd go with the fused token, since once the pair of tokens has been recognised, there is no advantage to splitting them up; afaik, there's no valid program in which case class can be parsed as anything other than case class. But I could be wrong.)
To apply rules one and two, then, we need two lookup tables indexed by token number: token_can_end_stmt and token_cannot_start_stmt. (The second one has its meaning reversed because most tokens can start statements; doing it this way simplifies initialisation.)
/* YYNTOKENS is defined by bison if you specify %token-tables */
static bool token_can_end_stmt[YYNTOKENS] = {
[T_THIS] = true, [T_NULL] = true, /* ... */
};
static bool token_cannot_start_stmt[YYNTOKENS] = {
[T_CATCH] = true, [T_ELSE] = true, /* ... */
};
We're going to need a little bit of persistent state, but fortunately when we're using a push parser, the scanner does not need to return to its caller every time it recognises a token, so we can keep the persistent state as local variables in the scan loop. (That's another advantage of the push-parser architecture.)
From the above description, we can see that what we're going to need to maintain in the scanner state are:
some indication that a newline has been encountered. This needs to be a count, not a boolean, because we might need to send two newlines:
if two tokens are separated by at least one completely blank line (i.e a line which contains no printable characters), then two nl tokens are inserted.
A simple way to handle this is to simply compare the current line number with the line number at the previous token. If they are the same, then there was no newline. If they differ by only one, then there was no blank line. If they differ by more than one, then there was either a blank line or a comment (or both). (It seems odd to me that a comment would not trigger the blank line rule, so I'm assuming that it does. But I could be wrong, which would require some adjustment to this scanner.) [Note 2]
the previous token (for rule 1). It's only necessary to record the token number, which is a simple small integer.
some way of telling whether we're in a "region where newlines are enabled" (for rule 3). I'm pretty sure that this will require assistance from the parser, so I've written it that way here.
By centralising the decision about sending a newline into a single function, we can avoid a lot of code duplication. My typical push-parser architecture uses a SEND macro anyway, to deal with the boilerplate of saving the semantic value, calling the parser, and checking for errors; it's easy to add the newline logic there:
// yylloc handling mostly omitted for simplicity
#define SEND_VALUE(token, tag, value) do { \
yylval.tag = value; \
SEND(token); \
} while(0);
#define SEND(token) do { \
int status = YYPUSH_MORE; \
if (yylineno != prev_lineno) \
&& token_can_end_stmt[prev_token] \
&& !token_cannot_start_stmt[token] \
&& in_new_line_region) { \
status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled); \
if (status == YYPUSH_MORE \
&& yylineno - prev_lineno > 1) \
status = yypush_parse(ps, T_NL, NULL, &yylloc, &nl_enabled); \
} \
nl_encountered = 0; \
if (status == YYPUSH_MORE) \
status = yypush_parse(ps, token, &yylval, &yylloc, &nl_enabled); \
if (status != YYPUSH_MORE) return status; \
prev_token = token; \
prev_lineno = yylineno; \
while (0);
Specifying the local state in the scanner is extremely simple; just place the declarations and initialisations at the top of your scanner rules, indented. Any indented code prior to the first rule is inserted directly into yylex, almost at the top of the function (so it is executed once per call, not once per lexeme):
%%
int nl_encountered = 0;
int prev_token = 0;
int prev_lineno = 1;
bool nl_enabled = true;
YYSTYPE yylval;
YYLTYPE yylloc = {0};
Now, the individual rules are pretty simple (except for case). For example, we might have rules like:
"while" { SEND(T_WHILE); }
[[:lower:]][[:alnum:]_]* { SEND_VALUE(T_VARID, str, strdup(yytext)); }
That still leaves the question of how to determine if we are in a region where newlines are enabled.
Most of the rules could be handled in the lexer by just keeping a stack of different kinds of open parentheses, and checking the top of the stack: If the parenthesis on the top of the stack is a {, then newlines are enabled; otherwise, they are not. So we could use rules like:
[{([] { paren_stack_push(yytext[0]); SEND(yytext[0]); }
[])}] { paren_stack_pop(); SEND(yytext[0]); }
However, that doesn't deal with the requirement that newlines be disabled between a case and its corresponding =>. I don't think it's possible to handle that as another type of parenthesis, because there are lots of uses of => which do not correspond with a case and I believe some of them can come between a case and it's corresponding =>.
So a better approach would be to put this logic into the parser, using lexical feedback to communicate the state of the newline-region stack, which is what is assumed in the calls to yypush_parse above. Specifically, they share one boolean variable between the scanner and the parser (by passing a pointer to the parser). [Note 3] The parser then maintains the value of this variable in MRAs in each rule which matches a region of potentially different newlinedness, using the parse stack itself as a stack. Here's a small excerpt of a (theoretical) parser:
%define api.pure full
%define api.push-pull push
%locations
%parse-param { bool* nl_enabled; }
/* More prologue */
%%
// ...
/* Some example productions which modify nl_enabled: */
/* The actions always need to be before the token, because they need to take
* effect before the next lookahead token is requested.
* Note how the first MRA's semantic value is used to keep the old value
* of the variable, so that it can be restored in the second MRA.
*/
TypeArgs : <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
'[' Types
{ *nl_enabled = $1; } ']'
CaseClause : <boolean>{ $$ = *nl_enabled; *nl_enabled = false; }
"case" Pattern opt_Guard
{ *nl_enabled = $1; } "=>" Block
FunDef : FunSig opt_nl
<boolean>{ $$ = *nl_enabled; *nl_enabled = true; }
'{' Block
{ *nl_enabled = $3; } '}'
Notes:
Push parsers have many other advantages; IMHO they are they are the solution of choice. In particular, using push parsers avoids the circular header dependency which plagues attempts to build pure parser/scanner combinations.
There is still the question of multiline comments with preceding and trailing text:
return /* Is this comment
a newline? */ 42
I'm not going to try to answer that question.)
It would be possible to keep this flag in the YYLTYPE structure, since only one instance of yylloc is ever used in this example. That might be a reasonable optimisation, since it cuts down on the number of parameters sent to yypush_parse. But it seemed a bit fragile, so I opted for a more general solution here.

What are all the Unicode properties a Perl 6 character will match?

The .uniprop returns a single property:
put join ', ', 'A'.uniprop;
I get back one property (the general category):
Lu
Looking around I didn't see a way to get all the other properties (including derived ones such as ID_Start and so on). What am I missing? I know I can go look at the data files, but I'd rather have a single method that returns a list.
I am mostly interested in this because regexes understand properties and match the right properties. I'd like to take any character and show which properties it will match.
"A".uniprop("Alphabetic") will get the Alphabetic property. Are you asking for what other properties are possible?
All these that have a checkmark by them will likely work. This just displays that status of roast testing for it https://github.com/perl6/roast/issues/195
This may more more useful for you, https://github.com/rakudo/rakudo/blob/master/src/core/Cool.pm6#L396-L483
The first hash is just mapping aliases for the property names to the full names. The second hash specifices whether the property is B for boolean, S for a string, I for integer, nv for numeric value, na for Unicode Name and a few other specials.
If I didn't understand you question please let me know and I will revise this answer.
Update: Seems you want to find out all the properties that will match. What you will want to do is iterate all of https://github.com/rakudo/rakudo/blob/master/src/core/Cool.pm6#L396-L483 and looking only at string, integer and boolean properties. Here is the full thing: https://gist.github.com/samcv/ae09060a781bb4c36ae6cac80ea9325f
sub MAIN {
use Test;
my $char = 'a';
my #result = what-matches($char);
for #result {
ok EVAL("'$char' ~~ /$_/"), "$char ~~ /$_/";
}
}
use nqp;
sub what-matches (Str:D $chr) {
my #result;
my %prefs = prefs();
for %prefs.keys -> $key {
given %prefs{$key} {
when 'S' {
my $propval = $chr.uniprop($key);
if $key eq 'Block' {
#result.push: "<:In" ~ $propval.trans(' ' => '') ~ ">";
}
elsif $propval {
#result.push: "<:" ~ $key ~ "<" ~ $chr.uniprop($key) ~ ">>";
}
}
when 'I' {
#result.push: "<:" ~ $key ~ "<" ~ $chr.uniprop($key) ~ ">>";
}
when 'B' {
#result.push: ($chr.uniprop($key) ?? "<:$key>" !! "<:!$key>");
}
}
}
#result;
}
sub prefs {
my %prefs = nqp::hash(
'Other_Grapheme_Extend','B','Titlecase_Mapping','tc','Dash','B',
'Emoji_Modifier_Base','B','Emoji_Modifier','B','Pattern_Syntax','B',
'IDS_Trinary_Operator','B','ID_Continue','B','Diacritic','B','Cased','B',
'Hangul_Syllable_Type','S','Quotation_Mark','B','Radical','B',
'NFD_Quick_Check','S','Joining_Type','S','Case_Folding','S','Script','S',
'Soft_Dotted','B','Changes_When_Casemapped','B','Simple_Case_Folding','S',
'ISO_Comment','S','Lowercase','B','Join_Control','B','Bidi_Class','S',
'Joining_Group','S','Decomposition_Mapping','S','Lowercase_Mapping','lc',
'NFKC_Casefold','S','Simple_Lowercase_Mapping','S',
'Indic_Syllabic_Category','S','Expands_On_NFC','B','Expands_On_NFD','B',
'Uppercase','B','White_Space','B','Sentence_Terminal','B',
'NFKD_Quick_Check','S','Changes_When_Titlecased','B','Math','B',
'Uppercase_Mapping','uc','NFKC_Quick_Check','S','Sentence_Break','S',
'Simple_Titlecase_Mapping','S','Alphabetic','B','Composition_Exclusion','B',
'Noncharacter_Code_Point','B','Other_Alphabetic','B','XID_Continue','B',
'Age','S','Other_ID_Start','B','Unified_Ideograph','B','FC_NFKC_Closure','S',
'Case_Ignorable','B','Hyphen','B','Numeric_Value','nv',
'Changes_When_NFKC_Casefolded','B','Expands_On_NFKD','B',
'Indic_Positional_Category','S','Decomposition_Type','S','Bidi_Mirrored','B',
'Changes_When_Uppercased','B','ID_Start','B','Grapheme_Extend','B',
'XID_Start','B','Expands_On_NFKC','B','Other_Uppercase','B','Other_Math','B',
'Grapheme_Link','B','Bidi_Control','B','Default_Ignorable_Code_Point','B',
'Changes_When_Casefolded','B','Word_Break','S','NFC_Quick_Check','S',
'Other_Default_Ignorable_Code_Point','B','Logical_Order_Exception','B',
'Prepended_Concatenation_Mark','B','Other_Lowercase','B',
'Other_ID_Continue','B','Variation_Selector','B','Extender','B',
'Full_Composition_Exclusion','B','IDS_Binary_Operator','B','Numeric_Type','S',
'kCompatibilityVariant','S','Simple_Uppercase_Mapping','S',
'Terminal_Punctuation','B','Line_Break','S','East_Asian_Width','S',
'ASCII_Hex_Digit','B','Pattern_White_Space','B','Hex_Digit','B',
'Bidi_Paired_Bracket_Type','S','General_Category','S',
'Grapheme_Cluster_Break','S','Grapheme_Base','B','Name','na','Ideographic','B',
'Block','S','Emoji_Presentation','B','Emoji','B','Deprecated','B',
'Changes_When_Lowercased','B','Bidi_Mirroring_Glyph','bmg',
'Canonical_Combining_Class','S',
);
}
OK, so here's another take on answering this question, but the solution is not perfect. Bring the downvotes!
If you join #perl6 channel on freenode, there's a bot called unicodable6 which has functionality that you may find useful. You can ask it to do this (e.g. for character A and π simultaneously):
<AlexDaniel> propdump: Aπ
<unicodable6> AlexDaniel, https://gist.github.com/b48e6062f3b0d5721a5988f067259727
Not only it shows the value of each property, but if you give it more than one character it will also highlight the differences!
Yes, it seems like you're looking for a way to do that within perl 6, and this answer is not it. But in the meantime it's very useful. Internally Unicodable just iterates through this list of properties. So basically this is identical to the other answer in this thread.
I think someone can make a module out of this (hint-hint), and then the answer to your question will be “just use module Unicode::Propdump”.

Lex program rules not working

%{
#include <stdio.h>
int sline=0,mline=0;
%}
%%
"/*"[a-zA-Z0-9 \t\n]*"*/" { mline++; }
"//".* { sline++; }
.|\n { fprintf(yyout,"%s",yytext); }
%%
int main(int argc,char *argv[])
{
if(argc!=3)
{
printf("Invalid number of arguments!\n");
return 1;
}
yyin=fopen(argv[1],"r");
yyout=fopen(argv[2],"w");
yylex();
printf("Single line comments = %d\nMultiline comments=%d\nTotal comments = %d\n",sline,mline,sline+mline);
return 0;
}
I am trying to make a Lex program which would count the number of comment lines (single-line comments and multi-line comments separately).
Using this code, I gave a .c file and a blank text file as input and output arguments.
When I have any special characters in multi-line comments, its not working for that multi-line and mline is not incremented for the comment line.
How do I fix this problem?
Below is a nudge in the right direction. The main differences between what you did and what I have done is that I made only two regex - one for whitespace and one for ident (identifiers). What I mean by identifiers is anything that you want to comment out. This regex can obviously be expanded out to include other characters and symbols. I also just defined the three patterns that begin and end comments and associated them with tokens that we could pass to the syntax analyzer (but that's a whole new topic).
I also changed the way that you feed input to the program. I find it cleaner to redirect input to a program from a file and redirect output to another file - if you need this.
Here is an example of how you might use this program:
flex filename.l
g++ lex.yy.c -o lexer
./lexer < input.txt
You can redirect the output to another file if you need to by using:
./lexer < input.txt > output.txt
Instead of the last command above.
Note: the '.'(dot) character at the end of the pattern matching is used as a catch-all for characters, sequences of characters, symbols, etc. that do not have a match.
There are many nuances to pattern matching using regex to match comment lines. For example, this would still match even if the comment line was part of a string.
Ex. " //This is a comment in a string! "
You will need to do a little more work to get past these nuances - like I said, this is a nudge in the right direction.
You can do something similar to this to accomplish your goal:
%{
#include <stdio.h>
int sline = 0;
int mline = 0;
#define T_SLINE 0001
#define T_BEGIN_MLINE 0002
#define T_END_MLINE 0003
#define T_UNKNOWN 0004
%}
WSPACE [ \t\r]+
IDENT [a-zA-Z0-9]
%%
"//" {
printf("TOKEN: T_SLINE LEXEME: %s\n", yytext);
sline++;
return T_SLINE;
}
"/*" {
printf("TOKEN: T_BEGIN_MLINE LEXEME: %s\n", yytext);
return T_BEGIN_MLINE;
}
"*/" {
printf("TOKEN: T_END_MLINE LEXEME: %s\n", yytext);
mline++;
return T_END_MLINE;
}
{IDENT} {/*Do nothing*/}
{WSPACE} { /*Do Nothing*/}
. {
printf("TOKEN: UNKNOWN LEXEME: %s\n", yytext);
return T_UNKNOWN;
}
%%
int yywrap(void) { return 1; }
int main(void) {
while ( yylex() );
printf("Single-line comments = %d\n Multi-line comments = %d\n Total comments = %d\n", sline, mline, (sline + mline));
return 0;
}
The problem is your regex for multiline comments:
"/*"[a-zA-Z0-9 \t\n]*"*/"
This only matches multiline comments that ONLY contain letters, digits, spaces, tabs, and newlines. If the comment contains anything else it won't match. You want something like:
/"*"([^*]|"*"+[^*/])*"*"+/
This will match anything except a */ between the /* and */.
Below is the full lex code to count the number of comment line and executable line.
%{
int cc=0,cl=0,el=0,flag=0;
%}
%x cmnt
%%
^[ \t]*"//".*\n {cc++;cl++;}
.+"//".*\n {cc++;cl++;el++;}
^[ \t]*"/*" {BEGIN cmnt;}
<cmnt>\n {cl++;}
<cmnt>.\n {cl++;}
<cmnt>"*/"\n {cl++;cc++;BEGIN 0;}
<cmnt>"*/" {cl++;cc++;BEGIN 0;}
.*"/*".*"*/".+\n {cc++;cl++;}
.+"/*".*"*/".*\n {cc++;cl++;el++;}
.+"/*" {BEGIN cmnt;}
.\n {el++;}
%%
main()
{
yyin=fopen("abc.cpp","r");
yyout=fopen("abc.txt","w");
yylex();
fprintf(yyout,"Comment Count: %d \nCommented Lines: %d \nExecutable Lines: %d",cc,cl,el);
}
int yywrap()
{
return 1;
}
The program takes the input as a c++ program that is abc.cpp and appends the output in the file abc.txt

How to tell eclipse to not format parts of a codefile (pressing Strg + Shift + F)

I really love the autofromat feature. I makes your code more readable and in case of JavaScript tells you, when there are synatcs errors (missing brackets etc.).
However sometimes the formatting makes the code harder to read. e.g. when it puts a long array inizalisation into a single line. In that case I don't want him to format it, but rather leave it ofer multiple lines.
E.g.
define([
'jquery',
'aloha',
'aloha/plugin',
'ui/ui',
'ui/scopes',
'ui/button',
'ui/toggleButton',
'ui/port-helper-attribute-field',
'ui/text'
// 'css!youtube/css/youtube.css'
],
function(
$,
Aloha,
Plugin,
Ui,
Scopes,
Button,
ToggleButton,
AttributeField)
{
this array should stay like this and don't become this:
define(['jquery', 'aloha', 'aloha/plugin', 'ui/ui', 'ui/scopes', 'ui/button', 'ui/toggleButton', 'ui/port-helper-attribute-field', 'ui/text' ], function($, Aloha, Plugin, Ui, Scopes, Button, ToggleButton, AttributeField) {
Is there a special tag, to tell eclipse not to format the code?
OK, it took me some time to find the right setting so I will post a toturial here.
Go to Window Preferences and Search the Formatter you are using. In my case it was under 'Aptana Studia' -> 'Formatter'. (Depending on your Package this differs, e.g. the Java Formatter is under 'Java' -> 'Code Style' -> 'Formater').
Noww create a new Build profile since you can't override the old one.
Now enable the Formatter tags.
Now you can use the
- #formatter:on
- #formatter:off
tags to disable code formatting.
Example:
this code:
function hello() { return 'hello';
}
//#formatter:off
/*
|\ _,,,---,,_
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) fL
*/
//#formatter:on
function
world() {
return 'world';
}
Will get formatted to like this
function hello() {
return 'hello';
}
//#formatter:off
/*
|\ _,,,---,,_
/,`.-'`' -. ;-;;,_
|,4- ) )-,_..;\ ( `'-'
'---''(_/--' `-'\_) fL
*/
//#formatter:on
function world() {
return 'world';
}
Note how the function definition is formatted correct, while the ascii art isn't
Credits:
Katja Christiansen for his comment
https://stackoverflow.com/a/3353765/639035 : for a similar answer
Try to make an empty comment after each line:
define([ //
'jquery', //
'aloha', //
'aloha/plugin', //
'ui/ui', //
'ui/scopes', //
'ui/button', //
'ui/toggleButton', //
...
Not nice, but I think it will work.

Getting a string which ends with a string "lngt" in Lex

I am writing a lex script to tokenize C ASTs. I want to write a regex in lex to get a string that ends with a specific string "lngt" but does not include "lngt" in the final string returned by lex. So basically the string form would be (.*lngt), but I haven't been able to figure out how to do this in lex. Any advice/direction would be really helpful
Example:I have this line in my file
#65 string_cst type: #71 strg: Reverse order of the given number is : %d lngt: 42
I want to retrieve string after strg: and before lngt: ie "Reverse order of the given number is : %d" (NOTE: this string could be composed of any characters possible)
Thanks.
This question needs an answer is similar to the one I wrote here. It can be done by writing your own state machine in lex. It could also be done by writing some C code as shown in the cited answer or in the other texts cited below.
If we assume that the string you want is always between "strg" and "lngt" then this is the same as any other non-symmetric string delimiters.
%x STRG LETTERL LN LNG LNGT
ws [ \t\r\n]+
%%
<INITIAL>"strg: " {
BEGIN(STRG);
}
<STRG>[^l]*l {
yymore();
BEGIN(LETTERL);
}
<LETTERL>n {
yymore();
BEGIN(LN);
}
<LN>g {
yymore();
BEGIN(LNG);
}
<LNG>t {
yymore();
BEGIN(LNGT);
}
<LNGT>":" {
printf("String is '%s'\n", yytext);
BEGIN(INITIAL);
}
<LETTERL>[^n] {
BEGIN(STRG);
yymore();
}
<LN>[^g] {
BEGIN(STRG);
yymore();
}
<LNG>[^t] {
BEGIN(STRG);
yymore();
}
<LNGT>[^:] {
BEGIN(STRG);
yymore();
}
<INITIAL>{ws} /* skip */ ;
<INITIAL>. /* skip anything not in the string */
%%
To quote my other answer:
There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:
J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator
The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.