I am writing a lex script to tokenize C ASTs. I want to write a regex in lex to get a string that ends with a specific string "lngt" but does not include "lngt" in the final string returned by lex. So basically the string form would be (.*lngt), but I haven't been able to figure out how to do this in lex. Any advice/direction would be really helpful
Example:I have this line in my file
#65 string_cst type: #71 strg: Reverse order of the given number is : %d lngt: 42
I want to retrieve string after strg: and before lngt: ie "Reverse order of the given number is : %d" (NOTE: this string could be composed of any characters possible)
Thanks.
This question needs an answer is similar to the one I wrote here. It can be done by writing your own state machine in lex. It could also be done by writing some C code as shown in the cited answer or in the other texts cited below.
If we assume that the string you want is always between "strg" and "lngt" then this is the same as any other non-symmetric string delimiters.
%x STRG LETTERL LN LNG LNGT
ws [ \t\r\n]+
%%
<INITIAL>"strg: " {
BEGIN(STRG);
}
<STRG>[^l]*l {
yymore();
BEGIN(LETTERL);
}
<LETTERL>n {
yymore();
BEGIN(LN);
}
<LN>g {
yymore();
BEGIN(LNG);
}
<LNG>t {
yymore();
BEGIN(LNGT);
}
<LNGT>":" {
printf("String is '%s'\n", yytext);
BEGIN(INITIAL);
}
<LETTERL>[^n] {
BEGIN(STRG);
yymore();
}
<LN>[^g] {
BEGIN(STRG);
yymore();
}
<LNG>[^t] {
BEGIN(STRG);
yymore();
}
<LNGT>[^:] {
BEGIN(STRG);
yymore();
}
<INITIAL>{ws} /* skip */ ;
<INITIAL>. /* skip anything not in the string */
%%
To quote my other answer:
There are suggested solutions on several university compiler courses. The one that explains it well is here (at Manchester). Which cites a couple of good books which also cover the problems:
J.Levine, T.Mason & D.Brown: Lex and Yacc (2nd ed.)
M.E.Lesk & E.Schmidt: Lex - A Lexical Analyzer Generator
The two techniques described are to use Start Conditions to explicity specify the state machine, or manual input to read characters directly.
Related
I have a string consisting of words and punctuation, such as "Accept data protection terms / conditions (German)". I need to normalize that to camelcase, removing punctuation.
My closest attempt so far fails to camelcase the words, I only manage to make them into kebab-case or snake_case:
$normalizeId := function($str) <s:s> {
$str.$lowercase()
.$replace(/\s+/, '-')
.$replace(/[^-a-zA-Z0-9]+/, '')
};
Anindya's answer works for your example input, but if (German) was not capitalized, it would result in the incorrect output:
"acceptDataProtectionTermsConditionsgerman"
Link to playground
This version would work and prevent that bug:
(
$normalizeId := function($str) <s:s> {
$str
/* normalize everything to lowercase */
.$lowercase()
/* replace any "punctuations" with a - */
.$replace(/[^-a-zA-Z0-9]+/, '-')
/* Find all letters with a dash in front,
strip the dash and uppercase the letter */
.$replace(/-(.)/, function($m) { $m.groups[0].$uppercase() })
/* Clean up any leftover dashes */
.$replace("-", '')
};
$normalizeId($$)
/* OUTPUT: "acceptDataProtectionTermsConditionsGerman" */
)
Link to playground
You should target the letters which has a space in front, and capitalize them by using this regex /\s(.)/.
Here is my snippet: (Edited
(
$upper := function($a) {
$a.groups[0].$uppercase()
};
$normalizeId := function($str) <s:s> {
$str.$lowercase()
.$replace(/[^-a-zA-Z0-9]+/, '-')
.$replace(/-(.)/, $upper)
.$replace(/-/, '')
};
$normalizeId("Accept data protection terms / conditions (German)");
)
/* OUTPUT: "acceptDataProtectionTermsConditionsGerman" */
Edit: Thanks #vitorbal. The "$lower" function on regex replacement earlier was not necessary, and did not handle the scenario you mentioned. Thanks for pointing that out. I have updated my snippet as well as added a link to the playground below.
Link to playground
I've just started getting into vsCode snippets. They seem really handy.
Is there a way to ensure that what a user entered at a tabstop starts with a lowercase value.
Here's my test case/ sandbox :
"junk": {
"prefix": "junk",
"body": [
"original:${1:type some string here then tab}",
"lower:${1/(.*)/${1:/downcase}/}",
"upper:${1/(.*)/${1:/upcase}/}",
"capitalized:${1/(.*)/${1:/capitalize}/}",
"camel:${1/(.*)/${1:/camelcase}/}",
"pascal:${1/(.*)/${1:/pascalcase}/}",
],
"description": "junk"
}
and here's what it produces:
original:SomeValue
lower:somevalue
upper:SOMEVALUE
capitalized:SomeValue
camel:somevalue
pascal:Somevalue
"camel" is pretty close but I want to preserve the capital if the user entered a camelcase value.
I just want the first character lower no matter what.
The answer is:
${1/(.)(.*)/${1:/downcase}$2/}
Just to clarify, if you look at this commit: https://github.com/microsoft/vscode/commit/3d6389bb336b8ca9b12bc1e772f7056d5c03d3ee
function _toCamelCase(value: string): string {
const match = value.match(/[a-z0-9]+/gi);
console.log(match)
if (!match) {
return value;
}
return match.map((word, index) => {
if (index === 0) {
return word.toLowerCase();
} else {
return word.charAt(0).toUpperCase()
+ word.substr(1).toLowerCase();
}
})
.join('');
}
the camelcase transform is intended for input like
some-value
some_value
some.value
I think any non [a-z0-9]/i will work as the separator between words. So your case of SomeValue is not the intended use of camelcase: according to the function above the entire SomeValue is one match (the match is case-insensitve) and then that entire word is lowercased.
I am using this to find customer name in text file. Names are each on a separate line. I need to find exact name. If searching for Nick specifically it should find Nick only but my code will say found even if only Nickolson is in te list.
On*:text:*!Customer*:#: {
if ($read(system\Customer.txt,$2)) {
.msg $chan $2 Customer found in list! | halt }
else { .msg $chan 4 $2 Customer not found in list. | halt }
}
You have to loop through every matching line and see if the line is an exact match
Something like this
On*:text:*!Custodsddmer*:#: {
var %nick
; loop over all lines that contains nick
while ($read(customer.txt, nw, *nick*, $calc($readn + 1))) {
; check if the line is an exact match
if ($v1 == nick) {
%nick = $v1
; stop the loop because a result is found
break;
}
}
if (%nick == $null) {
.msg $chan 4 $2 Customer not found in list.
}
else{
.msg $chan $2 Customer found in list!
}
You can find more here: https://en.wikichip.org/wiki/mirc/text_files#Iterating_Over_Matches
If you're looking for exact match in a new line separate list, then you can use the 'w' switch without using wildcard '*' character.
From mIRC documentation
$read(filename, [ntswrp], [matchtext], [N])
Scans the file info.txt for a line beginning with the word mirc and
returns the text following the match value. //echo $read(help.txt, w,
*help*)
Because we don't want the wildcard matching, but a exact match, we would use:
$read(customers.txt, w, Nick)
Complete Code:
ON *:TEXT:!Customer *:#: {
var %foundInTheList = $read(system\Customer.txt, w, $2)
if (%foundInTheList) {
.msg # $2 Customer found in list!
}
else {
.msg 4 # $2 Customer not found in list.
}
}
Few remarks on Original code
Halting
halt should only use when you forcibly want to stop any future processing to take place. In most cases, you can avoid it, by writing you code flow in a way it will behave like that without explicitly using halting.
It will also resolve new problems that may arise, in case you will want to add new code, but you will wonder why it isn't executing.. because of the darn now forgotten halt command.
This will also improve you debugging, in the case it will not make you wonder on another flow exit, without you knowing.
Readability
if (..) {
.... }
else { .. }
When considering many lines of codes inside the first { } it will make it hard to notice the else (or elseif) because mIRC remote parser will put on the same identification as the else line also the line above it, which contains the closing } code. You should almost always few extra code in case of readability, especially which it costs new nothing!, as i remember new lines are free of charge.
So be sure the to have the rule of thump of every command in a new line. (that includes the closing bracket)
Matching Text
On*:text:*!Customer*:#: {
The above code has critical problem, and bug.
Critical: Will not work, because on*:text contains no space between on and *:text
Bug: !Customer will match EVERYTHING-BEFORE!customerANDAFTER <NICK>, which is clearly not desired behavior. What you want is :!Customer *: will only match if the first word was !customer and you must enter at least another text, because I've used [SPACE]*.
Is there any way to search a string only inside a function definition.
I mean to say suppose there is a c program file a.c , in which there is definition of several functions are present , but i want output of search only when that string present inside specific function ( lets say do_something()) definition, is there any way to search string like that, from command prompt?
for example , for following code:
#include <stdio.h>
void f(int n,
int j,
int k)
{
printf("name is is pankaj ");
printf("name is is kumar ");
printf("name is is mayank ");
}
int main()
{
printf("name is is pankaj ");
return 0;
}
for above program, I want only one occurrence of pankaj which is present in function f(), I don't want pankaj present in main function as output of search.
Please ignore any semantic or syntax error in program , my query is only for search of a string in program.
Of course, try this:
$0 ~ fun {
count = 1
while (! ($0 ~ /{/))
getline
getline
}
count > 0 {
if ($0 ~ /{/)
count++
if ($0 ~ /}/)
count--
if ($0 ~ query)
print FILENAME ": l" FNR ". " $0
}
And invoke the script like this:
awk -v query="pankaj" -v fun="void f[(]" -f script.awk inputfile.java
Where query is the string to search and fun the regex for the function name.
This script counts { and } to see when we leave the function and should print the line if a match is found.
Edit: you may want to extend the regex for counting brackets, perhaps an extra check to see if they aren't placed in comments is required (although you'd never do that).
%{
#include <stdio.h>
int sline=0,mline=0;
%}
%%
"/*"[a-zA-Z0-9 \t\n]*"*/" { mline++; }
"//".* { sline++; }
.|\n { fprintf(yyout,"%s",yytext); }
%%
int main(int argc,char *argv[])
{
if(argc!=3)
{
printf("Invalid number of arguments!\n");
return 1;
}
yyin=fopen(argv[1],"r");
yyout=fopen(argv[2],"w");
yylex();
printf("Single line comments = %d\nMultiline comments=%d\nTotal comments = %d\n",sline,mline,sline+mline);
return 0;
}
I am trying to make a Lex program which would count the number of comment lines (single-line comments and multi-line comments separately).
Using this code, I gave a .c file and a blank text file as input and output arguments.
When I have any special characters in multi-line comments, its not working for that multi-line and mline is not incremented for the comment line.
How do I fix this problem?
Below is a nudge in the right direction. The main differences between what you did and what I have done is that I made only two regex - one for whitespace and one for ident (identifiers). What I mean by identifiers is anything that you want to comment out. This regex can obviously be expanded out to include other characters and symbols. I also just defined the three patterns that begin and end comments and associated them with tokens that we could pass to the syntax analyzer (but that's a whole new topic).
I also changed the way that you feed input to the program. I find it cleaner to redirect input to a program from a file and redirect output to another file - if you need this.
Here is an example of how you might use this program:
flex filename.l
g++ lex.yy.c -o lexer
./lexer < input.txt
You can redirect the output to another file if you need to by using:
./lexer < input.txt > output.txt
Instead of the last command above.
Note: the '.'(dot) character at the end of the pattern matching is used as a catch-all for characters, sequences of characters, symbols, etc. that do not have a match.
There are many nuances to pattern matching using regex to match comment lines. For example, this would still match even if the comment line was part of a string.
Ex. " //This is a comment in a string! "
You will need to do a little more work to get past these nuances - like I said, this is a nudge in the right direction.
You can do something similar to this to accomplish your goal:
%{
#include <stdio.h>
int sline = 0;
int mline = 0;
#define T_SLINE 0001
#define T_BEGIN_MLINE 0002
#define T_END_MLINE 0003
#define T_UNKNOWN 0004
%}
WSPACE [ \t\r]+
IDENT [a-zA-Z0-9]
%%
"//" {
printf("TOKEN: T_SLINE LEXEME: %s\n", yytext);
sline++;
return T_SLINE;
}
"/*" {
printf("TOKEN: T_BEGIN_MLINE LEXEME: %s\n", yytext);
return T_BEGIN_MLINE;
}
"*/" {
printf("TOKEN: T_END_MLINE LEXEME: %s\n", yytext);
mline++;
return T_END_MLINE;
}
{IDENT} {/*Do nothing*/}
{WSPACE} { /*Do Nothing*/}
. {
printf("TOKEN: UNKNOWN LEXEME: %s\n", yytext);
return T_UNKNOWN;
}
%%
int yywrap(void) { return 1; }
int main(void) {
while ( yylex() );
printf("Single-line comments = %d\n Multi-line comments = %d\n Total comments = %d\n", sline, mline, (sline + mline));
return 0;
}
The problem is your regex for multiline comments:
"/*"[a-zA-Z0-9 \t\n]*"*/"
This only matches multiline comments that ONLY contain letters, digits, spaces, tabs, and newlines. If the comment contains anything else it won't match. You want something like:
/"*"([^*]|"*"+[^*/])*"*"+/
This will match anything except a */ between the /* and */.
Below is the full lex code to count the number of comment line and executable line.
%{
int cc=0,cl=0,el=0,flag=0;
%}
%x cmnt
%%
^[ \t]*"//".*\n {cc++;cl++;}
.+"//".*\n {cc++;cl++;el++;}
^[ \t]*"/*" {BEGIN cmnt;}
<cmnt>\n {cl++;}
<cmnt>.\n {cl++;}
<cmnt>"*/"\n {cl++;cc++;BEGIN 0;}
<cmnt>"*/" {cl++;cc++;BEGIN 0;}
.*"/*".*"*/".+\n {cc++;cl++;}
.+"/*".*"*/".*\n {cc++;cl++;el++;}
.+"/*" {BEGIN cmnt;}
.\n {el++;}
%%
main()
{
yyin=fopen("abc.cpp","r");
yyout=fopen("abc.txt","w");
yylex();
fprintf(yyout,"Comment Count: %d \nCommented Lines: %d \nExecutable Lines: %d",cc,cl,el);
}
int yywrap()
{
return 1;
}
The program takes the input as a c++ program that is abc.cpp and appends the output in the file abc.txt