lex - Removing "/*" also removes internal stars - lex

I'm trying to pull comments out of a c file. But my code pulls out all stars instead of just /* and */. Can anyone help?
Input /**A**/ or /***/
Desired Output *A* and *
My Output *A and nothing
Code
"/*" /* comment */ BEGIN(Comment);
<Comment>{
[^*] /* not a '*' */ ECHO;
"*"+[^/] /* '*'s not followed by '/' */ ECHO;
"*"+"/" /* end of Comment */ BEGIN(INITIAL);
}

Change your last two patterns to
"*"+/[^/]
"*/"
Your last pattern explicitly takes every * at the end of the comment out of the comment. If you only change the last rule, then it will not recognize the end of the comment of for example /***/, because /* will start the comment, then ** is matched by the one but last pattern and the / is matched by [^*].
"*"+/[^/] matches all sequences of * followed by anything but a /, but not consuming the character that follows. This is necessary as this could be the * of the */ closing the comment.

This regex matches non-nesting C comments:
"/*"([^*]|[*]*[^*/])*"*"+"/"
Here is a complete Lex program which strips C comments from input, replacing each one with a space.
%%
"/*"([^*]|[*]*[^*/])*"*"+"/" putc(' ', yyout);
%%
However, this fails to provide helpful diagnostics. For instance if something like /* /* */ occurs, it's nice to generate a warning about a suspicious looking start of a comment within a comment. Also, if a comment is unterminated, it's useful to detect that and produce a diagnostic about were that diagnostic started.
For these reasons, it may be best to handle C comments by recognizing just the /* sequence and then taking over with a custom piece of code that reads characters from the yyin stream and recognizes the rest of the comment.

Related

doxygen alias to struct, class, etc

I want to create some alias that that internally creates \struct command that refers to some specific struct and adds some additional commands:
ALIASES += "thing{2}=\struct \2 \n \n \xrefitem thingList\"\" \"List of Things\" \2 this thing belongs to that \ref \1"
the alias is invoked in some normal doxy-comment:
/**
*
* \thing{SomeThing, SomeThingStruct}
*
* \brief ..sdfsdf
*/
typedef struct sSomeTag SomeThingStruct;
It mainly does that it should and also the xrefitem list is generated correctly, but i get the error:
warning: the name `\_linebr' supplied as the argument of the \class, \struct, \union, or \include command is not an input file
because it interprets the \n in the alias as second argument to the \struct keyword
How can i define my alias that it does not produce this warning?
See documentation about ALIASES in the doxygen documentation.
A few points directly from the documentation:
ALIASES This tag can be used to specify a number of aliases that act
as commands in the documentation. An alias has the form: name=value
For example adding "sideeffect=#par Side Effects:\n" will allow you to
put the command \sideeffect (or #sideeffect) in the documentation,
which will result in a user-defined paragraph with heading "Side
Effects:". You can put \n's in the value part of an alias to insert
newlines (in the resulting output). You can put ^^ in the value part
of an alias to insert a newline as if a physical newline was in the
original file.
We see here the usage of the equal sign (=) (corrected in the mean time, had been forgotten during copying)
the use of upper case <-> lowercase (you should now have a message: warning: Found unknown command\thing'` (corrected in the mean time)
usage of \n might be ^^
SO the alias should read:
ALIASES += thing{2}="\struct \2 ^^ ^^ \xrefitem thingList\"\" \"List of Things\" \2 this thing belongs to that \ref \1"

Move to the beginning of next code block in Vim

Assuming I have the following Perl code open in Vim:
if (#arr = shomething()) {
for (#arr) {
some_function($_->{some_key});
# some
# more
# code
while (some_other_funtion($_)) {
write_log('working');
}
}
}
and the cursor at the beginning of the line with some_function, how I can move the cursor to any of:
the start of the while
the { of the while
the first line inside the while block (with the call to write_log)
Searching for { is not an option, because there could be many of { that do not start new inner code block - for example, see parameter of some_function.
It seems you are defining a “code block” to be { } that contain at least one line. You can most easily search for those just by searching for a { at the end of a line:
/{$
/{ means search for a {, and $ represents an anchor to the end of the line.
There might be cases where a { opens a block, but is not the last character of a line:
while (some_other_funtion($_)) { # this while is very important
write_log('working');
}
To take this into account, do the following search for a { that is not closed on the same line:
/{[^}]*$
/ – search for
{ – a { character
[^}] – followed by any character that is not a }
* – repeated 0 or more times
$ – until the end of the line
(Vim regexes are not always the same as in Perl, but this particular one is.)
You could define a mapping for that second search by putting this in your .vimrc:
noremap <Leader>nb /{[^}]*$<CR>
That would let you jump to the next block by pressing <Leader> (\ by default) n b.
Since it uses :noremap, it affects Select mode too. You won’t want that if your <Leader> is a printable character (which it is by default). In that case, add the line sunmap <Leader>nb below the previous line to fix Select mode.
% , $, and ^ are your best friends. (cursor to matching enclosure, end of line, beginning of line).
At the beginning of your code block there ':1$' , will put your cursor at the first bracket.
% will advance you to the next 'matching' end of your code block, assuming it is balanced. If your code is out of balance, the cursor won't move. It actually counts matching-type opening and closing braces which follow and if there is an imbalance, the cursor will not move. Usually the terminal will beep: as in 'Doh! You have a problem.' It's very useful and it works with '{}[]()'
Good way to check your code and ensure that the end of the block exists. It will skip as many lines as exist between the braces (or parens or brackets) to place the cursor on the matching enclosure.
This file is small but assuming you're on line 1 (:1)
:1$ - end of line first code block
:2 - puts the cursor at the 'f' in 'for' on line 2 rather than the white space preceding.
% - jumps you to the closing ')' on that line.
% - jumps you to the opening '(' on that line.
$ - takes you to the '{' which opens the for loop code
% - jumps the cursor to the ending '}' of the for loop
% - takes you back to the top (% is bi-directional. )
Play with it. There's a reason that Intellij's text editor has a vim mode. It's powerful.
Also, pretty good vim manual here that covers some of this stuff and much more.
https://www.pks.mpg.de/~mueller/docs/suse10.1/suselinux-manual_en/manual/sec.suse.vim.html

Comment out some part of a line in matlab function

As the question suggests I want to comment out some part of a line in MATLAB.
Also I want to comment out some part of a line not till the end of line.
Reason for this is, I have to try two different versions of a line and I don't want to replicate the line twice. I know it is easy to comment/uncomment if I replicate the line , But I want it this way.
Within one line is not possible (afaik), but you can split up your term into multiple lines:
x=1+2+3 ... optional comments for each line
... * factorA ... can be inserted here
* factorB ...
+4;
Here * factorA is commented out and * factorB is used, resulting in the term x=1+2+3*factorB+4.
The documentation contains a similar example, commenting out one part of an array:
header = ['Last Name, ', ...
'First Name, ', ...
... 'Middle Initial, ', ...
'Title']
Nope, this is not possible. From help '%':
% Percent. The percent symbol is used to begin comments.
Logically, it serves as an end-of-line character. Any
following text on the line is ignored or printed by the
HELP system.
So just copy-paste the line, or write a tiny function so that it's easier to switch between versions.

Lexing/Parsing "here" documents

For those that are experts in lexing and parsing... I am attempting to write a series of programs in perl that would parse out IBM mainframe z/OS JCL for a variety of purposes, but am hitting a roadblock in methodology. I am mostly following the lexing/parsing ideology put forth in "Higher Order Perl" by Mark Jason Dominus, but there are some things that I can't quite figure out how to do.
JCL has what's called inline data, which is very similar to "here" documents. I am not quite sure how to lex these into tokens.
The layout for inline data is as follows:
//DDNAME DD *
this is the inline data
this is some more inline data
/*
...
Conventionally, the "*" after the "DD" signifies that following lines are the inline data itself, terminated by either "/*" or the next valid JCL record (starting with "//" in the first 2 columns).
More advanced, the inline data could appear as such:
//DDNAME DD *,DLM=ZZ
//THIS LOOKS LIKE JCL BUT IT'S ACTUALLY DATA
//MORE DATA MASQUERADING AS JCL
ZZ
...
Sometimes the inline data is itself JCL (perhaps to be pumped to a program or the internal reader, whatever).
But here's the rub. In JCL, the records are 80 bytes, fixed in length. Everything past column 72 (cols 73-80) is a "comment". As well, everything following a blank that follows valid JCL is likewise a comment. Since I am looking to manipulate JCL in my programs and spit it back out, I'd like to capture comments so that I can preserve them.
So, here's an example of inline comments in the case of inline data:
//DDNAME DD *,DLM=ZZ THIS IS A COMMENT COL73DAT
data
...
ZZ
...more JCL
I originally thought that I could have my top-most lexer pull in a line of JCL and immediately create a non-token for cols 1-72 and then a token (['COL73COMMENT',$1]) for the column 73 comment, if any. This would then pass downstream to the next iterator/tokenizer a string of the cols 1-72 text followed by the col73 token.
But how would I, downstream from there, grab the inline data? I'd originally figured that the top-most tokenizer could look for a "DD \*(,DLM=(\S*))" (or the like) and then just keep pulling records from the feeding iterator until it hit the delimiter or a valid JCL starter ("//").
But you may see the issue here... I can't have 2 topmost tokenizers... either the tokenizer that looks for COL73 comments must be the top or the tokenizer that gets inline data must be at the top.
I imagine that perl parsers have the same challenge, since seeing
<<DELIM
isn't necessarily the end of the line, followed by the here document data. After all, you could see perl like:
my $this=$obj->ingest(<<DELIM)->reformat();
inline here document data
more data
DELIM
How would the tokenizer/parser know to tokenize the ")->reformat();" and then still grab the following records as-is? In the case of the inline JCL data, those lines are passed as-is, cols 73-80 are NOT comments in that case...
So, any takers on this? I know there will be tons of questions clarifying my needs and I'm happy to clarify as much as is needed.
Thanks in advance for any help...
In this answer I will concentrate on heredocs, because the lessons can be easily transferred to the JCL.
Any language that supports heredocs is not context-free, and thus cannot be parsed with common techniques like recursive descent. We need a way to guide the lexer along more twisted paths, but in doing so, we can maintain the appearance of a context-free language. All we need is another stack.
For the parser, we treat introductions to heredocs <<END as string literals. But the lexer has to be extended to do the following:
When a heredoc introduction is encountered, it adds the terminator to the stack.
When a newline is encountered, the body of the heredoc is lexed, until the stack is empty. After that, normal parsing is resumed.
Take care to update the line number appropriately.
In a hand-written combined parser/lexer, this could be implemented like so:
use strict; use warnings; use 5.010;
my $s = <<'INPUT-END'; pos($s) = 0;
<<A <<B
body 1
A
body 2
B
<<C
body 3
C
INPUT-END
my #strs;
push #strs, parse_line() while pos($s) < length($s);
for my $i (0 .. $#strs) {
say "STRING $i:";
say $strs[$i];
}
sub parse_line {
my #strings;
my #heredocs;
$s =~ /\G\s+/gc;
# get the markers
while ($s =~ /\G<<(\w+)/gc) {
push #strings, '';
push #heredocs, [ \$strings[-1], $1 ];
$s =~ /\G[^\S\n]+/gc; # spaces that are no newlines
}
# lex the EOL
$s =~ /\G\n/gc or die "Newline expected";
# process the deferred heredocs:
while (my $heredoc = shift #heredocs) {
my ($placeholder, $marker) = #$heredoc;
$s =~ /\G(.*\n)$marker\n/sgc or die "Heredoc <<$marker expected";
$$placeholder = $1;
}
return #strings;
}
Output:
STRING 0:
body 1
STRING 1:
body 2
STRING 2:
body 3
The Marpa parser simplifies this a bit by allowing events to be triggered once a certain token is parsed. These are called pauses, because the built-in lexing pauses a moment for you to take over. Here is a high-level overview and a short blogpost describing this technique with the demo code on Github.
In case anyone was wondering how I decided to resolve this, here is what I did.
My main lexing routine accepts an iterator that pumps full lines of text (which can take it from a file, a string, whatever I want). The routine uses that to create another iterator, which examines the line for "comments" after column 72, which it will then return as a "mainline" token followed by a "col72" token. This iterator is then used to create yet another iterator, which passes the col72 tokens through unchanged, but takes the mainline tokens and lexes them into atomic tokens (things like STRING, NUMBER, COMMA, NEWLINE, etc).
But here's the crux... the lexing routine has the ORIGINAL ITERATOR still... so when it receives a token that indicates there is a "here" document, it continues processing tokens until it hits a NEWLINE token (meaning end of the actual line of text) and then uses the original iterator to pull off the here document data. Since that iterator feeds the atomic tokens iterator, pulling from it then prevents those lines from being atomized.
To illustrate, think of iterators like hoses. The first hose is the main iterator. To that I attach the col72 iterator hose, and to that I attach the atomic tokenizer hose. As streams of characters go in the first hose, atomized tokens come out the end of the third hose. But I can attach a 2-way nozzle to the first hose that will allow its output to come out the alternate nozzle, preventing that data from going into the second hose (and hence the third hose). When I'm done diverting the data through the alternate nozzle, I can turn that off and then data begins flowing through the second and third hoses again.
Easy-peasey.

replace string1 with string2 in many java files, only in comments

I have around 3000 instance of replacement done in hundreds of files. Replacing all occurance of string1 with string2 was easy. IntelliJ allows me to replace all occurences in "comments and strings".
The problem is that the same string appear in comments and real code. I would like restrict the replacement only in comment section ( we use mix of /**/ or // )
Any library/IDE/script that can do this?
use Regexp::Common 'comment';
...
s/($RE{comment}{'C++'})/(my $x = $1) =~ s#string1#string2#g; $x/ge;
Try using the following regex to find all comments, and then replace what you want afterwards:
/(?>\/\*[^\*\/]*\*\/|\/\/([^\n])*\n)/
The first part \/\*[^\*\/]*\*\/ Tries to find all /**/ pairs where it finds something that starts with /* and then contains something other than end tag */ and the contains end tag */.
THe other part checks something that starts with // and goes to endline(\n) and contains something not newline between ([^\n]*).
Thus it should all comments