Removing comments with JFlex, but keeping line terminators - lex

I'm writing lexical specification for JFlex (it's like flex, but for Java). I have problem with TraditionalComment (/* */) and DocumentationComment (/** */). So far I have this, taken from JFlex User's Manual:
LineTerminator = \r|\n|\r\n
InputCharacter = [^\r\n]
WhiteSpace = {LineTerminator} | [ \t\f]
/* comments */
Comment = {TraditionalComment} | {EndOfLineComment} | {DocumentationComment}
TraditionalComment = "/*" [^*] ~"*/" | "/*" "*"+ "/"
EndOfLineComment = "//" {InputCharacter}* {LineTerminator}
DocumentationComment = "/**" {CommentContent} "*"+ "/"
CommentContent = ( [^*] | \*+ [^/*] )*
{Comment} { /* Ignore comments */ }
{LineTerminator} { return LexerToken.PASS; }
LexerToken.PASS means that later I'm passing line terminators on output. Now, what I want to do is:
Ignore everything which is inside the comment, except new line terminators.
For example, consider such input:
/* Some
* quite long comment. */
In fact it is /* Some\n * quite long comment. */\n. With current lexer it will be converted to a single line. The output will be single '\n'. But I would like to have 2 lines, '\n\n'. In general, I would like that my output will always have the same number of lines as input. How to do it?

After couple of days I found a solution. I will post it here, maybe somebody will have the same problem.
The trick is, after recognizing that you are inside a comment - go once more through its body and if you spot new line terminators - pass them, not ignore:
%{
public StringBuilder newLines;
%}
// ...
{Comment} {
char[] ch;
ch = yytext().toCharArray();
newLines = new StringBuilder();
for (char c : ch)
{
if (c == '\n')
{
newLines.append(c);
}
}
return LexerToken.NEW_LINES;
}

Related

In a StringTemplate how to temporarily suppress automatic indentation?

In a StringTemplate how to temporarily suppress automatic indentation?
Suppose a template:
fooTemplate() ::= <<
I want this to be indented normally.
# I do not want this line to be indented.
>>
So you can understand the motivation.
I am generating C-lang code and I do not want the preprocessor instructions to be indented. e.g.
#if
To be clear the fooTemplate is not the only template.
It is called by other templates (which may nest several levels deep).
Introducing a special character into the template to temporarily disable indentation would be acceptable.
fooTemplate() ::= <<
I want this to be indented normally.
<\u0008># I do not want this line to be indented.
>>
I see that indentation is actually applied by the 'AutoIndentWriter' https://github.com/antlr/stringtemplate4/blob/master/doc/indent.md
I implemented my own 'SemiAutoIndentWriter' which looks for a magic character (\b in my case) in the stream.
When seen the magic character sets a 'suppressIndent' switch which causes indentation to be suppressed.
package org.stringtemplate.v4;
import java.io.IOException;
import java.io.Writer;
/** Just pass through the text. */
public class SemiAutoIndentWriter extends AutoIndentWriter {
public boolean suppressIndent = false;
public SemiAutoIndentWriter (Writer out) {
super(out);
}
#Override
public int write(String str) throws IOException {
int n = 0;
int nll = newline.length();
int sl = str.length();
for (int i=0; i<sl; i++) {
char c = str.charAt(i);
if ( c=='\b' ) {
suppressIndent = true;
continue;
}
// found \n or \r\n newline?
if ( c=='\r' ) continue;
if ( c=='\n' ) {
suppressIndent = false
atStartOfLine = true;
charPosition = -nll; // set so the write below sets to 0
out.write(newline);
n += nll;
charIndex += nll;
charPosition += n; // wrote n more char
continue;
}
// normal character
// check to see if we are at the start of a line; need indent if so
if ( atStartOfLine ) {
if (! suppressIndent) n+=indent();
atStartOfLine = false;
}
n++;
out.write(c);
charPosition++;
charIndex++;
}
return n;
}
Note that the '<\b>' is not a recognized special character by ST4 but '' is recognized.

Insert multiple lines of text into a Rich Text content control with OpenXML

I'm having difficulty getting a content control to follow multi-line formatting. It seems to interpret everything I'm giving it literally. I am new to OpenXML and I feel like I must be missing something simple.
I am converting my multi-line string using this function.
private static void parseTextForOpenXML(Run run, string text)
{
string[] newLineArray = { Environment.NewLine, "<br/>", "<br />", "\r\n" };
string[] textArray = text.Split(newLineArray, StringSplitOptions.None);
bool first = true;
foreach (string line in textArray)
{
if (!first)
{
run.Append(new Break());
}
first = false;
Text txt = new Text { Text = line };
run.Append(txt);
}
}
I insert it into the control with this
public static WordprocessingDocument InsertText(this WordprocessingDocument doc, string contentControlTag, string text)
{
SdtElement element = doc.MainDocumentPart.Document.Body.Descendants<SdtElement>().FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>().Val == contentControlTag);
if (element == null)
throw new ArgumentException("ContentControlTag " + contentControlTag + " doesn't exist.");
element.Descendants<Text>().First().Text = text;
element.Descendants<Text>().Skip(1).ToList().ForEach(t => t.Remove());
return doc;
}
I call it with something like...
doc.InsertText("Primary", primaryRun.InnerText);
Although I've tried InnerXML and OuterXML as well. The results look something like
Example AttnExample CompanyExample AddressNew York, NY 12345 or
<w:r xmlns:w="http://schemas.openxmlformats.org/wordprocessingml/2006/main"><w:t>Example Attn</w:t><w:br /><w:t>Example Company</w:t><w:br /><w:t>Example Address</w:t><w:br /><w:t>New York, NY 12345</w:t></w:r>
The method works fine for simple text insertion. It's just when I need it to interpret the XML that it doesn't work for me.
I feel like I must be super close to getting what I need, but my fiddling is getting me nowhere. Any thoughts? Thank you.
I believe the way I was trying to do it was doomed to fail. Setting the Text attribute of an element is always going to be interpreted as text to be displayed it seems. I ended up having to take a slightly different tack. I created a new insert method.
public static WordprocessingDocument InsertText(this WordprocessingDocument doc, string contentControlTag, Paragraph paragraph)
{
SdtElement element = doc.MainDocumentPart.Document.Body.Descendants<SdtElement>().FirstOrDefault(sdt => sdt.SdtProperties.GetFirstChild<Tag>().Val == contentControlTag);
if (element == null)
throw new ArgumentException("ContentControlTag " + contentControlTag + " doesn't exist.");
OpenXmlElement cc = element.Descendants<Text>().First().Parent;
cc.RemoveAllChildren();
cc.Append(paragraph);
return doc;
}
It starts the same, and gets the Content Control by searching for it's Tag. But then I get it's parent, remove the Content Control elements that were there and just replace them with a paragraph element.
It's not exactly what I had envisioned, but it seems to work for my needs.

Generating DXL documentation using Doxygen : if is shown as a function

I am trying to generate some DXL documentation usings Doxygen , but the results are often not correct , DXL is used as a scripting language and that has a C/C++ like syntax with some changes , like for example i can ignor using the Semicolons , What should i do to correct this problem ?
which creates some problems while generating the documentation, here is an example of my dxl code database :
string replace (string sSource, string sSearch, string sReplace) {
int iLen = length sSource
if (iLen == 0) return ""
int iLenSearch = length(sSearch)
if (iLenSearch == 0) {
return ""
}
char firstChar = sSearch[0]
Buffer s = create()
int pos = 0, d1,d2;
int i
while (pos < iLen) {
char ch = sSource[pos];
bool found = true
if (ch != firstChar) {pos ++; s+= ch; continue}
for (i = 1; i < iLenSearch; i++) {
if (sSource[pos+i] != sSearch[i]) { found = false; break }
}
if (!found) {pos++; s+= ch; continue}
s += sReplace
pos += iLenSearch
}
string result = stringOf s
delete s
return result }
as i said the main difference with C and that may cause doxygen to interpret this code incorrectly is that in DXL , we dont have to use ";" .
thanks in advance
You must do three things to apply Doxygen successfully on DXL scripts:
1.) In Doxygen-GUI, 'Wizard' tab, section 'Mode' choose 'Optimize for C or PHP'
2.) The DXL code must be C-confom, i.e. each statement ends with a semicolon ';'
3.) In tab 'Expert' set language mapping for DXL and INC files in section 'Project' under 'EXTENSION_MAPPING':
dxl=C
inc=C
This all tells Doxygen to treat DXL scripts as C code.
Further, for DOORS to recognize a DXL file documented for DoxyGen as valid and bind it to a menu item, it must comply with certain header structure, consisting of single line and multi-line comment, e.g.
// <dxl-file>
/**
* #file <dxl-file>
* #copyright (c) ...
* #author Th. Grosser
* #date 01 Dec 2017
* #brief ...
*/

Talend: Split a data file into two flows/streams (header_info, data_rows)

Please, I don't need no solution, just a few hints on how-tos. Anyhow, here is the problem I am tackeling with:
I have a file (bloomberg answer file) which is built as follows:
we have a header part (I am only interested in the
START-OF-FIELDS[...]END-OF-FIELDS; varying amount of fields!)
then there is the data part: START-OF-DATA[...]END-OF-DATA. Where each row: unique_id|some_val|some_val|EXCH_CODE|ID_BB_GLOBAL|NAME|SECURITY_TYP|TICKER\n
Shortened example file:
START-OF-FILE
RUNDATE=20150921
PROGRAMFLAG=oneshot
DATEFORMAT=yyyymmdd_sep
FIRMNAME=dl111111
FILETYPE=pc
REPLYFILENAME=r150921020044_20426_01_00
SECMASTER=yes
DERIVED=yes
CREDITRISK=yes
USERNUMBER=1111111
WS=0
SN=111111
CLOSINGVALUES=yes
SECID=BB_GLOBAL
PROGRAMNAME=getdata
START-OF-FIELDS
EXCH_CODE
ID_BB_GLOBAL
NAME
SECURITY_TYP
TICKER
END-OF-FIELDS
TIMESTARTED=Mon Sep 21 01:01:18 BST 2015
START-OF-DATA
BBG004C5BLW2|0|5|LABUAN INTL FIN|BBG004C5BLW2|1MDB GLOBAL INVESTMENTS|EURO-DOLLAR|OGIMK|
BBG000MGZ064|0|5|HK|BBG000MGZ064|361 DEGREES INTERNATIONAL|Common Stock|1361|
BBG000QVRHX9|0|5|AV|BBG000QVRHX9|3BG EMCORE CONVRT GLB-A|Open-End Fund|EMBDGCA|
BBG000BP52R2|0|5|US|BBG000BP52R2|3M CO|Common Stock|MMM|
BBG0068TPTD9|0|5|TRACE|BBG0068TPTD9|51JOB INC|US DOMESTIC|JOBS|
BBG0069D1BR3|0|5|NOT LISTED|BBG0069D1BR3|51JOB INC|EURO-DOLLAR|JOBS|
BBG000BJD1D4|0|5|US|BBG000BJD1D4|51JOB INC-ADR|ADR|JOBS|
BBG008CTTWK1|0|5|FRANKFURT|BBG008CTTWK1|AABAR INVESTMENTS PJSC|EURO MTN|AABAR|
BBG008D4J9S9|0|5|FRANKFURT|BBG008D4J9S9|AABAR INVESTMENTS PJSC|EURO MTN|AABAR|
BBG008B2BXH2|0|5|SIX|BBG008B2BXH2|AARGAUISCHE KANTONALBANK|DOMESTIC|KBAARG|
BBG0016WJL30|0|5|LX|BBG0016WJL30|AB-AMERICAN INCOME PT-ATEURH|Open-End Fund|ABAATEH|
BBG006F3D598|0|5|BH|BBG006F3D598|ABBEY CAPITAL DAILY FUTURE-B|Fund of Funds|ABBDFUB|
END-OF-DATA
TIMEFINISHED=Mon Sep 21 01:03:22 BST 2015
END-OF-FILE
And now my questions
How can I split this file into 2 flows (field_names; data_rows)?
My problem was:
The regex component only works on row level...
The tFileInputMSDelimited does bring me nowhere...
I don't want to start parsing the file by hand (tJava)... or do I have to?
Thanks for any hints in advance,
Marco
No need to java code, check out this very simple job:
Generally, the header have fixed row count, so we need only to play with rows numbers:
tFileInputDelimited1: header 11 and limit 5
tFileInputDelimited2: header 20 and footer 3
and it works fine, if you have a dynamic rows positions, try to find these positions, save them in variables then use this job based on variables. You can also refeer to my answer here.
I'd use tJavaFlex and some Java code. If you look at the actual code its not that hard to understand how it works even if you don't really know java.
Begin:
boolean header = false;
boolean data = false;
String headerData = "";
String line;
Main:
line = input_row.line;
if(line.equalsIgnoreCase("START-OF-FIELDS") ) { header = true; }
if(line.equalsIgnoreCase("END-OF-FIELDS") ) { header = false; }
if(line.equalsIgnoreCase("START-OF-DATA") ) { data = true; }
if(line.equalsIgnoreCase("END-OF-DATA") ) { data = false; }
if(header && !line.equalsIgnoreCase("START-OF-FIELDS")) {
headerData += line + "|";
}
if (data) {
if(line.equalsIgnoreCase("START-OF-DATA")) {
output_row.line = headerData.substring(0,headerData.length()-1); //remove the trailing delimiter.
} else {
output_row.line = line;
}
} else {
continue; //lets go to the next line.
}
End:
//if you want to handle the header separately:
globalMap.put("headerData",headerData);
Hope this helps.

how to check each elements of string array contains data or not in c#

i have created web application and using textbox and it can contains multiple line of data becoz i have set its textmode property is multiline.
my problem is that i want to check each line contain data or not so i using count variable which count how many line contain data.
string[] data;
int cntindex;
data = txt_invoicenumber.Text.ToString().Split("\n".ToCharArray());
cntindex = data.Length;
for (j = 0; j < cntindex; j++)
{
if (data[j]!="")
{
inv_count++;
}
}
Its not working.
Please help me.
I guess this is because new line is \r\n so there is a '\r' also on empty lines.
Change the if statement to:
if (data[j].Trim().Length != 0)
Firstly, You don't need to ToString() the .Text property as it is already a string.
try this
string[] lines = txt_invoicenumber.Text.Split(Environment.NewLine);
int lineCount = 0;
foreach(string line in lines)
{
if(!string.IsNullOrEmpty(line))
{
lineCount ++;
this.ProcessLine(line);
}
}
var lb = new String[] { "\r\n" };
var lines = txt_invoicenumber.Text.Split(lb, StringSplitOptions.None).Length;
This will count empty lines too. If you don't want to count empty lines, use the StringSplitOptions.RemoveEmptyEntries value.
Don't count 100% on "\r\n" if you have little control over your environment though.
This is the answer I came up with.
String[] lines = TextBox1.Text.Split(new Char[] { '\r', '\n' },
StringSplitOptions.RemoveEmptyEntries);
Int32 validLineCount = lines.Length;