Matching specific characters in FLEX/LEX - lex

Say for example I have strings like
The Undiscovered Country
Return of the Jedi
The Motion Picture
The Phantom Menace
Attack of the Clones
And I only want it to return the ones starting with "The" how do I match the first 3 characters specifically in a line?

You can match text starting only at the beginning of a line by putting a ^ at the beginning of the pattern. (The beginning of a line is either the beginning of the input, or the character immediately following a newline (\n) character.)
So the following will categorize lines into whole lines starting with The and all other whole lines:
^"The ".* { /* yytext is a line starting with The, not including the newline. */ }
.+ { /* yytext is a line not starting with The. */ }
\n ; /* Ignore newline characters */
Note: The pattern only matches lines which start with The (including the space). That's not the same as lines starting The (which would include These errors), and nor is it the same as lines starting with the word The (which might include something like The--only--way forward). In general, getting a lexical specification right means thinking carefully about all the corner cases, and deciding for each one what the desired outcome is.

Related

Flex: easy way to see if a line has any content?

Among many rules in my Altair BASIC Flex file is this one:
[\n]
{
++num_lines;
++num_statements;
return '\n';
}
++statements; is not actually correct - in theory the line might be empty (due to bad data in the .BAS file for instance) and thus not have any statements on that line. So is there any way to know if there's any tokens in front of the \n since the last \n? I know you can do this with the BEGIN() et all, but that seems like a LOT of work for a simple problem! Is there an easier way?
It's easy to match a blank line, although I'm not sure that's really what you're looking for.
The first pattern matches a line which only contains space and tab characters (adjust as necessary to match other whitespace). The second pattern matches the same whitespace when it's not at the beginning of a line. (Actually, it would match the whitespace anywhere, but at the beginning of a line, the first pattern wins.)
^[ \t]*\n ;
[ \t]*\n { ++num_statements; return '\n'; }
Instead of counting lines yourself, I suggest you use %option yylineno so flex will count them for you. (In yylineno.)

Does Unicode have a special marker character?

My father created in mid 90's an encoding for his engineering purposes for his company's computers. It was close to ISO 8859-2 (Latin 2), but with some differences.
For example there was added a special "MARKER CHARACTER". This character wasn't determined to be a literal, but also it wasn't a control character.
The purpose of this character was to be inserted by machine when needed to split text into parts. See the following Python parser script:
re.sub(r'\{\{', r'~{{', text)
re.sub(r'\[\[', r'~[[', text)
re.sub(r'\]\]', r']]~', text)
re.sub(r'\}\}', r'}}~', text)
parts = text.strip('~').split('~')
inCurly = [False]
inSharp = [False]
whereAmI = ['']
for part in parts:
if part[:2] == '{{':
inCurly.append(True)
whereAmI.append('Curly')
elif part[:2] == '[[':
inSharp.append(True)
whereAmI.append('Sharp')
if whereAmI[-1] == 'Sharp' and not inCurly[-1]:
# some advanced magic on current part,
# if it is directly surrounded by sharp brackets,
# but these sharp brackets are not in curly brackets anyhow
# (not: "{{ (( [[ some text ]] )) }}")
# detecting closing brackets and popping inSharp, inCurly, whereAmI
# joining parts back to text
This is an easy parser for advanced purposes, you can detect more parenthesis or quotation marks as you want. But this have one huge fault. It break things when a ~ is in text.
For this purpose and similar purposes like this (but in C lang I think) he added to his encoding/character set that marker character.
For years I use for this purpose three german "sharp s": ßßß, because it is almost impossible to see three of them in a row. But this is not an ideal solution.
Yesterday my father told me this story and I immediatelly thought: is there some equivalent in an Unicode family? Unicode is a modern developing standard spreaded all over the world in past decade drastically. There should be a special character only for this particular purpose, or not?
I don't think there's anything called that specifically, but you might find zero-width space or information separator, among others, suitable for the purpose. You can arbitrarily select any character as your marker, and use an escape character if it occurs within the string.
In the control pictures block, there is a symbol for the group separator.

Do Unicode's line breaking rules require the last character to be a mandatory break?

I'm trying to use libunibreak (https://github.com/adah1972/libunibreak) to mark the possible line breaks in some given unicode text.
Libunibreak gives back four possible options for each code unit in some text:
LINEBREAK_MUSTBREAK
LINEBREAK_ALLOWBREAK
LINEBREAK_NOBREAK
LINEBREAK_INSIDEACHAR
Hopefully these are self explanatory. I would expect that MUSTBREAK corresponds to newline characters like LF. However, for any given text Libunibreak always indicates that the last character is MUSTBREAK
So for example with the string "abc", the output would be [NOBREAK,NOBREAK,MUSTBREAK]. For "abc\n" the output would be [NOBREAK,NOBREAK,NOBREAK,MUSTBREAK]. I use the MUSTBREAK attribute to start a new line when drawing text so the first case ("abc") creates an extra linebreak that shouldn't be there.
Is this behaviour what Unicode specifies or is this a quirk of the library implementation I'm using?
Yes, this is what the Unicode line breaking algorithm specifies. Rule LB3 in UAX #14: Unicode Line Breaking Algorithm, section 6.1 "Non-tailorable Line Breaking Rules" says:
Always break at the end of text.
The spec further explains:
[This rule is] designed to deal with degenerate cases, so that there is [...] at least one line break for the whole text.

Why does Github Flavored Markup only add newlines for lines that start with [\w\<]?

In our site (which is aimed at highly non-technical people), we let them use Markdown when sending emails. That way, they get nice things like bold, italic, etc. Being non-technical, however, they would never get past the “add two lines to make newlines actually work” quirk.
For that reason mainly, we are using a variant of Github Flavored Markdown.
We mainly borrowed this part:
# in very clear cases, let newlines become <br /> tags
text.gsub!(/^[\w\<][^\n]*\n+/) do |x|
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
end
This works well, but in some cases it doesn’t add the new-lines, and I guess the key to that is the “in very clear cases” part of that comment.
If I interpret it correctly, this is only adding newlines for lines that start with either a word character or a ‘<’.
Does anyone know why that is? Particularly, why ‘<’?
What would be the harm in just adding the two spaces to essentially anything (lines starting with spaces, hyphens, anything)?
'<' character is used at the beginning of a line to quote messages. I guess that is the reason.
The other answer to this question is quite wrong. This has nothing to do with quoting, and the character for markdown quoting is >.
^[\w\<][^\n]*\n+
Let's break the above regex into parts:
^ = anchor start of string.
[\w\<] matches a word character or the start of word boundary. \< is not a literal, but rather a GNU word boundary. See here (do a ctrl+f for \<).
[^\n]* matches any length of non-newline characters
\n matches a new line.
+ is, I believe, a possessive quantifier.
I believe, but am not 100% sure, that this simply is used to set x to a line of text. Then, the heavy work is done with the next line:
x =~ /\n{2}/ ? x : (x.strip!; x << " \n")
This says "if x satisfies the regex \n{2} (that is, has two line breaks), leave x as is. Otherwise, strip x and append a newline character.

Unicode character for marking

We are going to digitize a lot of books. We want to mark place of line break in original book without influencing the flow of digital book. Which invisible Unicode charter can be used to mark some special places in a raw file?
(\n will used to indicate end of paragraph)
This is a sentence
in the original book that
I want to mark line
break places.
What is the proper character to replace *:
This is a sentence * in the original book that * I want to mark line *break places.
Unicode has no concept of a hidden character that represents a line break in some original but does not cause line break in rendering. Unicode encodes plain text data, and its control characters for line breaks have an effect when plain text is rendered.
What matters here is how the files will be used. If they need to be processable with plain text editors, then you need to decide: either the line breaks are replicated in default rendering, or they are omitted when creating the file. You can’t make them invisible. And different text editors, like Notepad and Emacs, may well use different line control conventions; one program’s end of line is another program’s end of paragraph.
If the files will only be processed by programs that you create, then you can use whatever conventions you like. The most logical one is this:
“Line and Paragraph Separator. The Unicode Standard provides two unambiguous characters,
U+2028 line separator and U+2029 paragraph separator, to separate lines and
paragraphs. They are considered the default form of denoting line and paragraph boundaries
in Unicode plain text. A new line is begun after each line separator. A new paragraph
is begun after each paragraph separator. As these characters are separator codes, it is not necessary either to start the first line or paragraph or to end the last line or paragraph with them. Doing so would indicate that there was an empty paragraph or line following. The paragraph separator can be inserted between paragraphs of text. Its use allows the creation of plain text files, which can be laid out on a different line width at the receiving end. The line separator can be used to indicate an unconditional end of line.”
http://www.unicode.org/versions/Unicode6.1.0/ch16.pdf (pages 6 and 7 in the PDF)
Beware that U+2028 and U+2029 are generally not understood by text editors. They are suitable for storing data in plain text format. When the text is to be rendered, the rendering software has the option of ignoring the original division into lines and treating U+2028 as equivalent to a space, except if preceded by a hyphen (which poses a problem that cannot be resolved without higher level information: a line that ends with “foo-” and is follod by a line beginning with “bar” could represent the word “foobar” as hyphenated for line breaking, or a hyphenated compound “foo-bar” or, in some cases, the combination “foo- bar”).
Use the line feed character (LF, "\n", 0x0A) and/or maybe carriage return (CR, "\r", 0x0D).
I.e., the regular characters for this purpose.