In emacs, how do I force certain characters to act as end of statement delineators? - emacs

I've created a new major mode derived from cc-mode, because I'm using a meta-language that is mostly C-like, but is parsed to generate code automatically.
Say I have something like this:
struct MyNewStruct
{
int newInt = 32;
{
[flag, different-flag]
string newString = "foo";
}
}
I need the ']' character to effectively be equivalent to the ; or the next line, declaring the string, doesn't indent properly.
I've tried using M-x modify-syntax-entry for ']' and making it both a closing character as well as a punctuation character (according to the GNU manual on syntax tables), but it doesn't look like it's allowed to belong to two character classes simultaneously (unless one of those character classes is a comment). (And if it's just a punctuation character, that causes other problems.)
I can't change the grammar of the meta-language, so adding a semicolon after the close bracket isn't possible.

In this case, the real answer was to pick something that was syntactically closer to my meta-language. csharp-mode already parses the brackets correctly and marks sections enclosed in brackets as statements, not statement-cont.

Related

Does Unicode have a special marker character?

My father created in mid 90's an encoding for his engineering purposes for his company's computers. It was close to ISO 8859-2 (Latin 2), but with some differences.
For example there was added a special "MARKER CHARACTER". This character wasn't determined to be a literal, but also it wasn't a control character.
The purpose of this character was to be inserted by machine when needed to split text into parts. See the following Python parser script:
re.sub(r'\{\{', r'~{{', text)
re.sub(r'\[\[', r'~[[', text)
re.sub(r'\]\]', r']]~', text)
re.sub(r'\}\}', r'}}~', text)
parts = text.strip('~').split('~')
inCurly = [False]
inSharp = [False]
whereAmI = ['']
for part in parts:
if part[:2] == '{{':
inCurly.append(True)
whereAmI.append('Curly')
elif part[:2] == '[[':
inSharp.append(True)
whereAmI.append('Sharp')
if whereAmI[-1] == 'Sharp' and not inCurly[-1]:
# some advanced magic on current part,
# if it is directly surrounded by sharp brackets,
# but these sharp brackets are not in curly brackets anyhow
# (not: "{{ (( [[ some text ]] )) }}")
# detecting closing brackets and popping inSharp, inCurly, whereAmI
# joining parts back to text
This is an easy parser for advanced purposes, you can detect more parenthesis or quotation marks as you want. But this have one huge fault. It break things when a ~ is in text.
For this purpose and similar purposes like this (but in C lang I think) he added to his encoding/character set that marker character.
For years I use for this purpose three german "sharp s": ßßß, because it is almost impossible to see three of them in a row. But this is not an ideal solution.
Yesterday my father told me this story and I immediatelly thought: is there some equivalent in an Unicode family? Unicode is a modern developing standard spreaded all over the world in past decade drastically. There should be a special character only for this particular purpose, or not?
I don't think there's anything called that specifically, but you might find zero-width space or information separator, among others, suitable for the purpose. You can arbitrarily select any character as your marker, and use an escape character if it occurs within the string.
In the control pictures block, there is a symbol for the group separator.

How to search for any unicode symbol in a character string?

I've got an existing DOORS module which happens to have some rich text entries; these entries have some symbols in them such as 'curly' quotes. I'm trying to upgrade a DXL macro which exports a LaTeX source file, and the problem is that these high-number symbols are not considered "standard UTF-8" by TexMaker's import function (and in any case probably won't be processed by Xelatex or other converters) . I can't simply use the UnicodeString functions in DXL because those break the rest of the rich text, and apparently the character identifier charOf(decimal_number_code) only works over the basic set of characters, i.e. less than some numeric code value. For example, charOf(8217) should create a right-curly single quote, but when I tried code along the lines of
if (charOf(8217) == one_char)
I never get a match. I did copy the curly quote from the DOORS module and verified via an online unicode analyzer that it was definitely Unicode decimal value 8217 .
So, what am I missing here? I just want to be able to detect any symbol character, identify it correctly, and then replace it with ,e.g., \textquoteright in the output stream.
My overall setup works for lower-count chars, since this works:
( c is a single character pulled from a string)
thedeg = charOf(176)
if( thedeg == c )
{
temp += "$\\degree$"
}
Got some help from DXL coding experts over at IBM forums.
Quoting the important stuff (there's some useful code snippets there as well):
Hey, you are right it seems intOf(char) and charOf(int) both do some
modulo 256 and therefore cut anything above that off. Try:
int i=8217;
char c = addr_(i);
print c;
Which then allows comparison of c with any input char.

How do I specify a unicode literal that requires more than four hex digits in Antlr?

I want to define a lexer rule for ranges between unicode characters that have code points that need more than four hexadecimal digits to identify. To be concrete, I want to declare the following rule:
ID_Continue : [\uE0100-\uE01EF] ;
Unfortunately, it doesn't work. This rule will match characters that are not in this range. (I'm not certain to what exact behaviour this results in, but it isn't the one I want.) I've tried also the following (padding with leading zeros and using 8 digits):
ID_Continue : [\U000E0100-\U000E01EF] ;
But it seems to result in the same unwanted behaviour.
I am using Antlr4 and the IntelliJ plugin for it for testing.
Does Antlr4 not support unicode literals above \uFFFF?
No, ANTLR's max is the same as Java's Character.MAX_VALUE
If you look at (a part of) ANTLR4's lexer grammar you will see these rules:
// Any kind of escaped character that we can embed within ANTLR literal strings.
fragment EscSeq
: Esc
( [btnfr"'\\] // The standard escaped character set such as tab, newline, etc.
| UnicodeEsc // A Unicode escape sequence
| . // Invalid escape character
| EOF // Incomplete at EOF
)
;
...
fragment UnicodeEsc
: 'u' (HexDigit (HexDigit (HexDigit HexDigit?)?)?)?
;
...
fragment Esc : '\\' ;
Note: the limitation to the BMP is purely a Java limitation. Other targets might go much further. For instance my MySQL grammar, written for ANTLR3 (C target) can easily lex e.g. emojis from beyond the BMP. This works for quoted strings as well as IDENTIFIERs.
What's a bit strange here is however that I haven't specified that range in the grammar (it uses only the BMP). Still the parser can parse any utf-8 input. Might be a bug in the target runtime, though I'm happy it exists :-D

.tmlanguage escape sequences and rule priorities

I'm implementing a syntax highlighter in Apple's Swift language by parsing .tmlanguage files and applying styles to a NSMutableAttributtedString.
I'm testing with javascript code, a javascript.tmlanguage file, and the monokai.tmtheme theme (both last included in sublime text 3) to check that the syntax get highlighted correctly. By applying each rule (patterns) in the .tmlanguage file in the same order they come, the syntax is almost perfectly highlighted.
The problem I'm having right now is that I don't know how to know that a quote (") should be escaped when it has a backslash before it (\"). Am I missing something in the .tmlanguage file that specifies that?. Other problem is that I have no idea how to know that other rules should be ignored when inside others, for example:
I'm getting double slashes taken as comments when inside strings: "http://stackoverflow.com/" a url is recognised as comment after //
Also double or single quotes are taken as strings when inside comments: // press "Enter" to continue, the word "Enter" gets highlighted as string when should be same color as comments
So, I don't know if there is some priority for some rules over others in the convention, or if there is something in the files that I haven't noticed.
Help please!
Update:
Here is a better example of what I meant by escape quotes:
I'm getting this: while all the letters should be yellow except for the escaped sequence (/") which should be blue.
The question is. How do I know that /" should be escaped? The rule for that piece of code is:
Maybe I am late to answer this. You can apply the following method.
(Ugly) In your end regex, use ([^/])(") and in your endCaptures, it would be
1 = string.quote.double.js
2 = punctuation.definition.string.end.js
If the string must be single line, you can use match=(")(.*)("), captures=
1 = punctuation.definition.string.begin.js
2 = string.quote.double.js
3 = punctuation.definition.string.end.js
and use your patterns
You can try applyEndPatternLast and see if it is allowed. Set applyEndPatternLast=1 will do.
The priority is that earlier rules in the file are prioritized over later rules. As an example, in my Python Improved language definition, I have a scope that contains a series of all-caps constants used in Django, a popular Python web framework. I also have a generic constant.other.allcaps.python scope that recognizes (just about) anything in all caps. Since the Django constants rule is before the allcaps rule in the .tmLanguage file, I can color it with a theme using one color, while the later-occurring "highlight everything in all caps" only grabs identifiers that are NOT part of the first list.
Because of this, you should put your "comments" scope(s) as early in the file as possible, then write your parser in such a way that it obeys the rule I described above. However, it's slightly more complicated than that, as I believe items in the repository are prioritized based on where their include line is, not where the repository rule is defined in the file. You may want to do some testing to verify that, though.
Unfortunately I'm not sure what you mean about the escaped quotes - could you expand on that, and maybe add an example or two?
Hope this helps.
Assuming that / is the correct character for escaping a double quote mark, the following should work:
"str_double_quote": {
"begin": "\"",
"end": "\"",
"name": "string.quoted.double.swift",
"patterns": [
{
"name": "constant.character.escape.swift",
"match": "/[\"/]"
}
]
}
You can match an escaped double quote mark (/") and a literal forward slash (//) in the patterns to consume them before the end marker is used to handle them.
If the character for escaping is actually a backslash, then the tricky bit is that there are two levels of escaping, for the JSON encoding as well as the regular expression syntax. To match \", the regular expression requires you to escape the backslash (\\"). JSON requires you to escape backslashes and double quotes, resulting in \\\\\" in a TextMate JSON grammar file. The match expression would thus be \\\\[\"\\\\].

Is it possible to change an emacs syntax table based on context?

I'm working on improving an emacs major mode for UnrealScript. One of the (many) quirks is that it allows syntax like this for specifying tooltips in the Unreal editor:
var() int MyEditorVar <Foo=Bar|Tooltip=My tooltip text isn't quoted>;
The angle brackets after the variable declaration denote a pipe-separated list of Key=Value metadata pairs, and the metadata is not quoted but can contain quote marks -- a pipe (|) or right angle bracket (>) denotes the end.
Is there a way I can get the emacs syntax table to recognize this context-dependent syntax in a useful way? I'd like everything except for pipes and right angle brackets to be highlighted in some way inside of these variable metadata declarations, but otherwise retain their normal highlighting.
Right now, the single quote character is set up to be a quote delimiter (syntax designator "), so font-lock-mode interprets such a quote as starting a quoted string, which it's not in this very specific instance, so it mishighlights everything until it finds another supposedly matching single quote.
You'll need to setup a syntax-propertize-function which lets you apply different syntax designators to different characters in the buffer, depending on their context.
Grep for syntax-propertize-function in Emacs's lisp directory to see various examples (from simple to pretty complex ones).
You'll probably want to mark the "=" chars after your "Foo" and after your "Tooltip" as "generic string delimiter", then do the same with the corresponding terminating "|" and ">". An alternative could be to mark the char before the ">" as a (closing) generic string delimiter, so that you can then mark the "<" and ">" as open&close parens.