How to deal with overlapping token matching on syntax highlighting extension? - visual-studio-code

In vscode extensions for syntax highlighting, the json syntax config has 2 sections: the "patterns" and the "repository".
{
"patterns": [...],
"repository": {...}
}
The 'patterns' array contains the regexes that vscode will try to match for the current line, to turn chunks of code into tokens. The matches are tried in the order they are present in the array. If an entry matches, then rest are not tried.
The way that matches are performed seem to cause a dilemma. Say that I have a language that has the following syntax:
fn [public] MyFunction() {}
class [private, abstract] MyClass {}
Lets say that I want to match 3 different token types:
Match context sensitive keywords, located between the backets following the introducer keywords 'fn' and 'class'. In this case: 'public', 'private', and 'abstract'.
Match the introducer keywords 'fn' and 'class'.
Match the '[' and ']' as "decoration grouping brackets".
The problem I see is that, because vscode finds a match from the 'patterns' section and then doesn't try to match other entries, whatever is defined first gets precedence, before the internal-matching-cursor is moved to the next token to highlight. In other words:
A. If token type #1 is matched first, then #2 and #3 can't match because the cursor will be like this after match #1: fn [public]^ MyFunction() {}
B. If token type #2 is matched first, then #1 can't match because the cursor will be like this after match #2: fn^ [public] MyFunction() {}
C. If token type #3 is matched first, then #1 and #2 can't match because the cursor will be like this after match #3: fn [public]^ MyFunction() {}
So, how does one deal with performing those types of matches, where the tokens being matched have overlap? Is it by having the regexes do backtracking, like (?<=fn|class)\s*\[ ? Or is there another builtin way?

Related

MongoDB: How to only pull document with exact search term

I'm sure this is somewhere on here but I can't seem to find it. I'm trying to pull a document from a large file that only matches an exact term in a field, as opposed to anything with those letters in it.
More precisely, I'm trying to use .find({"name":"Eli"}) to pull the documents with that name, but my search is pulling every name with those letters (such as elizabeth or ophelia)
You can use a regular expression match to make sure you do not return names that share the same character formation.
Something like this:
const name = "Eli"
const query = new RegExp(`^${name}$`)
const user = await Collection.find({ name: { $regex: query } })
I am using 2 key operators from RegEx here: ^ and $
Putting ^ in front of a regular expression will match all strings that start with the pattern given.
Putting $ at the end of a regular expression will match all strings that end with the pattern given.
So essentially you are asking mongoose to find the record where the name both begins and ends with Eli. This will prevent Elizabeth from showing up in your result, but won't filter out other Eli's.

Why is #regex used in task.json in Azure DevOps extension? What does it check for?

I came across this and was wondering what this means and how it works?
What's the significance of using #regex here and how does it expand?
https://github.com/microsoft/azure-pipelines-tasks/blob/master/Tasks/DownloadPackageV0/task.json
"endpointUrl": "{{endpoint.url}}/{{ **#regex ([a-fA-F0-9\\-]+/)[a-fA-F0-9\\-]+ feed }}_apis**/Packaging/Feeds/{{ **#regex [a-fA-F0-9\\-]*/([a-fA-F0-9\\-]+) feed** }}{{#if view}}#{{{view}}}{{/if}}/Packages?includeUrls=false"
Also I would like to know how many packages will it return and display in the Task input UI dropdown if there are thousands of packages in the feed. Is there a known limit like first 100 or something?
#regex doesn't appear to actually be documented anywhere, but it takes two space-delimited arguments. The first is a regular expression and the second is a "path expression" identifying what value to match against, in this case the value of the feed input parameter. If the regex matches the value, it returns the first capturing subexpression, otherwise it returns the empty string.
In this particular context, the feed parameter is formatted as 'projectId/feedId', where projectId and feedId are GUIDs, and projectId and the / are eliminated for organization-scoped feeds (i.e. feeds that are not inside a project). The first regex therefore extracts the project ID and inserts it into the URL, and the second regex extracts the feed ID and inserts it into the URL.
As of this writing, the default limit on the API it's calling is 1000.
Regex stands for regular expression, which allows you to match any pattern rather than an exact string. You can find more info on how to use it in Azure Devops here
This regex is very specific. In this case, the regex ([a-fA-F0-9\\-]+/)[a-fA-F0-9\\-]+\ matches one or more of the following 1) letters a-f (small or capital) Or 2) \ Or 3) - followed by / and then again one or more of those characters.
You can copy the regex [a-fA-F0-9\\-]+/)[a-fA-F0-9\\-]+ into https://regexr.com/ to play around with it, to see what does and doesn't match the pattern.
Examples:
it matches: a/a a/b abcdef-\/dcba
but doesn't match: /a, abcdef, this-doesn't-match
Note that the full endpoint consists of concatenations of both regular expression and hardcoded strings!

how to get matched group number in pcre2

I want to use pcre2 to match string.
For example, I have several string pattern, "a","b","c","d", and "e".
I have a long text "str" to match.
Now I construct a pattern "a|b|c|d|e" to match "str" use pcre2_match.
How to know which pattern is matched?
I just want to get the matched pattern number, not "a" or "b", as I don't want to compare the matched pattern with "a","b","c","d","e" again.
Assuming you're using the PCRE2 library directly and have access to all of its features, you have several solutions for this, from the simplest to the most involved:
Use numbered capture groups: (a)|(b)|(c)|(d)
Use named capture groups: (?<a>a)|(?<b>b)|(?<c>c)|(?<d>d)
Use marks: a(*MARK:a)|b(*MARK:b)|c(*MARK:c)|d(*MARK:d)
Use callouts: a(?C{a})|b(?C{b})|c(?C{c})|d(?C{d})
If you really can't modify your input pattern, use PCRE2_AUTO_CALLOUT and find some way to map pattern offsets to branches, then rememeber the last pattern offset seen before the end of the match

Is it possible to influence how '_' lambda arguments are ordered?

I have a statement which currently looks like this:
arrays.foldLeft(0)((offset, array) => array.copyTo(largerArray, offset))
It would be great to express it as follows, but this is not possible since foldLeft is designed to take the seed argument first:
arrays.foldLeft(0)(_.copyTo(largerArray, _))
This is purely superficial - I'm just curious!
p.s. copyTo returns the next offset in this example.
The SLS seems to say "no".
Section 6.23, Anonymous Functions/Placeholder Syntax for Anonymous Functions:
An expression (of syntactic category Expr) may contain embedded
underscore symbols _ at places where identifiers are legal. Such an
expression represents an anonymous function where subsequent
occurrences of underscores denote successive parameters.
and
If an expression e binds underscore sections u1 , . . . , un, in this order, it is equivalent to the anonymous function (u'1 , ... u'n ) => e' where each u'i results from ui by replacing the
underscore with a fresh identifier and e' results from e by
replacing each underscore section ui by u'i.
Emphasis is mine - it explicitly states in both relevant section that a preserved ordering is assumed.
Personally, I think it makes sense to enforce that, if "only" for readability reasons.

Regular expression repeitition: how to match expressions of variable lengths?

Essentially, here's what I want to do:
if ($expression =~ /^\d{num}\w{num}$/)
{
#doSomething
}
where num is not an identifier, but could stand for any integer greater than 0 (\d and \w were arbitrarily chosen). I want to match a string iff it contains two groups of related characters, one group immediately followed by the other, and the number of characters in each group is the same.
For this example, 123abc and 021202abcdef would match, but 43abc would not, neither would 12ab3c or 1234acbcde.
Don’t think of the string as growing from left to right, but rather from the outside in:
xy
x(xy)y
xx(xy)yy
Your regex would then be something like:
/^(x(?1)?y)$/
Where (?1) is a reference to the outer pair of parentheses. ? makes it optional in order to give a “base case” of sorts to the recursive match. This is probably the simplest example of how regexes can be used to match context-free grammars—though it’s generally easier to get right with a parser generator or parser combinator library.
Well, there's
if ($expression =~ /^(\d+)([[:alpha:]]+)$/ && length($1)==length($2))
{
#doSomething
}
A regex isn't always the best option.