Is there a standard treesitter construct for parsing an arbitrary-length list? - treesitter

One very common parsing scenario in programming languages is an arbitrary-length nonempty list of elements with a separator, for example:
[1, 2, 3, 4, 5]
f(a, b, c)
I've been parsing this in treesitter as follows:
list: $ => seq(
repeat(seq($.element, ',')),
$.element
)
This works, but it's common enough that I wonder whether treesitter has a built-in construct for it. Does it?

In several grammars, (e.g. Rust, Go), we define helper functions for this.
function commaSep1(rule) {
return seq(rule, repeat(seq(',', rule)))
}
function commaSep(rule) {
return optional(commaSep1(rule))
}
We could include these functions as part of the Tree-sitter DSL, but since it's so easy to
define your own helper functions like this, I think it's best to keep the DSL small.

Related

Fill columns independently

I have a python class with two data class, first one is a polars time series, second one a list of string.
In a dictionary, a mapping from string and function is provided, for each element of the string is associated a function that returns a polars frame (of one column).
Then there is a function class that create a polars data frame with first column the time series and the other columns are created with this function.
Columns are all independent.
Is there a way to create this data frame in parallel?
Here I try to define a minimal example:
class data_frame_constr():
function_list: List[str]
time_series: pl.DataFrame
def compute_indicator_matrix(self) -> pl.DataFrame:
for element in self.function_list:
self.time_series.with_column(
[
mapping[element] # here is where we construct columns with the loop and mapping[element] is a custom function that returns a pl column
]
)
return self.time_series
For example, function_list = ["square", "square_root"].
Time frame is a column time series, I need to create square and square root (or other custom complex functions, identified by its name) columns, but I know the list of function only at runtime, specified in the constructor.
You can use the with_columns context to provide a list of expressions, as long as the expressions are independent. (Note the plural: with_columns.) Polars will attempt to run all expressions in the list in parallel, even if the list of expressions is generated dynamically at run-time.
def mapping(func_str: str) -> pl.Expr:
'''Generate Expression from function string'''
...
def compute_indicator_matrix(self) -> pl.DataFrame:
expr_list = [mapping(next_funct_str)
for next_funct_str in self.function_list]
self.time_series = self.time_series.with_columns(expr_list)
return self.time_series
One note: it is a common misconception that Polars is a generic Threadpool that will run any/all code in parallel. This is not true.
If any of your expressions call external libraries or custom Python bytecode functions (e.g., using a lambda function, map, apply, etc..), then your code will be subject to the Python GIL, and will run single-threaded - no matter how you code it. Thus, try to use only the Polars expressions to achieve your objectives (rather than calling external libraries or Python functions.)
For example, try the following. (Choose a value of nbr_rows that will stress your computing platform.) If we run the code below, it will run in parallel because everything is expressed using Polars expressions without calling external libraries or custom Python code. The result is embarassingly parallel performance.
nbr_rows = 100_000_000
df = pl.DataFrame({
'col1': pl.repeat(2, nbr_rows, eager=True),
})
df.with_columns([
pl.col('col1').pow(1.1).alias('exp_1.1'),
pl.col('col1').pow(1.2).alias('exp_1.2'),
pl.col('col1').pow(1.3).alias('exp_1.3'),
pl.col('col1').pow(1.4).alias('exp_1.4'),
pl.col('col1').pow(1.5).alias('exp_1.5'),
])
However, if we instead write the code using lambda functions that call Python bytecode, then it will run very slowly.
import math
df.with_columns([
pl.col('col1').apply(lambda x: math.pow(x, 1.1)).alias('exp_1.1'),
pl.col('col1').apply(lambda x: math.pow(x, 1.2)).alias('exp_1.2'),
pl.col('col1').apply(lambda x: math.pow(x, 1.3)).alias('exp_1.3'),
pl.col('col1').apply(lambda x: math.pow(x, 1.4)).alias('exp_1.4'),
pl.col('col1').apply(lambda x: math.pow(x, 1.5)).alias('exp_1.5'),
])

Why are macros based on abstract syntax trees better than macros based on string preprocessing?

I am beginning my journey of learning Rust. I came across this line in Rust by Example:
However, unlike macros in C and other languages, Rust macros are expanded into abstract syntax trees, rather than string preprocessing, so you don't get unexpected precedence bugs.
Why is an abstract syntax tree better than string preprocessing?
If you have this in C:
#define X(A,B) A+B
int r = X(1,2) * 3;
The value of r will be 7, because the preprocessor expands it to 1+2 * 3, which is 1+(2*3).
In Rust, you would have:
macro_rules! X { ($a:expr,$b:expr) => { $a+$b } }
let r = X!(1,2) * 3;
This will evaluate to 9, because the compiler will interpret the expansion as (1+2)*3. This is because the compiler knows that the result of the macro is supposed to be a complete, self-contained expression.
That said, the C macro could also be defined like so:
#define X(A,B) ((A)+(B))
This would avoid any non-obvious evaluation problems, including the arguments themselves being reinterpreted due to context. However, when you're using a macro, you can never be sure whether or not the macro has correctly accounted for every possible way it could be used, so it's hard to tell what any given macro expansion will do.
By using AST nodes instead of text, Rust ensures this ambiguity can't happen.
A classic example using the C preprocessor is
#define MUL(a, b) a * b
// ...
int res = MUL(x + y, 5);
The use of the macro will expand to
int res = x + y * 5;
which is very far from the expected
int res = (x + y) * 5;
This happens because the C preprocessor really just does simple text-based substitutions, it's not really an integral part of the language itself. Preprocessing and parsing are two separate steps.
If the preprocessor instead parsed the macro like the rest of the compiler, which happens for languages where macros are part of the actual language syntax, this is no longer a problem as things like precedence (as mentioned) and associativity are taken into account.

In Kotlin, I can override some existing operators but what about creating new operators?

In Kotlin, I see I can override some operators, such as + by function plus(), and * by function times() ... but for some things like Sets, the preferred (set theory) symbols/operators don't exist. For example A∩B for intersection and A∪B for union.
I can't seem to define my own operators, there is no clear syntax to say what symbol to use for an operator. For example if I want to make a function for $$ as an operator:
operator fun String.$$(other: String) = "$this !!whatever!! $other"
// or even
operator fun String.whatever(other: String) = "$this !!whatever!! $other" // how do I say this is the $$ symbol?!?
I get the same error for both:
Error:(y, x) Kotlin: 'operator' modifier is inapplicable on this function: illegal function name
What are the rules for what operators can be created or overridden?
Note: this question is intentionally written and answered by the author (Self-Answered Questions), so that the idiomatic answers to commonly asked Kotlin topics are present in SO.
Kotlin only allows a very specific set of operators to be overridden and you cannot change the list of available operators.
You should take care when overriding operators that you try to stay in the spirit of the original operator, or of other common uses of the mathematical symbol. But sometime the typical symbol isn't available. For example set Union ∪ can easily treated as + because conceptually it makes sense and that is a built-in operator Set<T>.plus() already provided by Kotlin, or you could get creative and use an infix function for this case:
// already provided by Kotlin:
// operator fun <T> Set<T>.plus(elements: Iterable<T>): Set<T>
// and now add my new one, lower case 'u' is pretty similar to math symbol ∪
infix fun <T> Set<T>.u(elements: Set<T>): Set<T> = this.plus(elements)
// and therefore use any of...
val union1 = setOf(1,2,5) u setOf(3,6)
val union2 = setOf(1,2,5) + setOf(3,6)
val union3 = setOf(1,2,5) plus setOf(3,6)
Or maybe it is more clear as:
infix fun <T> Set<T>.union(elements: Set<T>): Set<T> = this.plus(elements)
// and therefore
val union4 = setOf(1,2,5) union setOf(3,6)
And continuing with your list of Set operators, intersection is the symbol ∩ so assuming every programmer has a font where letter 'n' looks ∩ we could get away with:
infix fun <T> Set<T>.n(elements: Set<T>): Set<T> = this.intersect(elements)
// and therefore...
val intersect = setOf(1,3,5) n setOf(3,5)
or via operator overloading of * as:
operator fun <T> Set<T>.times(elements: Set<T>): Set<T> = this.intersect(elements)
// and therefore...
val intersect = setOf(1,3,5) * setOf(3,5)
Although you can already use the existing standard library infix function intersect() as:
val intersect = setOf(1,3,5) intersect setOf(3,5)
In cases where you are inventing something new you need to pick the closest operator or function name. For example negating a Set of enums, maybe use - operator (unaryMinus()) or the ! operator (not()):
enum class Things {
ONE, TWO, THREE, FOUR, FIVE
}
operator fun Set<Things>.unaryMinus() = Things.values().toSet().minus(this)
operator fun Set<Things>.not() = Things.values().toSet().minus(this)
// and therefore use any of...
val current = setOf(Things.THREE, Things.FIVE)
println(-current) // [ONE, TWO, FOUR]
println(-(-current)) // [THREE, FIVE]
println(!current) // [ONE, TWO, FOUR]
println(!!current) // [THREE, FIVE]
println(current.not()) // [ONE, TWO, FOUR]
println(current.not().not()) // [THREE, FIVE]
Be thoughtful since operator overloading can be very helpful, or it can lead to confusion and chaos. You have to decide what is best while maintaining code readability. Sometimes the operator is best if it fits the norm for that symbol, or an infix replacement that is similar to the original symbol, or using a descriptive word so that there is no chance of confusion.
Always check the Kotlin Stdlib API Reference because many operators you want might already be defined, or have equivalent extension functions.
One other thing...
And about your $$ operator, technically you can do that as:
infix fun String.`$$`(other: String) = "$this !!whatever!! $other"
But because you need to escape the name of the function, it will be ugly to call:
val text = "you should do" `$$` "you want"
That isn't truly operator overloading and only would work if it is a function that can me made infix.

How do I turn the Result type into something useful?

I wanted a list of numbers:
auto nums = iota(0, 5000);
Now nums is of type Result. It cannot be cast to int[], and it cannot be used as a drop-in replacement for int[].
It's not very clear from the docs how to actually use an iota as a range. Am I using the wrong function? What's the way to make a "range" in D?
iota, like many functions in Phobos, is lazy. Result is a promise to give you what you need when you need it but no value is actually computed yet. You can pass it to a foreach statement for example like so:
import std.range: iota;
foreach (i ; iota(0, 5000)) {
writeln(i);
}
You don't need it for a simple foreach though:
foreach (i ; 0..5000) {
writeln(i);
}
That aside, it is hopefully clear that iota is useful by itself. Being lazy also allows for costless chaining of transformations:
/* values are computed only once in writeln */
iota(5).map!(x => x*3).writeln;
// [0, 3, 6, 9, 12]
If you need a "real" list of values use array from std.array to delazify it:
int[] myArray = iota(0, 5000).array;
As a side note, be warned that the word range has a specific meaning in D that isn't "range of numbers" but describes a model of iterators much like generators in python. iota is a range (so an iterator) that produced a range (common meaning) of numbers.

returning multiple dissimilar data structures from R function in PL/R

I have been looking at various discussions here on SO and other places, and the general consensus seems that if one is returning multiple non-similar data structures from an R function, they are best returned as a list(a, b) and then accessed by the indexes 0 and 1 and so on. Except, when using an R function via PL/R inside a Perl program, the R list function flattens the list, and also stringifies even the numbers. For example
my $res = $sth->fetchrow_arrayref;
# now, $res is a single, flattened, stringified list
# even though the R function was supposed to return
# list([1, "foo", 3], [2, "bar"])
#
# instead, $res looks like c(\"1\", \""foo"\", \"3\", \"2\", \""bar"\")
# or some such nonsense
Using a data.frame doesn't work because the two arrays being returned are not symmetrical, and the function croaks.
So, how do I return a single data structure from an R function that is made up of an arbitrary set of nested data structures, and still be able to access each individual bundle from Perl as simply $res->[0], $res->[1] or $res->{'employees'}, $res->{'pets'}? update: I am looking for an R equiv of Perl's [[1, "foo", 3], [2, "bar"]] or even [[1, "foo", 3], {a => 2, b => "bar"}]
addendum: The main thrust of my question is how to return multiple dissimilar data structures from a PL/R function. However, the stringification, as noted above, and secondary, is also problematic because I convert the data to JSON, and all those extra quotes just add to useless data transferred between the server and the user.
I think you have a few problems here. The first is you can't just return an array in this case because it won't pass PostgreSQL's array checks (arrays must be symmetric, all of the same type, etc). Remember that if you are calling PL/R from PL/Perl across a query interface, PostgreSQL type constraints are going to be an issue.
You have a couple of options.
You could return setof text[], with one data type per row.
you could return some sort of structured data using structures PostgreSQL understands, like:
CREATE TYPE ab AS (
a text,
b text
);
CREATE TYPE r_retval AS (
labels text[],
my_ab ab
);
This would allow you to return something like:
{labels => [1, "foo", 3], ab => {a => 'foo', b => 'bar'} }
But at any rate you have to put it into a data structure that the PostgreSQL planner can understand and that is what I think is missing in your example.