Is there a way for an expression in an aggregation context to refer to a previous expression in the aggregation?
import polars as pl
df = pl.DataFrame(dict(
x=[0, 0, 1, 1],
y=[1, 2, 3, 4],
))
df.groupby("x").agg([
pl.col("x").sum().alias("sum_x"),
(pl.col("sum_x") / pl.count()).alias("mean_x"),
])
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# NotFound("Unable to get field named \"sum_x\". Valid fields: [\"x\", \"y\"]")
This does not work naively because, as the error clearly indicates, expressions in a context cannot refer to previous expressions. The workaround for the select context does not work for the groupby context because agg does not keep all the data around like with_column does.
Similar to the selection context, in the groupby context expressions are executed in parallel and thus cannot refer to each other in the same context.
You need to enforce sequential execution by adding a select:
df.groupby("x").agg([
pl.col("x").sum().alias("sum_x"),
pl.count()
]).select([
"sum_x",
(pl.col("sum_x") / pl.col("count")).alias("mean_x")
])
Related
I am trying to translate a pyspark solution into scala.
Here is the code:
conditions_ = [when(df1[c]!=df2[c], lit(c)).otherwise("") for c in df1.columns if c not in ['firstname','middlename','lastname']]
status = when(df1["id"].isNull(), lit("added"))
.when(df2["id"].isNull(), lit("deleted"))
.when(size(array_remove(array(*conditions_), "")) > 0, lit("updated"))
.otherwise("unchanged")
for scala, I am simply trying to use expr instead of * to substitute the conditions_ expression in my when clause, but it is not supported due to for syntax.
Can you please point me to the right syntax here to add a loop in when clause, calculating the count of column differences dynamically.
If you want to unpack array in scala you can use following syntax:
when(size(array_remove(array(conditions_:_*), "")) > 0, lit("updated"))
Examples of "_*" operator
I have a python class with two data class, first one is a polars time series, second one a list of string.
In a dictionary, a mapping from string and function is provided, for each element of the string is associated a function that returns a polars frame (of one column).
Then there is a function class that create a polars data frame with first column the time series and the other columns are created with this function.
Columns are all independent.
Is there a way to create this data frame in parallel?
Here I try to define a minimal example:
class data_frame_constr():
function_list: List[str]
time_series: pl.DataFrame
def compute_indicator_matrix(self) -> pl.DataFrame:
for element in self.function_list:
self.time_series.with_column(
[
mapping[element] # here is where we construct columns with the loop and mapping[element] is a custom function that returns a pl column
]
)
return self.time_series
For example, function_list = ["square", "square_root"].
Time frame is a column time series, I need to create square and square root (or other custom complex functions, identified by its name) columns, but I know the list of function only at runtime, specified in the constructor.
You can use the with_columns context to provide a list of expressions, as long as the expressions are independent. (Note the plural: with_columns.) Polars will attempt to run all expressions in the list in parallel, even if the list of expressions is generated dynamically at run-time.
def mapping(func_str: str) -> pl.Expr:
'''Generate Expression from function string'''
...
def compute_indicator_matrix(self) -> pl.DataFrame:
expr_list = [mapping(next_funct_str)
for next_funct_str in self.function_list]
self.time_series = self.time_series.with_columns(expr_list)
return self.time_series
One note: it is a common misconception that Polars is a generic Threadpool that will run any/all code in parallel. This is not true.
If any of your expressions call external libraries or custom Python bytecode functions (e.g., using a lambda function, map, apply, etc..), then your code will be subject to the Python GIL, and will run single-threaded - no matter how you code it. Thus, try to use only the Polars expressions to achieve your objectives (rather than calling external libraries or Python functions.)
For example, try the following. (Choose a value of nbr_rows that will stress your computing platform.) If we run the code below, it will run in parallel because everything is expressed using Polars expressions without calling external libraries or custom Python code. The result is embarassingly parallel performance.
nbr_rows = 100_000_000
df = pl.DataFrame({
'col1': pl.repeat(2, nbr_rows, eager=True),
})
df.with_columns([
pl.col('col1').pow(1.1).alias('exp_1.1'),
pl.col('col1').pow(1.2).alias('exp_1.2'),
pl.col('col1').pow(1.3).alias('exp_1.3'),
pl.col('col1').pow(1.4).alias('exp_1.4'),
pl.col('col1').pow(1.5).alias('exp_1.5'),
])
However, if we instead write the code using lambda functions that call Python bytecode, then it will run very slowly.
import math
df.with_columns([
pl.col('col1').apply(lambda x: math.pow(x, 1.1)).alias('exp_1.1'),
pl.col('col1').apply(lambda x: math.pow(x, 1.2)).alias('exp_1.2'),
pl.col('col1').apply(lambda x: math.pow(x, 1.3)).alias('exp_1.3'),
pl.col('col1').apply(lambda x: math.pow(x, 1.4)).alias('exp_1.4'),
pl.col('col1').apply(lambda x: math.pow(x, 1.5)).alias('exp_1.5'),
])
I have 2 arrays and I'd like to interleave their values.
For example
Interleave([[1,2],[2,2],[3,2]], [[3,1],[4,1],[5,1]]); // Should yield [[1,2],[3,1],[2,2],[4,1],[3,2],[5,1]]
However I can't get something like this to work.
function Interleave(Set1,Set2) =[for(x=[0:len(Set1)-1]) Set1[x], Set2[x]];
The comma is not doing what you expect. It's separating 2 list entries, the first one is the for generatator and the 2nd one is just Set2[x] which is undefined as the x belongs to the for.
You can see what's going on when using just a string for the second part:
function Interleave(Set1,Set2) =[for(x=[0:len(Set1)-1]) Set1[x], "test"];
// ECHO: [[1, 2], [2, 2], [3, 2], "test"]
To get the interleaved result, you can generate a list with 2 entries for each interation and unwrap that using each
function Interleave(Set1,Set2) =[for(x=[0:1:len(Set1)-1]) each [Set1[x], Set2[x]]];
Also note the ..:1:.. step size, this ensures sane behavior in case Set1 is empty. The start:end range behaves differently for backward compatibility reasons.
I wanted a list of numbers:
auto nums = iota(0, 5000);
Now nums is of type Result. It cannot be cast to int[], and it cannot be used as a drop-in replacement for int[].
It's not very clear from the docs how to actually use an iota as a range. Am I using the wrong function? What's the way to make a "range" in D?
iota, like many functions in Phobos, is lazy. Result is a promise to give you what you need when you need it but no value is actually computed yet. You can pass it to a foreach statement for example like so:
import std.range: iota;
foreach (i ; iota(0, 5000)) {
writeln(i);
}
You don't need it for a simple foreach though:
foreach (i ; 0..5000) {
writeln(i);
}
That aside, it is hopefully clear that iota is useful by itself. Being lazy also allows for costless chaining of transformations:
/* values are computed only once in writeln */
iota(5).map!(x => x*3).writeln;
// [0, 3, 6, 9, 12]
If you need a "real" list of values use array from std.array to delazify it:
int[] myArray = iota(0, 5000).array;
As a side note, be warned that the word range has a specific meaning in D that isn't "range of numbers" but describes a model of iterators much like generators in python. iota is a range (so an iterator) that produced a range (common meaning) of numbers.
When quoted using quote do: records aren't converted to tuples containing the record fields:
iex(1)> quote do: is_bitstring("blah")
{:is_bitstring, [context: Elixir, import: Kernel], ["blah"]}
iex(2)> quote do: Computer.new("Test")
{{:., [], [{:__aliases__, [alias: false], [:Computer]}, :new]}, [], [[name: "Test"]]}
iex(3)> quote do: Computer.new("Test")
{{:., [], [{:__aliases__, [alias: false], [:Computer]}, :new]}, [], [[name: "Test"]]}
iex(4)> c = Computer.new("Test")
Computer[name: "Test", type: nil, processor: nil, hard_drives: []]
iex(5)> c
Computer[name: "Test", type: nil, processor: nil, hard_drives: []]
iex(6)> quote do: c
{:c, [], Elixir}
Also, when I try doing this in my code:
defmacro computer([do: code]) do
# macro login here
# build computer record based on macro logic
computer = Computer.new(params)
quote do: unquote computer
end
I get an error:
** (CompileError) elixir/test/lib/computer_dsl_test.exs: tuples in quoted expressions must have 2 or 3 items, invalid quoted expression: Computer[name: "", type: nil, processor: nil, hard_drives: []]
I thought that records were just tuples with wrappers functions of some sort. The Elixir Getting Started guide states "A record is simply a tuple where the first element is the record module name." Is there something I am missing? Is there a function I can call on a record to get the tuple representation? I am aware of the raw: true option but I am not sure how to use that on an existing record.
Any insights?
Records are tuples. The output you see on the console is just formatted for easier inspection. You can check that records are tuples if you inspect them with raw: true:
iex(1)> defrecord X, a: 1, b: 2
iex(2)> x = X.new
X[a: 1, b: 2] # This is formatted output. x is really a tuple
iex(3)> IO.inspect x, raw: true
{X, 1, 2}
As can be seen, a record instance is really a tuple. You can also pattern match on it (although I don't recommend this):
iex(4)> {a, b, c} = x
iex(8)> a
X
iex(9)> b
1
iex(10)> c
2
The quote you are mentioning serves completely different purpose. It turns an Elixir expression into AST representation that can be injected into the rest of the AST, most often from the macro. Quote is relevant only in compile time, and as such, it can't even know what is in your variable. So when you say:
quote do: Computer.new("Test")
The result you get is AST representation of the call of the Computer.new function. But the function is not called at this point.
Just reading the error message and the elixir "getting stated" on macro definition it appears that the result of a quote has the form:
In general, each node (tuple) above follows the following format:
{ tuple | atom, list, list | atom }
The first element of the tuple is an atom or another tuple in the same representation;
The second element of the tuple is an list of metadata, it may hold information like the node line number;
The third element of the tuple is either a list of arguments for the function call or an atom. When an atom,
it means the tuple represents a variable.
Besides the node defined above, there are also five Elixir literals that when quoted return themselves (and not a tuple). They are:
:sum #=> Atoms
1.0 #=> Numbers
[1,2] #=> Lists
"binaries" #=> Strings
{key, value} #=> Tuples with two elements
My guess is that the unquote is the reverse function of quote, and so it expects as argument one of the above forms. This is not the case for the computer record.
I think the unquote is not necessary there (although I didn't try to understand the intent of your code...) and that
defmacro computer([do: code]) do %% why do you need this argument?
quote do: Computer.new
end
should be ok.