Fill columns independently - python-polars

I have a python class with two data class, first one is a polars time series, second one a list of string.
In a dictionary, a mapping from string and function is provided, for each element of the string is associated a function that returns a polars frame (of one column).
Then there is a function class that create a polars data frame with first column the time series and the other columns are created with this function.
Columns are all independent.
Is there a way to create this data frame in parallel?
Here I try to define a minimal example:
class data_frame_constr():
function_list: List[str]
time_series: pl.DataFrame
def compute_indicator_matrix(self) -> pl.DataFrame:
for element in self.function_list:
self.time_series.with_column(
[
mapping[element] # here is where we construct columns with the loop and mapping[element] is a custom function that returns a pl column
]
)
return self.time_series
For example, function_list = ["square", "square_root"].
Time frame is a column time series, I need to create square and square root (or other custom complex functions, identified by its name) columns, but I know the list of function only at runtime, specified in the constructor.

You can use the with_columns context to provide a list of expressions, as long as the expressions are independent. (Note the plural: with_columns.) Polars will attempt to run all expressions in the list in parallel, even if the list of expressions is generated dynamically at run-time.
def mapping(func_str: str) -> pl.Expr:
'''Generate Expression from function string'''
...
def compute_indicator_matrix(self) -> pl.DataFrame:
expr_list = [mapping(next_funct_str)
for next_funct_str in self.function_list]
self.time_series = self.time_series.with_columns(expr_list)
return self.time_series
One note: it is a common misconception that Polars is a generic Threadpool that will run any/all code in parallel. This is not true.
If any of your expressions call external libraries or custom Python bytecode functions (e.g., using a lambda function, map, apply, etc..), then your code will be subject to the Python GIL, and will run single-threaded - no matter how you code it. Thus, try to use only the Polars expressions to achieve your objectives (rather than calling external libraries or Python functions.)
For example, try the following. (Choose a value of nbr_rows that will stress your computing platform.) If we run the code below, it will run in parallel because everything is expressed using Polars expressions without calling external libraries or custom Python code. The result is embarassingly parallel performance.
nbr_rows = 100_000_000
df = pl.DataFrame({
'col1': pl.repeat(2, nbr_rows, eager=True),
})
df.with_columns([
pl.col('col1').pow(1.1).alias('exp_1.1'),
pl.col('col1').pow(1.2).alias('exp_1.2'),
pl.col('col1').pow(1.3).alias('exp_1.3'),
pl.col('col1').pow(1.4).alias('exp_1.4'),
pl.col('col1').pow(1.5).alias('exp_1.5'),
])
However, if we instead write the code using lambda functions that call Python bytecode, then it will run very slowly.
import math
df.with_columns([
pl.col('col1').apply(lambda x: math.pow(x, 1.1)).alias('exp_1.1'),
pl.col('col1').apply(lambda x: math.pow(x, 1.2)).alias('exp_1.2'),
pl.col('col1').apply(lambda x: math.pow(x, 1.3)).alias('exp_1.3'),
pl.col('col1').apply(lambda x: math.pow(x, 1.4)).alias('exp_1.4'),
pl.col('col1').apply(lambda x: math.pow(x, 1.5)).alias('exp_1.5'),
])

Related

How to access a collection of heterogenous functions randomly?

I am implementing an evolutionary algorithm where I have a numerical genetic encoding (0-n). Where each number from 0 to n represents a function. I have implemented a numpy version where it is possible to do the following. The actual implementation is a bit more complicated but this snippet captures the core functionality.
n = 3
max_ops = 10
# Generate randomly generated args and OPs
for i in range(number_of_iterations):
args = np.random.randint(min_val_arg, max_val_arg, size=(arg_count, arg_shape[0], arg_shape[1])
gene_of_operations = np.random.randint(0,n,size=(max_ops))
# A collection of OP encodings and operations. Doesn't need to be a dict.
dict_of_n_OPs = {
0:np.add,
1:np.multiply,
2:np.diff
}
#njit
def execute_genome(gene_of_operations, args, dict_of_n_OPs):
result = 0
for op, arg in zip(gene_of_operations,args)
result+= op(arg)
return result
## executing the gene
execute_genome(gene_of_operations, args, dict_of_n_OPs)
print(results)
Now when adding the njit decorator expects a statically typed function. Where heterogenously typed collections such as my dict_of_n_OPs are not supported, I have tried rendering it as a numpy array, numba.typed.Dict, numba.typed.List. But discovered none supports heteregoenous types.
What would be a numba compliant approach that allows for executing different functions based on a numerical encoding such as '00201'. Where number 0 would execute function 0?
Is the only way an n line if else statement for n unique operations/functions?

Transforming `PCollection` with many elements into a single element

I am trying to convert a PCollection, that has many elements, into a PCollection that has one element. Basically, I want to go from:
[1,2,3,4,5,6]
to:
[[1,2,3,4,5,6]]
so that I can work with the entire PCollection in a DoFn.
I've tried CombineGlobally(lamdba x: x), but only a portion of elements get combined into an array at a time, giving me the following result:
[1,2,3,4,5,6] -> [[1,2],[3,4],[5,6]]
Or something to that effect.
This is my relevant portion of my script that I'm trying to run:
import apache_beam as beam
raw_input = range(1024)
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
def combine(x):
print(x)
return x
(
input
| "Global aggregation" >> beam.CombineGlobally(combine)
)
pl.run()
run_test()
I figured out a pretty painless way to do this, which I missed in the docs:
The more general way to combine elements, and the most flexible, is
with a class that inherits from CombineFn.
CombineFn.create_accumulator(): This creates an empty accumulator. For
example, an empty accumulator for a sum would be 0, while an empty
accumulator for a product (multiplication) would be 1.
CombineFn.add_input(): Called once per element. Takes an accumulator
and an input element, combines them and returns the updated
accumulator.
CombineFn.merge_accumulators(): Multiple accumulators could be
processed in parallel, so this function helps merging them into a
single accumulator.
CombineFn.extract_output(): It allows to do additional calculations
before extracting a result.
I suppose supplying a lambda function that simply passes its argument to the "vanilla" CombineGlobally wouldn't do what I expected initially. That functionality has to be specified by me (although I still think it's weird this isn't built into the API).
You can find more about subclassing CombineFn here, which I found very helpful:
A CombineFn specifies how multiple values in all or part of a
PCollection can be merged into a single value—essentially providing
the same kind of information as the arguments to the Python “reduce”
builtin (except for the input argument, which is an instance of
CombineFnProcessContext). The combining process proceeds as follows:
Input values are partitioned into one or more batches.
For each batch, the create_accumulator method is invoked to create a fresh initial “accumulator” value representing the combination of
zero values.
For each input value in the batch, the add_input method is invoked to combine more values with the accumulator for that batch.
The merge_accumulators method is invoked to combine accumulators from separate batches into a single combined output accumulator value,
once all of the accumulators have had all the input value in their
batches added to them. This operation is invoked repeatedly, until
there is only one accumulator value left.
The extract_output operation is invoked on the final accumulator to get the output value. Note: If this CombineFn is used with a transform
that has defaults, apply will be called with an empty list at
expansion time to get the default value.
So, by subclassing CombineFn, I wrote this simple implementation, Aggregated, that does exactly what I want:
import apache_beam as beam
raw_input = range(1024)
class Aggregated(beam.CombineFn):
def create_accumulator(self):
return []
def add_input(self, accumulator, element):
accumulator.append(element)
return accumulator
def merge_accumulators(self, accumulators):
merged = []
for a in accumulators:
for item in a:
merged.append(item)
return merged
def extract_output(self, accumulator):
return accumulator
def run_test():
with TestPipeline() as test_pl:
input = test_pl | "Create" >> beam.Create(raw_input)
(
input
| "Global aggregation" >> beam.CombineGlobally(Aggregated())
| "print" >> beam.Map(print)
)
pl.run()
run_test()
You can also accomplish what you want with side inputs, e.g.
with beam.Pipeline() as p:
pcoll = ...
(p
# Create a PCollection with a single element.
| beam.Create([None])
# This will process the singleton exactly once,
# with the entirity of pcoll passed in as a second argument as a list.
| beam.Map(
lambda _, pcoll_as_side: ...consume pcoll_as_side here...,
pcoll_as_side=beam.pvalue.AsList(pcoll))

Why won't the exp function work in pyspark?

I'm trying to calculate odds ratios from the coefficients of a logistic regression but I'm encountering a problem best summed up by this code:
import pyspark.sql.functions as F
F.exp(1.2)
This fails with
py4j.Py4JException: Method exp([class java.lang.Double]) does not exist
An integer fails similarly. I don't get how a Double can cause a problem for the exp function?
If you have a look at the documentation for pyspark.sql.functions.exp(), it takes an input of a col object. Hence it will not work for a float value such as 1.2.
Create a dataframe or a Column object which you can use in F.exp()
Example would be:
df = df.withColumn("exp_x", F.exp(F.col("some_col_named_x")))
As #pissall mentionned, the pyspark.sql.functions.exp takes col objects as parameter, but you can use the pyspark.sql.functions.lit (introduced in version 1.3.0) to create a col object of a literal value.
from pyspark.sql.functions import exp, lit
df = df.withColumn("exp_1", exp(lit(1)))

Can operations on a numpy.memmap be deferred?

Consider this example:
import numpy as np
a = np.array(1)
np.save("a.npy", a)
a = np.load("a.npy", mmap_mode='r')
print(type(a))
b = a + 2
print(type(b))
which outputs
<class 'numpy.core.memmap.memmap'>
<class 'numpy.int32'>
So it seems that b is not a memmap any more, and I assume that this forces numpy to read the whole a.npy, defeating the purpose of the memmap. Hence my question, can operations on memmaps be deferred until access time?
I believe subclassing ndarray or memmap could work, but don't feel confident enough about my Python skills to try it.
Here is an extended example showing my problem:
import numpy as np
# create 8 GB file
# np.save("memmap.npy", np.empty([1000000000]))
# I want to print the first value using f and memmaps
def f(value):
print(value[1])
# this is fast: f receives a memmap
a = np.load("memmap.npy", mmap_mode='r')
print("a = ")
f(a)
# this is slow: b has to be read completely; converted into an array
b = np.load("memmap.npy", mmap_mode='r')
print("b + 1 = ")
f(b + 1)
Here's a simple example of an ndarray subclass that defers operations on it until a specific element is requested by indexing.
I'm including this to show that it can be done, but it almost certainly will fail in novel and unexpected ways, and require substantial work to make it usable.
For a very specific case it may be easier than redesigning your code to solve the problem in a better way.
I'd recommend reading over these examples from the docs to help understand how it works.
import numpy as np
class Defered(np.ndarray):
"""
An array class that deferrs calculations applied to it, only
calculating them when an index is requested
"""
def __new__(cls, arr):
arr = np.asanyarray(arr).view(cls)
arr.toApply = []
return arr
def __array_ufunc__(self, ufunc, method, *inputs, **kwargs):
## Convert all arguments to ndarray, otherwise arguments
# of type Defered will cause infinite recursion
# also store self as None, to be replaced later on
newinputs = []
for i in inputs:
if i is self:
newinputs.append(None)
elif isinstance(i, np.ndarray):
newinputs.append(i.view(np.ndarray))
else:
newinputs.append(i)
## Store function to apply and necessary arguments
self.toApply.append((ufunc, method, newinputs, kwargs))
return self
def __getitem__(self, idx):
## Get index and convert to regular array
sub = self.view(np.ndarray).__getitem__(idx)
## Apply stored actions
for ufunc, method, inputs, kwargs in self.toApply:
inputs = [i if i is not None else sub for i in inputs]
sub = super().__array_ufunc__(ufunc, method, *inputs, **kwargs)
return sub
This will fail if modifications are made to it that don't use numpy's universal functions. For instance percentile and median aren't based on ufuncs, and would end up loading the entire array. Likewise, if you pass it to a function that iterates over the array, or applies an index to substantial amounts the entire array will be loaded.
This is just how python works. By default numpy operations return a new array, so b never exists as a memmap - it is created when + is called on a.
There's a couple of ways to work around this. The simplest is to do all operations in place,
a += 1
This requires loading the memory mapped array for reading and writing,
a = np.load("a.npy", mmap_mode='r+')
Of course this isn't any good if you don't want to overwrite your original array.
In this case you need to specify that b should be memmapped.
b = np.memmap("b.npy", mmap+mode='w+', dtype=a.dtype, shape=a.shape)
Assigning can be done by using the out keyword provided by numpy ufuncs.
np.add(a, 2, out=b)

Scala closures on wikipedia

Found the following snippet on the Closure page on wikipedia
//# Return a list of all books with at least 'threshold' copies sold.
def bestSellingBooks(threshold: Int) = bookList.filter(book => book.sales >= threshold)
//# or
def bestSellingBooks(threshold: Int) = bookList.filter(_.sales >= threshold)
Correct me if I'm wrong, but this isn't a closure? It is a function literal, an anynomous function, a lambda function, but not a closure?
Well... if you want to be technical, this is a function literal which is translated at runtime into a closure, closing the open terms (binding them to a val/var in the scope of the function literal). Also, in the context of this function literal (_.sales >= threshold), threshold is a free variable, as the function literal itself doesn't give it any meaning. By itself, _.sales >= threshold is an open term At runtime, it is bound to the local variable of the function, each time the function is called.
Take this function for example, generating closures:
def makeIncrementer(inc: Int): (Int => Int) = (x: Int) => x + inc
At runtime, the following code produces 3 closures. It's also interesting to note that b and c are not the same closure (b == c gives false).
val a = makeIncrementer(10)
val b = makeIncrementer(20)
val c = makeIncrementer(20)
I still think the example given on wikipedia is a good one, albeit not quite covering the whole story. It's quite hard giving an example of actual closures by the strictest definition without actually a memory dump of a program running. It's the same with the class-object relation. You usually give an example of an object by defining a class Foo { ... and then instantiating it with val f = new Foo, saying that f is the object.
-- Flaviu Cipcigan
Notes:
Reference: Programming in Scala, Martin Odersky, Lex Spoon, Bill Venners
Code compiled with Scala version 2.7.5.final running on Java 1.6.0_14.
I'm not entirely sure, but I think you're right. Doesn't a closure require state (I guess free variables...)?
Or maybe the bookList is the free variable?
As far as I understand, this is a closure that contains a formal parameter, threshold and context variable, bookList, from the enclosing scope. So the return value(List[Any]) of the function may change while applying the filter predicate function. It is varying based on the elements of List(bookList) variable from the context.