polars LazyFrame.with_context().filter() throws unexpected NotFoundError for column - python-polars

I have two LazyFrames, df1 and df2.
After filtering df2 according to df1 max value, I want to concatenate them.
But combination of with_context() and filter() on LazyFrames will raise NotFoundError.
What's the best way to do this?
import polars as pl
df1 = pl.DataFrame({'foo': [0, 1], 'bar': ['a', 'a']}).lazy()
df2 = pl.DataFrame({'foo': [1, 2, 3], 'bar': ['b', 'b', 'b']}).lazy()
df = pl.concat(
[
df1,
df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.filter(pl.col('foo') > pl.col('foo_').max())
]).collect()
# ---------------------------------------------------------------------------
# NotFoundError Traceback (most recent call last)
# [<ipython-input-2-cf44deab2d4b>](https://localhost:8080/#) in <module>
# 4 df2 = pl.DataFrame({'foo': [1, 2, 3], 'bar': ['b', 'b', 'b']}).lazy()
# 5
# ----> 6 df = pl.concat(
# 7 [
# 8 df1,
#
# 1 frames
# [/usr/local/lib/python3.8/dist-packages/polars/utils.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
# 327 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
# 328 _rename_kwargs(fn.__name__, kwargs, aliases)
# --> 329 return fn(*args, **kwargs)
# 330
# 331 return wrapper
#
# [/usr/local/lib/python3.8/dist-packages/polars/internals/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
# 1166 streaming,
# 1167 )
# -> 1168 return pli.wrap_df(ldf.collect())
# 1169
# 1170 def sink_parquet(
#
# NotFoundError: foo_
When I assign comparison result as a column, error not raised,
(df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.with_column((pl.col('foo') > pl.col('foo_').max()).alias('x'))
.filter(pl.col('x'))).collect()
# OK
But I drop that column after filter(), error again.
(df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.with_column((pl.col('foo') > pl.col('foo_').max()).alias('x'))
.filter(pl.col('x'))
.drop('x')).collect()
# NotFoundError: foo_
Finally I find this works. But what's the difference between the previous?
(it seems verbose. other good solution exists?)
(df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.with_column(pl.col('foo_').max())
.filter(pl.col('foo') > pl.col('foo_'))
.drop('foo_')).collect()
# OK
related?
https://stackoverflow.com/a/71108312/7402018

Related

How to create an array in Pyspark with normal distribution with scipy.stats with UDF (or any other way)?

I currently working on migrate Python scripts to PySpark, I have this Python script that works fine:
### PYTHON
import pandas as pd
import scipy.stats as st
def fnNormalDistribution(mean,std, n):
box = list(eval('st.norm')(*[mean,std]).rvs(n))
return box
df = pd.DataFrame([[18.2500365,2.7105814157004193],
[9.833353,2.121324586200329],
[41.55563866666666,7.118716782527054]],
columns = ['mean','std'])
df
| mean | std |
|------------|----------|
| 18.250037| 2.710581|
| 9.833353| 2.121325|
| 41.555639| 7.118717|
n = 100 #Example
df['random_values'] = df.apply(lambda row: fnNormalDistribution(row["mean"], row["std"], n), axis=1)
df
| mean | std | random_values |
|------------|----------|--------------------------------------------------|
| 18.250037| 2.710581|[17.752189993958638, 18.883038367927465, 16.39...]|
| 9.833353| 2.121325|[10.31806454283759, 8.732261487201594, 11.6782...]|
| 41.555639| 7.118717|[38.17469739795093, 43.16514466083524, 49.2668...]|
but when I try to migrate to Pyspark I get the following error:
### PYSPARK
def fnNormalDistribution(mean,std, n):
box = list(eval('st.norm')(*[mean,std]).rvs(n))
return box
udf_fnNomalDistribution = f.udf(fnNormalDistribution, t.ArrayType(t.DoubleType()))
columns = ['mean','std']
data = [(18.2500365,2.7105814157004193),
(9.833353,2.121324586200329),
(41.55563866666666,7.118716782527054)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
| mean | std |
|------------|----------|
| 18.250037| 2.710581|
| 9.833353| 2.121325|
| 41.555639| 7.118717|
df = df.withColumn('random_values', udf_fnNomalDistribution('mean','std',f.lit(n)))
df.show()
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 132, in dump_stream
for obj in iterator:
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 200, in _batched
for item in iterator:
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 450, in mapper
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 450, in <genexpr>
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 85, in <lambda>
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
return f(*args, **kwargs)
File "C:\Users\Ubits\AppData\Local\Temp/ipykernel_10604/2493247477.py", line 2, in fnNormalDistribution
File "<string>", line 1, in <module>
NameError: name 'st' is not defined
Is there some way to use the same function in Pyspark or get the random_values column in another way? I googled it with no exit about it.
Thanks
I was trying this and it can really be fixed by moving st inside fnNormalDistribution like samkart suggested.
I will just leave my example here as Fugue may provide a more readable way to bring this to Spark, especially around handling schema. Full code below.
import pandas as pd
def fnNormalDistribution(mean,std, n):
import scipy.stats as st
box = (eval('st.norm')(*[mean,std]).rvs(n)).tolist()
return box
df = pd.DataFrame([[18.2500365,2.7105814157004193],
[9.833353,2.121324586200329],
[41.55563866666666,7.118716782527054]],
columns = ['mean','std'])
n = 100 #Example
def helper(df: pd.DataFrame) -> pd.DataFrame:
df['random_values'] = df.apply(lambda row: fnNormalDistribution(row["mean"], row["std"], n), axis=1)
return df
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# transform can take either pandas of spark DataFrame as input
# If engine is none, it will run on pandas
sdf = transform(df,
helper,
schema="*, random_values:[float]",
engine=spark)
sdf.show()

ruamel.yaml ComposerError when using alias/ as name

I am trying to parse the following document
hello:
there: &there_value 1
foo:
*there_value: 3
This gets correctly parsed with the safe loader:
>>> from ruamel.yaml import YAML
>>> document = """
... hello:
... there: &there_value 1
... foo:
... *there_value: 3
"""
>>> yaml=YAML(typ="safe")
>>> yaml.load(document)
{'hello': {'there': 1}, 'foo': {1: 3}}
The round-trip (standard) loader throws an error:
>>> yaml=YAML()
>>> yaml.load(document)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "...\site-packages\ruamel\yaml\main.py", line 434, in load
return constructor.get_single_data()
File "...\site-packages\ruamel\yaml\constructor.py", line 119, in get_single_data
node = self.composer.get_single_node()
File "...\site-packages\ruamel\yaml\composer.py", line 76, in get_single_node
document = self.compose_document()
File "...\site-packages\ruamel\yaml\composer.py", line 99, in compose_document
node = self.compose_node(None, None)
File "...\site-packages\ruamel\yaml\composer.py", line 143, in compose_node
node = self.compose_mapping_node(anchor)
File "...\site-packages\ruamel\yaml\composer.py", line 223, in compose_mapping_node
item_value = self.compose_node(node, item_key)
File "...\site-packages\ruamel\yaml\composer.py", line 117, in compose_node
raise ComposerError(
ruamel.yaml.composer.ComposerError: found undefined alias 'there_value:'
in "<unicode string>", line 6, column 3:
*there_value: 3
^ (line: 6)
I am using Python 3.8.10, ruamel.yaml version 0.17.21.
As Anthon suggested in their comment, : is a valid character for an anchor,
so *there_value: is looking for there_value: which is not defined, only there_value is.
The solution is to add a space after the anchor, *there_value :.
hello:
there: &there_value 1
foo:
*there_value : 3
This loads correctly both with round-trip and with safe.

corruption loop for a data frame in PySpark

I'm making a loop for corrupt my dataset on PySpark, and i want to control the errors.
First: I made a list of type: 9 errors and 1 with no error.
#code_erreur= [ "replace","inverse","inserte","delete","espace","NA"]
code_erreur= ["replace"]
varibale =["VARIABLEA","VARIABLEB"]
I start with just code the error : replace.
Select a random letter, on a random "varibale ", to replace by antoher random letter .
My input:
VARIABLEA  | VARIABLEB
BLUE        | WHITE
PINK         | DARK
My expected output:
VARIABLEA  | VARIABLEB
BLTE        | WHITE
PINK         | DARM
And I made a loop:
def algo_corruption(lettre,code_erreur,nombre_erreur,varibale,data):
alp=(list(string.ascii_uppercase))
table_corruption=[]
for i in range(1,data.count()):
code_erreur_choisi =random.choice(code_erreur)
varibale_choisie =random.choice(varibale)
(table_corruption.append((code_erreur_choisi,
varibale_choisie)))
cols=["code_erreur_choisi","varibale_choisie"]
result = spark.createDataFrame(table_corruption, cols)
result= result.withColumn("id", monotonically_increasing_id())
data= data.withColumn("id", monotonically_increasing_id())
data_join_result= data.join(result, "id","inner").drop("id")
for j in range(1,data_join_result.count()):
if data_join_result.filter(col("code_erreur_choisi") == "replace"):
data_corrp = (data_join_result[varibale_choisie].replace(random.choice(data_join_result.collect()[j][varibale_choisie]),
random.choice(alp))
display(data_corrp)
else:
print("erreur pas encore codée")
But that doesn't work, I always have errors like:
ValueError: R
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<command-3614001298541202> in <module>()
----> 1 algo_corruption(code_erreur,varibale,extrac_base_train)
<command-1181487391956055> in algo_corruption(code_erreur, varibale, data)
29 if data_join_result.filter(col("code_erreur_choisi") == "replace"):
30 data_corrp = (data_join_result[varibale_choisie].replace(random.choice(data_join_result.collect()[j][varibale_choisie]),
---> 31 data_join_result.collect()[j][lettre_choisie]))
32 display(data_corrp)
33
/databricks/spark/python/pyspark/sql/types.py in __getitem__(self, item)
1517 raise KeyError(item)
1518 except ValueError:
-> 1519 raise ValueError(item)
1520
1521 def __getattr__(self, item):
ValueError: R

Cryptic python error 'classobj' object has no attribute '__getitem__'. Why am I getting this?

I really wish I could be more specific here but I have read through related questions and none of them seem to relate to the issue that I am experiencing here and I have no understanding of the issue i am experiencing. This is for a homework assignment so I am hesitant to put up all my code for the program, here is a stripped down version. Compile this and you will see the issue.
import copy
class Ordering:
def __init__(self, tuples):
self.pairs = copy.deepcopy(tuples)
self.sorted = []
self.unsorted = []
for x in self.pairs:
self.addUnsorted(left(x))
self.addUnsorted(right(x))
def addUnsorted(self, item):
isPresent = False
for x in self.unsorted:
if x == item:
isPresent = True
if isPresent == False:
self.unsorted.append(left(item))
Here I have created a class, Ordering, that takes a list of the form [('A', 'B'), ('C', 'B'), ('D', 'A')] (where a must come before b, c must come before b, etc.) and is supposed to return it in partial ordered form. I am working on debugging my code to see if it works correctly but I have not been able to yet because of the error message I get back.
When I input the follwing in my terminal:
print Ordering[('A', 'B'), ('C', 'B'), ('D', 'A')]
I get back the following error message:
Traceback (most recent call last): File "<stdin>", line 1, in (module) Type Error: 'classobj' object has no attribute '__getitem__'
Why is this?!
To access an element of a list, use square brackets. To instantiate a class, use parens.
In other words, do not use:
print Ordering[('A', 'B'), ('C', 'B'), ('D', 'A')]
Use:
print Ordering((('A', 'B'), ('C', 'B'), ('D', 'A')))
This will generate another error from deeper in the code but, since this is a homework assignment, I will let you think about that one a bit.
How to use __getitem__:
As a minimal example, here is a class that returns squares via __getitem__:
class HasItems(object):
def __getitem__(self, key):
return key**2
In operation, it looks like this:
>>> a = HasItems()
>>> a[4]
16
Note the square brackets.
Answer to "Why is this?"
Your demo-code is not complete ( ref. comment above ), however the issue with .__getitem__ method is clearly related with a statement to print an object ( which due to other reasons did fail to respond to a request to answer to a called .__getitem__ method ) rather than the Class itself.
>>> aList = [ ('A','B'), ('C','D'), ('E','F')] # the stated format of input
>>> aList # validated to be a list
[('A', 'B'), ('C', 'D'), ('E', 'F')]
>>> type( aList ) # cross-validated
<type 'list'>
>>> for x in aList: # iterator over members
... print x, type( x ) # show value and type
... left( x ) # request as in demo-code
...
('A', 'B') <type 'tuple'>
Traceback (most recent call last): <<< demo-code does not have it
File "<stdin>", line 3, in <module>
NameError: name 'left' is not defined
>>> dir( Ordering ) # .__getitem__ method missing
[ '__doc__', '__init__', '__module__', 'addUnsorted']
>>> dir( aList[0] ) # .__getitem__ method present
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__',
'__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__',
'__getslice__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__',
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'count', 'index']

How to add meta_data to Pandas dataframe?

I use Pandas dataframe heavily. And need to attach some data to the dataframe, for example to record the birth time of the dataframe, the additional description of the dataframe etc.
I just can't find reserved fields of dataframe class to keep the data.
So I change the core\frame.py file to add a line _reserved_slot = {} to solve my issue. I post the question here is just want to know is it OK to do so ? Or is there better way to attach meta-data to dataframe/column/row etc?
#----------------------------------------------------------------------
# DataFrame class
class DataFrame(NDFrame):
_auto_consolidate = True
_verbose_info = True
_het_axis = 1
_col_klass = Series
_AXIS_NUMBERS = {
'index': 0,
'columns': 1
}
_reserved_slot = {} # Add by bigbug to keep extra data for dataframe
_AXIS_NAMES = dict((v, k) for k, v in _AXIS_NUMBERS.iteritems())
EDIT : (Add demo msg for witingkuo's way)
>>> df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDEFGHIJKLMN')[0:5])
>>> df
A B C D E
0 0.5890 -0.7683 -1.9752 0.7745 0.8019
1 1.1835 0.0873 0.3492 0.7749 1.1318
2 0.7476 0.4116 0.3427 -0.1355 1.8557
3 1.2738 0.7225 -0.8639 -0.7190 -0.2598
4 -0.3644 -0.4676 0.0837 0.1685 0.8199
5 0.4621 -0.2965 0.7061 -1.3920 0.6838
6 -0.4135 -0.4991 0.7277 -0.6099 1.8606
7 -1.0804 -0.3456 0.8979 0.3319 -1.1907
8 -0.3892 1.2319 -0.4735 0.8516 1.2431
9 -1.0527 0.9307 0.2740 -0.6909 0.4924
>>> df._test = 'hello'
>>> df2 = df.shift(1)
>>> print df2._test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\lib\site-packages\pandas\core\frame.py", line 2051, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute '_test'
>>>
This is not supported right now. See https://github.com/pydata/pandas/issues/2485. The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.
Your _reserved_slot will become a class variable. That might not work if you want to assign different value to different DataFrame. Probably you can assign what you want to the instance directly.
In [6]: import pandas as pd
In [7]: df = pd.DataFrame()
In [8]: df._test = 'hello'
In [9]: df._test
Out[9]: 'hello'
I think a decent workaround is putting your datafame into a dictionary with your metadata as other keys. So if you have a dataframe with cashflows, like:
df = pd.DataFrame({'Amount': [-20, 15, 25, 30, 100]},index=pd.date_range(start='1/1/2018', periods=5))
You can create your dictionary with additional metadata and put the dataframe there
out = {'metadata': {'Name': 'Whatever', 'Account': 'Something else'}, 'df': df}
and then use it as out[df]