What is the Polars equivalent of Pandas `.isna()` method? - python-polars

I'm trying to replace Pandas with Polars in production code, for better memory performance.
What would be the Polars equivalent of Pandas .isna() method? I couldn't find any good equivalent in the doc.

Polars has .is_null(). Note that Pandas has .isnull() as well, which is an alias for .isna().
I.e., per the example linked above:
s = pl.Series("a", [1.0, 2.0, 3.0, None])
s.is_null()
shape: (4,)
Series: 'is_null' [bool]
[
false
false
false
true
]

Related

Is it semantically possible to optimize LazyFrame -> Fill Null -> Cast to Categorical?

Here is a trivial benchmark based on a real-life workload.
import gc
import time
import numpy as np
import polars as pl
df = ( # I have a dataframe like this from reading a csv.
pl.Series(
name="x",
values=np.random.choice(
["ASPARAGUS", "BROCCOLI", ""], size=30_000_000
),
)
.to_frame()
.with_column(
pl.when(pl.col("x") == "").then(None).otherwise(pl.col("x"))
)
)
start = time.time()
df.lazy().with_column(
pl.col("x").cast(pl.Categorical).fill_null("MISSING")
).collect()
end = time.time()
print(f"Cast then fill_null took {end-start:.2f} seconds.")
Cast then fill_null took 0.93 seconds.
gc.collect()
start = time.time()
df.lazy().with_column(
pl.col("x").fill_null("MISSING").cast(pl.Categorical)
).collect()
end = time.time()
print(f"Fill_null then cast took {end-start:.2f} seconds.")
Fill_null then cast took 1.36 seconds.
(1) Am I correct to think that casting to categorical then filling null will always be faster?
(2) Am I correct to think that the result will always be identical regardless of the order?
(3) If the answers are "yes" and "yes", is it possible that someday polars will do this rearrangement automatically? Or is it actually impossible try all these sorts of permutations in a general query optimizer?
Thanks.
1: yes
2: somewhat. The logical categorcal representatition will always be the same. The physical changes by the order of occurrence of the string values. Doing fill_null before the cast, means "MISSING" will be found earlier. But this should be seen as an implementation detail.
3: Yes, this is something we can automatically optimize. Just today we merged something similar: https://github.com/pola-rs/polars/pull/4883

Similar function to scipy.stats.zscore but base on another "sample"

I have 2 datasets which describe the same process and I expect the same general range of values. So I would like to do is use scipy.stats.zscore on the one dataset but instead of using the sample mean and standard deviation, I would like to use the mean and standard deviation from the other dataset. Is there such an equivalent function?
It sounds like you want scipy.stats.zmap.
In [141]: import numpy as np
In [142]: from scipy.stats import zmap
In [143]: olddata = np.array([3.67, 4.01, 3.60, 5.36, 3.65, 2.01, 2.75, 4.43, 2.74, 3.89, 3.60])
In [144]: newdata = np.array([1.0, 2.4, 2.5, 3.25, 5.6])
In [145]: zmap(newdata, olddata)
Out[145]: array([-3.05378533, -1.41573956, -1.29873629, -0.42121177, 2.32836506])

scipy stats skewness is not correctly provide skewness results

I noticed that the skewness returned from scipy stats is not correct. Pandas.skew() actually provide better results.
I am recently trying to duplicate a classic paper, Expected Stock Returns and Volatility by French&Schwert. I use S&P500 data from 1928 to 1984. I follow the formula in the paper for standard deviation of the return and I am able to get the same result for mean, std dev of std dev.
However, when I use scipy.stats.skew function, I can't not get any number of the std dev of the sp return. The function return "nan", where clearly it should return a value.
I switch to Pandas.skew(). it returned me the correct value as in the paper.
Clearly, something is wrong with the scipy.stats.skew() function.
scipy.stats.skew()
pandas.skew()
Results by Scipy.stats.skew()
['Adj Close_gspc', 'Adj Close_gspc_lag', 'SP_Return', 'SP_Return_square',
'SP_Return_lag', 'SP_varianceMon', 'SP_varianceMon_sqrRoot']
array([ 0.6922229 , 0.69186265, -0.11292165, 4.23571807, -1.9556035 ,
5.39873607, nan])
results by pandas:
Adj Close_gspc 0.693745
Adj Close_gspc_lag 0.693384
SP_Return -0.113170
SP_Return_square 4.245033
SP_Return_lag -1.959904
SP_varianceMon 5.410609
SP_varianceMon_sqrRoot 2.800919
dtype: float64
You haven't provided enough information or sample code to reproduce the nan that you get.
To make scipy.stats.skew compute the same value as the skew() method in Pandas, add the argument bias=False.
Here's an example.
First, the imports:
In [21]: import numpy as np
In [22]: import pandas as pd
In [23]: from scipy.stats import skew
Generate some data:
In [24]: np.random.seed(8675309)
In [25]: x = np.random.weibull(0.2, size=15)
Compute the skew with scipy and with Pandas:
In [26]: skew(x, bias=False)
Out[26]: 3.7582525674514544
In [27]: pd.Series(x).skew()
Out[27]: 3.7582525674514544

VectorAssembler behavior and aggregating sparse data with dense

May someone explain behavior of VectorAssembler?
from pyspark.ml.linalg import Vectors
from pyspark.ml.feature import VectorAssembler
assembler = VectorAssembler(
inputCols=['CategoryID', 'CountryID', 'CityID', 'tf'],
outputCol="features")
output = assembler.transform(tf)
output.select("features").show(truncate=False)
the code via show method returns me
(262147,[0,1,2,57344,61006,80641,126469,142099,190228,219556,221426,231784],[2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
when I use the same variable "output" with take I get different return
output.select('features').take(1)
[Row(features=SparseVector(262147, {0: 2.0, 1: 1.0, 2: 1.0, 57344: 1.0, 61006: 1.0, 80641: 1.0, 126469: 1.0, 142099: 1.0, 190228: 1.0, 219556: 1.0, 221426: 1.0, 231784: 1.0}))]
By the way, consider case, There is an sparse array output from "tfidf". I still have an additional data (metadata) available. I need somehow aggregate sparse arrays in Pyspark Dataframes with metadata for LSH algorithm. I've tried VectorAssembler as you can see but it also returns dense vector. Maybe there are any tricks to combine data and still have sparse data as output.
Only the format of the two returns is different; in both cases, you get actually the same sparse vector.
In the first case, you get a sparse vector with 3 elements: the dimension (262147), and two lists, containing the indices & values respectively of the nonzero elements. You can easily verify that the length of these lists is the same, as it should be:
len([0,1,2,57344,61006,80641,126469,142099,190228,219556,221426,231784])
# 12
len([2.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0,1.0])
# 12
In the second case you get again a sparse vector with the same first element, but here the two lists are combined into a dictionary of the form {index: value}, which again has the same length with the lists of the previous representation:
len({0: 2.0, 1: 1.0, 2: 1.0, 57344: 1.0, 61006: 1.0, 80641: 1.0, 126469: 1.0, 142099: 1.0, 190228: 1.0, 219556: 1.0, 221426: 1.0, 231784: 1.0} )
# 12
Since assembler.transform() returns a Spark dataframe, the difference is due to the different formats returned by the Spark SQL functions show and take, respectively.
By the way, consider case [...]
It is not at all clear what exactly you are asking here, and in any case I suggest you open a new question on this with a reproducible example, since it sounds like a different subject...

filter spark dataframe based on maximum value of a column

I want to do something like this:
df
.withColumn("newCol", <some formula>)
.filter(s"""newCol > ${(math.min(max("newCol").asInstanceOf[Double],10))}""")
Exception I'm getting:
org.apache.spark.sql.Column cannot be cast to java.lang.Double
Can you please suggest me a way to achieve what i want?
I assume newCol is already present in df, then:
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
df
.withColumn("max_newCol",max($"newCol").over(Window.partitionBy()))
.filter($"newCol"> least($"max_newCol",lit(10.0)))
Instead of max($"newCol").over(Window.partitionBy()) you can also jjst write max($"newCol").over()
I think dataframe describe function is what you are looking for.
ds.describe("age", "height").show()
// output:
// summary age height
// count 10.0 10.0
// mean 53.3 178.05
// stddev 11.6 15.7
// min 18.0 163.0
// max 92.0 192.0
I'd separate both steps and either:
val newDF = df
.withColumn("newCol", <some formula>)
// Spark 2.1 or later
// With 1.x use join
newDf.alias("l").crossJoin(
newDf.alias("r")).where($"l.newCol" > least($"r.newCol", lit(10.0)))
or
newDf.where(
$"newCol" > (newDf.select(max($"newCol")).as[Double].first min 10.0))
The solution is two parts,
Part I
Find the maximum value,
df.select(max($"col1")).first()(0)
Part II
Use that value to filter on it
df.filter($"col1" === df.select(max($"col1")).first()(0)).show
Bonus
To avoid potential errors, you can also get the maximum value in a specific format you need, using the .get family on it df.select(max($"col1")).first.getDouble(0)
In this case col1 is DoubleType, so I chose to pick it in the correct format. You can get pretty much all other types. Options are:
getBoolean, getClass, getDecimal, getFloat, getJavaMap, getLong, getSeq, getString, getTimestamp, getAs, getByte, getDate, getDouble, getInt, getList, getMap, getShort, getStruct, getValuesMap
Making the full solution in this case
df.filter($"col1" === df.select(max($"col1")).first.getDouble(0)).show