Is it semantically possible to optimize LazyFrame -> Fill Null -> Cast to Categorical? - python-polars

Here is a trivial benchmark based on a real-life workload.
import gc
import time
import numpy as np
import polars as pl
df = ( # I have a dataframe like this from reading a csv.
pl.Series(
name="x",
values=np.random.choice(
["ASPARAGUS", "BROCCOLI", ""], size=30_000_000
),
)
.to_frame()
.with_column(
pl.when(pl.col("x") == "").then(None).otherwise(pl.col("x"))
)
)
start = time.time()
df.lazy().with_column(
pl.col("x").cast(pl.Categorical).fill_null("MISSING")
).collect()
end = time.time()
print(f"Cast then fill_null took {end-start:.2f} seconds.")
Cast then fill_null took 0.93 seconds.
gc.collect()
start = time.time()
df.lazy().with_column(
pl.col("x").fill_null("MISSING").cast(pl.Categorical)
).collect()
end = time.time()
print(f"Fill_null then cast took {end-start:.2f} seconds.")
Fill_null then cast took 1.36 seconds.
(1) Am I correct to think that casting to categorical then filling null will always be faster?
(2) Am I correct to think that the result will always be identical regardless of the order?
(3) If the answers are "yes" and "yes", is it possible that someday polars will do this rearrangement automatically? Or is it actually impossible try all these sorts of permutations in a general query optimizer?
Thanks.

1: yes
2: somewhat. The logical categorcal representatition will always be the same. The physical changes by the order of occurrence of the string values. Doing fill_null before the cast, means "MISSING" will be found earlier. But this should be seen as an implementation detail.
3: Yes, this is something we can automatically optimize. Just today we merged something similar: https://github.com/pola-rs/polars/pull/4883

Related

Cast a list column into dummy columns in Python Polars?

I have a very large data frame where there is a column that is a list of numbers representing category membership.
Here is a dummy version
import pandas as pd
import numpy as np
segments = [str(i) for i in range(1_000)]
# My real data is ~500m rows
nums = np.random.choice(segments, (100_000,10))
df = pd.DataFrame({'segments': [','.join(n) for n in nums]})
userId
segments
0
885,106,49,138,295,254,26,460,0,844
1
908,709,454,966,151,922,666,886,65,708
2
664,713,272,241,301,498,630,834,702,289
3
60,880,906,471,437,383,878,369,556,876
4
817,183,365,171,23,484,934,476,273,230
...
...
Note that there is a known list of segments (0-999 in the example)
I want to cast this into dummy columns indicating membership to each segment.
I found a few ways of doing this:
In pandas:
df_one_hot_encoded = (df['segments']
.str.split(',')
.explode()
.reset_index()
.assign(__one__=1)
.pivot_table(index='index', columns='segments', values='__one__', fill_value=0)
)
(takes 8 seconds on a 100k row sample)
And polars
df2 = pl.from_pandas(df[['segments']])
df_ans = (df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(','),
pl.lit(1).alias('__one__')
])
.explode('segments')
.pivot(index='row_index', columns='segments', values='__one__')
.fill_null(0)
)
df_one_hot_encoded = df_ans.to_pandas()
(takes 1.5 seconds inclusive of the conversion to and from pandas, .9s without)
However, I hear .pivot is not efficient, and that it does not work well with lazy frames.
I tried other solutions in polars, but they were much slower:
_ = df2.lazy().with_columns(**{segment: pl.col('segments').str.contains(segment) for segment in segments}).collect()
(2 seconds)
(df2
.with_columns([
pl.arange(0, len(df2)).alias('row_index'),
pl.col('segments').str.split(',')
])
.explode('segments')
.to_dummies(columns=['segments'])
.groupby('row_index')
.sum()
)
(4 seconds)
Does anyone know a better solution than the .9s pivot?
This approach ends up being slower than the pivot but it's a got a different trick so I'll include it.
df2=pl.from_pandas(df)
df2_ans=(df2.with_row_count('userId').with_column(pl.col('segments').str.split(',')).explode('segments') \
.with_columns([pl.when(pl.col('segments')==pl.lit(str(i))).then(pl.lit(1,pl.Int32)).otherwise(pl.lit(0,pl.Int32)).alias(str(i)) for i in range(1000)]) \
.groupby('userId')).agg(pl.exclude('segments').sum())
df_one_hot_encoded = df2_ans.to_pandas()
A couple of other observations. I'm not sure if you checked the output of your str.contains method but I would think that wouldn't work because, for example, 15 is contained within 154 when looking at strings.
The other thing, which I guess is just a preference, is the with_row_count syntax vs the pl.arrange. I don't think the performance of either is better (at least not significantly so) but you don't have to reference the df name to get the len of it which is nice.
I tried a couple other things that were also worse including not doing the explode and just doing is_in but that was slower. I tried using bools instead of 1s and 0s and then aggregating with any but that was slower.

pyspark explode performance

Background
I use explode to transpose columns to rows.
This works very well in general with good performance.
The source dataframe (df_audit in below code) is dynamic so can contain different structure.
Problem
Recently have incoming dataframe with very large number of columns (5 thousand) - the below code runs successfully but is very slow to run the line starting 'exploded'.
Anyone faced similar problems? I could split up the dataframe to multiple dataframes (broken out by columns) or might there be better way? Or example code?
Example code
key_cols = ["cola", "colb", "colc"]
cols = [col for col in df_audit.columns if col not in key_cols]
exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])
Both lit() and col() are for some reason quite slow when used in a loop. You can try instead with arrays_zip():
exploded = explode(
arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))
).alias('exploded')
In my quick test on 5k columns, this runs for ~6s vs. original ~25s.
Sharing some timings for bzu's approach and OP's approach based on colaboratory notebook.
cols = ['i'+str(i) for i in range(5000)]
# OP's method
%timeit func.array(*[func.struct(func.lit(k).alias('k'), func.col(k).alias('v')) for k in cols])
# 34.7 s ± 2.84 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
# bzu's method
%timeit func.arrays_zip(func.split(func.lit(','.join(cols)), ',').alias('k'), func.array(cols).alias('v'))
# 10.7 s ± 1.41 s per loop (mean ± std. dev. of 7 runs, 1 loop each)
Thank you bzu & samkart but for some reason I cannot get the new line working.
I have created a simple example that doesn't work as follows if you can see something obvious I am missing.
from pyspark.sql.functions import (
array, arrays_zip, coalesce, col, explode, lit, lower, split, struct,substring,)
from pyspark.sql.types import StringType
def process_data():
try:
logger.info("\ntest 1")
df_audit = spark.createDataFrame([("1", "foo", "abc", "xyz"),("2", "bar", "def", "zab"),],["id", "label", "colx", "coly"])
logger.info("\ntest 2")
key_cols = ["id", "label"]
cols = [col for col in df_audit.columns if col not in key_cols]
logger.info("\ntest 3")
# exploded = explode(array([struct(lit(c).alias("key"), col(c).alias("val")) for c in cols])).alias("exploded")
exploded = explode(arrays_zip(split(lit(','.join(cols)), ',').alias('key'), array(cols).alias('val'))).alias('exploded')
logger.info("\ntest 4")
df_audit = df_audit.select(key_cols + [exploded]).select(key_cols + ["exploded.key", "exploded.val"])
df_audit.show()
except Exception as e:
logger.error("Error in process_audit_data: {}".format(e))
return False
return True
When I call process_data function I get following logged:
test 1
test 2
test 3
test 4
Error in process_audit_data: No such struct field key in 0, 1.
Note: it does work successfully with the commented exploded line
Many thanks

How do i get datetime into a tableau extract with pantab

I have this code:
import pantab
import pandas as pd
import datetime
df = pd.DataFrame([
[datetime.date(2018,2,20), 4],
[datetime.date(2018,2,20), 4],
], columns=["date", "num_of_legs"])
pantab.frame_to_hyper(df, "example.hyper", table="animals")
which causes this error:
TypeError: Invalid value "datetime.date(2018, 2, 20)" found (row 0 column 0)
Is there a fix?
This has apparently been a problem since time 2020. You have to know which columns are datetime columns because panda dtypes treat dates and strings as objects. There's a world where data scientists don't care about dates, but not mine, apparently. Anyway, here's the solution awaiting the day when pandas makes the date dtype:
[https://github.com/innobi/pantab/issues/100][1]
And just to reduce it down to do what I did:
def createHyper(xlsx, filename, tablename):
for name in xlsx.columns:
if 'date' in name.lower():
xlsx[name] = pd.to_datetime(xlsx[name],errors='coerce',utc=True)
pantab.frame_to_hyper(xlsx,filename,table=tablename)
return open(filename, 'rb')
errors = 'coerce' makes it so you can have dates like 1/1/3000 which is handy in a world of scd2 tables
utc = True was necessary for me because my dates were timezone sensitive, yours may not be.
I screwed up the hyperlink, It's not working. Damn it. Thank you to the anonymous editor who will inevitable show up and fix it. I'm very sorry.

PySpark approxSimilarityJoin() not returning any results

I am trying to find similar users by vectorizing user features and sorting by distance between user vectors in PySpark. I'm running this in Databricks on Runtime 5.5 LTS ML cluster (Scala 2.11, Spark 2.4.3)
Following the code in the docs, I am using approxSimilarityJoin() method from the pyspark.ml.feature.BucketedRandomProjectionLSH model.
I have found similar users successfully using approxSimilarityJoin(), but every now and then I come across a user of interest that apparently has no users similar to them.
Usually when approxSimilarityJoin() doesn't return anything, I assume it's because the threshold parameter is set to low. That fixes the issue sometimes, but now I've tried using a threshold of 100000 and still getting nothing back.
I define the model as
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=1.0)
I'm not sure if I changing bucketLength or numHashTables would help in obtaining results.
The following example shows a pair of users where approxSimilarityJoin() returned something (dataA, dataB) and a pair of users (dataC, dataD) where it didn't.
from pyspark.ml.linalg import Vectors
from pyspark.sql.functions import col
dataA = [(0, Vectors.dense([0.7016968702094931,0.2636417660310031,4.155293362824633,4.191398632883099]),)]
dataB = [(1, Vectors.dense([0.3757117100334294,0.2636417660310031,4.1539923630906745,4.190086328785612]),)]
dfA = spark.createDataFrame(dataA, ["customer_id", "scaledFeatures"])
dfB = spark.createDataFrame(dataB, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfA)
# returns
# theshold of 100000 is clearly overkill
# A dataframe with dfA and dfB feature vectors and a EuclideanDistance of 0.32599039770730354
model.approxSimilarityJoin(dfA, dfB, 100000, distCol="EuclideanDistance").show()
dataC = [(0, Vectors.dense([1.1600056435954367,78.27652460873155,3.5535837780801396,0.0030949620591871887]),)]
dataD = [(1, Vectors.dense([0.4660731192450482,39.85571715054726,1.0679201943112886,0.012330725745062067]),)]
dfC = spark.createDataFrame(dataC, ["customer_id", "scaledFeatures"])
dfD = spark.createDataFrame(dataD, ["customer_id", "scaledFeatures"])
brp = BucketedRandomProjectionLSH(inputCol="scaledFeatures", outputCol="hashes", bucketLength=2.0,
numHashTables=3)
model = brp.fit(dfC)
# returns empty df
model.approxSimilarityJoin(dfC, dfD, 100000, distCol="EuclideanDistance").show()
I was able to obtain results to the second half of the example above by increasing the bucketLength parameter value to 15. The threshold could have been lowered because the Euclidean Distance was ~34.
Per the PySpark docs:
bucketLength = the length of each hash bucket, a larger bucket lowers the false negative rate

PySpark filtering gives inconsistent behavior

So I have a data set where I do some transformations and the last step is to filter out rows that have a 0 in a column called frequency. The code that does the filtering is super simple:
def filter_rows(self, name: str = None, frequency_col: str = 'frequency', threshold: int = 1):
df = getattr(self, name)
df = df.where(df[frequency_col] >= threshold)
setattr(self, name, df)
return self
The problem is a very strange behavior where if I put a rather high threshold like 10, it works fine, filtering out all the rows below 10. But if I make the threshold just 1, it does not remove the 0s! Here is an example of the former (threshold=10):
{"user":"XY1677KBTzDX7EXnf-XRAYW4ZB_vmiNvav7hL42BOhlcxZ8FQ","domain":"3a899ebbaa182778d87d","frequency":12}
{"user":"lhoAWb9U9SXqscEoQQo9JqtZo39nutq3NgrJjba38B10pDkI","domain":"3a899ebbaa182778d87d","frequency":9}
{"user":"aRXbwY0HcOoRT302M8PCnzOQx9bOhDG9Z_fSUq17qtLt6q6FI","domain":"33bd29288f507256d4b2","frequency":23}
{"user":"RhfrV_ngDpJex7LzEhtgmWk","domain":"390b4f317c40ac486d63","frequency":14}
{"user":"qZqqsNSNko1V9eYhJB3lPmPp0p5bKSq0","domain":"390b4f317c40ac486d63","frequency":11}
{"user":"gsmP6RG13azQRmQ-RxcN4MWGLxcx0Grs","domain":"f4765996305ccdfa9650","frequency":10}
{"user":"jpYTnYjVkZ0aVexb_L3ZqnM86W8fr082HwLliWWiqhnKY5A96zwWZKNxC","domain":"f4765996305ccdfa9650","frequency":15}
{"user":"Tlgyxk_rJF6uE8cLM2sArPRxiOOpnLwQo2s","domain":"f89838b928d5070c3bc3","frequency":17}
{"user":"qHu7fpnz2lrBGFltj98knzzbwWDfU","domain":"f89838b928d5070c3bc3","frequency":11}
{"user":"k0tU5QZjRkBwqkKvMIDWd565YYGHfg","domain":"f89838b928d5070c3bc3","frequency":17}
And now here is some of the data with threshold=1:
{"user":"KuhSEPFKACJdNyMBBD2i6ul0Nc_b72J4","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"EP1LomZ3qAMV3YtduC20","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"UxulBfshmCro-srE3Cs5znxO5tnVfc0_yFps","domain":"d69cb6f62b885fec9b7d","frequency":1}
{"user":"v2OX7UyvMVnWlDeDyYC8Opk-va_i8AwxZEsxbk","domain":"d69cb6f62b885fec9b7d","frequency":0}
{"user":"4hu1uE2ucAYZIrNLeOY2y9JMaArFZGRqjgKzlKenC5-GfxDJQQbLcXNSzj","domain":"68b588cedbc66945c442","frequency":0}
{"user":"5rFMWm_A-7N1E9T289iZ65TIR_JG_OnZpJ-g","domain":"68b588cedbc66945c442","frequency":1}
{"user":"RLqoxFMZ7Si3CTPN1AnI4hj6zpwMCJI","domain":"68b588cedbc66945c442","frequency":1}
{"user":"wolq9L0592MGRfV_M-FxJ5Wc8UUirjqjMdaMDrI","domain":"68b588cedbc66945c442","frequency":0}
{"user":"9spTLehI2w0fHcxyvaxIfo","domain":"68b588cedbc66945c442","frequency":1}
I should note that before this step I perform some other transformations, and I've noticed weird behaviors in Spark in the past sometimes doing very simple things like this after a join or a union can give very strange results where eventually the only solution is to write out the data and read it back in again and do the operation in a completely separate script. I hope there is a better solution than this!