read_json on polars causes OutOfSpec error - python-polars

I have started to evaluate Polars and it looks amazing comparing to Pandas. My case is running data processing tasks on a "medium" size data and for now it looks very promising.
However, when reading JSON file causes:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("offsets must not exceed the values length")
The call is:
import polars as pr
pr.read_json('./data/yelp_academic_dataset_review.json', json_lines=True)
The file size is 5.0G, was taken from kaggle Yelp dataset.
I am running on Mac: 16GB, 2.3 GHz Quad-Core Intel Core i7, Polars 0.13.58
What might be the reason?
Thanks

Update: Polars >= 0.13.59
As of Polars 0.13.59, this has been fixed. You can now read JSON file with more than 2GB of text in a column, so the workaround below is no longer needed.
And as an added bonus, the JSON parser is now much faster.
The problem
It appears not to be a RAM limitation, nor a malformed input file. Instead, it appears to be a limitation in json_loads with the amount of data being parsed.
I threw my Threadripper Pro (with 512 GB of RAM) at this. If I read the file into RAM:
import polars as pl
from io import StringIO
with open("/tmp/yelp_academic_dataset_review.json") as json_file:
file_lines = json_file.readlines()
len(file_lines)
We get 6,990,280 lines.
>>> len(file_lines)
6990280
Using binary search, I discovered that reading the first 3,785,593 lines works:
pl.read_json(StringIO("".join(file_lines[0:3_785_593])), json_lines=True)
>>> pl.read_json(StringIO("".join(file_lines[0:3_785_593])), json_lines=True)
shape: (3785593, 9)
┌────────────────────────┬──────┬─────────────────────┬───────┬─────┬───────┬─────────────────────────────────────┬────────┬────────────────────────┐
│ business_id ┆ cool ┆ date ┆ funny ┆ ... ┆ stars ┆ text ┆ useful ┆ user_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ ┆ f64 ┆ str ┆ i64 ┆ str │
╞════════════════════════╪══════╪═════════════════════╪═══════╪═════╪═══════╪═════════════════════════════════════╪════════╪════════════════════════╡
│ XQfwVwDr-v0ZS3_CbbE5Xw ┆ 0 ┆ 2018-07-07 22:09:11 ┆ 0 ┆ ... ┆ 3.0 ┆ If you decide to eat here, just ... ┆ 0 ┆ mh_-eMZ6K5RLWhZyISBhwA │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7ATYjTIgM3jUlt4UM3IypQ ┆ 1 ┆ 2012-01-03 15:28:18 ┆ 0 ┆ ... ┆ 5.0 ┆ I've taken a lot of spin classes... ┆ 1 ┆ OyoGAe7OKpv6SyGZT5g77Q │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ YjUWPpI6HXG530lwP-fb2A ┆ 0 ┆ 2014-02-05 20:30:30 ┆ 0 ┆ ... ┆ 3.0 ┆ Family diner. Had the buffet. Ec... ┆ 0 ┆ 8g_iMtfSiwikVnbP2etR0A │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kxX2SOes4o-D3ZQBkiMRfA ┆ 1 ┆ 2015-01-04 00:01:03 ┆ 0 ┆ ... ┆ 5.0 ┆ Wow! Yummy, different, delicio... ┆ 1 ┆ _7bHUi9Uuf5__HHc_Q8guQ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ EaqASiPkxV9OUkvsAp4ODg ┆ 0 ┆ 2015-03-17 20:48:03 ┆ 0 ┆ ... ┆ 4.0 ┆ Small hole in the wall, yet plen... ┆ 0 ┆ OPZWPj14g2LQnDWJjMioWQ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ WbCCGpq_XIr-2_jSXISZKQ ┆ 0 ┆ 2015-08-18 23:26:40 ┆ 1 ┆ ... ┆ 3.0 ┆ Easy street access with adequate... ┆ 0 ┆ 1rPlm6liFDqv8oSmuHSefA │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ld_H5-FpZOWm_tkzwkPYQQ ┆ 0 ┆ 2014-09-25 01:10:49 ┆ 0 ┆ ... ┆ 1.0 ┆ Think twice before staying here.... ┆ 1 ┆ Rz8za5LT_qXBgsL0ice5Qw │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ t0Qyogb4x--K9i5b0AoDCg ┆ 0 ┆ 2017-09-20 14:18:52 ┆ 0 ┆ ... ┆ 5.0 ┆ Reasonably priced, fast friendly... ┆ 0 ┆ uab7_Z8GPeiZ_Un-Jl3fVg │
└────────────────────────┴──────┴─────────────────────┴───────┴─────┴───────┴─────────────────────────────────────┴────────┴────────────────────────┘
But reading one more line, causes the error:
pl.read_json(StringIO("".join(file_lines[0:3_785_594])), json_lines=True)
>>> pl.read_json(StringIO("".join(file_lines[0:3_785_594])), json_lines=True)
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("offsets must not exceed the values length")', /github/home/.cargo/git/checkouts/arrow2-8a2ad61d97265680/c720eb2/src/array/growable/utf8.rs:70:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/corey/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/io.py", line 917, in read_json
return DataFrame._read_json(source, json_lines)
File "/home/corey/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 818, in _read_json
self._df = PyDataFrame.read_json(file, json_lines)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: OutOfSpec("offsets must not exceed the values length")
And yet reading a swath of records around that breakpoint reveals nothing particularly wrong or malformed.
pl.read_json(StringIO("".join(file_lines[3_785_592:3_785_595])), json_lines=True)
shape: (3, 9)
┌────────────────────────┬──────┬─────────────────────┬───────┬─────┬───────┬─────────────────────────────────────┬────────┬────────────────────────┐
│ business_id ┆ cool ┆ date ┆ funny ┆ ... ┆ stars ┆ text ┆ useful ┆ user_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ ┆ f64 ┆ str ┆ i64 ┆ str │
╞════════════════════════╪══════╪═════════════════════╪═══════╪═════╪═══════╪═════════════════════════════════════╪════════╪════════════════════════╡
│ t0Qyogb4x--K9i5b0AoDCg ┆ 0 ┆ 2017-09-20 14:18:52 ┆ 0 ┆ ... ┆ 5.0 ┆ Reasonably priced, fast friendly... ┆ 0 ┆ uab7_Z8GPeiZ_Un-Jl3fVg │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ wEdzUMaLE2ebYoe7Z0XGaA ┆ 0 ┆ 2017-07-18 00:16:16 ┆ 0 ┆ ... ┆ 1.0 ┆ I apologize to the readers of Ye... ┆ 0 ┆ tVkr6-lasqKzafoV5K4JfA │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ZF0tt7hn6WK3-aNWgtLcFA ┆ 0 ┆ 2016-08-01 22:08:07 ┆ 0 ┆ ... ┆ 5.0 ┆ Great place. Interesting to see ... ┆ 0 ┆ 9XT2LHohnC8v0T1H4Jxs2Q │
└────────────────────────┴──────┴─────────────────────┴───────┴─────┴───────┴─────────────────────────────────────┴────────┴────────────────────────┘
And there's nothing in the input file in that swath that suggest problems, other than a long comment:
head -3785595 yelp_academic_dataset_review.json | tail -3
{"review_id":"kWSOtQvuANZIaCpnb2jNbA","user_id":"uab7_Z8GPeiZ_Un-Jl3fVg","business_id":"t0Qyogb4x--K9i5b0AoDCg","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"Reasonably priced, fast friendly service, delicious Mexican food. Our go-to place for Mexican takeout in Exton\/Lionville. They also have tables for dining, you order at the counter. Exceptional value for high quality fresh food.","date":"2017-09-20 14:18:52"}
{"review_id":"sOOPVuf02-Lz75cTI33KEw","user_id":"tVkr6-lasqKzafoV5K4JfA","business_id":"wEdzUMaLE2ebYoe7Z0XGaA","stars":1.0,"useful":0,"funny":0,"cool":0,"text":"I apologize to the readers of Yelp in advance for the length of this review. However, I felt the need to say what is on my mind. The one star I gave is for the kind and intelligent hostess who needs to be in the manager's position as he does not know how to do his job. Firstly, this is NOT New York Pizza and the original in Brooklyn should be embarrassed that it bears their namesake. Ordered pizza for pickup. Arrived, got my pizza and went to my car. Opened the box to double check it and it was all sauce, with a very minute amount of \"mozzarella,\" which felt like rubber. I tasted the sauce that was on my finger from when I touched the cheese... horrifically BLAND. What happened next was worse than the bland sauce. I spoke to a very kind hostess, and asked for cheese to be added. She obliged and said they would remake it. The manager, who I see several other people have had issues with, came over and was extremely condescending. He explained that it's because they put sauce on top of the cheese... ok so why was there no cheese under the sauce then either hunny? Why he felt the need to explain to me why my pizza had no cheese is beyond me, especially when the situation had already been rectified. I then asked if that was also why there were burns on the top as well, and he found it amusing and stated \"it's only one burnt bubble...\" (It was waaaay more than one, but ok). Why is ANY PART OF MY FOOD BURNT SIR?! He then felt the need to explain how a coal brick oven works... I'm from Brooklyn, I've had PLENTY of pizza that is cooked this way, like for example, at Grimaldi's in BROOKLYN. When done properly, it doesn't come out BURNT on ANY PART of it. Anyway, I went from wanting cheese to wanting my money back, simply because of the manager's attitude. Which btw my refund was incorrect, but I wanted to leave so badly that I didn't even address that part. THEN he sarcastically offered me a free pizza, after I requested my money back, and when I declined he condescendingly gave me a $25 gift card and his business card. Sweetheart, I wanted cheese not a free meal, which your hostess had already taken care of before your snarky attitude disrupted our peaceful convo, get your life together. This immediately escalated from me allowing this business a chance to create a long time loyal patron and just getting CHEESE, to wanting to never set foot in this place again. I assume by his smug demeanor that he is accustomed to treating his patrons this way. Anyway, I found a homeless person and gave him the gift card. I can only hope the homeless man wasn't offended by me giving him a gift card for this disgusting place.","date":"2017-07-18 00:16:16"}
{"review_id":"yzgx106UX9OlyBh0tq2G0g","user_id":"9XT2LHohnC8v0T1H4Jxs2Q","business_id":"ZF0tt7hn6WK3-aNWgtLcFA","stars":5.0,"useful":0,"funny":0,"cool":0,"text":"Great place. Interesting to see and learn the history about it. Can get some really cool pictures. Been here a few times and will keep coming back when we're in the area.","date":"2016-08-01 22:08:07"}
Even if I try to cut the file, steering well clear of the those records...
head -3500000 yelp_academic_dataset_review.json > head.json
tail -1000000 yelp_academic_dataset_review.json > tail.json
cat head.json tail.json > try.json
We still get an error from reading 4.5 million records...
>>> pl.read_json('/tmp/try.json', json_lines=True)
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: OutOfSpec("offsets must not exceed the values length")', /github/home/.cargo/git/checkouts/arrow2-8a2ad61d97265680/c720eb2/src/array/growable/utf8.rs:70:14
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/corey/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/io.py", line 917, in read_json
return DataFrame._read_json(source, json_lines)
File "/home/corey/.virtualenvs/StackOverflow3.10/lib/python3.10/site-packages/polars/internals/frame.py", line 818, in _read_json
self._df = PyDataFrame.read_json(file, json_lines)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: OutOfSpec("offsets must not exceed the values length")
Workaround
If you cut the input file into smaller slices, use read_json on the smaller slices, and concatenate the result, you'll get your DataFrame.
I'll simulate that on my machine as follows. (You can cut your file into larger pieces than 1 million records. I just chose that as an easy number.)
import polars as pl
from io import StringIO
with open("/tmp/yelp_academic_dataset_review.json") as json_file:
file_lines = json_file.readlines()
slice_size = 1_000_000
df = pl.concat(
[
pl.read_json(
StringIO("".join(file_lines[offset: (offset + slice_size)])),
json_lines=True,
)
for offset in range(0, len(file_lines), slice_size)
]
)
df
shape: (6990280, 9)
┌────────────────────────┬──────┬─────────────────────┬───────┬─────┬───────┬─────────────────────────────────────┬────────┬────────────────────────┐
│ business_id ┆ cool ┆ date ┆ funny ┆ ... ┆ stars ┆ text ┆ useful ┆ user_id │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ str ┆ i64 ┆ ┆ f64 ┆ str ┆ i64 ┆ str │
╞════════════════════════╪══════╪═════════════════════╪═══════╪═════╪═══════╪═════════════════════════════════════╪════════╪════════════════════════╡
│ XQfwVwDr-v0ZS3_CbbE5Xw ┆ 0 ┆ 2018-07-07 22:09:11 ┆ 0 ┆ ... ┆ 3.0 ┆ If you decide to eat here, just ... ┆ 0 ┆ mh_-eMZ6K5RLWhZyISBhwA │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 7ATYjTIgM3jUlt4UM3IypQ ┆ 1 ┆ 2012-01-03 15:28:18 ┆ 0 ┆ ... ┆ 5.0 ┆ I've taken a lot of spin classes... ┆ 1 ┆ OyoGAe7OKpv6SyGZT5g77Q │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ YjUWPpI6HXG530lwP-fb2A ┆ 0 ┆ 2014-02-05 20:30:30 ┆ 0 ┆ ... ┆ 3.0 ┆ Family diner. Had the buffet. Ec... ┆ 0 ┆ 8g_iMtfSiwikVnbP2etR0A │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ kxX2SOes4o-D3ZQBkiMRfA ┆ 1 ┆ 2015-01-04 00:01:03 ┆ 0 ┆ ... ┆ 5.0 ┆ Wow! Yummy, different, delicio... ┆ 1 ┆ _7bHUi9Uuf5__HHc_Q8guQ │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2vLksaMmSEcGbjI5gywpZA ┆ 2 ┆ 2021-03-31 16:55:10 ┆ 1 ┆ ... ┆ 5.0 ┆ This spot offers a great, afford... ┆ 2 ┆ Zo0th2m8Ez4gLSbHftiQvg │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ R1khUUxidqfaJmcpmGd4aw ┆ 0 ┆ 2019-12-30 03:56:30 ┆ 0 ┆ ... ┆ 4.0 ┆ This Home Depot won me over when... ┆ 1 ┆ mm6E4FbCMwJmb7kPDZ5v2Q │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Rr9kKArrMhSLVE9a53q-aA ┆ 0 ┆ 2022-01-19 18:59:27 ┆ 0 ┆ ... ┆ 5.0 ┆ For when I'm feeling like ignori... ┆ 1 ┆ YwAMC-jvZ1fvEUum6QkEkw │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ VAeEXLbEcI9Emt9KGYq9aA ┆ 7 ┆ 2018-01-02 22:50:47 ┆ 3 ┆ ... ┆ 3.0 ┆ Located in the 'Walking District... ┆ 10 ┆ 6JehEvdoCvZPJ_XIxnzIIw │
└────────────────────────┴──────┴─────────────────────┴───────┴─────┴───────┴─────────────────────────────────────┴────────┴────────────────────────┘

Related

Given a data frame with n columns of numbers, how could you calculate the Pearson correlation of all column-pair combinations?

Let's say I have a Polars data frame like this:
=> shape: (19, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ date ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ 1674777600000 ┆ 51.39 ┆ 12.84 ┆ 50.0799 ┆ 16.535 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674691200000 ┆ 52.43 ┆ 13.14 ┆ 49.84 ┆ 16.54 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674604800000 ┆ 51.87 ┆ 12.88 ┆ 49.75 ┆ 15.97 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1674518400000 ┆ 51.22 ┆ 12.81 ┆ 50.1 ┆ 16.01 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672876800000 ┆ 45.3 ┆ 12.7 ┆ 47.185 ┆ 13.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672790400000 ┆ 44.77 ┆ 12.355 ┆ 47.32 ┆ 12.86 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672704000000 ┆ 45.77 ┆ 12.91 ┆ 47.84 ┆ 12.91 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1672358400000 ┆ 46.01 ┆ 12.57 ┆ 47.29 ┆ 12.55 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
I'm looking to calculate the Pearson correlation between each pair-combination of all columns (except the date one). The result would look something like this:
=> shape: (5, 5)
┌───────────────┬─────────┬───────────┬───────────┬──────────┐
│ symbol ┆ open_AA ┆ open_AADI ┆ open_AADR ┆ open_AAL │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ utf8 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═══════════════╪═════════╪═══════════╪═══════════╪══════════╡
│ open_AA ┆ 1 ┆ 1 ┆ .1 ┆ -.5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADI ┆ .2 ┆ 1 ┆ .2 ┆ .4 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AADR ┆ .4 ┆ .2 ┆ 1 ┆ .3 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ open_AAL. ┆ -.45 ┆ -.6 ┆ 50.1 ┆ 1 │
└───────────────┴─────────┴───────────┴───────────┴──────────┘
My hunch is that I need to do the following:
Get the cartesian product of columns [1..] as a new data frame.
Using Polars expressions, calculate the pearson_corr of each of each series pair.
I'm new to Polars and am having trouble with the syntax. Can anyone point me in the right direction?
Say you start with:
df = pl.DataFrame({"date":[5,6,7],"foo": [1, 3, 9], "bar": [4, 1, 3], "ham": [2, 18, 9]})
You want to exclude some cols, so let's put those in a variable
excl_cols=['date']
Then...
(
df.drop(excl_cols) # Use drop to exclude the date column (or whatever columns you don't want)
.pearson_corr() # this is the meat and potatos of the request but it's missing your symbol column on left
.select(
[
pl.Series(df.drop(excl_cols).columns).alias('symbol'), # This just creates a Series out of the column names to become its own column
pl.all() #then just every other column
])
)
shape: (3, 4)
┌────────┬───────────┬───────────┬───────────┐
│ symbol ┆ foo ┆ bar ┆ ham │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ f64 ┆ f64 ┆ f64 │
╞════════╪═══════════╪═══════════╪═══════════╡
│ foo ┆ 1.0 ┆ -0.052414 ┆ 0.169695 │
│ bar ┆ -0.052414 ┆ 1.0 ┆ -0.993036 │
│ ham ┆ 0.169695 ┆ -0.993036 ┆ 1.0 │
└────────┴───────────┴───────────┴───────────┘
Use DataFrame.pearson_corr
In [9]: df.drop('date').pearson_corr()
Out[9]:
shape: (2, 2)
┌─────────┬───────────┐
│ open_AA ┆ open_AADI │
│ --- ┆ --- │
│ f64 ┆ f64 │
╞═════════╪═══════════╡
│ 1.0 ┆ 1.0 │
│ 1.0 ┆ 1.0 │
└─────────┴───────────┘

Efficient way to rename columns from pivot

Currently pivot is joining the "values" column and value from "columns" column as new column name using underscore. Example from data below, new column name = "monthly_qty" + "_" + "product_a"
>>> data = pl.DataFrame({"month":["Jan", "Jan", "Feb", "Feb", "Mar", "Mar"], "type":["product_a", "product_b"]*3, "monthly_qty":[10,20]*3, "monthly_amt":[5., 8.]*3})
>>> data
shape: (6, 4)
┌───────┬───────────┬─────────────┬─────────────┐
│ month ┆ type ┆ monthly_qty ┆ monthly_amt │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ str ┆ i64 ┆ f64 │
╞═══════╪═══════════╪═════════════╪═════════════╡
│ Jan ┆ product_a ┆ 10 ┆ 5.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Jan ┆ product_b ┆ 20 ┆ 8.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Feb ┆ product_a ┆ 10 ┆ 5.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Feb ┆ product_b ┆ 20 ┆ 8.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mar ┆ product_a ┆ 10 ┆ 5.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mar ┆ product_b ┆ 20 ┆ 8.0 │
└───────┴───────────┴─────────────┴─────────────┘
>>> data = data.pivot(index="month", columns="type", values=["monthly_qty", "monthly_amt"])
>>> data
shape: (3, 5)
┌───────┬───────────────────────┬───────────────────────┬───────────────────────┬───────────────────────┐
│ month ┆ monthly_qty_product_a ┆ monthly_qty_product_b ┆ monthly_amt_product_a ┆ monthly_amt_product_b │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═══════╪═══════════════════════╪═══════════════════════╪═══════════════════════╪═══════════════════════╡
│ Jan ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Feb ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mar ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
└───────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┘
I wish to rename the columns as below, but not sure what is the most efficient way.
old column = "monthly_qty_product_a"
new_column = "product_a:monthly_qty"
This is what I can think of now, provided that the number of underscore is fixed.
>>> new_cols = {col:col if col=="month" else f"{'_'.join(col.split('_')[2:])}:{'_'.join(col.split('_')[0:2])}"for col in data.columns}
>>> data.rename(new_cols)
shape: (3, 5)
┌───────┬───────────────────────┬───────────────────────┬───────────────────────┬───────────────────────┐
│ month ┆ product_a:monthly_qty ┆ product_b:monthly_qty ┆ product_a:monthly_amt ┆ product_b:monthly_amt │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ i64 ┆ f64 ┆ f64 │
╞═══════╪═══════════════════════╪═══════════════════════╪═══════════════════════╪═══════════════════════╡
│ Jan ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Feb ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ Mar ┆ 10 ┆ 20 ┆ 5.0 ┆ 8.0 │
└───────┴───────────────────────┴───────────────────────┴───────────────────────┴───────────────────────┘
This will not work if value column has more than one underscore, e.g. "monthly_growth_pct"
Is there a better way of doing this? Any advice is much appreciated
Thanks!
There is no way in DataFrame.pivot to control this naming.
I would suggest to modify your long format dataframe (6 x 4) a bit by renaming the column monthly_qty to monthly_qty<CHAR>, where <CHAR> is a character you are quite sure is not present, for example !:
data = data.rename({"monthly_qty":"monthly_qty!"})
Proceed with the pivot, and then split on ! in your renaming logic.

Replacing a pivot with a lazy groupby operation

I'm pivoting a rather large dataframe of shape (10_000_000, 678) into one of approx. shape (770_000, 8_789) to create a dataset for an ML algorithm. It's a relatively slow operation taking about half an hour on a high-ram cluster I am using, and I'm wondering if there is a way to speed it up. Here is a minimum example, with a larger one below:
import polars as pl
import numpy as np
data = {
"id": [1,1,1,2,2,2,3,3,3],
"rank": [1,2,3,1,2,3,1,2,3], # rank is always repeating 1-3 (or 0-12 in large example)
"A": np.random.random((9)),
"B": np.random.random((9)),
}
df = pl.DataFrame(data)
df_pivot = df.pivot(values=["A", "B"], index="id", columns="rank")
# Now rename columns, since they are currently:
# df_pivot.columns
# ['id', '1', '2', '3', '1', '2', '3']
ranks = [1,2,3]
renamed_columns = df_pivot.columns[:1]
for col in df.columns[2:]:
for rank in ranks:
renamed_columns.append(f"{col}_{rank}")
df_pivot.columns = renamed_columns
# df_pivot
shape: (3, 7)
┌─────┬──────────┬──────────┬──────────┬──────────┬──────────┬──────────┐
│ id ┆ A_1 ┆ A_2 ┆ A_3 ┆ B_1 ┆ B_2 ┆ B_3 │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪══════════╪══════════╪══════════╪══════════╪══════════╪══════════╡
│ 1 ┆ 0.867957 ┆ 0.854234 ┆ 0.408062 ┆ 0.076254 ┆ 0.899092 ┆ 0.059019 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.642296 ┆ 0.670476 ┆ 0.480494 ┆ 0.4254 ┆ 0.536173 ┆ 0.492312 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.778481 ┆ 0.151697 ┆ 0.330138 ┆ 0.6661 ┆ 0.4086 ┆ 0.992057 │
└─────┴──────────┴──────────┴──────────┴──────────┴──────────┴──────────┘
The polars pivot code states that in a comment:
Polars lazy does not implement a pivot because it is impossible to know the schema without materializing the whole dataset. This makes a pivot quite a terrible operation for performant workflows. An optimization can never be pushed down passed a pivot.
And in the groupby.pivot code:
Polars'/arrow memory is not ideal for transposing operations like pivots. If you have a relatively large table, consider using a groupby over a pivot.
Some questions:
Is it possible to replace the above pivot example by a (preferably lazy) combination of groupby and something else? This SO post about pandas suggests an equivalency of groupby + "unstack" with pivot. Polars does not implement an unstack function, afaik.
Is the above suggestion more performant than the current pivot implementation? (See the larger example below).
I actually do know the schema ahead-of-schedule, since in my situation rank is a known series ([1, 2, 3] in the example). If implemented, would a lazy pivot where one can supply the schema be more performant than the eager one?
Should I be implementing it differently?
# Much larger example, but with 10_000 rows instead of 10_000_000
# 10_000 runs in 3 seconds, 100_000 runs in 40 seconds (M1 macbook)
from string import ascii_lowercase
import polars as pl
import numpy as np
ranks = np.arange(13)
N_ROWS = 10_000 # this could be ~10_000_000
df = (pl.DataFrame({"ID": np.arange(N_ROWS)})).join(
pl.DataFrame({"rank": ranks}), how="cross"
)
# create 26**2 dummy column names
column_names = []
for letter1 in ascii_lowercase:
for letter2 in ascii_lowercase:
column_names.append(letter1 + letter2)
# stack frames to create: ID, ranks, aa, ab, ..., zz
df = df.hstack(
pl.DataFrame({letter: np.random.random(len(df)) for letter in column_names})
)
df_pivot = df.pivot(values=df.columns[2:], index="ID", columns="rank")
renamed_columns = df_pivot.columns[:1]
for col in df.columns[2:]:
for rank in ranks:
renamed_columns.append(f"{col}_{rank}")
df_pivot.columns = renamed_columns
How about a non-lazy solution that brings your wall-clock on the much larger example with N_ROWS = 1_000_000 from over 7 minutes to around ... 10 seconds?
The Algorithm
I actually do know the schema ahead-of-schedule, since in my situation rank is a known series ([1, 2, 3] in the example). If implemented, would a lazy pivot where one can supply the schema be more performant than the eager one?
We're going to take advantage of the structure of the data. We'll re-sort the data strategically, and use slice on each series. (Slices are nearly free.)
I've also added an ID column with dtype Int64 so that we can use frame_equal to compare the results of this algorithm to the output of the pivot code from the example.
Note that the algorithm is not in Lazy mode.
ser_slices = [
s.slice(rank * N_ROWS, N_ROWS).alias(s.name + "_" + str(rank))
for s in df.sort(["rank", "ID"])[:, 2:]
for rank in range(0, 13)
]
result = (
pl.DataFrame(ser_slices)
.with_row_count('ID')
.with_column(
pl.col('ID').cast(pl.Int64)
)
)
Performance Comparison
Let's compare the performance and output of the algorithm above with the pivot code in your example.
We'll use your larger example with N_ROWS = 1_000_000.
The Algorithm Above (Slices in Eager Mode)
If you watch the performance of your CPU on this algorithm (e.g., in top on Linux), you'll notice that the algorithm runs heavily in parallel.
import time
start = time.perf_counter()
ser_slices = [
s.slice(rank * N_ROWS, N_ROWS).alias(s.name + "_" + str(rank))
for s in df.sort(["rank", "ID"])[:, 2:]
for rank in range(0, 13)
]
result = (
pl.DataFrame(ser_slices)
.with_row_count('ID')
.with_column(
pl.col('ID').cast(pl.Int64)
)
)
result
print(time.perf_counter() - start)
shape: (1000000, 8789)
┌────────┬──────────┬──────────┬──────────┬─────┬──────────┬──────────┬──────────┬──────────┐
│ ID ┆ aa_0 ┆ aa_1 ┆ aa_2 ┆ ... ┆ zz_9 ┆ zz_10 ┆ zz_11 ┆ zz_12 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪══════════╪══════════╪══════════╪═════╪══════════╪══════════╪══════════╪══════════╡
│ 0 ┆ 0.702774 ┆ 0.250239 ┆ 0.023121 ┆ ... ┆ 0.348179 ┆ 0.530304 ┆ 0.380147 ┆ 0.194915 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.184479 ┆ 0.562245 ┆ 0.038145 ┆ ... ┆ 0.575752 ┆ 0.254793 ┆ 0.126996 ┆ 0.557823 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.432553 ┆ 0.111145 ┆ 0.937674 ┆ ... ┆ 0.493157 ┆ 0.843966 ┆ 0.6257 ┆ 0.044151 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.607535 ┆ 0.389257 ┆ 0.864887 ┆ ... ┆ 0.765563 ┆ 0.312805 ┆ 0.085054 ┆ 0.4972 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999996 ┆ 0.101384 ┆ 0.918382 ┆ 0.024 ┆ ... ┆ 0.643435 ┆ 0.905557 ┆ 0.8266 ┆ 0.460866 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999997 ┆ 0.164607 ┆ 0.766515 ┆ 0.565382 ┆ ... ┆ 0.493534 ┆ 0.595359 ┆ 0.601306 ┆ 0.637546 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999998 ┆ 0.213503 ┆ 0.874676 ┆ 0.165461 ┆ ... ┆ 0.676855 ┆ 0.730082 ┆ 0.9647 ┆ 0.710811 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999999 ┆ 0.246028 ┆ 0.963617 ┆ 0.065186 ┆ ... ┆ 0.1091 ┆ 0.913634 ┆ 0.425842 ┆ 0.715304 │
└────────┴──────────┴──────────┴──────────┴─────┴──────────┴──────────┴──────────┴──────────┘
>>> print(time.perf_counter() - start)
10.33561857099994
Roughly 10 seconds. Not bad.
Pivot (from the example code)
If you watch your CPU monitor, you'll notice that the pivot code is largely single-threaded.
import time
start = time.perf_counter()
df_pivot = df.pivot(values=df.columns[2:], index="ID", columns="rank")
renamed_columns = df_pivot.columns[:1]
for col in df.columns[2:]:
for rank in ranks:
renamed_columns.append(f"{col}_{rank}")
df_pivot.columns = renamed_columns
df_pivot
print(time.perf_counter() - start)
shape: (1000000, 8789)
┌────────┬──────────┬──────────┬──────────┬─────┬──────────┬──────────┬──────────┬──────────┐
│ ID ┆ aa_0 ┆ aa_1 ┆ aa_2 ┆ ... ┆ zz_9 ┆ zz_10 ┆ zz_11 ┆ zz_12 │
│ --- ┆ --- ┆ --- ┆ --- ┆ ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ f64 ┆ f64 ┆ ┆ f64 ┆ f64 ┆ f64 ┆ f64 │
╞════════╪══════════╪══════════╪══════════╪═════╪══════════╪══════════╪══════════╪══════════╡
│ 0 ┆ 0.702774 ┆ 0.250239 ┆ 0.023121 ┆ ... ┆ 0.348179 ┆ 0.530304 ┆ 0.380147 ┆ 0.194915 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 0.184479 ┆ 0.562245 ┆ 0.038145 ┆ ... ┆ 0.575752 ┆ 0.254793 ┆ 0.126996 ┆ 0.557823 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 0.432553 ┆ 0.111145 ┆ 0.937674 ┆ ... ┆ 0.493157 ┆ 0.843966 ┆ 0.6257 ┆ 0.044151 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 0.607535 ┆ 0.389257 ┆ 0.864887 ┆ ... ┆ 0.765563 ┆ 0.312805 ┆ 0.085054 ┆ 0.4972 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999996 ┆ 0.101384 ┆ 0.918382 ┆ 0.024 ┆ ... ┆ 0.643435 ┆ 0.905557 ┆ 0.8266 ┆ 0.460866 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999997 ┆ 0.164607 ┆ 0.766515 ┆ 0.565382 ┆ ... ┆ 0.493534 ┆ 0.595359 ┆ 0.601306 ┆ 0.637546 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999998 ┆ 0.213503 ┆ 0.874676 ┆ 0.165461 ┆ ... ┆ 0.676855 ┆ 0.730082 ┆ 0.9647 ┆ 0.710811 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 999999 ┆ 0.246028 ┆ 0.963617 ┆ 0.065186 ┆ ... ┆ 0.1091 ┆ 0.913634 ┆ 0.425842 ┆ 0.715304 │
└────────┴──────────┴──────────┴──────────┴─────┴──────────┴──────────┴──────────┴──────────┘
>>> print(time.perf_counter() - start)
442.1277434679996
Over 7 minutes. (Given that, I didn't bother to time the two with N_ROWS = 10_000_000)
Comparison of the output
Do they produce the same result?
>>> result.frame_equal(df_pivot)
True

Polars is throwing an error when I convert from eger to lazy execution

This code works and returns the expected result.
import polars as pl
df = pl.DataFrame({
'A':[1,2,3,3,2,1],
'B':[1,1,1,2,2,2]
})
(df
#.lazy()
.groupby('B')
.apply(lambda x: x
.with_columns(
[pl.col("A").shift(i).alias(f"A_lag_{i}") for i in range(3)]
)
)
.with_columns(
[pl.col(f'A_lag_{i}') / pl.col('A') for i in range(3)]
)
#.collect()
)
However, if you comment out the .lazy() and .collect() you get a NotFoundError: f'A_lag_0
I've tried a few versions of this code, but I can't entirely understand if I'm doing something wrong, or whether this is a bug in Polars.
This doesn't address the error that you are receiving, but the more idiomatic way to express this in Polars is to use the over expression. For example:
(
df
.lazy()
.with_columns([
pl.col("A").shift(i).over('B').alias(f"A_lag_{i}")
for i in range(3)])
.with_columns([
(pl.col(f"A_lag_{i}") / pl.col("A")).suffix('_result')
for i in range(3)])
.collect()
)
shape: (6, 8)
┌─────┬─────┬─────────┬─────────┬─────────┬────────────────┬────────────────┬────────────────┐
│ A ┆ B ┆ A_lag_0 ┆ A_lag_1 ┆ A_lag_2 ┆ A_lag_0_result ┆ A_lag_1_result ┆ A_lag_2_result │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 ┆ f64 ┆ f64 ┆ f64 │
╞═════╪═════╪═════════╪═════════╪═════════╪════════════════╪════════════════╪════════════════╡
│ 1 ┆ 1 ┆ 1 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 1 ┆ 2 ┆ 1 ┆ null ┆ 1.0 ┆ 0.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 3 ┆ 2 ┆ 1 ┆ 1.0 ┆ 0.666667 ┆ 0.333333 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2 ┆ 3 ┆ null ┆ null ┆ 1.0 ┆ null ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 2 ┆ 3 ┆ null ┆ 1.0 ┆ 1.5 ┆ null │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2 ┆ 1 ┆ 2 ┆ 3 ┆ 1.0 ┆ 2.0 ┆ 3.0 │
└─────┴─────┴─────────┴─────────┴─────────┴────────────────┴────────────────┴────────────────┘

Fast apply of a function to Polars Dataframe

What are the fastest ways to apply functions to polars DataFrames - pl.DataFrame or pl.internals.lazy_frame.LazyFrame? This question is piggy-backing off Apply Function to all columns of a Polars-DataFrame
I am trying to concat all columns and hash the value using hashlib in python standard library. The function I am using is below:
import hashlib
def hash_row(row):
os.environ['PYTHONHASHSEED'] = "0"
row = str(row).encode('utf-8')
return hashlib.sha256(row).hexdigest()
However given that this function requires a string as input, means this function needs to be applied to every cell within a pl.Series. Working with a small amount of data, should be okay, but when we have closer to 100m rows this becomes very problematic. The question for this thread is how can we apply such a function in the most-performant way across an entire Polars Series?
Pandas
Offers a few options to create new columns, and some are more performant than others.
df['new_col'] = df['some_col'] * 100 # vectorized calls
Another option is to create custom functions for row-wise operations.
def apply_func(row):
return row['some_col'] + row['another_col']
df['new_col'] = df.apply(lambda row: apply_func(row), axis=1) # using apply operations
From my experience, the fastest way is to create numpy vectorized solutions.
import numpy as np
def np_func(some_col, another_col):
return some_col + another_col
vec_func = np.vectorize(np_func)
df['new_col'] = vec_func(df['some_col'].values, df['another_col'].values)
Polars
What is the best solution for Polars?
Let's start with this data of various types:
import polars as pl
df = pl.DataFrame(
{
"col_int": [1, 2, 3, 4],
"col_float": [10.0, 20, 30, 40],
"col_bool": [True, False, True, False],
"col_str": pl.repeat("2020-01-01", 4, eager=True),
}
).with_column(pl.col("col_str").str.strptime(pl.Date).alias("col_date"))
df
shape: (4, 5)
┌─────────┬───────────┬──────────┬────────────┬────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date │
╞═════════╪═══════════╪══════════╪════════════╪════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┘
Polars: DataFrame.hash_rows
I should first point out that Polars itself has a hash_rows function that will hash the rows of a DataFrame, without first needing to cast each column to a string.
df.hash_rows()
shape: (4,)
Series: '' [u64]
[
16206777682454905786
7386261536140378310
3777361287274669406
675264002871508281
]
If you find this acceptable, then this would be the most performant solution. You can cast the resulting unsigned int to a string if you need to. Note: hash_rows is available only on a DataFrame, not a LazyFrame.
Using polars.concat_str and apply
If you need to use your own hashing solution, then I recommend using the polars.concat_str function to concatenate the values in each row to a string. From the documentation:
polars.concat_str(exprs: Union[Sequence[Union[polars.internals.expr.Expr, str]], polars.internals.expr.Expr], sep: str = '') → polars.internals.expr.Expr
Horizontally Concat Utf8 Series in linear time. Non utf8 columns are cast to utf8.
So, for example, here is the resulting concatenation on our dataset.
df.with_column(
pl.concat_str(pl.all()).alias('concatenated_cols')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ concatenated_cols │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 110.0true2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 220.0false2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 330.0true2020-01-012020-01-01 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ 440.0false2020-01-012020-01-01 │
└─────────┴───────────┴──────────┴────────────┴────────────┴────────────────────────────────┘
Taking the next step and using the apply method and your function would yield:
df.with_column(
pl.concat_str(pl.all()).apply(hash_row).alias('hash')
)
shape: (4, 6)
┌─────────┬───────────┬──────────┬────────────┬────────────┬─────────────────────────────────────┐
│ col_int ┆ col_float ┆ col_bool ┆ col_str ┆ col_date ┆ hash │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ bool ┆ str ┆ date ┆ str │
╞═════════╪═══════════╪══════════╪════════════╪════════════╪═════════════════════════════════════╡
│ 1 ┆ 10.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ 1826eb9c6aeb0abcdd2999a76eee576e... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ ea50f5b11957bfc92b5ab7545b3ac12c... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 30.0 ┆ true ┆ 2020-01-01 ┆ 2020-01-01 ┆ eef039d8dedadcc282d6fa9473e071e8... │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 4 ┆ 40.0 ┆ false ┆ 2020-01-01 ┆ 2020-01-01 ┆ dcc5c57e0b5fdf15320a84c6839b0e3d... │
└─────────┴───────────┴──────────┴────────────┴────────────┴─────────────────────────────────────┘
Please remember that any time Polars calls external libraries or runs Python bytecode, you are subject to the Python GIL, which means single-threaded performance - no matter how you code it. From the Polars Cookbook section Do Not Kill The Parallelization!:
We have all heard that Python is slow, and does "not scale." Besides the overhead of running "slow" bytecode, Python has to remain within the constraints of the Global Interpreter Lock (GIL). This means that if you were to use a lambda or a custom Python function to apply during a parallelized phase, Polars speed is capped running Python code preventing any multiple threads from executing the function.
This all feels terribly limiting, especially because we often need those lambda functions in a .groupby() step, for example. This approach is still supported by Polars, but keeping in mind bytecode and the GIL costs have to be paid.
To mitigate this, Polars implements a powerful syntax defined not only in its lazy API, but also in its eager API.
Thanks cbilot - was unaware of hash_rows. Your solution is nearly identical to what I have wrote. The one thing that I have to mention is that --
concat_str did not work for me if there are Nulls in your series. Thus I had to cast to Utf8 before fill_null. Then I am able to concat_str and apply hash_row on the result.
def set_datatypes_and_replace_nulls(df, idcol="factset_person_id"):
return (
df
.with_columns([
pl.col("*").cast(pl.Utf8, strict=False),
pl.col("*").fill_null(pl.lit(""))
])
.with_columns([
pl.concat_str(pl.col("*").exclude(exclude_cols)).alias('concatstr'),
])
)
def hash_concat(df):
return (
df
.with_columns([
pl.col("concatstr").apply(hash_row).alias('sha256hash')
])
)
After this we need to aggregate the hashes by ID.
df = (
df
.pipe(set_datatypes_and_replace_nulls)
.pipe(hash_concat)
)
# something like the below...
part1= (
df.lazy()
.groupby("id")
.agg(
[
pl.col("concatstr").unique().list(),
]
)
)
Thanks for improving with pl.hash_rows.