Is there a good way to do `zfill` in polars? - python-polars

Is it proper to use pl.Expr.apply to throw the python function zfill at my data? I'm not looking for a performant solution.
pl.col("column").apply(lambda x: str(x).zfill(5))
Is there a better way to do this?
And to follow up I'd love to chat about what a good implementation could look like in the discord if you have some insight (assuming one doesn't currently exist).

Edit: Polars 0.13.43 and later
With version 0.13.43 and later, Polars has a str.zfill expression to accomplish this. str.zfill will be faster than the answer below and thus str.zfill should be preferred.
Prior to Polars 0.13.43
From your question, I'm assuming that you are starting with a column of integers.
lambda x: str(x).zfill(5)
If so, here's one that adheres to pandas rather strictly:
import polars as pl
df = pl.DataFrame({"num": [-10, -1, 0, 1, 10, 100, 1000, 10000, 100000, 1000000, None]})
z = 5
df.with_column(
pl.when(pl.col("num").cast(pl.Utf8).str.lengths() > z)
.then(pl.col("num").cast(pl.Utf8))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("num").cast(pl.Utf8)]).str.slice(-z))
.alias("result")
)
shape: (11, 2)
┌─────────┬─────────┐
│ num ┆ result │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════════╪═════════╡
│ -10 ┆ 00-10 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ -1 ┆ 000-1 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 00000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 00001 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ 00010 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100 ┆ 00100 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000 ┆ 01000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10000 ┆ 10000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 100000 ┆ 100000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 1000000 ┆ 1000000 │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ null ┆ null │
└─────────┴─────────┘
Comparing the output to pandas:
df.with_column(pl.col('num').cast(pl.Utf8)).get_column('num').to_pandas().str.zfill(z)
0 00-10
1 000-1
2 00000
3 00001
4 00010
5 00100
6 01000
7 10000
8 100000
9 1000000
10 None
dtype: object
If you are starting with strings, then you can simplify the code by getting rid any calls to cast.
Edit: On a dataset with 550 million records, this took about 50 seconds on my machine. (Note: this runs single-threaded)
Edit2: To shave off some time, you can use the following:
result = df.lazy().with_column(
pl.col('num').cast(pl.Utf8).alias('tmp')
).with_column(
pl.when(pl.col("tmp").str.lengths() > z)
.then(pl.col("tmp"))
.otherwise(pl.concat_str([pl.lit("0" * z), pl.col("tmp")]).str.slice(-z))
.alias("result")
).drop('tmp').collect()
but it didn't save that much time.

Related

Finding first index of where value in column B is greater than a value in column A

I'd like to know the first occurrence (index) when a value in column A is greater than in column B. Currently I use a for loop (and it's super slow) but I'd imagine it's possible to do that in a rolling window.
df = polars.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
# apply some window function?
# result of first indices where a value in column B is greater than the value in column A
result = polars.Series([2,2,2,3,None])
I'm still trying to understand polars concept of windows but I imagine the pseudo code would look sth like this:
for window length compare values in both columns, use arg_min() to get the index
if the resulting index is not found (e.g. value None or 0), increase window length and make a second pass
make passes until some max window_len
Current for loop implementation:
df = polars.DataFrame({"col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
for i in range(0, df.shape[0]):
# `arg_max()` returns 0 when there's no such index or if the index is actually 0
series = (df.select("col_a")[i,0] < df.select("col_b")[i:])[:,0]
idx_found = True in series
if idx_found:
print(i + series.arg_max())
else:
print("None")
# output:
2
2
2
3
None
Edit 1:
This almost solves the problem. But we still don't know if arg_max found an actual True value or didn't found an index since it returns 0 for both cases.
One idea is that we're never satisfied with the answer 0 and make a second scan for all values where the result was 0 but now with a longer window.
df.select(polars.col("idx")) + \
df_res = df.groupby_dynamic("idx", every="1i", period="5i").agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).arg_max().alias("res")
]
)
Edit 2:
This is the final solution: the first pass is made from the code in Edit 1. The following passes (with increasingly wider windows/periods) can be made with:
increase_window_size = "10i"
df_res.groupby_dynamic("idx", every="1i", period=increase_window_size).agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).filter(polars.col("res").head(1) == 0).arg_max().alias("res")
]
)
Starting from...
df=pl.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
For each row, you want the min idx where the current row's col_a is less than every subsequent row's col_b.
The first step is to add two columns that will contain all the data as a list and then we want to explode those into a much longer DataFrame.
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx'])
From here, we want to apply a filter so we're only keeping rows where the b_indx is at least as big as the idx AND the col_a is less than col_b
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b')))
There are a couple ways you could clean that up, one is to groupby+agg+sort
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.groupby(['idx']).agg([pl.col('b_indx').min()]).sort('idx')
The other way is to just do unique by idx
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx')
Lastly, to get the null values back you have to join it back to the original df. To keep with the theme of adding to the end of the chain we'd want a right join but right joins aren't an option so we have to put the join back at the beginning.
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').get_column('b_indx')
shape: (5,)
Series: 'b_indx' [i64]
[
2
2
2
3
null
]
Note:
I was curious on the performance difference between my approach and jqurious's approach so I did
df=pl.DataFrame({'col_a':np.random.randint(1,10,10000), 'col_b':np.random.randint(1,10,10000)}).with_row_count('idx')
then ran each code chunk. Mine took 1.7s while jqurious's took just 0.7s BUT his answer isn't correct...
For instance...
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]).head(5)
yields...
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 1 │ 1<4 at indx1
│ 2 ┆ 3 ┆ 5 ┆ 2 │ 3<5 at indx2
│ 3 ┆ 4 ┆ 2 ┆ 5 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 5 │ off the page
└─────┴───────┴───────┴────────┘
whereas
df.with_columns(
pl.when(pl.col("col_a") < pl.col("col_b"))
.then(1)
.cumsum()
.backward_fill()
.alias("result") + 1
).head(5)
yields
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 2 │ not right
│ 2 ┆ 3 ┆ 5 ┆ 3 │ not right
│ 3 ┆ 4 ┆ 2 ┆ 4 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 4 │ off the page
└─────┴───────┴───────┴────────┘
Performance
This scales pretty terribly, bumping the df from 10,000 rows to 100,000 made my kernel crash. Going from 10,000 to 20,000 made it take 5.7s which makes sense since we're squaring the size of the df. To mitigate this, you can do overlapping chunks.
First let's make a function
def idx_finder(df):
return(df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]))
Let's get some summary stats:
print(df.select(pl.all().max()) )
shape: (9, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 9 │
│ 2 ┆ 9 │
│ 3 ┆ 9 │
│ 4 ┆ 9 │
│ ... ┆ ... │
│ 6 ┆ 9 │
│ 7 ┆ 9 │
│ 8 ┆ 9 │
│ 9 ┆ 9 │
└───────┴───────┘
This tells us that for any value of col_a what the biggest value of col_b is 9 which means anytime the result is null when col_a is 9 that it's a true null
With that, we do
chunks=[]
chunks.append(idx_finder(df[0:10000])) # arbitrarily picking 10,000 per chunk
Then take a look at
chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
shape: (2, 4)
┌──────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞══════╪═══════╪═══════╪════════╡
│ 9993 ┆ 8 ┆ 6 ┆ null │
│ 9999 ┆ 3 ┆ 1 ┆ null │
└──────┴───────┴───────┴────────┘
Let's cutoff this chunk at idx=9992 and then start the next chunk at 9993
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
With that we can reformulate this logic into a while loop
curindx=0
chunks=[]
while curindx<=df.shape[0]:
print(curindx)
chunks.append(idx_finder(df[curindx:(curindx+10000)]))
curchunkfilt=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
if curchunkfilt.shape[0]==0:
curindx+=10000
elif curchunkfilt[0,0]>curindx:
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
else:
print("curindx not advancing")
break
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
Finally just
pl.concat(chunks)
As long as we're looping here's another approach
If the gaps between the A/B matches are small then this will end up being fast as it scales according to the gap size rather than by the size of the df. It just uses shift
df=df.with_columns(pl.lit(None).alias('result'))
y=0
while True:
print(y)
maxB=df.filter(pl.col('result').is_null()).select(pl.col('col_b').max())[0,0]
df=df.with_columns((
pl.when(
(pl.col('result').is_null()) & (pl.col('col_a')<pl.col('col_b').shift(-y))
).then(pl.col('idx').shift(-y)).otherwise(pl.col('result'))).alias('result'))
y+=1
if df.filter((pl.col('result').is_null()) & (pl.col('col_a')<maxB) & ~(pl.col('col_b').shift(-y).is_null())).shape[0]==0:
break
With my random data of 1.2m rows it only took 2.6s with a max row offset of 86. If, in your real data, the gaps are on the order of, let's just say, 100,000 then it'd be close to an hour.

python-polars create new column by dividing by two existing columns

in pandas the following creates a new column in dataframe by dividing by two existing columns. How do I do this in polars? Bonus if done in the fastest way using polars.LazyFrame
df = pd.DataFrame({"col1":[10,20,30,40,50], "col2":[5,2,10,10,25]})
df["ans"] = df["col1"]/df["col2"]
print(df)
You want to avoid Pandas-style coding and use Polars Expressions API. Expressions are the heart of Polars and yield the best performance.
Here's how we would code this using Expressions, including using Lazy mode:
(
df
.lazy()
.with_column(
(pl.col('col1') / pl.col('col2')).alias('result')
)
.collect()
)
shape: (5, 3)
┌──────┬──────┬────────┐
│ col1 ┆ col2 ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ f64 │
╞══════╪══════╪════════╡
│ 10 ┆ 5 ┆ 2.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 20 ┆ 2 ┆ 10.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 30 ┆ 10 ┆ 3.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 40 ┆ 10 ┆ 4.0 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 50 ┆ 25 ┆ 2.0 │
└──────┴──────┴────────┘
Here's a section of the User Guide that may help transitioning from Pandas-style coding to using Polars Expressions.

Filtering selected columns based on column aggregate

I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.
Currently, I am solving it via python. Is there a more idiomatic way?
import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})
# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique
Equivalent in pandas:
df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]
Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.
pl.select(
[s for s in df
if s.n_unique() < 3]
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
Previous answer
One easy way is to use the compress function from Python's itertools.
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
compress allows us to apply a boolean mask to a list, which in this case is a list of column names.
list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
['col1', 'col3']

Best way to get percentage counts in Polars

I frequently need to calculate the percentage counts of a variable. For example for the dataframe below
df = pl.DataFrame({"person": ["a", "a", "b"],
"value": [1, 2, 3]})
I want to return a dataframe like this:
person
percent
a
0.667
b
0.333
What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this
n_rows = len(df)
(
df
.with_column(pl.lit(1)
.alias('percent'))
.groupby('person')
.agg([pl.sum('percent') / n_rows])
)
polars.count will help here. When called without arguments, polars.count returns the number of rows in a particular context.
(
df
.groupby("person")
.agg([pl.count().alias("count")])
.with_column((pl.col("count") / pl.sum("count")).alias("percent_count"))
)
shape: (2, 3)
┌────────┬───────┬───────────────┐
│ person ┆ count ┆ percent_count │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════╪═══════╪═══════════════╡
│ a ┆ 2 ┆ 0.666667 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 0.333333 │
└────────┴───────┴───────────────┘

convert a pandas loc operation that needed the index to assign values to polars

In this example i have three columns, the 'DayOfWeek' Time' and the 'Risk'.
I want to group by 'DayOfWeek' and take the first element only and assign a high risk on it. This means the first known hour in day of week is the one that has the highest risk. The rest is initialized to 'Low' risk.
In pandas i had an additional column for the index, but in polars i do not. I could artificially create one, but is it even necessary?
Can i do this somehow smarter with polars?
df['risk'] = "Low"
df = df.sort('Time')
df.loc[df.groupby("DayOfWeek").head(1).index, "risk"] = "High"
The index is unique in this case and goes to range(n)
Here is my solution btw. (I don't really like it)
df = df.with_column(pl.arange(0, df.shape[0]).alias('pseudo_index')
# find lowest time for day
indexes_df = df.sort('Time').groupby('DayOfWeek').head(1)
# Set 'High' as col for all rows from groupby
indexes_df = indexes_df.select('pseudo_index').with_column(pl.lit('High').alias('risk'))
# Left join will generate null values for all values that are not in indexes_df 'pseudo_index'
df = df.join(indexes_df, how='left', on='pseudo_index').select([
pl.all().exclude(['pseudo_index', 'risk']), pl.col('risk').fill_null(pl.lit('low'))
])
You can use window functions to find where the first "index" of the "DayOfWeek" group equals the"index" column.
For that we only need to set an "index" column. We can do that easily with:
A method: df.with_row_count(<name>)
An expression: pl.arange(0, pl.count()).alias(<name>)
After that we can use this predicate:
pl.first("index").over("DayOfWeek") == pl.col("index")
Finally we use a when -> then -> otherwise expression to use that condition and create our new "Risk" column.
Example
Let's start with some data. In the snippet below I create an hourly date range and then determine the weekdays from that.
Preparing data
df = pl.DataFrame({
"Time": pl.date_range(datetime(2022, 6, 1), datetime(2022, 6, 30), "1h").sample(frac=1.5, with_replacement=True).sort(),
}).select([
pl.arange(0, pl.count()).alias("index"),
pl.all(),
pl.col("Time").dt.weekday().alias("DayOfWeek"),
])
print(df)
shape: (1045, 3)
┌───────┬─────────────────────┬───────────┐
│ index ┆ Time ┆ DayOfWeek │
│ --- ┆ --- ┆ --- │
│ i64 ┆ datetime[ns] ┆ u32 │
╞═══════╪═════════════════════╪═══════════╡
│ 0 ┆ 2022-06-29 22:00:00 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 ┆ 2022-06-14 11:00:00 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 2022-06-11 21:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ 2022-06-27 20:00:00 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1041 ┆ 2022-06-11 09:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1042 ┆ 2022-06-18 22:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1043 ┆ 2022-06-18 01:00:00 ┆ 6 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 1044 ┆ 2022-06-23 18:00:00 ┆ 4 │
└───────┴─────────────────────┴───────────┘
Computing Risk values
df.with_column(
pl.when(
pl.first("index").over("DayOfWeek") == pl.col("index")
).then(
"High"
).otherwise(
"Low"
).alias("Risk")
).drop("index")
print(df)
shape: (1045, 3)
┌─────────────────────┬───────────┬──────┐
│ Time ┆ DayOfWeek ┆ Risk │
│ --- ┆ --- ┆ --- │
│ datetime[ns] ┆ u32 ┆ str │
╞═════════════════════╪═══════════╪══════╡
│ 2022-06-29 22:00:00 ┆ 3 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-14 11:00:00 ┆ 2 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 21:00:00 ┆ 6 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-27 20:00:00 ┆ 1 ┆ High │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ ... ┆ ... ┆ ... │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-11 09:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 22:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-18 01:00:00 ┆ 6 ┆ Low │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2022-06-23 18:00:00 ┆ 4 ┆ Low │
└─────────────────────┴───────────┴──────┘