I have two dataframes that I'd like to join if one column's value is contained in the other column. The dataframes look like this:
df1 = pl.DataFrame({"col1": [1, 2, 3], "col2": ["x1, x2, x3", "x2, x3", "x3"]})
df2 = pl.DataFrame({"col3": [4, 5, 6], "col4": ["x1", "x2", "x3"]})
I tried to do:
model_data = df1.join(df2, on="col2")
Which does not produce the desired result. What I'd like to see is something like this:
col1 col2 col3 col4
1 "x1, x2, x3" 4 "x1"
1 "x1, x2, x3" 5 "x2"
1 "x1, x2, x3" 6 "x3"
2 "x2, x3" 5 "x2"
2 "x2, x3" 6 "x3"
3 "x3" 6 "x3"
It's a question of how you do the join when one value is contained by another value. I could not find good examples of this in the docs.
You want to split your col2 and .explode() similar to python-polars split string column into many columns by delimiter
Then you can perform the .join()
>>> (df1.with_column(df1["col2"].str.split(", ").alias("col4"))
... .explode("col4")
... .join(df2, on="col4")
... .select(df1.columns + df2.columns))
shape: (6, 4)
┌──────┬────────────┬──────┬──────┐
│ col1 ┆ col2 ┆ col3 ┆ col4 │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ i64 ┆ str │
╞══════╪════════════╪══════╪══════╡
│ 1 ┆ x1, x2, x3 ┆ 4 ┆ x1 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ x1, x2, x3 ┆ 5 ┆ x2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ x1, x2, x3 ┆ 6 ┆ x3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ x2, x3 ┆ 5 ┆ x2 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ x2, x3 ┆ 6 ┆ x3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 3 ┆ x3 ┆ 6 ┆ x3 │
└──────┴────────────┴──────┴──────┘
Another approach which might be counterintuitively faster is to do a cross join between them and then filter out the times that the col4 is not in col2
That would look something like this...
cj=df1.join(df2, how='cross')
filt=cj.apply(lambda x: x[3] in x[1])
cj.with_column(filt.to_series().alias('filt')).filter(pl.col('filt')==True).select(pl.exclude('filt'))
Essentially what happens is that you create cj which is a df that has every row of df1 and df2 mashed together. You then create filt which is just a series of Trues and Falses that you can filter by. You filter by that and then do a select to exclude that helper column. You just have to be careful of those index positions in the lambda expression of that second line.
You'll have to test the performance of this vs #jqurious's method. If (big if, I don't know) this one is faster then it's because the str.split.explode isn't as efficient as just mashing everything together. Unfortunately the series.str.contains method is looking for a fixed regex or literal so that's why this uses a lambda.
Related
I'd like to know the first occurrence (index) when a value in column A is greater than in column B. Currently I use a for loop (and it's super slow) but I'd imagine it's possible to do that in a rolling window.
df = polars.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
# apply some window function?
# result of first indices where a value in column B is greater than the value in column A
result = polars.Series([2,2,2,3,None])
I'm still trying to understand polars concept of windows but I imagine the pseudo code would look sth like this:
for window length compare values in both columns, use arg_min() to get the index
if the resulting index is not found (e.g. value None or 0), increase window length and make a second pass
make passes until some max window_len
Current for loop implementation:
df = polars.DataFrame({"col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
for i in range(0, df.shape[0]):
# `arg_max()` returns 0 when there's no such index or if the index is actually 0
series = (df.select("col_a")[i,0] < df.select("col_b")[i:])[:,0]
idx_found = True in series
if idx_found:
print(i + series.arg_max())
else:
print("None")
# output:
2
2
2
3
None
Edit 1:
This almost solves the problem. But we still don't know if arg_max found an actual True value or didn't found an index since it returns 0 for both cases.
One idea is that we're never satisfied with the answer 0 and make a second scan for all values where the result was 0 but now with a longer window.
df.select(polars.col("idx")) + \
df_res = df.groupby_dynamic("idx", every="1i", period="5i").agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).arg_max().alias("res")
]
)
Edit 2:
This is the final solution: the first pass is made from the code in Edit 1. The following passes (with increasingly wider windows/periods) can be made with:
increase_window_size = "10i"
df_res.groupby_dynamic("idx", every="1i", period=increase_window_size).agg(
[
(polars.col("col_a").head(1) < polars.col("col_b")).filter(polars.col("res").head(1) == 0).arg_max().alias("res")
]
)
Starting from...
df=pl.DataFrame({"idx": [i for i in range(5)], "col_a": [1,2,3,4,4], "col_b": [1,1,5,5,3]})
For each row, you want the min idx where the current row's col_a is less than every subsequent row's col_b.
The first step is to add two columns that will contain all the data as a list and then we want to explode those into a much longer DataFrame.
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx'])
From here, we want to apply a filter so we're only keeping rows where the b_indx is at least as big as the idx AND the col_a is less than col_b
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b')))
There are a couple ways you could clean that up, one is to groupby+agg+sort
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.groupby(['idx']).agg([pl.col('b_indx').min()]).sort('idx')
The other way is to just do unique by idx
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx')
Lastly, to get the null values back you have to join it back to the original df. To keep with the theme of adding to the end of the chain we'd want a right join but right joins aren't an option so we have to put the join back at the beginning.
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').get_column('b_indx')
shape: (5,)
Series: 'b_indx' [i64]
[
2
2
2
3
null
]
Note:
I was curious on the performance difference between my approach and jqurious's approach so I did
df=pl.DataFrame({'col_a':np.random.randint(1,10,10000), 'col_b':np.random.randint(1,10,10000)}).with_row_count('idx')
then ran each code chunk. Mine took 1.7s while jqurious's took just 0.7s BUT his answer isn't correct...
For instance...
df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]).head(5)
yields...
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 1 │ 1<4 at indx1
│ 2 ┆ 3 ┆ 5 ┆ 2 │ 3<5 at indx2
│ 3 ┆ 4 ┆ 2 ┆ 5 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 5 │ off the page
└─────┴───────┴───────┴────────┘
whereas
df.with_columns(
pl.when(pl.col("col_a") < pl.col("col_b"))
.then(1)
.cumsum()
.backward_fill()
.alias("result") + 1
).head(5)
yields
shape: (5, 4)
┌─────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ i32 │
╞═════╪═══════╪═══════╪════════╡
│ 0 ┆ 4 ┆ 1 ┆ 2 │ 4<5 at indx2
│ 1 ┆ 1 ┆ 4 ┆ 2 │ not right
│ 2 ┆ 3 ┆ 5 ┆ 3 │ not right
│ 3 ┆ 4 ┆ 2 ┆ 4 │ off the page
│ 4 ┆ 5 ┆ 4 ┆ 4 │ off the page
└─────┴───────┴───────┴────────┘
Performance
This scales pretty terribly, bumping the df from 10,000 rows to 100,000 made my kernel crash. Going from 10,000 to 20,000 made it take 5.7s which makes sense since we're squaring the size of the df. To mitigate this, you can do overlapping chunks.
First let's make a function
def idx_finder(df):
return(df.join(
df.with_columns([
pl.col('col_b').list(),
pl.col('idx').list().alias('b_indx')]) \
.explode(['col_b','b_indx']) \
.filter((pl.col('b_indx')>=pl.col('idx')) & (pl.col('col_a')<pl.col('col_b'))) \
.unique(subset='idx'),
on='idx',how='left').select(['idx','col_a','col_b',pl.col('b_indx').alias('result')]))
Let's get some summary stats:
print(df.select(pl.all().max()) )
shape: (9, 2)
┌───────┬───────┐
│ col_a ┆ col_b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═══════╪═══════╡
│ 1 ┆ 9 │
│ 2 ┆ 9 │
│ 3 ┆ 9 │
│ 4 ┆ 9 │
│ ... ┆ ... │
│ 6 ┆ 9 │
│ 7 ┆ 9 │
│ 8 ┆ 9 │
│ 9 ┆ 9 │
└───────┴───────┘
This tells us that for any value of col_a what the biggest value of col_b is 9 which means anytime the result is null when col_a is 9 that it's a true null
With that, we do
chunks=[]
chunks.append(idx_finder(df[0:10000])) # arbitrarily picking 10,000 per chunk
Then take a look at
chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
shape: (2, 4)
┌──────┬───────┬───────┬────────┐
│ idx ┆ col_a ┆ col_b ┆ result │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ i64 ┆ i64 ┆ u32 │
╞══════╪═══════╪═══════╪════════╡
│ 9993 ┆ 8 ┆ 6 ┆ null │
│ 9999 ┆ 3 ┆ 1 ┆ null │
└──────┴───────┴───────┴────────┘
Let's cutoff this chunk at idx=9992 and then start the next chunk at 9993
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
With that we can reformulate this logic into a while loop
curindx=0
chunks=[]
while curindx<=df.shape[0]:
print(curindx)
chunks.append(idx_finder(df[curindx:(curindx+10000)]))
curchunkfilt=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))
if curchunkfilt.shape[0]==0:
curindx+=10000
elif curchunkfilt[0,0]>curindx:
curindx=chunks[-1].filter((pl.col('result').is_null()) & (pl.col('col_a')<9))[0,0]
else:
print("curindx not advancing")
break
chunks[-1]=chunks[-1].filter(pl.col('idx')<curindx)
Finally just
pl.concat(chunks)
As long as we're looping here's another approach
If the gaps between the A/B matches are small then this will end up being fast as it scales according to the gap size rather than by the size of the df. It just uses shift
df=df.with_columns(pl.lit(None).alias('result'))
y=0
while True:
print(y)
maxB=df.filter(pl.col('result').is_null()).select(pl.col('col_b').max())[0,0]
df=df.with_columns((
pl.when(
(pl.col('result').is_null()) & (pl.col('col_a')<pl.col('col_b').shift(-y))
).then(pl.col('idx').shift(-y)).otherwise(pl.col('result'))).alias('result'))
y+=1
if df.filter((pl.col('result').is_null()) & (pl.col('col_a')<maxB) & ~(pl.col('col_b').shift(-y).is_null())).shape[0]==0:
break
With my random data of 1.2m rows it only took 2.6s with a max row offset of 86. If, in your real data, the gaps are on the order of, let's just say, 100,000 then it'd be close to an hour.
I am working with multiple parquet datasets that were written with nested structs (sometimes multiple levels deep). I need to output a flattened (no struct) schema. Right now the only way I can think to do that is to use for loops to iterate through the columns. Here is a simplified example where I'm for looping.
while len([x.name for x in df if x.dtype == pl.Struct]) > 0:
for col in df:
if col.dtype == pl.Struct:
df = df.unnest(col.name)
This works, maybe that is the only way to do it, and if so it would be helpful to know that. But Polars is pretty neat and I'm wondering if there is a more functional way to do this without all the looping and reassigning the df to itself.
If you have a df like this:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e'])
You can unnest the ab and cd at the same time by just doing
df.unnest(['ab','cd'])
If you don't know in advance what your column names and types are in advance then you can just use a list comprehension like this:
[col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct]
We can now just put that list comprehension in the unnest method.
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
If you have structs inside structs like:
df=pl.DataFrame({'a':[1,2,3], 'b':[2,3,4], 'c':[3,4,5], 'd':[4,5,6], 'e':[5,6,7]}).select([pl.struct(['a','b']).alias('ab'), pl.struct(['c','d']).alias('cd'),'e']).select([pl.struct(['ab','cd']).alias('abcd'),'e'])
then I don't think you can get away from some kind of while loop but this might be more concise:
while any([x==pl.Struct for x in df.dtypes]):
df=df.unnest([col_name for col_name,dtype in zip(df.columns, df.dtypes) if dtype==pl.Struct])
This is a minor addition. If you're concerned about constantly re-looping through a large number of columns, you can create a recursive formula to address only structs (and nested structs).
def unnest_all(self: pl.DataFrame):
cols = []
for next_col in self:
if next_col.dtype != pl.Struct:
cols.append(next_col)
else:
cols.extend(next_col.struct.to_frame().unnest_all().get_columns())
return pl.DataFrame(cols)
pl.DataFrame.unnest_all = unnest_all
So, using the second example by #Dean MacGregor above:
df = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
df
df.unnest_all()
>>> df
shape: (3, 2)
┌───────────────┬─────┐
│ abcd ┆ e │
│ --- ┆ --- │
│ struct[2] ┆ i64 │
╞═══════════════╪═════╡
│ {{1,2},{3,4}} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{2,3},{4,5}} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {{3,4},{5,6}} ┆ 7 │
└───────────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
And using the first example:
df = pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
).select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
df
df.unnest_all()
>>> df
shape: (3, 3)
┌───────────┬───────────┬─────┐
│ ab ┆ cd ┆ e │
│ --- ┆ --- ┆ --- │
│ struct[2] ┆ struct[2] ┆ i64 │
╞═══════════╪═══════════╪═════╡
│ {1,2} ┆ {3,4} ┆ 5 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {2,3} ┆ {4,5} ┆ 6 │
├╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ {3,4} ┆ {5,6} ┆ 7 │
└───────────┴───────────┴─────┘
>>> df.unnest_all()
shape: (3, 5)
┌─────┬─────┬─────┬─────┬─────┐
│ a ┆ b ┆ c ┆ d ┆ e │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 ┆ 4 ┆ 5 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 3 ┆ 4 ┆ 5 ┆ 6 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 4 ┆ 5 ┆ 6 ┆ 7 │
└─────┴─────┴─────┴─────┴─────┘
In the end, I'm not sure that this saves you much wall-clock time (or RAM).
The other answers taught me a lot. I encountered a new situation where I wanted to easily be able to get each column labeled with all the structs it came from. i.e. for
pl.col("my").struct.field("test").struct.field("thing")
I wanted to recover
my.test.thing
as a string which I could easily use when reading a subset of columns with pyarrow via
pq.ParquetDataset(path).read(columns = ["my.test.thing"])
Since there are many hundreds of columns and the nesting can go quite deep, I wrote functions to do a depth first search on the schema, extract the columns in that pyarrow friendly format, then I can use those to select each column unnested all in one go.
First, I worked with the pyarrow schema because I couldn't figure out how to drill into the structs in the polars schema:
schema = df.to_arrow().schema
navigating structs in that schema is quirky, at the top level the structure behaves differently from deeper in. I ended up writing two functions, the first to navigate the top level structure and the second to continue the search below:
def schema_top_level_DFS(pa_schema):
top_level_stack = list(range(len(pa_schema)))
while top_level_stack:
working_top_level_index = top_level_stack.pop()
working_element_name = pa_schema.names[working_top_level_index]
if type(pa_schema.types[working_top_level_index]) == pa.lib.StructType:
second_level_stack = list(range(len(pa_schema.types[working_top_level_index])))
while second_level_stack:
working_second_level_index = second_level_stack.pop()
schema_DFS(pa_schema.types[working_top_level_index][working_second_level_index],working_element_name)
else:
column_paths.append(working_element_name)
def schema_DFS(incoming_element,upstream_names):
current_name = incoming_element.name
combined_names = ".".join([upstream_names,current_name])
if type(incoming_element.type) == pa.lib.StructType:
stack = list(range(len(incoming_element.type)))
while stack:
working_index = stack.pop()
working_element = incoming_element.type[working_index]
schema_DFS(working_element,combined_names)
else:
column_paths.append(combined_names)
So that running
column_paths = []
schema_top_level_DFS(schema)
gives me column paths like
['struct_name_1.inner_struct_name_2.thing1','struct_name_1.inner_struct_name_2.thing2]
to actually do the unnesting, I wasn't sure how to do better than a function with a case statement:
def return_pl_formatting(col_string):
col_list = col_string.split(".")
match len(col_list):
case 1:
return pl.col(col_list[0]).alias(col_string)
case 2:
return pl.col(col_list[0]).struct.field(col_list[1]).alias(col_string)
case 3:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).alias(col_string)
case 4:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).alias(col_string)
case 5:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).alias(col_string)
case 6:
return pl.col(col_list[0]).struct.field(col_list[1]).struct.field(col_list[2]).struct.field(col_list[3]).struct.field(col_list[4]).struct.field(col_list[5]).alias(col_string)
Then get my unnested and nicely named df with:
df.select([return_pl_formatting(x) for x in column_paths])
To show the output on the example from #Dean MacGregor
test = (
pl.DataFrame(
{"a": [1, 2, 3], "b": [2, 3, 4], "c": [
3, 4, 5], "d": [4, 5, 6], "e": [5, 6, 7]}
)
.select([pl.struct(["a", "b"]).alias("ab"), pl.struct(["c", "d"]).alias("cd"), "e"])
.select([pl.struct(["ab", "cd"]).alias("abcd"), "e"])
)
column_paths = []
schema_top_level_DFS(test.to_arrow().schema)
print(test.select([return_pl_formatting(x) for x in column_paths]))
┌─────┬───────────┬───────────┬───────────┬───────────┐
│ e ┆ abcd.cd.d ┆ abcd.cd.c ┆ abcd.ab.b ┆ abcd.ab.a │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 ┆ i64 ┆ i64 │
╞═════╪═══════════╪═══════════╪═══════════╪═══════════╡
│ 5 ┆ 4 ┆ 3 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 6 ┆ 5 ┆ 4 ┆ 3 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┤
│ 7 ┆ 6 ┆ 5 ┆ 4 ┆ 3 │
└─────┴───────────┴───────────┴───────────┴───────────┘
I wish to select only columns with fewer than 3 unique values. I can generate a boolean mask via pl.all().n_unique() < 3, but I don't know if I can use that mask via the polars API for this.
Currently, I am solving it via python. Is there a more idiomatic way?
import polars as pl, pandas as pd
df = pl.DataFrame({"col1":[1,1,2], "col2":[1,2,3], "col3":[3,3,3]})
# target is:
# df_few_unique = pl.DataFrame({"col1":[1,1,2], "col3":[3,3,3]})
# my attempt:
mask = df.select(pl.all().n_unique() < 3).to_numpy()[0]
cols = [col for col, m in zip(df.columns, mask) if m]
df_few_unique = df.select(cols)
df_few_unique
Equivalent in pandas:
df_pandas = df.to_pandas()
mask = (df_pandas.nunique() < 3)
df_pandas.loc[:, mask]
Edit: after some thinking, I discovered an even easier way to do this, one that doesn't rely on boolean masking at all.
pl.select(
[s for s in df
if s.n_unique() < 3]
)
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
Previous answer
One easy way is to use the compress function from Python's itertools.
from itertools import compress
df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> df.select(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
shape: (3, 2)
┌──────┬──────┐
│ col1 ┆ col3 │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪══════╡
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 1 ┆ 3 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 2 ┆ 3 │
└──────┴──────┘
compress allows us to apply a boolean mask to a list, which in this case is a list of column names.
list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
>>> list(compress(df.columns, df.select(pl.all().n_unique() < 3).row(0)))
['col1', 'col3']
Is there a way to reference another Polars Dataframe in Polars expressions without using lambdas?
Just to use a simple example - suppose I have two dataframes:
df_1 = pl.DataFrame(
{
"time": pl.date_range(
low=date(2021, 1, 1),
high=date(2022, 1, 1),
interval="1d",
),
"x": pl.arange(0, 366, eager=True),
}
)
df_2 = pl.DataFrame(
{
"time": pl.date_range(
low=date(2021, 1, 1),
high=date(2021, 2, 1),
interval="1mo",
),
"y": [50, 100],
}
)
For each y value in df_2, I would like to find the maximum date in df_1, conditional on the x value being lower than the y.
I am able to perform this using apply/lambda (see below), but just wondering whether there is a more idiomatic way of performing this operation?
df_2.groupby("y").agg(
pl.col("y").apply(lambda s: df_1.filter(pl.col("x") < s).select(pl.col("time")).max()[0,0]).alias('latest')
)
Edit:
Is it possible to pre-filter df_1 prior to using join_asof. So switching the question to look for the min instead of the max, on an individual case this is what I would do:
(
df_2
.filter(pl.col('y') == 50)
.join_asof(
df_1
.sort("x")
.filter(pl.col('time') > date(2021,11,1))
.select([
pl.col("time").cummin().alias("time_min"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
]),
left_on="y",
right_on="x",
strategy="forward",
)
)
Is there a way to generalise this merge without using a loop / apply function?
Edit: Generalizing a join
One somewhat-dangerous approach to generalizing a join (so that you can run any sub-queries and filters that you like) is to use a "cross" join.
I say "somewhat-dangerous" because the number of row combinations considered in a cross join is M x N, where M and N are the number of rows in your two DataFrames. So if your two DataFrames are 1 million rows each, you have (1 million x 1 million) row combinations that are being considered. This process can exhaust your RAM or simply take a long time.
If you'd like to try it, here's how it would work (along with some arbitrary filters that I constructed, just to show the ultimate flexibility that a cross-join creates).
(
df_2.lazy()
.join(
df_1.lazy(),
how="cross"
)
.filter(pl.col('time_right') >= pl.col('time'))
.groupby('y')
.agg([
pl.col('time').first(),
pl.col('x')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(x) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') > pl.col('x'))
.max()
.alias('max(time_right) for(y>x)'),
pl.col('time_right')
.filter(pl.col('y') <= pl.col('x'))
.filter(pl.col('time_right') > pl.col('time'))
.min()
.alias('min(time_right) for(two filters)'),
])
.collect()
)
shape: (2, 5)
┌─────┬────────────┬─────────────────┬──────────────────────────┬──────────────────────────────────┐
│ y ┆ time ┆ max(x) for(y>x) ┆ max(time_right) for(y>x) ┆ min(time_right) for(two filters) │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ date ┆ i64 ┆ date ┆ date │
╞═════╪════════════╪═════════════════╪══════════════════════════╪══════════════════════════════════╡
│ 100 ┆ 2021-02-01 ┆ 99 ┆ 2021-04-10 ┆ 2021-04-11 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 50 ┆ 2021-01-01 ┆ 49 ┆ 2021-02-19 ┆ 2021-02-20 │
└─────┴────────────┴─────────────────┴──────────────────────────┴──────────────────────────────────┘
Couple of suggestions:
I strongly recommend running the cross-join in Lazy mode.
Try to filter directly after the cross-join, to eliminate row combinations that you will never need. This reduces the burden on the later groupby step.
Given the explosive potential of row combinations for cross-joins, I tried to steer you toward a join_asof (which did solve the original sample question). But if you need the flexibility beyond what a join_asof can provide, the cross-join will provide ultimate flexibility -- at a cost.
join_asof
We can use a join_asof to accomplish this, with two wrinkles.
The Algorithm
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
(pl.col("x") + 1),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
.drop(['x'])
)
shape: (2, 3)
┌────────────┬─────┬────────────┐
│ time ┆ y ┆ time_max │
│ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date │
╞════════════╪═════╪════════════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 │
└────────────┴─────┴────────────┘
This matches the output of your code.
In steps
Let's add some extra information to our query, to elucidate how it works.
(
df_2
.sort("y")
.join_asof(
(
df_1
.sort("x")
.select([
pl.col("time").cummax().alias("time_max"),
pl.col("x").alias("original_x"),
(pl.col("x") + 1).alias("x"),
])
),
left_on="y",
right_on="x",
strategy="backward",
)
)
shape: (2, 5)
┌────────────┬─────┬────────────┬────────────┬─────┐
│ time ┆ y ┆ time_max ┆ original_x ┆ x │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ date ┆ i64 ┆ date ┆ i64 ┆ i64 │
╞════════════╪═════╪════════════╪════════════╪═════╡
│ 2021-01-01 ┆ 50 ┆ 2021-02-19 ┆ 49 ┆ 50 │
├╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌┤
│ 2021-02-01 ┆ 100 ┆ 2021-04-10 ┆ 99 ┆ 100 │
└────────────┴─────┴────────────┴────────────┴─────┘
Getting the maximum date
Instead of attempting a "non-equi" join or sub-queries to obtain the maximum date for x or any lesser value of x, we can use a simpler approach: sort df_2 by x and calculate the cumulative maximum date for each "x". That way, when we join, we can join to a single row in df_2 and be certain that for any x, we are getting the maximum date for that x and all lesser values of x. The cumulative maximum is displayed above as time_max.
less-than (and not less-than-or-equal-to)
From the documentation for join_as:
A “backward” search selects the last row in the right DataFrame whose ‘on’ key is less than or equal to the left’s key.
Since you want "less than" and not "less than or equal to", we can simply increase each value of x by 1. Since x and y are integers, this will work. The result above displays both the original value of x (original_x), and the adjusted value (x) used in the join_asof.
If x and y are floats, you can add an arbitrarily small amount to x (e.g., x + 0.000000001) to force the non-equality.
I frequently need to calculate the percentage counts of a variable. For example for the dataframe below
df = pl.DataFrame({"person": ["a", "a", "b"],
"value": [1, 2, 3]})
I want to return a dataframe like this:
person
percent
a
0.667
b
0.333
What I have been doing is the following, but I can't help but think there must be a more efficient / polars way to do this
n_rows = len(df)
(
df
.with_column(pl.lit(1)
.alias('percent'))
.groupby('person')
.agg([pl.sum('percent') / n_rows])
)
polars.count will help here. When called without arguments, polars.count returns the number of rows in a particular context.
(
df
.groupby("person")
.agg([pl.count().alias("count")])
.with_column((pl.col("count") / pl.sum("count")).alias("percent_count"))
)
shape: (2, 3)
┌────────┬───────┬───────────────┐
│ person ┆ count ┆ percent_count │
│ --- ┆ --- ┆ --- │
│ str ┆ u32 ┆ f64 │
╞════════╪═══════╪═══════════════╡
│ a ┆ 2 ┆ 0.666667 │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ 1 ┆ 0.333333 │
└────────┴───────┴───────────────┘