How to get index corresponding to quantile in Polars List? - python-polars

Suppose I have the following dataframe
df = pl.DataFrame({'x':[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]]})
To get the nth percentile, I can do the following:
list_quantile_30 = pl.element().quantile(0.3)
df.select(pl.col('x').arr.eval(list_quantile_30))
But I can't figure out how to get the index corresponding to the percentile? Here is how I would do it using numpy:
import numpy as np
series = [0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]
np.searchsorted(series, np.percentile(series, 30))
Is there a way to do this in a Polars way without using apply?

Continuing from your example you could use pl.arg_where to search for a condition.
df = pl.DataFrame({'x':[[0, 2, 4, 6, 8, 10, 12, 14, 16, 18, 20]]})
list_quantile_30 = pl.element().quantile(0.3)
df.with_column(pl.col('x').arr.eval(
pl.arg_where(list_quantile_30 <= pl.element()).first()
).flatten().alias("arg_where"))
shape: (1, 2)
┌────────────────┬───────────┐
│ x ┆ arg_where │
│ --- ┆ --- │
│ list[i64] ┆ u32 │
╞════════════════╪═══════════╡
│ [0, 2, ... 20] ┆ 3 │
└────────────────┴───────────┘
This convinces me to add a pl.search_sorted in polars as well.

Related

Python-Polars update DataFrame function similar to Pandas DataFrame.update()

Thanks for the prompt responses. Based on the responses, I have modified the question and also provided numeric code example.
I am from Market Research industry. We analyse survey databases. One of the requirements of the survey tables is that blank rows & columns should not get suppressed. Blank rows and / or columns may result when we are generating table on filtered database.
To avoid this zero suppression, we create a blank table with all rows / columns, then create actual table using Pandas and update the blank table with the actual table numbers using Pandas pd.update function. This way, we retain rows / columns with zero estimates. My sincere apologies for not pasting code as this is my first question on Stack Overflow.
Here's the example dataframe:
dict = { 'state':
['state 1', 'state 2', 'state 3', 'state 4', 'state 5', 'state 6', 'state 7', 'state 8', 'state 9', 'state 10'],
'development': ['Low', 'Medium', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Medium', 'Low', 'Medium'],
'investment': ['50-500MN', '<50MN', '<50MN', '<50MN', '500MN+', '50-500MN', '<50MN', '50-500MN', '<50MN', '<50MN'],
'population': [22, 19, 25, 24, 19, 21, 33, 36, 22, 36],
'gdp': [18, 19, 29, 23, 22, 19, 35, 18, 26, 27]
}
I convert it into a dataframe:
df = pl.DataFrame(dict)
I filter it using a criteria:
df2 = df.filter(pl.col('development') != 'High')
And then generate a pivot table
df2.pivot(index='development', columns='investment', values='gdp')
The resulting table has one row suppressed ('High' development) and one column suppressed ('>500MN' investment).
The solution I am looking for is to update the blank table with all rows and columns with the pivot table generated. Wherever there are no values, they would be replaced with a zero.
What you want is a left join.
Let's say you have:
studentsdf=pl.DataFrame({'Name':students})
datadf=pl.DataFrame({'name':[x[0] for x in data], 'age':[x[1] for x in data]})
Then you'd do:
studentsdf.join(datadf, on='name', how='left')
shape: (4, 2)
┌────────┬──────┐
│ name ┆ age │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪══════╡
│ Alex ┆ 10 │
│ Bob ┆ 12 │
│ Clarke ┆ null │
│ Darren ┆ 13 │
└────────┴──────┘
If you want to "update" the studentsdf with that new info you'd just assign it like this:
studentsdf=studentsdf.join(datadf, on='name', how='left')
Even though that implies you're making a copy, under the hood, polars is just moving memory pointers around not copying all the underlying data.
You haven't written any code, so I won't either, but you can do what's suggested in https://github.com/pola-rs/polars/issues/6211

when / then / otherwise with values from numpy array

Say I have
df = pl.DataFrame({'group': [1, 1, 1, 3, 3, 3, 4, 4]})
I have a numpy array of values, which I'd like to replace 'group' 3 with
values = np.array([9, 8, 7])
Here's what I've tried:
(
df
.with_column(
pl.when(pl.col('group')==3)
.then(values)
.otherwise(pl.col('group')
).alias('group')
)
In [4]: df.with_column(pl.when(pl.col('group')==3).then(values).otherwise(pl.col('group')).alias('group'))
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
Cell In [4], line 1
----> 1 df.with_column(pl.when(pl.col('group')==3).then(values).otherwise(pl.col('group')).alias('group'))
File ~/tmp/.venv/lib/python3.8/site-packages/polars/internals/whenthen.py:132, in When.then(self, expr)
111 def then(
112 self,
113 expr: (
(...)
121 ),
122 ) -> WhenThen:
123 """
124 Values to return in case of the predicate being `True`.
125
(...)
130
131 """
--> 132 expr = pli.expr_to_lit_or_expr(expr)
133 pywhenthen = self._pywhen.then(expr._pyexpr)
134 return WhenThen(pywhenthen)
File ~/tmp/.venv/lib/python3.8/site-packages/polars/internals/expr/expr.py:118, in expr_to_lit_or_expr(expr, str_to_lit)
116 return expr.otherwise(None)
117 else:
--> 118 raise ValueError(
119 f"did not expect value {expr} of type {type(expr)}, maybe disambiguate with"
120 " pl.lit or pl.col"
121 )
ValueError: did not expect value [9 8 7] of type <class 'numpy.ndarray'>, maybe disambiguate with pl.lit or pl.col
How can I do this correctly?
A few things to consider.
One is that you always should convert your numpy arrays to polars Series as we will use the arrow memory specification underneath and not numpys.
Second is that when -> then -> otherwise operates on columns that are of equal length. We nudge the API in such a direction that you define a logical statement based of columns in your DataFrame and therefore you should not know the indices (nor the lenght of a group) that you want to replace. This allows for much optimizations because if you do not define indices to replace we can push down a filter before that expression.
Anyway, your specific situation does know the length of the group, so we must use something different. We can first compute the indices where the conditional holds and then modify based on those indices.
df = pl.DataFrame({
"group": [1, 1, 1, 3, 3, 3, 4, 4]
})
values = np.array([9, 8, 7])
# compute indices of the predicate
idx = df.select(
pl.arg_where(pl.col("group") == 3)
).to_series()
# mutate on those locations
df.with_column(
df["group"].set_at_idx(idx, pl.Series(values))
)
Here's all I could come up with
df.with_column(
pl.col("group")
.cumcount()
.over(pl.col("group"))
.alias("idx")
).apply(
lambda x: values[x[1]] if x[0] == 3 else x[0]
).select(
pl.col("apply").alias("group")
)
Surely there's a simpler way?
In [28]: df.with_column(pl.col('group').cumcount().over(pl.col('group')).alias('idx')).apply(lambda x:
...: values[x[1]] if x[0] == 3 else x[0]).select(pl.col('apply').alias('group'))
Out[28]:
shape: (8, 1)
┌───────┐
│ group │
│ --- │
│ i64 │
╞═══════╡
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌┤
│ 9 │
├╌╌╌╌╌╌╌┤
│ 8 │
├╌╌╌╌╌╌╌┤
│ 7 │
├╌╌╌╌╌╌╌┤
│ 4 │
├╌╌╌╌╌╌╌┤
│ 4 │
└───────┘

Polars - how to parallelize lambda that uses only Polars expressions?

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()
In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘

Lightweight syntax for filtering a polars DataFrame on a multi-column key?

I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.

How can i count elements while aggregate in clickhouse

I have a data like this
customer_id - product_id
1 - 10
1 - 11
1 - 12
1 - 11
2 - 15
2 - 20
After aggregating i want to get:
customer_id - product_id
1 - {10:1, 11:2, 12:1}
2 - {15:1, 20:1}
Try this query:
SELECT
customer_id,
arrayMap((product_id, count) -> (product_id, count),
untuple(sumMap([product_id], [1]))) AS result
FROM
(
/* Emulate the test dataset. */
SELECT
data.1 AS customer_id,
data.2 AS product_id
FROM
(
SELECT arrayJoin([(1, 10), (1, 11), (1, 12), (1, 11),
(2, 15), (2, 20)]) AS data
)
)
GROUP BY customer_id
/*
┌─customer_id─┬─result─────────────────┐
│ 1 │ [(10,1),(11,2),(12,1)] │
│ 2 │ [(15,1),(20,1)] │
└─────────────┴────────────────────────┘
*/
WITH dataset AS
(
SELECT
data.1 AS customer_id,
data.2 AS product_id
FROM
(
SELECT arrayJoin([
(1, 10), (1, 11), (1, 12), (1, 11), (2, 15), (2, 20)
]) AS data
)
)
SELECT
customer_id,
arrayMap(
x -> (x, arrayCount(y -> (y = x), groupArray(product_id) AS product_ids)),
arrayDistinct(product_ids)
) AS result
FROM dataset
GROUP BY customer_id
┌─customer_id─┬─result─────────────────┐
│ 1 │ [(10,1),(11,2),(12,1)] │
│ 2 │ [(15,1),(20,1)] │
└─────────────┴────────────────────────┘