Python-Polars update DataFrame function similar to Pandas DataFrame.update() - python-polars

Thanks for the prompt responses. Based on the responses, I have modified the question and also provided numeric code example.
I am from Market Research industry. We analyse survey databases. One of the requirements of the survey tables is that blank rows & columns should not get suppressed. Blank rows and / or columns may result when we are generating table on filtered database.
To avoid this zero suppression, we create a blank table with all rows / columns, then create actual table using Pandas and update the blank table with the actual table numbers using Pandas pd.update function. This way, we retain rows / columns with zero estimates. My sincere apologies for not pasting code as this is my first question on Stack Overflow.
Here's the example dataframe:
dict = { 'state':
['state 1', 'state 2', 'state 3', 'state 4', 'state 5', 'state 6', 'state 7', 'state 8', 'state 9', 'state 10'],
'development': ['Low', 'Medium', 'Low', 'Medium', 'High', 'Low', 'Medium', 'Medium', 'Low', 'Medium'],
'investment': ['50-500MN', '<50MN', '<50MN', '<50MN', '500MN+', '50-500MN', '<50MN', '50-500MN', '<50MN', '<50MN'],
'population': [22, 19, 25, 24, 19, 21, 33, 36, 22, 36],
'gdp': [18, 19, 29, 23, 22, 19, 35, 18, 26, 27]
}
I convert it into a dataframe:
df = pl.DataFrame(dict)
I filter it using a criteria:
df2 = df.filter(pl.col('development') != 'High')
And then generate a pivot table
df2.pivot(index='development', columns='investment', values='gdp')
The resulting table has one row suppressed ('High' development) and one column suppressed ('>500MN' investment).
The solution I am looking for is to update the blank table with all rows and columns with the pivot table generated. Wherever there are no values, they would be replaced with a zero.

What you want is a left join.
Let's say you have:
studentsdf=pl.DataFrame({'Name':students})
datadf=pl.DataFrame({'name':[x[0] for x in data], 'age':[x[1] for x in data]})
Then you'd do:
studentsdf.join(datadf, on='name', how='left')
shape: (4, 2)
┌────────┬──────┐
│ name ┆ age │
│ --- ┆ --- │
│ str ┆ i64 │
╞════════╪══════╡
│ Alex ┆ 10 │
│ Bob ┆ 12 │
│ Clarke ┆ null │
│ Darren ┆ 13 │
└────────┴──────┘
If you want to "update" the studentsdf with that new info you'd just assign it like this:
studentsdf=studentsdf.join(datadf, on='name', how='left')
Even though that implies you're making a copy, under the hood, polars is just moving memory pointers around not copying all the underlying data.

You haven't written any code, so I won't either, but you can do what's suggested in https://github.com/pola-rs/polars/issues/6211

Related

Polars - how to parallelize lambda that uses only Polars expressions?

This runs on a single core, despite not using (seemingly) any non-Polars stuff. What am I doing wrong?
(the goal is to convert a list in doc_ids field in every row into its string representation, s.t. [1, 2, 3] (list[int]) -> '[1, 2, 3]' (string))
import polars as pl
df = pl.DataFrame(dict(ent = ['a', 'b'], doc_ids = [[2,3], [3]]))
df = (df.lazy()
.with_column(
pl.concat_str([
pl.lit('['),
pl.col('doc_ids').apply(lambda x: x.cast(pl.Utf8)).arr.join(', '),
pl.lit(']')
])
.alias('docs_str')
)
.drop('doc_ids')
).collect()
In general, we want to avoid apply at all costs. It acts like a black-box function that Polars cannot optimize, leading to single-threaded performance.
Here's one way that we can eliminate apply: replace it with arr.eval. arr.eval allows us to treat a list as if it were an Expression/Series, which allows us to use standard expressions on it.
(
df.lazy()
.with_column(
pl.concat_str(
[
pl.lit("["),
pl.col("doc_ids")
.arr.eval(pl.element().cast(pl.Utf8))
.arr.join(", "),
pl.lit("]"),
]
).alias("docs_str")
)
.drop("doc_ids")
.collect()
)
shape: (2, 2)
┌─────┬──────────┐
│ ent ┆ docs_str │
│ --- ┆ --- │
│ str ┆ str │
╞═════╪══════════╡
│ a ┆ [2, 3] │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ [3] │
└─────┴──────────┘

Lightweight syntax for filtering a polars DataFrame on a multi-column key?

I'm wondering if there's a lightweight syntax for filtering a polars DataFrame against a multi-column key, other than inner/anti joins. (There's nothing wrong with the joins, but it would be nice if there's something more compact).
Using the following frame as an example:
import polars as pl
df = pl.DataFrame(
data = [
["x",123, 4.5, "misc"],
["y",456,10.0,"other"],
["z",789,99.5,"value"],
],
columns = ["a","b","c","d"],
)
A PostgreSQL statement could use a VALUES expression, like so...
(("a","b") IN (VALUES ('x',123),('y',456)))
...and a pandas equivalent might set a multi-column index.
pf.set_index( ["a","b"], inplace=True )
pf[ pf.index.isin([('x',123),('y',456)]) ]
The polars syntax would look like this:
df.join(
pl.DataFrame(
data = [('x',123),('y',456)],
columns = {col:tp for col,tp in df.schema.items() if col in ("a","b")},
orient = 'row',
),
on = ["a","b"],
how = "inner", # or 'anti' for "not in"
)
Is a multi-column is_in construct, or equivalent expression, currently available with polars? Something like the following would be great if it exists (or could be added):
df.filter( pl.cols("a","b").is_in([('x',123),('y',456)]) )
In the next polars release >0.13.44 this will work on the struct datatype.
We convert the 2 (or more) columns we want to check to a struct with pl.struct and call the is_in expression. (A conversion to struct is a free operation)
df = pl.DataFrame(
data=[
["x", 123, 4.5, "misc"],
["y", 456, 10.0, "other"],
["z", 789, 99.5, "value"],
],
columns=["a", "b", "c", "d"],
)
df.filter(
pl.struct(["a", "b"]).is_in([{"a": "x", "b": 123}, {"a": "y", "b": 456}])
)
shape: (2, 4)
┌─────┬─────┬──────┬───────┐
│ a ┆ b ┆ c ┆ d │
│ --- ┆ --- ┆ --- ┆ --- │
│ str ┆ i64 ┆ f64 ┆ str │
╞═════╪═════╪══════╪═══════╡
│ x ┆ 123 ┆ 4.5 ┆ misc │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ y ┆ 456 ┆ 10.0 ┆ other │
└─────┴─────┴──────┴───────┘
Filtering by data in another DataFrame.
The idiomatic way to filter data by presence in another DataFrame are semi and anti joins. Inner joins also filter by presence, but they include the columns of the right hand DataFrame, where a semi join does not and only filters the left hand side.
semi: keep rows/keys that are in both DataFrames
anti: remove rows/keys that are in both DataFrames
The reason why these joins are preferred over is_in is that they are much faster and currently allow for more optimization.

Expression in Polars select context that refers to earlier alias

Is there a way to allow an expression in Polars to refer to a previous aliased expression? For example, this code that defines two new columns errors because the second new column refers to the first:
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
df.select([
(pl.col('x') + 1).alias('y'),
(pl.col('y') * 2).alias('z')],
)
# pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value:
# NotFound("Unable to get field named \"y\". Valid fields: [\"x\"]")
The error makes it obvious that the failure is caused by the first alias not being visible to the second expression. Is there a straightforward way to make this work?
All polars expressions within a context are executed in parallel. So they cannot refer to a column that does not yet exist.
A context is:
df.with_columns
df.select
df.groupby(..).agg
This means you need to enforce sequential execution for expressions that reference to other expression outputs.
In your case I would do:
(df.with_column(
(pl.col('x') + 1).alias('y')
).select([
pl.col('y'),
(pl.col('y') * 2).alias('z')
]))
One workaround is to pull out each new column into its own with_column call and then do a final select to keep the columns you were supposed to keep. You will probably want to make sure this is done lazily.
import polars as pl
df = pl.DataFrame(dict(x=[0, 0, 1]))
(df
.lazy()
.with_column((pl.col("x") + 1).alias("y"))
.with_column((pl.col("y") * 2).alias("z"))
.select(["y", "z"])
.collect()
)
# shape: (3, 2)
# ┌─────┬─────┐
# │ y ┆ z │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 2 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 2 ┆ 4 │
# └─────┴─────┘

Groupby aggregate two columns into a dictionary in Polars

Given the following data, i'm looking to groupby and combine two columns into one, holding a dictionary. One column supplies the keys, while the values stem from another column which is aggregated into a list first.
import polars as pl
data = pl.DataFrame(
{
"names": ["foo", "ham", "spam", "cheese", "egg", "foo"],
"dates": ["1", "1", "2", "3", "3", "4"],
"groups": ["A", "A", "B", "B", "B", "C"],
}
)
>>> print(data)
names dates groups
0 foo 1 A
1 ham 1 A
2 spam 2 B
3 cheese 3 B
4 egg 3 B
5 foo 4 C
# This is what i'm trying to do:
groups combined
0 A {'1': ['foo', 'ham']}
1 B {'2': ['spam'], '3': ['cheese', 'egg']}
2 C {'4': ['foo']}
In pandas i can do this using two groupby statements, in pyspark using a set of operations around "map_from_entries" but despite various attempts i haven't figured out a way in polars. So far i use agg_list(), convert to pandas and use a lambda. While this works, it certainly doesn't feel right.
data = data.groupby(["groups", "dates"])["names"].agg_list()
data = (
data.to_pandas()
.groupby(["groups"])
.apply(lambda x: dict(zip(x["dates"], x["names_agg_list"])))
.reset_index(name="combined")
)
Alternativly, inspired by this post i've tried a number of variations similar to the following, including converting the dict to json strings among other things.
data = data.groupby(["groups"]).agg(
pl.apply(exprs=["dates", "names_agg_list"], f=build_dict).alias("combined")
)
With the release of polars>=0.12.10 you can do this:
print(data
.groupby(["groups", "dates"]).agg(pl.col("names").list().keep_name())
.groupby("groups")
.agg([
pl.apply([pl.col("dates"), pl.col("names")], lambda s: dict(zip(s[0], s[1].to_list())))
])
)
shape: (3, 2)
┌────────┬─────────────────────────────────────┐
│ groups ┆ dates │
│ --- ┆ --- │
│ str ┆ object │
╞════════╪═════════════════════════════════════╡
│ A ┆ {'1': ['foo', 'ham']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ C ┆ {'4': ['foo']} │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ B ┆ {'3': ['cheese', 'egg'], '2': ['... │
└────────┴─────────────────────────────────────┘
This not really how you should be using DataFrames though. There is likely a solution that lets you deal with more flattened dataframes and doesn't require you to put slow python objects in dataframes.

Is it possible to create a 2d json array with PostgreSQL

tab:
num │ value_two │ value_three │ value_four
─────┼───────────┼─────────────┼────────────
1 │ a │ A │ 4.0
2 │ a │ A2 │ 75.0
3 │ b │ A3 │ 7.0
I want to create a 2D json array like this
[[1,"a","A",4.0],[2,"a","A2",75.0],[3,"b","A3",7.0]]
I have tried two things:
First SELECT json_agg(tab) FROM tab but it returns an array of objects.
The second thing that I tried kinda works, the only detail is that it returns a 2d string array.
SELECT json_agg(ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]) FROM tab
[["1","a","A","4.0"],["2","a","A2",75.0],["3","b","A3","7.0"]]
Short answer:
=# select json_agg(json_build_array(num, value_two, value_three, value_four)) as answer
from tab;
answer
-----------------------------------------------------------------
[[1, "a", "A", 4.0], [2, "a", "A2", 75.0], [3, "b", "A3", 7.0]]
(1 row)
Native PostgreSQL arrays like the one you created with
ARRAY[num::TEXT,value_two,value_three,value_four::TEXT]
are strictly typed, which is why you had to cast num and value_four to text.
To get the type mixing allowed in JSON, use json_build_array(), instead.