Sort polars DataFrame using column with text and numericals - python-polars

If I have a DataFrame like
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
How can I sort it by numerical value.
Ie I want to pull the string from the Name column and extract only the numerical elements when sorting.
Ie
┌────────┬──────────────────────┐
│ Name ┆ Value │
│ --- ┆ --- │
│ str ┆ list[str] │
╞════════╪══════════════════════╡
│ No. 1 ┆ ["None", "!!!"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 2 ┆ ["1.1", "OK"] │
├╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ No. 10 ┆ ["0.3", "OK"] │
Can't see the polars expression needed and not sure you can pass a custom python function.
Thanks

You can to use str.extract to get the number from the string, using a regular expression.
Then cast it to int and sort:
pl.DataFrame({"Name": ["No. 1", "No. 12", "No. 2"]}).sort(
pl.col("Name").str.extract(r"No\. ([0-9]*)", 1).cast(int)
)

Also, if you want to sort by numbers in List:
df.sort(
pl.col("Value").arr.get(0).cast(pl.Float32, strict=False),
nulls_last=False
)

Related

Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows

I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
┌─────┬────────────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
╞═════╪════════════╡
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
└─────┴────────────┘
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
└─────┴─────┘
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.
To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
df.explode("b")
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │
└─────┴─────┘

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
pl.col('Open').drop_nulls().last()])
For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
})
┌─────┬──────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪══════╡
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
└─────┴──────┘
df.filter(
pl.col("b").is_not_null()
).select(pl.all().last())
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═════╡
│ 3 ┆ owl │
└─────┴─────┘

Ordinal encoding of column in polars

I would like to do an ordinal encoding of a column. Pandas has the nice and convenient method of pd.factorize(), however, I would like to achieve the same in polars.
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
┌─────┬───────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
╞═════╪═══════╡
│ 5 ┆ hi │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 8 ┆ hello │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ 10 ┆ hi │
└─────┴───────┘
desired result:
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘
You can join with a dummy DataFrame that contains the unique values and the ordinal encoding you are interested in:
df = pl.DataFrame({"a": [5, 8, 10], "b": ["hi", "hello", "hi"]})
unique = df.select(
pl.col("b").unique(maintain_order=True)
).with_row_count(name="ordinal")
df.join(unique, on="b")
Or you could "misuse" the fact that categorical values are backed by u32 integers.
df.with_column(
pl.col("b").cast(pl.Categorical).to_physical().alias("ordinal")
)
Both methods output:
shape: (3, 3)
┌─────┬───────┬─────────┐
│ a ┆ b ┆ ordinal │
│ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ u32 │
╞═════╪═══════╪═════════╡
│ 5 ┆ hi ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 8 ┆ hello ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┤
│ 10 ┆ hi ┆ 0 │
└─────┴───────┴─────────┘
Here's another way to do it although I doubt it's better than the dummy Dataframe from #ritchie46
df.with_columns([pl.col('b').unique().list().alias('uniq'),
pl.col('b').unique().list().arr.eval(pl.element().rank()).alias('uniqid')]).explode(['uniq','uniqid']).filter(pl.col('b')==pl.col('uniq')).select(pl.exclude('uniq')).with_column(pl.col('uniqid')-1)
There's almost certainly a way to improve this but basically it creates a new column called uniq which is a list of all the unique values of the column as well as uniqid which (I think, and seems to be) the 1-index based order of the values. It then explodes those creating a row for ever value in uniq and then filters out the ones rows that don't equal the column b. Since rank gives the 1-index (rather than 0-index) you have to subtract 1 and exclude the uniq column that we don't care about since it's the same as b.
If the order is not important you could use .rank(method="dense")
>>> df.select(pl.all().rank(method="dense") - 1)
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 1 │
└─────┴─────┘
If it is - you could:
>>> (
... df.with_row_count()
... .with_columns([
... pl.col("row_nr").first()
... .over(col)
... .rank(method="dense")
... .alias(col) - 1
... for col in df.columns
... ])
... .drop("row_nr")
... )
shape: (3, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞═════╪═════╡
│ 0 ┆ 0 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 0 │
└─────┴─────┘

In Python polars convert a json string column to dict for filtering

Hi have a dataframe where I have a column called tags which is a json string.
I want to filter this dataframe on the tags column so it only contains rows where a certain tag key is present or where a tag has a particular value.
I guess I could do a string contains match but think it may be more robust to have the json convert into a dict first and using has_key etc ?
What would be the recommended way to do this in python polars ?
Thanks
Polars does not have a generic dictionary type. Instead, dictionaries are imported/mapped as structs. Each dictionary key is mapped to a struct 'field name', and the corresponding dictionary value becomes the value of this field.
However, there are some constraints for creating a Series of type struct. Two of them are:
all structs must have the same field names.
the field names must be listed in the same order.
In your description, you mention has_key, which indicates that the dictionaries will not have the same keys. As such, creating a column of struct from your dictionaries will not work. (For more information, you can see this Stack Overflow response.)
json_path_match
I suggest using json_path_match, which extracts values based on some simple JSONPath syntax. Using JSONPath syntax, you should be able to query whether a key exists, and retrieve it's value. (For simple unnested dictionaries, these are the same query.)
For example, let's start with this data:
import polars as pl
json_list = [
"""{"name": "Maria",
"position": "developer",
"office": "Seattle"}""",
"""{"name": "Josh",
"position": "analyst",
"termination_date": "2020-01-01"}""",
"""{"name": "Jorge",
"position": "architect",
"office": "",
"manager_st_dt": "2020-01-01"}""",
]
df = pl.DataFrame(
{
"tags": json_list,
}
).with_row_count("id", 1)
df
shape: (3, 2)
┌─────┬────────────────────┐
│ id ┆ tags │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪════════════════════╡
│ 1 ┆ {"name": "Maria", │
│ ┆ "posit... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ {"name": "Josh", │
│ ┆ "positi... │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ {"name": "Jorge", │
│ ┆ "posit... │
└─────┴────────────────────┘
To query for values:
df.with_columns([
pl.col('tags').str.json_path_match(r"$.name").alias('name'),
pl.col('tags').str.json_path_match(r"$.office").alias('location'),
pl.col('tags').str.json_path_match(r"$.manager_st_dt").alias('manager start date'),
])
shape: (3, 5)
┌─────┬────────────────────┬───────┬──────────┬────────────────────┐
│ id ┆ tags ┆ name ┆ location ┆ manager start date │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ str ┆ str ┆ str ┆ str │
╞═════╪════════════════════╪═══════╪══════════╪════════════════════╡
│ 1 ┆ {"name": "Maria", ┆ Maria ┆ Seattle ┆ null │
│ ┆ "posit... ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ {"name": "Josh", ┆ Josh ┆ null ┆ null │
│ ┆ "positi... ┆ ┆ ┆ │
├╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 3 ┆ {"name": "Jorge", ┆ Jorge ┆ ┆ 2020-01-01 │
│ ┆ "posit... ┆ ┆ ┆ │
└─────┴────────────────────┴───────┴──────────┴────────────────────┘
Notice the null values. This is the return value when a key is not found. We'll use this fact for the has_key functionality you mentioned.
Also, if we look at the "location" column, you'll see that json_path_match does distinguish between an empty string "office":"" and a key not found..
To filter for the presence of a key, we simply filter for null values.
df.filter(
pl.col('tags').str.json_path_match(r"$.manager_st_dt").is_not_null()
)
shape: (1, 2)
┌─────┬───────────────────┐
│ id ┆ tags │
│ --- ┆ --- │
│ u32 ┆ str │
╞═════╪═══════════════════╡
│ 3 ┆ {"name": "Jorge", │
│ ┆ "posit... │
└─────┴───────────────────┘
The json_path_match will also work with nested structures. (See the Syntax page for details.)
One limitation, however: json_path_match will only return the first match for a query, rather than a list of matches. If your JSON strings are not lists or nested dictionaries, this won't be a problem.

How to form dynamic expressions without breaking on types

Any way to make the dynamic polars expressions not break with errors?
Currently I'm just excluding the columns by type, but just wondering if there is a better way.
For example, i have a df coming from parquet, if i just execute an expression on all columns it might break for certain types. Instead I want to contain these errors and possibly return a default value like None or -1 or something else.
import polars as pl
df = pl.scan_parquet("/path/to/data/*.parquet")
print(df.schema)
# Prints: {'date_time': <class 'polars.datatypes.Datetime'>, 'incident': <class 'polars.datatypes.Utf8'>, 'address': <class 'polars.datatypes.Utf8'>, 'city': <class 'polars.datatypes.Utf8'>, 'zipcode': <class 'polars.datatypes.Int32'>}
Now if i form generic expression on top of this, there are chances it may fail. For example,
# Finding positive count across all columns
# Fails due to: exceptions.ComputeError: cannot compare Utf8 with numeric data
print(df.select((pl.all() > 0).count().prefix("__positive_count_")).collect())
# Finding positive count across all columns
# Fails due to: pyo3_runtime.PanicException: 'unique_counts' not implemented for datetime[ns] data types
print(df.select(pl.all().unique_counts().prefix("__unique_count_")).collect())
# Finding positive count across all columns
# Fails due to: exceptions.SchemaError: Series dtype Int32 != utf8
# Note: this could have been avoided by doing an explict cast to string first
print(df.select((pl.all().str.lengths() > 0).count().prefix("__empty_count_")).collect())
I'll keep to things that work in lazy mode, as it appears that you are working in lazy mode with Parquet files.
Let's use this data as an example:
import polars as pl
from datetime import datetime
df = pl.DataFrame(
{
"col_int": [-2, -2, 0, 2, 2],
"col_float": [-20.0, -10, 10, 20, 20],
"col_date": pl.date_range(datetime(2020, 1, 1), datetime(2020, 5, 1), "1mo"),
"col_str": ["str1", "str2", "", None, "str5"],
"col_bool": [True, False, False, True, False],
}
).lazy()
df.collect()
shape: (5, 5)
┌─────────┬───────────┬─────────────────────┬─────────┬──────────┐
│ col_int ┆ col_float ┆ col_date ┆ col_str ┆ col_bool │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ str ┆ bool │
╞═════════╪═══════════╪═════════════════════╪═════════╪══════════╡
│ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 ┆ str1 ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ -2 ┆ -10.0 ┆ 2020-02-01 00:00:00 ┆ str2 ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 0 ┆ 10.0 ┆ 2020-03-01 00:00:00 ┆ ┆ false │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-04-01 00:00:00 ┆ null ┆ true │
├╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌┤
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ str5 ┆ false │
└─────────┴───────────┴─────────────────────┴─────────┴──────────┘
Using the col Expression
One feature of the col expression is that you can supply a datatype, or even a list of datatypes. For example, if we want to contain our queries to floats, we can do the following:
df.select((pl.col(pl.Float64) > 0).sum().suffix("__positive_count_")).collect()
shape: (1, 1)
┌────────────────────────────┐
│ col_float__positive_count_ │
│ --- │
│ u32 │
╞════════════════════════════╡
│ 3 │
└────────────────────────────┘
(Note: (pl.col(...) > 0) creates a series of boolean values that need to be summed, not counted)
To include more than one datatype, you can supply a list of datatypes to col.
df.select(
(pl.col([pl.Int64, pl.Float64]) > 0).sum().suffix("__positive_count_")
).collect()
shape: (1, 2)
┌──────────────────────────┬────────────────────────────┐
│ col_int__positive_count_ ┆ col_float__positive_count_ │
│ --- ┆ --- │
│ u32 ┆ u32 │
╞══════════════════════════╪════════════════════════════╡
│ 2 ┆ 3 │
└──────────────────────────┴────────────────────────────┘
You can also combine these into the same select statement if you'd like.
df.select(
[
(pl.col(pl.Utf8).str.lengths() == 0).sum().suffix("__empty_count"),
pl.col(pl.Utf8).is_null().sum().suffix("__null_count"),
(pl.col([pl.Float64, pl.Int64]) > 0).sum().suffix("_positive_count"),
]
).collect()
shape: (1, 4)
┌──────────────────────┬─────────────────────┬──────────────────────────┬────────────────────────┐
│ col_str__empty_count ┆ col_str__null_count ┆ col_float_positive_count ┆ col_int_positive_count │
│ --- ┆ --- ┆ --- ┆ --- │
│ u32 ┆ u32 ┆ u32 ┆ u32 │
╞══════════════════════╪═════════════════════╪══════════════════════════╪════════════════════════╡
│ 1 ┆ 1 ┆ 3 ┆ 2 │
└──────────────────────┴─────────────────────┴──────────────────────────┴────────────────────────┘
The Cookbook has a handy list of datatypes.
Using the exclude expression
Another handy trick is to use the exclude expression. With this, we can select all columns except columns of certain datatypes. For example:
df.select(
[
pl.exclude(pl.Utf8).max().suffix("_max"),
pl.exclude([pl.Utf8, pl.Boolean]).min().suffix("_min"),
]
).collect()
shape: (1, 7)
┌─────────────┬───────────────┬─────────────────────┬──────────────┬─────────────┬───────────────┬─────────────────────┐
│ col_int_max ┆ col_float_max ┆ col_date_max ┆ col_bool_max ┆ col_int_min ┆ col_float_min ┆ col_date_min │
│ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ f64 ┆ datetime[ns] ┆ u32 ┆ i64 ┆ f64 ┆ datetime[ns] │
╞═════════════╪═══════════════╪═════════════════════╪══════════════╪═════════════╪═══════════════╪═════════════════════╡
│ 2 ┆ 20.0 ┆ 2020-05-01 00:00:00 ┆ 1 ┆ -2 ┆ -20.0 ┆ 2020-01-01 00:00:00 │
└─────────────┴───────────────┴─────────────────────┴──────────────┴─────────────┴───────────────┴─────────────────────┘
Unique counts
One caution: unique_counts results in Series of varying lengths.
df.select(pl.col("col_int").unique_counts().prefix(
"__unique_count_")).collect()
shape: (3, 1)
┌────────────────────────┐
│ __unique_count_col_int │
│ --- │
│ u32 │
╞════════════════════════╡
│ 2 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└────────────────────────┘
df.select(pl.col("col_float").unique_counts().prefix(
"__unique_count_")).collect()
shape: (4, 1)
┌──────────────────────────┐
│ __unique_count_col_float │
│ --- │
│ u32 │
╞══════════════════════════╡
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 1 │
├╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌╌┤
│ 2 │
└──────────────────────────┘
As such, these should not be combined into the same results. Each column/Series of a DataFrame must have the same length.