Polars convert list of strings to list of categoricals - python-polars

I'm trying to improve performance of my polars code by converting a list of string to a list of categorical type for my tags column:
shape: (3, 2)
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
However I get the following error:
ValueError: could not convert value 'Unknown' as a Literal
Does polars support lists of categoricals?

Polars does supports lists of Categoricals.
The issue is you're using pl.list() instead of pl.List() - datatypes start with uppercased letters.
>>> df.with_columns(pl.col('b').cast(pl.List(pl.Categorical)))
shape: (3, 2)
│ a | b │
│ --- | --- │
│ i64 | list[cat] │
│ 1 | ["a", "b"] │
│ 2 | ["a"] │
│ 3 | ["c", "d"] │
pl.list() is something different - it appears to be shorthand syntax for pl.col().list()


Polars table convert a list column to separate rows i.e. unnest a list column to multiple rows

I have a Polars dataframe in the form:
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ list[str] │
│ 1 ┆ ["a", "b"] │
│ 2 ┆ ["a"] │
│ 3 ┆ ["c", "d"] │
I want to convert it to the following form. I plan to save to a parquet file, and query the file (with sql).
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
│ 1 ┆ "a" │
│ 1 ┆ "b" │
│ 2 ┆ "a" │
│ 3 ┆ "c" │
│ 3 ┆ "d" │
I have seen an answer that works on struct columns, but df.unnest('b') on my data results in the error:
SchemaError: Series of dtype: List(Utf8) != Struct
I also found a github issue that shows list can be converted to a struct, but I can't work out how to do that, or if it applies here.
To decompose column with Lists, you can use .explode() method (doc)
df = pl.DataFrame({'a':[1,2,3], 'b':[['a','b'],['a'],['c','d']]})
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
│ 1 ┆ a │
│ 1 ┆ b │
│ 2 ┆ a │
│ 3 ┆ c │
│ 3 ┆ d │

How to select the last non-null value from one column and also the value from another column on the same row in Polars?

Below is a non working example in which I retrieve the last available 'Open' but how do I get corresponding 'Time'?
sel = self.data.select([pl.col('Time'),
For instance, you can use .filter() to select rows that do not contain null and then take last row
Here example:
df = pl.DataFrame({
"a": [1,2,3,4,5],
"b": ["cat", None, "owl", None, None]
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
│ 1 ┆ cat │
│ 2 ┆ null │
│ 3 ┆ owl │
│ 4 ┆ null │
│ 5 ┆ null │
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ str │
│ 3 ┆ owl │

Sum columns based on column names in a list for polars

So in python Polars
I can add one or more columns to make a new column by using an expression something like
frame.with_column((pl.col('colname1') + pl.col('colname2').alias('new_colname')))
However, if I have all the column names in a list is there a way to sum all the columns in that list and create a new column based on the result ?
sum expr supports horizontal summing. From the docs,
List[Expr] -> aggregate the sum value horizontally.
Sample code for ref,
import polars as pl
df = pl.DataFrame({"a": [1, 2, 3], "b": [1, 2, None]})
This results in something like,
shape: (3, 2)
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
│ 1 ┆ 1 │
│ 2 ┆ 2 │
│ 3 ┆ null │
On this you can do something like,
cols = ["a", "b"]
df2 = df.select(pl.sum([pl.col(i) for i in cols]).alias('new_colname'))
Which will result in,
shape: (3, 1)
│ sum │
│ --- │
│ i64 │
│ 2 │
│ 4 │
│ null │

Merge list column with constant column in polars

I have a dataframe like:
pl.DataFrame({'a': [['a', 'b'], None, ['c', 'd', 'e'], None], 't':['x', 'y', None, None]})
shape: (4, 2)
│ a ┆ t │
│ --- ┆ --- │
│ list[str] ┆ str │
│ ["a", "b"] ┆ x │
│ null ┆ y │
│ ["c", "d", "e"] ┆ null │
│ null ┆ null │
I'd like to have a transformation that results in:
│ a │
│ --- │
│ list[str] │
│ ["a", "b", "x"] │
│ ["y"] │
│ ["c", "d", "e"] │
│ null │
However, the obvious solutions which come to mind don't seem to work.
results in
│ a ┆ t │
│ --- ┆ --- │
│ list[str] ┆ str │
│ ["a", "b", "x"] ┆ x │
│ null ┆ y │
│ ["c", "d", ... null] ┆ null │
│ null ┆ null │
Strangely, somehow
col('t').apply(lambda s: [s]).arr.concat(col('a'))
results in an error saying that the dataframe length has changed.:
ShapeError: Could not add column. The Series length 5 differs from the DataFrame height: 4
I don't understand why concatenating the two Series together should produce a new series of a different length. Is this a bug?
I have tried a number of ways to produce a solution but continue to run into errors. For example, using a list comprehension works to add the arrays together, but .append does not.
def combine(d):
x, y = d['a'], d['t']
if x and y:
# return x.append(y) # produces error
return [a for a in x] + [b for b in y]
if x and not y:
return [a for a in x]
if y and not x:
return [b for b in y]
# return None # (produces error)
return ['None']
pl.struct([col('a'), col('t')]).apply(combine).alias('combined')
│ a ┆ t ┆ combined │
│ --- ┆ --- ┆ --- │
│ list[str] ┆ str ┆ list[str] │
│ ["a", "b"] ┆ x ┆ ["a", "b", "x"] │
│ null ┆ y ┆ ["y"] │
│ ["c", "d", "e"] ┆ null ┆ ["c", "d", "e"] │
│ null ┆ null ┆ ["None"] │
This gets part of the way there but now we have to deal with ["None"] at some point.

How to get row_count for a group in polars?

The usage might seems like the code below
out_df = df.select([
The data should be like this:
md5 row_count
a 1
a 2
b 1
Maybe Im misunderstanding, as your output has both values 1 and 2 for a. Assuming you meant 2 for both:
You are very close, Polars has .count():
import polars as pl
df = pl.DataFrame({"md5": ["a", "a", "b"]})
out_df = df.select([
Which prints out this:
shape: (3, 2)
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ u32 │
│ a ┆ 2 │
│ a ┆ 2 │
│ b ┆ 1 │
If I think I understand correctly, you want to have a count per seen value in the group.
You can do this:
df = pl.DataFrame({"md5": ["a", "a", "b"]})
shape: (3, 2)
│ md5 ┆ row_count │
│ --- ┆ --- │
│ str ┆ i32 │
│ a ┆ 1 │
│ a ┆ 2 │
│ b ┆ 1 │
We still have to add a dummy column "ones", because (as of polars==0.10.23` we cannot apply a window function over literals. We will add this functionality.