Is there a way to read from stdin with polars? I have tried a few different ways and always hit errors.
$ printf "1,2,3\n1,2,3\n" | python -c 'import polars as pl; import sys; pl.read_csv("/dev/stdin")'
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/mmfs1/gscratch/stergachislab/mvollger/miniconda3/envs/snakemake/lib/python3.9/site-packages/polars/io.py", line 398, in read_csv
df = DataFrame._read_csv(
File "/mmfs1/gscratch/stergachislab/mvollger/miniconda3/envs/snakemake/lib/python3.9/site-packages/polars/internals/frame.py", line 585, in _read_csv
self._df = PyDataFrame.read_csv(
OSError: No such device (os error 19)
Thanks in advance,
Mitchell
$ printf "a,b,c\n1,2,3\n" | python3 -c 'import polars as pl; import sys; import io; print(pl.read_csv(io.StringIO(sys.stdin.read())))'
shape: (1, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 2 ┆ 3 │
└─────┴─────┴─────┘
Just read from STDIN, then wrap the str into StringIO, and pass it to polars.read_csv.
Related
I have a python-polars dataframe:
┌───────┬─────┐
│ val1 ┆ val2│
│ --- ┆ --- │
│ f32 ┆ i32 │
╞═══════╪═════╡
│ 0.0 ┆ 3 │
│ 1.0 ┆ 3 │
│ 1.0 ┆ 3 │
│ 1.0 ┆ 3 │
│ ... ┆ ... │
│ 0.667 ┆ 3 │
│ 1.0 ┆ 3 │
│ 0.333 ┆ 3 │
│ 0.333 ┆ 3 │
└───────┴─────┘
I'd like to do this operation to replace the two columns above with one column obtained by:
((val1 * 100) << 16) + val2
I have a function that can take values from the two columns and do this operation, but it is not expressive and is a python lambda function which makes it quite slow.
df.select(((pl.col("val1") * 100) << 16) + pl.col("val2"))
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# TypeError: unsupported operand type(s) for <<: 'Expr' and 'int'
#
# unsupported operand type(s) for <<: 'Expr' and 'int'
I can run +, -, * or / in place of <<, but it does not work with bitwise shift operators.
Is this possible in polars? I saw that there is a function in polars-rust that does a left shift, but I don't know how or if that can be used with python.
You can use pyarrow since converting from polars to arrow is free and more importantly because pyarrow does have bit shifting although only for integers.
#start with
df=pl.DataFrame({'val1':[0.0,1.0,1.0,0.667,1,0,0.33,0.33], 'val2':[3,3,3,3,3,3,3,3]})
# make a pa array of the val1 * 100 and cast it to Int
padf=df.with_columns([(pl.col('val1')*100).cast(pl.Int64())]).to_arrow()
# import pyarrow and then use `shift_left`
import pyarrow as pa
import pyarrow.compute as pc
df=df.with_columns((pl.from_arrow(pc.shift_left(padf['val1'], pa.scalar(16))).alias('shifted'))).with_columns((pl.col('shifted')+pl.col('val2')).alias('result'))
I just noticed #jquerious's comment. That's actually a good approach because np.left_shift is a ufunc.
It just needs a couple tweaks
df \
.with_columns(
np.left_shift((pl.col('val1')*100).cast(pl.Int64()),16).alias('npres'))
polars.LazyFrame.var will return variance value for each column in a table as below:
>>> df = pl.DataFrame({"a": [1, 2, 3, 4], "b": [1, 2, 1, 1], "c": [1, 1, 1, 1]}).lazy()
>>> df.collect()
shape: (4, 3)
┌─────┬─────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞═════╪═════╪═════╡
│ 1 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 1 │
└─────┴─────┴─────┘
>>> df.var().collect()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
I wish to select columns with value > 0 from LazyFrame but couldn't find the solution.
I can iterate over columns in polars dataframe then filter columns by condition as below:
>>> data.var()
shape: (1, 3)
┌──────────┬──────┬─────┐
│ a ┆ b ┆ c │
│ --- ┆ --- ┆ --- │
│ f64 ┆ f64 ┆ f64 │
╞══════════╪══════╪═════╡
│ 1.666667 ┆ 0.25 ┆ 0.0 │
└──────────┴──────┴─────┘
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
>>> cols
['a', 'b']
>>> data.select(cols)
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
But it doesn't work in LazyFrame:
>>> data = data.lazy()
>>> data
<polars.internals.lazyframe.frame.LazyFrame object at 0x7f0e3d9966a0>
>>> cols = pl.select([s for s in data.var() if (s > 0).all()]).columns
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "<stdin>", line 1, in <listcomp>
File "/home/jasmine/miniconda3/envs/jupyternb/lib/python3.9/site-packages/polars/internals/lazyframe/frame.py", line 421, in __getitem__
raise TypeError(
TypeError: 'LazyFrame' object is not subscriptable (aside from slicing). Use 'select()' or 'filter()' instead.
The reason for doing this in LazyFrame is that we want to maximize the performance. Any advice would be much appreciated. Thanks!
polars doesn't know what the variance is until after it is calculated but that's the same time that it is displaying the results so there's no way to filter the columns reported and also have it be more performant than just displaying all the columns, at least with respect to the polars calculation. It could be that python/jupyter takes longer to display more results than fewer.
With that said you could do something like this:
df.var().melt().filter(pl.col('value')>0).collect()
which gives you what you want in one line but it's a different shape.
You could also do something like this:
dfvar=df.var()
dfvar.select(dfvar.melt().filter(pl.col('value')>0).select('variable').collect().to_series().to_list()).collect()
Building on the answer from #dean MacGregor, we:
do the var calculation
melt
apply the filter
extract the variable column with column names
pass it as a list to select
df.select(
(
df.var().melt().filter(pl.col('value')>0).collect()
["variable"]
)
.to_list()
).collect()
shape: (4, 2)
┌─────┬─────┐
│ a ┆ b │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞═════╪═════╡
│ 1 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 2 ┆ 2 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 3 ┆ 1 │
├╌╌╌╌╌┼╌╌╌╌╌┤
│ 4 ┆ 1 │
└─────┴─────┘
```
I have a Polars dataframe like below:
df_cat = pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "b"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "G", "G"], dtype=pl.Categorical)
])
print(df_cat)
shape: (5, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ a ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ E │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ b ┆ G │
└───────┴───────┘
The following filter runs perfectly fine:
print(df_cat.filter(pl.col('a_cat') == 'c'))
shape: (2, 2)
┌───────┬───────┐
│ a_cat ┆ b_cat │
│ --- ┆ --- │
│ cat ┆ cat │
╞═══════╪═══════╡
│ c ┆ F │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┤
│ c ┆ G │
└───────┴───────┘
What I want is to use a list of string to run the filter more efficiently. So I tried and ended up with the following error message:
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
---------------------------------------------------------------------------
ComputeError Traceback (most recent call last)
d:\GitRepo\Test2\stockEMD3.ipynb Cell 9 in <cell line: 1>()
----> 1 print(df_cat.filter(pl.col('a_cat').is_in(['c'])))
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\dataframe\frame.py:2185, in DataFrame.filter(self, predicate)
2181 if _NUMPY_AVAILABLE and isinstance(predicate, np.ndarray):
2182 predicate = pli.Series(predicate)
2184 return (
-> 2185 self.lazy()
2186 .filter(predicate) # type: ignore[arg-type]
2187 .collect(no_optimization=True, string_cache=False)
2188 )
File c:\ProgramData\Anaconda3\envs\charm3.9\lib\site-packages\polars\internals\lazyframe\frame.py:660, in LazyFrame.collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, string_cache, no_optimization, slice_pushdown)
650 projection_pushdown = False
652 ldf = self._ldf.optimization_toggle(
653 type_coercion,
654 predicate_pushdown,
(...)
658 slice_pushdown,
659 )
--> 660 return pli.wrap_df(ldf.collect())
ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache
From this Stackoverflow link I understand "You need to set a global string cache to compare categoricals created in different columns/lists." but my question is
Why the == one single string filter case works?
What is the proper way to filter a categorical column with a list of string?
Thanks!
Actually, you don't need to set a global string cache to compare strings to Categorical variables. You can use cast to accomplish this.
Let's use this data. I've included the integer values that underlie the Categorical variables to demonstrate something later.
import polars as pl
df_cat = (
pl.DataFrame(
[
pl.Series("a_cat", ["c", "a", "b", "c", "X"], dtype=pl.Categorical),
pl.Series("b_cat", ["F", "G", "E", "S", "X"], dtype=pl.Categorical),
]
)
.with_column(
pl.all().to_physical().suffix('_phys')
)
)
df_cat
shape: (5, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ b ┆ E ┆ 2 ┆ 2 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Comparing a categorical variable to a string
If we cast a Categorical variable back to its string values, we can make any comparison we need. For example:
df_cat.filter(pl.col('a_cat').cast(pl.Utf8).is_in(['a', 'c']))
shape: (3, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ c ┆ F ┆ 0 ┆ 0 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ a ┆ G ┆ 1 ┆ 1 │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌╌╌╌╌┤
│ c ┆ S ┆ 0 ┆ 3 │
└───────┴───────┴────────────┴────────────┘
Or in a filter step comparing the string values of two Categorical variables that do not share the same string cache.
df_cat.filter(pl.col('a_cat').cast(pl.Utf8) == pl.col('b_cat').cast(pl.Utf8))
shape: (1, 4)
┌───────┬───────┬────────────┬────────────┐
│ a_cat ┆ b_cat ┆ a_cat_phys ┆ b_cat_phys │
│ --- ┆ --- ┆ --- ┆ --- │
│ cat ┆ cat ┆ u32 ┆ u32 │
╞═══════╪═══════╪════════════╪════════════╡
│ X ┆ X ┆ 3 ┆ 4 │
└───────┴───────┴────────────┴────────────┘
Notice that it is the string values being compared (not the integers underlying the two Categorical variables).
The equality operator on Categorical variables
The following statements are equivalent:
df_cat.filter((pl.col('a_cat') == 'a'))
df_cat.filter((pl.col('a_cat').cast(pl.Utf8) == 'a'))
The former is syntactic sugar for the latter, as the former is a common use case.
As the error states: ComputeError: joins/or comparisons on categorical dtypes can only happen if they are created under the same global string cache.
Comparisons of categorical values are only allowed under a global string cache. You really want to set this in such a case as it speeds up comparisons and prevents expensive casts to strings.
Setting this on the start of your query will ensure it runs:
import polars as pl
pl.Config.set_global_string_cache()
This is a new answer based on the one from #ritchie46.
Polar 0.15.15 it now is
import polars as pl
pl.toggle_string_cache(True)
Also a StringCache() Context manager can be used, see polars documentation:
with pl.StringCache():
print(df_cat.filter(pl.col('a_cat').is_in(['a', 'c'])))
In Polars, the select and with_column methods broadcast any scalars that they get, including literals:
import polars as pl
df.with_column(pl.lit(1).alias("y"))
# shape: (3, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 1 ┆ 1 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 2 ┆ 1 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 3 ┆ 1 │
# └─────┴─────┘
The agg method does not broadcast literals:
import polars as pl
df = pl.DataFrame(dict(x=[1,1,0,0])).groupby("x")
df.agg(pl.lit(1).alias("y"))
# exceptions.ComputeError: returned aggregation is a different length: 1 than the group lengths: 2
Is there an operation I can apply that will broadcast a scalar and ignore a non-scalar? Something like this:
df.agg(something(pl.lit(1)).alias("y"))
# shape: (2, 2)
# ┌─────┬─────┐
# │ x ┆ y │
# │ --- ┆ --- │
# │ i64 ┆ i64 │
# ╞═════╪═════╡
# │ 0 ┆ 1 │
# ├╌╌╌╌╌┼╌╌╌╌╌┤
# │ 1 ┆ 1 │
# └─────┴─────┘
You will need to use pl.repeat(1, pl.count()) to expand the literal to the group size.
(answer from the Polars Issue tracker - https://github.com/pola-rs/polars/issues/2987#issuecomment-1079617229)
df = pl.DataFrame(dict(x=[1,1,0,0])).groupby("x")
df.agg( pl.repeat(1, pl.count() ) ).explode(
pl.col('literal')
)
may be helpful
I would like to call a numpy universal function (ufunc) that has two positional arguments in polars.
df.with_column(
numpy.left_shift(pl.col('col1'), 8)
)
Above attempt results in the following error message
Traceback (most recent call last):
File "<stdin>", line 2, in <module>
File "/usr/local/lib/python3.8/dist-packages/polars/internals/expr.py", line 181, in __array_ufunc__
out_type = ufunc(np.array([1])).dtype
TypeError: left_shift() takes from 2 to 3 positional arguments but 1 were given
There are other ways to perform this computation, e.g.,
df['col1'] = numpy.left_shift(df['col1'], 8)
... but I'm trying to use this with a polars.LazyFrame.
I'm using polars 0.13.13 and Python 3.8.
Edit: as of Polars 0.13.19, the apply method converts Numpy datatypes to Polars datatypes without requiring the Numpy item method.
When you need to pass only one column from polars to the ufunc (as in your example), the easist method is to use the apply function on the particular column.
import numpy as np
import polars as pl
df = pl.DataFrame({"col1": [2, 4, 8, 16]}).lazy()
df.with_column(
pl.col("col1").apply(lambda x: np.left_shift(x, 8).item()).alias("result")
).collect()
shape: (4, 2)
┌──────┬────────┐
│ col1 ┆ result │
│ --- ┆ --- │
│ i64 ┆ i64 │
╞══════╪════════╡
│ 2 ┆ 512 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1024 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2048 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 4096 │
└──────┴────────┘
If you need to pass multiple columns from Polars to the ufunc, then use the struct expression with apply.
df = pl.DataFrame({"col1": [2, 4, 8, 16], "shift": [1, 1, 2, 2]}).lazy()
df.with_column(
pl.struct(["col1", "shift"])
.apply(lambda cols: np.left_shift(cols["col1"], cols["shift"]).item())
.alias("result")
).collect()
shape: (4, 3)
┌──────┬───────┬────────┐
│ col1 ┆ shift ┆ result │
│ --- ┆ --- ┆ --- │
│ i64 ┆ i64 ┆ i64 │
╞══════╪═══════╪════════╡
│ 2 ┆ 1 ┆ 4 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 4 ┆ 1 ┆ 8 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 8 ┆ 2 ┆ 32 │
├╌╌╌╌╌╌┼╌╌╌╌╌╌╌┼╌╌╌╌╌╌╌╌┤
│ 16 ┆ 2 ┆ 64 │
└──────┴───────┴────────┘
One Note: the use of the numpy item method may no longer be needed in future releases of Polars. (Presently, the apply method does not always automatically translate between numpy dtypes and Polars dtypes.)
Does this help?