Can I use a pandas-like string expression for filtering a DataFrame? - python-polars

I am considering replacing my use of pandas with polars in a tool that allows users to input predicate expressions for filtering/subsetting data rows. This allows users to use expressions that the pandas.DataFrame.query method can parse, such as "x > 1", as a very simple example.
However, I can't seem to find a way to use the same types of string expressions with polars.DataFrame.filter so that I can swap out pandas for polars without requiring users to change their predicate expressions.
The only thing I've found that's close to my question is this posting: String as a condition in a filter
Unfortunately, that's not quite what I need, as it still requires a string expression like "pl.col('x') > 1" rather than simply "x > 1".
Is there a way to use the simpler ("agnostic") syntax with polars?
Using the example from the polars.DataFrame.filter docs:
>>> df = pl.DataFrame(
... {
... "foo": [1, 2, 3],
... "bar": [6, 7, 8],
... "ham": ["a", "b", "c"],
... }
... )
When calling df.filter, I'm forced to use expressions like the following:
pl.col("foo") < 3
(pl.col("foo") < 3) & (pl.col("ham") == "a")
However, I want to be able to use the following string expressions instead, respectively, so that the users of the tool (currently using pandas) do not have to be aware of the polars-specific syntax (thus allowing me to swap libraries without impacting users):
"foo < 3"
"foo < 3 & ham == 'a'"
When I attempt to do so, here's what happens, which is puzzling since str is one of the supported types for the predicate argument, so it is unclear as to the syntax supported for str predicates since the docs do not show any examples of such:
>>> df.filter("foo < 3")
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/internals/dataframe/frame.py", line 2565, in filter
self.lazy()
File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/utils.py", line 391, in wrapper
return fn(*args, **kwargs)
File "/usr/local/Caskroom/miniconda/base/envs/gedi_subset/lib/python3.10/site-packages/polars/internals/lazyframe/frame.py", line 1165, in collect
return pli.wrap_df(ldf.collect())
exceptions.NotFoundError: foo < 3
What I was expecting was the same return value that df.filter(pl.col("foo") < 3) would return.

You could try to use the SqlContext for that.
import polars as pl
ctxt = pl.SQLContext()
df = pl.DataFrame(
{
"foo": [1, 2, 3],
"bar": [6, 7, 8],
"ham": ["a", "b", "c"],
}
)
ctxt.register("df", df.lazy())
string_expr = "foo < 3 and ham = 'a'"
(ctxt.query(f"""
SELECT * FROM df
WHERE {string_expr}
"""))
shape: (1, 1)
┌─────┐
│ x │
│ --- │
│ i64 │
╞═════╡
│ 3 │
└─────┘
Note that the SQL language doesn't use the bitwise & nor equality == the same way as pandas, so you might need to replace & with and and == with =.

Related

polars LazyFrame.with_context().filter() throws unexpected NotFoundError for column

I have two LazyFrames, df1 and df2.
After filtering df2 according to df1 max value, I want to concatenate them.
But combination of with_context() and filter() on LazyFrames will raise NotFoundError.
What's the best way to do this?
import polars as pl
df1 = pl.DataFrame({'foo': [0, 1], 'bar': ['a', 'a']}).lazy()
df2 = pl.DataFrame({'foo': [1, 2, 3], 'bar': ['b', 'b', 'b']}).lazy()
df = pl.concat(
[
df1,
df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.filter(pl.col('foo') > pl.col('foo_').max())
]).collect()
# ---------------------------------------------------------------------------
# NotFoundError Traceback (most recent call last)
# [<ipython-input-2-cf44deab2d4b>](https://localhost:8080/#) in <module>
# 4 df2 = pl.DataFrame({'foo': [1, 2, 3], 'bar': ['b', 'b', 'b']}).lazy()
# 5
# ----> 6 df = pl.concat(
# 7 [
# 8 df1,
#
# 1 frames
# [/usr/local/lib/python3.8/dist-packages/polars/utils.py](https://localhost:8080/#) in wrapper(*args, **kwargs)
# 327 def wrapper(*args: P.args, **kwargs: P.kwargs) -> T:
# 328 _rename_kwargs(fn.__name__, kwargs, aliases)
# --> 329 return fn(*args, **kwargs)
# 330
# 331 return wrapper
#
# [/usr/local/lib/python3.8/dist-packages/polars/internals/lazyframe/frame.py](https://localhost:8080/#) in collect(self, type_coercion, predicate_pushdown, projection_pushdown, simplify_expression, no_optimization, slice_pushdown, common_subplan_elimination, streaming)
# 1166 streaming,
# 1167 )
# -> 1168 return pli.wrap_df(ldf.collect())
# 1169
# 1170 def sink_parquet(
#
# NotFoundError: foo_
When I assign comparison result as a column, error not raised,
(df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.with_column((pl.col('foo') > pl.col('foo_').max()).alias('x'))
.filter(pl.col('x'))).collect()
# OK
But I drop that column after filter(), error again.
(df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.with_column((pl.col('foo') > pl.col('foo_').max()).alias('x'))
.filter(pl.col('x'))
.drop('x')).collect()
# NotFoundError: foo_
Finally I find this works. But what's the difference between the previous?
(it seems verbose. other good solution exists?)
(df2.with_context(df1.select(pl.col('foo').alias('foo_')))
.with_column(pl.col('foo_').max())
.filter(pl.col('foo') > pl.col('foo_'))
.drop('foo_')).collect()
# OK
related?
https://stackoverflow.com/a/71108312/7402018

Read and merge large tables on computer cluster

I need to merge different large tables (up to 10Gb each) into a single one. To do so I am using a computer cluster with 50+ cores and 10+Gb Ram that runs on Linux.
I always end up with an error message like: "Cannot allocate vector of size X Mb".
Given that commands like memory.limit(size=X) are Windows-specific and not accepted, I cannot find a way around to merge my large tables.
Any suggestion welcome!
This is the code I use:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
temp = list.files(pattern="*.txt$")
gc()
Here the error occurs:
myfiles = parLapply(cl,temp, function(x) read.csv(x,
header=TRUE,
sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA","99","")))
myfiles.final = do.call(rbind, myfiles)
You could just use merge, for example:
`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
`
One way to approach this is with python and dask. The dask dataframe is stored mostly on disk rather than in ram- allowing you to work with larger than ram data- and can help you do computations with clever parallelization. A nice tutorial of ways to work on big data can be found in this kaggle post which might also be helpful for you. I also suggest checking out the docs on dask performance here. To be clear, if your data can fit in RAM using regular R dataframe or pandas dataframe will be faster.
Here's a dask solution which will assume you have named columns in the tables to align the concat operation. Please add to your question if you have any other special requirements about the data we need to consider.
import dask.dataframe as dd
import glob
tmp = glob.glob("*.txt")
dfs= []
for f in tmp:
# read the large tables
ddf = dd.read_table(f)
# make a list of all the dfs
dfs.append(ddf)
#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)
# save the data
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')
if you just wanted to cat all the tables together you can do something like this in straight python. (you could also use linux cat/paste).
with open('all_big_files.tsv', 'w') as O:
file_number = 0
for f in tmp:
with open(f, 'rU') as F:
if file_number == 0:
for line in F:
line = line.rstrip()
O.write(line + '\n')
else:
# skip the header line
l = F.readline()
for line in F:
line = line.rstrip()
O.write(line + '\n')
file_number +=1

Aligning and italicising table column headings using Rmarkdown and pander

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work
In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.

Cryptic python error 'classobj' object has no attribute '__getitem__'. Why am I getting this?

I really wish I could be more specific here but I have read through related questions and none of them seem to relate to the issue that I am experiencing here and I have no understanding of the issue i am experiencing. This is for a homework assignment so I am hesitant to put up all my code for the program, here is a stripped down version. Compile this and you will see the issue.
import copy
class Ordering:
def __init__(self, tuples):
self.pairs = copy.deepcopy(tuples)
self.sorted = []
self.unsorted = []
for x in self.pairs:
self.addUnsorted(left(x))
self.addUnsorted(right(x))
def addUnsorted(self, item):
isPresent = False
for x in self.unsorted:
if x == item:
isPresent = True
if isPresent == False:
self.unsorted.append(left(item))
Here I have created a class, Ordering, that takes a list of the form [('A', 'B'), ('C', 'B'), ('D', 'A')] (where a must come before b, c must come before b, etc.) and is supposed to return it in partial ordered form. I am working on debugging my code to see if it works correctly but I have not been able to yet because of the error message I get back.
When I input the follwing in my terminal:
print Ordering[('A', 'B'), ('C', 'B'), ('D', 'A')]
I get back the following error message:
Traceback (most recent call last): File "<stdin>", line 1, in (module) Type Error: 'classobj' object has no attribute '__getitem__'
Why is this?!
To access an element of a list, use square brackets. To instantiate a class, use parens.
In other words, do not use:
print Ordering[('A', 'B'), ('C', 'B'), ('D', 'A')]
Use:
print Ordering((('A', 'B'), ('C', 'B'), ('D', 'A')))
This will generate another error from deeper in the code but, since this is a homework assignment, I will let you think about that one a bit.
How to use __getitem__:
As a minimal example, here is a class that returns squares via __getitem__:
class HasItems(object):
def __getitem__(self, key):
return key**2
In operation, it looks like this:
>>> a = HasItems()
>>> a[4]
16
Note the square brackets.
Answer to "Why is this?"
Your demo-code is not complete ( ref. comment above ), however the issue with .__getitem__ method is clearly related with a statement to print an object ( which due to other reasons did fail to respond to a request to answer to a called .__getitem__ method ) rather than the Class itself.
>>> aList = [ ('A','B'), ('C','D'), ('E','F')] # the stated format of input
>>> aList # validated to be a list
[('A', 'B'), ('C', 'D'), ('E', 'F')]
>>> type( aList ) # cross-validated
<type 'list'>
>>> for x in aList: # iterator over members
... print x, type( x ) # show value and type
... left( x ) # request as in demo-code
...
('A', 'B') <type 'tuple'>
Traceback (most recent call last): <<< demo-code does not have it
File "<stdin>", line 3, in <module>
NameError: name 'left' is not defined
>>> dir( Ordering ) # .__getitem__ method missing
[ '__doc__', '__init__', '__module__', 'addUnsorted']
>>> dir( aList[0] ) # .__getitem__ method present
['__add__', '__class__', '__contains__', '__delattr__', '__doc__', '__eq__',
'__format__', '__ge__', '__getattribute__', '__getitem__', '__getnewargs__',
'__getslice__', '__gt__', '__hash__', '__init__', '__iter__', '__le__', '__len__',
'__lt__', '__mul__', '__ne__', '__new__', '__reduce__', '__reduce_ex__',
'__repr__', '__rmul__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__',
'count', 'index']

How to add meta_data to Pandas dataframe?

I use Pandas dataframe heavily. And need to attach some data to the dataframe, for example to record the birth time of the dataframe, the additional description of the dataframe etc.
I just can't find reserved fields of dataframe class to keep the data.
So I change the core\frame.py file to add a line _reserved_slot = {} to solve my issue. I post the question here is just want to know is it OK to do so ? Or is there better way to attach meta-data to dataframe/column/row etc?
#----------------------------------------------------------------------
# DataFrame class
class DataFrame(NDFrame):
_auto_consolidate = True
_verbose_info = True
_het_axis = 1
_col_klass = Series
_AXIS_NUMBERS = {
'index': 0,
'columns': 1
}
_reserved_slot = {} # Add by bigbug to keep extra data for dataframe
_AXIS_NAMES = dict((v, k) for k, v in _AXIS_NUMBERS.iteritems())
EDIT : (Add demo msg for witingkuo's way)
>>> df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDEFGHIJKLMN')[0:5])
>>> df
A B C D E
0 0.5890 -0.7683 -1.9752 0.7745 0.8019
1 1.1835 0.0873 0.3492 0.7749 1.1318
2 0.7476 0.4116 0.3427 -0.1355 1.8557
3 1.2738 0.7225 -0.8639 -0.7190 -0.2598
4 -0.3644 -0.4676 0.0837 0.1685 0.8199
5 0.4621 -0.2965 0.7061 -1.3920 0.6838
6 -0.4135 -0.4991 0.7277 -0.6099 1.8606
7 -1.0804 -0.3456 0.8979 0.3319 -1.1907
8 -0.3892 1.2319 -0.4735 0.8516 1.2431
9 -1.0527 0.9307 0.2740 -0.6909 0.4924
>>> df._test = 'hello'
>>> df2 = df.shift(1)
>>> print df2._test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\lib\site-packages\pandas\core\frame.py", line 2051, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute '_test'
>>>
This is not supported right now. See https://github.com/pydata/pandas/issues/2485. The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.
Your _reserved_slot will become a class variable. That might not work if you want to assign different value to different DataFrame. Probably you can assign what you want to the instance directly.
In [6]: import pandas as pd
In [7]: df = pd.DataFrame()
In [8]: df._test = 'hello'
In [9]: df._test
Out[9]: 'hello'
I think a decent workaround is putting your datafame into a dictionary with your metadata as other keys. So if you have a dataframe with cashflows, like:
df = pd.DataFrame({'Amount': [-20, 15, 25, 30, 100]},index=pd.date_range(start='1/1/2018', periods=5))
You can create your dictionary with additional metadata and put the dataframe there
out = {'metadata': {'Name': 'Whatever', 'Account': 'Something else'}, 'df': df}
and then use it as out[df]