Django ORM - Select All Records from One Table That Do Not Exist in Another Table - django-orm

Lets have 2 models:
class A(models.Model):
f1 = models.CharField()
f2 = models.IntegerField()
f3 = models.BooleanField()
class B(models.Model):
f1 = models.CharField()
f2 = models.IntegerField()
f3 = models.DecimalField()
Lets have this data:
A(f1=rat, f2=100, f3=true)
A(f1=cat, f2=200, f3=true)
A(f1=dog, f2=300, f3=false)
B(f1=eagle, f2=100, f3=3.14)
B(f1=cat, f2=200, f3=9.81)
B(f1=dog, f2=300, f3=100.500)
I need to select objects from table B, that does not have similar data for fields f1, f2 in table A.
In my case it will be:
B(f1=eagle, f2=100, f3=3.14)
The following objects are not relevant, because they exist in both tables (f1 and f2 fields)
B(f1=cat, f2=200, f3=9.81)
B(f1=dog, f2=300, f3=100.500)
Is it possible to select this data using Django ORM?
I tried to find information about Sub-query, but did not find good example.

You can use exclude with Q objects:
from django.db.models import Q
B.objects.exclude(Q(f1__in=A.objects.values_list('f1', flat=True)) & Q(f2__in=A.objects.values_list('f2', flat=True)))

resolved in this way:
from django.db.models import OuterRef, Exists
a_queryset = A.objects.filter(f1=OuterRef('f1'), f2=OuterRef('f2'))
result_queryset = B.objects.filter(~Exists(a_queryset))
# B(f1=eagle, f2=100, f3=3.14)
reverse:
result_queryset = B.objects.filter(Exists(a_queryset))
# B(f1=cat, f2=200, f3=9.81)
# B(f1=dog, f2=300, f3=100.500)

Related

find nodes with multiple parents using networkx

Let's say I have a directed graph (this is about relational tables). I want to find M:N tables to track relationships enabled thru M:N tables.
from pathlib import Path
import subprocess
import networkx as nx
def write_svg(g, name):
temp = "temp.dot"
suffix = "jpg"
nx.nx_agraph.write_dot(g, temp)
pa_img = Path(f"{name}.{suffix}")
li_cmd = f"/opt/local/bin/dot {temp} -T {suffix} -o {pa_img}".split()
subprocess.check_output(li_cmd)
G = nx.DiGraph()
G.add_edge("C1", "P1")
G.add_edge("C2", "P1")
G.add_edge("C21", "C2")
G.add_edge("MN12", "P1")
G.add_edge("MN12", "P2")
G.add_nodes_from([
("MN12", {"color" : "red"})
])
Run this, and I get:
So what I am considering here is that MN12 has as parents P1 and P2. So I want to consider P2 to be related to P1, with MN12 as the mapping table.
In other words, if I hard-code the relationship graph I want:
G = nx.DiGraph()
G.add_edge("C1", "P1")
G.add_edge("C2", "P1")
G.add_edge("C21", "C2")
G.add_edge("P2(using MN12)", "P1")
G.add_nodes_from([
("P2(using MN12)", {"color" : "green"})
])
Note that C21 remains a child of C2. Only MN12 is modified, because it has 2 parents.
Now, I know I can see the degree of a given node.
Going in back to my input graph:
(Pdb++) G.degree('MN12')
2
(Pdb++) G.degree('C1')
1
(Pdb++) G.degree('C2')
2
(Pdb++) G.degree('P1')
3
But how do I see that the arrows from MN12 go towards both P1 and P2? Is this even a question for networkx?
You can use both out_edges and successors.
>>> G.out_edges("MN12")
OutEdgeDataView([('MN12', 'P1'), ('MN12', 'P2')])
>>> list(G.successors("MN12"))
['P1', 'P2']

Read and merge large tables on computer cluster

I need to merge different large tables (up to 10Gb each) into a single one. To do so I am using a computer cluster with 50+ cores and 10+Gb Ram that runs on Linux.
I always end up with an error message like: "Cannot allocate vector of size X Mb".
Given that commands like memory.limit(size=X) are Windows-specific and not accepted, I cannot find a way around to merge my large tables.
Any suggestion welcome!
This is the code I use:
library(parallel)
no_cores <- detectCores() - 1
cl <- makeCluster(no_cores)
temp = list.files(pattern="*.txt$")
gc()
Here the error occurs:
myfiles = parLapply(cl,temp, function(x) read.csv(x,
header=TRUE,
sep=";",
stringsAsFactors=F,
encoding = "UTF-8",
na.strings = c("NA","99","")))
myfiles.final = do.call(rbind, myfiles)
You could just use merge, for example:
`
mergedTable <- merge(table1, table2, by = "dbSNP_RSID")
If your samples have overlapping column names, then you'll find that the mergedTable has (for example) columns called Sample1.x and Sample1.y. This can be fixed by renaming the columns before or after the merge.
Reproducible example:
x <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
y <- data.frame(dbSNP_RSID = paste0("rs", sample(1e6, 1e5)),
matrix(paste0(sample(c("A", "C", "T", "G"), 1e7, replace = TRUE),
sample(c("A", "C", "T", "G"), 1e7, replace = TRUE)), ncol = 100))
colnames(x)[2:101] <- paste0("Sample", 1:100)
colnames(y)[2:101] <- paste0("Sample", 101:200)
mergedDf <- merge(x, y, by = "dbSNP_RSID")
`
One way to approach this is with python and dask. The dask dataframe is stored mostly on disk rather than in ram- allowing you to work with larger than ram data- and can help you do computations with clever parallelization. A nice tutorial of ways to work on big data can be found in this kaggle post which might also be helpful for you. I also suggest checking out the docs on dask performance here. To be clear, if your data can fit in RAM using regular R dataframe or pandas dataframe will be faster.
Here's a dask solution which will assume you have named columns in the tables to align the concat operation. Please add to your question if you have any other special requirements about the data we need to consider.
import dask.dataframe as dd
import glob
tmp = glob.glob("*.txt")
dfs= []
for f in tmp:
# read the large tables
ddf = dd.read_table(f)
# make a list of all the dfs
dfs.append(ddf)
#row-wise concat of the data
dd_all = dd.concat(dfs)
#repartition the df to 1 partition for saving
dd_all = dd_all.repartition(npartitions=1)
# save the data
# provide list of one name if you don't want the partition number appended on
dd_all.to_csv(['all_big_files.tsv'], sep = '\t')
if you just wanted to cat all the tables together you can do something like this in straight python. (you could also use linux cat/paste).
with open('all_big_files.tsv', 'w') as O:
file_number = 0
for f in tmp:
with open(f, 'rU') as F:
if file_number == 0:
for line in F:
line = line.rstrip()
O.write(line + '\n')
else:
# skip the header line
l = F.readline()
for line in F:
line = line.rstrip()
O.write(line + '\n')
file_number +=1

Matlab's writetable() function only exporting partial data of a table

The writetable() function at the end of my code only exports the first row (namely FR_1w, FR_2w and FR_3w), whereas I want the entire table to be exported and written as .xls or .xlsx.
V=[{A B C};...
{A1 B1 C1};...
{A2 B2 C2}];
X=cell2table(V);
X.Properties.VariableNames= {'FR_1w' 'FR_2w' 'FR_3w'};
X.Properties.RowNames= {'4Weeks' '12Weeks' '24Weeks'};
writetable(X, 'X.xlsx')
n.b. Variables in table V are 3x1 cells.
A, for example, contains:
My workaround solution:
Z=[{A{1,1} B{1,1} C{1,1}};...
{A{2,1} B{2,1} C{2,1}};...
{A{3,1} B{3,1} C{3,1}};...
{A1{1,1} B1{1,1} C1{1,1}};...
{A1{2,1} B1{2,1} C1{2,1}};...
{A1{3,1} B1{3,1} C1{3,1}};...
{A2{1,1} B2{1,1} C2{1,1}};...
{A2{2,1} B2{2,1} C2{2,1}};...
{A2{3,1} B2{3,1} C2{3,1}}];
Tstat = cell2table(VV);
Tstat.Properties.VariableNames = {'ok1' 'ok2' 'ok3'};
Tstat.Properties.RowNames = {'one' 'two' 'three' 'four' 'five' ...
'six' 'seven' 'eight' 'nine'};
writetable(Tstat, 'TstatOverview.xlsx')

Aligning and italicising table column headings using Rmarkdown and pander

I am writing a rmarkdown document knitting to pdf with tables taken from portions of lists from the ezANOVA package. The tables are made using the pander package. Toy Rmarkdown file with toy dataset below.
---
title: "Table Doc"
output: pdf_document
---
```{r global_options, include=FALSE}
#set global knit options parameters.
knitr::opts_chunk$set(fig.width=12, fig.height=8, fig.path='Figs/',
echo=FALSE, warning=FALSE, message=FALSE, dev = 'pdf')
```
```{r, echo=FALSE}
# toy data
id <- rep(c(1,2,3,4), 5)
group1 <- factor(rep(c("A", "B"), 10))
group2 <- factor(rep(c("A", "B"), each = 10))
dv <- runif(20, min = 0, max = 10)
df <- data.frame(id, group1, group2, dv)
```
``` {r anova, echo = FALSE}
library(ez)
library(plyr)
library(pander)
# create anova object
anOb <- ezANOVA(df,
dv = dv,
wid = id,
between = c(group1, group2),
type = 3,
detailed = TRUE)
# extract the output table from the anova object, reduce it down to only desired columns
anOb <- data.frame(anOb[[1]][, c("Effect", "F", "p", "p<.05")])
# format entries in columns
anOb[,2] <- format( round (anOb[,2], digits = 1), nsmall = 1)
anOb[,3] <- format( round (anOb[,3], digits = 4), nsmall = 1)
pander(anOb, justify = c("left", "center", "center", "right"))
```
Now I have a few problems
a) For the last three columns I would like to have the column heading in the table aligned in the center, but the actual column entries underneath those headings aligned to the right.
b) I would like to have the column headings 'F' and 'p' in italics and the 'p' in the 'p<.05' column in italics also but the rest in normal font. So they read F, p and p<.05
I tried renaming the column headings using plyr::rename like so
anOb <- rename(anOb, c("F" = "italic(F)", "p" = "italic(p)", "p<.05" = ""))
But it didn't work
In markdown, you have to use the markdown syntax for italics, which is wrapping text between a star or underscore:
> names(anOb) <- c('Effect', '*F*', '*p*', '*p<.05*')
> pander(anOb)
-----------------------------------------
Effect *F* *p* *p<.05*
--------------- ------ -------- ---------
(Intercept) 52.3 0.0019 *
group1 1.3 0.3180
group2 2.0 0.2261
group1:group2 3.7 0.1273
-----------------------------------------
If you want to do that in a programmatic way, you can also use the pandoc.emphasis helper function to add the starts to a string.
But your other problem is due to a bug in the package, for which I've just proposed a fix on GH. Please feel free to give a try to that branch and report back on GH -- I will try to get some time later this week to clean up the related unit tests and merge the branch if everything seem to be OK.

How to add meta_data to Pandas dataframe?

I use Pandas dataframe heavily. And need to attach some data to the dataframe, for example to record the birth time of the dataframe, the additional description of the dataframe etc.
I just can't find reserved fields of dataframe class to keep the data.
So I change the core\frame.py file to add a line _reserved_slot = {} to solve my issue. I post the question here is just want to know is it OK to do so ? Or is there better way to attach meta-data to dataframe/column/row etc?
#----------------------------------------------------------------------
# DataFrame class
class DataFrame(NDFrame):
_auto_consolidate = True
_verbose_info = True
_het_axis = 1
_col_klass = Series
_AXIS_NUMBERS = {
'index': 0,
'columns': 1
}
_reserved_slot = {} # Add by bigbug to keep extra data for dataframe
_AXIS_NAMES = dict((v, k) for k, v in _AXIS_NUMBERS.iteritems())
EDIT : (Add demo msg for witingkuo's way)
>>> df = pd.DataFrame(np.random.randn(10,5), columns=list('ABCDEFGHIJKLMN')[0:5])
>>> df
A B C D E
0 0.5890 -0.7683 -1.9752 0.7745 0.8019
1 1.1835 0.0873 0.3492 0.7749 1.1318
2 0.7476 0.4116 0.3427 -0.1355 1.8557
3 1.2738 0.7225 -0.8639 -0.7190 -0.2598
4 -0.3644 -0.4676 0.0837 0.1685 0.8199
5 0.4621 -0.2965 0.7061 -1.3920 0.6838
6 -0.4135 -0.4991 0.7277 -0.6099 1.8606
7 -1.0804 -0.3456 0.8979 0.3319 -1.1907
8 -0.3892 1.2319 -0.4735 0.8516 1.2431
9 -1.0527 0.9307 0.2740 -0.6909 0.4924
>>> df._test = 'hello'
>>> df2 = df.shift(1)
>>> print df2._test
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "D:\Python\lib\site-packages\pandas\core\frame.py", line 2051, in __getattr__
(type(self).__name__, name))
AttributeError: 'DataFrame' object has no attribute '_test'
>>>
This is not supported right now. See https://github.com/pydata/pandas/issues/2485. The reason is the propogation of these attributes is non-trivial. You can certainly assign data, but almost all pandas operations return a new object, where the assigned data will be lost.
Your _reserved_slot will become a class variable. That might not work if you want to assign different value to different DataFrame. Probably you can assign what you want to the instance directly.
In [6]: import pandas as pd
In [7]: df = pd.DataFrame()
In [8]: df._test = 'hello'
In [9]: df._test
Out[9]: 'hello'
I think a decent workaround is putting your datafame into a dictionary with your metadata as other keys. So if you have a dataframe with cashflows, like:
df = pd.DataFrame({'Amount': [-20, 15, 25, 30, 100]},index=pd.date_range(start='1/1/2018', periods=5))
You can create your dictionary with additional metadata and put the dataframe there
out = {'metadata': {'Name': 'Whatever', 'Account': 'Something else'}, 'df': df}
and then use it as out[df]