How to get multiple vertices/edges in a single gremlin query? - titan

I am in a situation where I need to get two different types of vertices using a single query. For example, assume that the graph has the following structure :
Node("User")--Edge("is_member")-->Node("Membership")--Edge("is_member")-->Node("Group")
Assume that the nodes have the following properties:
Membership
status
date
Group
name
date
type
Now, I need to get all the Membership nodes that a user is_member of, along with the corresponding Group's name. How do I write a Gremlin query for this?
I am using the Bulbs framework. How do I store the result in a python object?

The following query gives you for user u1 a map with key = Membership-Nodes and value = list of group names of the key membership node:
m=[:];u1.out('is_member').groupBy(m){it}{it.out('is_member').name}
Output is:
gremlin> m
==>v[m1]=[group1]
==>v[m2]=[group2, group3]
Here the used sample graph:
g = new TinkerGraph()
u1 = g.addVertex('u1')
u2 = g.addVertex('u2')
m1 = g.addVertex('m1')
m2 = g.addVertex('m2')
g1 = g.addVertex('g1')
g2 = g.addVertex('g2')
g3 = g.addVertex('g3')
g.addEdge(u1, m1, 'is_member')
g.addEdge(u1, m2, 'is_member')
g.addEdge(u2, m2, 'is_member')
g.addEdge(m1, g1, 'is_member')
g.addEdge(m2, g2, 'is_member')
g.addEdge(m2, g3, 'is_member')
g1.name = 'group1'
g2.name = 'group2'
g3.name = 'group3'
See also: How do I write a sub-query?
(tested with gremlin2)

Related

A scalable Graph method for finding cliques for complete connected components PySpark

I'm trying to split GraphFrame connectedComponent output for each component to have a sub-group for each complete connected, meaning all vertices are connected to each other. the following sketch will help demonstrate what I'm trying to achieve
I'm using NetworkX method in order to achive it as following
def create_subgroups(edges,components, key_name = 'component'):
# joining the edges to enrich component id
sub_components = edges.join(components,[(edges.dst == components.id) | (edges.src == components.id)]).select('src','dst',key_name).drop_duplicates()
# caching the table using temp table
sub_components = save_temp_table(sub_components,f'inner_sub_{key_name}s', zorder = [key_name])
schema = StructType([ \
StructField("index",LongType(),True), \
StructField("id",StringType(),True), \
])
# applying pandas udf to enrich each vertices with the new component id
sub_components = sub_components.groupby(key_name).applyInPandas(pd_create_subgroups, schema).where('id != "not_connected"').drop_duplicates()
# joining the output and mulitplying each vertices by the time of sub-groups were found
components = components.join(sub_components,'id','left')
components = components.withColumn(key_name,when(col('index').isNull(),col(key_name)).otherwise(concat(col(key_name),lit('_'),concat('index')))).drop('index')
return components
import networkx as nx
from networkx.algorithms.clique import find_cliques
def pd_create_subgroups(pdf):
# building the graph
gnx = nx.from_pandas_edgelist(pdf,'src','dst')
# removing one degree nodes
outdeg = gnx.degree()
to_remove = [n[0] for n in outdeg if n[1] == 1]
gnx.remove_nodes_from(to_remove)
bic = list(find_cliques(gnx))
if len(bic)<=2:
return pd.DataFrame(data = {"index":[-1],"id":["not_connected"]})
res = {
"index":[],
"id":[]
}
ind = 0
for i in bic:
if len(i)<3:
continue
for id in i:
res['index'] = res['index'] + [ind]
res['id'] = res['id'] + [id]
ind += 1
return pd.DataFrame(res)
# creating sub-components if necessary
subgroups = create_subgroups(edges,components, key_name = 'component')
My problem is that there's a very large component containing 80% of the vertices causing very slow performance of the clusters. I've been trying to use labelPropagation to create smaller groups but it wouldn't do the trick. it has split it in a way that isn't suitable causing a split of vertices that should have been in the same groups.
Here's the cluster usage when it reaches the pandas_udf part
This issue was resolved by separating vertices into N groups, pulling all edges for each vertice in the group, and calculating the sub-group using the find_cliques method.

Phylogenetic model using multiple entries for each species

I am relatively new to phylogenetic regression models. In the past I used PGLS when I had only 1 entry for each species in my tree. Now I have a dataset with thousands of records for a total of 9 species and I would like to run a phylogenetic model. I read the tutorial of the most common packages (e.g. caper) but I am unsure how to build the model.
When I try to create the object for caper, i.e. using:
obj <- comparative.data(phy = Tree, data = Data, names.col = species, vcv = TRUE, na.omit = FALSE, warn.dropped = TRUE)
I get the message:
Error in row.names<-.data.frame(*tmp*, value = value) :
duplicate 'row.names' are not allowed
In addition: Warning message:
non-unique values when setting 'row.names': ‘Species1’, ‘Species2’, ‘Species3’, ‘Species4’, ‘Species5’, ‘Species6’, ‘Species7’, ‘Species8’, ‘Species9’
I understood that I may solve this by applying a MCMCglmm model but I am unfamiliar with Bayesian models.
Thanks in advance for your help.
This is indeed not going to work with a simple PGLS from caper because it cannot deal with individuals as a random effect. I suggest you use MCMCglmm that is not much more complex to understand and will allow you to have individuals as a random effect. You can find excellent documentation from the package's author here or here or an alternative documentation that's more dealing with some specific aspects of the package (namely tree uncertainty) here.
Really briefly to get you going:
## Your comparative data
comp_data <- comparative.data(phy = my_tree, data =my_data,
names.col = species, vcv = TRUE)
Note that you can have a specimen column that can look like this:
taxa var1 var2 specimen
1 A 0.08730689 a spec1
2 B 0.47092692 a spec1
3 C -0.26302706 b spec1
4 D 0.95807782 b spec1
5 E 2.71590217 b spec1
6 A -0.40752058 a spec2
7 B -1.37192856 a spec2
8 C 0.30634567 b spec2
9 D -0.49828379 b spec2
10 E 1.42722363 b spec2
You can then set up your formula (similar to a simple lm formula):
## Your formula
my_formula <- variable1 ~ variable2
And your MCMC settings:
## Setting the prior list (see the MCMCglmm course notes for details)
prior <- list(R = list(V=1, nu=0.002),
G = list(G1 = list(V=1, nu=0.002)))
## Setting the MCMC parameters
## Number of interactions
nitt <- 12000
## Length of burnin
burnin <- 2000
## Amount of thinning
thin <- 5
And you should then be able to run a default MCMCglmm:
## Extracting the comparative data
mcmc_data <- comp_data$data
## As MCMCglmm requires a column named animal for it to identify it as a phylo
## model we include an extra column with the species names in it.
mcmc_data <- cbind(animal = rownames(mcmc_data), mcmc_data)
mcmc_tree <- comp_data$phy
## The MCMCglmmm
mod_mcmc <- MCMCglmm(fixed = my_formula,
random = ~ animal + specimen,
family = "gaussian",
pedigree = mcmc_tree,
data = mcmc_data,
nitt = nitt,
burnin = burnin,
thin = thin,
prior = prior)

SQL query to include V and E props in the result

I feel like this should be simple, but lots of searching and experimenting has not produced any results.
I have a very simple ODB database with one V and one E class. V has various props and E has one prop: "order"
This simple ODB SQL query...
select expand(out()) from #12:34
...returns all the props from the V records connected by the "out" edges on #12:34 (i.e. the child records of #12:34), working as expected.
But what I'm not able to do is to also include the one prop from those edges, as well as sort by that E prop (which should be trivial once I can get it into the projections).
This works fine in OrientDB v 3.0.0RC2 (the GA will be released in a few days)
MATCH
{class:V, where:(#rid = ?), as:aVertex}.outE(){as:theEdge}.inV(){as:otherVertex}
RETURN theEdge:{*}, otherVertex:{*}
you can also return single properties with
RETURN theEdge.prop1 as p1, theEdge.prop2 as p2, otherVertex.propX as p3

Adding edge attribute causes TypeError: 'AtlasView' object does not support item assignment

Using networkx 2.0 I try to dynamically add an additional edge attribute by looping through all the edges. The graph is a MultiDiGraph.
According to the tutorial it seems to be possible to add edge attributes the way I do in the code below:
g = nx.read_gpickle("../pickles/" + gname)
yearmonth = gname[:7]
g.name = yearmonth # works
for source, target in g.edges():
g[source][target]['yearmonth'] = yearmonth
This code throws the following error:
TypeError: 'AtlasView' object does not support item assignment
What am I doing wrong?
That should happen if your graph is a nx.MultiGraph. From which case you need an extra index going from 0 to n where n is the number of edges between the two nodes.
Try:
for source, target in g.edges():
g[source][target][0]['yearmonth'] = yearmonth
The tutorial example is intended for a nx.Graph.

Produce State Frequencies and Entropy Index for a Particular Variable

I am able to generate separate plots from my data set (DISDARAE) for different variables (GENDER, RACE) such as
seqIplot(DISDATAE.seq, border = NA, group = DISDATAE$GENDER, sortv = "from.start")
seqIplot(DISDATAE.seq, border = NA, group = DISDATAE$RACE, sortv = "from.start")
How do I generate separate state frequency and entropy tables for each variable?
I used this syntax for the entire data set: seqstatd(DISDATAE.seq[, 1:4]), but unable to create one for separate variables
Just use by. I illustrate using the mvad data shipping with TraMineR
library(TraMineR)
data(mvad)
# creating the state sequence object
mvad.seq <- seqdef(mvad[, 15:86])
## Distributions and cross-sectional entropies by sex
by(mvad.seq, mvad$male, seqstatd)
Hope this helps.