How to get nodes that have a given amount of outgoing relationships with a given property in Neo4j Cypher? - nosql

In my domain a node can have several relationships of the same type to other entities. Each relationship have several properties and I'd like to retrieve the nodes that are connected by at least 2 relationships that present a given property.
EG: A relationship between nodes have a property year. How do I find the nodes that have at least two outgoing relationships with the year set to 2012?
Why Chypher query so far looks like this (syntax error)
START x = node(*)
MATCH x-[r:RELATIONSHIP_TYPE]->y
WITH COUNT(r.year == 2012) AS years
WHERE HAS(r.year) AND years > 1
RETURN x;
I tried also nesting queries but I believe that it's not allowed in Cypher. The closest thing is the following but I do not know how to get rid of the nodes with value 1:
START n = node(*)
MATCH n-[r:RELATIONSHIP_TYPE]->c
WHERE HAS(r.year) AND r.year == 2012
RETURN n, COUNT(r) AS counter
ORDER BY counter DESC

Try this query
START n = node(*)
MATCH n-[r:RELATIONSHIP_TYPE]->c
WHERE HAS(r.year) AND r.year=2012
WITH n, COUNT(r) AS rc
WHERE rc > 1
RETURN n, rc

Related

Few minizinc questions on constraints

A little bit of background. I'm trying to make a model for clustering a Design Structure Matrix(DSM). I made a draft model and have a couple of questions. Most of them are not directly related to DSM per se.
include "globals.mzn";
int: dsmSize = 7;
int: maxClusterSize = 7;
int: maxClusters = 4;
int: powcc = 2;
enum dsmElements = {A, B, C, D, E, F,G};
array[dsmElements, dsmElements] of int: dsm =
[|1,1,0,0,1,1,0
|0,1,0,1,0,0,1
|0,1,1,1,0,0,1
|0,1,1,1,1,0,1
|0,0,0,1,1,1,0
|1,0,0,0,1,1,0
|0,1,1,1,0,0,1|];
array[1..maxClusters] of var set of dsmElements: clusters;
array[1..maxClusters] of var int: clusterCard;
constraint forall(i in 1..maxClusters)(
clusterCard[i] = pow(card(clusters[i]), powcc)
);
% #1
% constraint forall(i, j in clusters where i != j)(card(i intersect j) == 0);
% #2
constraint forall(i, j in 1..maxClusters where i != j)(
card(clusters[i] intersect clusters[j]) == 0
);
% #3
% constraint all_different([i | i in clusters]);
constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;
var int: intraCost = sum(i in 1..maxClusters, j, k in clusters[i] where k != j)(
(dsm[j,k] + dsm[k,j]) * clusterCard[i]
) ;
var int: extraCost = sum(el in dsmElements,
c in clusters where card(c intersect {el}) = 0,
k,j in c)(
(dsm[j,k] + dsm[k,j]) * pow(card(dsmElements), powcc)
);
var int: TCC = trace("\(intraCost), \(extraCost)\n", intraCost+extraCost);
solve maximize TCC;
Question 1
I was under the impression, that constraints #1 and #2 are the same. However, seems like they are not. The question here is why? What is the difference?
Question 2
How can I replace constraint #2 with all_different? Does it make sense?
Question 3
Why the trace("\(intraCost), \(extraCost)\n", intraCost+extraCost); shows nothing in the output? The output I see using gecode is:
Running dsm.mzn
intraCost, extraCost
clusters = array1d(1..4, [{A, B, C, D, E, F, G}, {}, {}, {}]);
clusterCard = array1d(1..4, [49, 0, 0, 0]);
----------
<sipped to save space>
----------
clusters = array1d(1..4, [{B, C, D, G}, {A, E, F}, {}, {}]);
clusterCard = array1d(1..4, [16, 9, 0, 0]);
----------
==========
Finished in 5s 419msec
Question 4
The expression constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;, here I wanted to say that the union of all clusters should match the set of all nodes. Unfortunately, I did not find a way to make this big union more dynamic, so for now I just manually provide all clusters. Is there a way to make this expression return union of all sets from the array of sets?
Question 5
Basically, if I understand it correctly, for example from here, the Intra-cluster cost is the sum of all interactions within a cluster multiplied by the size of the cluster in some power, basically the cardinality of the set of nodes, that represents the cluster.
The Extra-cluster cost is a sum of interactions between some random element that does not belong to a cluster and all elements of that cluster multiplied by the cardinality of the whole space of nodes to some power.
The main question here is are the intraCost and extraCost I the model correct (they seem to be but still), and is there a better way to express these sums?
Thanks!
(Perhaps you would get more answers if you separate this into multiple questions.)
Question 3:
Here's an answer on the trace question:
When running the model, the trace actually shows this:
intraCost, extraCost
which is not what you expect, of course. Trace is in effect when creating the model, but at that stage there is no value of these two decision values and MiniZinc shows only the variable names. They got some values to show after the (first) solution is reached, and can then be shown in the output section.
trace is mostly used to see what's happening in loops where one can trace the (fixed) loop variables etc.
If you trace an array of decision variables then they will be represented in a different fashion, the array x will be shown as X_INTRODUCED_0_ etc.
And you can also use trace for domain reflection, e.g. using lb and ub to get the lower/upper value of the domain of a variable ("safe approximation of the bounds" as the documentation states it: https://www.minizinc.org/doc-2.5.5/en/predicates.html?highlight=ub_array). Here's an example which shows the domain of the intraCost variable:
constraint
trace("intraCost: \(lb(intraCost))..\(ub(intraCost))\n")
;
which shows
intraCost: -infinity..infinity
You can read a little more about trace here https://www.minizinc.org/doc-2.5.5/en/efficient.html?highlight=trace .
Update Answer to question 1, 2 and 4.
The constraint #1 and #2 means the same thing, i.e. that the elements in clusters should be disjoint. The #1 constraint is a little different in that it loops over decision variables while the #2 constraint use plain indices. One can guess that #2 is faster since #1 use the where i != j which must be translated to some extra constraints. (And using i < j instead should be a little faster.)
The all_different constraint states about the same and depending on the underlying solver it might be faster if it's translated to an efficient algorithm in the solver.
In the model there is also the following constraint which states that all elements must be used:
constraint (clusters[1] union clusters[2] union clusters[3] union clusters[4]) = dsmElements;
Apart from efficiency, all these constraints above can be replaced with one single constraint: partition_set which ensure that all elements in dsmElements must be used in clusters.
constraint partition_set(clusters,dsmElements);
It might be faster to also combine with the all_different constraint, but that has to be tested.

How To Use kmedoids from pyclustering with set number of clusters

I am trying to use k-medoids to cluster some trajectory data I am working with (multiple points along the trajectory of an aircraft). I want to cluster these into a set number of clusters (as I know how many types of paths there should be).
I have found that k-medoids is implemented inside the pyclustering package, and am trying to use that. I am technically able to get it to cluster, but I do not know how to control the number of clusters. I originally thought it was directly tied to the number of elements inside what I called initial_medoids, but experimentation shows that it is more complicated than this. My relevant code snippet is below.
Note that D holds a list of lists. Each list corresponds to a single trajectory.
def hausdorff( u, v):
d = max(directed_hausdorff(u, v)[0], directed_hausdorff(v, u)[0])
return d
traj_count = len(traj_lst)
D = np.zeros((traj_count, traj_count))
for i in range(traj_count):
for j in range(i + 1, traj_count):
distance = hausdorff(traj_lst[i], traj_lst[j])
D[i, j] = distance
D[j, i] = distance
from pyclustering.cluster.kmedoids import kmedoids
initial_medoids = [104, 345, 123, 1]
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()[0]
num_clusters = len(np.unique(cluster_lst))
print('There were %i clusters found' %num_clusters)
I have a total of 1900 trajectories, and the above-code finds 1424 clusters. I had expected that I could control the number of clusters through the length of initial_medoids, as I did not see any option to input the number of clusters into the program, but this seems unrelated. Could anyone guide me as to the mistake I am making? How do I choose the number of clusters?
In case of requirement to obtain clusters you need to call get_clusters():
cluster_lst = kmedoids_instance.get_clusters()
Not get_clusters()[0] (in this case it is a list of object indexes in the first cluster):
cluster_lst = kmedoids_instance.get_clusters()[0]
And that is correct, you can control amount of clusters by initial_medoids.
It is true you can control the number of cluster, which correspond to the length of initial_medoids.
The documentation is not clear about this. The get__clusters function "Returns list of medoids of allocated clusters represented by indexes from the input data". so, this function does not return the cluster labels. It returns the index of rows in your original (input) data.
Please check the shape of cluster_lst in your example, using .get_clusters() and not .get_clusters()[0] as annoviko suggested. In your case, this shape should be (4,). So, you have a list of four elements (clusters), each containing the index or rows in your original data.
To get, for example, data from the first cluster, use:
kmedoids_instance = kmedoids(traj_lst, initial_medoids)
kmedoids_instance.process()
cluster_lst = kmedoids_instance.get_clusters()
traj_lst_first_cluster = traj_lst[cluster_lst[0]]

Group by with paging (take skip)

I am trying to make some kind of paging. But, I need to do it on a grouped result, because every time I do a page. It is a requirement that all data for a given group is fetched.
Below code:
var erere = dbCtx.StatusViewList
.GroupBy(p => p.TurbineNumber)
.OrderBy(p => p.FirstOrDefault().TurbineNumber)
.Skip(0)
.Take(10)
.ToList();
I have 200k items and the statement above seems to be so slow the connection times out. My best bet is its the orderby that slows it down. Any suggestions how to do this, or how to speed the statement above up?
At your case, grouping on server side is not needed at all, because anyway you will get all data, but with additional overhead on server side. So try another approach:
var groupPage = dbCtx.StatusViewList.Select(x => TurbineNumber)
.Distinct().OrderBy(x => x.TurbineNumber).Skip(40).Take(20).ToList();
var data = dbCtx.StatusViewList.Where(x => groupPage.Contains(x.TurbineNumber))
.ToList().GroupBy(x => x.TurbineNumber).ToList();
The GroupBy needs to visit all elements to group all StatusViews into groups of StatusViews that have equal TurbineNumber.
After that, you take every group, from every group your take the first element and ask for its TurbineNumber, to sort by Turbine Number.
Apparently you take into account that a group of StatusViews might be empty (FirstOrDefault, instead of First), but then again, you assume that FirstOrDefault never returns null.
One of the things that could speed up your query is using the Key of your groups. The Key is the element on which you grouped, in your case the TurbineNumber: All elements in the a group have the same TurbineNumber.
var result = dbCtx.StatusViewList
.GroupBy(statusView => statusView.TurbineNumber)
.OrderBy(group => group.Key)
...
I think that will be a first step to improve performance.
However, you return a fixed number of Groups. Some Groups might be huge, 1000s of elements, some groups might be small: only one element. So the result of one page could be 10 groups, each with 1000 elements, having a total of 10000 elements. It could also be 10 groups, each with 1 element, a total of 10 elements. I'm not sure if this would be the result you want by paging.
Wouldn't you prefer a page that always has the same number of elements, preferably with the same TurbineNumber, If there are not many same TurbineNumbers fill the rest of your page with the next TurbineNumber. If there are too many StatusViews with this TurbineNumber divide them into several pages?
Something like:
TurbineNumber StatusView
4 A
4 B
4 F
5 D
5 K
6 C
6 Z
6 Q
6 W
7 E
To do this, don't GroupBy, use OrderBy and then Skip and Take
IEnumerable<StatusView> GetPage(int pageNr, int pageSize)
{
return dbCtx.StatusViewList
.Orderby(statusView => statusView.TurbineNumber)
.Skip(pageNr * pageSize)
.Take(pageSize)
}
If you create an extra index for TurbineNumber, this will be very fast:
In your DbContext.OnModelCreating(DbModelBuilder modelBuilder):
// Add an extra index on TurbineNumber:
var indexAttribute = new IndexAttribute("TurbineIndex", 0) {IsUnique = false}
var indexAnnotation =new IndexAnnotation(indexAttribute);
modelBuilder.Entity<Statusview>()
.Property(statusView => statusView.TurbineNumber)
.HasColumnAnnotation("MyIndexName", indexAnnotation);

Mongdb combined limit and sort when using find function

I have db mongdb example with document a and document b
a_id type
1 1
2 2
3 3
4 4
Now. I want to extract the last N (1,2,3,4,5,....) values in table b in the same order as in the example above. But if I use skip function :
b.find().skip(M)
if M > N then result empty => wrong. I want dynamic M.
If I use sort and limit then it does not give the correct order.
b.find().sort({$natural:-1}).limit(M)
result:
4 4
3 3
I want a solution!
You can use the same skip() to access the last N documents in the collection.
N = Last N documents to be accessed
So the query is
b.find().skip(b.count() - N).pretty()
or you can play with the mongo shell just as javascript like
var totalCount = b.count()
db.find().skip(totalCount - N).pretty()

What is the query to access the nodes in reverse to a relationship in neo4j

I am using neo4j as my graph db. I am having some problem with the queries. Here is the scenario.
I have an neo4j index = users.
I have all the user nodes in users index.
I have another index called "comments"
Every comment is a node.
And Every comment has a relationship "HAS_COMMENT" with user node.
So I have, user_node ->HAS_COMMENT-> comment_node
I can get all the comments of a user by this query.
$ start n = node:users(username='user1') match n-[r:HAS_COMMENT] -> a return a;
Now, I want to get in reverse direction. I have to get username from comment.
This is I am trying but getting null result.
$ start n = node:comments(_id='c101') match n-[r:HAS_COMMENT] -> a return a;
c101 is my comment id(node id); and it is present in db.
How can I do this?
You can match arrows either way:
start n = node:comments(_id='c101') match n-[r:HAS_COMMENT] -> a return a;
start n = node:comments(_id='c101') match n<-[r:HAS_COMMENT] - a return a;