How to speed up "global" queries in Titan DB? - titan

We are using Titan with Persistit as backend, for a graph with about 100.000 vertices. Our use-case is quite complex, but the current problem can be illustrated with a simple example. Let's assume that we are storing Books and Authors in the graph. Each Book vertex has an ISBN number, which is unique for the whole graph.
I need to answer the following query:
Give me the set of ISBN numbers of all Books in the Graph.
Currently, we are doing it like this:
// retrieve graph instance
TitanGraph graph = getGraph();
// Start a Gremlin query (I omit the generics for brevity here)
GremlinPipeline gremlin = new GremlinPipeline().start(graph);
// get all vertices in the graph which represent books (we have author vertices, too!)
gremlin.V("type", "BOOK");
// the ISBN numbers are unique, so we use a Set here
Set<String> isbnNumbers = new HashSet<String>();
// iterate over the gremlin result and retrieve the vertex property
while(gremlin.hasNext()){
Vertex v = gremlin.next();
isbnNumbers.add(v.getProperty("ISBN"));
}
return isbnNumbers;
My question is: is there a smarter way to do this faster? I am new to Gremlin, so it might very well be that I do something horribly stupid here. The query currently takes 2.5 seconds, which is not too bad, but I would like to speed it up, if possible. Please consider the backend as fixed.

I doubt that there is a much faster way (you will always need to iterate over all book vertices), however a less verbose solution to your task is possible with groovy/gremlin.
On the sample graph you can run e.g. the following query:
gremlin> namesOfJaveProjs = []; g.V('lang','java').name.store(namesOfJaveProjs)
gremlin> namesOfJaveProjs
==>lop
==>ripple
Or for your book graph:
isbnNumbers = []; g.V('type','BOOK').ISBN.store(isbnNumbers)

Related

Concise way to filter on two child attributes in ArangoDB (AQL / Spring Data ArangoDB)

In ArangoDB I have documents in a trip collection which is related to documents in a driver collection via edges in a tripToDriver collection, and the trip documents are also related to documents in a departure collection via edges in a departureToTrip collection.
To fetch trips where their driver has a given idNumber and their associated departure has a startTime after a supplied date/time, I've successfully written the following AQL:
FOR doc IN trip
LET drivers = (FOR v IN 1..1 OUTBOUND doc tripToDriver RETURN v)
LET departures = (FOR v in 1..1 INBOUND doc departureToTrip RETURN v
FILTER drivers[0].idNumber == '999999-9999' AND departures[0].startTime >= '2018-07-30'
RETURN doc
But I wonder if there is a more concise / elegant way to achieve the same results?
A related question, since I'm using Spring Data ArangoDB: Is it possible to achieve this result with derived queries?
For a single relation I was able to create a query like:
Iterable<Trip> findTripsByDriversIdNumber( String driverId ); but haven't had luck incorporating the departure relation into this signature (maybe because it's inbound?).
First of all your query only works if you have only one connected driver/departure for every trip. You're fetching all linked drivers but only check the first found one.
If this is your model it is totally ok, but I would recommend to do the idNumber/startTime check within the sub queries. Then, because we only need to know that at least one driver/departure fits our filter condition, we add a LIMIT 1 to the sub query and return only a true. This is enough we need for our FITLER in our main query.
FOR doc IN trip
FILTER (FOR v IN 1..1 OUTBOUND doc tripToDriver FILTER v.idNumber == #idNumber LIMIT 1 RETURN true)[0]
FILTER (FOR v IN 1..1 INBOUND doc departureToTrip FILTER v.startTime >= #startTime LIMIT 1 RETURN true)[0]
RETURN doc
I tested to solve your case with a derived query. It would work, if there wasn't a bug in the current arangodb-spring-data release. The bug is already fixed but not yet released. You can already test it using a snapshot version of arangodb-spring-data (1.3.1-SNAPSHOT or 2.3.1-SNAPSHOT depending on your Spring Data version, see supported versions).
The following derived query method worked for me with the snapshot version.
Iterable<Trip> findByDriversIdNumberAndDeparturesStartTimeGreaterThanEqual(String idNumber, LocalDate startTime);
To make the dervied query work you need the following annotated fields in your class Trip
#Relations(edges = TripToDriver.class, direction = Direction.OUTBOUND, maxDepth = 1)
private Collection<Driver> drivers;
#Relations(edges = DepartureToTrip.class, direction = Direction.INBOUND, maxDepth = 1)
private Collection<Departure> departures;
I also created an working example project on github.

Gremlin query to find the count of a label for all the nodes

Sample query
The following query returns me the count of a label say
"Asset " for a particular id (0) has >>>
g.V().hasId(0).repeat(out()).emit().hasLabel('Asset').count()
But I need to find the count for all the nodes that are present in the graph with a condition as above.
I am able to do it individually but my requirement is to get the count for all the nodes that has that label say 'Asset'.
So I am expecting some thing like
{ v[0]:2
{v[1]:1}
{v[2]:1}
}
where v[1] and v[2] has a node under them with a label say "Asset" respectively, making the overall count v[0] =2 .
There's a few ways you could do it. It's maybe a little weird, but you could use group()
g.V().
group().
by().
by(repeat(out()).emit().hasLabel('Asset').count())
or you could do it with select() and then you don't build a big Map in memory:
g.V().as('v').
map(repeat(out()).emit().hasLabel('Asset').count()).as('count').
select('v','count')
if you want to maintain hierarchy you could use tree():
g.V(0).
repeat(out()).emit().
tree().
by(project('v','count').
by().
by(repeat(out()).emit().hasLabel('Asset')).select(values))
Basically you get a tree from vertex 0 and then apply a project() over that to build that structure per vertex in the tree. I had a different way to do it using union but I found a possible bug and had to come up with a different method (actually Gremlin Guru, Daniel Kuppitz, came up with the above approach). I think the use of project is more natural and readable so definitely the better way. Of course as Mr. Kuppitz pointed out, with project you create an unnecessary Map (which you just get rid of with select(values)). The use of union would be better in that sense.

Titan - Cassandra: Process entire set of vertices of a given type without running out of memory

I'm new to Titan and looking for the best way to iterate over the entire set of vertices with a given label without running out of memory. I come from a strong SQL background so I am still working on switching my way of thinking away from SQL-type thinking. Let's say I have 1 million profile vertices. I would like to iterate over each one and perform some type of statistical analysis of the information linked to each profile. I don't really care how long the entire analysis process takes, but I need to iterate over all of the profiles. In SQL I would do SELECT * FROM MY_TABLE, using a scroll-sensitive result, fetch the next result, grab and process the info linked to that row, then fetch the next result. I also don't care if the result is real-time accurate as it is just for gathering general stats, so if a new profile is added during iteration and I miss it, that's ok.
Even if there is a way to grab all the values for a given property, that would probably work too because then I could go through that list and grab each vertex by its ID for example.
I believe titan does lazy loading so you should be able to just iterate over the whole graph:
GraphTraversal<Vertex, Vertex> it = graph.traversal().V();
while(it.hasNext()){
Vertex v = it.next():
//Do what you want here
}
Another option would be to use the range step so that you explicitly choose the range of vertices you need. For example:
List<Vertex> vertices = graph.traversal().V().range(0, 3).toList();
//Do what you want with your batch of vertices.
With regards to getting vertices of a specific type you can query vertices based on their internal properties. For example if you have and internal property "TYPE" which defined the type you are interested in. You can query for those vertices by:
graph.traversal().V().has("TYPE", "A"); //Gets vertices of type A
graph.traversal().V().has("TYPE", "B"); //Gets vertices of type B

UPSERTing with TitanDB

I'm having my first steps as a TitanDB user. That being, I'd like to know how to make an upsert / conditionally insert a vertex inside a TitanTransaction (in the style of "get or create").
I have a unique index on the vertex/property I want to create/lookup.
Here's a one-liner "getOrCreate" for Titan 1.0 and TinkerPop 3:
getOrCreate = { id ->
g.V().has('userId', id).tryNext().orElseGet{ g.addV('userId', id).next() }
}
As taken from the new TinkerPop "Getting Started" Tutorial. Here is the same code translated to java:
public Vertex getOrCreate(Object id) {
return g.V().has('userId', id).tryNext().orElseGet(() -> g.addV('userId', id).next());
}
Roughly speaking, every Cassandra insert is an "upsert". If you look at the Titan representation of vertices and edges in a Cassandra-like model, you'll find vertices and edges each get their own rows. This means a blind write of an edge will have the given behavior you're looking for: what you write is what will win. Doing this with a vertex isn't supported directly by Titan.
But I don't think this is what you're looking for. If you're looking to enforce uniqueness, why not use the unique() modifier on a Titan composite index? (From the documentation):
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
With a Cassandra storage backend, you need to enable consistency locking. As with any database, managing the overhead of uniqueness comes at a cost which you need to consider when writing your data. This way if you insert a vertex which violates your uniqueness requirement, the transaction will fail.

QueryDSL: querying relations and properties

I'm using QueryDSL with JPA.
I want to query some properties of an entity, it's like this:
QPost post = QPost.post;
JPAQuery q = new JPAQuery(em);
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name);
It works fine.
If i want to query a relation property, e.g. comments of a post:
List<Set<Comment>> rows = q.from(post).where(...).list(post.comments);
It's also fine.
But when I want to query relation and simple properties together, e.g.
List<Object[]> rows = q.from(post).where(...).list(post.id, post.name, post.comments);
Then something went wrong, generiting a bad SQL syntax.
Then I realized that it's not possible to query them together in one SQL statement.
Is it possible that QueryDSL would somehow deal with relations and generate additional queries (just like what hibernate does with lazy relations), and load the results in?
Or should I just query twice, and then merge both result lists?
P.S. what i actually want is each post with its comments' ids. So a function to concat each post's comment ids is better, is this kind of expressin possible?
q.list(post.id, post.name, post.comments.all().id.join())
and generate a subquery sql like (select group_concat(c.id) from comments as c inner join post where c.id = post.id)
Querydsl JPA is restricted to the expressivity of JPQL, so what you are asking for is not possible with Querydsl JPA. You can though try to express it with Querydsl SQL. It should be possible. Also as you don't project entities, but literals and collections it might work just fine.
Alternatively you can load the Posts with only the Comment ids loaded and then project the id, name and comment ids to something else. This should work when accessors are annotated.
The simplest thing would be to query for Posts and use fetchJoin for comments, but I'm assuming that's too slow for you use case.
I think you ought to simply project required properties of posts and comments and group the results by hand (if required). E.g.
QPost post=...;
QComment comment=..;
List<Tuple> rows = q.from(post)
// Or leftJoin if you want also posts without comments
.innerJoin(comment).on(comment.postId.eq(post.id))
.orderBy(post.id) // Could be used to optimize grouping
.list(new QTuple(post.id, post.name, comment.id));
Map<Long, PostWithComments> results=...;
for (Tuple row : rows) {
PostWithComments res = results.get(row.get(post.id));
if (res == null) {
res = new PostWithComments(row.get(post.id), row.get(post.name));
results.put(res.getPostId(), res);
}
res.addCommentId(row.get(comment.id));
}
NOTE: You cannot use limit nor offset with this kind of queries.
As an alternative, it might be possible to tune your mappings so that 1) Comments are always lazy proxies so that (with property access) Comment.getId() is possible without initializing the actual object and 2) using batch fetch* on Post.comments to optimize collection fetching. This way you could just query for Posts and then access id's of their comments with little performance hit. In most cases you shouldn't even need those lazy proxies unless your Comment is very fat. That kind of code would certainly look nicer without low level row handling and you could also use limit and offset in your queries. Just keep an eye on your query log to make sure everything works as intended.
*) Batch fetching isn't directly supported by JPA, but Hibernate supports it through mapping and Eclipselink through query hints.
Maybe some day Querydsl will support this kind of results grouping post processing out-of-box...