Counting Super Nodes On Titan

Counting Super Nodes On Titan - titan

In my system I have the requirement that the number of edges on the node must be stored as an internal property on the vertex as well as a vertex centric index on a specific outgoing edge. This naturally requires me to count the number of edges on the node after all the data has finished loading. I do so as follows:
long edgeCount = graph.getGraph().traversal().V(vertexId).bothE().count().next();
However when I scale up my tests to the point where some of my nodes are "super" nodes I get the following exception on the above line:
Caused by: com.netflix.astyanax.connectionpool.exceptions.TransportException: TransportException: [host=127.0.0.1(127.0.0.1):9160, latency=4792(4792), attempts=1]org.apache.thrift.transport.TTransportException: Frame size (70936735) larger than max length (62914560)!
at com.netflix.astyanax.thrift.ThriftConverter.ToConnectionPoolException(ThriftConverter.java:197) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:65) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.AbstractOperationImpl.execute(AbstractOperationImpl.java:28) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftSyncConnectionFactoryImpl$ThriftConnection.execute(ThriftSyncConnectionFactoryImpl.java:153) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractExecuteWithFailoverImpl.tryOperation(AbstractExecuteWithFailoverImpl.java:119) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.connectionpool.impl.AbstractHostPartitionConnectionPool.executeWithFailover(AbstractHostPartitionConnectionPool.java:352) ~[astyanax-core-3.8.0.jar!/:3.8.0]
at com.netflix.astyanax.thrift.ThriftColumnFamilyQueryImpl$4.execute(ThriftColumnFamilyQueryImpl.java:538) ~[astyanax-thrift-3.8.0.jar!/:3.8.0]
at com.thinkaurelius.titan.diskstorage.cassandra.astyanax.AstyanaxKeyColumnValueStore.getNamesSlice(AstyanaxKeyColumnValueStore.java:112) ~[titan-cassandra-1.0.0.jar!/:na]
What is the best way to fix this ? Should I simply increase the frame size or is there a better way to count the number of edges on the node ?

Yes, you will need to increase the frame size. When you have a supernode, there is a really big row that needs to be read out of the storage backend, and this is even true in the OLAP case. I agree that if you are planning to calculate this on every vertex in the graph, this would be best done as an OLAP operation.
This and several other good tips can be found in this Titan mailing list thread. Keep in mind that link is pretty old, so the concepts are still valid, but some of the Titan configuration properties names may be different.

Such a task, which is OLAP by its nature, should be performed using a distributed system, not using a traversal.
There is a concept called GraphComputer in TinkerPop 3, which can be used to perform such a task.
It is basically allows you to run Gremlin queries, which will be evaluated on multiple machines.
For example, you can use SparkGraphComputer to run your queries on top of Apache Spark.

Related

R-tree - Remove algorithm using reinsertion

I am trying to implement an R-tree in scala following the guidelines from the original paper about the R-tree structure. In the deletion algorithm section is stated:
Reinsert all entries of nodes in set Q. Entries from eliminated leaf nodes are reinserted in tree leaves as described in Insert, but entries from higher level nodes must be placed higher in the tree, so that leaves of their depedent subtrees will be on the same level as leaves of the main tree.
I can't wrap my head around the last part. What is meant by higher level nodes must be placed higher in the tree? How is that implemented? My idea was that I remove nodes that underflow add them to the set Q (their entries) and in the end I reinsert their entries using Insert. Is this incorrect or partially correct that requires something extra? If you can explain using examples as well that would be great.

Nodes must be reinserted in the correct height, or the tree will become invalid. Remember that all leaves must be at the same level.

Inserting and removing values in R-Tree is a quite expensive operation when you need to keep it optimally balanced for fast window or nearest requests, especially in a multi-thread environment.
A more efficient approach is using one writer (actor or thread) which gathers updates in batches, packs a new R-Tree instance and publishes it in some volatile variable for reading.
Here is a comparison of some R-Tree implementations that can be used in such a way from Scala.

olap and oltp queries in gremlin

In gremlin,
s = graph.traversal()
g = graph.traversal(computer())
i know the first one is for OLTP and second for OLAP. I know the difference between OLAP and OLTP at definition level.I have the following queries on this:
How does
the above queries differ in working?
Can I use second one ,using'g'
in queries in my application to get results(I know this 'g' one
gives gives results faster than first one )?
Difference between OLAP and OLTP with example ?
Thanks in advance.

From the user's perspective, in terms of results, there's no real difference between OLAP and OLTP. The Gremlin statements are the same save for configuration of the TraversalSource as you have shown with your use of withComputer() and other settings.
The difference is more in how the traversal is executed behind the scenes. OLAP-based traversals are meant to process the "entire graph" (i.e. all vertices/edges and perhaps more than once). Where OLTP based traversals are meant to process smaller bodies of data, typically starting with one or a handful of vertices and traversing from there. When you consider graphs in the scale of "billions of edges", it's easy to understand why an efficient mechanism like OLAP is needed to process such graphs.
You really shouldn't think of OLTP vs OLAP as "faster" vs "slower". It's probably better to think of it as it is described in the documentation:
OLTP: real-time, limited data accessed, random data access,
sequential processing, querying
OLAP: long running, entire data set
accessed, sequential data access, parallel processing, batch
processing
There's no reason why you can't use an OLAP traversal in your applications so long as your application is aware of the requirements of that traversal. If you have some SLA that says that REST requests must complete in under 0.5 seconds and you decide to use an OLAP traversal to get the answer, you will undoubtedly break your SLA. Assuming you execute the OLAP traversal job over Spark, it will take Spark 10-15 seconds just to get organized to run your job.
I'm not sure how to provide an example of OLAP and OLTP, except to talk about the use cases a little bit more, so it should be clear as to when to use one as opposed to the other. In any case, let's assume you have a graph with 10 billion edges. You would want your OLTP traversals to always start with some form of index lookup - like a traversal that shows the average age of the friends of the user "stephenm":
g.V().has('username','stephenm').out('knows').values('age').mean()
but what if I want to know the average age of every user in my database? In this case I don't have any index I can use to lookup a "small set of starting vertices" - I have to process all the many millions/billions of vertices in my graph. This is a perfect use case for OLAP:
g.V().hasLabel('user').values('age').mean()
OLAP is also great for understanding growth of your graph and for maintaining your graph. With billions of edges and a high data ingestion rate, not knowing that your graph is growing improperly is a death sentence. It's good to use OLAP to grab global statistics over all the data in the graph:
g.E().label().groupCount()
g.V().label().groupCount()
In the above examples, you get an edge/vertex label distribution. If you have an idea as to how your graph is growing, this can be a good indicator of whether or not your data ingestion process is working properly. On a billion edge graph, trying to execute even one of the traversals would take "forever" if it ever finished at all without error.

performance tuning creating OrientDB (deBruijn Graph)

i have searched the Internet and also the manual for some performance tuning abilities to decrease the building time of my deBruijn Graph with OrientDB.
All following is dione in Java.
In OrientDB:
Kmers are indexed.
Edges are property based.
What i want to do is:
read in a multiple sequence file
split sequence in kmers
add Kmers to Database and create Edges between adjacent kmers
One and two are done already.
So when I add a kmer to OrientDB, i have to check if this kmer exist and if so i need the vertx to add a nwe edge.
Is there any fast way to do so?
I already created a local hash with kmer as key and OrientDB RID as value. But getting the vertex seems to need much time.
I already tried with:
OGlobalConfiguration.USE_WAL.setValue(false);
OGlobalConfiguration.TX_USE_LOG.setValue(false);
declareIntent(new OIntentMassiveInsert());
I need nearly 3 hours to add 256 kmers and 40.000.000 edges.
Also the size of the created DB is 9GB, the start file was 40 MB.
Any suggestions how to improve it?
Feel free to ask if something is not understandable.
Thanks a lot.
Michael
Edit:
do you have any experience with Record Grow factor?
I think the Node record have some in and out edge information by default in it. Could i increase the runtime with the RECORD_GROW_FACTOR? DO you have any experince with that?

Titan batch graph

I want to add a new Property [and some times add edges] to a selection of nodes in an existing Graph of 2 million nodes, 10+ million edges. I thought of using BatchGraph but from their WIKI looks like it does not support any retrieval queries.
For e.g. retrieve these nodes: g.V('id',1).has('prop1','text1') and update 'prop1' to 'text2'.
What is the best way to do this.

I don't think you need to use BatchGraph here. It sounds as if you are doing a large graph mutation in which case it would probably best to just write a Gremlin script to do your changes. You don't have a very large graph so unless you plan to do some very complex mutations (e.g a fat multi-step traversal), it shouldn't take very long to execute. If you do think it's going to run "long" you should think of ways to parallelize the job. If you go this route you might consider using gpars.
As your graph grows, you will find that you will need to use Faunus for most data administration. Specifically, that means utilizing script step.

scala/akka/stm design for large shared state?

I am new to Scala and Akka and am considering using it to solve a problem. Suppose I have a calculation engine (that searches for a solution). I'd like to parallelize that search both across cpus and across nodes by giving each cpu on each node its own engine instance.
The engine inputs consist of a small number of scalar inputs and a very large hash table. Each engine instance would use its scalar inputs to make some small local change to the hash table, calculate a goodness, then discard its changes (they do not need to be committed/seen by any other engine instance). The goodness value would be returned to some coordinator that would choose among the results.
I was reading some about the STM TransactionalMap as a vehicle for shared state. This seems ideal, but I don't really see any complete examples using it as shared state.
Questions:
Does the actor/stm model seem right for this problem?
Can you show a specific example of how to distribute the shared state? (is it Ref[TransactionalMap[,]] as a message?
Is there anything different about distributing the shared state within a node as opposed to across different nodes?
Inquiring Minds Want to Know,
Allan

In terms of handling shared memory it doesn't sound like STM would be the right fit here because you don't want the changes made in engine instances to commit to the shared copy of the hash table.
Instead, an immutable HashMap might be a better fit. The elements that do not change in the map can be shared by the engine instances with only the differences in each map taking additional memory space.
The actor model would fit very well what you want to do. Set up one actor for each engine instance you want and pass it a message with the scalar values and the hashmap. Then have it return the results to the coordinator.