Titan index issues with Cassandra storage backend - titan

I am populating a Titan 1.0.0 single instance with a moderate graph in order to test its query performance. I am using Cassandra 2.0.17 as storage backend.
The thing is I am not able to create node indexes, and hence query results optimally. I have read the docs and I am trying to follow them carefully without much success. I am using the following groovy script for the schema definition, data population and index creation:
import com.thinkaurelius.titan.core.*;
import com.thinkaurelius.titan.core.schema.*;
import com.thinkaurelius.titan.graphdb.database.management.ManagementSystem;
import java.time.temporal.ChronoUnit;
graph = TitanFactory.open('conf/my-titan.properties');
mgmt = graph.openManagement();
// Build graph schema
// Node properties
idProp = mgmt.containsPropertyKey('userId') ?
mgmt.getPropertyKey('userId') : mgmt.makePropertyKey('id').dataType(String.class).cardinality(Cardinality.SINGLE);
isPublicProp = mgmt.containsPropertyKey('isPublic') ?
mgmt.getPropertyKey('isPublic') : mgmt.makePropertyKey('isPublic').dataType(Boolean.class).cardinality(Cardinality.SINGLE);
completionPercentageProp = mgmt.containsPropertyKey('completionPercentage') ?
mgmt.getPropertyKey('completionPercentage') : mgmt.makePropertyKey('completionPercentage').dataType(Integer.class).cardinality(Cardinality.SINGLE);
genderProp = mgmt.containsPropertyKey('gender') ?
mgmt.getPropertyKey('gender') : mgmt.makePropertyKey('gender').dataType(String.class).cardinality(Cardinality.SINGLE);
regionProp = mgmt.containsPropertyKey('region') ?
mgmt.getPropertyKey('region') : mgmt.makePropertyKey('region').dataType(String.class).cardinality(Cardinality.SINGLE);
lastLoginProp = mgmt.containsPropertyKey('lastLogin') ?
mgmt.getPropertyKey('lastLogin') : mgmt.makePropertyKey('lastLogin').dataType(String.class).cardinality(Cardinality.SINGLE);
registrationProp = mgmt.containsPropertyKey('registration') ?
mgmt.getPropertyKey('registration') : mgmt.makePropertyKey('registration').dataType(String.class).cardinality(Cardinality.SINGLE);
ageProp = mgmt.containsPropertyKey('age') ? mgmt.getPropertyKey('age') : mgmt.makePropertyKey('age').dataType(Integer.class).cardinality(Cardinality.SINGLE);
mgmt.commit();
nUsers = 0
println 'Starting nodes population...';
// Load users
new File('/home/jarandaf/soc-pokec-profiles.txt').eachLine {
try {
fields = it.split('\t').take(8);
userId = fields[0];
isPublic = fields[1] == '1' ? true : false;
completionPercentage = fields[2]
gender = fields[3] == '1' ? 'male' : 'female';
region = fields[4];
lastLogin = fields[5];
registration = fields[6];
age = fields[7] as int;
graph.addVertex('userId', userId, 'isPublic', isPublic, 'completionPercentage', completionPercentage, 'gender', gender, 'region', region, 'lastLogin', lastLogin, 'registration', registration, 'age', age);
} catch (Exception e) {
// Silently skip...
}
nUsers += 1
if (nUsers % 100000 == 0) println String.valueOf(nUsers) + ' loaded...';
};
graph.tx().commit();
println 'Nodes population finished';
// Index users by userId, gender and age
println 'Getting node properties...';
mgmt = graph.openManagement();
userId = mgmt.getPropertyKey('userId');
gender = mgmt.getPropertyKey('gender');
age = mgmt.getPropertyKey('age');
println 'Building byUserId index...';
if (mgmt.getGraphIndex('byUserId') == null) mgmt.buildIndex('byUserId', Vertex.class).addKey(userId).buildCompositeIndex();
println 'Building byGender index...';
if (mgmt.getGraphIndex('byGender') == null) mgmt.buildIndex('byGender', Vertex.class).addKey(gender).buildCompositeIndex();
println 'Building byAge index...';
if (mgmt.getGraphIndex('byAge') == null) mgmt.buildIndex('byAge', Vertex.class).addKey(age).buildCompositeIndex();
mgmt.commit();
// Wait for the indexes to become available
println 'Awaiting byUserId graph index status...';
ManagementSystem.awaitGraphIndexStatus(graph, 'byUserId')
.status(SchemaStatus.REGISTERED)
.timeout(10, ChronoUnit.MINUTES)
.call();
println 'Awaiting byGender graph index status...';
ManagementSystem.awaitGraphIndexStatus(graph, 'byGender')
.status(SchemaStatus.REGISTERED)
.timeout(10, ChronoUnit.MINUTES)
.call();
println 'Awaiting byAge graph index status...';
ManagementSystem.awaitGraphIndexStatus(graph, 'byAge')
.status(SchemaStatus.REGISTERED)
.timeout(10, ChronoUnit.MINUTES)
.call();
// Reindex the existing data
mgmt = graph.openManagement();
println 'Reindexing data by byUserId index...';
mgmt.updateIndex(mgmt.getGraphIndex('byUserId'), SchemaAction.REINDEX).get();
println 'Reindexing data by byGender index...';
mgmt.updateIndex(mgmt.getGraphIndex('byGender'), SchemaAction.REINDEX).get();
println 'Reindexing data by byAge index...';
mgmt.updateIndex(mgmt.getGraphIndex('byAge'), SchemaAction.REINDEX).get();
mgmt.commit();
// Enable indexes
println 'Enabling byUserId index...'
mgmt.awaitGraphIndexStatus(graph, 'byUserId').status(SchemaStatus.ENABLED).call();
println 'Enabling byGender index...'
mgmt.awaitGraphIndexStatus(graph, 'byGender').status(SchemaStatus.ENABLED).call();
println 'Enabling byAge index...'
mgmt.awaitGraphIndexStatus(graph, 'byAge').status(SchemaStatus.ENABLED).call();
graph.close();
The error I am getting is the following and is related with the reindex phase:
08:24:26 ERROR com.thinkaurelius.titan.graphdb.database.management.ManagementLogger - Evicted [2#0ac717511509-mybox] from cache but waiting too long for transactions to close. Stale transaction alert on: [standardtitantx[0x4b8696a4], standardtitantx[0x2d39f30a], standardtitantx[0x0da9172d], standardtitantx[0x7c6c7909], standardtitantx[0x79dd0a38], standardtitantx[0x5999c49e], standardtitantx[0x5aaba4a7]]
08:24:26 ERROR com.thinkaurelius.titan.graphdb.database.management.ManagementLogger - Evicted [3#0ac717511509-mybox] from cache but waiting too long for transactions to close. Stale transaction alert on: [standardtitantx[0x4b8696a4], standardtitantx[0x2d39f30a], standardtitantx[0x0da9172d], standardtitantx[0x7c6c7909], standardtitantx[0x79dd0a38], standardtitantx[0x5999c49e], standardtitantx[0x5aaba4a7]]
08:24:26 ERROR com.thinkaurelius.titan.graphdb.database.management.ManagementLogger - Evicted [4#0ac717511509-mybox] from cache but waiting too long for transactions to close. Stale transaction alert on: [standardtitantx[0x4b8696a4], standardtitantx[0x2d39f30a], standardtitantx[0x0da9172d], standardtitantx[0x7c6c7909], standardtitantx[0x79dd0a38], standardtitantx[0x5999c49e], standardtitantx[0x5aaba4a7]]
Any hints on this would be much appreciated.

The errors you get indicate that you have open transactions when you try to modify the schema. Titan needs to wait for all transactions to complete before it can modify the schema. See the answer from Matthias Broecheler on the mailing list for more information.
In general, you should avoid reindexing if possible as it requires Titan to walk over all vertices to see whether they need to be added to the index that should be updated. The documentation contains more information about this process.
For your use case, you can simply create all indexes before you load any data. When you then add the data after all indexes are ready, they will be simply added to the indexes. That way, you should be able to use the indexes immediately.
A minimal example for the schema creation in Groovy (but it should be basically the same in Java):
import com.thinkaurelius.titan.core.TitanFactory;
import com.thinkaurelius.titan.core.Multiplicity;
import com.thinkaurelius.titan.core.Cardinality;
graph = TitanFactory.open('conf/my-titan.properties')
mgmt = graph.openManagement()
id = mgmt.makePropertyKey('id').dataType(String.class).cardinality(Cardinality.SINGLE)
// some other properties that will not be indexed
mgmt.makePropertyKey('isPublic').dataType(Boolean.class).cardinality(Cardinality.SINGLE)
mgmt.makePropertyKey('completionPercentage').dataType(Integer.class).cardinality(Cardinality.SINGLE)
// I prefer to use vertex labels to differentiate between different 'types' of vertices but this isn't necessary
User = mgmt.makeVertexLabel('User').make()
mgmt.buildIndex('UserById',Vertex.class).addKey(id).indexOnly(user).buildCompositeIndex()
mgmt.commit()
I removed all the checks for already existing schema elements for simplicity, but you can of course add them again.
After the schema creation, you can add your data just like before.
A final node about index management: Try to always define the property keys that you want to index in the same transaction in which you create the index. Otherwise, Titan cannot know whether there is already data that needs to be added to the new index which requires again a complete scan of all data. This might require to choose a different name for a property. When you add for example a new vertex label post, then you might want to use a new name like postId instead of using the property id again to avoid the scan of all existing data.

Related

Orientdb NoSQL conditionally execute a query

I'm using the OrientDB REST API and am trying to find a way to check for an edge and create it if it does not exist using only one POST request. Multiple queries and commands are fine, just want to minimize the overhead created by back and forth with the server.
I have written a query to check for the edge in OrientDB's built in Tolkien-Arda dataset:
SELECT IF(COUNT(*) FROM
(SELECT FROM Begets
WHERE OUT IN (SELECT FROM Creature WHERE uniquename='rosaBaggins')
AND IN IN (SELECT FROM Creature WHERE uniquename='adalgrimTook')) > 0, "True", "False")
This ugly monstrosity of a query just counts how many edges are going from rosaBaggins to adalgrimTook and returns "true" if it returns more than 0 and false otherwise.
However I'm not sure how to go the next step and execute the CREATE EDGE query if true. Help appreciated with this or with writing my insane query more efficiently, I get the feeling that I've done it the hard way.
If you want you can do that through Java API, this code check if an outgoing edge from rosaBaggins to adalgrimTook exist:
String DB = "<db name>";
String path = "remote:localhost/" + DB;
OServerAdmin serverAdmin;
try
{
serverAdmin = new OServerAdmin(path).connect("<username>", "<password>");
if(serverAdmin.existsDatabase())
{
OrientGraphFactory factory = new OrientGraphFactory(path);
OrientGraph g = factory.getTx();
Iterable<Vertex> result = g.command(new OCommandSQL("SELECT FROM #18:0 WHERE out().uniquename contains 'adalgrimTook'")).execute();
List<Vertex> list = new ArrayList<Vertex>();
CollectionUtils.addAll(list, result.iterator());
if(list.size() == 0)
{
System.out.println("Edge doesn't exist, I'm creating it ...");
g.command(new OCommandSQL("CREATE EDGE connection FROM (SELECT FROM Creature WHERE uniquename = 'rosaBaggins') TO (SELECT FROM Creature WHERE uniquename = 'adalgrimTook')")).execute();
}
else
{
System.out.println("Edge already exist");
}
serverAdmin.close();
}
}
catch(Exception e)
{
e.printStackTrace();
}
Hope it helps
Regards
Since it was not mentioned to be in Java I'll just provide you the pure SQL implementation of this edge Upsert
let $1 = select from user where _id = 'x5mxEBwhMfiLSQHaK';
let $2 = select expand(both('user_story')) from story where _id = '5ab4ddea1908792c6aa06a93';
let $3 = select intersect($1, $2);
if($3.size() > 0) {
return 'already inserted';
}
create edge user_story from (select from user where _id = 'x5mxEBwhMfiLSQHaK') to (select from story where _id = '5ab4ddea1908792c6aa06a93')
return 'just inserted';
I did not use the original code from the tolkien-Arda, but feel free to fill that code in.
The structure consists of a user and a story written by him. If they aren't linked yet an edge (user_story) is created.
Using part of #mitchken 's answer, I've figured it out.
LET $1 = SELECT expand(bothE('Begets')) from Creature where uniquename='asdfasdf';\
LET $2 = SELECT expand(bothE('Begets')) FROM Creature WHERE uniquename='adalgrimTook';\
LET $3 = SELECT INTERSECT($1, $2);\
LET $4 = CREATE EDGE Begets FROM (SELECT FROM Creature WHERE uniquename='asdfasdf') TO (SELECT FROM Creature WHERE uniquename='adalgrimTook');\
SELECT IF($3.INTERSECT.size() > 0, 'Already exists', $4)
Sending this script to the server the first time creates a new edge between 'asdfasdf' (a vertex I just created) and 'adalgrimTook', and returns the #rid of the new edge. The second time I send the script to the server, it returns 'Already exists'.
Also important to note (took me a lot of frustration to figure this out) is that the LET statements will not work in the CLI or Browse tab on the web GUI, but work just fine as a POST script.

How to prevent titan generate duplicate records with same property?

i have spark app which insert data into titan with goblin. but it insert duplicate vertexes with same name. the test condition 'if not result:' not match, and i am in the same session.
def savePartition(p):
print ('savePartition', p)
from goblin import element, properties
class Brand(element.Vertex):
name = properties.Property(properties.String)
import asyncio
loop = asyncio.get_event_loop()
from goblin.app import Goblin
app = loop.run_until_complete(Goblin.open(loop))
app.register(Brand)
async def go(app):
session = await app.session()
for i in p:
if i['brand']:
traversal = session.traversal(Brand)
result = await traversal.has(Brand.name, i['brand']).oneOrNone()
if not result: # TODO: Remove Duplicates
print(i)
brand = Brand()
brand.name = i['brand']
session.add(brand)
session.flush()
await app.close()
loop.run_until_complete(go(app))
rdd = rdd.foreachPartition(savePartition)
how to fix it? thanks a lot.
I am not sure how this would work with Goblin but if you want Titan to prevent duplicates based on a vertex property you can just use Titan composite indices and specify that they must be unique. For example, you could do the following:
mgmt = graph.openManagement()
name = mgmt.makePropertyKey('name').dataType(String.class).make()
mgmt.buildIndex('byNameUnique', Vertex.class).addKey(name).unique().buildCompositeIndex()
mgmt.commit()
The above will specify that the name property on vertices must be unique.

Graph processing increasingly gets slower on titan + dynamoDB (local) as more vertices/edges are added

I am working with titan 1.0 using AWS dynamoDB local implementation as storage backend on a 16GB machine. My use case involves generating graphs periodically containing vertices & edges in the order of 120K. Every time I generate a new graph in-memory, I check the graph stored in DB and either (i) add vertices/edges that do not exist, or (ii) update properties if they already exist (existence is determined by 'Label' and a 'Value' attribute). Note that the 'Value' property is indexed. Transactions are committed in batches of 500 vertices.
Problem: I find that this process gets slower each time I process a new graph (1st graph finished in 45 mins with empty db initially, 2nd took 2.5 hours, 3rd in 3.5 hours, 4th in 6 hours, 5th in 10 hours and so on). In fact, when processing a given graph, it is fairly quick at start time but progressively gets slower (initial batches take 2-4 secs and later on it increases to 100s of seconds for same batch size of 500 nodes; I also see sometimes it takes 1000-2000 secs for a batch). This is the processing time alone (see approach below); commit takes between 8-10 secs always. I configured the jvm heap size to 10G, and I notice that when the app is running it is eventually using up all of it.
Question: Is this behavior to be expected? It seems to me something is wrong here (either in my config / approach?). Any help or suggestions would be greatly appreciated.
Approach:
Starting from the root node of the in-memory graph, I retrieve all child nodes and maintain a queue
For each child node, I check to see if it exists in DB, else create new node, and update some properties
Vertex dbVertex = dbgraph.traversal().V()
.has(currentVertexInMem.label(), "Value",
(String) currentVertexInMem.value("Value"))
.tryNext()
.orElseGet(() -> createVertex(dbgraph, currentVertexInMem));
if (dbVertex != null) {
// Update Properties
updateVertexProperties(dbgraph, currentVertexInMem, dbVertex);
}
// Add edge if necessary
if (parentDBVertex != null) {
GraphTraversal<Vertex, Edge> edgeIt = graph.traversal().V(parentDBVertex).outE()
.has("EdgeProperty1", eProperty1) // eProperty1 is String input parameter
.has("EdgeProperty2", eProperty2); // eProperty2 is Long input parameter
Boolean doCreateEdge = true;
Edge e = null;
while (edgeIt.hasNext()) {
e = edgeIt.next();
if (e.inVertex().equals(dbVertex)) {
doCreateEdge = false;
break;
}
if (doCreateEdge) {
e = parentDBVertex.addEdge("EdgeLabel", dbVertex, "EdgeProperty1", eProperty1, "EdgeProperty2", eProperty2);
}
e = null;
it = null;
}
...
if ((processedVertexCount.get() % 500 == 0)
|| processedVertexCount.get() == verticesToProcess.get()) {
graph.tx().commit();
}
Create function:
public static Vertex createVertex(Graph graph, Vertex clientVertex) {
Vertex newVertex = null;
switch (clientVertex.label()) {
case "Label 1":
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"),
"Property1-1", clientVertex.value("Property1-1"),
"Property1-2", clientVertex.value("Property1-2"));
break;
case "Label 2":
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"), "Property2-1",
clientVertex.value("Property2-1"),
"Property2-2", clientVertex.value("Property2-2"));
break;
default:
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"));
break;
}
return newVertex;
}
Schema Def: (Showing some of the indexes)
Note:
"EdgeLabel" = Constants.EdgeLabels.Uses
"EdgeProperty1" = Constants.EdgePropertyKeys.EndpointId
"EdgeProperty2" = Constants.EdgePropertyKeys.Timestamp
public void createSchema() {
// Create Schema
TitanManagement mgmt = dbgraph.openManagement();
mgmt.set("cache.db-cache",true);
// Vertex Properties
PropertyKey value = mgmt.getPropertyKey(Constants.VertexPropertyKeys.Value);
if (value == null) {
value = mgmt.makePropertyKey(Constants.VertexPropertyKeys.Value).dataType(String.class).make();
mgmt.buildIndex(Constants.GraphIndexes.ByValue, Vertex.class).addKey(value).buildCompositeIndex(); // INDEX
}
PropertyKey shapeSet = mgmt.getPropertyKey(Constants.VertexPropertyKeys.ShapeSet);
if (shapeSet == null) {
shapeSet = mgmt.makePropertyKey(Constants.VertexPropertyKeys.ShapeSet).dataType(String.class).cardinality(Cardinality.SET).make();
mgmt.buildIndex(Constants.GraphIndexes.ByShape, Vertex.class).addKey(shapeSet).buildCompositeIndex();
}
...
// Edge Labels and Properties
EdgeLabel uses = mgmt.getEdgeLabel(Constants.EdgeLabels.Uses);
if (uses == null) {
uses = mgmt.makeEdgeLabel(Constants.EdgeLabels.Uses).multiplicity(Multiplicity.MULTI).make();
PropertyKey timestampE = mgmt.getPropertyKey(Constants.EdgePropertyKeys.Timestamp);
if (timestampE == null) {
timestampE = mgmt.makePropertyKey(Constants.EdgePropertyKeys.Timestamp).dataType(Long.class).make();
}
PropertyKey endpointIDE = mgmt.getPropertyKey(Constants.EdgePropertyKeys.EndpointId);
if (endpointIDE == null) {
endpointIDE = mgmt.makePropertyKey(Constants.EdgePropertyKeys.EndpointId).dataType(String.class).make();
}
// Indexes
mgmt.buildEdgeIndex(uses, Constants.EdgeIndexes.ByEndpointIDAndTimestamp, Direction.BOTH, endpointIDE,
timestampE);
}
mgmt.commit();
}
The behavior you experience is expected. Today, DynamoDB Local is a testing tool built on SQLite. If you need to support high TPS for large and periodic data loads, I recommend you use the DynamoDB service.

Grails, multiple saving to mongodb throws optimistic locking exception

i have a grails job which is updating the totalSellCount of a product, for which i run a loop , i have a map productTotalSellCount which have the identifier of each product with its total sell count , now i am iterating the loop to update all product sell count like this
productTotalSellCount.each { k,v ->
Product product = Product.findByIdentifier(k)
product.totalSellCount = productTotalSellCount.get(k)
product.save(flush: true)
}
i have around 50k products , and this is a daily schedule job and it always fails , help !!
Instead of using GORM, you should use batch updates: for example:
org.hibernate.StatelessSession session = grails.util.Holders.applicationContext.sessionFactory.openStatelessSession()
org.hibernate.Transaction tx = session.beginTransaction()
groovy.sql.Sql sql = new groovy.sql.Sql(session.connection())
//Create batch of 100 update statements before executing them to db
sql.withBatch(100, "update product set totalSellCount = :val0 where identifier = :val1") {
groovy.sql.BatchingStatementWrapper stmt ->
productTotalSellCount.each { identifier, value ->
stmt.addBatch(val0: value, val1: identifier)
}
}
tx.commit()
session.close()
You can also try parallel execution using GPars.

Delete from cassandra Table in Spark

I'm using Spark with cassandra. And i'm reading some rows from my table in order to delete theme using the PrimaryKey. This is my code :
val lines = sc.cassandraTable[(String, String, String, String)](CASSANDRA_SCHEMA, table).
select("a","b","c","d").
where("d=?", d).cache()
lines.foreach(r => {
val session: Session = connector.openSession
val delete = s"DELETE FROM "+CASSANDRA_SCHEMA+"."+table+" where channel='"+r._1 +"' and ctid='"+r._2+"'and cvid='"+r._3+"';"
session.execute(delete)
session.close()
})
But this method create an session for each row and it takes lot of time. So is it possible to delete my rows using sc.CassandraTable or another solution better then the mine.
Thank you
I don't think there's a support for delete at the moment on the Cassandra Connector. To amortize the cost of connection setup, the recommended approach is to apply the operation to each partition.
So your code will look like this:
lines.foreachPartition(partition => {
val session: Session = connector.openSession //once per partition
partition.foreach{elem =>
val delete = s"DELETE FROM "+CASSANDRA_SCHEMA+"."+table+" where channel='"+elem._1 +"' and ctid='"+elem._2+"'and cvid='"+elem._3+"';"
session.execute(delete)
}
session.close()
})
You could also look into using the DELETE FROM ... WHERE pk IN (list) and use a similar approach to build up the list for each partition. This will be even more performant, but might break with very large partitions as the list will become consequentially long. Repartitioning your target RDD before applying this function will help.
You asked the question a long time ago so you probably found the answer already. :P Just to share, here is what I did in Java. This code works against my local Cassandra instance beautifully. But it does not work against our BETA or PRODUCTION instances, because I suspect there are multiple instances of the Cassandra database there and the delete only worked against 1 instance and the data got replicated right back. :(
Please do share if you were able to get it to work against your Cassandra production environment, with multiple instances of it running!
public static void deleteFromCassandraTable(Dataset myData, SparkConf sparkConf){
CassandraConnector connector = CassandraConnector.apply(sparkConf);
myData.foreachPartition(partition -> {
Session session = connector.openSession();
while(partition.hasNext()) {
Row row = (Row) partition.next();
boolean isTested = (boolean) row.get(0);
String product = (String) row.get(1);
long reportDateInMillSeconds = ((Timestamp) row.get(2)).getTime();
String id = (String) row.get(3);
String deleteMyData = "DELETE FROM test.my_table"
+ " WHERE is_tested=" + isTested
+ " AND product='" + product + "'"
+ " AND report_date=" + reportDateInMillSeconds
+ " AND id=" + id + ";";
System.out.println("%%% " + deleteMyData);
ResultSet deleteResult = session.execute(deleteMyData);
boolean result = deleteResult.wasApplied();
System.out.println("%%% deleteResult =" + result);
}
session.close();
});
}