Graph processing increasingly gets slower on titan + dynamoDB (local) as more vertices/edges are added - titan

I am working with titan 1.0 using AWS dynamoDB local implementation as storage backend on a 16GB machine. My use case involves generating graphs periodically containing vertices & edges in the order of 120K. Every time I generate a new graph in-memory, I check the graph stored in DB and either (i) add vertices/edges that do not exist, or (ii) update properties if they already exist (existence is determined by 'Label' and a 'Value' attribute). Note that the 'Value' property is indexed. Transactions are committed in batches of 500 vertices.
Problem: I find that this process gets slower each time I process a new graph (1st graph finished in 45 mins with empty db initially, 2nd took 2.5 hours, 3rd in 3.5 hours, 4th in 6 hours, 5th in 10 hours and so on). In fact, when processing a given graph, it is fairly quick at start time but progressively gets slower (initial batches take 2-4 secs and later on it increases to 100s of seconds for same batch size of 500 nodes; I also see sometimes it takes 1000-2000 secs for a batch). This is the processing time alone (see approach below); commit takes between 8-10 secs always. I configured the jvm heap size to 10G, and I notice that when the app is running it is eventually using up all of it.
Question: Is this behavior to be expected? It seems to me something is wrong here (either in my config / approach?). Any help or suggestions would be greatly appreciated.
Approach:
Starting from the root node of the in-memory graph, I retrieve all child nodes and maintain a queue
For each child node, I check to see if it exists in DB, else create new node, and update some properties
Vertex dbVertex = dbgraph.traversal().V()
.has(currentVertexInMem.label(), "Value",
(String) currentVertexInMem.value("Value"))
.tryNext()
.orElseGet(() -> createVertex(dbgraph, currentVertexInMem));
if (dbVertex != null) {
// Update Properties
updateVertexProperties(dbgraph, currentVertexInMem, dbVertex);
}
// Add edge if necessary
if (parentDBVertex != null) {
GraphTraversal<Vertex, Edge> edgeIt = graph.traversal().V(parentDBVertex).outE()
.has("EdgeProperty1", eProperty1) // eProperty1 is String input parameter
.has("EdgeProperty2", eProperty2); // eProperty2 is Long input parameter
Boolean doCreateEdge = true;
Edge e = null;
while (edgeIt.hasNext()) {
e = edgeIt.next();
if (e.inVertex().equals(dbVertex)) {
doCreateEdge = false;
break;
}
if (doCreateEdge) {
e = parentDBVertex.addEdge("EdgeLabel", dbVertex, "EdgeProperty1", eProperty1, "EdgeProperty2", eProperty2);
}
e = null;
it = null;
}
...
if ((processedVertexCount.get() % 500 == 0)
|| processedVertexCount.get() == verticesToProcess.get()) {
graph.tx().commit();
}
Create function:
public static Vertex createVertex(Graph graph, Vertex clientVertex) {
Vertex newVertex = null;
switch (clientVertex.label()) {
case "Label 1":
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"),
"Property1-1", clientVertex.value("Property1-1"),
"Property1-2", clientVertex.value("Property1-2"));
break;
case "Label 2":
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"), "Property2-1",
clientVertex.value("Property2-1"),
"Property2-2", clientVertex.value("Property2-2"));
break;
default:
newVertex = graph.addVertex(T.label, clientVertex.label(), "Value",
clientVertex.value("Value"));
break;
}
return newVertex;
}
Schema Def: (Showing some of the indexes)
Note:
"EdgeLabel" = Constants.EdgeLabels.Uses
"EdgeProperty1" = Constants.EdgePropertyKeys.EndpointId
"EdgeProperty2" = Constants.EdgePropertyKeys.Timestamp
public void createSchema() {
// Create Schema
TitanManagement mgmt = dbgraph.openManagement();
mgmt.set("cache.db-cache",true);
// Vertex Properties
PropertyKey value = mgmt.getPropertyKey(Constants.VertexPropertyKeys.Value);
if (value == null) {
value = mgmt.makePropertyKey(Constants.VertexPropertyKeys.Value).dataType(String.class).make();
mgmt.buildIndex(Constants.GraphIndexes.ByValue, Vertex.class).addKey(value).buildCompositeIndex(); // INDEX
}
PropertyKey shapeSet = mgmt.getPropertyKey(Constants.VertexPropertyKeys.ShapeSet);
if (shapeSet == null) {
shapeSet = mgmt.makePropertyKey(Constants.VertexPropertyKeys.ShapeSet).dataType(String.class).cardinality(Cardinality.SET).make();
mgmt.buildIndex(Constants.GraphIndexes.ByShape, Vertex.class).addKey(shapeSet).buildCompositeIndex();
}
...
// Edge Labels and Properties
EdgeLabel uses = mgmt.getEdgeLabel(Constants.EdgeLabels.Uses);
if (uses == null) {
uses = mgmt.makeEdgeLabel(Constants.EdgeLabels.Uses).multiplicity(Multiplicity.MULTI).make();
PropertyKey timestampE = mgmt.getPropertyKey(Constants.EdgePropertyKeys.Timestamp);
if (timestampE == null) {
timestampE = mgmt.makePropertyKey(Constants.EdgePropertyKeys.Timestamp).dataType(Long.class).make();
}
PropertyKey endpointIDE = mgmt.getPropertyKey(Constants.EdgePropertyKeys.EndpointId);
if (endpointIDE == null) {
endpointIDE = mgmt.makePropertyKey(Constants.EdgePropertyKeys.EndpointId).dataType(String.class).make();
}
// Indexes
mgmt.buildEdgeIndex(uses, Constants.EdgeIndexes.ByEndpointIDAndTimestamp, Direction.BOTH, endpointIDE,
timestampE);
}
mgmt.commit();
}

The behavior you experience is expected. Today, DynamoDB Local is a testing tool built on SQLite. If you need to support high TPS for large and periodic data loads, I recommend you use the DynamoDB service.

Related

Spark mapPartitionsToPai execution time

In the current project I am working, we are using spark as computation engine for one of workflows.
Workflow is as follows
We have product catalog being served from several pincodes. User logged in from any particular pin code should be able to see least available cost from all available serving pincodes.
Least cost is calculated as follows
product price+dist(pincode1,pincode2) -
pincode2 being user pincode and pincode1 being source pincode. Apply the above formula for all source pincodes and identify the least available one.
My Core spark logic looks like this
pincodes.javaRDD().cartesian(pincodePrices.javaRDD()).mapPartitionsToPair(new PairFlatMapFunction<Iterator<Tuple2<Row,Row>>, Row, Row>() {
#Override
public Iterator<Tuple2<Row, Row>> call(Iterator<Tuple2<Row, Row>> t)
throws Exception {
MongoClient mongoclient = MongoClients.create("mongodb://localhost");
MongoDatabase database = mongoclient.getDatabase("catalogue");
MongoCollection<Document>pincodeCollection = database.getCollection("pincodedistances");
List<Tuple2<Row,Row>> list =new LinkedList<>();
while (t.hasNext()) {
Tuple2<Row, Row>tuple2 = t.next();
Row pinRow = tuple2._1;
Integer srcPincode = pinRow.getAs("pincode");
Row pricesRow = tuple2._2;
Row pricesRow1 = (Row)pricesRow.getAs("leastPrice");
Integer buyingPrice = pricesRow1.getAs("buyingPrice");
Integer quantity = pricesRow1.getAs("quantity");
Integer destPincode = pricesRow1.getAs("pincodeNum");
if(buyingPrice!=null && quantity>0) {
BasicDBObject dbObject = new BasicDBObject();
dbObject.append("sourcePincode", srcPincode);
dbObject.append("destPincode", destPincode);
//System.out.println(srcPincode+","+destPincode);
Number distance;
if(srcPincode.intValue()==destPincode.intValue()) {
distance = 0;
}else {
Document document = pincodeCollection.find(dbObject).first();
distance = document.get("distance", Number.class);
}
double margin = 0.02;
Long finalPrice = Math.round(buyingPrice+(margin*buyingPrice)+distance.doubleValue());
//Row finalPriceRow = RowFactory.create(finalPrice,quantity);
StructType structType = new StructType();
structType = structType.add("finalPrice", DataTypes.LongType, false);
structType = structType.add("quantity", DataTypes.LongType, false);
Object values[] = {finalPrice,quantity};
Row finalPriceRow = new GenericRowWithSchema(values, structType);
list.add(new Tuple2<Row, Row>(pinRow, finalPriceRow));
}
}
mongoclient.close();
return list.iterator();
}
}).reduceByKey((priceRow1,priceRow2)->{
Long finalPrice1 = priceRow1.getAs("finalPrice");
Long finalPrice2 = priceRow2.getAs("finalPrice");
if(finalPrice1.longValue()<finalPrice2.longValue())return priceRow1;
return priceRow2;
}).collect().forEach(tuple2->{
// Business logic to push computed price to mongodb
}
I am able to get the answer correctly, however mapPartitionsToPair is taking a bit of time(~22 secs for just 12k records).
After browsing internet I found that mapPartitions performs better than mapPartitionsToPair, but I am not sure how to emit (key,value) from mapPartitions and then sort it.
Is there any alternative for above transformations or any better approach is highly appreciated.
Spark Cluster: Standalone(1 executor, 6 cores)

Entity Framework is too slow during mapping data up to 100k

I have min 100 000 data into a Job_Details table and I'm using Entity Framework to map the data.
This is the code:
public GetJobsResponse GetImportJobs()
{
GetJobsResponse getJobResponse = new GetJobsResponse();
List<JobBO> lstJobs = new List<JobBO>();
using (NSEXIM_V2Entities dbContext = new NSEXIM_V2Entities())
{
var lstJob = dbContext.Job_Details.ToList();
foreach (var dbJob in lstJob.Where(ie => ie.IMP_EXP == "I" && ie.Job_No != null))
{
JobBO job = MapBEJobforSearchObj(dbJob);
lstJobs.Add(job);
}
}
getJobResponse.Jobs = lstJobs;
return getJobResponse;
}
I found to this line is taking about 2-3 min to execute
var lstJob = dbContext.Job_Details.ToList();
How can i solve this issue?
To outline the performance issues with your example: (see inline comments)
public GetJobsResponse GetImportJobs()
{
GetJobsResponse getJobResponse = new GetJobsResponse();
List<JobBO> lstJobs = new List<JobBO>();
using (NSEXIM_V2Entities dbContext = new NSEXIM_V2Entities())
{
// Loads *ALL* entities into memory. This effectively takes all fields for all rows across from the database to your app server. (Even though you don't want it all)
var lstJob = dbContext.Job_Details.ToList();
// Filters from the data in memory.
foreach (var dbJob in lstJob.Where(ie => ie.IMP_EXP == "I" && ie.Job_No != null))
{
// Maps the entity to a DTO and adds it to the return collection.
JobBO job = MapBEJobforSearchObj(dbJob);
lstJobs.Add(job);
}
}
// Returns the DTOs.
getJobResponse.Jobs = lstJobs;
return getJobResponse;
}
First: pass your WHERE clause to EF to pass to the DB server rather than loading all entities into memory..
public GetJobsResponse GetImportJobs()
{
GetJobsResponse getJobResponse = new GetJobsResponse();
using (NSEXIM_V2Entities dbContext = new NSEXIM_V2Entities())
{
// Will pass the where expression to be DB server to be executed. Note: No .ToList() yet to leave this as IQueryable.
var jobs = dbContext.Job_Details..Where(ie => ie.IMP_EXP == "I" && ie.Job_No != null));
Next, use SELECT to load your DTOs. Typically these won't contain as much data as the main entity, and so long as you're working with IQueryable you can load related data as needed. Again this will be sent to the DB Server so you cannot use functions like "MapBEJobForSearchObj" here because the DB server does not know this function. You can SELECT a simple DTO object, or an anonymous type to pass to a dynamic mapper.
var dtos = jobs.Select(ie => new JobBO
{
JobId = ie.JobId,
// ... populate remaining DTO fields here.
}).ToList();
getJobResponse.Jobs = dtos;
return getJobResponse;
}
Moving the .ToList() to the end will materialize the data into your JobBO DTOs/ViewModels, pulling just enough data from the server to populate the desired rows and with the desired fields.
In cases where you may have a large amount of data, you should also consider supporting server-side pagination where you pass a page # and page size, then utilize a .Skip() + .Take() to load a single page of entries at a time.

Load the graph from the titan db for a specified depth in a single or efficitent query

We are using titan db to store the graph infomation. we have cassandra + es as backend storage and index. We are trying to load the graph data to represent the graph in the webui.
This is the approach i am following.
public JSONObject getGraph(long vertexId, final int depth) throws Exception {
JSONObject json = new JSONObject();
JSONArray vertices = new JSONArray();
JSONArray edges = new JSONArray();
final int currentDepth = 0;
TitanGraph graph = GraphFactory.getInstance().getGraph();
TitanTransaction tx = graph.newTransaction();
try {
GraphTraversalSource g = tx.traversal();
Vertex parent = graphDao.getVertex(vertexId);
loadGraph(g, parent, currentDepth + 1, depth, vertices, edges);
json.put("vertices", vertices);
json.put("edges", edges);
return json;
} catch (Throwable e) {
log.error(e.getMessage(), e);
if (tx != null) {
tx.rollback();
}
throw new Exception(e.getMessage(), e);
} finally {
if (tx != null) {
tx.close();
}
}
}
private void loadGraph(final GraphTraversalSource g, final Vertex vertex, final int currentDepth,
final int maxDepth, final JSONArray vertices, final JSONArray edges) throws Exception {
vertices.add(toJSONvertex));
List<Edge> edgeList = g.V(vertex.id()).outE().toList();
if (edgeList == null || edgeList.size() <= 0) {
return;
}
for (Edge edge : edgeList) {
Vertex child = edge.inVertex();
edges.add(Schema.toJSON(vertex, edge, child));
if (currentDepth < maxDepth) {
loadGraph(g, child, currentDepth + 1, maxDepth, vertices, edges);
}
}
}
But this is taking a bit lot of time for the depth 3 when we have more number of nodes exist in the tree it is taking about 1 min to load the data.
Please help me are there any better mechanisms to load the graph efficiently?
You might see better performance performing your full query across a single traversal execution - e.g., g.V(vertex.id()).outV().outV().outE() for depth 3 - but any vertices with very high edge cardinality are going to make this query slow no matter how you execute it.
To add to the answer by #Benjamin performing one traversal as opposed to many little ones which are constantly expanding will indeed be faster. Titan uses lazy loading so you should take advantage of that.
The next thing I would recommend is to also multithread each of your traversals and writes. Titan actually supports simultaneous writes very nicely. You can achieve this using Transactions.

How to compare two data rows in one data set in BIRT

I'm new to BIRT and need an answer to the following question:
How to compare two data rows in one data set in BIRT and then print it out to the document?
I am assuming you have a reason for not using a self-join query to bring in the data. One simple thing you could do is have 2 identical datasets and then create a new joint dataset using the 2.
With an Oracle DB, you could easily achieve this with pure SQL using the "Analytic Function" LAG (see the Oracle documentation for details).
Independent from the DB, with BIRT, you could use a variable last_row:
Create some computed columns to keep the results of your comparisons. e.g. "FIRST_COLUMN_CHANGED" as boolean.
afterOpen event:
last_row = null;
onFetch event (pls note I'm not sure wether the actual data columns start at 0 or 1):
if (last_row != null) {
if (last_row[0] == row[0]) {
row["FIRST_COLUMN_CHANGED"] = false;
} else {
row["FIRST_COLUMN_CHANGED"] = true;
}
} else {
// do computations for the first record.
row["FIRST_COLUMN_CHANGED"] = true;
}
// Copy the current row to last_row
last_row = {};
// modify depending on the number of columns
for (var i=0; i<10; i++) {
last_row[i] = row[i];
}

Is there an Update Object holder on Entity Framework?

I'm currently inserting/updating fields like this (if there's a better way, please say so - we're always learning)
public void UpdateChallengeAnswers(List<ChallengeAnswerInfo> model, Decimal field_id, Decimal loggedUserId)
{
JK_ChallengeAnswers o;
foreach (ChallengeAnswerInfo a in model)
{
o = this.FindChallengeAnswerById(a.ChallengeAnswerId);
if (o == null) o = new JK_ChallengeAnswers();
o.answer = FilterString(a.Answer);
o.correct = a.Correct;
o.link_text = "";
o.link_url = "";
o.position = FilterInt(a.Position);
o.updated_user = loggedUserId;
o.updated_date = DateTime.UtcNow;
if (o.challenge_id == 0)
{
// New record
o.challenge_id = field_id; // FK
o.created_user = loggedUserId;
o.created_date = DateTime.UtcNow;
db.JK_ChallengeAnswers.AddObject(o);
}
else
{
// Update record
this.Save();
}
}
this.Save(); // Commit changes
}
As you can see there is 2 times this.Save() (witch invokes db.SaveChanges();)
when Adding we place the new object into a Place Holder with the AddObject method, in other words, the new object is not committed right away and we can place as many objects we want.
But when it's an update, I need to Save first before moving on to the next object, is there a method that I can use in order to, let's say:
if (o.challenge_id == 0)
{
// New record
o.challenge_id = field_id;
o.created_user = loggedUserId;
o.created_date = DateTime.UtcNow;
db.JK_ChallengeAnswers.AddObject(o);
}
else
{
// Update record
db.JK_ChallengeAnswers.RetainObject(o);
}
}
this.Save(); // Only save once when all objects are ready to commit
}
So if there are 5 updates, I don't need to save into the database 5 times, but only once at the end.
Thank you.
Well if you have an object which is attached to the graph, if you modify values of this object, then the entity is marked as Modified.
If you simply do .AddObject, then the entity is marked as Added.
Nothing has happened yet - only staging of the graph.
Then, when you execute SaveChanges(), EF will translate the entries in the OSM to relevant store queries.
Your code looks a bit strange. Have you debugged through (and ran a SQL trace) to see what is actually getting executed? Because i can't see why you need that first .Save, because inline with my above points, since your modifying the entities in the first few lines of the method, an UPDATE statement will most likely always get executed, regardless of the ID.
I suggest you refactor your code to handle new/modified in seperate method. (ideally via a Repository)
Taken from Employee Info Starter Kit, you can consider the code snippet as below:
public void UpdateEmployee(Employee updatedEmployee)
{
//attaching and making ready for parsistance
if (updatedEmployee.EntityState == EntityState.Detached)
_DatabaseContext.Employees.Attach(updatedEmployee);
_DatabaseContext.ObjectStateManager.ChangeObjectState(updatedEmployee, System.Data.EntityState.Modified);
_DatabaseContext.SaveChanges();
}