Speeding up Titan graph traversals - titan

I have been playing around with a very small graph of 264,346 vertices and 733,846 edges. I bulk imported it into Titan using BatchGraph and the recommended settings. This worked out fine - it is surprisingly efficient.
I am now playing around with graph traversals and implemented a quick Dijkstra's algorithm in Java using the Java libraries. I should mention I have a local Cassandra node that my Titan database is running on - it is not the embedded one.
An average Dijkstra run takes 1+ minute for an average path in this graph. This is extremely long. I would expect a very slow Dijkstra run on such a graph to take under 10 seconds (with an in-memory graph average queries on this graph size would take well under 1 second).
What are the best practices for running such algorithms over Titan efficiently?
I will give parts of the simple Dijkstra implementation in case the way I am accessing vertices and edges is not the most efficient way.
Getting the graph instance:
TitanGraph graph = TitanFactory.open("cassandra:localhost");
Parts of the Dijkstra implementation (specifically involving graph access):
public int run(long src, long trg){
this.pqueue.add(new PQNode(0, src));
this.nodeMap.put(src, new NodeInfo(src, 0, -1));
int dist = Integer.MAX_VALUE;
while (!this.pqueue.isEmpty()){
PQNode current = this.pqueue.poll();
NodeInfo nodeInfo = this.nodeMap.get(current.getNodeId());
long u = nodeInfo.getNodeId();
if (u == trg) {
dist = nodeInfo.getDistance();
break;
}
if (nodeInfo.isSeen())
continue;
this.expansion++;
TitanVertex vertex = graph.getVertex(u);
for (TitanEdge out: vertex.getEdges()){
Direction dir = out.getDirection(vertex);
if (dir.compareTo(Direction.OUT) != 0 && dir.compareTo(Direction.BOTH) != 0){
continue;
}
TitanVertex v = out.getOtherVertex(vertex);
long vId = (long)v.getId();
NodeInfo vInfo = this.nodeMap.get(vId);
if (vInfo == null){
vInfo = new NodeInfo(vId, Integer.MAX_VALUE, -1);
this.nodeMap.put(vId, vInfo);
}
int weight = out.getProperty("weight");
int currentDistance = nodeInfo.getDistance() + weight;
if (currentDistance < vInfo.getDistance()){
vInfo.setParent(u);
vInfo.setDistance(currentDistance);
this.pqueue.add(new PQNode(currentDistance, vId));
}
}
nodeInfo.setSeen(true);
}
return dist;
}
How should I proceed trying to execute such algorithms efficiently over Titan?

Forgetting about your code/algorithm, I have to agree that as you say, your graph is "very small". That begs the question why you chose to use cassandra as your backend. For a graph of that size, I would try out berkeleydb to see if that helps. Or perhaps if you want to go really extreme just use the fastest Blueprints implementation there is: TinkerGraph!
As for the code sample itself, one suggestion might be to stop iterating all edges on a vertex if you don't need to:
for (TitanEdge out: vertex.getEdges()){
Direction dir = out.getDirection(vertex);
if (dir.compareTo(Direction.OUT) != 0 && dir.compareTo(Direction.BOTH) != 0){
continue;
}
Instead if you just want in" edges tell Titan that: vertex.getEdges(Direction.IN). That should give you some extra efficiency there - I'd even do that if you decided to just load to TinkerGraph. That would also eliminate the need for that getOtherVertex line because then you know to just out.getVertex(Direction.OUT) which I'd guess to be faster.

Related

Looking for advice on improving a custom function in AnyLogic

I'm estimating last mile delivery costs in an large urban network using by-route distances. I have over 8000 customer agents and over 100 retail store agents plotted in a GIS map using lat/long coordinates. Each customer receives deliveries from its nearest store (by route). The goal is to get two distance measures in this network for each store:
d0_bar: the average distance from a store to all of its assigned customers
d1_bar: the average distance between all customers common to a single store
I've written a startup function with a simple foreach loop to assign each customer to a store based on by-route distance (customers have a parameter, "customer.pStore" of Store type). This function also adds, in turn, each customer to the store agent's collection of customers ("store.colCusts"; it's an array list with Customer type elements).
Next, I have a function that iterates through the store agent population and calculates the two average distance measures above (d0_bar & d1_bar) and writes the results to a txt file (see code below). The code works, fortunately. However, the problem is that with such a massive dataset, the process of iterating through all customers/stores and retrieving distances via the openstreetmap.org API takes forever. It's been initializing ("Please wait...") for about 12 hours. What can I do to make this code more efficient? Or, is there a better way in AnyLogic of getting these two distance measures for each store in my network?
Thanks in advance.
//for each store, record all customers assigned to it
for (Store store : stores)
{
distancesStore.print(store.storeCode + "," + store.colCusts.size() + "," + store.colCusts.size()*(store.colCusts.size()-1)/2 + ",");
//calculates average distance from store j to customer nodes that belong to store j
double sumFirstDistByStore = 0.0;
int h = 0;
while (h < store.colCusts.size())
{
sumFirstDistByStore += store.distanceByRoute(store.colCusts.get(h));
h++;
}
distancesStore.print((sumFirstDistByStore/store.colCusts.size())/1609.34 + ",");
//calculates average of distances between all customer nodes belonging to store j
double custDistSumPerStore = 0.0;
int loopLimit = store.colCusts.size();
int i = 0;
while (i < loopLimit - 1)
{
int j = 1;
while (j < loopLimit)
{
custDistSumPerStore += store.colCusts.get(i).distanceByRoute(store.colCusts.get(j));
j++;
}
i++;
}
distancesStore.print((custDistSumPerStore/(loopLimit*(loopLimit-1)/2))/1609.34);
distancesStore.println();
}
Firstly a few simple comments:
Have you tried timing a single distanceByRoute call? E.g. can you try running store.distanceByRoute(store.colCusts.get(0)); just to see how long a single call takes on your system. Routing is generally pretty slow, but it would be good to know what the speed limit is.
The first simple change is to use java parallelism. Instead of using this:
for (Store store : stores)
{ ...
use this:
stores.parallelStream().forEach(store -> {
...
});
this will process stores entries in parallel using standard Java streams API.
It also looks like the second loop - where avg distance between customers is calculated doesn't take account of mirroring. That is to say distance a->b is equal to b->a. Hence, for example, 4 customers will require 6 calculations: 1->2, 1->3, 1->4, 2->3, 2->4, 3->4. Whereas in case of 4 customers your second while loop will perform 9 calculations: i=0, j in {1,2,3}; i=1, j in {1,2,3}; i=2, j in {1,2,3}, which seems wrong unless I am misunderstanding your intention.
Generally, for long running operations it is a good idea to include some traceln to show progress with associated timing.
Please have a look at above and post results. With more information additional performance improvements may be possible.

How to connect component in Spark when data is too large

When dealing with component connecting of big data, I find it very difficult to merging them in spark.
The data structure in my research can be simplified to RDD[Array[Int]]. For example:
RDD[Array(1,2,3), Array(1,4), Array(5,6), Array(5,6,7,8), Array(9), Array(1)]
The objective is to merge two Array if they have intersection set, ending up with arrays without any intersection. Therefore after merging, it should be:
RDD[Array(1,2,3,4), Array(5,6,7,8), Array(9)]
The problem is kind of component connecting in Pregel framework in Graph Algo. One solution is to first find the edge connection between two Array using cartesian product and then merge them. However, in my case, there are 300K Array with total size 1G. Therefore, the time and memory complexity would be roughly 300K*300K. When I run the program in my Mac Pro in spark, it is completely stuck.
Baiscally, it is like:
Thanks
Here is my solution. Might not be decent enough, but works for a small data. Whether it can apply to large data needs further proof.
def mergeCanopy(canopies:RDD[Array[Int]]):Array[Array[Int]] = {
/*
try to merge two canopies
*/
val s = Set[Array[Int]]()
val c = canopies.aggregate(s)(mergeOrAppend, _++_)
return c.toArray
def mergeOrAppend(disjoint: Set[Array[Int]], cluster: Array[Int]):Set[Array[Int]] = {
var disjoints = disjoint
for (clus <- disjoint) {
if (clus.toSet.&(cluster.toSet) != Set()) {
disjoints += (clus.toSet++cluster.toSet).toArray
disjoints -= clus
return disjoints
}
}
disjoints += cluster
return disjoints
}

How to use MapReduce for k-Means Spatial Clustering

I'm new to mongodb and map-reduce and want to evaluate spatial data by using a k-means spatial clustering. I found this article which seems to be a good description of the algorithm, but I have no clue how to translate this into a mongo shell script. Assume my data looks like:
{
_id: ObjectID(),
loc: {x: <longitude>, y: <latitude>},
user: <userid>
}
And I can use { k = sqrt(n/2) } where n is the number of samples.
I can use aggregates to get the bounding extents of the data and the count, etc.
I kind of got lost with the reference to a file of the cluster points, which I assume would be just another collection and I have no idea how to do the iteration or if that would be done in the client or the database?
Ok, I have made a little progress on this in that I have generated and array of initial random points that I need to compute the sum of least squares against during the map-reduce phase, but I do not know how to pass these to the map function. I took a stab at writing the map function:
var mapCluster = function() {
var key = -1;
var sos = 0;
var pos;
for (var i=0; i<pts.length; i++) {
var dx = pts[i][0] - this.arguments.pos[0];
var dy = pts[i][1] - this.arguments.pos[1];
var sumOfSquare = dx*dx + dy*dy;
if (i == 0 || sumOfSquares < sos) {
key = i;
sos = sumOfSquares;
pos = this.arguments.pos;
}
}
emit(key, pos);
};
I this case the cluster points are like, which is probably will not work:
var pts = [ [x,y], [x1,y1], ... ];
So for each mr iteration, we compare all the collection points against this array and emit the index of the point that we are closest to along with location of the collection point then in the reduce function the average of the points associated with each index would be used to create the new cluster point location. Then in the finialize function I can update the cluster document.
I assume I could do a findOne() on the cluster document to load the cluster points in the map function but do we want to load this document on every call to map? or is there a way to load it once for each iteraction?
So it looks like you can do the above using the scope variable like this:
db.main.mapReduce( mapCluster, mapReduce, { scope: { pnts: pnts, ... }} );
You have to be careful about variable names in the scope as these are placed in the scope of the map, reduce and finialize functions they can collide with existing variable names.
What have you tried?
Note that you will need more than one round of mappers.
With the canonical approach of running k-means on MR, you need one mapper/reducer per iteration.
So, can you try to write the map and reduce steps of a single iteration only?

Beat Detection on iPhone with wav files and openal

Using this website i have tried to make a beat detection engine. http://www.gamedev.net/reference/articles/article1952.asp
{
ALfloat energy = 0;
ALfloat aEnergy = 0;
ALint beats = 0;
bool init = false;
ALfloat Ei[42];
ALfloat V = 0;
ALfloat C = 0;
ALshort *hold;
hold = new ALshort[[myDat length]/2];
[myDat getBytes:hold length:[myDat length]];
ALuint uiNumSamples;
uiNumSamples = [myDat length]/4;
if(alDatal == NULL)
alDatal = (ALshort *) malloc(uiNumSamples*2);
if(alDatar == NULL)
alDatar = (ALshort *) malloc(uiNumSamples*2);
for (int i = 0; i < uiNumSamples; i++)
{
alDatal[i] = hold[i*2];
alDatar[i] = hold[i*2+1];
}
energy = 0;
for(int start = 0; start<(22050*10); start+=512){
for(int i = start; i<(start+512); i++){
energy+= ((alDatal[i]*alDatal[i]) + (alDatal[i]*alDatar[i]));
}
aEnergy = 0;
for(int i = 41; i>=0; i--){
if(i ==0){
Ei[0] = energy;
}
else {
Ei[i] = Ei[i-1];
}
if(start >= 21504){
aEnergy+=Ei[i];
}
}
aEnergy = aEnergy/43.f;
if (start >= 21504) {
for(int i = 0; i<42; i++){
V += (Ei[i]-aEnergy);
}
V = V/43.f;
C = (-0.0025714*V)+1.5142857;
init = true;
if(energy >(C*aEnergy)) beats++;
}
}
}
alDatal and alDatar are (short*) type;
myDat is NSdata that holds the actual audio data of a wav file formatted to
22050 khz and 16 bit stereo.
This doesn't seem to work correctly. If anyone could help me out that would be amazing. I've been stuck on this for 3 days.
The desired result is after the 10 seconds worth of data has been processed i should be able to multiply that by 6 and have an estimated beats per minute.
My current results are 389 beats every 10 seconds, 2334 BPM the song i know is right around 120 BPM.
That code really has been smacked about with the ugly stick. If you're going to ask other people to find your bugs for you, it's a good idea to make things presentable first. Strangely enough, this will often help you to find them for yourself too.
So, before I point out some of the more fundamental errors, I have to make a few schoolmarmly suggestions:
Don't sprinkle your code with magic numbers. Is it really that hard to type a few lines like const ALuint SAMPLE_RATE = 22050? Trust me, it makes life a lot easier.
Use variable names that you aren't going to mix up easily. One of your bugs is a substitution of alDatal for alDatar. That probably wouldn't have happened if they were called left and right. Similarly, what is the point of having a meaningful variable name like energy if you're just going to stick it alongside the meaningless but more or less identical aEnergy? Why not something informative like average?
Declare variables close to where you're going to use them and in the appropriate scope. Another of your bugs is that you don't reset your calculated energy sum when you move your averaging window, so the energy will just add up and up. But you don't need the energy outside that loop, and if you declared it inside the problem couldn't happen.
There are some other things I personally find a little irksome, like the random bracing and indentation, and mixing of C and C++ allocations, and odd inconsistent scraps of Hungarian prefixing, but at least some of those may be more a matter of taste so I won't go on.
Anyway, here are some reasons why your code doesn't work:
First up, look at the right hand side of this line:
energy+= ((alDatal[i]*alDatal[i]) + (alDatal[i]*alDatar[i]));
You want the square of each channel value, so it should really say:
energy+= ((alDatal[i]*alDatal[i]) + (alDatar[i]*alDatar[i]));
Spot the difference? Not easy with those names, is it?
Second, you should be computing the total energy over each window of samples, but you're only setting energy = 0 outside the outer loop. So the sum accumulates, and consequently the current window energy will always be the biggest you've ever encountered.
Third, your variance calculation is wrong. You have:
V += (Ei[i]-aEnergy);
But it should be the sum of the squares of the differences from the mean:
V += (Ei[i] - aEnergy) * (Ei[i] - aEnergy);
There may well be other errors as well. For instance, you don't allocate the data buffers if they're not NULL, but assume that they're the right length -- which you've only just calculated. You may justify that in terms of some consistent usage you've stuck to throughout your code, but from the perspective of what we can see here it looks like a pretty bad idea.

Random AI / Switch case?

So I have a very simple game going here..., right now the AI is nearly perfect and I want it to make mistakes every now and then. The only way the player can win is if I slow the computer down to a mind numbingly easy level.
My logic is having a switch case statement like this:
int number = randomNumber
case 1:
computer moves the complete opposite way its supposed to
case 2:
computer moves the correct way
How can I have it select case 2 68% (random percentage, just an example) of the time, but still allow for some chance to make the computer fail? Is a switch case the right way to go? This way the difficulty and speed can stay high but the player can still win when the computer makes a mistake.
I'm on the iPhone. If there's a better/different way, I'm open to it.
Generating Random Numbers in Objective-C
int randNumber = 1 + rand() % 100;
if( randNumber < 68 )
{
//68% path
}
else
{
//32% path
}
int randomNumber = GeneraRandomNumberBetweenZeroAndHundred()
if (randomNumber < 68) {
computer moves the complete opposite way its supposed to
} else {
computer moves the correct way
}
Many PRNGs will offer a random number in the range [0,1). You can use an if-statement instead:
n = randomFromZeroToOne()
if n <= 0.68:
PlaySmart()
else:
PlayStupid()
If you're going to generate an integer from 1 to N, instead of a float, beware of modulo bias in your results.