Heuristics to find "surprising" mutual friends - facebook

I have the undirected friends graph of my Facebook friends, G i.e. G[i][j] = G[j][i] = true iff my friend i and friend j are friends with each other. I want to find "surprising" mutual friends i.e. pairs of my friends I normally would not expect to know each other. What are some good heuristics/algorithms I can apply? My initial idea is to run a clustering algorithm (not sure which one is the best) and see if I can find edges going across clusters. Any other ideas? What's a good clustering algorithm I can use that takes in a G and spits out clusters.

Here is my idea. Friendship is an edge. Surprising friendship is an edge, such that if you remove the edge, the distance between the two nodes becomes very large.

The answer by Wu Yongzheng can be tied to an existing network concept that is a robust and perhaps more sensitive measure, i.e. a quantitative take on the distance between the nodes becomes very large. This concept is edge betweenness. In this context one would compute an estimated version. See e.g. https://en.wikipedia.org/wiki/Betweenness_centrality and http://igraph.sourceforge.net/doc/R/betweenness.html.


Measuring the "remoteness" of a node in a graph

I mapped out all the edges of a graph in the ancient BBS game 'TradeWars 2002'. There are 1000 nodes. The graph is officially a directed graph, although most edges between nodes are undirected. The graph is strongly connected.
I modelled the universe in networkx. I'd like to use networkx methods to identify the "most remote" nodes in the network. I don't know how to articulate "most-remote" in graph theory terminology though. But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes. And the idea that on the edge of the well-connected nodes, there might be a string of nodes that extend out along a single path that terminates.
I visualization of what I imagine is node 733. Pretty unlikely someone accidentally stumbles onto that one, compared to other better-connected nodes.
What could I use from networkx library to quantify some measure of 'remoteness'?
This is the entire universe:
But the idea I have is nodes that would be bumped into very rarely when someone is transitting between two other arbitrary nodes.
As #Joel mentioned, there are many centrality measures available, and there are often strong correlations between them such that many of them will probably give you more or less what you want.
That being said, I think the class of centrality measures that most closely reflect your intuition are based on random walks. Some of these are pretty costly to compute (although see this paper for some recent improvements on that front) but luckily there is a strong correspondence between the Eigenvector centrality and the frequency with which nodes are visited by a random walker.
The implementation in networkx is available via networkx.algorithms.centrality.eigenvector_centrality.
networkx has a collection of algorithms for this kind of problems: centrality. For example, you can use the simpliest function: closeness_centrality:
# Create a random graph
G = nx.gnp_random_graph(50, 0.1)
{0: 0.3888888888888889,
1: 0.45794392523364486,
2: 0.35507246376811596,
3: 0.4375,
4: 0.4083333333333333,
5: 0.3684210526315789,
# Draw the graph
labels = {n: n for n in G.nodes}
nx.draw(G, with_labels=True, labels=labels)
And the most remote (the less central) nodes can be listed by returning nodes with the least closeness_centrality (note nodes IDs and nodes in blue circles in the upper picture:
c = nx.closeness_centrality(G)
sorted(c.items(), key=lambda x: x[1])[:5]
[(48, 0.28823529411764703),
(7, 0.33793103448275863),
(11, 0.35251798561151076),
(2, 0.35507246376811596),
(46, 0.362962962962963)]

Understanding Titan Traversals

I am trying to write a highly scalable system with titandb. I have a situation where some nodes are highly connected.
Imagine the following example at much larger scale.
Now I have the following situations:
I want to find all the freinds of node X.
I want to find a specific friend of node X for example 5.
For scenario 1 I do: g.V(X).out(friend).toList(). For scenario 2 I do: g.V(X).out(friend).hasId(5).next(). Both of these traversals will work but scale poorly as X gets more friends. Can I optimise this situation by putting more information on the edge label ? For example if on the edge between X and 5 I change the label to freind_with_5 will the following be faster:
From my understanding this will be faster as only 1 edge will be traversed. However, if I make such a change to my edge labels how would I find all the friends of X ?
You could encode data into your edge label, but I would say that do that at the cost of complicating your graph schema considerably and, as you note, make it hard to do simple things like "find all my friends". I don't think you should take that approach.
The preferred method for dealing with this is with vertex-centric indices. If you denormalize any data to your edges, you should do it with those indices in mind (and not by encoding that data into the edge label). Put some unique identifier for the friend on the "friend" edge and index that.
If your supernodes are especially large (millions+ edges) you should also consider Titan's vertex partitioning feature.

Optimized search.How to reduce the complexity ?

Here is a problem I'm trying to solve using graph algorithms. Answer to this question is easy if one is familiar with different graph traversal algorithms. What I want to learn is how can we reduce the complexity of this problem?
Let say we have to traverse in someone's network - Friends, Friends of
Friends (FoF) and FoFoF (1st, 2nd, 3rd Degree.. up to 6th degree) to
search for a particular thing, say 'people living in California'. The
complexity of the problem greatly increases when you have 1000 friends
and your 1000 friends have 1000 friends each and so on.
Let's say we want to do an optimized search, where you know the
destination node (here, a person living in California). How will you
reduce the complexity of the problem?
The program you submit should return the degree by which that person
is connected to you. [where the 'destination node' is your Degree 1st
(Friend), or 2nd (friend of friend) or 3rd Degree (FoFoF) or a Degree
greater than 3rd degree].
Assuming your graph is unweighted, doing Breadth First Search will give you shortest paths (which effectively are the degrees that you need). If the destination is known you can also use Dijkstra's Algorithm to find a shortest path to that specific node, although if the graph is unweighted just doing the BFS will be more efficient as it's complexity is lower than Dijkstra's. Also if I understand correctly your output has to cover only 4 cases: Degrees 1,2,3 or higher than that. If so, you can just BFS the first three levels and store the results. Then you can answer the question in constant time by checking for the existence of such person in the data obtained via BFS.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Dijkstra algorithm for iPhone

It is possible to easily use the GPS functionality in the iPhone since sdk 3.0, but it is explicitly forbidden to use Google's Maps.
This has two implications, I think:
You will have to provide maps yourself
You will have to calculate the shortest routes yourself.
I know that calculating the shortest route has puzzled mathematicians for ages, but both Tom Tom and Google are doing a great job, so that issue seems to have been solved.
Searching on the 'net, not being a mathematician myself, I came across the Dijkstra Algorithm. Is there anyone of you who has successfully used this algorithm in a Maps-like app in the iPhone?
Would you be willing to share it with me/the community?
Would this be the right approach, or are the other options?
Thank you so much for your consideration.
I do not believe Dijkstra's algorithm would be useful for real-world mapping because, as Tom Leys said (I would comment on his post, but lack the rep to do so), it requires a single starting point. If the starting point changes, everything must be recalculated, and I would imagine this would be quite slow on a device like the iPhone for a significantly large data set.
Dijkstra's algorithm is for finding the shortest path to all nodes (from a single starting node). Game programmers use a directed search such as A*. Where Dijkstra processes the node that is closest to the starting position first, A* processes the one that is estimated to be nearest to the end position
The way this works is that you provide a cheap "estimate" function from any given position to the end point. A good example is how far a bird would fly to get there. A* adds this to the current distance from the start for each node and then chooses the node that seems to be on the shortest path.
The better your estimate, the shorter the time it will take to find a good path. If this time is still too long, you can do a path find on a simple map and then another on a more complex map to find the route between the places you found on the simple map.
After much searching, I have found an article on A* for you to to read
Dijkstra's algorithm is O(m log n) for n nodes and m edges (for a single path) and is efficient enough to be used for network routing. This means that it's efficient enough to be used for a one-off computation.
Briefly, Dijkstra's algorithm works like:
Take the start node
Assign it a depth of zero
Insert it into a priority queue at its depth key
Pop the node with the lowest depth from the priority queue
Record the node that you came from so you can track the path back
Mark the node as having been visited
If this node is the destination:
For each neighbour:
If the node has not previously been visited:
Calculate depth as depth of current node + distance to neighbour
Insert neighbour into the priority queue at the calculated depth.
Return the destination node and list of the nodes through which it was reached.
Contrary to popular belief, Dijkstra's algorithm is not necessarily an all-pairs shortest path calculator, although it can be adapted to do this.
You would have to get a graph of the streets and intersections with the distances between the intersections. If you had this data you could use Dijkstra's algorithm to compute a shortest route.
If you look at technology tomtom calls 'IQ routes', they measure actual speed and travel time per roadstretch per time of day. This makes the arrival time more accurate. So the expected arrival time is more fact-based http://www.tomtom.com/page/iq-routes
Calculating a route using the A* algorithm is plenty fast enough on an iPhone with offline map data. I have experience of doing this commercially. I use the A* algorithm as documented on Wikipedia, and I keep the road network in memory and re-use it; once it's loaded, routing even over a large area like Spain or the western half of Canada is practically instant.
I take data from OpenStreetMap or elswhere and convert it into a directed graph, assuming (which is the right way to do it according to those who know) that any two roads sharing a point with the same ID are joined. I assign weights to different types of roads based on expected speeds, and if a portion of a road is one-way I create only a single arc; two-way roads get two arcs, one in each direction. That's pretty much the whole thing apart from some ad-hoc code to prevent dangerous turns, and implementing routing restrictions.
This was discussed earlier here: What algorithms compute directions from point a to point b on a map?
Have a look at CloudMade. They offer a free service for iPhone and iPad that allows navigation based on your current location. It is built on open street maps and has some nifty features like making your own mapstyle. It is a little slow from time to time but its totally free.