simulation of creation of social network graph given present snapshot - simulation

I am using http://networkrepository.com/socfb-B-anon.php dataset for my analysis. I would like to do some analysis of how this present graph is formed from scratch. Is there any existing social network simulation framework for this kind of problem?
I am also open to use any other dataset if available. I would need the timestamp for every edge( nodes connected at).

The Barabási–Albert (BA) model describes a preferential attachment model for generating networks, or graphs. It iteratively builds a graph by adding new nodes and connecting them to previously added nodes. The new node is attached to some other nodes with a probability proportional to the degree of the old node with relation to the total number of edges in the graph.
This algorithm has been shown to produce graphs that are scale-free, which means the distribution of degrees follows a power law, which is a typical property of social networks.
This can be seen as a 'simulation' of a growing social network, where users are more likely to 'befriend' or 'follow' popular existing users. Of course it's not a complete simulation because it assumes a new user is done befriending or following other users right after they created an account, but it might be a good starting point for your exploration.
A timestamp for each edge or node creation can be generated by maintaining one during the creation process and increment it as you add more edges or nodes to the graph.
Hopefully this answer gives you enough terminology to further your research.

Related

pgRouting with custom network?

I have a cost network, but it's not a street mapping network. I know the nodes and edges as I defined them. pgRouting looks like a good choice, but every single example I can find uses Open Street Map as the data. I don't have GPS coordinates. The x1,y1 for nodes makes no sense in my graphs, my nodes have specific ids, not coordinates. The costs aren't calculated from the coordinates, they're assigned by me on the various edges based on domain knowledge specific to my domain.
Are there any examples of how to create a custom network in pgRouting? I'm really struggling because the examples are "and then you use this tool to import OSM data"...which doesn't help me at all.
#Chris Kessel
I don't know if this is still relevant, but it may help others:
Basically, what you need to have is a table with edges, where in column 'source' is the id of a node on one end of the edge and in column 'target' - id of the node on the other end. You also have to have a defined cost for the edge, I'm not sure what this will be for you - usually it's distance or time units.
Ususally this is done with geo info using pgr_createTopology function, but in your case you will need to just create this yourself, I suppose.
I think this link can help you:
https://anitagraser.com/2011/02/07/a-beginners-guide-to-pgrouting/
The answer to the question "Are there any examples of how to create a custom network in pgRouting?" is Yes there are.

Multiple object tracking using radar data and extended kalman filter

thanks in advance.
I am new to the multiple object tracking field. So, I have been working on this for a couple of days. I have developed my first version of a single object tracker using an extended Kalman filter. I am estimating position, velocity by assuming a constant acceleration model. Now my question is how can I convert the existing model for multiple objects tracking. The main problem is I am using radar data. So, I am not able to get the references for developing the tracker. So, One good example or steps to achieve can help me in understanding the concept.
The answer to this question depends on a lot of things. For example, how much control and knowledge do you have over the whole system? If you know how many targets you need to track you can add all of them to the Kalman Filter state and for every measurement you perform data association to find out to which object a given measurement belongs. An easy association metric would be nearest neighbor.
If you don't know how many targets there will be you will want to implement a track management where each target you are tracking represents a track and you can model birth and death probabilities of targets.
Multi Target Tracking is a vast field and if you want to have an in-depth mathematical introduction I would recommend the 2015 survey paper "Multitarget Tracking" by Ba-Ngu Vo et al. You should be able to find a preprint pdf online.
If you are looking more for a lightweight tutorial I would assume it should be possible to find some tutorial or example code online where to start. As mentioned in the first paragraph, nearest neighbor association for a fixed amount of objects might be a good first step.

framework for distributed algorithm

I have to do a project where I have a dynamic graph and each node execute my algorithm to calculate the pagerank.
My question is: There is a framwork that allows me to run an algorithm in the same time in each node (the algorithm is not centralized)?
Yes, Giraph is probably the most common example for it and can do exactly what you are looking for. However it isn't trivial to set up, there is a question from yesterday on SO about materials for Giraph: https://stackoverflow.com/questions/22817423/material-related-to-giraph/
Another example would be GraphX (http://amplab.github.io/graphx/) from spark and GraphLab (http://graphlab.org/projects/index.html), but I don't have any experience with those. However all of those frameworks enable writing code for a node and execute it for each node in a graph. They also allow you to distribute the algorithm across multiple servers for large graphs, but it isn't necessary if your graph is small enough.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Clearing Mesh of Graph

If we do the information visualization of documents, the graph generation across multiple documents often forms a mesh. Now to get a clear picture it is easy to form them with minimum data load and thus summarization is a good thing. But if the document load becomes
million then with summarization also the graph forms a big mesh.
I am bit perplexed how to clear the mesh. Reading and working round http://www.jerrytalton.net/research/Talton04SSMSA.report/Talton04SSMSA.pdf is not coming much help, as data is huge.
If any learned members may kindly help me out.
Regards,
SK
Are you talking about creating a graph or network of the documents? For example, you could have a network of documents linked by their citations, by having shared authors, by having the same terms appearing in them, etc. This isn't generally called a mesh problem, instead it is an automatic graph layout problem.
You need either better layout algorithms or to do some kind of clustering and reduction. There are many clustering algorithms you can use, for example Wakita & Tsurumi's:
Ken Wakita and Toshiyuki Tsurumi. 2007. Finding community structure in mega-scale social networks: [extended abstract]. Proc. 16th international conference on World Wide Web (WWW '07). 1275-1276. DOI=10.1145/1242572.1242805.
One that is particularly targeted at reducing complexity through "graph summarization" is Navlakha et al. 2008:
Saket Navlakha, Rajeev Rastogi, and Nisheeth Shrivastava. 2008. Graph summarization with bounded error. Proc. 2008 ACM SIGMOD international conference on Management of data (SIGMOD '08). 419-432. DOI=10.1145/1376616.1376661.
You could also check out my latest paper, which replaces common repeating patterns in the network with representative glyphs:
Dunne, C. & Shneiderman, B. 2013. Motif simplification: improving network visualization readability with fan, connector, and clique glyphs. Proc. 2013 SIGCHI Conference on Human Factors in Computing Systems (CHI '13). PDF.
Here's an example picture of the reduction possible: