Why am I getting different community detection results for NetworkX and Gephi? - networkx

I've got a network with undirected, weighted edges. I'm interested in whether nodes A and B are in the same community. When I run "modularity" in Gephi, the modularity class for node A is usually, though not always, distinct from that of B. However, when I switch over to Python and run, on the exact same underlying data, either louvain_communities() (from the networkx.algorithms.community module) or community_louvain.best_partition() (from the community module), A is always in the same community as B. I've tried this at various resolutions, but keep getting similar results: the Python modules are grouping A and B significantly more often than Gephi.
My question is: What is the Gephi method doing differently? My understanding is that Gephi uses the Louvain method; the variables I can see (resolution, using weights, etc.) seem to be the same. Why the disparity?
Edit:
I was asked to provide some code for this.
Basically I have edge tuples with weights like so:
edge_tuples = [(A,B,5),(A,C,11),(B,C,41),…]
I have nodes as a list:
nodes = [A,B,C,…]
I use networkx to make a graph:
G = nx.Graph()
G.add_nodes_from(nodes)
G.add_weighted_edges_from(edge_tuples)
If I’m using community, I get the partition like so:
partition = community.community_louvain.best_partition(G,resolution=.7)
The resolution could be whatever, but .7 is in line with what I've tried before.
In Gephi, I'm just using ordinary node and edge tables. These are generated and exported as csv's in the process of creating the edge_tuples described above (i.e. I make them both out of the same data and just export a csv before making the networkx graph), so I don't see where the underlying data would be differing, although I'm certainly open to correction.

Related

How to create a "Denoising Autoencoder" in Matlab?

I know Matlab has the function TrainAutoencoder(input, settings) to create and train an autoencoder. The result is capable of running the two functions of "Encode" and "Decode".
But this is only applicable to the case of normal autoencoders. What if you want to have a denoising autoencoder? I searched and found some sample codes, where they used the "Network" function to convert the autoencoder to a normal network and then Train(network, noisyInput, smoothOutput)like a denoising autoencoder.
But there are multiple missing parts:
How to use this new network object to "encode" new data points? it doesn't support the encode().
How to get the "latent" variables to the features, out of this "network'?
I appreciate if anyone could help me resolve this issue.
Thanks,
-Moein
At present (2019a), MATALAB does not permit users to add layers manually in autoencoder. If you want to build up your own, you will have start from the scratch by using layers provided by MATLAB;
In order to to use TrainNetwork(...) to train your model, you will have you find out a way to insert your data into an object called imDatastore. The difficulty for autoencoder's data is that there is NO label, which is required by imDatastore, hence you will have to find out a smart way to avoid it--essentially you are to deal with a so-called OCC (One Class Classification) problem.
https://www.mathworks.com/help/matlab/ref/matlab.io.datastore.imagedatastore.html
Use activations(...) to dump outputs from intermediate (hidden) layers
https://www.mathworks.com/help/deeplearning/ref/activations.html?searchHighlight=activations&s_tid=doc_srchtitle
I swang between using MATLAB and Python (Keras) for deep learning for a couple of weeks, eventually I chose the latter, albeit I am a long-term and loyal user to MATLAB and a rookie to Python. My two cents are that there are too many restrictions in the former regarding deep learning.
Good luck.:-)
If you 'simulation' means prediction/inference, simply use activations(...) to dump outputs from any intermediate (hidden) layers as I mentioned earlier so that you can check them.
Another way is that you construct an identical network but with the encoding part only, copy your trained parameters into it, and feed your simulated signals.

Running DBSCAN in ELKI

I am trying to cluster some geospatial data, and I previously tried the WEKA library.
I found this benchmarking, and decided to try ELKI.
Despite the advice to not use ELKI as a Java library (which is suppose to be less maintained than the UI), I incorporated it in my application, and I can say that I am quite happy about the results. The structures that it uses to store data, are far more efficient than the ones used by Weka, and the fact that it has the option of using a spatial index is definetly a plus.
However, when I compare the results of Weka's DBSCAN, with the ones from ELKI's DBSCAN, I get a little bit puzzled. I would accept different implementations can give origin to slightly different results, but these magnitude of difference makes me think there is something wrong with the algorithm (probably with my code). The number of clusters and their geometry is very different in the two algorithms.
For the record, I am using the latest version of ELKI (0.6.0), and the parameters I used for my simulations were:
minpts=50
epsilon=0.008
I coded two DBSCAN functions (for Weka and ELKI), where the "entry point" is a csv with points, and the "output" for both of them is also identical: a function that calculates the concave hull of a set of points (one for each cluster). Since the function that reads the csv file into an ELKI "database" is relatively simple, I think my problem could be:
a) in the parametrization of the algorithm;
b) reading the results (most likely).
Parametrizing DBSCAN does not pose any challenges, and I use the two compulsory parameters, which I previously tested through the UI:
ListParameterization params2 = new ListParameterization();
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.MINPTS_ID, minPoints);
params2.addParameter(de.lmu.ifi.dbs.elki.algorithm.clustering.DBSCAN.Parameterizer.EPSILON_ID, epsilon);
Reading the result is a bit more challenging, as I don't completely understand the organization of the structure that stores the clusters; My idea is to iterate over each cluster, get the list of points, and pass it to the function that calculates the concave hull, in order to generate a polygon.
ArrayList<Clustering<?>> cs = ResultUtil.filterResults(result, Clustering.class);
for (Clustering<?> c : cs) {
System.out.println("clusters: " + c.getAllClusters().size());
for (de.lmu.ifi.dbs.elki.data.Cluster<?> cluster : c.getAllClusters()) {
if (!cluster.isNoise()){
Coordinate[] ptList=new Coordinate[cluster.size()];
int ct=0;
for (DBIDIter iter = cluster.getIDs().iter(); iter.valid(); iter.advance()) {
ptList[ct]=dataMap.get(DBIDUtil.toString(iter));
++ct;
}
//there are no "empty" clusters
assertTrue(ptList.length>0);
GeoPolygon poly=getBoundaryFromCoordinates(ptList);
if (poly.getCoordinates().getGeometryType()==
"Polygon"){
try {
out.write(poly.coordinates.toText()+"\n");
} catch (IOException e) {
// TODO Auto-generated catch block
e.printStackTrace();
}
}else
System.out.println(
poly.getCoordinates().getGeometryType());
}//!noise
}
}
I notice that the "noise" was coming up as a cluster, so I ignored this cluster (I don't want to draw it).
I am not sure if this is the right way of reading the clusters, as I don't find many examples. I also have some questions, for which I did not found answers yet:
What is the difference between getAllClusters() and
getTopLevelClusters()?
Are the DBSCAN clusters "nested", i.e.: can we have points that
belong to many clusters at the same time? Why?
I read somewhere that we should not use the database IDs to identify
the points, as they are for ELKI's internal use, but what other way
there is to get the list of points in each cluster? I read that you
can use a relation for the labels, but I am not sure how to actually
implement this...
Any comments that could point me in the right direction, or any code suggestions to iterate over the result set of ELKI's DBSCAN would be really welcome! I also used ELKI's OPTICSxi in my code, and I have even more questions regarding those results, but I guess I'll save that for another post.
This is mostly a follow-up to #Anony-Mousse, who gave a pretty complete answer.
getTopLevelClusters() and getAllClusters() do the same for DBSCAN, as DBSCAN does not produce hierarchical clusters.
DBSCAN clusters are disjoint. Treating clusters with isNoise()==true as singleton objects is likely the best way to handling noise. Clusters returned by our OPTICSXi implementation are also disjoint, but you should consider the members of all child clusters to be part of the outer cluster. For convex hulls, an efficient approach is to first compute the convex hull of the child clusters; then for the parent compute the convex hull on the additional objects + the convex hull points of all childs.
The RangeDBIDs approach mentioned by #Anony-Mousse is pretty clean for static databases. A clean approach that also works with dynamic databases is to have an additional relation that identifies the objects. When using a CSV file as input, instead of relying on the line numbering to be consistent, you would just add a non-numeric column, containing labels e.g. object123. This is the best approach from a logical point of view - if you want to be able to identify objects, give them a unique identifier. ;-)
We use ELKI for teaching, and we have verified its DBSCAN algorithm very very carefully (you can find a DBSCAN step by step demonstration here, and ELKI results exactly match this). The DBSCAN and OPTICS code in Weka was contributed by a student a long time ago, and has never been verified to the same extend. From a quick check, Weka does not produce the correct results on our class exercise data set.
Because the exercise data set has the same extend of 10 in each dimension, we can adjust the epsilon parameter by 1/10, and then the Weka result seems to match the solution; so #Anony-Mousses finding appears to be correct: Weka's implementation enforces a [0;1] scaling on the data.
Accessing the DBIDs of ELKI works, if you pay attention to how they are assigned.
For a static database, getDBIDs() will return a RangeDBIDs object, and it can give you an offset into the database. This is very reliable. But if you always restart your process, the DBIDs will be assigned deterministically anyway (only when using the MiniGUI, they will differ if you rerun a job!)
This will also be more efficient than DBIDUtil.toString.
DBSCAN results are not hierarchical, so every cluster should be a top level cluster.
As for Weka, it sometimes does automatic normalization. Then the epsilon value will be distorted. For geographic data, I would prefer geodetic distance anyway, Euclidean distance on latitude and longitude does not make sense.
Check this part of Wekas code: "norm" function, used by EuclideanDataObject. This does look to me as if Wekas DBSCAN enforces a normalization on the data set! Try scaling your data to [0:1] (I'm pretty sure there is a filter for this in ELKI), if the results are identical afterwards?
Judging from this code snippet, I would blame Weka. The code above also looks a bit inefficient to me. The filter approach makes IMHO more sense than this enforced filtering in the data objects.

framework for distributed algorithm

I have to do a project where I have a dynamic graph and each node execute my algorithm to calculate the pagerank.
My question is: There is a framwork that allows me to run an algorithm in the same time in each node (the algorithm is not centralized)?
Yes, Giraph is probably the most common example for it and can do exactly what you are looking for. However it isn't trivial to set up, there is a question from yesterday on SO about materials for Giraph: https://stackoverflow.com/questions/22817423/material-related-to-giraph/
Another example would be GraphX (http://amplab.github.io/graphx/) from spark and GraphLab (http://graphlab.org/projects/index.html), but I don't have any experience with those. However all of those frameworks enable writing code for a node and execute it for each node in a graph. They also allow you to distribute the algorithm across multiple servers for large graphs, but it isn't necessary if your graph is small enough.

Doubts about clustering methods for tweets

I'm fairly new to clustering and related topics so please forgive my questions.
I'm trying to get introduced into this area by doing some tests, and as a first experiment I'd like to create clusters on tweets based on content similarity. The basic idea for the experiment would be storing tweets on a database and periodically calculate the clustering (ie. using a cron job). Please note that the database would obtain new tweets from time to time.
Being ignorant in this field, my idea (probably naive) would be to do something like this:
1. For each new tweet in the db, extract N-grams (N=3 for example) into a set
2. Perform Jaccard similarity and compare with each of the existing clusters. If result > threshold then it would be assigned to that cluster
3. Once finished I'd get M clusters containing similar tweets
Now I see some problems with this basic approach. Let's put aside computational cost, how would the comparison between a tweet and a cluster be done? Assuming I have a tweet Tn and a cluster C1 containing T1, T4, T10 which one should I compare it to? Given that we're talking about similarity, it could well happen that sim(Tn,T1) > threshold but sim(Tn,T4) < threshold. My gut feeling tells me that something like an average should be used for the cluster, in order to avoid this problem.
Also, it could happen that sim(Tn, C1) and sim(Tn, C2) are both > threshold but similarity with C1 would be higher. In that case Tn should go to C1. This could be done brute force as well to assign the tweet to the cluster with maximum similarity.
And last of all, it's the computational issue. I've been reading a bit about minhash and it seems to be the answer to this problem, although I need to do some more research on it.
Anyway, my main question would be: could someone with experience in the area recommend me which approach should I aim to? I read some mentions about LSA and other methods, but trying to cope with everything is getting a bit overwhelming, so I'd appreciate some guiding.
From what I'm reading a tool for this would be hierarchical clustering, as it would allow regrouping of clusters whenever new data enters. Is this correct?
Please note that I'm not looking for any complicated case. My use case idea would be being able to cluster similar tweets into groups without any previous information. For example, tweets from Foursquare ("I'm checking in ..." which are similar to each other would be one case, or "My klout score is ..."). Also note that I'd like this to be language independent, so I'm not interested in having to deal with specific language issues.
It looks like to me that you are trying to address two different problems in one, i.e. "syntactic" and "semantic" clustering. They are quite different problems, expecially if you are in the realm of short-text analysis (and Twitter is the king of short-text analysis, of course).
"Syntactic" clustering means aggregating tweets that come, most likely, from the same source. Your example of Foursquare fits perfectly, but it is also common for retweets, people sharing online newspaper articles or blog posts, and many other cases. For this type of problem, using a N-gram model is almost mandatory, as you said (my experience suggests that N=2 is good for tweets, since you can find significant tweets that have as low as 3-4 features). Normalization is also an important factor here, removing RT tag, mentions, hashtags might help.
"Semantic" clustering means aggregating tweets that share the same topic. This is a much more difficult problem, and it won't likely work if you try to aggregate random sample of tweets, due to the fact that they, usually, carry too little information. These techniques might work, though, if you restrict your domain to a specific subset of tweets (i.e. the one matching a keyword, or an hashtag). LSA could be useful here, while it is useless for syntactic clusters.
Based on your observation, I think what you want is syntactic clustering. Your biggest issue, though, is the fact that you need online clustering, and not static clustering. The classical clustering algorithms that would work well in the static case (like hierarchical clustering, or union find) aren't really suited for online clustering , unless you redo the clustering from scratch every time a new tweet gets added to your database. "Averaging" the clusters to add new elements isn't a great solution according to my experience, because you need to retain all the information of every cluster member to update the "average" every time new data gets in. Also, algorithms like hierarchical clustering and union find work well because they can join pre-existant clusters if a link of similarity is found between them, and they don't simply assign a new element to the "closest" cluster, which is what you suggested to do in your post.
Algorithms like MinHash (or SimHash) are indeed more suited to online clustering, because they support the idea of "querying" for similar documents. MinHash is essentially a way to obtain pairs of documents that exceed a certain threshold of similarity (in particular, MinHash can be considered an estimator of Jaccard similarity) without having to rely on a quadratic algorithm like pairwise comparison (it is, in fact, O(nlog(n)) in time). It is, though, quadratic in space, therefore a memory-only implementation of MinHash is useful for small collections only (say 10000 tweets). In your case, though, it can be useful to save "sketches" (i.e., the set of hashes you obtain by min-hashing a tweet) of your tweets in a database to form an "index", and query the new ones against that index. You can then form a similarity graph, by adding edges between vertices (tweets) that matched the similarity query. The connected components of your graph will be your clusters.
This sounds a lot like canopy pre-clustering to me.
Essentially, each cluster is represented by the first object that started the cluster.
Objects within the outer radius join the cluster. Objects that are not within the inner radius of at least one cluster start a new cluster. This way, you get an overlapping (non-disjoint!) quantization of your dataset. Since this can drastically reduce the data size, it can be used to speed up various algorithms.
However don't expect useful results from clustering tweets. Tweet data is just to much noise. Most tweets have just a few words, too little to define a good similarity. On the other hand, you have the various retweets that are near duplicates - but trivial to detect.
So what would be a good cluster of tweets? Can this n-gram similarity actually capture this?

Non-linear regression models in PostgreSQL using R

Background
I have climate data (temperature, precipitation, snow depth) for all of Canada between 1900 and 2009. I have written a basic website and the simplest page allows users to choose category and city. They then get back a very simple report (without the parameters and calculations section):
The primary purpose of the web application is to provide a simple user interface so that the general public can explore the data in meaningful ways. (A list of numbers is not meaningful to the general public, nor is a website that provides too many inputs.) The secondary purpose of the application is to provide climatologists and other scientists with deeper ways to view the data. (Using too many inputs, of course.)
Tool Set
The database is PostgreSQL with R (mostly) installed. The reports are written using iReport and generated using JasperReports.
Poor Model Choice
Currently, a linear regression model is applied against annual averages of daily data. The linear regression model is calculated within a PostgreSQL function as follows:
SELECT
regr_slope( amount, year_taken ),
regr_intercept( amount, year_taken ),
corr( amount, year_taken )
FROM
temp_regression
INTO STRICT slope, intercept, correlation;
The results are returned to JasperReports using:
SELECT
year_taken,
amount,
year_taken * slope + intercept,
slope,
intercept,
correlation,
total_measurements
INTO result;
JasperReports calls into PostgreSQL using the following parameterized analysis function:
SELECT
year_taken,
amount,
measurements,
regression_line,
slope,
intercept,
correlation,
total_measurements,
execute_time
FROM
climate.analysis(
$P{CityId},
$P{Elevation1},
$P{Elevation2},
$P{Radius},
$P{CategoryId},
$P{Year1},
$P{Year2}
)
ORDER BY year_taken
This is not an optimal solution because it gives the false impression that the climate is changing at a slow, but steady rate.
Questions
Using functions that take two parameters (e.g., year [X] and amount [Y]), such as PostgreSQL's regr_slope:
What is a better regression model to apply?
What CPAN-R packages provide such models? (Installable, ideally, using apt-get.)
How can the R functions be called within a PostgreSQL function?
If no such functions exist:
What parameters should I try to obtain for functions that will produce the desired fit?
How would you recommend showing the best fit curve?
Keep in mind that this is a web app for use by the general public. If the only way to analyse the data is from an R shell, then the purpose has been defeated. (I know this is not the case for most R functions I have looked at so far.)
Thank you!
The awesome pl/r package allows you to run R inside PostgreSQL as a procedural language. There are some gotchas because R likes to think about data in terms of vectors which is not what a RDBMS does. It is still a very useful package as it gives you R inside of PostgreSQL saving you some of the roundtrips of your architecture.
And pl/r is apt-get-able for you as it has been part of Debian / Ubuntu for a while. Start with apt-cache show postgresql-8.4-plr (that is on testing, other versions/flavours have it too).
As for the appropriate modeling: that is a whole different ballgame. loess is a fair suggestion for something non-parametric, and you probably also want some sort of dynamic model, either ARMA/ARIMA or lagged regression. The choice of modeling is pretty critical given how politicized the topic is.
I don't think autoregression is what you want. Non-linear isn't what you want either because the implies discontinuous data. You have continuous data, it just may not be a straight line. If you're just visualizing, and especially if you don't know what the shape is supposed to be then loess is what you want.
It's easy to also get a confidence interval band around the line if you just plot the data with ggplot2.
qplot(x, y, data = df, geom = 'point') + stat_smooth()
That will make a nice plot.
If you want to a simpler graph in straight R.
plot(x, y)
lines(loess.smooth(x,y))
May I propose a different solution? Just use PostgreSQL to pull the data, feed it into some R script and finally show the results. The R script may be as complicated as you want as long as the user doesn't have to deal with it.
You may want to have a look at rapache, an Apache module that allows running R scripts in a webpage.
A couple of videos illustrating its use:
Hello world application
Jeffrey Horner's presentation of RApache + links to working apps
In particular check how the San Francisco Estuary Institue Web Query Tool allows the user to interact with the parameters.
As for the regression, I'm not an expert, so I may be saying something extremely stupid... but wouldn't something like a LOESS regression be OK for this?