Modularity across sparsity - networkx

I am clear on Figures 3 A_to_G of "Altered functional and structural brain network organization in autism", but have been stumbling on Fig 3.H.
I incorporated Networkx Modularity, but neigther modularities exceede 0.3 nor ttest2 freed 0!

Related

Using neural networks (MLP) for estimation

Im new with NN and i have this problem:
I have a dataset with 300 rows and 33 columns. Each row has 3 more columns for the results.
Im trying to use MLP for trainning a model so that when i have a new row, it estimates those 3 result columns.
I can easily reduce the error during trainning to 0.001 but when i use cross validation it keep estimating very poorly.
It estimates correctly if i use the same entry it used to train, but if i use another values that werent used for trainning the results are very wrong
Im using two hidden layers with 20 neurons each, so my architecture is [33 20 20 3]
For activation function im using biporlarsigmoid function.
Do you guys have some suggestion on where i could try to change to improve this?
Overfitting
As mentioned in the comments, this perfectly describes overfitting.
I strongly suggest reading the wikipedia article on overfitting, as it well describes causes, but I'll summarize some key points here.
Model complexity
Overfitting often happens when you model is needlessly complex for the problem. I don't know anything about your dataset, but I'm guessing [33 20 20 3] is more parameters than necessary for predicting.
Try running your cross-validation methods again, this time with either fewer layers, or fewer nodes per layer. Right now you are using 33*20 + 20*20 + 20*3 = 1120 parameters (weights) to make your prediction, is this necessary?
Regularization
A common solution to overfitting is regularization. The driving principle is KISS (keep it simple, stupid).
By applying an L1 regularizer to your weights, you keep preference for the smallest number of weights to solve your problem. The network will pull many weights to 0 as they aren't need.
By applying an L2 regularizer to your weights, you keep preference for lower rank solutions to your problem. This means that your network will prefer weights matrices that span lower dimensions. Practically this means your weights will be smaller numbers, and are less likely to be able to "memorize" the data.
What is L1 and L2? These are types of vector norms. L1 is the sum of the absolute value of your weights. L2 is the sqrt of the sum of squares of your weights. (L3 is the cubed root of the sum of cubes of weights, L4 ...).
Distortions
Another commonly used technique is to augment your training data with distorted versions of your training samples. This only makes sense with certain types of data. For instance images can be rotated, scaled, shifted, add gaussian noise, etc. without dramatically changing the content of the image.
By adding distortions, your network will no longer memorize your data, but will also learn when things look similar to your data. The number 1 rotated 2 degrees still looks like a 1, so the network should be able to learn from both of these.
Only you know your data. If this is something that can be done with your data (even just adding a little gaussian noise to each feature), then maybe this is worth looking into. But do not use this blindly without considering the implications it may have on your dataset.
Careful analysis of data
I put this last because it is an indirect response to the overfitting problem. Check your data before pumping it through a black-box algorithm (like a neural network). Here are a few questions worth answering if your network doesn't work:
Are any of my features strongly correlated with each other?
How do baseline algorithms perform? (Linear regression, logistic regression, etc.)
How are my training samples distributed among classes? Do I have 298 samples of one class and 1 sample of the other two?
How similar are my samples within a class? Maybe I have 100 samples for this class, but all of them are the same (or nearly the same).

How do you determine how many variables is too many for a CCA?

I am running a CCA of some ecological data with ~50 sites and several hundred species. I know that you have to be careful when your number of explanatory variables approaches your number of samples. I have 23 explanatory variables, so this isn't a problem for me, but I have also heard that using too many explanatory variables can start to "un-constrain" the CCA.
Are there any guidelines about how many explanatory variables is appropriate? So far, I have just plotted them all and then removed the ones that appear to be redundant (leaving me with 8). Can I use the intertia values to help inform/justify this?
Thanks
This is the same question as asking "how many variables are too many for regression analysis?". Not "almost the same", but exactly the same: CCA is an ordination of fitted values of linear regression. In most severe cases you can over-fit. In CCA this is evident when the first eigenvalues of CCA and (unconstrained) CA are almost identical and the ordinations look similar in first dimensions (you can use Procrustes analysis to check this). Extreme case would be that residual variation disappears, but in ordination you focus on first dimensions, and there the constraints can get lost much earlier than in later constrained axes or in residuals. More importantly: you must see CCA as a kind of regression analysis and have the same attitude to constraints as to explanatory (independent) variables in regression. If you have no prior hypothesis to study, you have all the problems of model selection of regression analysis plus the problems of multivariate ordination, but these are non-technical problems that should be handled somewhere else than in stackoverflow.

Conjugate Gradient with a noisy function

OK, so I'm doing RMS prop or SGD to get a neural network to learn it's parameters. But, after a while, both training and validation errors appear to have stagnated (outside of random fluctuations: I'm using dropout)..
So, I decided, to try to use conjugate gradient to refine the values. I still obviously don't want it to overfit, so I was keeping the dropout... But, of course, this makes the optimization function be noisy. So, I guess my question is: Does Conjugate Gradient (or L-BFGS or etc.) require noiseless functions? Or can they work in the presence of noise?
Thanks!
Gradient-based optimization algorithms are highly sensitive to noise. This is because the derivative calculation gets affected due to the discontinuities caused by the noise across the function domain.
To optimize noisy objective functions it is better to use heuristic algorithms: Genetic Algorithms, Simulated Annealing, Ant Colony, Particle Swarm... They Are not based on gradients and therefore they do not present the same weakness.
You can read more about these algorithms in the book:
Duc Pham, D. Karaboga, Intelligent Optimisation Techniques. London, United Kingdom: Springer-Verlag London, 2000.
If you are interested in Simulated Annealing, you can also read:
Peter Rossmanith, “Simulated Annealing” in Algorithms Unplugged, Vöcking, B., Alt, H., Dietzfelbinger, M., Reischuk, R., Scheideler, C., Vollmer, H., Wagner, D., Ed. Berlin, Germany: Springer-Verlag Berlin Heidelberg, 2011, ch. 41, pp. 393-400.

Neural Network learning rate and batch weight update

I have programmed a Neural Network in Java and am now working on the back-propagation algorithm.
I've read that batch updates of the weights will cause a more stable gradient search instead of a online weight update.
As a test I've created a time series function of 100 points, such that x = [0..99] and y = f(x). I've created a Neural Network with one input and one output and 2 hidden layers with 10 neurons for testing. What I am struggling with is the learning rate of the back-propagation algorithm when tackling this problem.
I have 100 input points so when I calculate the weight change dw_{ij} for each node it is actually a sum:
dw_{ij} = dw_{ij,1} + dw_{ij,2} + ... + dw_{ij,p}
where p = 100 in this case.
Now the weight updates become really huge and therefore my error E bounces around such that it is hard to find a minimum. The only way I got some proper behaviour was when I set the learning rate y to something like 0.7 / p^2.
Is there some general rule for setting the learning rate, based on the amount of samples?
http://francky.me/faqai.php#otherFAQs :
Subject: What learning rate should be used for
backprop?
In standard backprop, too low a learning rate makes the network learn very slowly. Too high a learning rate
makes the weights and objective function diverge, so there is no learning at all. If the objective function is
quadratic, as in linear models, good learning rates can be computed from the Hessian matrix (Bertsekas and
Tsitsiklis, 1996). If the objective function has many local and global optima, as in typical feedforward NNs
with hidden units, the optimal learning rate often changes dramatically during the training process, since
the Hessian also changes dramatically. Trying to train a NN using a constant learning rate is usually a
tedious process requiring much trial and error. For some examples of how the choice of learning rate and
momentum interact with numerical condition in some very simple networks, see
ftp://ftp.sas.com/pub/neural/illcond/illcond.html
With batch training, there is no need to use a constant learning rate. In fact, there is no reason to use
standard backprop at all, since vastly more efficient, reliable, and convenient batch training algorithms exist
(see Quickprop and RPROP under "What is backprop?" and the numerous training algorithms mentioned
under "What are conjugate gradients, Levenberg-Marquardt, etc.?").
Many other variants of backprop have been invented. Most suffer from the same theoretical flaw as
standard backprop: the magnitude of the change in the weights (the step size) should NOT be a function of
the magnitude of the gradient. In some regions of the weight space, the gradient is small and you need a
large step size; this happens when you initialize a network with small random weights. In other regions of
the weight space, the gradient is small and you need a small step size; this happens when you are close to a
local minimum. Likewise, a large gradient may call for either a small step or a large step. Many algorithms
try to adapt the learning rate, but any algorithm that multiplies the learning rate by the gradient to compute
the change in the weights is likely to produce erratic behavior when the gradient changes abruptly. The
great advantage of Quickprop and RPROP is that they do not have this excessive dependence on the
magnitude of the gradient. Conventional optimization algorithms use not only the gradient but also secondorder derivatives or a line search (or some combination thereof) to obtain a good step size.
With incremental training, it is much more difficult to concoct an algorithm that automatically adjusts the
learning rate during training. Various proposals have appeared in the NN literature, but most of them don't
work. Problems with some of these proposals are illustrated by Darken and Moody (1992), who
unfortunately do not offer a solution. Some promising results are provided by by LeCun, Simard, and
Pearlmutter (1993), and by Orr and Leen (1997), who adapt the momentum rather than the learning rate.
There is also a variant of stochastic approximation called "iterate averaging" or "Polyak averaging"
(Kushner and Yin 1997), which theoretically provides optimal convergence rates by keeping a running
average of the weight values. I have no personal experience with these methods; if you have any solid
evidence that these or other methods of automatically setting the learning rate and/or momentum in
incremental training actually work in a wide variety of NN applications, please inform the FAQ maintainer
(saswss#unx.sas.com).
References:
Bertsekas, D. P. and Tsitsiklis, J. N. (1996), Neuro-Dynamic
Programming, Belmont, MA: Athena Scientific, ISBN 1-886529-10-8.
Darken, C. and Moody, J. (1992), "Towards faster stochastic gradient
search," in Moody, J.E., Hanson, S.J., and Lippmann, R.P., eds.
Advances in Neural Information Processing Systems 4, San Mateo, CA:
Morgan Kaufmann Publishers, pp. 1009-1016. Kushner, H.J., and Yin,
G. (1997), Stochastic Approximation Algorithms and Applications, NY:
Springer-Verlag. LeCun, Y., Simard, P.Y., and Pearlmetter, B.
(1993), "Automatic learning rate maximization by online estimation of
the Hessian's eigenvectors," in Hanson, S.J., Cowan, J.D., and Giles,
C.L. (eds.), Advances in Neural Information Processing Systems 5, San
Mateo, CA: Morgan Kaufmann, pp. 156-163. Orr, G.B. and Leen, T.K.
(1997), "Using curvature information for fast stochastic search," in
Mozer, M.C., Jordan, M.I., and Petsche, T., (eds.) Advances in Neural
Information Processing Systems 9,Cambridge, MA: The MIT Press, pp.
606-612.
Credits:
Archive-name: ai-faq/neural-nets/part1
Last-modified: 2002-05-17
URL: ftp://ftp.sas.com/pub/neural/FAQ.html
Maintainer: saswss#unx.sas.com (Warren S. Sarle)
Copyright 1997, 1998, 1999, 2000, 2001, 2002 by Warren S. Sarle, Cary, NC, USA.
A simple solution would be to take the average weight of a batch instead of summing it. This way you can just use a learning rate of 0.7 (or any other value of your liking), without having to worry about optimizing yet another parameter.
More interesting information about batch updating and learning rates can be found in this article by Wilson (2003).

What's usual success rate for neural network models?

I am building a system with a NN trained for classification.
I am interested in what is error rate for systems you have built?
Classic example from UCI ML is the Iris data set.
NN trained on it is almost perfect - error rate 0-1%; however it is a very basic dataset.
My network has following structure: 80in, 208hid, 2out.
My result is 8% error rate on testing dataset.
Basically in this question I want to ask about various research results you encountered,
in your work, papers etc.
Addition 1:
the error rate is of course on testing data - not training. So it is completely new dataset for the network
Addition 2 (from my comment under the question):
My new results. 1200 entries, 900 training, 300 testing. 85 in Class1, 1115 in Class2. Out of 85, 44 in testing set. Error rate - 6%. It is not so bad because 44 is ~15% of 300. So I am 2.5 times better..
Model performance is completely problem-specific. Even among situations with similar quality and volumes of development data, with identical target variable definitions, performance can vary substantially. Obviously, the more similar the problem definitions, the more likely the performance of different models are to match.
Another thing to consider is the difference between technical performance and business performance. In some applications, an accuracy of 52% is tremendously profitable, whereas in other areas, and accuracy of 98% would be hopelessly low.
Let me also add that besides what Predictor mentions, measuring your performance on the training set is usually useless as a guide to determine how your classifier would perform on previously unseen data. Many times with relatively simple classifiers you can get 0% error rate on the training set without learning anything useful (this is called overfitting).
What is more commonly used (and more helpful in determining how your classifier works) is either held out data or cross validation, even better if you separate your data in three: training, validating and testing.
Also it is very hard to get a sense of how good a classifier works from one threshold and giving only true positive + true negatives. People tend to also evaluate false positives and false negatives and plot ROC curves to see/evaluate the tradeoff. So, saying "2.5 times better" you should be clear that your comparing to a classifier that classifies everything as C2, which is a pretty crappy baseline.
See for example this paper:
Danilo P. Mandic and Jonathon A. Chanbers (2000). Towards the Optimal Learning Rate for
Backpropagation, Neural Processing Letters 11: 1–5. PDF