How does mallet set its default hyperparameters for LDA i.e. alpha and beta? - mallet

I have one question to ask about Mallet topic modelling. How does it set its default hyperparameters for LDA i.e. alpha and beta?

The default for alpha is 5.0 divided by the number of topics. You can think of this as five "pseudo-words" of weight on the uniform distribution over topics. If the document is short, we expect to stay closer to the uniform prior. If the document is long, we would feel more confident moving away from the prior.
With hyperparameter optimization, the alpha value for each topic can be different. They usually become smaller than the default setting.
The default value for beta is 0.01. This means that each topic has a weight on the uniform prior equal to the size of the vocabulary divided by 100. This seems to be a good value. With optimization turned on, the value rarely changes by more than a factor of two.

Related

Are bounded distributions automatically adjusted to ensure the area under the curve is still equal to 1.0?

For instance, if I'm simulating demand, but the mean is close to zero and the SD is large enough that the normal distribution includes negative values, then negative demand outcomes are possible. So, we use a bounded normal with a min of zero to prevent that from occurring. However, the probabilities of the remaining possible demand values no longer sum to 1.0. So, the curve should be raised up y-axis just a bit. This is more of a theoretical question because in practice, I can't imagine it making too much of a difference. After all, each demand outcome's probability would simply be increased equally (by an amount equal to the area under the curve < 0 divided by the number of remaining possible demand outcomes), making this mostly a moot point.
Does anyone know if Anylogic automatically adjusts bounded distributions for this? Thanks.

Neural Neworks - Different learning rate for each weight

I have few questions regarding the theory behind neural networks' gradient descent.
First question: Lets say we have 5 weights one for each of the 5 features. And now we want to compute the gradient. How does the algorithm internally do it? Does it take the first weight (=W1) and tries increasing it a bit (or decreasing it) and when it is done, goes to the 2nd weight? Or does it do it differently and more efficiently by changing simultaneously more than 1 weights?
Second question: If feature 1 is way way more important that feature 2, so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight? If we have only one learning rate, we set it by taking account only the most impactful weight, right?
For question 1:
It just does gradient descent. You don't wiggle weights independently: you stack your weights in a vector/matrix/tensor W an compute and increment delta_W which itself is a (respectively) vector/matrix/tensor. Once you know this increment you apply it to all weights at once.
For question 2:
There are already many algorithms that tune the learning rate to parameters. See for example RMSprop and Adam. Those are usually (roughly said) based on the frequency at which a parameter intervenes.
Regarding the "importance" that you describe:
so the same change (in %) of W1 has a bigger effect on loss compared to W2, isn't it better to have a different learning rate for each weight
You are just describing gradient! In that case W1 has a higher gradient than W2, and it already is being updated with a higher weight, so to speak. It wouldn't make much sense though to play around with its learning rate independently unless you have more information about its role (e.g. the frequency mentinoed above).

Multiclass classification or regression?

I am trying to train a CNN model to classify images based on their aesthetic score. There are 2,00,000 images and every image is rated by more than 100 subjects. Mean score is calculated and the scores are normalized.
The distribution of the scores is approximately gaussian. So I have decided to build a 10 class classification model after assigning appropriate weight for each class as the data is imbalanced.
My question:
For this problem, the scores are continuous, ie, 0<0.2<0.3<0.4<0.5<..<1.
Then does that mean this is a regression problem? If so, how do I balance the data for a regression problem, as most of the datapoints are present in between 0.4 and 0.6.
Thanks!
Since your labels are continuous, you could divide them in to 10 equal quantiles using a technique like pandas.qcut() and provide label to each classes. This can turn a regression problem to a classification problem.
And as far as the imbalance is concerned, you may want to try to oversample the minority data. This will ensure your model is not biased towards majority data.
Hope this helps.
I would recommend you to do a Histogram Equalization over ALL data of your participants first, so that their ratings are destributed equaly.
Then for each image in your training set calculate the Expected Value (and if you also want to, the Variance) The Expected Value is just the mean of the votes. For the Variance there are standard functions in (almost) every programming language where you can input an array of votes which will output the Variance.
Now take the Expected Value (and if you want also the Variance) as your ground truth for your Network.
EDIT: Histogram Equalization:
Histogram equalization is a method to use the given numerical range as efficient as possible.
In the context of images, this would change the pixel values, so that the darkest pixel becomes the value 0 and the lightest value becomes 255. Furthermore every grayscale value gets destributed so that it occurs as often as each other (in average). For your dataset you want the same. Even though your values are not from 0 to 255 but from 0 to 10. Furthermore you don't need to (and shoudn't) round the resulting values to integers. In this way more often occurring votes are more spread and less often votes are contracted.
Maybe you should first calculate the expected value and than do the histogram equalization over the expected values of all images.
By this the CNN sould be able to better differentiate those small differences.

Do I have to use a Scale-Layer after every BatchNorm Layer?

I am using caffe , in detail pycaffe, to create my neuronal network. I noticed that I have to use BatchNormLayer to get a positive result. I am using the Kappa-Score as a result matrix.
I now have seen several different locations for the BatchNorm-Layers in my network. But I came across the ScaleLayer, too which is not in the Layer Catalogue but gets often mentioned with the BatchNorm Layer
Do you always need to put a ScaleLayer after a BatchNorm - Layer and what does it do?
From the original batch normalization paper by Ioffe & Szegedy: "we make sure that the transformation inserted in the network can represent the identity transform." Without the Scale layer after the BatchNorm layer, that would not be the case because the Caffe BatchNorm layer has no learnable parameters.
I learned this from the Deep Residual Networks git repo; see item 6 under disclaimers and known issues there.
In general, you will get no benefit from a scale layer juxtaposed with batch normalization. Each is a linear transformation. Where BatchNorm translates so that the new distribution has a mean of 0 and variance of 1, Scale compresses the entire range into a specified interval, typically [0,1]. Since they're both linear transformations, if you do them in sequence, the second will entirely undo the work of the first.
They also deal somewhat differently with outliers. Consider a set of data: ten values, five each of -1 and +1. BatchNorm will not change this at all: it already has mean 0 and variance 1. For consistency, let's specify the same interval for Scale, [-1, 1], which is also a popular choice.
Now, add an outlier of, say 99 to the mix. Scale will transform the set to the range [-1, 1] so that there are now five -1.00 values, one +1.00 value (the former 99), and five values of -0.96 (formerly +1).
BatchNorm worries about the mean standard deviation, not the max and min values. The new mean is +9; the S.D. is 28.48 (rounding everything to 2 decimal places). The numbers will be scaled to be roughly five values each of -.35 and -.28, and one value of 3.16
Whether one scaling works better than the other depends much on the skew and scatter of your distribution. I prefer BatchNorm, as it tends to differentiate better in dense regions of a distribution.

Do you have to normalize the data for a neural net if it is already scaled?

I'm currently trying to preprocess my training data ready for a multi-layered perceptron. The data I downloaded consists of 20,000 instances and 16 attributes, all of which are coordinate values of pixels as part of letter recognition. The data itself has already been scaled from its original form into values between 0 - 15 before being published.
However since it's already been scaled, is it still necessary to perform normalization on it? I've tried to read around and look at previous examples but have come up with conflicting points. In some papers, it has stated that scaling is a form of normalization, where as others have said that normalization would be bringing that values to a range of 0-1.
Since I'm using WEKA I've attempted their normalize filter during a pre-processing stage and it caused the accuracy to decrease by around 2% which makes me think it could be unnecessary. But again, I've read that it may only have a positive effect later in training.
So my question is:
What is the difference between scaling to a range such as 0 - 15 and normalizing it? Should I still normalize it on top of this scaling thats already done?
In your case you do not need to. Normalizing data is done so that an attribute with a different scale will not decide outcome of distance operations, ultimately decide clustering or classification results.
An example you have two attributes weight and income. Weight will be 10 and 200kg at most. While income can be 10,000$ and 20,000,000$. But most of the people's income will be 10,000 and 120,000, while above this values will be outliers. If you do not normalize your data before using Multi Layer Perceptron, outcome of your neural network will be decided by these outliers.
In your case this situation is already mitigated due to your scaling therefore you do not need normalizing.