Clustering using K-means algorithm for documents [closed] - distance

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
How i calculate the distance between two documents? In the k-means for numbers you have to caculate the distance between two points. I know that i can use the cosinus function.
I want to perform clustering to rss documents. I have done stemming and removed the stop words from the documents. I have counted the frequency of word in each document. And now i want to implement the k-mean algorithm.

I'm assuming that your difficulty is in creating the feature vector? Create a feature vector for each document by
Collecting together all words to form a giant vector
Setting the elements of that vector to be the count of terms.
For example, if you have
Document 1 = the quick brown fox jumped over the brown dog
Document 2 = the brown cows eat hippo meat
Then the total set of words is [the,quick,brown,fox,jumped,over,the,dog,cows,eat,hippo,meat] and the document vectors are
Document 1 = [1,1,2,1,1,1,1,1,0,0,0,0]
Document 2 = [1,0,1,0,0,0,0,0,1,1,1,1]
And now you just have two giant feature vectors that you can use to represent the document and you can use k-means clustering. As others have said, Euclidean distance can be used to calculate the distance between documents.

There various distance functions. One is the Euclidean Distance.

You can use the euclidean distance formula for an n-dimensional system.
sqrt((x1-x2)^2 + (y1-y2)^2 + (z1 - z2)^2 ... )

Related

What algorithm is randperm based on? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
Where can I find what algorithm Matlab's randperm function uses? Is it Fisher-yates (Knuth) shuffling algorithm or something else?
For MATLAB releases as early as R2009b, randperm is implemented as follows:
function p = randperm(n)
[ignore, p] = sort(rand(1, n));
You can see it for yourself by typing:
type randperm
Basically randperm generates n numbers and sorts them, returning the resulting array of ordered indices p as the random permutation. The time complexity for this is O(nlogn) at best, worse than Fisher-and-Yates' shuffle, which runs at O(n).
EDIT: Dennis points out that in later releases randperm runs at O(n) time, so obviously it's improved. However, it's a built-in function so it is impossible to see its implementation.

What does the value of the gradient during NN training signify? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
The by default value for the gradient descent approach is 1e-5.
Is this a very small value for generalization to a testing set? What range should I keep it in?
Does the gradient signify the error between the targets and the predicted class during TRAINING (i.e using the training data)?
If you're not using regularization, you should check several values for the learning rate and several values for the number of iterations. You should do this on on a hold out set (also called validation set). If you're using regularization you should not do this and instead try several values for the weight of the regularization term (usually C or lambda).
As for values people try from 2^-10 to 2^-1. Also it is in general useful if your feature values are in a reasonable numerical range (from -1 to 1) or from (0 to 1).

Need a more suitable number type in matlab [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 9 years ago.
I want to do some statistical calculations in matlab, so my numbers are very small(between 0 and 1) with massive multiplications which makes them even smaller I was using double type for my work but I noticed it only stores 5 digits of my number and for larger numbers it stores the power of 10. So it sure leads to a really big error in the final answers. How can I use more accurate number types? Thanks for the help
Have you considered working in log-space? represent each number x = exp( -y ) by its exponent y. The exponents y now ranges between 0 and inf and should be more robust to dynamic range.

Best tool for visualizing functions [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I am looking for tools which can help me to visualize functions (vector valued). For eg quadratic functions like 1/2x'Ax + q'x where x' is the transpose of x and so on. Which would be the best tool for that. I mean, I just want to give the function and it automatically plots it. I know I can like generate the function values myself and then plot it using plot function and all, but I want something which can do it automatically. Is there anything for that?
You're probably looking for ezplot
dfig,ezplot(#(x)[x,x]*A*[x;x]+q'*[x;x],[xmin,xmax])
should do the trick. Use scalars for xmin,xmax.

Strange behaviour in Neural Network training? [closed]

As it currently stands, this question is not a good fit for our Q&A format. We expect answers to be supported by facts, references, or expertise, but this question will likely solicit debate, arguments, polling, or extended discussion. If you feel that this question can be improved and possibly reopened, visit the help center for guidance.
Closed 10 years ago.
I have created a neural network for detection spam.
It involves the following steps;
1.Formation of tf-idf matrix of terms and mails.
2.Reduction of matrix using PCA.
3.Feeding the 20 most important terms according to eigen values to neural network as features.
I'm training it for 1-Spam and 0-Not spam.
EDITS:
I decided to train it by taking a batch size of 7 mails because it was prone to showing Out of memory error while forming the matrix. I used the standard enron dataset of ham and spam .
I used to train neural network via back-propagation -1 input - 1 hidden - 1 output layer with 20 neurons in first layer and 6 hidden layer neurons.
So I started training with my original spam mails in my gmail giving very bad results before switching it to enron dataset. Satisfactory outputs were obtained after training quite a lot.
6 out of 14 mails were being detected spam when i tested.
I used alternative training like batch 1 of spam mails and batch 2 for ham mail and so on such that the network is trained for 1 output for spam and 0 for ham .
But now after too much training almost 400-500 mails i guess, it if giving bad results again . I reduced learning rate but no help.
What's going wrong?
To summarize my comments into an answer... If you're net is producing results that you would expect and then after additional training the output is less accurate, then there is a good chance it is overtrained.
This is especially prone to happen if your data set is small or doesn't vary enough. Finding the optimal number of epochs is mostly trial-and-error.