Artifical Neural network in matlab - matlab

I’m new to neural networks but I have a question regarding NN and ANN.
I have a list of objects. Each objects contains longitude, latitude and a list of words.
What I want to do is to predict the location based on the text contained in the object (similar texts should have similar location). Right now I’m using cosine similarity to calculate the similarity between the objects text but I’m stuck how I can use that information to train my neural network. I have a matrix containing each object and how many time each word appeared in that object. F.x. if I had these two objects
Obj C: 54.123, 10.123, [This is a text for object C]
Obj B: 57.321, 11.113, [This is a another text for object B]
Then I have something like the following matrix
This is a text for object C another B
ObjC: 1 1 1 1 1 1 1 0 0
ObjB: 1 1 1 1 1 1 0 1 1
I would also have something like, for the distance between the two objects (note, that the numbers are not real)
ObjC ObjB
ObjC 1 0.25
ObjB 0.25 1
I have looked at how I use neural network to either classify things into groups (like A,B,C) or predict something like a housing price, but nothing that I find helpful for my problem.
I would consider the prediction right if it is within some distance X, since I’m dealing with location.
This might be a stupid question, but someone point me to the right direction.
Everything helps!
Regards

Related

How to define Traingular membership function for fuzzy controller design?

I am designing a fuzzy controller and for that, I have to define 3 triangular function sets. They are:
1 large
2 medium
3 small
But my problem is I have following data only:
Maximum input = 3 Minimum input= 0.1
Maximum output = 5.5 Minimum output= 0.8
How to define 3 triangular set range based on only this given information?
Here is the formula for a triangular membership function
f=0 if x<=a
f=(x-a)/(b-a) if a<=x<=b
f=(c-x)/(c-b) if b<=x<=c
f=0 if x>c
where a is the min, c is the max and b is the midpoint.
In your case, take the top situation where the max is 3 and the min is 0.1. The midpoint is (3+0.1)/2=1.55, so you have
f=0 if x<=0.1
f=(x-0)/(1.55-1) if 0.1<=x<=1.55
f=(3-x)/(3-1.55) if 1.55<=x<=3
f=0 if x>3
You should be able to take the 2nd example from here, but if not let me know. Something worth pointing out is that the midpoint may not be the ideal b in your situation. Any point between a and c could serve as your b, just know that it is the point where the membership function equals 1.
It is difficult to tell, but it looks like maybe you just have given parameters for two of the functions, perhaps for small and large or medium and large. You may need to use some judgement for the 3rd membership function.

Understanding data types and attributes values

I would like to understand the following data types and attributes values for the following data , understanding it can make correct decision to select classification or clustering algorithm .
my data consist of 100 folders which contain images in each one , so I selected some content to categorical these images based on its content
like { sea , sky , lion ....etc }
categorical- attributes
folder-name total images sea sky food animals
folder1 100 10 2 0 5
folder2 20 0 1 15 3
etc.
total images refer to total images in that folder , number in each category vectors are the frequency of that images found in each folder , for example sea picture are found in folder 1 10 ( 10 images are sea photo) etc..
I know the values here are discrete , but what is the attributes { interval , nominal , ordinal }
value has been grouped based on simple comparing is folder1.image1=sea if yes then 1 otherwise is 0 then I have grouped the image values to declare the above table ,
in case to convert frequency values to ordinal , calculating frequency percentage if its 10% then is 1 , 20% then is 2 is this way correct ,
any advices thanks .
As I said in my comments you implement different approaches to cluster:
Euclidean distance (lets say spot 10% most frequent terms and build the space accordingly (X .. n axis) and them measure the distance between documents (folders)
Jaccard index
CLIQUE looks interesting, but I'm not enough familiar with it.
tf-idf is good for spotting non-frequent terms (files) and claim that documents that has these terms are similar and belong to the same class.
As I mentioned before I would start with something really simple like ranking by terms or Euclidean distance to "feel" the data. As you proceed you get more ideas

Using and understanding MATLAB's TreeBagger (a random forest) method

I'm trying to use MATLAB's TreeBagger method, which implements a random forest.
I get some results, and can do a classification in MATLAB after training the classifier.
However I'd like to "see" the trees, or want to know how the classification works.
For example, let's run this minimal example, I found here: Matlab treebagger example
So, I end up with a classificator stored in "B".
How can I inspect the trees? Like having a look at each node, to see on which criteria (e.g. feature) the decision is made?
Entering B returns:
B =
TreeBagger
Ensemble with 20 bagged decision trees:
Training X: [6x2]
Training Y: [6x1]
Method: classification
Nvars: 2
NVarToSample: 2
MinLeaf: 1
FBoot: 1
SampleWithReplacement: 1
ComputeOOBPrediction: 0
ComputeOOBVarImp: 0
Proximity: []
ClassNames: '0' '1'
I can't see something like B.trees or so.
And a follow-up question would be:
How to port your random-forest code you prototyped in MATLAB to any other language.
Then you need to know how each tree works, so you can implement it in the target language.
I hope you get the point, or understand my query ;)
Thanks for answers!
Best,
Patrick
Found out how to inspect the trees, by running the view() command. E.g. for inspecting the first tree of the example:
>> view(B.Trees{1})
Decision tree for classification
1 if x2<650 then node 2 elseif x2>=650 then node 3 else 0
2 if x1<4.5 then node 4 elseif x1>=4.5 then node 5 else 1
3 class = 0
4 class = 0
5 class = 1
By passing some more arguments to the view() command, the tree can also be visualized:
view(B.Trees{1},'mode','graph')
to view multiple trees just use loop :
for n=1:30 %number of tree
view(t.Trees{n});
end
you can find the source here

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

Dot Product: * Command vs. Loop gives different results

I have two vectors in Matlab, z and beta. Vector z is a 1x17:
1 0.430742139435890 0.257372971229541 0.0965909090909091 0.694329541928697 0 0.394960106863064 0 0.100000000000000 1 0.264704325268675 0.387774594078319 0.269207605609567 0.472226643323253 0.750000000000000 0.513121013402805 0.697062571025173
... and beta is a 17x1:
6.55269487769363e+26
0
0
-56.3867588816768
-2.21310778926413
0
57.0726052009847
0
3.47223691057151e+27
-1.00249317882651e+27
3.38202232046686
1.16425987969027
0.229504956512063
-0.314243264212449
-0.257394312588330
0.498644243389556
-0.852510642195370
I'm dealing with some singularity issues, and I noticed that if I want to compute the dot product of z*beta, I potentially get 2 different solutions. If I use the * command, z*beta = 18.5045. If I write a loop to compute the dot product (below), I get a solution of 0.7287.
summation=0;
for i=1:17
addition=z(1,i)*beta(i);
summation=summation+addition;
end
Any idea what's going on here?
Here's a link to the data: https://dl.dropboxusercontent.com/u/16594701/data.zip
The problem here is that addition of floating point numbers is not associative. When summing a sequence of numbers of comparable magnitude, this is not usually a problem. However, in your sequence, most numbers are around 1 or 10, while several entries have magnitude 10^26 or 10^27. Numerical problems are almost unavoidable in this situation.
The wikipedia page http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems shows a worked example where (a + b) + c is not equal to a + (b + c), i.e. demonstrating that the order in which you add up floating point numbers does matter.
I would guess that this is a homework assignment designed to illustrate these exact issues. If not, I'd ask what the data represents to suss out the appropriate approach. It would probably be much more productive to find out why such large numbers are being produced in the first place than trying to make sense of the dot product that includes them.