I've been given a assignment based on section 10.6 of the book Neural-Symbolic
Cognitive Reasoning. Basically, I have to replicate a binary classification experiment. And to explain my issue with it, it's needed to explain the context of the problem first.
Context of the problem
The goal is to classify the eastbound and westbound trains, each train having a set of cars as shown in Figure 10.2. In order to classify a train, certain features of the train, along with features of its cars, must be considered.
The data set contains the following attributes:
for each train,
(a) the number of cars (3 to 5),
and (b) the number of different loads (1 to 4);
and for each car,
(c) the number of wheels (2 or 3),
(d) the length (short or long),
(e) the shape (closed-top rectangle, open-top rectangle, double open rectangle, ellipse, engine, hexagon,jagged top, open trap, sloped top, or U-shaped),
(f) the number of loads (0 to 3),
and (g) the shape of the load (circle, hexagon, rectangle, or triangle).
Then, ten boolean variables describe whether any particular pair of types of load are on adjacent cars of the train (each car carries a single type of load):
(h) there is a rectangle next to a rectangle,
(i) a rectangle next to a triangle,
(j) a rectangle next to a hexagon,
(k) a rectangle next to a circle,
(l) a triangle next to a triangle,
(m) a triangle next to a hexagon,
(n) a triangle next to a circle,
(o) a hexagon next to a hexagon,
(p) a hexagon next to a circle,
(q) a circle next to a circle.
Finally, the class attribute may be either east or west.
The issue
The book says that it was used
a network containing 32 input neurons and one output neuron (denoting east) [...]. The 32 inputs encode: the number of cars in a train; the number of different loads in a train; the number of wheels, the length, and the shape of each car; the number of loads in each car; the shape of the load of each car; and the ten boolean variables described above.
So, what I can't understand is how to map those features to 32 input neurons. The way that I'm counting those neurons is as follows.
How I'm mapping those features to input neurons
2 neurons to represent attributes (a) and (b) listed above,
10 neurons to represent the ten boolean variables (attribute (h) to (q)),
and for each car:
5 neurons to represent attributes (c), (d), (e), (f) and (g)
Since there's at most 5 cars in each train, there should in total 25 neurons to represent the attributes (c) to (g) for each car. This way, the network would have 2 + 10 + 25 = 37 input neurons, not 32, as the book says. So, what am I getting wrong here? Thanks in advance.
Edit #1:
According to the book, the values given to the attributes are as follows:
attributes referring to properties of cars that do not exist are assigned the value false. As usual, −1 is used to denote false and 1 to denote true in the case of boolean variables. Further, we assign values 1, 2, 3, ... to any attributes that have multiple values, in the order which they are presented above. So, in the case, for example, of the shape of the load, 1 is used to denote circle, 2 to denote hexagon, 3 to denote rectangle, and so on. Of course, for the corresponding neurons, instead of the bipolar function, we use a linear activation function h(x) = x.
My mistake was here:
Since there's at most 5 cars in each train, there should in total 25 neurons to represent the attributes (c) to (g) for each car
The first car doesn't need the five attributes (c) to (g), because it's the engine of the train. Therefore, the network would have the expected 2 + 10 + 20 = 32 input neurons.
Related
Following this question and this tutorial I've create a simple net just like the tutorial but with 100X100 images and first convolution kernel of 11X11 and pad=0.
I understand that the formula is : (W−F+2P)/S+1 and in my case dimension became [51X51X3] (3 is channel of rgb) but the number 96 popup in my net diagram and as this tutorial said it is third dimension of the output, in other hand , my net after first conv became [51X51X96]. I couldn't figure out , how the number 96 calculated and why.
Isn't the network convolution layer suppose to pass throw three color channel and the output should be three feature map? How come its dimension grow like this? Isn't it true that we have one kernel for each channel ? How this one kernel create 96(or in the first tutorial, 256 or 384) feature map ?
You are mixing input channels and output channels.
Your input image has three channels: R, G and B. Each filter in your conv layer acts on these three channels and its spatial kernel size (e.g., 3-by-3). Each filter outputs a single number per spatial location. So, if you have one filter in your layer then your output would have only one output channel(!)
Normally, you would like to compute more than a single filter at each layer, this is what num_output parameter is used for in convolution_param: It allows you to define how many filters will be trained in a specific convolutional layer.
Thus a Conv layer
layer {
type: "Convolution"
name: "my_conv"
bottom: "x" # shape 3-by-100-by-100
top: "y"
convolution_param {
num_output: 32 # number of filters = number of output channels
kernel_size: 3
}
}
Will output "y" with shape 32-by-98-by-98.
For each input I have, I have a 49x2 matrix associated. Here's what 1 input-output couple looks like
input :
[Car1, Car2, Car3 ..., Car118]
output :
[[Label1 Label2]
[Label1 Label2]
...
[Label1 Label2]]
Where both Label1 and Label2 are LabelEncode and they have respectively 1200 and 1300 different classes.
Just to make sure this is what we call a multi-output multi-class problem?
I tried to flatten the output but I feared the model wouldn't understand that all similar Label share the same classes.
Is there a Keras layer that handle output this peculiar array shape?
Generally, multi-class problems correspond with models outputting a probability distribution over the set of classes (that is typically scored against the one-hot encoding of the actual class through cross-entropy). Now, independently of whether you are structuring it as one single output, two outputs, 49 outputs or 49 x 2 = 98 outputs, that would mean having 1,200 x 49 + 1,300 x 49 = 122,500 output units - which is not something a computer cannot handle, but maybe not the most convenient thing to have. You could try having each class output to be a single (e.g. linear) unit and round it's value to choose the label, but, unless the labels have some numerical meaning (e.g. order, sizes, etc.), that is not likely to work.
If the order of the elements in the input has some meaning (that is, shuffling it would affect the output), I think I'd approach the problem through an RNN, like an LSTM or a bidirectional LSTM model, with two outputs. Use return_sequences=True and TimeDistributed Dense softmax layers for the outputs, and for each 118-long input you'd have 118 pairs of outputs; then you can just use temporal sample weighting to drop, for example, the first 69 (or maybe do something like dropping the 35 first and the 34 last if you're using a bidirectional model) and compute the loss with the remaining 49 pairs of labellings. Or, if that makes sense for your data (maybe it doesn't), you could go with something more advanced like CTC (although Keras does not have it, I'm trying to integrate TensorFlow implementation into it without much sucess), which is also implemented in Keras (thanks #indraforyou)!.
If the order in the input has no meaning but the order of the outputs does, then you could have an RNN where your input is the original 118-long vector plus a pair of labels (each one-hot encoded), and the output is again a pair of labels (again two softmax layers). The idea would be that you get one "row" of the 49x2 output on each frame, and then you feed it back to the network along with the initial input to get the next one; at training time, you would have the input repeated 49 times along with the "previous" label (an empty label for the first one).
If there are no sequential relationships to exploit (i.e. the order of the input and the output do not have a special meaning), then the problem would only be truly represented by the initial 122,500 output units (plus all the hidden units you may need to get those right). You could also try some kind of middle ground between a regular network and a RNN where you have the two softmax outputs and, along with the 118-long vector, you include the "id" of the output that you want (e.g. as a 49-long one-hot encoded vector); if the "meaning" of each label at each of the 49 outputs is similar, or comparable, it may work.
I'm trying to implement an algorithm in matlab.
The algorithm (or stages) is as follows:
(1) Choose an illuminant.
(2) Calculate colour signals for all the 24 reflectances under that illuminant.
(3) Multiply, element-by-element, each sensor-response vector (columns or R (see variables)) by the colour signal.
(4) Sum the result over all wavelengths,(which should leave me with 72 values: 24 R values (one for each surface), 24 G values, and 24 B values)
(5) Create an image from the calculated sensor response for each reflectance by assigning a 100x100 pixel square and create a pattern of 4 rows and 6
columns (like a macbeth colourchecker).
I think I'm getting confused at stage 4 (but I might be implementing it wrong earlier)...
These are my variables:
A %an illuminance vector of size 31x1.
R %colour camera sensitivities of size 31x3. (the columns of this matrix are the red, green, and blue sensor response functions of a camera).
S %surface reflectances (24) of size 31x24 from a Macbeth ColourChecker (each column is a different reflectance function.
WAV %Reference wavelength values of size 31x1.
This is what I've implemented:
(1)choose A (as it's the only one I've made)
A;
(2)Calc colour signals for all 24 reflectances
cSig_1A = S(:,1).*A;
cSig_2A = S(:,2).*A;
.
. %all 24 columns of S
.
cSig_24A = S(:,24).*A;
(3)multiply sensor-response vector (R columns (RGB)) by colour signal:
% R.*reflectances G.*reflectances B.*reflectances
a1=R(:,1).*cSig_1A; a12=R(:,2).*cSig_1A; a13=R(:,3).* cSig_1A;
b1=R(:,1).*cSig_2A; b12=R(:,2).*cSig_2A; b13=R(:,3).* cSig_2A;
.
. %all 24 signals (think this is correct)
.
x1=R(:,1).*cSig_24A; x12=R(:,2).*cSig_24A; x13=R(:,3).*cSig_24A;
Assuming I've done the previous steps correct, I'm not sure how you sum the results for this over wavelengths to only have 72 values left? and then create an image from them.
Maybe the wording confuses me but if you guys could give me some guidance, that would be great. It's much appreciated. Thanks in advance.
I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end
I have a lisp program on roulette wheel selection,I am trying to understand the theory behind it but I cannot understand anything.
How to calculate the fitness of the selected strng?
For example,if I have a string 01101,how did they get the fitness value as 169?
Is it that the binary coding of 01101 evaluates to 13,so i square the value and get the answer as 169?
That sounds lame but somehow I am getting the right answers by doing that.
The fitness function you have is therefore F=X^2.
The roulette wheel calculates the proportion (according to its fitness) of the whole that that individual (string) takes, this is then used to randomly select a set of strings for the next generation.
Suggest you read this a few times.
The "fitness function" for a given problem is chosen (often) arbitrarily keeping in mind that as the "fitness" metric rises, the solution should approach optimality. For example for a problem in which the objective is to minimize a positive value, the natural choice for F(x) would be 1/x.
For the problem at hand, it seems that the fitness function has been given as F(x) = val(x)*val(x) though one cannot be certain from just a single value pair of (x,F(x)).
Roulette-wheel selection is just a commonly employed method of fitness-based pseudo-random selection. This is easy to understand if you've ever played roulette or watched 'Wheel of Fortune'.
Let us consider the simplest case, where F(x) = val(x),
Suppose we have four values, 1,2,3 and 4.
This implies that these "individuals" have fitnesses 1,2,3 and 4 respectively. Now the probability of selection of an individual 'x1' is calculated as F(x1)/(sum of all F(x)). That is to say here, since the sum of the fitnesses would be 10, the probabilities of selection would be, respectively, 0.1,0.2,0.3 and 0.4.
Now if we consider these probabilities from a cumulative perspective the values of x would be mapped to the following ranges of "probability:
1 ---> (0.0, 0.1]
2 ---> (0.1, (0.1 + 0.2)] ---> (0.1, 0.3]
3 ---> (0.3, (0.1 + 0.2 + 0.3)] ---> (0.3, 0.6]
4 ---> (0.6, (0.1 + 0.2 + 0.3 + 0.4)] ---> (0.6, 1.0]
That is, an instance of a uniformly distributed random variable generated, say R lying in the normalised interval, (0, 1], is four times as likely to be in the interval corresponding to 4 as to that corresponding to 1.
To put it another way, suppose you were to spin a roulette-wheel-type structure with each x assigned a sector with the areas of the sectors being in proportion to their respective values of F(x), then the probability that the indicator will stop in any given sector is directly propotional to the value of F(x) for that x.