Normalizing data before feeding to RNN - neural-network

I want to feed the following example data into a RNN:
[[[100 22 1]
[120 30 0]
[232 11 0]]
[[110 23 1]
[230 32 1]
[500 43 1]]]
So there are 3 features, the first two are numerical with different scales and the third is binary. The size is 2x3x3, (2 examples, with sequence length 3, and 3 features.
I would like to normalize these data. I read that you do not have to normalize binary features. But what about the other two?
Lets say i use the following method:
(x-min(x))/(max(x)-min(x))
a) Should i normalize them in each example independently? So that means calculate min and max for each separate example?
b) Or should i normalize based on the whole dataset?
Which one is the correct way?

Related

Clustering matrix distance between 3 time series

I have a question about the application of clustering techniques more concretely the K-means.
I have a data frame with 3 sensors (A,B,C):
time A | B | C |
8:00:00 6 10 11
8:30:00 11 17 20
9:00:00 22 22 15
9:30:00 20 22 21
10:00:00 17 26 26
10:30:00 16 45 29
11:00:00 19 43 22
11:30:00 20 32 22
... ... ... ...
And I want to group sensors that have the same behavior.
My question is: Looking at the dataframe above, I must calculate the correlation of each object of the data frame and then apply the Euclidean distance on this correlation matrix, thus obtaining a 3 * 3 matrix with the value of distances?
Or do I transpose my data frame and then compute the dist () matrix with Euclidean metric only and then I will have a 3 * 3 matrix with the distances value.
You have just three sensors. That means, you'll need three values, d(A B), d(B,C) and d(A B). Any "clustering" here does not seem to make sense to me? Certainly not k-means. K-means is for points (!) In R^d for small d.
Choose any form of time series similarity that you like. Could be simply correlation, but also DTW and the like.
Q1: No. Why: The correlation is not needed here.
Q2: No. Why: I'd calculate the distances differently
For the first row, R' built-in s dist() function (which uses Euclidean distance by default)
dist(c(6, 10, 11))
gives you the intervals between each value
1 2
------
2| 4
3| 5 1
item 2 and 3 are closest to each other. That's simple.
But there is no single way to calculate the distance between a point and a group of points. There you need a linkage function (min/max/average/...)
What I would do using R's built-in kmeans() function:
Ignore the date column,
(assuming there are no NA values in any A,B,C columns)
scale the data if necessary (here they all seem to have same order of magnitude)
perform KMeans analysis on the A,B,C columns, with k = 1...n ; evaluate results
perform a final KMeans with your suitable choice of k
get the cluster assignments for each row
put them in a new column to the right of C

Vector with specific number of equally spaced values

I am familiar with the command:
(-5:0.1:5)
which creates a vector of values, equally spaced by 0.1, from -5 to 5.
However, is there a way to produce a vector of values, equally spaced from -5 to 5 such that there are, say, 100 values in the vector.
(-5:0.1:5) gives a vector with 101 values, however, is there a way to get a vector of 100 values without manually calculating the step size?
Yes there is. Use function linspace. See documentation here
linspace(1,10,10)
ans =
1 2 3 4 5 6 7 8 9 10
also the question is a duplicate of this question

Import a txt weighted adjacency list into a matrix

I want to create a weighted adj matrix. Is there a good way that will work even with a huge data set?
I have this abc.txt file for example:
abc.txt
1 2 50
2 3 70
3 1 42
1 3 36
result should be
matrix=
0 50 36
0 0 70
42 0 0
How can I construct a weighted adjacency matrix from input dataset graph file as shown above that will contain the weights?
So basically input file has 3 columns and the third column is the weights of each edge.
You could also apply spconvert to the output of importdata:
matrix = full(spconvert(importdata('abc.txt')));
What you have is a sparse definition of a matrix, using sparse is the simplest way to create it. If your matrix is thin (many zeros) you may also stick with the sparse matrix because it requires less memory. Then delete the last line.
S=load('abc.txt')
M=sparse(S(:,1),S(:,2),S(:,3))
M=full(M)

How to visualize binary data?

I have a dataset 6x1000 of binary data (6 data points, 1000 boolean dimensions).
I perform cluster analysis on it
[idx, ctrs] = kmeans(x, 3, 'distance', 'hamming');
And I get the three clusters. How can I visualize my result?
I have 6 rows of data each having 1000 attributes; 3 of them should be alike or similar in a way. Applying clustering will reveal the clusters. Since I know the number of clusters
I only need to find similar rows. Hamming distance tell us the similarity between rows and the result is correct that there are 3 clusters.
[EDIT: for any reasonable data, kmeans will always finds asked number
of clusters]
I want to take that knowledge
and make it easily observable and understandable without having to write huge explanations.
Matlab's example is not suitable since it deals with numerical 2D data while my questions concerns n-dimensional categorical data.
The dataset is here http://pastebin.com/cEWJfrAR
[EDIT1: how to check if clusters are significant?]
For more information please visit the following link:
https://chat.stackoverflow.com/rooms/32090/discussion-between-oleg-komarov-and-justcurious
If the question is not clear ask, for anything you are missing.
For representing the differences between high-dimensional vectors or clusters, I have used Matlab's dendrogram function. For instance, after loading your dataset into the matrix x I ran the following code:
l = linkage(a, 'average');
dendrogram(l);
and got the following plot:
The height of the bar that connects two groups of nodes represents the average distance between members of those two groups. In this case it looks like (5 and 6), (1 and 2), and (3 and 4) are clustered.
If you would rather use the hamming distance rather than the euclidian distance (which linkage does by default), then you can just do
l = linkage(x, 'average', {'hamming'});
although it makes little difference to the plot.
You can start by visualizing your data with a 'barcode' plot and then labeling rows with the cluster group they belong:
% Create figure
figure('pos',[100,300,640,150])
% Calculate patch xy coordinates
[r,c] = find(A);
Y = bsxfun(#minus,r,[.5,-.5,-.5, .5])';
X = bsxfun(#minus,c,[.5, .5,-.5,-.5])';
% plot patch
patch(X,Y,ones(size(X)),'EdgeColor','none','FaceColor','k');
% Set axis prop
set(gca,'pos',[0.05,0.05,.9,.9],'ylim',[0.5 6.5],'xlim',[0.5 1000.5],'xtick',[],'ytick',1:6,'ydir','reverse')
% Cluster
c = kmeans(A,3,'distance','hamming');
% Add lateral labeling of the clusters
nc = numel(c);
h = text(repmat(1010,nc,1),1:nc,reshape(sprintf('%3d',c),3,numel(c))');
cmap = hsv(max(c));
set(h,{'Background'},num2cell(cmap(c,:),2))
Definition
The Hamming distance for binary strings a and b the Hamming distance is equal to the number of ones (population count) in a XOR b (see Hamming distance).
Solution
Since you have six data strings, so you could create a 6 by 6 matrix filled with the Hamming distance. The matrix would be symetric (distance from a to b is the same as distance from b to a) and the diagonal is 0 (distance for a to itself is nul).
For example, the Hamming distance between your first and second string is:
hamming_dist12 = sum(xor(x(1,:),x(2,:)));
Loop that and fill your matrix:
hamming_dist = zeros(6);
for i=1:6,
for j=1:6,
hamming_dist(i,j) = sum(xor(x(i,:),x(j,:)));
end
end
(And yes this code is a redundant given the symmetry and zero diagonal, but the computation is minimal and optimizing not worth the effort).
Print your matrix as a spreadsheet in text format, and let the reader find which data string is similar to which.
This does not use your "kmeans" approach, but your added description regarding the problem helped shaping this out-of-the-box answer. I hope it helps.
Results
0 182 481 495 490 500
182 0 479 489 492 488
481 479 0 180 497 517
495 489 180 0 503 515
490 492 497 503 0 174
500 488 517 515 174 0
Edit 1:
How to read the table? The table is a simple distance table. Each row and each column represent a series of data (herein a binary string). The value at the intersection of row 1 and column 2 is the Hamming distance between string 1 and string 2, which is 182. The distance between string 1 and 2 is the same as between string 2 and 1, this is why the matrix is symmetric.
Data analysis
Three clusters can readily be identified: 1-2, 3-4 and 5-6, whose Hamming distance are, respectively, 182, 180, and 174.
Within a cluster, the data has ~18% dissimilarity. By contrast, data not part of a cluster has ~50% dissimilarity (which is random given binary data).
Presentation
I recommend Kohonen network or similar technique to present your data in, say, 2 dimensions. In general this area is called Dimensionality reduction.
I you can also go simpler way, e.g. Principal Component Analysis, but there's no quarantee you can effectively remove 9998 dimensions :P
scikit-learn is a good Python package to get you started, similar exist in matlab, java, ect. I can assure you it's rather easy to implement some of these algorithms yourself.
Concerns
I have a concern over your data set though. 6 data points is really a small number. moreover your attributes seem boolean at first glance, if that's the case, manhattan distance if what you should use. I think (someone correct me if I'm wrong) Hamming distance only makes sense if your attributes are somehow related, e.g. if attributes are actually a 1000-bit long binary string rather than 1000 independent 1-bit attributes.
Moreover, with 6 data points, you have only 2 ** 6 combinations, that means 936 out of 1000 attributes you have are either truly redundant or indistinguishable from redundant.
K-means almost always finds as many clusters as you ask for. To test significance of your clusters, run K-means several times with different initial conditions and check if you get same clusters. If you get different clusters every time or even from time to time, you cannot really trust your result.
I used a barcode type visualization for my data. The code which was posted here earlier by Oleg was too heavy for my solution (image files were over 500 kb) so I used image() to make the figures
function barcode(A)
B = (A+1)*2;
image(B);
colormap flag;
set(gca,'Ydir','Normal')
axis([0 size(B,2) 0 size(B,1)]);
ax = gca;
ax.TickDir = 'out'
end

calculating x2 from poisson distributed data

So I have a table of values
v=0 1 2 3 4 5 6 7 8 9
#times obs.: 5 19 23 21 14 12 3 2 1 0
I am supposed to calculate chi squared assuming the data fits a poisson dist. with mean u=3.
I have to group values >=6 all in one bin.
I am unsure of how to plot the poisson dist., and most of all how to control what goes into what bin, if that makes sense.
I have plotted a histogram using histc before..but it was with random numbers that I normalized. The amount in each bin was set for me.
I am super new...sorry if this question sucks.
You use bar to plot a bar graph in matlab.
So this is what you do:
v=0:9;
f=[5 19 23 21 14 12 3 2 1 0];
fc=f(find(v<6)); % copy elements where v<=6 into a new array
fc(end+1)=sum(f(v=>6)); % append the sum of elements where v=>6 to that array
figure
bar(v(v<=6), fc);
That should do the trick...
Now you didn't actually ask about the chi squared calculation. I would urge you not to put values of v>6 all into one bin for that calculation, as it will give you a really bad result.
There is another technique: if you use the hist function, you can choose the bins - and Matlab will automatically put things that exceed the limits into the last bin. So if your observations were in the array Obs, you can do what was asked with:
h = hist(Obs, 0:6);
figure
bar(0:6, h)
The advantage is that you have the array h available (frequencies) for other calculations.
If you do instead
hist(Obs, 0:6)
Matlab will plot the graph for you in a single statement (but you don't have the values...)