Matlab performance sum and loop - matlab

I am trying to implement a function that mimics a 2d convolution on a picture https://www.tensorflow.org/api_docs/python/tf/nn/conv2d. I am not allowed to use a library for this. For a batch of pictures x(n,w,h,c), with size (50, 28, 28, 1), I calculate 32 features per pixel like so:
for im = 1:batch_size
for i=1:self.shape_input(2)
for j=1:self.shape_input(3)
y_unagg= x_virtual(im, i:(self.pad*2+i), j:(self.pad*2+j),:,:).*self.W_hat;
y(im,i,j,:) = reshape(sum(sum(sum(y_unagg, 2), 3), 4),...
1, 1, 1, self.shape_filter(4));
end
end
end
This takes roughly half a second. The second time around I calculate a batch with size (50, 14, 14, 32) and map it to 64 features. This time it takes 6 seconds. Is there any way I could speed this up?

Related

UMAP validation to calculate trustworthiness_vector problem

I have a dataset with over 200.000 data samples with 256 features, then, I used UMAP with n_components = 8, 16, 32, 64, to reduce data dimension fron 256 to 64, 32, 16, 8, respectively. I do not have labels. I want to use umap validation embedding data. But I encountered an error "0xC00000FD" when I ran the umap validation.trustworthiness_vector(source=df_raw_data.to_numpy(), embedding=df_embedding.to_numpy(), max_k=K), K = 30. And Segmentation fault on WSL.How can I handle this situation?
I have tried to reduce n_components = 2, but the problem still happened.

ORL dataset Pytorch dataset input data

I am trying to make a neural network in Pytorch to recognize faces from the famous Olivetti faces dataset (ORL dataset). The dimensions of the images are 32x32=1024, and there are a total of 400 of them with 40 classes. I transferred the dataset from the .mat file to Python's familiar variable environment.
orl = loadmat('ORL_32x32.mat')
x = orl["fea"]
y = orl["gnd"]
df = pd.DataFrame(x)
df_label = pd.DataFrame(y)
df.to_csv("data.csv", index = False)
df_label.to_csv("y.csv", index = False)
And after that I did the following code
label = torchvision.transforms.functional.to_tensor(df_label.values) #shape torch.Size([1, 400, 1])
df_tensor = torchvision.transforms.functional.to_tensor(df.values) #shape torch.Size([1, 400, 1024])
After that, I created a tensor dataset and started training through epochs.
trn = TensorDataset(df_tensor,label)
#print(type(trn))
trn_dataloader = torch.utils.data.DataLoader(trn,batch_size=400,shuffle=False, num_workers=4)
for epoch in range(EPOCHS):
for batch_idx, (data, target) in enumerate(trn_dataloader):
print(data.shape) #torch.Size([1, 400, 1024])
Which is actually a big problem - because data.shape should be torch.Size([1, 1, 1024]) just one image, not the whole dataset looking as one image.
What is the best way to solve the whole problem?
You have specified the batch size of the dataloader to be 400, which you stated is the number of images in the dataset. The data tensor in the dataloader loop will therefore contain all images. If you set the batch size to 1, you will see that data will have shape (1, 1, 1024).
Depending on how you are training your model, you will adjust the batch size accordingly, but usually you do not train with 1 as batch size.
Since working with PyTorch, I would advise reshaping your data to the standard way for images, which is (batch size, number of channels, height, width). It seems like you are working with flattened images, so therefore the shape should be (batch size, number of features).
To me it seems like your data.csv has some wrong arrangements to be loaded the right way. When loaded, it mixes up the channel and batch size dimensions. But this can be fixed by permutating the tensor:
df_tensor = df_tensor.permute(1, 0, 2) # Shape: (1, 400, 1024) -> (400, 1, 1024)
Or scrapping the channel dimension since these are flattened images:
df_tensor = df_tensor.squeeze(0) # Shape: (1, 400, 1024) -> (400, 1024)

DBSCAN on 3d coordinates doesn't find clusters

I'm trying to cluster points in a 3D coordinates DataFrame of 1428 points.
The clusters are relatively flat planes that are elongated clouds DataFrame. They are very obvious clusters so I was hoping to try unsupervised clustering (not putting in the number of clusters expected) KMeans does not properly separate them and does require the number of clusters:
Kmeans plot results
The data looks as follows:
5 6 7
0 9207.495280 18922.083277 4932.864
1 5831.199280 3441.735280 5756.326
2 8985.735280 12511.719280 7099.844
3 8858.223280 28883.151280 5689.652
4 6801.399277 6468.759280 7142.524
... ... ... ...
1423 10332.927277 22041.855280 5136.252
1424 6874.971277 12937.563277 5467.216
1425 8952.471280 28849.887280 5710.522
1426 7900.611277 19128.255280 4803.122
1427 10234.635277 18734.631280 5631.286
[1428 rows x 3 columns]
I was hoping DBSCAN would deal better with this data. However, when I try the following (I played around with eps and min_samples but without success):
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=10, min_samples = 50)
clusters = dbscan.fit_predict(X)
print('Clusters found', dbscan.labels_)
len(clusters)
I get this output:
Clusters found [-1 -1 -1 ... -1 -1 -1]
1428
I have been confused about getting this to work, especially since Kmeans did work:
kmeans = sk_cluster.KMeans(init='k-means++', n_clusters=9, n_init=50)
kmeans.fit_predict(X)
centroids = kmeans.cluster_centers_
kmeans_labels = kmeans.labels_
error = kmeans.inertia_
print ("The total error of the clustering is: ", error)
print ('\nCluster labels')
The total error of the clustering is: 4994508618.792263
Cluster labels
[8 0 7 ... 3 8 1]
Remember this golden rule:
Always and always perform normalization on your data before feeding it to ML / DL algorithm.
Reason being, your columns have different range, probably one column has a range of [10000,20000] and other has [4000,5000] when you will plot these coordinates on a graph, they will be very very far away, Clustering/Classification will never work, maybe Regression will. Scaling brings the range of each of the column to same level but still maintaining the distance but with different scale. It is just like in google MAPS, when you zoom in scale decrease and when you zoom out scale increases.
You are free to choose the normalization algorithm, there are almost 20-30 available on sklearn.
Edit:
Use this code:
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
scaler.fit(X)
X_norm = scaler.transform(X)
from sklearn.cluster import DBSCAN
dbscan = DBSCAN(eps=0.05, min_samples = 3,leaf_size=30)
clusters = dbscan.fit_predict(X_norm)
np.unique(dbscan.labels_)
array([-1, 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15,
16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32,
33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47])
What I found that as DBSCAN is a density based approach and I Tried sklearn normalizer(from sklearn.preprocessing import normalize) which basically converts into gaussian distribution, but it didn't work and it should not in case of DBSCAN as it requires each feature to have similar density.
So, I went with MinMax scaler as it should turn each features density similar and One thing to note, that as your data points after scaling, are less than 1, one should use epsilon in the similar range as well.
Kudos :)

clustering and matlab

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...

What kind of data/format should matlabs clustering toolbox use [duplicate]

I'm trying to cluster some data I have from the KDD 1999 cup dataset
the output from the file looks like this:
0,tcp,http,SF,239,486,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,8,8,0.00,0.00,0.00,0.00,1.00,0.00,0.00,19,19,1.00,0.00,0.05,0.00,0.00,0.00,0.00,0.00,normal.
with 48 thousand different records in that format. I have cleaned the data up and removed the text keeping only the numbers. The output looks like this now:
I created a comma delimited file in excel and saved as a csv file then created a data source from the csv file in matlab, ive tryed running it through the fcm toolbox in matlab (findcluster outputs 38 data types which is expected with 38 columns).
The clusters however don't look like clusters or its not accepting and working the way I need it to.
Could anyone help finding the clusters? Im new to matlab so don't have any experience and I'm also new to clustering.
The method:
Chose number of clusters (K)
Initialize centroids (K patterns randomly chosen from data set)
Assign each pattern to the cluster with closest centroid
Calculate means of each cluster to be its new centroid
Repeat step 3 until a stopping criteria is met (no pattern move to another cluster)
This is what I'm trying to achieve:
This is what I'm getting:
load kddcup1.dat
plot(kddcup1(:,1),kddcup1(:,2),'o')
[center,U,objFcn] = fcm(kddcup1,2);
Iteration count = 1, obj. fcn = 253224062681230720.000000
Iteration count = 2, obj. fcn = 241493132059137410.000000
Iteration count = 3, obj. fcn = 241484544542298110.000000
Iteration count = 4, obj. fcn = 241439204971005280.000000
Iteration count = 5, obj. fcn = 241090628742523840.000000
Iteration count = 6, obj. fcn = 239363408546874750.000000
Iteration count = 7, obj. fcn = 238580863900727680.000000
Iteration count = 8, obj. fcn = 238346826370420990.000000
Iteration count = 9, obj. fcn = 237617756429912510.000000
Iteration count = 10, obj. fcn = 226364785036628320.000000
Iteration count = 11, obj. fcn = 94590774984961184.000000
Iteration count = 12, obj. fcn = 2220521449216102.500000
Iteration count = 13, obj. fcn = 2220521273191876.200000
Iteration count = 14, obj. fcn = 2220521273191876.700000
Iteration count = 15, obj. fcn = 2220521273191876.700000
figure
plot(objFcn)
title('Objective Function Values')
xlabel('Iteration Count')
ylabel('Objective Function Value')
maxU = max(U);
index1 = find(U(1, :) == maxU);
index2 = find(U(2, :) == maxU);
figure
line(kddcup1(index1, 1), kddcup1(index1, 2), 'linestyle',...
'none','marker', 'o','color','g');
line(kddcup1(index2,1),kddcup1(index2,2),'linestyle',...
'none','marker', 'x','color','r');
hold on
plot(center(1,1),center(1,2),'ko','markersize',15,'LineWidth',2)
plot(center(2,1),center(2,2),'kx','markersize',15,'LineWidth',2)
Since you are new to machine-learning/data-mining, you shouldn't tackle such advanced problems. After all, the data you are working with was used in a competition (KDD Cup'99), so don't expect it to be easy!
Besides the data was intended for a classification task (supervised learning), where the goal is predict the correct class (bad/good connection). You seem to be interested in clustering (unsupervised learning), which is generally more difficult.
This sort of dataset requires a lot of preprocessing and clever feature extraction. People usually employ domain knowledge (network intrusion detection) to obtain better features from the raw data.. Directly applying simple algorithms like K-means will generally yield poor results.
For starters, you need to normalize the attributes to be of the same scale: when computing the euclidean distance as part of step 3 in your method, the features with values such as 239 and 486 will dominate over the other features with small values as 0.05, thus disrupting the result.
Another point to remember is that too many attributes can be a bad thing (curse of dimensionality). Thus you should look into feature selection or dimensionality reduction techniques.
Finally, I suggest you familiarize yourself with a simpler dataset...