What is the structure of torch dataset? - matlab

I am beginning to use torch 7 and I want to make my dataset for classification. I've already made pixel images and corresponding labels. However, I do not know how to feed those data to the torch. I read some codes from others and found out that they are using the dataset whose extension is '.t7' and I think it is a tensor type. Is it right? And I wonder how I can convert my pixel images(actually, I made them with Matlab by using MNIST dataset) into t7 extension compatible to the torch. There must be structure of dataset in the t7 format but I cannot find it (also for the labels too).
To sum up, I have pixel images and labels and want to convert those to t7 format compatible to the torch.
Thanks in advance!

The datasets '.t7' are tables of labeled Tensors.
For example the following lua code :
if (not paths.filep("cifar10torchsmall.zip")) then
os.execute('wget -c https://s3.amazonaws.com/torch7/data/cifar10torchsmall.zip')
os.execute('unzip cifar10torchsmall.zip')
end
Readed_t7 = torch.load('cifar10-train.t7')
print(Readed_t7)
Will return through itorch :
{
data : ByteTensor - size: 10000x3x32x32
label : ByteTensor - size: 10000
}
Which means the file contains a table of two ByteTensor one labeled "data" and the other one labeled "label".
To answer your question, you should first read your images (with torchx for example : https://github.com/nicholas-leonard/torchx/blob/master/README.md ) then put them in a table with your Tensor of label. The following code is just a draft to help you out. It considers the case where : there are two classes, all your images are in the same folder and are ordered through those classes.
require 'torchx';
--Read all your dataset (the chosen extension is png)
files = paths.indexdir("/Path/to/your/images/", 'png', true)
data1 = {}
for i=1,files:size() do
local img1 = image.load(files:filename(i),3)
table.insert(data1, img1)
end
--Create the table of label according to
label1 = {}
for i=1, #data1 do
if i <= number_of_images_of_the_first_class then
label1[i] = 1
else
label1[i] = 2
end
end
--Reshape the tables to Tensors
label = torch.Tensor(label1)
data = torch.Tensor(#data1,3,16,16)
for i=1, #data1 do
data[i] = data1[i]
end
--Create the table to save
Data_to_Write = { data = data, label = label }
--Save the table in the /tmp
torch.save("/tmp/Saved_Data.t7", Data_to_Write)
It should be possible to make a less hideous code but this one details all the steps and works with torch 7 and Jupyter 5.0.0 .
Hope it helps.
Regards

Related

How to apply word2vec for k-means clustering?

I am new to word2vec. With applying this method, I am trying to form some clusters based on words extracted by word2vec from scientific publications' abstracts. To this end, I have first retrieved sentences from the abstracts via stanfordNLP and put each sentence into a line in a text file. Then the text file required by deeplearning4j word2vec was ready to process (http://deeplearning4j.org/word2vec).
Since the texts come from scientific fields, there are a lot of mathematical terms or brackets. See the sample sentences below:
The meta-analysis showed statistically significant effects of pharmacopuncture compared to conventional treatment = 3.55 , P = .31 , I-2 = 16 % ) .
90 asymptomatic hypertensive subjects associated with LVH , DM , or RI were randomized to receive D&G herbal capsules 1 gm/day , 2 gm/day , or identical placebo capsules in double-blind and parallel fashion for 12 months .
After preparing the text file, I have run word2vec as below:
SentenceIterator iter = new LineSentenceIterator(new File(".../filename.txt"));
iter.setPreProcessor(new SentencePreProcessor() {
#Override
public String preProcess(String sentence) {
//System.out.println(sentence.toLowerCase());
return sentence.toLowerCase();
}
});
// Split on white spaces in the line to get words
TokenizerFactory t = new DefaultTokenizerFactory();
t.setTokenPreProcessor(new CommonPreprocessor());
log.info("Building model....");
Word2Vec vec = new Word2Vec.Builder()
.minWordFrequency(5)
.iterations(1)
.layerSize(100)
.seed(42)
.windowSize(5)
.iterate(iter)
.tokenizerFactory(t)
.build();
log.info("Fitting Word2Vec model....");
vec.fit();
log.info("Writing word vectors to text file....");
// Write word vectors
WordVectorSerializer.writeWordVectors(vec, "abs_terms.txt");
This script creates a text file containing many words withe their related vector values in each row as below:
pills -4.559159278869629E-4 0.028691953048110008 0.023867368698120117 ...
tricuspidata -0.00431067543104291 -0.012515762820839882 0.0074045853689312935 ...
As a subsequent step, this text file has been used to form some clusters via k-means in spark. See the code below:
val rawData = sc.textFile("...abs_terms.txt")
val extractedFeatureVector = rawData.map(s => Vectors.dense(s.split(' ').slice(2,101).map(_.toDouble))).cache()
val numberOfClusters = 10
val numberOfInterations = 100
//We use KMeans object provided by MLLib to run
val modell = KMeans.train(extractedFeatureVector, numberOfClusters, numberOfInterations)
modell.clusterCenters.foreach(println)
//Get cluster index for each buyer Id
val AltCompByCluster = rawData.map {
row=>
(modell.predict(Vectors.dense(row.split(' ').slice(2,101)
.map(_.toDouble))),row.split(',').slice(0,1).head)
}
AltCompByCluster.foreach(println)
As a result of the latest scala code above, I have retrieved 10 clusters based on the word vectors suggested by word2vec. However, when I have checked my clusters no obvious common words appeared. That is, I could not get reasonable clusters as I expected. Based on this bottleneck of mine I have a few questions:
1) From some tutorials for word2vec I have seen that no data cleaning is made. In other words, prepositions etc. are left in the text. So how should I apply cleaning procedure when applying word2vec?
2) How can I visualize the clustering results in a explanatory way?
3) Can I use word2vec word vectors as input to neural networks? If so which neural network (convolutional, recursive, recurrent) method would be more suitable for my goal?
4) Is word2vec meaningful for my goal?
Thanks in advance.

Organising large datasets in Matlab

I have a problem I hope you can help me with.
I have imported a large dataset (200000 x 5 cell) in Matlab that has the following structure:
'Year' 'Country' 'X' 'Y' 'Value'
Columns 1 and 5 contain numeric values, while columns 2 to 4 contain strings.
I would like to arrange all this information into a variable that would have the following structure:
NewVariable{Country_1 : Country_n , Year_1 : Year_n}(Y_1 : Y_n , X_1 : X_n)
All I can think of is to loop through the whole dataset to find matches between the names of the Country, Year, X and Y variables combining the if and strcmp functions, but this seems to be the most ineffective way of achieving what I am trying to do.
Can anyone help me out?
Thanks in advance.
As mentioned in the comments you can use categorical array:
% some arbitrary data:
country = repmat('ca',10,1);
country = [country; repmat('cb',10,1)];
country = [country; repmat('cc',10,1)];
T = table(repmat((2001:2005)',6,1),cellstr(country),...
cellstr(repmat(['x1'; 'x2'; 'x3'],10,1)),...
cellstr(repmat(['y1'; 'y2'; 'y3'],10,1)),...
randperm(30)','VariableNames',{'Year','Country','X','Y','Value'});
% convert all non-number data to categorical arrays:
T.Country = categorical(T.Country);
T.X = categorical(T.X);
T.Y = categorical(T.Y);
% here is an example for using categorical array:
newVar = T(T.Country=='cb' & T.Year==2004,:);
The table class is made for such things, and very convenient. Just expand the logic statement in the last line T.Country=='cb' & T.Year==2004 to match your needs.
Tell me if this helps ;)

Find and replace in matlab?

I want do find and replace all in matlab (As we do in MS office).
https://www.dropbox.com/s/hxfqunjwhnvkl1f/matlab.mat?dl=0
I have a cell array LUT_HS_complete (contains identifier in column 1 and protein name in column 2 and summary in column 3) this is my look up table. on the other hand, I have my protein-protein interaction data (named Second_layer with identifiers in first two columns and the score in column 3).
I want to replace the first two columns in my Second_layer with the corresponding protein name from my look up table.
I tried strmatch, but that didn't help me.
Source_gene = Second_layer(:,1); Source_gene = regexprep(Source_gene,'[-/\s]','');
Target_gene = Second_layer(:,2); Target_gene = regexprep(Target_gene,'[-/\s]','');
Inter_score = Second_layer(:,3);
%%
for i=1:length(Source_gene(1:end,1));
SG = strmatch(Source_gene(i),LUT_HS_complete(1:end,1),'exact');
renamed_Source_gene(SG,1) = LUT_HS_complete(SG,2);
end
for j=1:length(Target_gene(1:end,1));
TG = strmatch(Target_gene(j),LUT_HS_complete(1:end,1),'exact');
renamed_Target_gene(TG,1) = LUT_HS_complete(TG,2);
end
If you could find a solution. It would be a great help.
Might this work for you?
renamed_Second_layer(:,1)=LUT_HS_complete(cellfun(#(x) find(strcmp(x,LUT_HS_complete(:,1))),Second_layer(:,1)),2);
renamed_Second_layer(:,2)=LUT_HS_complete(cellfun(#(x) find(strcmp(x,LUT_HS_complete(:,1))),Second_layer(:,2)),2);
renamed_Second_layer(:,3)=Second_layer(:,3);

Read data from a website with Matlab

Can someone show me how to read data from this website: http://www.amlbook.com/data/zip/features.train
I used to copy+paste to form a array in my Matlab editor, but this time it seems the data amount is huge...
block = URLREAD('http://www.amlbook.com/data/zip/features.train');
readData = textscan(block,'%f%f%f','delimiter', char(9));
train1 = readData{1};
train2 = readData{2};
train3 = readData{3};
clear readData
Three 7291*1 double arrays are imported, representing three different columns on the website page.

How to load a specially formatted data file into matlab?

I need to load a data file, test.dat, into Matlab. The contents of data file are like
*a682 1233~0.2
*a2345 233~0.8 345~0.2 4567~0.3
*a3457 345~0.9 34557~1.2 34578~0.2 9809~0.1 2345~2.9 23452~0.9 334557~1.2 234578~0.2 19809~0.1 23452~2.9 3452~0.9 4557~1.2 3578~0.2 92809~0.1 12345~2.9 232452~0.9 33557~1.6 23478~0.6 198099~2.1 234532~2.9 …
How to read this type of file into matlab, and use the terms, such as *2345 to identify a row, which links to corresponding terms, including 233~0.8 345~0.2 4567~0.3
Thanks.
Because each of the rows is a different size, you either have to make a cell array, a structure, or deal with adding NaN or zero to a matrix. I chose to use a cell array, hope it is ok! If someone is better with regexp than me please comment, the output cells are now not perfect (i.e. show 345~ instead of 345~0.9) but I am sure it is a minor fix. Here is the code:
datfile = 'test.dat';
text = fileread(datfile);
row1 = regexp(text,'*[a-z]?\d+','match');
data(:,1) = row1';
row2 = regexp(text,'*[a-z]?\d+','split');
row2 = [row2(:,2:end)'];
for i = 1:size(row2,1)
data{i,2} = regexp(row2{i},'\d+\S\d+\s','split');
end
What this creates is a cell array called data where the first column of every row is your *a682 id and the second column of each row is a cell with your data values. To get them you could use:
data{1}
to show the id
data{1,2}
to show the cell contents
data{1,2}{1}
to show the specific data point
This should work and is relatively simple!