How to prepare input file for svmstruct - classification

I want to use svmstruct for my Named entity Recognition task. Some of my features for each token are not in numerical format ( mostly in textual format such a n-char affixes or word shape ,...) . Since svmstruct's input format is same as svmlight format , I would like to know how should i convert those textual features to numerical ones?
All Bests

Basically you need to encode your text data as binary categories.
For example lets say you have the data
affix shape
==============
ing lower
initcap
ed allcaps
What you want to send to svmstruct is something like this:
affix_ing:1 shape_lower:1
shape_initcap:1
affix_ed:1 shape_allcaps
Now you can't you words as column identifiers, but svmstruct uses a sparse format so you can use widely separate column numbers as long as they are unique.
This is a great application for a hash function. So the technique is to make up column IDs on the fly and dummy encode your discrete data.
hash(colName + colValue) => 1
Depending on your data you might not need colName. Is a colName likely to collide with a colValue?
You can use a hash function like murmur hash or cityhash to get a huge space with fast calculation and low collisions.

Related

How to handle flag/exception values

In Paraview, I am working with a dataset that uses the value -99999 as a flag value. I'd like to be able to manipulate the dataset without these values causing issues with things like glyphs and colorbars. Nominally, I'd like the data to be "ignored".
A little about the data: I've got both scalar and vector point data, sitting on a fixed 2D spatial mesh at set temporal intervals.
Although -99999 is very far beyond the values the data might otherwise show, using a threshold filter isn't an option because the flag can occur at different places at different times. The way Paraview's threshold filter works means that the point ID to a fixed point in space will change as the number of filtered points changes through time.
In case it matters, data are in a netCDF file that is read in via an XMF header file and the XDMF Reader since the CF reader doesn't work (possibly because of my unstructured triangular mesh). The netCDF data have the _FillValue global attribute, however this doesn't appear to be getting picked up on by Paraview.
You could use a Programmable Filter to replace values below -99999 by NaN. Providing the data is not a vtkMultiblockDataSet, you can use the following script in the programmable filter :
import numpy as np
from vtk.numpy_interface import dataset_adapter as dsa
# name of the array
name = 'name'
# limit
limit = -99999
array = inputs[0].PointData[name].copy()
array[array<=limit] = np.nan
out = dsa.WrapDataObject(self.GetOutput())
out.PointData.append(array, name)
Note: if data of interest is a Cell Data, replace PointData by CellData in the script.
Note 2: the script was tested on ParaView 5.6.

Is there a native function in Matlab that export string array to csv and vice versa?

Now that string array is a thing since R2016b, is there a native function that export a string array to csv file and vice versa?
A function that fills the same role of csvread and csvwrite for numeric arrays in the old days but for string arrays. And to relax the requirement, say the string array contains columns of pure strings and columns of pure doubles. Stock prices with time stamp strings would be an example.
Native = not looping with fprintf. But if you are certain Matlab hasn't included any such functions yet, feel free to answer with the best approach thusfar without any restrictions.
Without any native function, pre-R2013a, looping with fprintf is the only way I can think of. And it was awful. Given past reputation of inefficiency, I still don't trust looping in Matlab.
Post-R2016b, one can convert a string array to cell array with num2cell and then to table with cell2table. Table can be written to csv file with writetable. This is actually fast, as writetable is fast. Only num2cell slows down the whole process a little. However, formatting is impossible along the way.
Post-R2019a, cell2table can be skipped with writecell, which is nice but the time consuming (slightly) step is num2cell and formatting should still be impossible. (I don't have R2019a to test it.)
Is there a better way or is it another one of those basic things left to be desired about Matlab?
writematix and readmatrix are the functions to do that since R2019a.
%If S1 is a string array that you want to `foobar.csv` then:
writematrix(S1,'foobar.csv');
%To read this csv file back into MATLAB as the same string array, use:
S2 = readmatrix('foobar.csv','OutputType','string');
%Verifying the result:
isequal(S1,S2)
ans =
logical
1
Loops have been significantly improved since R2015b. Not all loops are slow and not all vectorised versions are faster. The correct approach is to timeit when in doubt.

dlmwrite for specific rows and columns in matlab

I have 2D data for 10x10 matrices, and it looks like this
here is the table
However, data is updating and append every dt calculation, therefore I would like to reorganize and write it for specific columns , you may see this table in link
I use normally those codes to write
t=t+dt;
if ss==2000
dlmwrite('d:\Model_Results_Theta.txt', Tnew,'-append');
ss=0;
end
Could you recommend me any different way to organize the data based on the specific rows and columns using the matlab codes? Thanks in advance !!

MATLAB Saving and Loading Feature Vectors

I am trying to load feature vectors into classifiers such as a k-nearest neighbors classifier.
I have my code for GLCM, so I get contrast, correlation, energy, homogeneity in numbers (feature vectors).
My question is, how can I save every set of feature vectors from all the training images? I have seen somewhere that people had a .set file to load into classifiers (may be it is a special case for the particular classifier toolbox).
load 'mydata.set';
for example.
I suppose it does not have to be a .set file.
I'd just need a way to store all the feature vectors from all the training images in a separate file that can be loaded.
I've google,
and I found this that may be useful
but I am not entirely sure.
Thanks for your time and help in advance.
Regards.
If you arrange your feature vectors as the columns of an array called X, then just issue the command
save('some_description.mat','X');
Alternatively, if you want the save file to be readable, say in ASCII, then just use this instead:
save('some_description.txt', 'X', '-ASCII');
Later, when you want to re-use the data, just say
var = {'X'}; % <-- You can modify this if you want to load multiple variables.
load('some_description.mat', var{:});
load('some_description.txt', var{:}); % <-- Use this if you saved to .txt file.
Then the variable named 'X' will be loaded into the workspace and its columns will be the same feature vectors you computed before.
You will want to replace the some_description part of each file name above and instead use something that allows you to easily identify which data set's feature vectors are saved in the file (if you have multiple data sets). Your array of feature vectors may also be called something besides X, so you can change the name accordingly.

Text Categorization datasets for MATLAB

I am looking for a reliable dataset for Text categorization tasks in MATLAB format.
I want to run some experiments and don't want to spend too much time in preprocessing the text and creating feature vectors. I need something to be ready so I can plug it in my algorithm. I found a MATLAB files for reuters dataset here: link text
Everything is ready in here, but I want to use a subset of this. In this "fea" contains the feature vectors for each document. However, it seems that it is not a normal matrix. I want for example to select the top 1000 documents in this "fea". If you just download it and load it into MATLAB you will see what I mean.
So, If it is possible I need a solution for the above-mentioned dataset or any alternative datasets.
Thanks in advance.
It is stored as sparse matrix. Extract the first 1000 documents (rows), and if you have enough space, you can convert it to full dense matrix:
load Reuters21578.mat
TF = full( fea(1:1000,:) );
Lets check the variables we have:
>> whos
Name Size Bytes Class Attributes
TF 1000x18933 151464000 double
fea 8293x18933 4749196 double sparse
gnd 8293x1 66344 double
testIdx 2347x1 18776 double
trainIdx 5946x1 47568 double
so you can see TF is now about 150MB.
Other than that, the rest is self-explanatory:
fea: term-frequency matrix, rows are documents, columns are terms
gnd: category of each document, where numel(unique(gnd)) == 65
trainIdx/testIdx: split of instances (documents) for classification purposes, contains indices of rows, used as: tr = fea(trainIdx,:); tt = fea(testIdx,:);