I want to do text categorization on a dataset of news. I have a lot of features like subject, keyword, summary, etc... all of these features are stored in one cell array of structs, each struct looking like this:
label: 'misc.forsale'
subj: ' Motorcycle wanted.'
keyword: [1x190 char]
reference: []
organization: ' Worcester Polytechnic Institute'
from: ' kedz#bigwpi.WPI.EDU (John Kedziora)'
summary: []
lines: ' 11'
vocab: [4x2 double]
I want to classify them with class = classify(test, train, target, 'diaglinear');but these functions only receive arrays as input, and do not accept cells or structs.
I can't convert this cell array to one multidimensional array because the amount of features varies (for example, one subject has two words and other has three words).
What can I do?
Do some feature extraction first. For example, tokenize the strings, then use TF-IDF.
You can include the key with the tokens. This is a common practise in information retrieval. See the Xapian manual for an example.
Usually, you will do some stemming, e.g. Examples -> exampl. Now, just add a prefix to make the words distinct depending on their occurrence. E.g. Sexampl when the subject contained example and Kexampl when it was a keyword.
Then you have a "bag of words" representation that is used everywhere. They even do this for mining images, it's called "visual words" then. These aren't english-language words either.
Related
I am dynamically storing data from different data recorders in timetables, nested in a structure DATA, such as DATA.Motor (timetable with motor data), DATA.Actuators (timetable with actuators data) and so on.
My objective is to have a function that synchronizes and merges these timetables so I can work with one big timetable.
I am trying to use synchronize to merge and synchronize those timetables:
fields = fieldnames(DATA);
TT = synchronize(DATA.(fields{1:end}));
but get the following error:
Expected one output from a curly brace or dot indexing expression, but there were 3 results.
This confuses me because DATA.(fields{1}) return the timetable of the first field name of the DATA structure.
Any thought on how I can solve this is greatly appreciated.
The problem here is that fields{1:end} is returning a "comma-separated list", and you're not allowed to use one of those as a struct dot-index expression. I.e. it's as if you tried the following, which is not legal:
DATA.('Motor','Actuators')
One way to fix this is to pull out the values from DATA into a cell array, and then you can use {:} indexing to generate the comma-separated list as input to synchronize, like this:
DATA = struct('Motor', timetable(datetime, rand), ...
'Actuators', timetable(datetime, rand));
DATA_c = struct2cell(DATA);
TT = synchronize(DATA_c{:});
I'm a beginner to Weka and I'm trying to use it for text classification. I have seen how to StringToWordVector filter for classification. My question is, is there any way to add more features to the text I'm classifying? For example, if I wanted to add POS tags and named entity tags to the text, how would I use these features in a classifier?
It depends of the format of your dataset and the preprocessing steps you perform. For instance, let us suppose that you have pre-POS-tagged your texts, looking like:
The_det dog_n barks_v ._p
So you can build an specific tokenizer (see weka.core.tokenizers) to generate two tokens per word, one would be "The" and the other one would be "The_det" so you keep the tag information.
If you want only tagged words, then you can just ensure that "_" is not a delimiter in the weka.core.tokenizers.WordTokenizer.
My advice is to have both the words and tagged words, so a simpler way would be to write an script that joins the texts and the tagged texts. From a file containing "The dog barks" and another one cointaining "The_det dog_n barks_v ._p", it would generate a file with "The The_det dog dog_n barks barks_v . ._p". You may even forget about the order unless you are going to make use of n-grams.
I have a matlab code which is for printing a cell array to excel. The size of matrix is 50x13.
The row 1 is the column names.
Column 1 is dates and rest columns are numbers.
The dateformat being defined in the code is:
dFormat = struct;
dFormat.Style = struct( 'NumberFormat', '_(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_)' );
dFormat.Font = struct( 'Size', 8 );
Can someone please explain me what the dFormat.Style code means ?
Thanks
The first line creates an empty struct (struct with no fields) called dFormat. A structure can contain pretty much anything in one of its fields, including another structure. The second line adds a field called 'Style' to the dFormat struct and sets it equal to another struct with a field called 'NumberFormat'. The 'NumberFormat' field is set equal to that long string of characters. You now have a structure of structures. The third line is similar to the second.
Note that the first line isn't really necessary unless dFormat already exists and it needs to be "zeroed out" as dFormat.Style with create it implicitly. However, using the struct function can make code more readable in some cases as objects use a similar notation for access methods and properties. In other words, all of your code could be replaced with:
dFormat.Style.NumberFormat = '_(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_)';
dFormat.Font.Size = 8;
See this video from the MathWorks for more details and this list of helpful structure functions and examples.
#horchler already elaborated on structs, but I imagine you may actually be more interested in the content of this structs Style field.
In case you are solely interested in _(* #,##0.00_);_(* (#,##0.00);_(* "-"??_);_(#_), that does not really look like something MATLAB related to me.
My best guess is that this code is used to later feed some other program, for examle to build an excel file.
I previously posted on how to display and access structure array content. The file consisted of states, capitals, and populations. Now I'm having trouble in created a new structure by organizing these states in alphabetical order. I did this by the sortrows function, and I tried pairing up the values of population and the capitals with the alphabetical states, but I can't seem to get it to be an array. I want it to be an array so I can write it to a file. This is what I have so far:
fid=fopen('Regions_list.txt')
file=textscan(fid,'%s %s %f','delimiter',',')
State=file{1}
Capital=file{2}
Population=num2cell(file{3})
sortedStates=sortrows(State)
n=length(State)
regions=struct('State',State,...
'Capital',Capital,...
'Population',Population)
for k=1:n;
region=sortedStates(k);
state_name={regions.State};
state_reference=strcmpi(state_name,region);
state_info=regions(state_reference)
end
I hope I'm making myself clear.
Use this to sort the cell array read in (no conversion needed), then write to file with this.
With respect to your sorting issue, The function SORT will return as its second output a sort index which can be used to apply the same sort order to other arrays. For example, you could sort your arrays before you create your structure array:
[sortedStates,sortIndex] = sort(State);
regions = struct('State',sortedStates,...
'Capital',Capital(sortIndex),...
'Population',Population(sortIndex));
Or you could apply your sorting after you create your structure array:
regions = struct('State',State,...
'Capital',Capital,...
'Population',Population);
[~,sortIndex] = sort({regions.State});
regions = regions(sortIndex);
However, I'm not sure what you mean when you say "I want it to be an array so I can write to a file."
Sql query:
select * from test_mart
where replace(replace(replace(replace(replace(replace(lower(name),'+'),'_'),'the '),' the'),'a '),' a')='tariq'
I can fire following query very easy, if I have to use simply Sqlite... but In current project I am using Core Data so not familiar about NSPredicate much.
The functionality talks about removing all BUT alphanumeric characters, which means removing special characters.
The characters that should be valid in the comparison would be
ABCDEFGHIJKLMNOPQRESTUVWXYZ1234567890
But we should not fail the comparison for the following characters
:;,~`!##$%^&*()_-+="'/?.>,<|\
Or for the following words
'the' 'an' 'a'
Some examples:
'Walmart' would be seen as the same payee as 'Wal-Mart'
'The Shoe Store' would be seen as the same payee as 'Shoe Store'
'Domino's Pizza' would be seen as the same payee as 'Dominos Pizza'
'Test Payee;' would be seen as the same payee as 'Test Payee'
Can any one suggest appropriate Predicates/Regular Expression ?
Thanks
I would have an extra field in the data base which would be a processed version of the original with all the irrelevant characters stripped out. Then use that for comparisons.
You might want to look at the soundex algorithm which may suite your purposes better... Soundex
It seems to me that you would want to normalize your data before it every gets set into the core data store. So if you're given "Wal-Mart", normalize it to "walmart" once, and then save it. Then you won't be doing all of this expensive on-the-fly comparison many many times.
The normalization would be fairly simple, given your rules:
Strip the words "a", "an", and "the"
Remove punctuation