weka is able to use classification via clustering

weka is able to use classification via clustering - cluster-analysis

I am new to WEKA tool. Can i combine classification and clustering? i.e first cluster the data and then classify the instances cluster wise. for this requirement what are the steps are need to follow.
Thanks in advance.

Yes you can. It is really easy with the ClassificationViaClustering classifier (Class ClassificationViaClustering).
Steps in Java pseudocode:
1. Create a SimpleKMeans clusterer
SimpleKMeans skm = new SimpleKMeans();
skm.setNumClusters(5); // in this example the clusterer uses 5 clusters
2. Read the dataset and set class index
BufferedReader reader = new BufferedReader(new FileReader("[path].arff")); // replace [path] with your path to dataset
Instances data = new Instances(reader);
data.setClassIndex([your class index]); // if the first attribute is your class, then insert 0
3. Create the classifier
ClassifierViaClustering cvc = new ClassificationViaClustering();
cvc.setClusterer(skm); // let your classifier use the SimpleKMeans clusterer
cvc.buildClassifier(data);
Then, when you want to classify an new instance:
Instance instanceToClassify = new Instance(data.firstInstance());
instanceToClassify.setDataset(data); // the instance to be classified has to have access to the dataset
double class = cvc.classifyInstance(instanceToClassify); // classify instance based by the cluster it belongs to

Related

Testing a trading system on bootstrap samples using Arch library in python

I am trying to test a hypothesis on outperformance of a trading strategy over the buy and hold. I have original data's returns containing 1261 observations as a sample to be used for bootstrap.
I want to know if I have applied it correctly.
def back_test_series(x):
df= pd.DataFrame(x, columns= ['Close'])
return df.Close
from arch.bootstrap import CircularBlockBootstrap
bs = CircularBlockBootstrap(40, sample_return)
results = bs.apply(back_test_series, 2500)
Above, sample_return is the sample containing 2761 returns on actual data. I created 2500 bootstrapped samples containing 2761 observations each.
and then created a cummulative return to get price time series.
time_series = []
for simu in results:
df = pd.DataFrame(simu, columns=["Close"])
df['Close'] = (1+df).cumprod()
time_series.append(df)
and finally ran my backtesting in the price series obatained from bootstrap.
final_results = []
for simulation in enumerate(time_series):
x = Backtesting.scrip_backtest(simulation)
final_results.append(x)
Backtesting.scrip_backtest is my trading strategy which will return stats like buy and hold cagr, strategy cagr, std dev of strategy returns.
My question is can I use bootstrap in this way? Should I use MovingBlockBootstrap or CircularBlockBootstrap?
It it correct to run trading strategy on bootstrapped time series as mentioned above?

Is it possible to load a model which is stored with model.module.state_dict() but load with model.state_dict()

I want to ask a question, I have trained a model with two gpus and stored this model with
model.module.state_dict(), now I want to load this model in one gpu, can I directly load this trained model with model.state_dict()?
Thanks in advance!

You can refer to this question.
You can either adding an nn.DataParallel for loading purpose. Or change the key naming like
# original saved file with DataParallel
state_dict = torch.load('myfile.pth.tar')
# create new OrderedDict that does not contain `module.`
from collections import OrderedDict
new_state_dict = OrderedDict()
for k, v in state_dict.items():
name = k[7:] # remove `module.`
new_state_dict[name] = v
# load params
model.load_state_dict(new_state_dict)
But as you save the model with model.module.state_dict() instead of model.state_dict() it possible that the names may differed. If the two methods above don't work try print the saved dict and model to see what you need to change. Like
state_dict = torch.load('myfile.pth.tar')
print(state_dict)
print(model)

Write Matrix Data to Each Member of Datatype in HDF5 file via MATLAB

This is my first go at trying to create an HDF5 file from scratch using the Low-Level commands via MATLAB.
My issue is that I am having a hard time trying to write data to each specific member in the datatype on my dataset.
First, I create a new HDF5 file, and set the right layer of groups:
new_h5 = H5F.create('new_hdf5_file.h5','H5F_ACC_TRUNC','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'first','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'second','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
Then, I create my datatype:
datatype = H5T.create('H5T_compound',20);
H5T.insert(datatype,'first_element',0,'H5T_NATIVE_INT');
H5T.insert(datatype,'second_element',4,'H5T_NATIVE_DOUBLE');
H5T.insert(datatype,'third_element',12,'H5T_NATIVE_DOUBLE');
Then, I format that into my dataset:
new_h5 = H5D.create(new_h5,'location',datatype,H5S.create('H5S_SCALAR'),'H5P_DEFAULT');
subset = H5D.get_type(H5D.open(new_h5,'/first/second/location'));
mem_type = H5T.get_member_type(subset,0);
I receive an error with the following command:
H5D.write(mem_type,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data);
Error using hdf5lib2
Unhandled HDF5 class (H5T_NO_CLASS) encountered. It is not possible to write to this attribute or dataset.
So, I try this method instead:
new_h5 = H5D.create(new_h5,'location',datatype,H5S.create_simple(2,dims,dims),'H5P_DEFAULT'); %where dims are the dimensions of all matrices of data structure
H5D.write(mem_type,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data); %where data is a structure
I receive an error with this following command:
H5D.write(mem_type,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data);
Error using hdf5lib2
Attempted to transfer too many values to or from the library buffer.
When looking here for the XML tags for the error messages, it describes the above error as "illegalArrayAccess." Apparently, according to this question, you can only write to 4 members without the buffer throwing an error?
Is this correct? How can I correctly write to each member. I am about to reach my mental limit trying to figure this one out.
EDIT:
References kept here for general information:
HDF5 Compound Datatypes Example
HDF5 Compount Datatypes
H5D.write MATLAB Command

I found out why I cannot write data. I have solved the problem. I had my dimensions set incorrectly (which is code I forgot to include originally). My apologies. I had my dimensions like this:
dims = fliplr(size(data_matrix));
Where dims was a 15x250 matrix. The error was in that the buffer was unable to write a 250x15 matrix for each member, because it only had data for a 250x1 for each member.
The following code will (generically) work for writing data to each member:
new_h5 = H5F.create('new_hdf5_file.h5','H5F_ACC_TRUNC','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'first','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
new_h5 = H5G.create(new_h5,'second','H5P_DEFAULT','H5P_DEFAULT','H5P_DEFAULT');
datatype = H5T.create('H5T_compound',20);
H5T.insert(datatype,'first_element',0,'H5T_NATIVE_INT');
H5T.insert(datatype,'second_element',4,'H5T_NATIVE_DOUBLE');
H5T.insert(datatype,'third_element',12,'H5T_NATIVE_DOUBLE');
dims = fliplr(size(data_matrix)); dims = [1 dims(1,2)];
new_h5 = H5D.create(new_h5,'location',datatype,H5S.create_simple(2,dims,dims),'H5P_DEFAULT');
H5D.write(new_h5,'H5ML_DEFAULT','H5S_ALL','H5S_ALL','H5P_DEFAULT',data_structure);
where data_matrix is a 15x250 matrix containing all data, and where data_structure is a sctucture containing 15 fields, each 250x1 in size.

Is it possible to load word2vec pre-trained available vectors into spark?

Is there a way to load Google's or Glove's pre-trained vectors (models) such as GoogleNews-vectors-negative300.bin.gz into spark and performing operations such as findSynonyms that are provided from spark? or do I need to do the loading and operations from scratch?
In this post Load Word2Vec model in Spark , Tom Lous suggests converting the bin file to txt and starting from there, I already did that .. but then what is next?
In a question I posted yesterday I got an answer that models in Parquet format can be loaded in spark, thus I'm posting this question to be sure that there is no other option.

Disclaimer: I'm pretty new to spark, but the below at least works for me.
The trick is figuring out how to construct a Word2VecModel from a set of word vectors as well as handling some of the gotchas in trying to create the model this way.
First, load your word vectors into a Map. For example, I have saved my word vectors to a parquet format (in a folder called "wordvectors.parquet") where the "term" column holds the String word and the "vector" column holds the vector as an array[float], and I can load it like so in Java:
// Loads the dataset with the "term" column holding the word and the "vector" column
// holding the vector as an array[float]
Dataset<Row> vectorModel = pSpark.read().parquet("wordvectors.parquet");
//convert dataset to a map.
Map<String, List<Float>> vectorMap = Arrays.stream((Row[])vectorModel.collect())
.collect(Collectors.toMap(row -> row.getAs("term"), row -> row.getList(1)));
//convert to the format that the word2vec model expects float[] rather than List<Float>
Map<String, float[]> word2vecMap = vectorMap.entrySet().stream()
.collect(Collectors.toMap(Map.Entry::getKey, entry -> (float[]) Floats.toArray(entry.getValue())));
//need to convert to scala immutable map because that's what word2vec needs
scala.collection.immutable.Map<String, float[]> scalaMap = toScalaImmutableMap(word2vecMap);
private static <K, V> scala.collection.immutable.Map<K, V> toScalaImmutableMap(Map<K, V> pFromMap) {
final List<Tuple2<K,V>> list = pFromMap.entrySet().stream()
.map(e -> Tuple2.apply(e.getKey(), e.getValue()))
.collect(Collectors.toList());
Seq<Tuple2<K,V>> scalaSeq = JavaConverters.asScalaBufferConverter(list).asScala().toSeq();
return (scala.collection.immutable.Map<K, V>) scala.collection.immutable.Map$.MODULE$.apply(scalaSeq);
}
Now you can construct the model from scratch. Due to a quirk in how Word2VecModel works, you must set the vector size manually, and do so in a weird way. Otherwise it defaults to 100 and you get an error when trying to invoke .transform(). Here is a way I've found that works, not sure if everything is necessary:
//not used for fitting, only used for setting vector size param (not sure if this is needed or if result.set is enough
Word2Vec parent = new Word2Vec();
parent.setVectorSize(300);
Word2VecModel result = new Word2VecModel("w2vmodel", new org.apache.spark.mllib.feature.Word2VecModel(scalaMap)).setParent(parent);
result.set(result.vectorSize(), 300);
Now you should be able to use result.transform() like you would with a self-trained model.
I haven't tested other Word2VecModel functions to see if they work correctly, I only tested .transform().

Classification of new instances in weka

In our training set, we performed feature selection (ex. CfsSubsetEval GreedyStepwise) and then classified the instances using a classifier (ex. J48). We have saved the model Weka created.
Now, we want to classify new [unlabeled] instances (which still has the original number of attributes of the training set before it went under feature selection). Are we right in assuming that we should perform the feature selection in this set of new [unlabeled] instances so we could re-evaluate it using the saved model (to make the training and test sets compatible)? If yes, how can we filter the test set?
Thank you for helping!

Yes, both test and training set must have the same number of attributes and each attribute must correspond to the same thing. So you should remove the same attributes (that you removed from training set) from your test set before classification.

I don't think you have to perform feature selection on the test set. If your test set already has the original number of attributes, upload it, and in the "preprocess" window, manually remove all the attributes that were removed during the feature selection in the training set file.

You must apply the same filter to the test set , that you have previously applied to the training set. You can use the WEKA API for applying the same filter to the test set as well.
Instances trainSet = //get training set
Instances testSet = //get testing set
AttributeSelection attsel = new AttributeSelection();//apply feature selection on training data
CfsSubsetEval ws = new CfsSubsetEval();
GreedyStepwise search = new GreedyStepwise();
attsel.setEvaluator(ws);
attsel.setSearch(search);
attsel.SelectAttributes(trainSet);
retArr = attsel.selectedAttributes();//get indicies of selected attributes
Filter remove = new Remove() //set up the filter for removing attributes
remove.setAttributeIndicesArray(retArr);
remove.setInvertSelection(true);//retain the selected,remove all others
remove.setInputFormat(trainSet);
trainSet = Filter.useFilter(trainSet, remove);
//now apply the same filter to the testing set as well
testSet = Filter.useFilter(testSet, remove);
//now you are good to go!

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

weka is able to use classification via clustering - cluster-analysis

I am new to WEKA tool. Can i combine classification and clustering? i.e first cluster the data and then classify the instances cluster wise. for this requirement what are the steps are need to follow. Thanks in advance.

Related

Testing a trading system on bootstrap samples using Arch library in python

Is it possible to load a model which is stored with model.module.state_dict() but load with model.state_dict()

Write Matrix Data to Each Member of Datatype in HDF5 file via MATLAB

Is it possible to load word2vec pre-trained available vectors into spark?

Classification of new instances in weka

Categories

Resources