Why text classification example of fastText did not apply LabelEncoder on label - fasttext

I new to fastText, and had read the tutorials: https://fasttext.cc/docs/en/supervised-tutorial.html.
I had download the sample data, and found that the label is string type.
$ head cooking.stackexchange.txt
__label__sauce __label__cheese How much does potato starch affect a cheese sauce recipe?
__label__food-safety __label__acidity Dangerous pathogens capable of growing in acidic environments
__label__cast-iron __label__stove How do I cover up the white spots on my cast iron stove?
__label__restaurant Michelin Three Star Restaurant; but if the chef is not there
__label__knife-skills __label__dicing Without knife skills, how can I quickly and accurately dice vegetables?
__label__storage-method __label__equipment __label__bread What's the purpose of a bread box?
__label__baking __label__food-safety __label__substitutions __label__peanuts how to seperate peanut oil from roasted peanuts at home?
__label__chocolate American equivalent for British chocolate terms
__label__baking __label__oven __label__convection Fan bake vs bake
__label__sauce __label__storage-lifetime __label__acidity __label__mayonnaise Regulation and balancing of readymade packed mayonnaise and other sauces
And the train and test code from the tutorial.
>>> model = fasttext.train_supervised(input="cooking.train", lr=1.0)
Read 0M words
Number of words: 9012
Number of labels: 734
Progress: 100.0% words/sec/thread: 81469 lr: 0.000000 loss: 6.405640 eta: 0h0m
>>> model.test("cooking.valid")
(3000L, 0.563, 0.245)
My question is that why the label is not applied (say sklearn) LabelEncoder? I've run the example and it worked well. And I was confused.
[UPDATED] --------
IMO, the code would look like below
from sklearn import preprocessing
texts_train, labels_train = load_dataset()
label_encoder = preprocessing.LabelEncoder()
labels_train = label_encoder.fit_transform(labels_train)
with open('cooking.train.2', 'w') as f:
for i in range(len(texts_train)):
f.write('%s __label__%d\n' % (texts_train[i], labels_train[i]))
model = fasttext.train_supervised('cooking.train.2',lr=1.0)

Related

incorrect result of fasttext model

I created a fasttext model to do comments sentiments analysis , I used a train_File with 51% positive comments, 47% negative and 2% neutral.
When I want to test with some given sentences always results are divided between: 0.49 .. positive, 0.47 .. negative, 0.02 neutral, even if I type a single negative word.
I have the following code :
import fasttext
model = fasttext.train_supervised(TRAIN_FILE, lr=0.1, dim=20, epoch=20, word_ngrams=1 , loss='softmax')
model.save_model(MODEL_FILE)
model = fasttext.load_model(MODEL_FILE)
pred = model.predict(['bad'], k =3)
print(pred)
I get always a result around 0,4.. positif , 0,4... negatif , 0.0.. neutral :
([['__label__négatif', '__label__positif', '__label__neutre']], array([[0.49168783, 0.47954634, 0.02879585]]))
Can someone tell me where the mistake lies

How to check via callbacks if alpha is decreasing? + How to load all cores during training?

I'm training doc2vec, and using callbacks trying to see if alpha is decreasing over training time using this code:
class EpochSaver(CallbackAny2Vec):
'''Callback to save model after each epoch.'''
def __init__(self, path_prefix):
self.path_prefix = path_prefix
self.epoch = 0
os.makedirs(self.path_prefix, exist_ok=True)
def on_epoch_end(self, model):
savepath = get_tmpfile(
'{}_epoch{}.model'.format(self.path_prefix, self.epoch)
)
model.save(savepath)
print(
"Model alpha: {}".format(model.alpha),
"Model min_alpha: {}".format(model.min_alpha),
"Epoch saved: {}".format(self.epoch + 1),
"Start next epoch"
)
self.epoch += 1
def train():
workers = multiprocessing.cpu_count()*4
model = Doc2Vec(
DocIter(),
vec_size=600, alpha=0.03, min_alpha=0.00025, epochs=20,
min_count=10, dm=1, hs=1, negative=0, workers=workers,
callbacks=[EpochSaver("./checkpoints")]
)
print(
"HS", model.hs, "Negative", model.negative, "Epochs",
model.epochs, "Workers: ", model.workers, "Model alpha:
{}".format(model.alpha)
)
And while training I see that alpha is not changing over time. On each callback I see alpha = 0.03.
Is it possible to check if alpha is decreasing? Or it really not decreasing at all during training?
One more question:
How can I benefit from all my cores while training doc2vec?
As we can see, each core is not loaded more than +-30%.
The model.alpha property only holds the initially-configured starting-alpha – it's not updated to the effective learning-rate through training.
So, even if the value is being decreased properly (and I expect that it is), you wouldn't see it in the logging you've added.
Separate observations about your code:
in gensim versions at least through 3.5.0, maximum training throughput is most often reached with some value for workers between 3 and the number of cores – but usually not the full number of cores (if it's higher than 12) or larger. So workers=multiprocessing.cpu_count()*4 is likely going to much slower than what you could achieve with a lower number.
if your corpus is large enough to support 600-dimensional vectors, and discarding words with fewer than min_count=10 examples, negative sampling may work faster and get better results than the hs mode. (The pattern in published work seems to be to prefer negative-sampling with larger corpuses.)

Low alpha for NLTK agreement using MASI distance

I'm getting a very low value for Krippendorff's alpha when I calculate agreement in NLTK using MASI as the distance function.
Three coders (Inky, Blinky, and Sue) are instructed to assign topic labels (love, gifts, slime, or gaming) to two texts (text01 and text02), based on what the texts are about. Each text can be about more than one topic, so coders may assign each text more than one label. The data and the code used to make the calculatons are shown below:
import nltk
from nltk.metrics import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
#(coder, item, label)
data = [('inky','text01',frozenset(['love','gifts'])),
('blinky','text01',frozenset(['love','gifts'])),
('sue','text01',frozenset(['love','gifts'])),
('inky','text02',frozenset(['slime','gaming'])),
('blinky','text02',frozenset(['slime'])),
('sue','text02',frozenset(['slime','gaming']))]
jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
masi_task = nltk.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(data)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
print()
When I run the code, I get the following results:
Statistics for dataset using <function jaccard_distance at 0x09D26DB0>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset ({'gaming', 'slime'})}
Pi: 0.7272727272727273
Kappa: 0.7777777777777777
Multi-Kappa: 0.7499999999999999
Alpha: 0.75
Statistics for dataset using <function masi_distance at 0x09D26DF8>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset({'gaming', 'slime'})}
Pi: 0.8172727272727272
Kappa: 0.8511111111111113
Multi-Kappa: 0.8324999999999998
Alpha: -1.5
My question is, why is the alpha so low when using the MASI distance function compared to Jaccard?
I was unable to reproduce the error and got the correct value of Krippendorff's alpha with MASI distance when running the provided code. I used Python 3.5.2, NumPy 1.18.2, NLTK 3.4.5. Thus, the most probable answer would be that one need to update NLTK.

Keras ImageDataGenerator Slow

I am looking for the best approach to train on larger-than-memory-data in Keras and currently noticing that the vanilla ImageDataGenerator tends to be slower than I would hope.
I have two networks training on the Kaggle cat's vs dogs dataset (25000 images):
1) this approach is exactly the code from: http://www.pyimagesearch.com/2016/09/26/a-simple-neural-network-with-python-and-keras/
2) same as (1) but using an ImageDataGenerator instead of loading into memory the data
Note: for below, "preprocessing" means resizing, scaling, flattening
I find the following on my gtx970:
For network 1, it takes ~0s per epoch.
For network 2, it takes ~36s per epoch if the preprocessing is done in the data generator.
For network 2, it takes ~13s per epoch if preprocessing is done in a first-pass outside of the data generator.
Is this likely the speed limit for ImageDataGenerator (13s seems like the usual 10-100x difference between disk and ram...)? Are there approaches/mechanisms better suited for training on larger-than-memory-data when using Keras?
e.g. Perhaps there is way to get the ImageDataGenerator in Keras to save its processed images after the first epoch?
Thanks!
I assume you already might have solved this, but nevertheless...
Keras image preprocessing has the option of saving the results by setting the save_to_dir argument in the flow() or flow_from_directory() function:
https://keras.io/preprocessing/image/
In my understanding, problem is that augmented images are used only once in a training cycle of a model, not even across several epochs. So it's a huge waste of GPU cycles while CPU is struggling.
I found following solution:
I generate as many augmentations in RAM as I can
I use them for training across a frame of epochs, 10 to 30, whatever it takes to get a noticeable convergence
after that I generate new batch of augmented images (by implementing on_epoch_end) and process goes on.
This approach most of the time keeps GPU busy, while being able to benefit from data augmentation. I use custom Sequence subclass to generate augmentation and fix classes imbalance at the same time.
EDIT: adding some code to clarify the idea
from pyutilz.string import read_config_file
from tqdm.notebook import tqdm
from gc import collect
import numpy as np
import tensorflow
import random
import cv2
class StoppingFromFile(tensorflow.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
if read_config_file('control.ini','ML','stop',globals()):
if stop is not None:
if stop==True or stop=='True':
logging.warning(f'Model should be stopped according to the control fole')
self.model.stop_training = True
class AugmentedBalancedSequence(tensorflow.keras.utils.Sequence):
def __init__(self, images_and_classes:dict,input_size:tuple,class_sizes:list, augmentations_fn:object, preprocessing_fn:object, batch_size:int=10,
num_class_samples=100, frame_length:int=5, aug_p:float=0.1,aug_pipe_p:float=0.2,is_validation:bool=False,
disk_saving_prob:float=.01,disk_example_nfiles:int=50):
"""
From a dict of file paths grouped by class label, creates each N epochs augmented balanced training set.
If current class is too scarce, ensures that current frame has no duplicate final images.
If it's rich enough, ensures that current frame has no duplicate base images.
"""
logging.info(f'Got {len(images_and_classes)} classes.')
self.disk_example_nfiles=disk_example_nfiles;self.disk_saving_prob=disk_saving_prob;self.cur_example_file=0
self.images_and_classes=images_and_classes
self.num_class_samples=num_class_samples
self.augmentations_fn=augmentations_fn
self.preprocessing_fn=preprocessing_fn
self.is_validation=is_validation
self.frame_length=frame_length
self.batch_size = batch_size
self.class_sizes=class_sizes
self.input_size=input_size
self.aug_pipe_p=aug_pipe_p
self.aug_p=aug_p
self.images=None
self.epoch = 0
#print(f'got frame_length={self.frame_length}')
self._generate_data()
def __len__(self):
return int(np.ceil(len(self.images)/ float(self.batch_size)))
def __getitem__(self, idx):
a=idx * self.batch_size;b=a+self.batch_size
return self.images[a:b],self.labels[a:b]
def on_epoch_end(self):
import ast
self.epoch += 1
mydict={}
import pathlib
fname='control.json'
p = pathlib.Path(fname)
if p.is_file():
try:
with open (fname) as f:
mydict=json.load(f)
for var,val in mydict.items():
if hasattr(self,var):
converted = val #ast.literal_eval(val)
if converted is not None:
if getattr(self, var)!=converted:
setattr(self, var, converted)
print(f'{var} became {val}')
except Exception as e:
logging.error(str(e))
if self.epoch % self.frame_length == 0:
#print('generating data...')
self._generate_data()
def _add_sample(self,image,label):
from random import random
idx=self.indices[self.img_sent]
if self.disk_saving_prob>0:
if random()<self.disk_saving_prob:
self.cur_example_file+=1
if self.cur_example_file>self.disk_example_nfiles:
self.cur_example_file=1
Path(r'example_images/').mkdir(parents=True, exist_ok=True)
cv2.imwrite(f'example_images/test{self.cur_example_file}.jpg',cv2.cvtColor(image,cv2.COLOR_RGB2BGR))
if self.preprocessing_fn:
self.images[idx]=self.preprocessing_fn(image)
else:
self.images[idx]=image
self.labels[idx]=label
self.img_sent+=1
def _generate_data(self):
logging.info('Generating new set of augmented data...')
collect()
#del self.images
#del self.labels
#collect()
if self.num_class_samples:
expected_length=len(self.images_and_classes)*self.num_class_samples
else:
expected_length=sum(self.class_sizes.values())
if self.images is None:
self.images=np.empty((expected_length,)+(self.input_size[1],)+(self.input_size[0],)+(3,))
self.labels=np.empty((expected_length),np.int32)
self.indices=np.random.choice(expected_length, expected_length, replace=False)
self.img_sent=0
collect()
relaxed_augmentation_pipeline=self.augmentations_fn(p=self.aug_p,pipe_p=self.aug_pipe_p)
maxed_out_augmentation_pipeline=self.augmentations_fn(p=self.aug_p,pipe_p=1.0)
#for each class
x,y=[],[]
nartificial=0
for label,images in tqdm(self.images_and_classes.items()):
if self.num_class_samples is None:
#Just all native samples without augmentations
for image in images:
self._add_sample(image,label)
else:
#if there are enough native samples
if len(images)>=self.num_class_samples:
#randomly select samples of this class which will participate in this frame of epochs
indices=np.random.choice(len(images), self.num_class_samples, replace=False)
#apply albumentations pipeline to selected samples
for idx in indices:
if not self.is_validation:
self._add_sample(relaxed_augmentation_pipeline(image=images[idx])['image'],label)
else:
self._add_sample(images[idx],label)
else:
#------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Randomly pick next image from existing. try applying augmentation pipeline (with maxed out probability) till we get num_class_samples DIFFERENT images
#------------------------------------------------------------------------------------------------------------------------------------------------------------------
hashes=set()
norig=0
while len(hashes)<self.num_class_samples:
if self.is_validation and norig<len(images):
#just include all originals first
image=images[norig]
else:
image=maxed_out_augmentation_pipeline(image=random.choice(images))['image']
next_hash=np.sum(image)
if next_hash not in hashes or (self.is_validation and norig<=len(images)):
#print(f'Adding orig {norig} out of {self.num_class_samples}, hashes={hashes}')
self._add_sample(image,label)
if next_hash in hashes:
norig+=1
hashes.add(norig)
else:
hashes.add(next_hash)
nartificial+=1
#self.images=self.images[indices];self.labels=self.labels[indices]
logging.info(f'Generated {self.img_sent} samples ({nartificial} artificial)')
once I have images and classes loaded,
train_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_train,
input_size=INPUT_SIZE,class_sizes=class_sizes_train,num_class_samples=UPSCALE_SAMPLES,
augmentations_fn=get_albumentations_pipeline,aug_p=AUG_P,aug_pipe_p=AUG_PIPE_P,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,disk_saving_prob=0.05)
val_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_val,
input_size=INPUT_SIZE,class_sizes=class_sizes_val,num_class_samples=None,
augmentations_fn=get_albumentations_pipeline,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,is_validation=True)
and after the model is instantiated, I do
model.fit(train_datagen,epochs=600,verbose=1,
validation_data=(val_datagen.images,val_datagen.labels),validation_batch_size=BATCH_SIZE,
callbacks=[checkpointer,StoppingFromFile()],validation_freq=1)

OneR WEKA - wrong prediction?

I am trying to make a ranking of attributes depending on their predictive power by using OneR in WEKA iteratively. At every run I remove the chosen attribute to see what the next best is.
I have done this for all my attributes and some (3 out of ten attributes) get 'ranked' higher than others, although they have less % correct prediction, a smaller ROC Area average and their rules are less compact.
As I understand, OneR just looks at the frequency tables for the attribute it has and then the class values, so it wouldn't care about whether I take attributes out or not...but I am probably missing something
Would anyone have an idea?
As an alternative you can you use the OneR package (available on CRAN, more information here: OneR - Establishing a New Baseline for Machine Learning Classification Models)
With the option verbose = TRUE you get the accuracy of all attributes, e.g.:
> library(OneR)
> example(OneR)
OneR> data <- optbin(iris)
OneR> model <- OneR(data, verbose = TRUE)
Attribute Accuracy
1 * Petal.Width 96%
2 Petal.Length 95.33%
3 Sepal.Length 74.67%
4 Sepal.Width 55.33%
---
Chosen attribute due to accuracy
and ties method (if applicable): '*'
OneR> summary(model)
Rules:
If Petal.Width = (0.0976,0.791] then Species = setosa
If Petal.Width = (0.791,1.63] then Species = versicolor
If Petal.Width = (1.63,2.5] then Species = virginica
Accuracy:
144 of 150 instances classified correctly (96%)
Contingency table:
Petal.Width
Species (0.0976,0.791] (0.791,1.63] (1.63,2.5] Sum
setosa * 50 0 0 50
versicolor 0 * 48 2 50
virginica 0 4 * 46 50
Sum 50 52 48 150
---
Maximum in each column: '*'
Pearson's Chi-squared test:
X-squared = 266.35, df = 4, p-value < 2.2e-16
(full disclosure: I am the author of this package and I would be very interested in the results you get)
The OneR classifier looks a bit like nearest-neighbor. Given that, the following applies: In the source code of the OneR classifier, it says:
// if this attribute is the best so far, replace the rule
if (noRule || r.m_correct > m_rule.m_correct) {
m_rule = r;
}
Thus, it should be possible (either in 1-R generally or in this implementation) for an attribute to block another, yet be later removed in your process.
Say you have attributes 1,2, and 3 with the distribution 1: 50%, 2: 30%, 3: 20%. In all cases where attribute 1 is best, attribute 3 is second best.
Thus, when attribute 1 is left out, attribute 3 wins with 70%, even though before attribute 2 ranked as "better" than 3 in the comparison of all three.