Low alpha for NLTK agreement using MASI distance - annotations

I'm getting a very low value for Krippendorff's alpha when I calculate agreement in NLTK using MASI as the distance function.
Three coders (Inky, Blinky, and Sue) are instructed to assign topic labels (love, gifts, slime, or gaming) to two texts (text01 and text02), based on what the texts are about. Each text can be about more than one topic, so coders may assign each text more than one label. The data and the code used to make the calculatons are shown below:
import nltk
from nltk.metrics import agreement
from nltk.metrics.distance import masi_distance
from nltk.metrics.distance import jaccard_distance
#(coder, item, label)
data = [('inky','text01',frozenset(['love','gifts'])),
('blinky','text01',frozenset(['love','gifts'])),
('sue','text01',frozenset(['love','gifts'])),
('inky','text02',frozenset(['slime','gaming'])),
('blinky','text02',frozenset(['slime'])),
('sue','text02',frozenset(['slime','gaming']))]
jaccard_task = nltk.AnnotationTask(distance=jaccard_distance)
masi_task = nltk.AnnotationTask(distance=masi_distance)
tasks = [jaccard_task, masi_task]
for task in tasks:
task.load_array(data)
print("Statistics for dataset using {}".format(task.distance))
print("C: {}\nI: {}\nK: {}".format(task.C, task.I, task.K))
print("Pi: {}".format(task.pi()))
print("Kappa: {}".format(task.kappa()))
print("Multi-Kappa: {}".format(task.multi_kappa()))
print("Alpha: {}".format(task.alpha()))
print()
When I run the code, I get the following results:
Statistics for dataset using <function jaccard_distance at 0x09D26DB0>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset ({'gaming', 'slime'})}
Pi: 0.7272727272727273
Kappa: 0.7777777777777777
Multi-Kappa: 0.7499999999999999
Alpha: 0.75
Statistics for dataset using <function masi_distance at 0x09D26DF8>
C: {'inky', 'sue', 'blinky'}
I: {'text01', 'text02'}
K: {frozenset({'slime'}), frozenset({'love', 'gifts'}), frozenset({'gaming', 'slime'})}
Pi: 0.8172727272727272
Kappa: 0.8511111111111113
Multi-Kappa: 0.8324999999999998
Alpha: -1.5
My question is, why is the alpha so low when using the MASI distance function compared to Jaccard?

I was unable to reproduce the error and got the correct value of Krippendorff's alpha with MASI distance when running the provided code. I used Python 3.5.2, NumPy 1.18.2, NLTK 3.4.5. Thus, the most probable answer would be that one need to update NLTK.

Related

How can I get the numbers for the correlation matrix from Pandas Profiling

I really like the heatmap, but what I need are the numbers behind the heatmap (AKA correlation matrix).
Is there an easy way to extract the numbers?
It was a bit hard to track down but starting from the documentation; specifically
from the report structure then digging into the following function get_correlation_items(summary) and then going into the source and looking at the usage of it we get to this call that essentially loops over each of the correlation types in the summary, to obtain the summary object we can find the following, if we lookup the caller we find that it is get_report_structure(summary) and if we try to find how to get the summary arg we find that it is simply the description_set property as shown here.
Given the above, we can now do the following using version 2.9.0:
import numpy as np
import pandas as pd
from pandas_profiling import ProfileReport
df = pd.DataFrame(
np.random.rand(100, 5),
columns=["a", "b", "c", "d", "e"]
)
profile = ProfileReport(df, title="StackOverflow", explorative=True)
correlations = profile.description_set["correlations"]
print(correlations.keys())
dict_keys(['pearson', 'spearman', 'kendall', 'phi_k'])
To see a specific correlation do:
correlations["phi_k"]["e"]
a 0.000000
b 0.112446
c 0.289983
d 0.000000
e 1.000000
Name: e, dtype: float64
Sample Notebook

GPflow change point kernel issue with multiple dimensions

I'm following the tutorial here for implementing a change point kernel in gpflow.
However, I have 3 inputs and 1 output and I would like the changepoint kernel to be on the first input dimension only and other standard kernels to be on the other two input dimensions. I'm getting the following error :
InvalidArgumentError: Incompatible shapes: [2000,3,1] vs. [3,2000,1] [Op:Mul] name: mul/
Below is a minimum working example. Could anyone please let me know where I'm going wrong?
gpflow version 2.0.0.rc1
import pandas as pd
import gpflow
from gpflow.utilities import print_summary
df_all = pd.read_csv(
'https://raw.githubusercontent.com/ipan11/gp/master/dataset.csv')
# Training dataset in numpy format
X = df_all[['X1', 'X2', 'X3']].to_numpy()
Y1 = df_all['Y'].to_numpy().reshape(-1, 1)
# Changepoint kernel only on first dimension and standard kernels for the other two dimensions
base_k1 = gpflow.kernels.Matern32(lengthscale=0.2, active_dims=[0])
base_k2 = gpflow.kernels.Matern32(lengthscale=2., active_dims=[0])
k1 = gpflow.kernels.ChangePoints(
[base_k1, base_k2], [.4], steepness=5)
k2 = gpflow.kernels.Matern52(lengthscale=[1., 1.], active_dims=[1, 2])
k_all = k1+k2
print_summary(k_all)
m1 = gpflow.models.GPR(data=(X, Y1), kernel=k_all, mean_function=None)
print_summary(m1)
opt = gpflow.optimizers.Scipy()
def objective_closure():
return -m1.log_marginal_likelihood()
opt_logs = opt.minimize(objective_closure, m1.trainable_variables,
options=dict(maxiter=100))
The correct answer would be to move the active_dims=[0] from the base_k* kernels to the ChangePoints() kernel,
k1 = gpflow.kernels.ChangePoints([base_k1, base_k2], [0.4], steepness=5, active_dims=[0])
but this is currently not supported in GPflow 2, which is a bug. I've opened an issue on github, and will update this answer once it's fixed (if you feel up to having a go at fixing this bug, feel free to open a pull request, help always welcome!).

Please help debug my call scipy library for Kolmogorov-Smirnov Test

I am completing an assignment but can not get the right results from a kolmogorov smirnov test for a small sample of observations against a 'norm' distribution.
I have setup a minimal sample in a jupyter notebook with expected kstest results and tried running this in several environment and reviewed the call for hours. Answer key says my ks_value and p_value are wildly wrong.
But, I cannot see my error.
the sample I have is from the test run in the answer key. it is a 1d array, a valid input option.
sample mean and standard deviation I compute look right
if I change ddof it makes a small difference (hint is to use ddof=0)
norm is a valid distribution for the kstest
library documentation is at
https://docs.scipy.org/doc/scipy-0.14.0/reference/generated/scipy.stats.kstest.html#scipy-stats-kstest
Any ideas or comments?
Would you expect a sample = [0.37, 0.27, 0.69, 0.56, 0.26] compared to a normal distribution to have
'KS test statistic' of 0.64 or 0.24
and
'p-value' of 0.02 or 0.94
TIA
import pandas as pd
import numpy as np
from scipy.stats import kstest
sample = [0.37, 0.27, 0.69, 0.56, 0.26]
normal_args = (np.mean(sample), np.std(sample, ddof=0))
print('mean', normal_args[0])
print('std', normal_args[1])
ks_value, p_value = kstest(sample, 'norm', normal_args )
print('ks_value', ks_value)
print('p_value', p_value)
print('')
print('#####posted solution')
print('expected ks_value = 0.63919407')
print('expected p_value = 0.01650327')
mean 0.43000000000000005
std 0.1688786546606764
ks_value 0.23881183701141995
p_value 0.9379686201081335
####posted solution
expected ks_value = 0.63919407
expected p_value = 0.01650327
My bad. A new guy mistake.
The function calls defines the 3rd argument as "args=()". I had put the 3rd argument in treating the input as a positional. Changing the call to
ks_value, p_value = kstest(sample, 'norm', args=(normal_args) )
yields the correct response.

How to implement exponentially decay learning rate in Keras by following the global steps

Look at the following example
# encoding: utf-8
import numpy as np
import pandas as pd
import random
import math
from keras import Sequential
from keras.layers import Dense, Activation
from keras.optimizers import Adam, RMSprop
from keras.callbacks import LearningRateScheduler
X = [i*0.05 for i in range(100)]
def step_decay(epoch):
initial_lrate = 1.0
drop = 0.5
epochs_drop = 2.0
lrate = initial_lrate * math.pow(drop,
math.floor((1+epoch)/epochs_drop))
return lrate
def build_model():
model = Sequential()
model.add(Dense(32, input_shape=(1,), activation='relu'))
model.add(Dense(1, activation='linear'))
adam = Adam(lr=0.5)
model.compile(loss='mse', optimizer=adam)
return model
model = build_model()
lrate = LearningRateScheduler(step_decay)
callback_list = [lrate]
for ep in range(20):
X_train = np.array(random.sample(X, 10))
y_train = np.sin(X_train)
X_train = np.reshape(X_train, (-1,1))
y_train = np.reshape(y_train, (-1,1))
model.fit(X_train, y_train, batch_size=2, callbacks=callback_list,
epochs=1, verbose=2)
In this example, the LearningRateSchedule does not change the learning rate at all because in each iteration of ep, epoch=1. Thus the learning rate is just const (1.0, according to step_decay). In fact, instead of setting epoch>1 directly, I have to do outer loop as shown in the example, and insider each loop, I just run 1 epoch. (This is the case when I implement deep reinforcement learning, instead of supervised learning).
My question is how to set an exponentially decay learning rate in my example and how to get the learning rate in each iteration of ep.
You can actually pass two arguments to the LearningRateScheduler.
According to Keras documentation, the scheduler is
a function that takes an epoch index as input (integer, indexed from
0) and current learning rate and returns a new learning rate as output
(float).
So, basically, simply replace your initial_lr with a function parameter, like so:
def step_decay(epoch, lr):
# initial_lrate = 1.0 # no longer needed
drop = 0.5
epochs_drop = 2.0
lrate = lr * math.pow(drop,math.floor((1+epoch)/epochs_drop))
return lrate
The actual function you implement is not exponential decay (as you mention in your title) but a staircase function.
Also, you mention your learning rate does not change inside your loop. That's true because you set model.fit(..., epochs=1,...) and your epochs_drop = 2.0 at the same time. I am not sure this is your desired case or not. You are providing a toy example and it's not clear in that case.
I would like to add the more common case where you don't mix a for loop with fit() and just provide a different epochs parameter in your fit() function. In this case you have the following options:
First of all keras provides a decaying functionality itself with the predefined optimizers. For example in your case Adam() the actual code is:
lr = lr * (1. / (1. + self.decay * K.cast(self.iterations, K.dtype(self.decay))))
which is not exactly exponential either and it's somehow different than tensorflow's one. Also, it's used only when decay > 0.0 as it's obvious.
To follow the tensorflow convention of exponential decay you should implement:
decayed_learning_rate = learning_rate * ^ (global_step / decay_steps)
Depending on your needs you could choose to implement a Callback subclass and define a function within it (see 3rd bullet below) or use LearningRateScheduler which is actually exactly this with some checking: a Callback subclass which updates the learning rate at each epoch end.
If you want a finer handling of your learning rate policy (per batch for example) you would have to implement your subclass since as far as I know there is no implemented subclass for this task. The good part is that it's super easy:
Create a subclass
class LearningRateExponentialDecay(Callback):
and add the __init__() function which will initialize your instance with all needed parameters and also create a global_step variables to keep track of the iterations (batches):
def __init__(self, init_learining_rate, decay_rate, decay_steps):
self.init_learining_rate = init_learining_rate
self.decay_rate = decay_rate
self.decay_steps = decay_steps
self.global_step = 0
Finally, add the actual function inside the class:
def on_batch_begin(self, batch, logs=None):
actual_lr = float(K.get_value(self.model.optimizer.lr))
decayed_learning_rate = actual_lr * self.decay_rate ^ (self.global_step / self.decay_steps)
K.set_value(self.model.optimizer.lr, decayed_learning_rate)
self.global_step += 1
The really cool part is the if you want the above subclass to update every epoch you could use on_epoch_begin(self, epoch, logs=None) which nicely has epoch as parameter to it's signature. This case is even easier as you could skip global step altogether (no need to keep track of it now unless you want a fancier way to apply your decay) and use epoch in it's place.

Keras ImageDataGenerator Slow

I am looking for the best approach to train on larger-than-memory-data in Keras and currently noticing that the vanilla ImageDataGenerator tends to be slower than I would hope.
I have two networks training on the Kaggle cat's vs dogs dataset (25000 images):
1) this approach is exactly the code from: http://www.pyimagesearch.com/2016/09/26/a-simple-neural-network-with-python-and-keras/
2) same as (1) but using an ImageDataGenerator instead of loading into memory the data
Note: for below, "preprocessing" means resizing, scaling, flattening
I find the following on my gtx970:
For network 1, it takes ~0s per epoch.
For network 2, it takes ~36s per epoch if the preprocessing is done in the data generator.
For network 2, it takes ~13s per epoch if preprocessing is done in a first-pass outside of the data generator.
Is this likely the speed limit for ImageDataGenerator (13s seems like the usual 10-100x difference between disk and ram...)? Are there approaches/mechanisms better suited for training on larger-than-memory-data when using Keras?
e.g. Perhaps there is way to get the ImageDataGenerator in Keras to save its processed images after the first epoch?
Thanks!
I assume you already might have solved this, but nevertheless...
Keras image preprocessing has the option of saving the results by setting the save_to_dir argument in the flow() or flow_from_directory() function:
https://keras.io/preprocessing/image/
In my understanding, problem is that augmented images are used only once in a training cycle of a model, not even across several epochs. So it's a huge waste of GPU cycles while CPU is struggling.
I found following solution:
I generate as many augmentations in RAM as I can
I use them for training across a frame of epochs, 10 to 30, whatever it takes to get a noticeable convergence
after that I generate new batch of augmented images (by implementing on_epoch_end) and process goes on.
This approach most of the time keeps GPU busy, while being able to benefit from data augmentation. I use custom Sequence subclass to generate augmentation and fix classes imbalance at the same time.
EDIT: adding some code to clarify the idea
from pyutilz.string import read_config_file
from tqdm.notebook import tqdm
from gc import collect
import numpy as np
import tensorflow
import random
import cv2
class StoppingFromFile(tensorflow.keras.callbacks.Callback):
def on_epoch_end(self, epoch, logs=None):
if read_config_file('control.ini','ML','stop',globals()):
if stop is not None:
if stop==True or stop=='True':
logging.warning(f'Model should be stopped according to the control fole')
self.model.stop_training = True
class AugmentedBalancedSequence(tensorflow.keras.utils.Sequence):
def __init__(self, images_and_classes:dict,input_size:tuple,class_sizes:list, augmentations_fn:object, preprocessing_fn:object, batch_size:int=10,
num_class_samples=100, frame_length:int=5, aug_p:float=0.1,aug_pipe_p:float=0.2,is_validation:bool=False,
disk_saving_prob:float=.01,disk_example_nfiles:int=50):
"""
From a dict of file paths grouped by class label, creates each N epochs augmented balanced training set.
If current class is too scarce, ensures that current frame has no duplicate final images.
If it's rich enough, ensures that current frame has no duplicate base images.
"""
logging.info(f'Got {len(images_and_classes)} classes.')
self.disk_example_nfiles=disk_example_nfiles;self.disk_saving_prob=disk_saving_prob;self.cur_example_file=0
self.images_and_classes=images_and_classes
self.num_class_samples=num_class_samples
self.augmentations_fn=augmentations_fn
self.preprocessing_fn=preprocessing_fn
self.is_validation=is_validation
self.frame_length=frame_length
self.batch_size = batch_size
self.class_sizes=class_sizes
self.input_size=input_size
self.aug_pipe_p=aug_pipe_p
self.aug_p=aug_p
self.images=None
self.epoch = 0
#print(f'got frame_length={self.frame_length}')
self._generate_data()
def __len__(self):
return int(np.ceil(len(self.images)/ float(self.batch_size)))
def __getitem__(self, idx):
a=idx * self.batch_size;b=a+self.batch_size
return self.images[a:b],self.labels[a:b]
def on_epoch_end(self):
import ast
self.epoch += 1
mydict={}
import pathlib
fname='control.json'
p = pathlib.Path(fname)
if p.is_file():
try:
with open (fname) as f:
mydict=json.load(f)
for var,val in mydict.items():
if hasattr(self,var):
converted = val #ast.literal_eval(val)
if converted is not None:
if getattr(self, var)!=converted:
setattr(self, var, converted)
print(f'{var} became {val}')
except Exception as e:
logging.error(str(e))
if self.epoch % self.frame_length == 0:
#print('generating data...')
self._generate_data()
def _add_sample(self,image,label):
from random import random
idx=self.indices[self.img_sent]
if self.disk_saving_prob>0:
if random()<self.disk_saving_prob:
self.cur_example_file+=1
if self.cur_example_file>self.disk_example_nfiles:
self.cur_example_file=1
Path(r'example_images/').mkdir(parents=True, exist_ok=True)
cv2.imwrite(f'example_images/test{self.cur_example_file}.jpg',cv2.cvtColor(image,cv2.COLOR_RGB2BGR))
if self.preprocessing_fn:
self.images[idx]=self.preprocessing_fn(image)
else:
self.images[idx]=image
self.labels[idx]=label
self.img_sent+=1
def _generate_data(self):
logging.info('Generating new set of augmented data...')
collect()
#del self.images
#del self.labels
#collect()
if self.num_class_samples:
expected_length=len(self.images_and_classes)*self.num_class_samples
else:
expected_length=sum(self.class_sizes.values())
if self.images is None:
self.images=np.empty((expected_length,)+(self.input_size[1],)+(self.input_size[0],)+(3,))
self.labels=np.empty((expected_length),np.int32)
self.indices=np.random.choice(expected_length, expected_length, replace=False)
self.img_sent=0
collect()
relaxed_augmentation_pipeline=self.augmentations_fn(p=self.aug_p,pipe_p=self.aug_pipe_p)
maxed_out_augmentation_pipeline=self.augmentations_fn(p=self.aug_p,pipe_p=1.0)
#for each class
x,y=[],[]
nartificial=0
for label,images in tqdm(self.images_and_classes.items()):
if self.num_class_samples is None:
#Just all native samples without augmentations
for image in images:
self._add_sample(image,label)
else:
#if there are enough native samples
if len(images)>=self.num_class_samples:
#randomly select samples of this class which will participate in this frame of epochs
indices=np.random.choice(len(images), self.num_class_samples, replace=False)
#apply albumentations pipeline to selected samples
for idx in indices:
if not self.is_validation:
self._add_sample(relaxed_augmentation_pipeline(image=images[idx])['image'],label)
else:
self._add_sample(images[idx],label)
else:
#------------------------------------------------------------------------------------------------------------------------------------------------------------------
# Randomly pick next image from existing. try applying augmentation pipeline (with maxed out probability) till we get num_class_samples DIFFERENT images
#------------------------------------------------------------------------------------------------------------------------------------------------------------------
hashes=set()
norig=0
while len(hashes)<self.num_class_samples:
if self.is_validation and norig<len(images):
#just include all originals first
image=images[norig]
else:
image=maxed_out_augmentation_pipeline(image=random.choice(images))['image']
next_hash=np.sum(image)
if next_hash not in hashes or (self.is_validation and norig<=len(images)):
#print(f'Adding orig {norig} out of {self.num_class_samples}, hashes={hashes}')
self._add_sample(image,label)
if next_hash in hashes:
norig+=1
hashes.add(norig)
else:
hashes.add(next_hash)
nartificial+=1
#self.images=self.images[indices];self.labels=self.labels[indices]
logging.info(f'Generated {self.img_sent} samples ({nartificial} artificial)')
once I have images and classes loaded,
train_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_train,
input_size=INPUT_SIZE,class_sizes=class_sizes_train,num_class_samples=UPSCALE_SAMPLES,
augmentations_fn=get_albumentations_pipeline,aug_p=AUG_P,aug_pipe_p=AUG_PIPE_P,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,disk_saving_prob=0.05)
val_datagen = AugmentedBalancedSequence(images_and_classes=images_and_classes_val,
input_size=INPUT_SIZE,class_sizes=class_sizes_val,num_class_samples=None,
augmentations_fn=get_albumentations_pipeline,preprocessing_fn=preprocess_input, batch_size=BATCH_SIZE,frame_length=FRAME_LENGTH,is_validation=True)
and after the model is instantiated, I do
model.fit(train_datagen,epochs=600,verbose=1,
validation_data=(val_datagen.images,val_datagen.labels),validation_batch_size=BATCH_SIZE,
callbacks=[checkpointer,StoppingFromFile()],validation_freq=1)