Why is my learned Bayesian network not rooted at the binary class variable? - classification

I want to compare bankrupt firm profiles among different countries using Bayesian Networks (pomegranate library in Python). The class is a binary variable (1=bankrupt, 0=active company), and the rest are financial features.
I get two very different BN structures for the two countries with Class at the top of the structure for the 1st country and lower for the 2nd.
Why is it happening? How can I compare the structures among countries if Class is not the parent?
Here is my code for plotting the BN structures:
from pomegranate import *
import graphviz
model = BayesianNetwork.from_samples(X=df[['WC/TA', 'RE/TA', 'EBIT/TA', 'BVE/TL', 'Class']].values, algorithm='exact', state_names=altman_features + ['Class'])
p = model.log_probability(X = df[['WC/TA', 'RE/TA', 'EBIT/TA', 'BVE/TL', 'Class']].values).sum()
model.plot()

Related

How to use FaceNet and DBSCAN on multiple embeddings identities?

I have the following setting:
A surveillance system take photos of people's faces (there are a varying number of photos for each person).
I run FaceNet for each photo and get a list of embedding vectors for each person (each person is represented by a list of embeddings, not by a single one).
The problem:
I want to cluster observed people using DBSCAN, but I need to guarantee that face embeddings from the same people go to the same cluster (remember we can have multiple photos of the same people, and we already know they must belong to the same cluster).
One solution could be to get a "mean" or average embedding for each person, but I believe this data loss is going to produce bad results.
Another solution could be to concatenate N embeddings (with N constant) in a single vector and pass that 512xN vector to DBSCAN, but the problem with this is that the order in which the embeddings are appended to this vector is going to produce different results.
Anyone has faced this same problem?
deepface wraps facenet face recognition model. The regular face recognition process is shown below.
#!pip install deepface
from deepface import DeepFace
my_set = [
["img1.jpg", "img2.jpg"],
["img1.jpg", "img3.jpg"],
]
obj = DeepFace.verify(my_set, model_name = 'Facenet')
for i in obj:
print(i["distance"])
If you need the embeddings generated by facenet, you can adopt deepface as well.
from deepface.commons import functions
from deepface.basemodels import Facenet
model = Facenet.loadModel()
#this detects and aligns faces. Facenet expects 160x160x3 shaped inputs.
img1 = functions.preprocess_face("img1.jpg", target_size = (160, 160))
img2 = functions.preprocess_face("img2.jpg", target_size = (160, 160))
#this finds embeddings for images
img1_embedding = model.predict(img1)
img2_embedding = model.predict(img2)
Embeddings will be 128 dimensional vectors for Facenet. You can run any clustering algorithm to embeddings. I have applied k-means for this kind of a study. I haven't any experience about dbscan but you can apply it if you have the embeddings.
Besides, you can adopt different face recognition models within deepface such as vgg-face, openface, facebook deepface, deepid and dlib.

Interclass and Intraclass classification structure of CNN

I am working on a inter-class and intra-class classification problem with one CNN such as first there is two classes Cat and Dog than in Cat there is a classification three different breeds of cats and in Dog there are 5 different breeds dogs.
I haven't tried the coding yet just working on feasibility if that works.
My question is what will be the feasible design for this kind of problem.
I am thinking to design for the training, first CNN-1 network that will differentiate cat and dog and gather the image data of all the training images. After the separation of cat and dog, CNN-2 and CNN-3 will train these images further for each breed of dog and cat. I am just not sure how the testing will work in this situation.
I have approached a similar problem previously in Python. Hopefully this is helpful and you can come up with an alternative implementation in Matlab if that is what you are using.
After all was said and done, I landed on a single model for all predictions. For your purpose you could have one binary output for dog vs. cat, another multi-class output for the dog breeds, and another multi-class output for the cat breeds.
Using Tensorflow, I created a mask for the irrelevant classes. For example, if the image was of a cat, then all of the dog breeds are irrelevant and they should not impact model training for that example. This required a customized TF Dataset (that converted 0's to -1 for the mask) and a customized loss function that returned 0 error when the mask was present for that example.
Finally for the training process. Specific to your question, you will have to create custom accuracy functions that can handle the mask values how you want them to, but otherwise this part of the process should be standard. It was best practice to evenly spread out the classes among the training data but they can all be trained together.
If you google "Multi-Task Training" you can find additional resources for this problem.
Here are some code snips if you are interested:
For the customize TF dataset that masked irrelevant labels...
# Replace 0's with -1 for mask when there aren't any labels
def produce_mask(features):
for filt, tensor in features.items():
if "target" in filt:
condition = tf.equal(tf.math.reduce_sum(tensor), 0)
features[filt] = tf.where(condition, tf.ones_like(tensor) * -1, tensor)
return features
def create_dataset(filepath, batch_size=10):
...
# **** This is where the mask was applied to the dataset
dataset = dataset.map(produce_mask, num_parallel_calls=cpu_count())
...
return parsed_features
Custom loss function. I was using binary-crossentropy because my problem was multi-label. You will likely want to adapt this to categorical-crossentropy.
# Custom loss function
def masked_binary_crossentropy(y_true, y_pred):
mask = backend.cast(backend.not_equal(y_true, -1), backend.floatx())
return backend.binary_crossentropy(y_true * mask, y_pred * mask)
Then for the custom accuracy metrics. I was using top-k accuracy, you may need to modify for your purposes, but this will give you the general idea. When comparing this to the loss function, instead of converting all to 0, which would over-inflate the accuracy, this function filters those values out entirely. That works because the outputs are measured individually, so each output (binary, cat breed, dog breed) would have a different accuracy measure filtered only to the relevant examples.
backend is keras backend.
def top_5_acc(y_true, y_pred, k=5):
mask = backend.cast(backend.not_equal(y_true, -1), tf.bool)
mask = tf.math.reduce_any(mask, axis=1)
masked_true = tf.boolean_mask(y_true, mask)
masked_pred = tf.boolean_mask(y_pred, mask)
return top_k_categorical_accuracy(masked_true, masked_pred, k)
Edit
No, in the scenario I described above there is only one model and it is trained with all of the data together. There are 3 outputs to the single model. The mask is a major part of this as it allows the network to only adjust weights that are relevant to the example. If the image was a cat, then the dog breed prediction does not result in loss.

How do I generate a class_labels.txt in Keras for usage in a CoreML model?

I have been trying to create an IOS App using coreML, I have trained a convolutional neural network in Keras, when I use CoreMLTools to transform this model to a CoreML model it shows that the output is a multidimensional array, but I want it to be a class probability. How do I generate a .txt file with the class labels in Keras?
This is the code I use to generate a coreML model:
import coremltools
coreml_model = coremltools.converters.keras.convert(
"chars74kV3.0.h5", class_labels = "class_labels.txt", image_input_names= ['input'], input_names=['input'], image_scale=255.)
coreml_model.author = 'Thijs van der Heijden'
coreml_model.license = 'MIT'
coreml_model.description = 'A basic Deep Convolutional Neural Network to classify handwritten letters.'
coreml_model.input_description['input'] = 'A 64x64 pixel Image'
coreml_model.save('chars74k.mlmodel')
The class_labels.txt file should just be a plain text file with one label per line, in order of the classes in your training set. For example,
dog
cat
person
would be your label file for a three-class network where class 0 was "dog", class 1 was "cat", and class 2 was "person". If this is a public classification dataset, you should have that information with the dataset, and if it's your own you'll just have to create such a mapping file. You'd have to do this anyway to associate class numbers with values.

Reporting log-likelihood / perplexity of spark LDA model (different in local vs distributed models?)

Given a training corpus docsWithFeatures, I've trained an LDA model in Spark (via Scala API) like so:
import org.apache.spark.mllib.clustering.{LDA, DistributedLDAModel, LocalLDAModel}
val n_topics = 10;
val lda = new LDA().setK(n_topics).setMaxIterations(20)
val ldaModel = lda.run(docsWithFeatures)
val distLDAModel = ldaModel.asInstanceOf[DistributedLDAModel]
And now I want to report the log-likelihood and perplexity of the model.
I can get the log-likelihood like so:
scala> distLDAModel.logLikelihood
res11: Double = -2600097.2875547716
But this is where things get weird. I also wanted the perplexity, which is only implemented for a local model, so I run:
val localModel = distLDAModel.toLocal
Which lets me get the (log) perplexity like so:
scala> localModel.logPerplexity(docsWithFeatures)
res14: Double = 0.36729132682898674
But the local model also supports the log-likelihood calculation, which I run like this:
scala> localModel.logLikelihood(docsWithFeatures)
res15: Double = -3672913.268234148
So what's going on here? Shouldn't the two log-likelihood values be the same? The documentation for a distributed model says
"logLikelihood: log likelihood of the training corpus, given the inferred topics and document-topic distributions"
while for a local model it says:
"logLikelihood(documents): Calculates a lower bound on the provided documents given the inferred topics."
I guess these are different, but it's not clear to me how or why. Which one should I use? That is, which one is the "true" likelihood of the model, given the training documents?
To summarize, two main questions:
1 - How and why are the two log-likelihood values different, and which should I use?
2 - When reporting perplexity, am I correct in thinking that I should use the exponential of the logPerplexity result? (But why does the model give log perplexity instead of just plain perplexity? Am I missing something?)
1) These two log-likelihood values differ because they are computing the log-likelihood for two different models. DistributedLDAModel is effectively computing the log-likelihood w.r.t. a model where the parameters for the topics and the mixing weights for each of the documents are constants (as I mentioned in another post, the DistributedLDAModel is essentially regularized PLSI, though you need to use logPrior to also account for the regularization), while the LocalLDAModel takes the view that the topic parameters as well as the mixing weights for each document are random variables. So in the case of LocalLDAModel you have to integrate (marginalize) out the topic parameters and document mixing weights in order to compute the log-likelihood (and this is what makes the variational approximation/lower bound necessary, though even without the approximation the log-likelihoods would not be the same since the models are just different.)
As far as which one you should use, my suggestion (without knowing what you ultimately want to do) would be to go with the log-likelihood method attached to the class you originally trained (i.e. the DistributedLDAModel.) As a side note, the primary (only?) reason that I can see to convert a DistributedLDAModel into a LocalLDAModel via toLocal is to enable the computation of topic mixing weights for a new (out-of-training) set of documents (for more info on this see my post on this thread: Spark MLlib LDA, how to infer the topics distribution of a new unseen document?), a operation which is not (but could be) supported in DistributedLDAModel.
2) log-perplexity is just the negative log-likelihood divided by the number of tokens in your corpus. If you divide the log-perplexity by math.log(2.0) then the resulting value can also be interpreted as the approximate number of bits per a token needed to encode your corpus (as a bag of words) given the model.

Generating synthetic data of mixed type (numerical/categorical) in Matlab (or any other)

I'm trying to generate some synthetic data for experiments. When it comes to data sets with numerical features this is rather easy, I just use a Gaussian mixture (using Netlab, a package for Matlab) and that's done.
Noooww, I also need to generate some data sets with numerical and categorical features. The numerical part I can easily do using the above method, what about the categorical?
I was thinking to generate a categorical feature with (say) 3 categories with probabilities of 68.2% (+/- 1 sigma), 27.2% (between +/- 1 sigma and +/- 2 sigma), and 4.6% (the rest) within the objects with the same label.
And perhaps another categorical feature with 5 categories, with probabilities of 34.1%, 34.1%, 13.6%, 13.6%, 4.6% - again, within the objects with the same label.
Does that make sense to you guys? any thoughts?
I can easily write the code for the above, but if you know of any function that does it for me - please let me know.
Thanks!
It's easy to do in Python using numpy:
import numpy as np
np.random.multinomial(n=1, pvals=[.3,.3,.4], size=10)