Is this way of using Naive Bayes in matlab correct - matlab

I'm using naive bayes in matlab for clasification like this:
dataFull = csvread('haberman.data.data')
dataTaining = dataFull(250, :)
dataTaining = dataFull(1:250,:)
dataTest = dataFull(251:end, :)
dataTainingClass = dataTaining(:,4)
dataTraining = dataTraining(:,1:3)
dataTraining = dataTaining(:,1:3)
dataTestClass = dataTest(:,4)
dataTest = dataTest(:,1:3)
nb = NaiveBayes.fit(dataTraining, dataTainingClass)
predict(nb, dataTest)
percentange = (dataTestClass == ans)
sum(percentange) / length(percentange)
With Haberman's dataset from UCI dataset
Am I doing this right?

Related

pyspark image dimension reduction with PCA

I am using Pyspark in AWS cloud to extract the image features:
ImageSchema.imageFields
img2vec = F.udf(lambda x: DenseVector(ImageSchema.toNDArray(x).flatten()),
VectorUDT())
df_vec = df_cat.withColumn('original_vectors', img2vec("image"))
df_vec.show()
After having standardized the data:
standardizer = MinMaxScaler(inputCol="original_vectors",
outputCol="scaledFeatures",
min=-1.0,
max=1.0)
#withStd=True, withMean=True)
model_std = standardizer.fit(df_vec)
df_std = model_std.transform(df_vec)
df_std.show()
... when I apply PCA for dimension reduction, I receive an error that I could not debug for a couple of weeks :(
Error_1
Error_2
Could you please help me to solve that?
I use Pyspark spark-3.0.3-bin-hadoop2.7
img2vec = F.udf(lambda x : Vectors.dense(x), VectorUDT())
df = df.withColumn("data_as_vector", img2vec("data_as_resized_array"))
standardizer = StandardScaler(withMean=True, withStd=True, inputCol="data_as_vector", outputCol="scaledFeatures")
for image it needs to resize image data with this code and you must use the resized image data;
def resize_img(img_data, resize=True):
mode = 'RGBA' if (img_data.nChannels == 4) else 'RGB'
img = Image.frombytes(mode=mode, data=img_data.data, size=[img_data.width, img_data.height])
img = img.convert('RGB') if (mode == 'RGBA') else img
img = img.resize([224, 224], resample=Image.Resampling.BICUBIC) if (resize) else img
arr = convert_bgr_array_to_rgb_array(np.asarray(img))
arr = arr.reshape([224*224*3]) if (resize) else arr.reshape([img_data.width*img_data.height*3])
return arr
def resize_image_udf(dataframe_batch_iterator: Iterator[pd.DataFrame]) -> Iterator[pd.DataFrame]:
for dataframe_batch in dataframe_batch_iterator:
dataframe_batch["data_as_resized_array"] = dataframe_batch.apply(resize_img, args=(True,), axis=1)
dataframe_batch["data_as_array"] = dataframe_batch.apply(resize_img, args=(False,), axis=1)
yield dataframe_batch
resized_df = df_image.select("image.*").mapInPandas(resize_image_udf, schema)
then you can make standardscaler and PCA with;
model_std = standardizer.fit(df)
df = model_std.transform(df)
# algorithm
pca = PCA(k=n_components, inputCol='data_as_vector', outputCol='pcaFeatures')
model_pca = pca.fit(df)
# Transformation images
df = model_pca.transform(df)
i think, i am too late to answer your questions, sorry

ValueError: Target size (torch.Size([128])) must be the same as input size (torch.Size([112]))

I have a training function, in which inside there are two vectors:
d_labels_a = torch.zeros(128)
d_labels_b = torch.ones(128)
Then I have these features:
# Compute output
features_a = nets[0](input_a)
features_b = nets[1](input_b)
features_c = nets[2](inputs)
And then a domain classifier (nets[4]) makes predictions:
d_pred_a = torch.squeeze(nets[4](features_a))
d_pred_b = torch.squeeze(nets[4](features_b))
d_pred_a = d_pred_a.float()
d_pred_b = d_pred_b.float()
print(d_pred_a.shape)
The error raises in the loss function: ` pred_a = torch.squeeze(nets3)
pred_b = torch.squeeze(nets3)
pred_c = torch.squeeze(nets3)
loss = criterion(pred_a, labels_a) + criterion(pred_b, labels_b) + criterion(pred_c, labels) + d_criterion(d_pred_a, d_labels_a) + d_criterion(d_pred_b, d_labels_b)
The problem is that d_pred_a/b is different from d_labels_a/b, but only after a certain point. Indeed, when I print the shape of d_pred_a/b it istorch.Size([128])but then it changes totorch.Size([112])` independently.
It comes from here:
# Compute output
features_a = nets[0](input_a)
features_b = nets[1](input_b)
features_c = nets[2](inputs)
because if I print the shape of features_a is torch.Size([128, 2048]) but it changes into torch.Size([112, 2048])
nets[0] is a VGG, like this:
class VGG16(nn.Module):
def __init__(self, input_size, batch_norm=False):
super(VGG16, self).__init__()
self.in_channels,self.in_width,self.in_height = input_size
self.block_1 = VGGBlock(self.in_channels,64,batch_norm=batch_norm)
self.block_2 = VGGBlock(64, 128,batch_norm=batch_norm)
self.block_3 = VGGBlock(128, 256,batch_norm=batch_norm)
self.block_4 = VGGBlock(256,512,batch_norm=batch_norm)
#property
def input_size(self):
return self.in_channels,self.in_width,self.in_height
def forward(self, x):
x = self.block_1(x)
x = self.block_2(x)
x = self.block_3(x)
x = self.block_4(x)
# x = self.avgpool(x)
x = torch.flatten(x,1)
return x
I solved. The problem was the last batch. I used drop_last=True in the dataloader and It worked.

How can I measure Precision and Recall on Logistic Regression with PySpark?

I am using a Logistic Regression model on PySpark through databricks but i am not able to get my precision and recall. Everything works fine and I am able to get my ROC but there is not attribute or lib for Precision and Recall
lrModel = LogisticRegression()
predictions = bestModel.transform(testData)
# Instantiate metrics object
results = predictions.select(['probability', 'label'])
results_collect = results.collect()
results_list = [(float(i[0][0]), 1.0-float(i[1])) for i in results_collect]
scoreAndLabels = sc.parallelize(results_list)
metrics = MulticlassMetrics(scoreAndLabels)
# Overall statistics
precision = metrics.precision()
recall = metrics.recall()
f1Score = metrics.fMeasure()
print("Summary Stats")
print("Precision = %s" % precision)
print("Recall = %s" % recall)
print("F1 Score = %s" % f1Score)
>>>Summary Stats
>>>Precision = 0.0
>>>Recall = 0.0
>>>F1 Score = 0.0
I was able to create my own function to do so. It returns everything and more. I am using the "MulticlassMetrics()" from mllib package. Since its a multiclass it calculates metrics for each label so, you have to specify which label you want to retrieve.
### Model Evaluator User Defined Functions
def udfModelEvaluator(dfPredictions, labelColumn='label'):
colSelect = dfPredictions.select(
[F.col('prediction').cast(DoubleType())
,F.col(labelColumn).cast(DoubleType()).alias('label')])
metrics = MulticlassMetrics(colSelect.rdd)
mAccuracy = metrics.accuracy
mPrecision = metrics.precision(1)
mRecall = metrics.recall(1)
mF1 = metrics.fMeasure(1.0, 1.0)
mMatrix = metrics.confusionMatrix().toArray().astype(int)
mTP = metrics.confusionMatrix().toArray()[1][1]
mTN = metrics.confusionMatrix().toArray()[0][0]
mFP = metrics.confusionMatrix().toArray()[0][1]
mFN = metrics.confusionMatrix().toArray()[1][0]
mResults = [mAccuracy, mPrecision, mRecall, mF1, mMatrix, mTP, mTN, mFP, mFN, "Return [[0]=Accuracy, [1]=Precision, [2]=Recall, [3]=F1, [4]=ConfusionMatrix, [5]=TP, [6]=TN, [7]=FP, [8]=FN]"]
return mResults
To call the function:
metricsList = udfModelEvaluator(predictionsData, "label")
metricsList

Fit with the parameter

I am quite new to Matlab and I am trying to use this code I found online.
I am trying to fit a graph described by the HydrodynamicSpectrum. But instead of having it fit after inputting fvA and fmA, I am trying to obtain the fitted parameters for this value also.
I have tried removing them, changing them. But none is working. I was wondering if any one here will be able to point me into the right direction of fixing this.
specFunc = #(f, para)HydrodynamicSpectrum(f, [para fvA fmA]);
[fit.AXfc, fit.AXD] = NonLinearFit(fit.f(indXY), fit.AXSpec(indXY), specFunc, [iguess_AXfc iguess_AXD]);
[fit.AYfc, fit.AYD] = NonLinearFit(fit.f(indXY), fit.AYSpec(indXY), specFunc, [iguess_AYfc iguess_AYD]);
[fit.ASumfc, fit.ASumD] = NonLinearFit(fit.f(indSum), fit.ASumSpec(indSum), specFunc, [iguess_ASumfc iguess_ASumD]);
predictedAX = HydrodynamicSpectrum(fit.f, [fit.AXfc fit.AXD fvA fmA]);
predictedAY = HydrodynamicSpectrum(fit.f, [fit.AYfc fit.AYD fvA fmA]);
predictedASum = HydrodynamicSpectrum(fit.f, [fit.ASumfc fit.ASumD fvA fmA]);
function spec = HydrodynamicSpectrum(f, para);
fc = para(1);
D = para(2);
fv = para(3);
fm = para(4);
f = abs(f); %Kludge!
spec = D/pi^2*(1+sqrt(f/fv))./((fc - f.*sqrt(f./fv) - (f.^2)/fm).^2 + (f + f.*sqrt(f./fv)).^2);
function [fc, D, sfc, sD] = NonLinearFit(f, spec, specFunc, init);
func = #(para, f)spec./specFunc(f, para);
[paraFit, resid, J] = nlinfit(f, ones(1, length(spec)), func, init);
fc = paraFit(1);
D = paraFit(2);
ci = nlparci(real(paraFit), real(resid), real(J)); % Kludge!!
sfc = (ci(1,2) - ci(1,1))/4;
sD = (ci(2,2) - ci(2,1))/4;
[paraFit, resid, J] = nlinfit(f, ones(1, length(spec)), func, init);
It looks like you get your fitted parameter using this line. And you are further processing them to get other stuff out from the function. You can modify your second function to get them out as well.
As there are very few comments, and questions seems to be application specific, there is not much help I can give with what you have presented.

How to programatically construct a large cell array

How could I construct automatically a dataset like the one below, assuming that the number of columns of matrix summary_whts is approx. 400???
lrwghts = dataset(...
{summary_whts(:,01),'w00'},...
{summary_whts(:,02),'w01'},...
{summary_whts(:,03),'w02'},...
{summary_whts(:,04),'w03'},...
{summary_whts(:,05),'w04'},...
{summary_whts(:,06),'w05'},...
{summary_whts(:,07),'w06'},...
{summary_whts(:,08),'w07'},...
{summary_whts(:,09),'w08'},...
{summary_whts(:,10),'w09'},...
{summary_whts(:,11),'w10'},...
{summary_whts(:,12),'w11'},...
'ObsNames',summary_mthd);
Why not use a simple loop to populate dataset?
nCols = size(summary_whts,1);
dataset = cell(nCols, 2);
for i = 1:nCols
dataset{i,1} = summary_whts(:,i);
dataset{i,2} = sprintf('w%04d', i);
end
dataset{end+1,1} = 'ObsNames';
dataset(end, 2} = summary_mthd;
At last, I found it! This is what I was looking for:
cat = [];
for i = 0:(size(X,2)),
cat = [cat;sprintf('w%03d',i)];
end
cat = cellstr(cat);
lrwghts = dataset({summary_whts,cat{:}},'ObsNames',cellstr(summary_mthd));