Getting argument missing error In ParamgridBuilder on Pyspark - pyspark

I am currently implementing Gradientboost classification model in Pyspark.Based on kaggle dataset My current final columns after fitting pipeline is
I am now trying parameter tuning by PARAMGRIDBUILD. here is my Parameter grid build code
param_grid=ParamGridBuilder.addGrid(gradboost.maxDepth,[2,3,4]).addGrid(gradboost.minInfoGain,[0.0, 0.1, 0.2, 0.3]).addGrid(gradboost.stepSize,[0.05, 0.1, 0.2, 0.4]).build()
and I am getting below error
****param_grid=ParamGridBuilder.addGrid(gradboost.maxDepth,[2,3,4]).addGrid(gradboost.minInfoGain,[0.0, 0.1, 0.2, 0.3]).addGrid(gradboost.stepSize,[0.05, 0.1, 0.2, 0.4]).build()
TypeError: addGrid() missing 1 required positional argument: 'values'****
I did not use Paramgridbuild before. does this array values represent each column of my current dataframe? kindly help me to figure out the error and give me the basic concept of using this values. Here is my full code
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer,VectorIndexer,OneHotEncoder,VectorAssembler
from pyspark.ml.classification import GBTClassifier
from pyspark.ml import Pipeline
from pyspark.ml.tuning import ParamGridBuilder,CrossValidator
from pyspark.ml.evaluation import BinaryClassificationEvaluator
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("Gradientboostapp").enableHiveSupport().getOrCreate()
data= spark.read.csv("C:/Users/codemen/Desktop/Timeseries Analytics/liver_patient.csv",header=True, inferSchema=True)
#data.show()
print(data.count())
#data.printSchema()
print("After deleting null values")
data=data.na.drop()
print(data.count())
data.show(5)
gender_column=data.columns[1:2]
#print(categorical_column)
stringindexstage=[StringIndexer(inputCol=c,outputCol='genderindexed')for c in gender_column]
#print(stringindexstage)
stringindexstage=stringindexstage+[StringIndexer(inputCol='category',outputCol='classlabel')]
for x in stringindexstage:
data=x.fit(data).transform(data)
data.show(3)
#data.show(3)
#print ("Type of",type(stringindexstage))
onehotencoderstage=[OneHotEncoder(inputCol='genderindexed', outputCol='onehot'+c) for c in gender_column]
for onehot in onehotencoderstage:
data=onehot.transform(data)
data.show()
#vector assembler
print("data current")
data.show(3)
feature_column=['Age','onehotGender','Total_Bilirubin', 'Direct_Bilirubin', 'Alkaline_Phosphotase', 'Alamine_Aminotransferase', 'Aspartate_Aminotransferase', 'Total_Protiens',
'Albumin', 'Albumin_and_Globulin_Ratio']
print(feature_column)
#Vector Assembler stage
vectorassmblestage=[VectorAssembler(inputCols=feature_column,outputCol="features")]
#pipeline model
#allstages=stringindexstage+onehotencoderstage+vectorassmblestage
#for i in allstages:
#
pipelinestage=Pipeline(stages=vectorassmblestage)
#
# #fitting variable
pipelinemodel=pipelinestage.fit(data)
#
# #Transform Data
#
finalcolumns=feature_column+['features','classlabel']
#
dataframe=pipelinemodel.transform(data).select(finalcolumns)
print("final column print")
dataframe.show(5)
#splitting data into train test
(traindata, testdata)=dataframe.randomSplit([0.7,0.3],seed=1234)
#gradientboosting
gradboost=GBTClassifier(featuresCol='features',labelCol='classlabel',maxIter=10)
#parameter tuning
param_grid=ParamGridBuilder.addGrid(gradboost.maxDepth,[2,3,4]).addGrid(gradboost.minInfoGain,[0.0, 0.1, 0.2, 0.3]).addGrid(gradboost.stepSize,[0.05, 0.1, 0.2, 0.4]).build()
##Evaluation
print("Evaluation stage")
evaluator=BinaryClassificationEvaluator(rawPredictionCol='prediction')
#crossvalidation state
print("cross validation stage")
crossvalidation=CrossValidator(estimator=gradboost,estimatorParamMaps=param_grid,evaluator=evaluator)
crossvalidateData=crossvalidation.fit(dataframe)
##prediction on Training Data
print("Prediction in Training data ....")
predictTrain=crossvalidateData.transform(traindata)
predictTrain.show(10)
Thank you in advance

For starters it looks like you need to call ParamGridBuilder() with parentheses
so;
param_grid = ParamGridBuilder() \
.addGrid(...)

Related

Spark vs scikit-learn

I use pyspark for traffic classification using the decision tree model & I measure the time required for training the model. It took 2 min and 17 s. Then, I perform the same task using scikit-learn. In the second case, the training time is 1 min and 19 s. Why? since it is supposed that Spark performs the task in a distributed way.
This is the code for pyspark:
df = (spark.read.format("csv")\
.option('header', 'true')\
.option("inferSchema", "true")\
.load("D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv"))
from pyspark.ml.classification import DecisionTreeClassifier
dt = DecisionTreeClassifier(featuresCol = 'features', labelCol = 'label', maxDepth = 10)
pModel = dt.fit(trainDF)
in scikit - learn
import warnings
warnings.filterwarnings('ignore')
path = 'D:/PHD Project/Paper_3/Datasets_Download/IP Network Traffic Flows Labeled with 75 Apps/Dataset-Unicauca-Version2-87Atts.csv'
df= pd.read_csv(path)
#df.info()
%%time
from sklearn.tree import DecisionTreeClassifier
model = DecisionTreeClassifier()
model.fit(X_train, y_train)

How to export a PyTorch model to MATLAB?

I have created and trained a model in PyTorch, and I would like to be able to use the trained model within a MATLAB program. Based on this post I have been exporting the model to ONNX and then attempting to load the ONNX model in MATLAB. I followed this pytorch tutorial on how to export a model to ONNX.
However, I get this error
Error using nnet.internal.cnn.onnx.importONNXNetwork>iHandleTranslationIssues
Unable to import the network because of the following issues:
1 operator(s) : Unable to create an input layer for ONNX input #1 (with name 'input') because its data format is unknown or not supported as a MATLAB input layer.
If you know the input format, pass it by using the "InputDataFormats" parameter.
The input shape declared in the ONNX file is '(batch_size, 12)'.
1 operator(s) : Unable to create an output layer for ONNX network output #1 (with name 'output') because its data format is unknown or not supported as a MATLAB
output layer. If you know the output format, pass it using the 'OutputDataFormats' parameter.
To import the ONNX network as a dlnetwork, set the 'TargetNetwork' value to 'dlnetwork'.
To import the ONNX network as a layer graph with weights, use importONNXLayers.
To import the ONNX network as a function, use importONNXFunction.
Error in nnet.internal.cnn.onnx.importONNXNetwork (line 37)
iHandleTranslationIssues(translationIssues);
Error in importONNXNetwork (line 113)
Network = nnet.internal.cnn.onnx.importONNXNetwork(modelfile, varargin{:});
Here is a minimal example to create the error
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.onnx
class FFNN(nn.Module):
def __init__(self):
super(FFNN, self).__init__()
input_size = 12
self.layer1 = nn.Linear(input_size, 24)
self.layer2 = nn.Linear(24, 24)
self.layer3 = nn.Linear(24, 12)
self.norm1 = nn.BatchNorm1d(12, eps=1e-05, momentum=0.1, affine=False, track_running_stats=True)
self.layer4 = nn.Linear(12, 6)
self.layer5 = nn.Linear(6, 1)
def forward(self, x):
x = F.relu(self.layer1(x))
x = F.relu(self.layer2(x))
x = self.norm1(self.layer3(x))
x = F.relu(self.layer4(x))
out = self.layer5(x)
return out
net = FFNN()
net.eval()
batch_size = 1
input_size = 12
x = torch.randn(batch_size, input_size, requires_grad=True) # random model input for onnx
out = net(x)
input_names = ['x1', 'x2', 'x3', 'x4', 'x5', 'x6', 'x7', 'Qm', 'l1', 'uPre', 'basal', 'dG']
output_names = ['output']
# Export the model
torch.onnx.export(net, # model being run
x, # model input (or a tuple for multiple inputs)
'model.onnx', # where to save the model (can be a file or file-like object)
export_params=True, # store the trained parameter weights inside the model file
opset_version=10, # the ONNX version to export the model to
do_constant_folding=True, # whether to execute constant folding for optimization
input_names = ['input'], # the model's input names
output_names = ['output'], # the model's output names
dynamic_axes={'input' : {0 : 'batch_size'}, # variable length axes
'output' : {0 : 'batch_size'}})
nnMPC = importONNXNetwork("model.onnx"); % produces error
However I can validate the model in ONNX in python and it loads correctly.
So I think the problem is how I am loading it into MATLAB.
(my validation code)
import onnx
import onnxruntime
import numpy as np
onnx_model = onnx.load('model.onnx')
onnx.checker.check_model(onnx_model)
ort_session = onnxruntime.InferenceSession('model.onnx')
def to_numpy(tensor):
return tensor.detach().cpu().numpy() if tensor.requires_grad else tensor.cpu().numpy()
# compute ONNX Runtime output prediction
ort_inputs = {ort_session.get_inputs()[0].name: to_numpy(x)}
ort_outs = ort_session.run(None, ort_inputs)
# compare ONNX Runtime and PyTorch results
np.testing.assert_allclose(to_numpy(out), ort_outs[0], rtol=1e-03, atol=1e-05)
print("Exported model has been tested with ONNXRuntime, and the result looks good!")

MultiOutputRegressor in spark/scala for xgboost

Do we have MultiOutputRegressor as a wrapper for xgboost using spark/scala in distributed way with the help of xgboost4j-spark, my requirement is to implement multiple target prediction.
Please find below the python code snippet for reference.
from sklearn.multioutput import MultiOutputRegressor
from sklearn import ensemble
import xgboost as xgb
xgbr = xgb.XGBRegressor(max_depth=1, eta=0.01, silent=1, subsample= 0.8, reg_lambda = 1.515,
reg_alpha= 0.0017, min_child_weight=7, colsample_bytree=0.85, nthread= 32,
gamma= 0.01, objective='reg:linear',tree_method= 'approx', booster = 'gbtree', n_estimators=100)
X = ["indep_1","indep_2","indep3","indep_4","indep_5","indep_6","indep_7","indep_8"]
Y = ["dep_1","dep_2","dep_3","dep_4"]
model = MultiOutputRegressor(xgbr).fit(X, Y)

Error when using Seaborn in jupyter notebook(pyspark)

I am trying to visualize data using Seaborn. I have created a dataframe using SQLContext in pyspark. However, when I call lmplot it results in an error. I am not sure what I am missing. Given below is my code(I am using jupyter notebook):
import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)
df = sqlContext.read.load('file:///home/cloudera/Downloads/WA_Sales_Products_2012-14.csv',
format='com.databricks.spark.csv',
header='true',inferSchema='true')
sns.lmplot(x='Quantity', y='Year', data=df)
Error trace:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-86-2a2b43993475> in <module>()
----> 2 sns.lmplot(x='Quantity', y='Year', data=df)
/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/regression.py in lmplot(x, y, data, hue, col, row, palette, col_wrap, size, aspect, markers, sharex, sharey, hue_order, col_order, row_order, legend, legend_out, x_estimator, x_bins, x_ci, scatter, fit_reg, ci, n_boot, units, order, logistic, lowess, robust, logx, x_partial, y_partial, truncate, x_jitter, y_jitter, scatter_kws, line_kws)
557 hue_order=hue_order, size=size, aspect=aspect,
558 col_wrap=col_wrap, sharex=sharex, sharey=sharey,
--> 559 legend_out=legend_out)
560
561 # Add the markers here as FacetGrid has figured out how many levels of the
/home/cloudera/anaconda3/lib/python3.5/site-packages/seaborn/axisgrid.py in __init__(self, data, row, col, hue, col_wrap, sharex, sharey, size, aspect, palette, row_order, col_order, hue_order, hue_kws, dropna, legend_out, despine, margin_titles, xlim, ylim, subplot_kws, gridspec_kws)
255 # Make a boolean mask that is True anywhere there is an NA
256 # value in one of the faceting variables, but only if dropna is True
--> 257 none_na = np.zeros(len(data), np.bool)
258 if dropna:
259 row_na = none_na if row is None else data[row].isnull()
TypeError: object of type 'DataFrame' has no len()
Any help or pointer is appreciated. Thank you in advance:-)
sqlContext.read.load(...) returns a Spark-DataFrame. I am not sure, whether seaborn can automatically cast a Spark-DataFrame into a Pandas-Dataframe.
Try:
sns.lmplot(x='Quantity', y='Year', data=df.toPandas())
df.toPandas() returns the the pandas-DF from the Spark-DF.

Printing confusion matrix to file produces illegal characters

I am classifying a set of images stored as tuples in a csv file.
The confusion matrix that I get on terminal display is correct. But when I write that same conf. matrix to a file, it produces illegal characters (32bit hex).
Here's the code-
from sklearn.metrics import confusion_matrix
import numpy as np
import os
import csv
from sklearn import svm
from sklearn import cross_validation
from sklearn import linear_model
from sklearn.neighbors import KNeighborsClassifier
import matplotlib.pyplot as plt
from sklearn import metrics
import cPickle
def prec(num):
return "%0.5f"%num
outfile = open("output/linear_svm_output.txt","a")
for dim in [20,30,40]:
images=[]
labels=[]
name = str(dim)+"x"+str(dim)+".csv"
with open(name,'r') as file:
reader = csv.reader(file,delimiter=',')
for line in file:
labels.append(line[0])
line=line[2:] # Remove the label
image=[int(pixel) for pixel in line.split(',')]
images.append(np.array(image))
clf = svm.LinearSVC()
print clf
kf = cross_validation.KFold(len(images),n_folds=10,indices=True, shuffle=True, random_state=4)
print "\nDividing dataset using `Kfold()` -:\n\nThe training dataset has been divided into " + str(len(kf)) + " parts\n"
for train, test in kf:
training_images=[]
training_labels=[]
for i in train:
training_images.append(images[i])
training_labels.append(labels[i])
testing_images=[]
testing_labels=[]
for i in test:
testing_images.append(images[i])
testing_labels.append(labels[i])
clf.fit(training_images,training_labels)
predicted = clf.predict(testing_images)
print prec(clf.score(testing_images, testing_labels))
outfile.write(prec(clf.score(testing_images, testing_labels)))
outfile.write(str(clf))
outfile.write(confusion_matrix(testing_labels, predicted))
print confusion_matrix(testing_labels, predicted)
# outfile.write(metrics.classification_report(testing_labels, predicted))
print "\nDividing dataset using `train_test_split()` -:\n"
training_images, testing_images, training_labels, testing_labels = cross_validation.train_test_split(images,labels, test_size=0.2, random_state=0)
clf = clf.fit(training_images,training_labels)
score = clf.score(testing_images,testing_labels)
predicted = clf.predict(testing_images)
print prec(score)
outfile.write(str(clf))
outfile.write(confusion_matrix(testing_labels, predicted))
print confusion_matrix(testing_labels, predicted)
# outfile.write(metrics.classification_report(testing_labels, predicted))
Output in file-
302e 3939 3338 374c 696e 6561 7253 5643
2843 3d31 2e30 2c20 636c 6173 735f 7765
...
Use the following to print the matrix to file properly:
with open(filename, 'w') as f:
f.write(np.array2string(confusion_matrix(y_test, pred), separator=', '))
Because outfile.write(confusion_matrix(testing_labels, predicted)) will write the matrix out in binary format. If you want write it in human readable text, try this if you are using python 2.x
print >> outfile, confusion_matrix(testing_labels, predicted)
It just redirect the stdout to outfile