PySpark TypeError: object of type 'ParamGridBuilder' has no len() - pyspark

I am trying to tune my model on Databricks using Pyspark.
I receive the following error:
TypeError: object of type 'ParamGridBuilder' has no len()
My code has been listed below.
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
als = ALS(userCol = "userId",itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative = True, implicitPrefs = False)
# Imports ParamGridBuilder package
from pyspark.ml.tuning import ParamGridBuilder
# Creates a ParamGridBuilder, and adds hyperparameters
param_grid = ParamGridBuilder().addGrid(als.rank, [5,10,20,40]).addGrid(als.maxIter, [5,10,15,20]).addGrid(als.regParam,[0.01,0.001,0.0001,0.02])
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
# Imports CrossValidator package
from pyspark.ml.tuning import CrossValidator
# Creates cross validator and tells Spark what to use when training and evaluates
cv = CrossValidator(estimator = als,
estimatorParamMaps = param_grid,
evaluator = evaluator,
numFolds = 5)
model = cv.fit(training)
TypeError: object of type 'ParamGridBuilder' has no len()
Full Error Log:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1952169986445972> in <module>()
----> 1 model = cv.fit(training)
2
3 # Extract best combination of values from cross validation
4
5 best_model = model.bestModel
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/tuning.py in _fit(self, dataset)
279 est = self.getOrDefault(self.estimator)
280 epm = self.getOrDefault(self.estimatorParamMaps)
--> 281 numModels = len(epm)

It simple means that your object does not have a length property (unlike lists). Thus, In your line
param_grid = ParamGridBuilder()
.addGrid(als.rank, [5,10,20,40])
.addGrid(als.maxIter, [5,10,15,20])
.addGrid(als.regParam, [0.01,0.001,0.0001,0.02])
You should add .build() in the end to actually construct a grid.

Related

Mismatch in Input Shape of tf.keras model

I am getting a mismatch in input shape of a tf.keras model. The code block is given below with the stack trace. I am using hub.keraslayer as my first layer. The model is being made for being trained using Tensor Flow Federated (TFF). The input to the model are variable length strings. Kindly suggest a way out.
#Making a Tensorflow Model
from tensorflow import keras
def create_keras_model():
encoder = hub.load("https://tfhub.dev/google/tf2-preview/gnews-swivel-20dim/1")
return tf.keras.models.Sequential([
hub.KerasLayer(encoder, input_shape=[],dtype=tf.string,trainable=True),
keras.layers.Dense(32, activation='relu'),
keras.layers.Dense(16, activation='relu'),
keras.layers.Dense(1, activation='sigmoid'),
])
def model_fn():
# We _must_ create a new model here, and _not_ capture it from an external
# scope. TFF will call this within different graph contexts.
keras_model = create_keras_model()
return tff.learning.from_keras_model(
keras_model,
input_spec=preprocessed_example_dataset.element_spec,
loss=tf.keras.losses.BinaryCrossentropy(),
metrics=[tf.keras.metrics.Accuracy()])
iterative_process = tff.learning.build_federated_averaging_process(
model_fn,
client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0))
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-
packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with
constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:From /usr/local/lib/python3.6/dist-
packages/tensorflow/python/ops/resource_variable_ops.py:1817: calling
BaseResourceVariable.__init__ (from tensorflow.python.ops.resource_variable_ops) with
constraint is deprecated and will be removed in a future version.
Instructions for updating:
If using Keras pass *_constraint arguments to layers.
WARNING:tensorflow:Model was constructed with shape (None,) for input
Tensor("keras_layer_input:0", shape=(None,), dtype=string), but it was called on an input
with incompatible shape (None, None).
WARNING:tensorflow:Model was constructed with shape (None,) for input
Tensor("keras_layer_input:0", shape=(None,), dtype=string), but it was called on an input
with incompatible shape (None, None).
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
<ipython-input-27-68fa27e65b7e> in <module>()
3 model_fn,
4 client_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=0.02),
----> 5 server_optimizer_fn=lambda: tf.keras.optimizers.SGD(learning_rate=1.0))
18 frames
/usr/local/lib/python3.6/dist-packages/tensorflow/python/framework/func_graph.py in
wrapper(*args, **kwargs)
966 except Exception as e: # pylint:disable=broad-except
967 if hasattr(e, "ag_error_metadata"):
--> 968 raise e.ag_error_metadata.to_exception(e)
969 else:
970 raise
ValueError: in user code:
/usr/local/lib/python3.6/dist-
packages/tensorflow_federated/python/learning/federated_averaging.py:91 __call__ *
num_examples_sum = dataset.reduce(
/usr/local/lib/python3.6/dist-
packages/tensorflow_federated/python/learning/model_utils.py:152 forward_pass *
self._model.forward_pass(batch_input, training), model_lib.BatchOutput)
/usr/local/lib/python3.6/dist-
packages/tensorflow_federated/python/learning/keras_utils.py:391 forward_pass *
return self._forward_pass(batch_input, training=training)
/usr/local/lib/python3.6/dist-
packages/tensorflow_federated/python/learning/keras_utils.py:359 _forward_pass *
predictions = self._keras_model(inputs, training=training)
/usr/local/lib/python3.6/dist-packages/tensorflow_hub/keras_layer.py:222 call *
result = f()
/usr/local/lib/python3.6/dist-packages/tensorflow/python/saved_model/load.py:486
_call_attribute **
return instance.__call__(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py:580 __call__
result = self._call(*args, **kwds)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/def_function.py:618 _call
results = self._stateful_fn(*args, **kwds)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2419 __call__
graph_function, args, kwargs = self._maybe_define_function(args, kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2735
_maybe_define_function
*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2238
canonicalize_function_inputs
self._flat_input_signature)
/usr/local/lib/python3.6/dist-packages/tensorflow/python/eager/function.py:2305
_convert_inputs_to_signature
format_error_message(inputs, input_signature))
ValueError: Python inputs incompatible with input_signature:
inputs: (
Tensor("batch_input:0", shape=(None, None), dtype=string))
input_signature: (
TensorSpec(shape=(None,), dtype=tf.string, name=None))

Not able to pass StringIndexer as list to the model pipeline stage

PySpark pipeline is quite new to me .I am trying to create the stages in pipeline by passing below list :
pipeline = Pipeline().setStages([indexer,assembler,dtc_model])
where I am Applying feature indexing on multiple columns :
cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]
On Running the fit on the pipeline I am getting below error :
model_pipeline = pipeline.fit(train_df)
How we can pass the list to the stage or any work around to achieve this or better way to do this?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-3999694668013877> in <module>
----> 1 model_pipeline = pipeline.fit(train_df)
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
95 if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
96 raise TypeError(
---> 97 "Cannot recognize a pipeline stage of type %s." % type(stage))
98 indexOfLastEstimator = -1
99 for i, stage in enumerate(stages):
TypeError: Cannot recognize a pipeline stage of type <class 'list'>.```
Try below-
cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]
assembler = VectorAssembler...
dtc_model = DecisionTreeClassifier...
# Create pipeline using transformers and estimators
stages = indexer
stages.append(assembler)
stages.append(dtc_model)
pipeline = Pipeline().setStages(stages)
model_pipeline = pipeline.fit(train_df)

Sympy .coeff_all() returned list is not readable by scipy

I have question about the data type of the result returned by Sympy Poly.all_coeffs(). I have started to use Sympy just recently.
My Sympy transfer function is following:
Then I run this code:
n,d = fraction(Gs)
num = Poly(n,s)
den = Poly(d,s)
num_c = num.all_coeffs()
den_c = den.all_coeffs()
I get:
Then I run this code:
from scipy import signal
#nu = [5000000.0]
#de = [4.99, 509000.0]
nu = num_c
de = den_c
sys = signal.lti(nu, de)
w,mag,phase = signal.bode(sys)
plt.plot(w/(2*np.pi), mag)
and the result is:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-131-fb960684259c> in <module>
4 nu = num_c
5 de = den_c
----> 6 sys = signal.lti(nu, de)
But if I use those commented line 'nu' and 'de' straight python lists instead, the program works. So what is wrong here?
Why did you just show a bit the error? Why not the full message, maybe even the full traceback!
In [60]: sys = signal.lti(num_c, den_c)
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-60-21f71ecd8884> in <module>
----> 1 sys = signal.lti(num_c, den_c)
/usr/local/lib/python3.6/dist-packages/scipy/signal/ltisys.py in __init__(self, *system, **kwargs)
590 self._den = None
591
--> 592 self.num, self.den = normalize(*system)
593
594 def __repr__(self):
/usr/local/lib/python3.6/dist-packages/scipy/signal/filter_design.py in normalize(b, a)
1609 leading_zeros = 0
1610 for col in num.T:
-> 1611 if np.allclose(col, 0, atol=1e-14):
1612 leading_zeros += 1
1613 else:
<__array_function__ internals> in allclose(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in allclose(a, b, rtol, atol, equal_nan)
2169
2170 """
-> 2171 res = all(isclose(a, b, rtol=rtol, atol=atol, equal_nan=equal_nan))
2172 return bool(res)
2173
<__array_function__ internals> in isclose(*args, **kwargs)
/usr/local/lib/python3.6/dist-packages/numpy/core/numeric.py in isclose(a, b, rtol, atol, equal_nan)
2267 y = array(y, dtype=dt, copy=False, subok=True)
2268
-> 2269 xfin = isfinite(x)
2270 yfin = isfinite(y)
2271 if all(xfin) and all(yfin):
TypeError: ufunc 'isfinite' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
Now look at the elements of the num_c list (same for den_c):
In [55]: num_c[0]
Out[55]: 500000.000000000
In [56]: type(_)
Out[56]: sympy.core.numbers.Float
The scipy code is doing numpy testing on the inputs. So it's first turned the lists into arrays:
In [61]: np.array(num_c)
Out[61]: array([500000.000000000], dtype=object)
This array contains sympy object(s). It can't cast that to numpy float with 'safe'. But an explicit astype uses unsafe as the default:
In [63]: np.array(num_c).astype(float)
Out[63]: array([500000.])
So lets convert both lists into valid numpy float arrays:
In [64]: sys = signal.lti(np.array(num_c).astype(float), np.array(den_c).astype(float))
In [65]: sys
Out[65]:
TransferFunctionContinuous(
array([100200.4008016]),
array([1.00000000e+00, 1.02004008e+05]),
dt: None
)
Conversion in a list comprehension also works:
sys = signal.lti([float(i) for i in num_c],[float(i) for i in den_c])
You likely need to conver sympy objects to floats / lists of floats.

Databricks UDF calling an external web service cannot be serialised (PicklingError)

I am using Databricks and have a column in a dataframe that I need to update for every record with an external web service call. In this case it is using the Azure Machine Learning Service SDK and does a service call. This code works fine when not run as a UDF in spark (ie. just python) however it throws a serialization error when I try to call it as a UDF. The same happens if I use a lambda and a map with an rdd.
The model uses fastText and can be invoked fine from Postman or python via a normal http call or using the WebService SDK from AMLS - it's just when it is a UDF that it fails with this message:
TypeError: can't pickle _thread._local objects
The only workaround I can think of is to loop through each record in the dataframe sequentially and update the record with a call, however this is not very efficient. I don't know if this is a spark error or because the service is loading a fasttext model. When I use the UDF and mock a return value it works though.
Error at bottom...
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core import Workspace
def predictModelValue2(summary, modelName, modelLabel):
raw_data = '[{"label": "' + modelLabel + '", "model": "' + modelName + '", "as_full_account": "' + summary + '"}]'
prediction = service.run(raw_data)
return prediction
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
predictModelValueUDF = udf(predictModelValue2)
DVIRCRAMFItemsDFScored1 = DVIRCRAMFItemsDF.withColumn("Result", predictModelValueUDF("Summary", "ModelName", "ModelLabel"))
TypeError: can't pickle _thread._local objects
During handling of the above exception, another exception occurred:
PicklingError Traceback (most recent call
last) in
----> 2 x = df.withColumn("Result", predictModelValueUDF("Summary",
"ModelName", "ModelLabel"))
/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
194 #functools.wraps(self.func, assigned=assignments)
195 def wrapper(*args):
--> 196 return self(*args)
197
198 wrapper.name = self._name
/databricks/spark/python/pyspark/sql/udf.py in call(self, *cols)
172
173 def call(self, *cols):
--> 174 judf = self._judf
175 sc = SparkContext._active_spark_context
176 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
/databricks/spark/python/pyspark/sql/udf.py in _judf(self)
156 # and should have a minimal performance impact.
157 if self._judf_placeholder is None:
--> 158 self._judf_placeholder = self._create_judf()
159 return self._judf_placeholder
160
/databricks/spark/python/pyspark/sql/udf.py in _create_judf(self)
165 sc = spark.sparkContext
166
--> 167 wrapped_func = _wrap_function(sc, self.func, self.returnType)
168 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
169 judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
/databricks/spark/python/pyspark/sql/udf.py in _wrap_function(sc,
func, returnType)
33 def _wrap_function(sc, func, returnType):
34 command = (func, returnType)
---> 35 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
36 return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
37 sc.pythonVer, broadcast_vars, sc._javaAccumulator)
/databricks/spark/python/pyspark/rdd.py in _prepare_for_python_RDD(sc,
command) 2461 # the serialized command will be compressed by
broadcast 2462 ser = CloudPickleSerializer()
-> 2463 pickled_command = ser.dumps(command) 2464 if len(pickled_command) >
sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M
2465 # The broadcast will have same life cycle as created
PythonRDD
/databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
709 msg = "Could not serialize object: %s: %s" % (e.class.name, emsg)
710 cloudpickle.print_exec(sys.stderr)
--> 711 raise pickle.PicklingError(msg)
712
713
PicklingError: Could not serialize object: TypeError: can't pickle
_thread._local objects
I am not expert in DataBricks or Spark, but pickling functions from the local notebook context is always problematic when you are touching complex objects like the service object. In this particular case, I would recommend removing the dependency on the azureML service object and just use requests to call the service.
Pull the key from the service:
# retrieve the API keys. two keys were generated.
key1, key2 = service.get_keys()
scoring_uri = service.scoring_uri
You should be able to use these strings in the UDF directly without pickling issues -- here is an example of how you would call the service with just requests. Below applied to your UDF:
import requests, json
def predictModelValue2(summary, modelName, modelLabel):
input_data = json.dumps({"summary": summary, "modelName":, ....})
headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + key1}
# call the service for scoring
resp = requests.post(scoring_uri, input_data, headers=headers)
return resp.text[1]
On a side node, though: your UDF will be called for each row in your data frame and each time it will make a network call -- that will be very slow. I would recommend looking for ways to batch the execution. As you can see from your constructed json service.run will accept an array of items, so you should call it in batches of 100s or so.

Multiclass Classification Evaluator in PySpark

from pyspark.ml.classification import MultilayerPerceptronClassifier
inputneurons = len(pipe_df.columns)
nn = MultilayerPerceptronClassifier(layers = [inputneurons,20,2])
nn_model = nn.fit(train_data)
results = nn_model.transform(test_data)
from pyspark.ml.evaluation import MulticlassClassificationEvaluator
evaluator = MulticlassClassificationEvaluator()
mlp_accuracy = evaluator.evaluate(results)
and when run it, shows errors ---------------------------------------------------------------------------
Py4JJavaError Traceback (most recent call last)
in ()
23 evaluator = MulticlassClassificationEvaluator()
24
---> 25 mlp_accuracy = evaluator.evaluate(results)
26
27
and I tried BinaryClassificationEvaluator but it does't work as well..
Is anyone know whats wrong here? I am new to PySpark...