Databricks UDF calling an external web service cannot be serialised (PicklingError) - pyspark

I am using Databricks and have a column in a dataframe that I need to update for every record with an external web service call. In this case it is using the Azure Machine Learning Service SDK and does a service call. This code works fine when not run as a UDF in spark (ie. just python) however it throws a serialization error when I try to call it as a UDF. The same happens if I use a lambda and a map with an rdd.
The model uses fastText and can be invoked fine from Postman or python via a normal http call or using the WebService SDK from AMLS - it's just when it is a UDF that it fails with this message:
TypeError: can't pickle _thread._local objects
The only workaround I can think of is to loop through each record in the dataframe sequentially and update the record with a call, however this is not very efficient. I don't know if this is a spark error or because the service is loading a fasttext model. When I use the UDF and mock a return value it works though.
Error at bottom...
from azureml.core.webservice import Webservice, AciWebservice
from azureml.core import Workspace
def predictModelValue2(summary, modelName, modelLabel):
raw_data = '[{"label": "' + modelLabel + '", "model": "' + modelName + '", "as_full_account": "' + summary + '"}]'
prediction = service.run(raw_data)
return prediction
from pyspark.sql.types import FloatType
from pyspark.sql.functions import udf
predictModelValueUDF = udf(predictModelValue2)
DVIRCRAMFItemsDFScored1 = DVIRCRAMFItemsDF.withColumn("Result", predictModelValueUDF("Summary", "ModelName", "ModelLabel"))
TypeError: can't pickle _thread._local objects
During handling of the above exception, another exception occurred:
PicklingError Traceback (most recent call
last) in
----> 2 x = df.withColumn("Result", predictModelValueUDF("Summary",
"ModelName", "ModelLabel"))
/databricks/spark/python/pyspark/sql/udf.py in wrapper(*args)
194 #functools.wraps(self.func, assigned=assignments)
195 def wrapper(*args):
--> 196 return self(*args)
197
198 wrapper.name = self._name
/databricks/spark/python/pyspark/sql/udf.py in call(self, *cols)
172
173 def call(self, *cols):
--> 174 judf = self._judf
175 sc = SparkContext._active_spark_context
176 return Column(judf.apply(_to_seq(sc, cols, _to_java_column)))
/databricks/spark/python/pyspark/sql/udf.py in _judf(self)
156 # and should have a minimal performance impact.
157 if self._judf_placeholder is None:
--> 158 self._judf_placeholder = self._create_judf()
159 return self._judf_placeholder
160
/databricks/spark/python/pyspark/sql/udf.py in _create_judf(self)
165 sc = spark.sparkContext
166
--> 167 wrapped_func = _wrap_function(sc, self.func, self.returnType)
168 jdt = spark._jsparkSession.parseDataType(self.returnType.json())
169 judf = sc._jvm.org.apache.spark.sql.execution.python.UserDefinedPythonFunction(
/databricks/spark/python/pyspark/sql/udf.py in _wrap_function(sc,
func, returnType)
33 def _wrap_function(sc, func, returnType):
34 command = (func, returnType)
---> 35 pickled_command, broadcast_vars, env, includes = _prepare_for_python_RDD(sc, command)
36 return sc._jvm.PythonFunction(bytearray(pickled_command), env, includes, sc.pythonExec,
37 sc.pythonVer, broadcast_vars, sc._javaAccumulator)
/databricks/spark/python/pyspark/rdd.py in _prepare_for_python_RDD(sc,
command) 2461 # the serialized command will be compressed by
broadcast 2462 ser = CloudPickleSerializer()
-> 2463 pickled_command = ser.dumps(command) 2464 if len(pickled_command) >
sc._jvm.PythonUtils.getBroadcastThreshold(sc._jsc): # Default 1M
2465 # The broadcast will have same life cycle as created
PythonRDD
/databricks/spark/python/pyspark/serializers.py in dumps(self, obj)
709 msg = "Could not serialize object: %s: %s" % (e.class.name, emsg)
710 cloudpickle.print_exec(sys.stderr)
--> 711 raise pickle.PicklingError(msg)
712
713
PicklingError: Could not serialize object: TypeError: can't pickle
_thread._local objects

I am not expert in DataBricks or Spark, but pickling functions from the local notebook context is always problematic when you are touching complex objects like the service object. In this particular case, I would recommend removing the dependency on the azureML service object and just use requests to call the service.
Pull the key from the service:
# retrieve the API keys. two keys were generated.
key1, key2 = service.get_keys()
scoring_uri = service.scoring_uri
You should be able to use these strings in the UDF directly without pickling issues -- here is an example of how you would call the service with just requests. Below applied to your UDF:
import requests, json
def predictModelValue2(summary, modelName, modelLabel):
input_data = json.dumps({"summary": summary, "modelName":, ....})
headers = {'Content-Type':'application/json', 'Authorization': 'Bearer ' + key1}
# call the service for scoring
resp = requests.post(scoring_uri, input_data, headers=headers)
return resp.text[1]
On a side node, though: your UDF will be called for each row in your data frame and each time it will make a network call -- that will be very slow. I would recommend looking for ways to batch the execution. As you can see from your constructed json service.run will accept an array of items, so you should call it in batches of 100s or so.

Related

Is there a way to use spark MLLib CrossValidator without parameter grid?

I want to use cross-validation instead of the normal validation set approach just as a means to get a better estimate of the test error rate. I am using spark-MLLib Dataframe based API. However if I run the following code -
cv = tuning.CrossValidator(estimator=randomForestRegressor, evaluator=evaluator, numFolds=5)
cv_model = cv.fit(vsdf)
I get the error -
KeyError Traceback (most recent call last)
<ipython-input-44-d4e7a9d3602e> in <module>
----> 1 cv_model = cv.fit(vsdf)
C:\Spark\spark-3.1.2-bin-hadoop3.2\python\pyspark\ml\base.py in fit(self, dataset, params)
159 return self.copy(params)._fit(dataset)
160 else:
--> 161 return self._fit(dataset)
162 else:
163 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
C:\Spark\spark-3.1.2-bin-hadoop3.2\python\pyspark\ml\tuning.py in _fit(self, dataset)
667 def _fit(self, dataset):
668 est = self.getOrDefault(self.estimator)
--> 669 epm = self.getOrDefault(self.estimatorParamMaps)
670 numModels = len(epm)
671 eva = self.getOrDefault(self.evaluator)
C:\Spark\spark-3.1.2-bin-hadoop3.2\python\pyspark\ml\param\__init__.py in getOrDefault(self, param)
344 return self._paramMap[param]
345 else:
--> 346 return self._defaultParamMap[param]
347
348 def extractParamMap(self, extra=None):
KeyError: Param(parent='CrossValidator_a9121a59fda3', name='estimatorParamMaps', doc='estimator param maps')
I guess this is because I have not provided any parameter-map to search over. Is there no way to do cross-validation in spark-ml without a parameter grid?
Yes, you can do it. You need to pass an empty parameter grid though.
something like this should work, it will behave like a normal K-fold cross validator.
params_grid = ParamGridBuilder().build()

Not able to pass StringIndexer as list to the model pipeline stage

PySpark pipeline is quite new to me .I am trying to create the stages in pipeline by passing below list :
pipeline = Pipeline().setStages([indexer,assembler,dtc_model])
where I am Applying feature indexing on multiple columns :
cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]
On Running the fit on the pipeline I am getting below error :
model_pipeline = pipeline.fit(train_df)
How we can pass the list to the stage or any work around to achieve this or better way to do this?
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-3999694668013877> in <module>
----> 1 model_pipeline = pipeline.fit(train_df)
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/pipeline.py in _fit(self, dataset)
95 if not (isinstance(stage, Estimator) or isinstance(stage, Transformer)):
96 raise TypeError(
---> 97 "Cannot recognize a pipeline stage of type %s." % type(stage))
98 indexOfLastEstimator = -1
99 for i, stage in enumerate(stages):
TypeError: Cannot recognize a pipeline stage of type <class 'list'>.```
Try below-
cat_col = ['Gender','Habit','Mode']
indexer = [StringIndexer(inputCol=column, outputCol=column+"_index").fit(training_data_0) for column in cat_col ]
assembler = VectorAssembler...
dtc_model = DecisionTreeClassifier...
# Create pipeline using transformers and estimators
stages = indexer
stages.append(assembler)
stages.append(dtc_model)
pipeline = Pipeline().setStages(stages)
model_pipeline = pipeline.fit(train_df)

Using pretrained models from sparknlp on Databricks

I am trying to follow the official examples from John Snow Labs but every time I get a TypeError: 'JavaPackage' object is not callable error. I followed all of the steps in the Databricks install documentation but no matter what walkthrough I try, either this one or this one it fails.
An example of the first (after doing the installs):
import sparknlp
from sparknlp.pretrained import *
pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
recognize_entities_dl download started this may take some time.
TypeError: 'JavaPackage' object is not callable
TypeError Traceback (most recent call last)
<command-937510457011238> in <module>
----> 1 pipeline = PretrainedPipeline('recognize_entities_dl', 'en')
2
3 # ner_bert = NerDLModel.pretrained('ner_dl_bert')
4
5 # pipeline = PretrainedPipeline('recognize_entities_dl', 'en', 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_bert_en_2.4.3_2.4_1584624951079.zip')
/databricks/python/lib/python3.7/site-packages/sparknlp/pretrained.py in __init__(self, name, lang, remote_loc, parse_embeddings, disk_location)
89 def __init__(self, name, lang='en', remote_loc=None, parse_embeddings=False, disk_location=None):
90 if not disk_location:
---> 91 self.model = ResourceDownloader().downloadPipeline(name, lang, remote_loc)
92 else:
93 self.model = PipelineModel.load(disk_location)
/databricks/python/lib/python3.7/site-packages/sparknlp/pretrained.py in downloadPipeline(name, language, remote_loc)
49 def downloadPipeline(name, language, remote_loc=None):
50 print(name + " download started this may take some time.")
---> 51 file_size = _internal._GetResourceSize(name, language, remote_loc).apply()
52 if file_size == "-1":
53 print("Can not find the model to download please check the name!")
/databricks/python/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, name, language, remote_loc)
190 def __init__(self, name, language, remote_loc):
191 super(_GetResourceSize, self).__init__(
--> 192 "com.johnsnowlabs.nlp.pretrained.PythonResourceDownloader.getDownloadSize", name, language, remote_loc)
193
194
/databricks/python/lib/python3.7/site-packages/sparknlp/internal.py in __init__(self, java_obj, *args)
127 super(ExtendedJavaWrapper, self).__init__(java_obj)
128 self.sc = SparkContext._active_spark_context
--> 129 self._java_obj = self.new_java_obj(java_obj, *args)
130 self.java_obj = self._java_obj
131
/databricks/python/lib/python3.7/site-packages/sparknlp/internal.py in new_java_obj(self, java_class, *args)
137
138 def new_java_obj(self, java_class, *args):
--> 139 return self._new_java_obj(java_class, *args)
140
141 def new_java_array(self, pylist, java_class):
/databricks/spark/python/pyspark/ml/wrapper.py in _new_java_obj(java_class, *args)
65 java_obj = getattr(java_obj, name)
66 java_args = [_py2java(sc, arg) for arg in args]
---> 67 return java_obj(*java_args)
68
69 #staticmethod
TypeError: 'JavaPackage' object is not callable
I get a similar if not the exact error if I try:
pipeline = PretrainedPipeline('recognize_entities_dl', 'en', 'https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/models/ner_dl_bert_en_2.4.3_2.4_1584624951079.zip')
I also get the same error for the second example. The Databricks Runtime Version is: 6.5 (includes Apache Spark 2.4.5, Scala 2.11), which is on the list of approved runtimes.
I'm not sure what the error messages mean or how to resolve them.
I found out that 'JavaPackage' object is not callable is caused by the spark-nlp (assembly jars) missing. So I made sure that these jars were downloaded and then placed in BOTH the executor and driver. E.g
when building the Spark docker image do something like
RUN cd /opt/spark/jars && \
wget https://s3.amazonaws.com/auxdata.johnsnowlabs.com/public/spark-nlp-assembly-2.6.4.jar
and also on the driver image/machine make sure the jar exists in the local directoy. Then set
conf.set("spark.driver.extraClassPath", "/opt/spark/jars/spark-nlp-assembly-2.6.4.jar")
conf.set("spark.executor.extraClassPath", "/opt/spark/jars/spark-nlp-assembly-2.6.4.jar")
The solution for databricks might be a bit different so instead of baking in the jars you may need to host them on S3 and refer to them that way.

Error with show variable in data viewer for jupyter notebook

In the recent VS Code release, they added this feature to view the active variables in the Jupyter Notebook and also, view the values in the variable with Data Viewer.
However, every time I am trying to view the values in Data Viewer, VS Code is throwing error below. It says that the reason is that the object of data type is Int64 and not string, but I am sure that should not be the reason to not show the variable. Anyone facing similar issues. I tried with a simple data frame and it's working fine.
Error: Failure during variable extraction:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-16-eae5f1f55b35> in <module>
97
98 # Transform this back into a string
---> 99 print(_VSCODE_json.dumps(_VSCODE_targetVariable))
100 del _VSCODE_targetVariable
101
~/anaconda3/lib/python3.7/json/__init__.py in dumps(obj, skipkeys, ensure_ascii, check_circular, allow_nan, cls, indent, separators, default, sort_keys, **kw)
229 cls is None and indent is None and separators is None and
230 default is None and not sort_keys and not kw):
--> 231 return _default_encoder.encode(obj)
232 if cls is None:
233 cls = JSONEncoder
~/anaconda3/lib/python3.7/json/encoder.py in encode(self, o)
197 # exceptions aren't as detailed. The list call should be roughly
198 # equivalent to the PySequence_Fast that ''.join() would do.
--> 199 chunks = self.iterencode(o, _one_shot=True)
200 if not isinstance(chunks, (list, tuple)):
201 chunks = list(chunks)
~/anaconda3/lib/python3.7/json/encoder.py in iterencode(self, o, _one_shot)
255 self.key_separator, self.item_separator, self.sort_keys,
256 self.skipkeys, _one_shot)
--> 257 return _iterencode(o, 0)
258
259 def _make_iterencode(markers, _default, _encoder, _indent, _floatstr,
~/anaconda3/lib/python3.7/json/encoder.py in default(self, o)
177
178 """
--> 179 raise TypeError(f'Object of type {o.__class__.__name__} '
180 f'is not JSON serializable')
181
TypeError: Object of type int64 is not JSON serializable

PySpark TypeError: object of type 'ParamGridBuilder' has no len()

I am trying to tune my model on Databricks using Pyspark.
I receive the following error:
TypeError: object of type 'ParamGridBuilder' has no len()
My code has been listed below.
from pyspark.ml.recommendation import ALS
from pyspark.ml.evaluation import RegressionEvaluator
als = ALS(userCol = "userId",itemCol="movieId", ratingCol="rating", coldStartStrategy="drop", nonnegative = True, implicitPrefs = False)
# Imports ParamGridBuilder package
from pyspark.ml.tuning import ParamGridBuilder
# Creates a ParamGridBuilder, and adds hyperparameters
param_grid = ParamGridBuilder().addGrid(als.rank, [5,10,20,40]).addGrid(als.maxIter, [5,10,15,20]).addGrid(als.regParam,[0.01,0.001,0.0001,0.02])
evaluator = RegressionEvaluator(metricName="rmse", labelCol="rating",predictionCol="prediction")
# Imports CrossValidator package
from pyspark.ml.tuning import CrossValidator
# Creates cross validator and tells Spark what to use when training and evaluates
cv = CrossValidator(estimator = als,
estimatorParamMaps = param_grid,
evaluator = evaluator,
numFolds = 5)
model = cv.fit(training)
TypeError: object of type 'ParamGridBuilder' has no len()
Full Error Log:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1952169986445972> in <module>()
----> 1 model = cv.fit(training)
2
3 # Extract best combination of values from cross validation
4
5 best_model = model.bestModel
/databricks/spark/python/pyspark/ml/base.py in fit(self, dataset, params)
130 return self.copy(params)._fit(dataset)
131 else:
--> 132 return self._fit(dataset)
133 else:
134 raise ValueError("Params must be either a param map or a list/tuple of param maps, "
/databricks/spark/python/pyspark/ml/tuning.py in _fit(self, dataset)
279 est = self.getOrDefault(self.estimator)
280 epm = self.getOrDefault(self.estimatorParamMaps)
--> 281 numModels = len(epm)
It simple means that your object does not have a length property (unlike lists). Thus, In your line
param_grid = ParamGridBuilder()
.addGrid(als.rank, [5,10,20,40])
.addGrid(als.maxIter, [5,10,15,20])
.addGrid(als.regParam, [0.01,0.001,0.0001,0.02])
You should add .build() in the end to actually construct a grid.