pyspark Run bootstrap parallel - pyspark

I have a function that takes 2 spark dataframes, some other arguments and outputs a scalar value.
Would like to bootstrap (to fill missing values in the dataframes above) n times with the whole process above and return the output with n rows.
I tried the below for a simple problem:
def sum_fn(a,b):
return a+b
rdd=spark.sparkContext.parallelize(list(range(1, 9+1)))
df = rdd.map(lambda x: (x,sum_fn(1,x))).toDF()
display(df)
This works fine, however when I input my function with sdf as input instead of sum_fn
I get an error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in
dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 72, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
Could someone please suggest on how I could do the same
Thanks

Related

How can we fix this AttributeError: 'bytes' object has no attribute 'map'?

I'm trying to run this code on an AWS EMR instance to get features from images by Transfer learning model but I get this error code : AttributeError: 'bytes' object has no attribute 'map'
*
input = np.stack(content_series.map(preprocess))
preds = model.predict(input)
here is the whole code :
def preprocess(content):
img = Image.open(io.BytesIO(content)).resize([224, 224])
arr = img_to_array(img)
return preprocess_input(arr)
def featurize_series(model, content_series):
"""
Featurize a pd.Series of raw images using the input model.
:return: a pd.Series of image features
"""
input = np.stack(content_series.map(preprocess))
preds = model.predict(input)
output = [p.flatten() for p in preds]
return pd.Series(output)
#pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
'''
This method is a Scalar Iterator pandas UDF wrapping our featurization function.
The decorator specifies that this returns a Spark DataFrame column of type ArrayType(FloatType).
:param content_series_iter: This argument is an iterator over batches of data, where each batch
is a pandas Series of image data.
'''
# With Scalar Iterator pandas UDFs, we can load the model once and then re-use it
# for multiple data batches. This amortizes the overhead of loading big models.
model = model_fn()
for content_series in content_series_iter:
yield featurize_series(model, content_series)
do u have any suggestions ? tks
I tried to change instance config
The error is occurring because the content_series.map method is trying to apply the preprocess function to each element of content_series, which is assumed to be a pandas Series. However, the error message indicates that content_series is a 'bytes' object, which does not have a 'map' attribute.
It appears that the content_series_iter input to the featurize_udf function is an iterator over batches of image data, where each batch is a pandas Series of image data. To resolve the error, you'll need to modify the featurize_udf function to convert the 'bytes' objects in each batch to pandas Series before applying the preprocess function:
#pandas_udf('array<float>', PandasUDFType.SCALAR_ITER)
def featurize_udf(content_series_iter):
model = model_fn()
for content_series in content_series_iter:
content_series = pd.Series(content_series.tolist())
yield featurize_series(model, content_series)
This should resolve the AttributeError and allow you to run the code successfully on the AWS EMR instance.

Force Polars to store Python dictionary as pl.Object

I am running into the dreaded "struct orders must remain the same" with a dictionary nested in my record. However, for my use case, I don't care about creating a Polars struct, I am OK with just storing a pl.Object of the original dictionary. However, my attempt to do so is not working:
df = pl.DataFrame(
data=c.fetchall(),
columns=[
("primary_key", pl.Int32),
("string_col", pl.Object),
("dict_col", pl.Object),
],
)
However, this is still raising an exception:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ComputeError(Borrowed("struct orders must remain the same"))', /Users/runner/work/polars/polars/polars/polars-core/src/frame/row.rs:457:86
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File:
df = pl.DataFrame(
File "venv/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 299, in __init__
self._df = sequence_to_pydf(
File "venv/lib/python3.9/site-packages/polars/internals/construction.py", line 612, in sequence_to_pydf
pydf = PyDataFrame.read_rows(data, infer_schema_length)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(Borrowed("struct orders must remain the same"))
This is with polars 0.14.18

PySpark: Count every element in flatmap

I am having trouble counting every element in a list that I have created in PySpark.
Here is what I am working with:
test2 = words.filter(lambda line: re.match(r'^[AEIOU]', line)).take(10)
test2
[u'EBook', u'Author:', u'English', u'OF', u'EBOOK', u'Inc.,', u'Etext', u'Inc.,', u'Etexts', u'Etext']
Now I want to confirm the count of test2 is 10. But everytime I use test2.count(), it's giving me an error:
Traceback (most recent call last):
File "", line 1, in
TypeError: count() takes exactly one argument (0 given)
Can someone help me learn how to count the elements properly?
Thank you!
test2 is a list, so you should be doing len(test2) to find the number of elements. The function count(), when called on a list, will return the number of occurrences of whatever you pass as a parameter.

Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark'

I am trying to implement Gradient boosting algorithm on a kaggle dataset in pyspark for learning purpose. I am facing error given below
Traceback (most recent call last):
File "C:/SparkCourse/Gradientboost.py", line 29, in <module>
output=assembler.transform(data)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 105, in transform
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\wrapper.py", line 281, in _transform
AttributeError: 'OneHotEncoder' object has no attribute '_jdf'
the corresponding code is
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer,VectorIndexer,OneHotEncoder,VectorAssembler
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("Gradientboostapp").enableHiveSupport().getOrCreate()
data= spark.read.csv("C:/Users/codemen/Desktop/Timeseries Analytics/liver_patient.csv",header=True, inferSchema=True)
#data.show()
print(data.count())
#data.printSchema()
print("After deleting null values")
data=data.na.drop()
print(data.count())
data=StringIndexer(inputCol="Gender",outputCol="GenderIndex").fit(data)
#let onehot encode the data
data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec")
usedfeature=["Age","gendervec","Total_Bilirubin","Direct_Bilirubin","Alkaline_Phosphotase","Alamine_Aminotransferase","Aspartate_Aminotransferase","Total_Protiens","Albumin","Albumin_and_Globulin_Ratio"]
#
assembler=VectorAssembler(inputCols=usedfeature,outputCol="features")
output=assembler.transform(data)
output.select("features","category").show()
I have converted Gender category into numerical form by using String indexer then I have tried to perform OnehotEncoding on Genderindex value. I am getting the error when I have performed VectorAssembler in code. May I am missing very silly concept here. kindly help me to figure it out
This line of code is incorrect: data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec"). You are setting data to be equal to the OneHotEncoder() object, not transforming the data. You need to call a transform to encode the data. It should look like this.
encoder=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec")
data = encoder.transform(data)

Numba: UntypedAttributeError in class method

I have the following class and method that should convolve an array with a kernel.
import numpy as np
from numpy.fft import fft2 as FFT, ifft2 as IFFT
from PIL import Image
from tqdm import trange, tqdm
from numba import jit
from time import sleep
import _kernel
class convolve(object):
""" contains methods to convolve two images """
def __init__(self, image_array, kernel):
self.array = image_array
self.kernel = kernel
self.__rangeX_ = self.array.shape[0]
self.__rangeY_ = self.array.shape[1]
self.__rangeKX_ = self.kernel.shape[0]
self.__rangeKY_ = self.kernel.shape[1]
if (self.__rangeKX_ >= self.__rangeX_ or \
self.__rangeKY_ >= self.__rangeY_):
raise ValueError('Must submit suitable sizes for convolution.')
#jit(nopython=True)
def spaceConv(self):
""" normal convolution, O(N^2*n^2). This is usually too slow """
# pad array for convolution
offsetX = self.__rangeKX_ // 2
offsetY = self.__rangeKY_ // 2
self.array = np.pad(self.array, \
[(offsetY, offsetY), (offsetX, offsetX)], \
mode='constant', constant_values=0)
# this is the O(N^2) part of this algorithm
for i in xrange(self.__rangeX_ - 2*offsetX):
for j in xrange(self.__rangeY_ - 2*offsetY):
# Now O(n^2) portion
total = 0.0
for k in xrange(2*offsetX+1):
for t in xrange(2*offsetY+1):
total += self.kernel[k][t] * self.array[i+k][j+t]
self.array[i+offsetX][j+offsetY] = total
return self.array
As an additional note (in case anyone asks), _kernel just generates specific kernels one may want to convolve the image with (e.g. Gaussian, Moffat, etc.), so it has nothing to do with this class.
When I call the above class on an image and kernel, I get the following error:
Traceback (most recent call last):
File "fftconv.py", line 147, in <module>
plt.imshow(conv.spaceConv(), interpolation='none', cmap='gray')
File "/root/anaconda2/lib/python2.7/site-packages/numba/dispatcher.py", line 304, in _compile_for_args
raise e
numba.errors.UntypedAttributeError: Caused By:
Traceback (most recent call last):
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 249, in run
stage()
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 465, in stage_nopython_frontend
self.locals)
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 789, in type_inference_stage
infer.propagate()
File "/root/anaconda2/lib/python2.7/site-packages/numba/typeinfer.py", line 717, in propagate
raise errors[0]
UntypedAttributeError: Unknown attribute "rangeKX" of type pyobject
File "fftconv.py", line 45
[1] During: typing of get attribute at fftconv.py (45)
Failed at nopython (nopython frontend)
Unknown attribute "rangeKX" of type pyobject
File "fftconv.py", line 45
[1] During: typing of get attribute at fftconv.py (45)
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of value <__main__.convolve object at 0xaff5628c>
Usually I'm pretty good at tracing through Python errors to the cause, but because I'm not familiar with the inner-works of Numba, I'm not sure why it doesn't know what type offsetX is. Any suggestions?
One step performed by numba is type-inference. This assigns types to the different values present in the function so that it can compile (in a way that it works fast).
The error means that numba doesn't understand the first input argument on the function (self in this case). Numba works best in plain functions where the arguments are scalars or array (all numeric). One option would be to move the O(n^2) loop into a function of its own and have that function receive the arrays and any other value explicitly, and decorate that function with numba.njit (or numba.jit(nopython=True), which are equivalent
Also worth a try is just trying the code "as is" removing the "nopython=True". If the performance is good enough then leave it alone :). That may happen, as numba.jit is able to detect loops inside the code that can be compiled in "no python" mode and automatically do what is needed so that the loop itself is compiled in full speed mode. The explicit "nopython=True" keyword disables that mode though.