SARIMAX statsmodel weird error in Databricks - pyspark

I'm running a grid search optimazation o a Databricks notebook, the same code runs on my local machine but when I try to run in on Databricks I get a TypeError as follow:
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe''
The fitting process I'm running is this (note this has defined p,d,q,P,D,Q,m values as I need to check why none model are being fitted):
exodus_train = np.array(np.random.normal(2,1, size=(25,1)))
model = sm.tsa.statespace.SARIMAX(train,
order=[2,0,0],
exog=exodus_train,
seasonal_order=[2,0,0,12],
enforce_stationarity=False,
enforce_invertibility=False).fit()
Than it trow an TypeError:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<command-1275539631463044> in <module>
4 seasonal_order=[2,0,0,12],
5 enforce_stationarity=False,
----> 6 enforce_invertibility=False).fit()
/databricks/python/lib/python3.7/site-packages/statsmodels/tsa/statespace/mlemodel.py in fit(self, start_params, transformed, cov_type, cov_kwds, method, maxiter, full_output, disp, callback, return_params, optim_score, optim_complex_step, optim_hessian, flags, **kwargs)
430 """
431 if start_params is None:
--> 432 start_params = self.start_params
433 transformed = True
434
/databricks/python/lib/python3.7/site-packages/statsmodels/tsa/statespace/sarimax.py in start_params(self)
966 # Although the Kalman filter can deal with missing values in endog,
967 # conditional sum of squares cannot
--> 968 if np.any(np.isnan(endog)):
969 mask = ~np.isnan(endog).squeeze()
970 endog = endog[mask]
TypeError: ufunc 'isnan' not supported for the input types, and the inputs could not be safely coerced to any supported types according to the casting rule ''safe'' ```

In case this happens with someone else, this will happen if your time series value has commas as decimal separator or if you column is not a float.

Related

pyspark Run bootstrap parallel

I have a function that takes 2 spark dataframes, some other arguments and outputs a scalar value.
Would like to bootstrap (to fill missing values in the dataframes above) n times with the whole process above and return the output with n rows.
I tried the below for a simple problem:
def sum_fn(a,b):
return a+b
rdd=spark.sparkContext.parallelize(list(range(1, 9+1)))
df = rdd.map(lambda x: (x,sum_fn(1,x))).toDF()
display(df)
This works fine, however when I input my function with sdf as input instead of sum_fn
I get an error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in
dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 72, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
Could someone please suggest on how I could do the same
Thanks

Type error in cv2.getRotationMatrix2D - Only Size-1 Arrays can be converted to Python Scalars

I am trying to use the cv2 get rotation matrix 2D to perform a simple rotation but getting the type error. I assume it may because of cv2 or numpy version and hence giving the details below.
cv2 version - 4.2.0
numpy version - 1.18.4
Python - 3.7.3
image= cv2.imread('4.jpg')
(h,w) = image[:2]
center=(w//2,h//2)
M = cv2.getRotationMatrix2D(center,45,1)
rotated = cv2.warpAffine(image,M,(w,h))
plt.axis('off')
plt.imshow(cv2.cvtColor(rotated,cv2.COLOR_BGR2RGB))
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-30-930e769f51b7> in <module>
2 (h,w) = image[:2]
3 center=(w//2,h//2)
----> 4 M = cv2.getRotationMatrix2D(center,45,1)
5 rotated = cv2.warpAffine(image,M,(w,h))
6 plt.axis('off')
TypeError: only size-1 arrays can be converted to Python scalars
Change the line,
(h,w) = image[:2]
to,
(h,w) = image.shape[:2]
image[:2] returns an array not the dimensions of the image

Numba: UntypedAttributeError in class method

I have the following class and method that should convolve an array with a kernel.
import numpy as np
from numpy.fft import fft2 as FFT, ifft2 as IFFT
from PIL import Image
from tqdm import trange, tqdm
from numba import jit
from time import sleep
import _kernel
class convolve(object):
""" contains methods to convolve two images """
def __init__(self, image_array, kernel):
self.array = image_array
self.kernel = kernel
self.__rangeX_ = self.array.shape[0]
self.__rangeY_ = self.array.shape[1]
self.__rangeKX_ = self.kernel.shape[0]
self.__rangeKY_ = self.kernel.shape[1]
if (self.__rangeKX_ >= self.__rangeX_ or \
self.__rangeKY_ >= self.__rangeY_):
raise ValueError('Must submit suitable sizes for convolution.')
#jit(nopython=True)
def spaceConv(self):
""" normal convolution, O(N^2*n^2). This is usually too slow """
# pad array for convolution
offsetX = self.__rangeKX_ // 2
offsetY = self.__rangeKY_ // 2
self.array = np.pad(self.array, \
[(offsetY, offsetY), (offsetX, offsetX)], \
mode='constant', constant_values=0)
# this is the O(N^2) part of this algorithm
for i in xrange(self.__rangeX_ - 2*offsetX):
for j in xrange(self.__rangeY_ - 2*offsetY):
# Now O(n^2) portion
total = 0.0
for k in xrange(2*offsetX+1):
for t in xrange(2*offsetY+1):
total += self.kernel[k][t] * self.array[i+k][j+t]
self.array[i+offsetX][j+offsetY] = total
return self.array
As an additional note (in case anyone asks), _kernel just generates specific kernels one may want to convolve the image with (e.g. Gaussian, Moffat, etc.), so it has nothing to do with this class.
When I call the above class on an image and kernel, I get the following error:
Traceback (most recent call last):
File "fftconv.py", line 147, in <module>
plt.imshow(conv.spaceConv(), interpolation='none', cmap='gray')
File "/root/anaconda2/lib/python2.7/site-packages/numba/dispatcher.py", line 304, in _compile_for_args
raise e
numba.errors.UntypedAttributeError: Caused By:
Traceback (most recent call last):
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 249, in run
stage()
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 465, in stage_nopython_frontend
self.locals)
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 789, in type_inference_stage
infer.propagate()
File "/root/anaconda2/lib/python2.7/site-packages/numba/typeinfer.py", line 717, in propagate
raise errors[0]
UntypedAttributeError: Unknown attribute "rangeKX" of type pyobject
File "fftconv.py", line 45
[1] During: typing of get attribute at fftconv.py (45)
Failed at nopython (nopython frontend)
Unknown attribute "rangeKX" of type pyobject
File "fftconv.py", line 45
[1] During: typing of get attribute at fftconv.py (45)
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of value <__main__.convolve object at 0xaff5628c>
Usually I'm pretty good at tracing through Python errors to the cause, but because I'm not familiar with the inner-works of Numba, I'm not sure why it doesn't know what type offsetX is. Any suggestions?
One step performed by numba is type-inference. This assigns types to the different values present in the function so that it can compile (in a way that it works fast).
The error means that numba doesn't understand the first input argument on the function (self in this case). Numba works best in plain functions where the arguments are scalars or array (all numeric). One option would be to move the O(n^2) loop into a function of its own and have that function receive the arrays and any other value explicitly, and decorate that function with numba.njit (or numba.jit(nopython=True), which are equivalent
Also worth a try is just trying the code "as is" removing the "nopython=True". If the performance is good enough then leave it alone :). That may happen, as numba.jit is able to detect loops inside the code that can be compiled in "no python" mode and automatically do what is needed so that the loop itself is compiled in full speed mode. The explicit "nopython=True" keyword disables that mode though.

Assign a value to a single SFrame element

I want to assign a value to a single element (i.e. single row and column) in an SFrame.
I am using the Python Notebook and importing graphlab.
I created an SFrame with dimensions 16364 rows x 37 columns.
The column 'test' contains zeros.
I have used the following syntax to set the value:
sf[1]['test'] = 3;
If I then type:
sf[1]['test']
then I see the correct value, i.e "3"
But if I type:
sf
then I just see values of zero for all rows of column 'test'
Also same for sf.head() or sf['test'] or sf['test'].head()
I don't understand why one syntax shows the value of "3" where an alternative one does not. Is the value in sf[1]['test'] 3 or 0 ?
SFrames are immutable, so they don't actually support item assignment. The reason for the difference you see here is because
sf[1]['test']
isn't actually referring to the SFrame at all. "sf[1]" returns a dictionary with keys that match to the SFrame's column names, and values that match the second row of the SFrame. When you assign a number to "sf[1]['test']", you are changing the value of the "test" key in the dictionary that was returned, so the SFrame "sf" is not involved in the assignment. The correct way to reference only the second value of the column "test" and assign the value "3" is this:
sf['test'][1] = 3
which would return this error:
---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
<ipython-input-5-c52dab41d5dd> in <module>()
----> 1 sf['test'][1] = 3
TypeError: 'SArray' object does not support item assignment

Python ggplot Plot Save Method Syntax?

Have tried several iterative versions, based on the error message prompts, to save a ggplot to file using either the pixels or inches method:
ggsave(filename="nlrundiff.jpg", width=4, height=4, units='in', plot=plt)
No success in either case, the resulting error message excerpt is as follows:
/usr/local/lib/python2.7/dist-packages/ggplot/utils/ggutils.pyc in ggsave(filename, plot, device, format, path, scale, width, height, units, dpi, limitsize, **kwargs)
118 from_inch = {"in":lambda x:x,"cm":lambda x: x * 2.54, "mm":lambda x: x * 2.54 * 10}
119
--> 120 w, h = figure.get_size_inches()
121 issue_size = False
122 if width is None:
AttributeError: 'NoneType' object has no attribute 'get_size_inches'
Is this a input syntax error on my part or a Python ggplot bug?
Thanks.
Calling ggsave(...) without specifying the ggplot object or printing it beforehand is not supported (but the above is a bug and should print a user readable message).
So, either print the ggplot object with gg.draw() or print(gg) and then call ggsave(...) like you did above OR pass in the ggplot-object: ggsave(gg, filename=...).