Force Polars to store Python dictionary as pl.Object - python-polars

I am running into the dreaded "struct orders must remain the same" with a dictionary nested in my record. However, for my use case, I don't care about creating a Polars struct, I am OK with just storing a pl.Object of the original dictionary. However, my attempt to do so is not working:
df = pl.DataFrame(
data=c.fetchall(),
columns=[
("primary_key", pl.Int32),
("string_col", pl.Object),
("dict_col", pl.Object),
],
)
However, this is still raising an exception:
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: ComputeError(Borrowed("struct orders must remain the same"))', /Users/runner/work/polars/polars/polars/polars-core/src/frame/row.rs:457:86
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
File:
df = pl.DataFrame(
File "venv/lib/python3.9/site-packages/polars/internals/dataframe/frame.py", line 299, in __init__
self._df = sequence_to_pydf(
File "venv/lib/python3.9/site-packages/polars/internals/construction.py", line 612, in sequence_to_pydf
pydf = PyDataFrame.read_rows(data, infer_schema_length)
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: ComputeError(Borrowed("struct orders must remain the same"))
This is with polars 0.14.18

Related

I'm working on keystroke dynamics project and getting this error

Traceback (most recent call last):
File "N:\NCI Cyber Security\Sem 3\keycalculatingValue.py", line 37, in
finalData['key'] = chr(keyList[j])
TypeError: integer argument expected, got float
It is hard to say without seeing more code, but it looks like keyList[j] is returning a float while the function chr() requires and integer in the range 0 through 1,114,111. If the integer is out of that range it will return a ValueError; it seems if the input type is incorrect it will return a TypeError, although that does not seem to be documented.
The following code gives the same error:
(j := 1.0) and chr(j)

pyspark Run bootstrap parallel

I have a function that takes 2 spark dataframes, some other arguments and outputs a scalar value.
Would like to bootstrap (to fill missing values in the dataframes above) n times with the whole process above and return the output with n rows.
I tried the below for a simple problem:
def sum_fn(a,b):
return a+b
rdd=spark.sparkContext.parallelize(list(range(1, 9+1)))
df = rdd.map(lambda x: (x,sum_fn(1,x))).toDF()
display(df)
This works fine, however when I input my function with sdf as input instead of sum_fn
I get an error :
Traceback (most recent call last):
File "/databricks/spark/python/pyspark/serializers.py", line 476, in
dumps
return cloudpickle.dumps(obj, pickle_protocol)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py",
line 72, in dumps
cp.dump(obj)
File "/databricks/spark/python/pyspark/cloudpickle/cloudpickle_fast.py", line 540, in dump
return Pickler.dump(self, obj)
TypeError: cannot pickle '_thread.RLock' object
PicklingError: Could not serialize object: TypeError: cannot pickle '_thread.RLock' object
Could someone please suggest on how I could do the same
Thanks

PySpark: Count every element in flatmap

I am having trouble counting every element in a list that I have created in PySpark.
Here is what I am working with:
test2 = words.filter(lambda line: re.match(r'^[AEIOU]', line)).take(10)
test2
[u'EBook', u'Author:', u'English', u'OF', u'EBOOK', u'Inc.,', u'Etext', u'Inc.,', u'Etexts', u'Etext']
Now I want to confirm the count of test2 is 10. But everytime I use test2.count(), it's giving me an error:
Traceback (most recent call last):
File "", line 1, in
TypeError: count() takes exactly one argument (0 given)
Can someone help me learn how to count the elements properly?
Thank you!
test2 is a list, so you should be doing len(test2) to find the number of elements. The function count(), when called on a list, will return the number of occurrences of whatever you pass as a parameter.

Getting AttributeError: 'OneHotEncoder' object has no attribute '_jdf in pyspark'

I am trying to implement Gradient boosting algorithm on a kaggle dataset in pyspark for learning purpose. I am facing error given below
Traceback (most recent call last):
File "C:/SparkCourse/Gradientboost.py", line 29, in <module>
output=assembler.transform(data)
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\base.py", line 105, in transform
File "C:\spark\python\lib\pyspark.zip\pyspark\ml\wrapper.py", line 281, in _transform
AttributeError: 'OneHotEncoder' object has no attribute '_jdf'
the corresponding code is
from pyspark.sql import SparkSession
from pyspark.ml.feature import StringIndexer,VectorIndexer,OneHotEncoder,VectorAssembler
spark=SparkSession.builder.config("spark.sql.warehouse.dir", "file:///C:/temp").appName("Gradientboostapp").enableHiveSupport().getOrCreate()
data= spark.read.csv("C:/Users/codemen/Desktop/Timeseries Analytics/liver_patient.csv",header=True, inferSchema=True)
#data.show()
print(data.count())
#data.printSchema()
print("After deleting null values")
data=data.na.drop()
print(data.count())
data=StringIndexer(inputCol="Gender",outputCol="GenderIndex").fit(data)
#let onehot encode the data
data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec")
usedfeature=["Age","gendervec","Total_Bilirubin","Direct_Bilirubin","Alkaline_Phosphotase","Alamine_Aminotransferase","Aspartate_Aminotransferase","Total_Protiens","Albumin","Albumin_and_Globulin_Ratio"]
#
assembler=VectorAssembler(inputCols=usedfeature,outputCol="features")
output=assembler.transform(data)
output.select("features","category").show()
I have converted Gender category into numerical form by using String indexer then I have tried to perform OnehotEncoding on Genderindex value. I am getting the error when I have performed VectorAssembler in code. May I am missing very silly concept here. kindly help me to figure it out
This line of code is incorrect: data=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec"). You are setting data to be equal to the OneHotEncoder() object, not transforming the data. You need to call a transform to encode the data. It should look like this.
encoder=OneHotEncoder(inputCol="GenderIndex",outputCol="gendervec")
data = encoder.transform(data)

Numba: UntypedAttributeError in class method

I have the following class and method that should convolve an array with a kernel.
import numpy as np
from numpy.fft import fft2 as FFT, ifft2 as IFFT
from PIL import Image
from tqdm import trange, tqdm
from numba import jit
from time import sleep
import _kernel
class convolve(object):
""" contains methods to convolve two images """
def __init__(self, image_array, kernel):
self.array = image_array
self.kernel = kernel
self.__rangeX_ = self.array.shape[0]
self.__rangeY_ = self.array.shape[1]
self.__rangeKX_ = self.kernel.shape[0]
self.__rangeKY_ = self.kernel.shape[1]
if (self.__rangeKX_ >= self.__rangeX_ or \
self.__rangeKY_ >= self.__rangeY_):
raise ValueError('Must submit suitable sizes for convolution.')
#jit(nopython=True)
def spaceConv(self):
""" normal convolution, O(N^2*n^2). This is usually too slow """
# pad array for convolution
offsetX = self.__rangeKX_ // 2
offsetY = self.__rangeKY_ // 2
self.array = np.pad(self.array, \
[(offsetY, offsetY), (offsetX, offsetX)], \
mode='constant', constant_values=0)
# this is the O(N^2) part of this algorithm
for i in xrange(self.__rangeX_ - 2*offsetX):
for j in xrange(self.__rangeY_ - 2*offsetY):
# Now O(n^2) portion
total = 0.0
for k in xrange(2*offsetX+1):
for t in xrange(2*offsetY+1):
total += self.kernel[k][t] * self.array[i+k][j+t]
self.array[i+offsetX][j+offsetY] = total
return self.array
As an additional note (in case anyone asks), _kernel just generates specific kernels one may want to convolve the image with (e.g. Gaussian, Moffat, etc.), so it has nothing to do with this class.
When I call the above class on an image and kernel, I get the following error:
Traceback (most recent call last):
File "fftconv.py", line 147, in <module>
plt.imshow(conv.spaceConv(), interpolation='none', cmap='gray')
File "/root/anaconda2/lib/python2.7/site-packages/numba/dispatcher.py", line 304, in _compile_for_args
raise e
numba.errors.UntypedAttributeError: Caused By:
Traceback (most recent call last):
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 249, in run
stage()
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 465, in stage_nopython_frontend
self.locals)
File "/root/anaconda2/lib/python2.7/site-packages/numba/compiler.py", line 789, in type_inference_stage
infer.propagate()
File "/root/anaconda2/lib/python2.7/site-packages/numba/typeinfer.py", line 717, in propagate
raise errors[0]
UntypedAttributeError: Unknown attribute "rangeKX" of type pyobject
File "fftconv.py", line 45
[1] During: typing of get attribute at fftconv.py (45)
Failed at nopython (nopython frontend)
Unknown attribute "rangeKX" of type pyobject
File "fftconv.py", line 45
[1] During: typing of get attribute at fftconv.py (45)
This error may have been caused by the following argument(s):
- argument 0: cannot determine Numba type of value <__main__.convolve object at 0xaff5628c>
Usually I'm pretty good at tracing through Python errors to the cause, but because I'm not familiar with the inner-works of Numba, I'm not sure why it doesn't know what type offsetX is. Any suggestions?
One step performed by numba is type-inference. This assigns types to the different values present in the function so that it can compile (in a way that it works fast).
The error means that numba doesn't understand the first input argument on the function (self in this case). Numba works best in plain functions where the arguments are scalars or array (all numeric). One option would be to move the O(n^2) loop into a function of its own and have that function receive the arrays and any other value explicitly, and decorate that function with numba.njit (or numba.jit(nopython=True), which are equivalent
Also worth a try is just trying the code "as is" removing the "nopython=True". If the performance is good enough then leave it alone :). That may happen, as numba.jit is able to detect loops inside the code that can be compiled in "no python" mode and automatically do what is needed so that the loop itself is compiled in full speed mode. The explicit "nopython=True" keyword disables that mode though.