Does KernelDensity.estimate in pyspark.mllib.stat.KernelDensity work when input data is normally distributed? - pyspark

Does pyspark's KernelDensity.estimate work correctly on a dataset that is normally distributed? I get an error when I try that. I have filed (KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not normally distributed))
Example code:
vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
# Find density estimates for the given values
densities = kd.estimate(samplePoints)
When data is NOT Gaussian, I get for e.g.
For reference, using Scala, for Gaussian data,
vecRDD = sc.parallelize(colVec)
kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0)
// Find density estimates for the given values
densities = kd.estimate(samplePoints)
I get:

I faced the same issue and was able to track down the issue to a very minimal test case. If you're using Numpy in Python to generate the data in the RDD, then that's the problem!
import numpy as np
kd = KernelDensity()
kd.setSample(sc.parallelize([0.0, 1.0, 2.0, 3.0])) # THIS WORKS
# kd.setSample(sc.parallelize([0.0, np.float32(1.0), 2.0, 3.0])) # THIS FAILS
kd.estimate([0.0, 1.0])
If this was your issue as well, simply convert the Numpy data to Python base type until the Spark issue is fixed. You can do that by using the np.asscalar function.


Different results with torchvision transforms

Correct me if I am wrong.
The 'classic' way to pass images through torchvision transforms is to
use Compose as in its doc page. This, however, requires to pass Image input.
An alternative is to use ConvertImageDtype with torch.nn.Sequential. This 'bypasses'
the need for Image, and in my case it is much faster because I work with numpy arrays.
My problem is that results are not identical.
Below is an example with custom Normalize.
I would like to use torch.nn.Sequential (tr) because it is faster for my needs,
but the error compared to Compose (tr2) is very large (~810).
from PIL import Image
import torchvision.transforms as T
import numpy as np
import torch
o = np.random.rand(64, 64, 3) * 255
o = np.array(o, dtype=np.uint8)
i = Image.fromarray(o)
tr = torch.nn.Sequential(
T.Resize(224, interpolation=T.InterpolationMode.BICUBIC),
T.Normalize([0.48145466, 0.4578275, 0.40821073], [0.26862954, 0.26130258, 0.27577711]),
tr2 = T.Compose([
T.Resize(224, interpolation=T.InterpolationMode.BICUBIC),
T.Normalize((0.48145466, 0.4578275, 0.40821073), (0.26862954, 0.26130258, 0.27577711)),
out = tr(torch.from_numpy(o).permute(2,0,1).contiguous())
out2 = tr2(i)
print(((out - out2) ** 2).sum())
The interpolation method seems to matter A LOT, and if I use the default BILINEAR the error is ~7, but I need to use BICUBIC.
The problem seems to lie in ConvertImageDtype vs ToTensor, because if I replace
ToTensor with ConvertImageDtype results are identical (cannot do the other way around
because ToTensor is not a subclass of Module and I cannot use it with nn.Sequential).
However, the following gives identical results
tr = torch.nn.Sequential(
tr2 = T.Compose([
out = tr(torch.from_numpy(o).permute(2,0,1).contiguous())
out2 = tr2(i)
print(((out - out2) ** 2).sum())
This means that the interpolation changes something in the results, which matters only
when I use ToTensor vs ConvertImageDtype.
Any input is appreciated.
This is documented here:
The output image might be different depending on its type: when downsampling, the interpolation of PIL images and tensors is slightly different, because PIL applies antialiasing. This may lead to significant differences in the performance of a network. Therefore, it is preferable to train and serve a model with the same input types. See also below the antialias parameter, which can help making the output of PIL images and tensors closer.
Passing antialias=True produces almost identical results.
This is interesting because the doc says that
it can be set to True for InterpolationMode.BILINEAR only mode.
Yet, I am using BICUBIC and still works.

how to get the prediction of a model in pyspark

i have developed a clustering model using pyspark and i want to just predict the class of one vector and here is the code
spark = SparkSession.builder.config("spark.sql.warehouse.dir",
vecAssembler = VectorAssembler(inputCols=FEATURES_COL, outputCol="features")
df_kmeans = vecAssembler.transform(df).select('LCLid', 'features')
k = 6
kmeans = KMeans().setK(k).setSeed(1).setFeaturesCol("features")
model =
centers = model.clusterCenters()
predictions = model.transform(df_kmeans)
transformed = model.transform(df_kmeans).select('LCLid', 'prediction')
rows = transformed.collect()
say that i have a vector of features V and i want to predict in which class it belongs
i tried a method that i found in this link
but it doesn't work since i'm working with SparkSession not in sparkContext
I see that you dealt with the most basic steps in your model creation, what you still need is to apply your k-means model on the vector that you want to make the clustering on (like what you did in line 10) then get your prediction, I mean what you have to do is to reDo the same work done in line 10 but on the new vector of features V. To understand this more I invite you to read this posted answer in StackOveflow:
KMeans clustering in PySpark.
I want to add also that the problem in the example that you are following is not due to the use of SparkSession or SparkContext as those are just an entry point to the Spark APIs, you can also get access to a sparContext through a sparkSession since it is unified by Databricks since Spark 2.0. The pyspark k-means is like the Scikit learn the only difference is the predefined functions in spark python API (PySpark).
You can call the predict method of the kmeans model using a Spark ML Vector:
from import Vectors
Here [1,0] is just an example. It should have the same length as your feature vector.

Feature Selection in Multivariate Linear Regression

import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.

how to calculate a raster mean/average from individual maps using GDAL in python?

I have 6-monthly raster maps of ET-data as tif format for the months from Apr to Sep; and would like to get the average/mean of those 6-maps as a single mean ET-map.
ETmaps_average.tif (I need such a map!)
Any idea?
I prefer doing it using GDAL package in python 3.7. Thanks
I have made a few assumptions here, but given that those are true this should solve your problem.
All data can fit in memory
All images have the same size (and same geotransform)
All images have a single band
You should be able to modify the code in case some of the above assumptions are not true
from osgeo import gdal
import numpy as np
file_paths = ['''List of paths to your files''']
# We build one large np array of all images (this requires that all data fits in memory)
res = []
for f in file_paths:
ds = gdal.Open(f)
res.append(ds.GetRasterBand(1).ReadAsArray()) # We assume that all rasters has a single band
stacked = np.dstack(res) # We assume that all rasters have the same dimensions
mean = np.mean(stacked, axis=-1)
# Finally save a new raster with the result.
# This assumes that all inputs have the same geotransform since we just copy the first
driver = gdal.GetDriverByName('GTiff')
result = driver.CreateCopy('ETmaps_average.tif', gdal.Open(file_paths[0]))
result = None

Do I have to preprocess test data using neural networks?

I am using Keras (version 2.0.0) and I'd like to make use of pretrained models like e.g. VGG16.
In order to get started, I ran the example of the [Keras documentation site ][] for extracting features with VGG16:
from keras.applications.vgg16 import VGG16
from keras.preprocessing import image
from keras.applications.vgg16 import preprocess_input
import numpy as np
model = VGG16(weights='imagenet', include_top=False)
img_path = 'elephant.jpg'
img = image.load_img(img_path, target_size=(224, 224))
x = image.img_to_array(img)
x = np.expand_dims(x, axis=0)
x = preprocess_input(x)
features = model.predict(x)
The used preprocess_input() function bothers me
(the function does Zero-centering by mean pixel what can be seen by looking at the source code).
Do I really have to preprocess input data (validation/test data) before using a trained model?
If yes, one can conclude that you always have to be aware of what preprocessing steps have been performed during training phase?!
If no: Does preprocessing of validation/test data cause a bias?
I appreciate your help.
Yes you should use the preprocessing step. You can retrain the model without it but the first layers will learn to center your datas so this is a waste of parameters.
If you do not recenter your performances will suffer.
Great thread on reddit :