How would I go about downloading the min_max_scaler attributes so that I could apply the same transform to data within a different notebook?
For full disclosure I've trained a NN within one notebook, and am running it in a different locations. It is simple for me to load the trained weights of the NN in the second location, but I need to scale the data before inputting it into the model. To be accurate I believe it has to use the original scale attributes.
Per the documentation, you can recreate what min max scaler does using
X_std = (X - X.min(axis=0)) / (X.max(axis=0) - X.min(axis=0))
X_scaled = X_std * (max - min) + min
where X is your original dataset. (Although as long as your feature range is the default of (0,1), the second line above is not needed - you will come out with X_scaled = X_std)
If you want to do this same computation using your already trained MaxMinScaler instead of your original dataset, consider the following example (again assuming feature range is left at the default (0,1))
from sklearn.preprocessing import MinMaxScaler
import pandas as pd
import numpy as np
# Test data set
X = pd.DataFrame(np.random.randint(0, 100, size=(20,4)))
# Test scaler
scaler = MinMaxScaler()
sklearn_result = scaler.fit_transform(X)
# Compute, and verify results match up to machine precision
manual_result = (X - scaler.data_min_)/(scaler.data_max_ - scaler.data_min_)
(sklearn_result - test).max().max() . # Is around 10e-16
Related
This can't be vectorised and we need to loop through the rows. I'm wondering if this can be done effectively in polars without casting. I see in the polars documentation for polars.DataFrame.rows it says:
Row-iteration is not optimal as the underlying data is stored in columnar form; where possible, prefer export via one of the dedicated export/output methods.
In pandas/numpy, the fastest way I can imagine is to use numba, roughly like this:
import numba as nb
import numpy as np
#nb.jit(nopython=True)
def exponential_sum(signal, decay, initial_value=0):
n = len(signal)
exp_sum = np.zeros(n)
exp_sum[0] = signal[0]
for i in range(1, n):
exp_sum[i] = exp_sum[i-1] * decay[i] + signal[i]
return exp_sum
where decay = np.exp(times.diff() * alpha) (alpha controls the decay rate and the half-life).
I have tried casting to numpy and using numba as mentioned, but I'm wondering if there's a in-polars approach that is performant, and if not whether: it is just not implemented or it is a problem which polars is not well suited for due to the storage format.
import numpy as np
import pandas as pd
import seaborn as sns
from sklearn.linear_model import LinearRegression
from sklearn.datasets import make_regression
a = make_regression(n_samples=300,n_features=5,noise=5)
df1 = pd.DataFrame(a[0])
df1 = pd.concat([df1,pd.DataFrame(a[1].T)],axis=1,ignore_index=True)
df1.rename(columns={0:"X1",1:"X2",2:"X3",3:"X4",4:"X5",5:"Target"},inplace=True)
sns.heatmap(df1.corr(),annot=True);
Correlation Matrix
Now I can ask my question. How can I choose features that will be included in the model?
I am not that well-versed in python as I use R most of the time.
But it should be something like this:
# Create a model
model = LinearRegression()
# Call the .fit method and pass in your data
model.fit(Variables,Target)
# Or simply do
model = LinearRegression().fit(Variables,Target)
# So based on the dataset head provided, it should be
X<-df1[['X1','X2','X3','X4','X5']]
Y<-df1['Target']
model = LinearRegression().fit(X,Y)
In order to do feature selections. You need to run the model first. Then check for the p-value. Typically, a p-value of 5% (.05) or less is a good cut-off point. If the p-value crosses the upper threshold of .05, the variable is insignificant and you can remove it from your model. You will have to do this manually. You can also tell by looking from the correlation matrix to see which value has less correlation to the target. AFAIK, there are no libs with built-in functionality to do feature selection automatically. In the end, statistics are just numbers. It is up to humans to interpret the results.
I am required to calculate the following in Paraview:
How can I calculate the transpose used in the above formula ? Basically I would like to know how to calculate the transpose of a matrix in Paraview.
As suggested by #Nico Vuaille, you should make use of Numpy support in ParaView. Simply apply a Programmable Filter to the dataset of interest, and supply a script comparable to the following.
import numpy as np
u = inputs[0].PointData['Velocity']
# Calculate gradient here, say uGrad
output.PointData.append(uGrad, 'Gradient')
EDIT: I have actually tried to generate your calculation with one of my datasets and realised that my answer and comments are not so helpful. Therefore, this is what I would suggest now, which should work:
Load your dataset in ParaView
Apply a Gradient / Gradient Of Unstructured Dataset filter on your dataset and select the velocity field as the input field (I used Gradient Of Unstructured Dataset, from which you have the possibility to also directly work out both divergence and vorticity fields).
Apply a Programmable Filter filter to the resulting dataset you obtained from the previous step and supply the code below.
Script
import numpy as np
grad = inputs[0].PointData['Gradients']
omega = (grad - np.transpose(grad, axes=(0, 2, 1))) / 2
output.PointData.append(omega, 'Omega')
You should end up with another item in your ParaView pipeline that only contains the expected Omega.
EDIT 2: The input file is using the XMDF format. When loaded into ParaView, it is interpreted as a Multi-Block Dataset of Blocks. As a result, the code snippet provided to the Script argument of Programmable Filter has to be updated to:
import paraview.vtk.numpy_interface.dataset_adapter as dsa
for i in range(inputs[0].GetNumberOfBlocks()):
data = dsa.WrapDataObject(inputs[0].GetBlock(i))
grad = data.PointData['Gradients']
omega = (grad - np.transpose(grad, axes=(0, 2, 1))) / 2
data.PointData.append(omega, 'Omega')
output.SetBlock(i, data.VTKObject)
I think this can be easily computed using Python calculator (no need for programmable filter):
To compute the gradient, type:
gradient(u)
To compute the symmetric part of the tensor gradient(u):
strain(u)
To compute the non-symmetric part, Omega, of the gradient tensor:
gradient(u) - strain(u)
Note that that the gradient(u) tensor can be written as follows:
I have 6-monthly raster maps of ET-data as tif format for the months from Apr to Sep; and would like to get the average/mean of those 6-maps as a single mean ET-map.
ETmaps_04.tif
ETmaps_05.tif
ETmaps_06.tif
ETmaps_07.tif
ETmaps_08.tif
ETmaps_09.tif
ETmaps_average.tif (I need such a map!)
Any idea?
I prefer doing it using GDAL package in python 3.7. Thanks
I have made a few assumptions here, but given that those are true this should solve your problem.
All data can fit in memory
All images have the same size (and same geotransform)
All images have a single band
You should be able to modify the code in case some of the above assumptions are not true
from osgeo import gdal
import numpy as np
file_paths = ['''List of paths to your files''']
# We build one large np array of all images (this requires that all data fits in memory)
res = []
for f in file_paths:
ds = gdal.Open(f)
res.append(ds.GetRasterBand(1).ReadAsArray()) # We assume that all rasters has a single band
stacked = np.dstack(res) # We assume that all rasters have the same dimensions
mean = np.mean(stacked, axis=-1)
# Finally save a new raster with the result.
# This assumes that all inputs have the same geotransform since we just copy the first
driver = gdal.GetDriverByName('GTiff')
result = driver.CreateCopy('ETmaps_average.tif', gdal.Open(file_paths[0]))
result.GetRasterBand(1).WriteArray(mean)
result = None
Does pyspark's KernelDensity.estimate work correctly on a dataset that is normally distributed? I get an error when I try that. I have filed https://issues.apache.org/jira/browse/SPARK-20803 (KernelDensity.estimate in pyspark.mllib.stat.KernelDensity throws net.razorvine.pickle.PickleException when input data is normally distributed (no error when data is not normally distributed))
Example code:
vecRDD = sc.parallelize(colVec)
kd = KernelDensity()
kd.setSample(vecRDD)
kd.setBandwidth(3.0)
# Find density estimates for the given values
densities = kd.estimate(samplePoints)
When data is NOT Gaussian, I get for e.g.
5.6654703477e-05,0.000100010001,0.000100010001,0.000100010001,.....
For reference, using Scala, for Gaussian data,
Code:
vecRDD = sc.parallelize(colVec)
kd = new KernelDensity().setSample(vecRDD).setBandwidth(3.0)
// Find density estimates for the given values
densities = kd.estimate(samplePoints)
I get:
[0.04113814235801906,1.0994865517293571E-163,0.0,0.0,.....
I faced the same issue and was able to track down the issue to a very minimal test case. If you're using Numpy in Python to generate the data in the RDD, then that's the problem!
import numpy as np
kd = KernelDensity()
kd.setSample(sc.parallelize([0.0, 1.0, 2.0, 3.0])) # THIS WORKS
# kd.setSample(sc.parallelize([0.0, np.float32(1.0), 2.0, 3.0])) # THIS FAILS
kd.setBandwidth(0.35)
kd.estimate([0.0, 1.0])
If this was your issue as well, simply convert the Numpy data to Python base type until the Spark issue is fixed. You can do that by using the np.asscalar function.