Pyspark: How to create a new column based of the existing column - pyspark

What is the equivalent of this operation in Pyspark?
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['new_set'] = df['Set']
Expected output
Set Type new_set
0 Z A Z
1 Z B Z
2 X B X
3 Y C Y

df = df.withColumn('new_set', col('set'))

Related

why does this Cubic Spline Error in dimensions appear?

def f(x):
return 1/(1 + (x**2))
from scipy.interpolate import CubicSpline
a = -1
b = 1
n = 5
xArray = np.linspace(a,b,n)
yArray = f(xArray)
x = np.linspace(a,b,nPts)
y = CubicSpline(xArray, yArray, x)
plt.plot(x, y, label="Interpolation, " + str(n) + " points")
Im wondering whats the problem in using cubic spline in this way. The error that I get says there is a wrong dimension?
ValueError: x and y must have same first dimension, but have shapes (101,) and (1,
I see your misunderstanding here roots from misinterpretation of the 'extrapolate' keyword, to quote the documentation of CubicSpline
extrapolate{bool, ‘periodic’, None}, optional
If bool, determines whether to extrapolate to out-of-bounds points
based on first and last intervals, or to return NaNs. If ‘periodic’,
periodic extrapolation is used. If None (default), extrapolate is set
to ‘periodic’ for bc_type='periodic' and to True otherwise.
is a boolean and not the list of points for which you want to interpolate and or extrapolate.
The correct usage is to fit a CubicSpline first and then use it to interpolate or extrapolate
def f(x):
return 1/(1 + (x**2))
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
a = -1
b = 1
n = 5
xArray = np.linspace(a,b,n)
yArray = f(xArray)
x = np.linspace(a,b,101)
cs = CubicSpline(xArray, yArray, True) # fit a cubic spline
y = cs(x) # interpolate/extrapolate
plt.plot(x, y, label="Interpolation, " + str(n) + " points")
plt.show()
The above code will work

How to plot ROC curve in pyspark for GBTClassifier?

I am trying to plot the ROC curve for a gradient boosting model. I have come across this post but it doesn't seem to work for the GBTclassifier model. pyspark extract ROC curve?
I am using a dataset in databricks and below is my code. It gives the following error
AttributeError: 'PipelineModel' object has no attribute 'summary'
%fs ls databricks-datasets/adult/adult.data
from pyspark.sql.functions import *
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, VectorSlicer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
import pandas as pd
dataset = spark.table("adult")
# spliting the train and test data frames
splits = dataset.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]
def predictions(train_df,
target_col,
):
"""
#Function attributes
dataframe - training df
target - target varibale in the model
"""
# one hot encoding and assembling
encoding_var = [i[0] for i in train_df.dtypes if (i[1]=='string') & (i[0]!=target_col)]
num_var = [i[0] for i in train_df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!=target_col)]
string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
label_indexes = StringIndexer(inputCol = target_col, outputCol = 'label', handleInvalid = 'keep')
assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
gbt = GBTClassifier(featuresCol = 'features', labelCol = 'label',
maxDepth = 5,
maxBins = 45,
maxIter = 20)
pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, gbt])
model = pipe.fit(train_df)
return model
gbt_model = predictions(train_df = train_df,
target_col = 'income')
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(gbt_model.summary.roc.select('FPR').collect(),
gbt_model.summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()
Based on your error, have a look at PipelineModel in this doc:
https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#pyspark.ml.PipelineModel
There is not attribute summary in on an object of this class. Instead, I believe you need to access the stages of the PipelineModel individually, such as gbt_model.stages[-1] (which should give access to your last stage - the GBTClassifier. Then try and play around with the attributes there, such as:
gbt_model.stages[-1].summary
And if your GBTClassifier has a summary, you'll find it there. Hope this helps.

Add a vector to every column of a matrix, using Scala Breeze

I have a matrix M of (L x N) rank and I want to add the same vector v of length L to every column of the matrix. Is there a way do this please, using Scala Breeze?
I tried:
val H = DenseMatrix.zeros(L,N)
for (j <- 0 to L) {
H (::,j) = M(::,j) + v
}
but this doesn't really fit Scala's immutability as H is then already defined and therefore gives a reassignment to val error. Any suggestions appreciated!
To add a vector to all columns of a matrix, you don't need to loop through columns; you can use the column broadcasting feature, for your example:
H(::,*) + v // assume v is breeze dense vector
Should work.
import breeze.linalg._
val L = 3
val N = 2
val v = DenseVector(1.0,2.0,3.0)
val H = DenseMatrix.zeros[Double](L, N)
val result = H(::,*) + v
//result: breeze.linalg.DenseMatrix[Double] = 1.0 1.0
// 2.0 2.0
// 3.0 3.0

Variable Dependence scipy.special.genlaguerre

I'm new to python. If I wanted L(n, a, x) where L is the general Laguerre polynomial then I could simply use
from scipy.special import genlaguerre
print(genlaguerre(n, a))
However, I am having trouble obtaining something like L(n, a, 2 pi x) since there is no explicit variable dependence in the function genlaguerre.
The object returned by genlaguerre(n, a) is callable; you call it to evaluate it at a given x.
For example,
In [71]: import numpy as np
In [72]: import matplotlib.pyplot as plt
In [73]: from scipy.special import genlaguerre
In [74]: n = 3
In [75]: alpha = 4.5
In [76]: L = genlaguerre(n, alpha)
To get the value of the polynomial at x, call L(x):
In [77]: L(0)
Out[77]: 44.6875
In [78]: L(1)
Out[78]: 23.895833333333332
In [79]: L([2, 2.5, 3])
Out[79]: array([ 9.60416667, 4.58333333, 0.8125 ])
In [80]: x = np.linspace(0, 14, 100)
In [81]: plt.plot(x, L(x))
Out[81]: [<matplotlib.lines.Line2D at 0x11cde42b0>]
In [82]: plt.xlabel('x')
Out[82]: <matplotlib.text.Text at 0x11cddc4a8>
In [83]: plt.ylabel('$L_{%d}^{(%g)}(x)$' % (n, alpha))
Out[83]: <matplotlib.text.Text at 0x11cdce320>
In [84]: plt.grid()
Here's the plot generated by the above code:

In Scipy LeastSq - How to add the penalty term

If the object function is
How to code it in python?
I've already coded the normal one:
import numpy as np
import scipy as sp
from scipy.optimize import leastsq
import pylab as pl
m = 9 #the degree of the polynomial
def real_func(x):
return np.sin(2*np.pi*x) #sin(2 pi x)
def fake_func(p, x):
f = np.poly1d(p) #polynomial
return f(x)
def residuals(p, y, x):
return y - fake_func(p, x)
#randomly choose 9 points as x
x = np.linspace(0, 1, 9)
x_show = np.linspace(0, 1, 1000)
y0 = real_func(x)
#add normalize noise
y1 = [np.random.normal(0, 0.1) + y for y in y0]
p0 = np.random.randn(m)
plsq = leastsq(residuals, p0, args=(y1, x))
print 'Fitting Parameters :', plsq[0]
pl.plot(x_show, real_func(x_show), label='real')
pl.plot(x_show, fake_func(plsq[0], x_show), label='fitted curve')
pl.plot(x, y1, 'bo', label='with noise')
pl.legend()
pl.show()
Since the penalization term is also just quadratic, you could just stack it together with thesquares of the error and use weights 1 for data and lambda for the penalization rows.
scipy.optimize.curvefit does weighted least squares, if you don't want to code it yourself.