Pyspark: How to create a new column based of the existing column

Pyspark: How to create a new column based of the existing column - pyspark

What is the equivalent of this operation in Pyspark?
import pandas as pd
import numpy as np
df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')})
df['new_set'] = df['Set']
Expected output
Set Type new_set
0 Z A Z
1 Z B Z
2 X B X
3 Y C Y

df = df.withColumn('new_set', col('set'))

Related

why does this Cubic Spline Error in dimensions appear?

def f(x):
return 1/(1 + (x**2))
from scipy.interpolate import CubicSpline
a = -1
b = 1
n = 5
xArray = np.linspace(a,b,n)
yArray = f(xArray)
x = np.linspace(a,b,nPts)
y = CubicSpline(xArray, yArray, x)
plt.plot(x, y, label="Interpolation, " + str(n) + " points")
Im wondering whats the problem in using cubic spline in this way. The error that I get says there is a wrong dimension?
ValueError: x and y must have same first dimension, but have shapes (101,) and (1,

I see your misunderstanding here roots from misinterpretation of the 'extrapolate' keyword, to quote the documentation of CubicSpline
extrapolate{bool, ‘periodic’, None}, optional
If bool, determines whether to extrapolate to out-of-bounds points
based on first and last intervals, or to return NaNs. If ‘periodic’,
periodic extrapolation is used. If None (default), extrapolate is set
to ‘periodic’ for bc_type='periodic' and to True otherwise.
is a boolean and not the list of points for which you want to interpolate and or extrapolate.
The correct usage is to fit a CubicSpline first and then use it to interpolate or extrapolate
def f(x):
return 1/(1 + (x**2))
from scipy.interpolate import CubicSpline
import numpy as np
import matplotlib.pyplot as plt
a = -1
b = 1
n = 5
xArray = np.linspace(a,b,n)
yArray = f(xArray)
x = np.linspace(a,b,101)
cs = CubicSpline(xArray, yArray, True) # fit a cubic spline
y = cs(x) # interpolate/extrapolate
plt.plot(x, y, label="Interpolation, " + str(n) + " points")
plt.show()
The above code will work

How to plot ROC curve in pyspark for GBTClassifier?

I am trying to plot the ROC curve for a gradient boosting model. I have come across this post but it doesn't seem to work for the GBTclassifier model. pyspark extract ROC curve?
I am using a dataset in databricks and below is my code. It gives the following error
AttributeError: 'PipelineModel' object has no attribute 'summary'
%fs ls databricks-datasets/adult/adult.data
from pyspark.sql.functions import *
from pyspark.ml.classification import RandomForestClassifier, GBTClassifier
from pyspark.ml.feature import StringIndexer, OneHotEncoderEstimator, VectorAssembler, VectorSlicer
from pyspark.ml import Pipeline
from pyspark.ml.evaluation import BinaryClassificationEvaluator,MulticlassClassificationEvaluator
from pyspark.ml.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, TrainValidationSplit
import pandas as pd
dataset = spark.table("adult")
# spliting the train and test data frames
splits = dataset.randomSplit([0.7, 0.3])
train_df = splits[0]
test_df = splits[1]
def predictions(train_df,
target_col,
):
"""
#Function attributes
dataframe - training df
target - target varibale in the model
"""
# one hot encoding and assembling
encoding_var = [i[0] for i in train_df.dtypes if (i[1]=='string') & (i[0]!=target_col)]
num_var = [i[0] for i in train_df.dtypes if ((i[1]=='int') | (i[1]=='double')) & (i[0]!=target_col)]
string_indexes = [StringIndexer(inputCol = c, outputCol = 'IDX_' + c, handleInvalid = 'keep') for c in encoding_var]
onehot_indexes = [OneHotEncoderEstimator(inputCols = ['IDX_' + c], outputCols = ['OHE_' + c]) for c in encoding_var]
label_indexes = StringIndexer(inputCol = target_col, outputCol = 'label', handleInvalid = 'keep')
assembler = VectorAssembler(inputCols = num_var + ['OHE_' + c for c in encoding_var], outputCol = "features")
gbt = GBTClassifier(featuresCol = 'features', labelCol = 'label',
maxDepth = 5,
maxBins = 45,
maxIter = 20)
pipe = Pipeline(stages = string_indexes + onehot_indexes + [assembler, label_indexes, gbt])
model = pipe.fit(train_df)
return model
gbt_model = predictions(train_df = train_df,
target_col = 'income')
import matplotlib.pyplot as plt
plt.figure(figsize=(5,5))
plt.plot([0, 1], [0, 1], 'r--')
plt.plot(gbt_model.summary.roc.select('FPR').collect(),
gbt_model.summary.roc.select('TPR').collect())
plt.xlabel('FPR')
plt.ylabel('TPR')
plt.show()

Based on your error, have a look at PipelineModel in this doc:
https://spark.apache.org/docs/2.4.3/api/python/pyspark.ml.html#pyspark.ml.PipelineModel
There is not attribute summary in on an object of this class. Instead, I believe you need to access the stages of the PipelineModel individually, such as gbt_model.stages[-1] (which should give access to your last stage - the GBTClassifier. Then try and play around with the attributes there, such as:
gbt_model.stages[-1].summary
And if your GBTClassifier has a summary, you'll find it there. Hope this helps.

Add a vector to every column of a matrix, using Scala Breeze

I have a matrix M of (L x N) rank and I want to add the same vector v of length L to every column of the matrix. Is there a way do this please, using Scala Breeze?
I tried:
val H = DenseMatrix.zeros(L,N)
for (j <- 0 to L) {
H (::,j) = M(::,j) + v
}
but this doesn't really fit Scala's immutability as H is then already defined and therefore gives a reassignment to val error. Any suggestions appreciated!

To add a vector to all columns of a matrix, you don't need to loop through columns; you can use the column broadcasting feature, for your example:
H(::,*) + v // assume v is breeze dense vector
Should work.
import breeze.linalg._
val L = 3
val N = 2
val v = DenseVector(1.0,2.0,3.0)
val H = DenseMatrix.zeros[Double](L, N)
val result = H(::,*) + v
//result: breeze.linalg.DenseMatrix[Double] = 1.0 1.0
// 2.0 2.0
// 3.0 3.0

Variable Dependence scipy.special.genlaguerre

I'm new to python. If I wanted L(n, a, x) where L is the general Laguerre polynomial then I could simply use
from scipy.special import genlaguerre
print(genlaguerre(n, a))
However, I am having trouble obtaining something like L(n, a, 2 pi x) since there is no explicit variable dependence in the function genlaguerre.

The object returned by genlaguerre(n, a) is callable; you call it to evaluate it at a given x.
For example,
In [71]: import numpy as np
In [72]: import matplotlib.pyplot as plt
In [73]: from scipy.special import genlaguerre
In [74]: n = 3
In [75]: alpha = 4.5
In [76]: L = genlaguerre(n, alpha)
To get the value of the polynomial at x, call L(x):
In [77]: L(0)
Out[77]: 44.6875
In [78]: L(1)
Out[78]: 23.895833333333332
In [79]: L([2, 2.5, 3])
Out[79]: array([ 9.60416667, 4.58333333, 0.8125 ])
In [80]: x = np.linspace(0, 14, 100)
In [81]: plt.plot(x, L(x))
Out[81]: [<matplotlib.lines.Line2D at 0x11cde42b0>]
In [82]: plt.xlabel('x')
Out[82]: <matplotlib.text.Text at 0x11cddc4a8>
In [83]: plt.ylabel('$L_{%d}^{(%g)}(x)$' % (n, alpha))
Out[83]: <matplotlib.text.Text at 0x11cdce320>
In [84]: plt.grid()
Here's the plot generated by the above code:

In Scipy LeastSq - How to add the penalty term

If the object function is
How to code it in python?
I've already coded the normal one:
import numpy as np
import scipy as sp
from scipy.optimize import leastsq
import pylab as pl
m = 9 #the degree of the polynomial
def real_func(x):
return np.sin(2*np.pi*x) #sin(2 pi x)
def fake_func(p, x):
f = np.poly1d(p) #polynomial
return f(x)
def residuals(p, y, x):
return y - fake_func(p, x)
#randomly choose 9 points as x
x = np.linspace(0, 1, 9)
x_show = np.linspace(0, 1, 1000)
y0 = real_func(x)
#add normalize noise
y1 = [np.random.normal(0, 0.1) + y for y in y0]
p0 = np.random.randn(m)
plsq = leastsq(residuals, p0, args=(y1, x))
print 'Fitting Parameters ：', plsq[0]
pl.plot(x_show, real_func(x_show), label='real')
pl.plot(x_show, fake_func(plsq[0], x_show), label='fitted curve')
pl.plot(x, y1, 'bo', label='with noise')
pl.legend()
pl.show()

Since the penalization term is also just quadratic, you could just stack it together with thesquares of the error and use weights 1 for data and lambda for the penalization rows.
scipy.optimize.curvefit does weighted least squares, if you don't want to code it yourself.

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

Pyspark: How to create a new column based of the existing column - pyspark

What is the equivalent of this operation in Pyspark? import pandas as pd import numpy as np df = pd.DataFrame({'Type':list('ABBC'), 'Set':list('ZZXY')}) df['new_set'] = df['Set'] Expected output Set Type new_set 0 Z A Z 1 Z B Z 2 X B X 3 Y C Y

df = df.withColumn('new_set', col('set'))

Related

why does this Cubic Spline Error in dimensions appear?

How to plot ROC curve in pyspark for GBTClassifier?

Add a vector to every column of a matrix, using Scala Breeze

Variable Dependence scipy.special.genlaguerre

In Scipy LeastSq - How to add the penalty term

Categories

Resources