getting 'StructField' object has no attribute '_get_object_id' on BinaryClassificationMetrics - pyspark

I was trying to get the binary classification report on pyspark and I ran into this error
StructField' object has no attribute '_get_object_id'
Here is my code
%%spark
from pyspark.mllib.evaluation import BinaryClassificationMetrics
#from pyspark.mllib.evaluation import BinaryClassificationMetrics
predictionAndLabels = test_pred.rdd.map(lambda Row : (float(Row['label']) , Row['prediction']))
metrics = BinaryClassificationMetrics(predictionAndLabels)
Also , Based on the documentation a link! , apparently it does not support f1 measure and recall etc . Any idea why or how we can extract them without low level coding ?

I don't think you have to go that deep. Taking their example of the data from the binary from the documentation you linked and assuming your threshold is p=0.5 cutoff you can just do something like
# f1 = 2 · Precision · Recall/Precision + Recall
# precision = tp / tp+fp
# recall = tp / tp+fn
from pyspark.sql.functions import col
scoreAndLabels = sc.parallelize([(0.1, 0.0), (0.1, 1.0), (0.4, 0.0), (0.6, 0.0), (0.6, 1.0), (0.6, 1.0), (0.8, 1.0)], 2)
df = scoreAndLabels.toDF()
threshold = 0.5
tp = df.where((col('_1')>=threshold) &(col('_2')==1.0)).count()
fp = df.where((col('_1')<threshold) &(col('_2')==1.0)).count()
fn = df.where((col('_1')>=threshold) &(col('_2')==0.0)).count()
precision = tp / (tp+fp)
recall = tp / (tp+fn)
f1 = 2 * (precision * recall) / (precision + recall)
returns f1 = 0.75.

Related

in Ipython a function named display gives me an error

# Kepler's Laws.py
# plots the orbit of a planet in an eccentric orbit to illustrate
# the sweeping out of equal areas in equal times, with sun at focus
# The eccentricity of the orbit is random and determined by the initial velocity
# program uses normalised units (G =1)
# program by Peter Borcherds, University of Birmingham, England
from vpython import *
from random import random
from IPython import display
import pandas as pd
def MonthStep(time, offset=20, whole=1): # mark the end of each "month"
global ccolor # have to make it global, since label uses it before it is updated
if whole:
Ltext = str(int(time * 2 + dt)) # end of 'month', printing twice time gives about 12 'months' in 'year'
else:
Ltext = duration + str(time * 2) + ' "months"\n Initial speed: ' + str(round(speed, 3))
ccolor = color.white
label(pos=planet.pos, text=Ltext, color=ccolor,
xoffset=offset * planet.pos.x, yoffset=offset * planet.pos.y)
ccolor = (0.5 * (1 + random()), random(), random()) # randomise colour of radial vector
return ccolor
scene = display(title="Kepler's law of equal areas", width=1000, height=1000, range=3.2)
duration = 'Period: '
sun = sphere(color=color.yellow, radius=0.1) # motion of sun is ignored (or centre of mass coordinates)
scale = 1.0
poss = vector(0, scale, 0)
planet = sphere(pos=poss, color=color.cyan, radius=0.02)
while 1:
velocity = -vector(0.7 + 0.5 * random(), 0, 0) # gives a satisfactory range of eccentricities
##velocity = -vector(0.984,0,0) # gives period of 12.0 "months"
speed = mag(velocity)
steps = 20
dt = 0.5 / float(steps)
step = 0
time = 0
ccolor = color.white
oldpos = vector(planet.pos)
ccolor = MonthStep(time)
curve(pos=[sun.pos, planet.pos], color=ccolor)
while not (oldpos.x > 0 and planet.pos.x < 0):
rate(steps * 2) # keep rate down so that development of orbit can be followed
time += dt
oldpos = vector(planet.pos) # construction vector(planet.pos) makes oldpos a varible in its own right
# oldpos = planet.pos makes "oldposs" point to "planet.pos"
# oldposs = planet.pos[:] does not work, because vector does not permit slicing
denom = mag(planet.pos) ** 3
velocity -= planet.pos * dt / denom # inverse square law; force points toward sun
planet.pos += velocity * dt
# plot orbit
curve(pos=[oldpos, planet.pos], color=color.red)
step += 1
if step == steps:
step = 0
ccolor = MonthStep(time)
curve(pos=[sun.pos, planet.pos], color=color.white)
else:
# plot radius vector
curve(pos=[sun.pos, planet.pos], color=ccolor)
if scene.kb.keys:
print
"key pressed"
duration = 'Duration: '
break
MonthStep(time, 50, 0)
label(pos=(2.5, -2.5, 0), text='Click for another orbit')
scene.mouse.getclick()
for obj in scene.objects:
if obj is sun or obj is planet: continue
obj.visible = 0 # clear the screen to do it again
I copied Kepler's Laws code in google and compiled it on pycharm.
But there is an error that
scene = display(title="Kepler's law of equal areas", width=1000, height=1000, range=3.2)
TypeError: 'module' object is not callable
I found some information on google that "pandas" library can improve this error so I tried it but I can't improve this error.
What should I do?
Replace "display" with "canvas", which is the correct name of this entity.

Survival probabilities in AFT survival model Pyspark

I would like to know how to calculate the survival probabilities in pyspark with the AFTSurvivalRegression method. I have seen this example on the web:
from pyspark.ml.regression import AFTSurvivalRegression
from pyspark.ml.linalg import Vectors
training = spark.createDataFrame([
(1.218, 1.0, Vectors.dense(1.560, -0.605)),
(2.949, 0.0, Vectors.dense(0.346, 2.158)),
(3.627, 0.0, Vectors.dense(1.380, 0.231)),
(0.273, 1.0, Vectors.dense(0.520, 1.151)),
(4.199, 0.0, Vectors.dense(0.795, -0.226))], ["label", "censor", "features"])
quantileProbabilities = [0.3, 0.6]
aft = AFTSurvivalRegression(quantileProbabilities=quantileProbabilities,
quantilesCol="quantiles")
model = aft.fit(training)
# Print the coefficients, intercept and scale parameter for AFT survival regression
print("Coefficients: " + str(model.coefficients))
print("Intercept: " + str(model.intercept))
print("Scale: " + str(model.scale))
model.transform(training).show(truncate=False)
But with this I can only predict the survival times. I also can get quantile probabilities but I do not know exactly how them work. My question is how can I get the probability of one person will survive at specific time?

How to create Bezier curves from B-Splines in Sympy?

I need to draw a smooth curve through some points, which I then want to show as an SVG path. So I create a B-Spline with scipy.interpolate, and can access some arrays that I suppose fully define it. Does someone know a reasonably simple way to create Bezier curves from these arrays?
import numpy as np
from scipy import interpolate
x = np.array([-1, 0, 2])
y = np.array([ 0, 2, 0])
x = np.r_[x, x[0]]
y = np.r_[y, y[0]]
tck, u = interpolate.splprep([x, y], s=0, per=True)
cx = tck[1][0]
cy = tck[1][1]
print( 'knots: ', list(tck[0]) )
print( 'coefficients x: ', list(cx) )
print( 'coefficients y: ', list(cy) )
print( 'degree: ', tck[2] )
print( 'parameter: ', list(u) )
The red points are the 3 initial points in x and y. The green points are the 6 coefficients in cx and cy. (Their values repeat after the 3rd, so each green point has two green index numbers.)
Return values tck and u are described scipy.interpolate.splprep documentation
knots: [-1.0, -0.722, -0.372, 0.0, 0.277, 0.627, 1.0, 1.277, 1.627, 2.0]
# 0 1 2 3 4 5
coefficients x: [ 3.719, -2.137, -0.053, 3.719, -2.137, -0.053]
coefficients y: [-0.752, -0.930, 3.336, -0.752, -0.930, 3.336]
degree: 3
parameter: [0.0, 0.277, 0.627, 1.0]
Not sure starting with a B-Spline makes sense: form a catmull-rom curve through the points (with the virtual "before first" and "after last" overlaid on real points) and then convert that to a bezier curve using a relatively trivial transform? E.g. given your points p0, p1, and p2, the first segment would be a catmull-rom curve {p2,p0,p1,p2} for the segment p1--p2, {p0,p1,p2,p0} will yield p2--p0, and {p1, p2, p0, p1} will yield p0--p1. Then you trivially convert those and now you have your SVG path.
As demonstrator, hit up https://editor.p5js.org/ and paste in the following code:
var points = [{x:150, y:100 },{x:50, y:300 },{x:300, y:300 }];
// add virtual points:
points = points.concat(points);
function setup() {
createCanvas(400, 400);
tension = createSlider(1, 200, 100);
}
function draw() {
background(220);
points.forEach(p => ellipse(p.x, p.y, 4));
for (let n=0; n<3; n++) {
let [c1, c2, c3, c4] = points.slice(n,n+4);
let t = 0.06 * tension.value();
bezier(
// on-curve start point
c2.x, c2.y,
// control point 1
c2.x + (c3.x - c1.x)/t,
c2.y + (c3.y - c1.y)/t,
// control point 2
c3.x - (c4.x - c2.x)/t,
c3.y - (c4.y - c2.y)/t,
// on-curve end point
c3.x, c3.y
);
}
}
Which will look like this:
Converting that to Python code should be an almost effortless exercise: there is barely any code for us to write =)
And, of course, now you're left with creating the SVG path, but that's hardly an issue: you know all the Bezier points now, so just start building your <path d=...> string while you iterate.
A B-spline curve is just a collection of Bezier curves joined together. Therefore, it is certainly possible to convert it back to multiple Bezier curves without any loss of shape fidelity. The algorithm involved is called "knot insertion" and there are different ways to do this with the two most famous algorithm being Boehm's algorithm and Oslo algorithm. You can refer this link for more details.
Here is an almost direct answer to your question (but for the non-periodic case):
import aggdraw
import numpy as np
import scipy.interpolate as si
from PIL import Image
# from https://stackoverflow.com/a/35007804/2849934
def scipy_bspline(cv, degree=3):
""" cv: Array of control vertices
degree: Curve degree
"""
count = cv.shape[0]
degree = np.clip(degree, 1, count-1)
kv = np.clip(np.arange(count+degree+1)-degree, 0, count-degree)
max_param = count - (degree * (1-periodic))
spline = si.BSpline(kv, cv, degree)
return spline, max_param
# based on https://math.stackexchange.com/a/421572/396192
def bspline_to_bezier(cv):
cv_len = cv.shape[0]
assert cv_len >= 4, "Provide at least 4 control vertices"
spline, max_param = scipy_bspline(cv, degree=3)
for i in range(1, max_param):
spline = si.insert(i, spline, 2)
return spline.c[:3 * max_param + 1]
def draw_bezier(d, bezier):
path = aggdraw.Path()
path.moveto(*bezier[0])
for i in range(1, len(bezier) - 1, 3):
v1, v2, v = bezier[i:i+3]
path.curveto(*v1, *v2, *v)
d.path(path, aggdraw.Pen("black", 2))
cv = np.array([[ 40., 148.], [ 40., 48.],
[244., 24.], [160., 120.],
[240., 144.], [210., 260.],
[110., 250.]])
im = Image.fromarray(np.ones((400, 400, 3), dtype=np.uint8) * 255)
bezier = bspline_to_bezier(cv)
d = aggdraw.Draw(im)
draw_bezier(d, bezier)
d.flush()
# show/save im
I didn't look much into the periodic case, but hopefully it's not too difficult.

Spark mllib.stat.Statistics - kolmogorovSmirnovTest CDF

I am looking through the example HypothesisTestingKolmogorovSmirnovTestExample.scala for spark and can't seem to figure out the CDF aspect.
Their example:
import org.apache.spark.{SparkConf, SparkContext}
import org.apache.spark.mllib.stat.Statistics
import org.apache.spark.rdd.RDD
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
val myCDF = Map(0.1 -> 0.2, 0.15 -> 0.6, 0.2 -> 0.05, 0.3 -> 0.05, 0.25 -> 0.1)
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
This returns:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
This makes sense - what doesn't is when I try to have it not reject the Null:
val data: RDD[Double] = sc.parallelize(Seq(0.1, 0.15, 0.2, 0.3, 0.25)) // an RDD of sample data
val myCDF = Map(0.1 -> 0.1, 0.15 -> 0.15, 0.2 -> 0.2, 0.3 -> 0.3, 0.25 -> 0.25) //CDF matching the data distribution
val testResult2 = Statistics.kolmogorovSmirnovTest(data, myCDF)
println(testResult2)
This ALSO returns:
Very strong presumption against null hypothesis: Sample follows theoretical distribution.
What gives? The CDF and the data are the exact same distribution, are they not? Why would the Null be rejected? What am I assuming/doing wrong?
What is the scenario, you can use the KS test:
KS Test is one of the Goodness-fit-Test to be executed after the fit distribution for the data.
this test will tell you whether the identified distribution for the data is correct or not. we need to validate this with the p-value.
if the p value > 0.05 then the distribution you set for the data is fine. the p value is < 0.05 then you need fit data with the different distribution.
Rejecting Null means, p value is < 0.05: Data Not fit for the given distribution

How to add JulianDate objects or offsets in Skyfield

The JulianDate object in Skyfield is a handy way to quickly produce and hold a set of time values in Julian Days, and pass them to Skyfield's at() method to calculate astronomical positions in various coordinates. (see an example script)
However, I can't seem to find an add or offset method so that I can add a time offset or an iterable of offsets to a JulianDate object. I always seem to struggle with dates and times.
Here is a very simple, abstracted example. I generate jd60 which is offset from an arbitrary jd0 by 60 days. As a simple check I calculate the position of the earth at the two times and make sure it moves by about 60 degrees.
from skyfield.api import load, JulianDate
import numpy as np
data = load('de421.bsp')
earth = data['earth']
Start with an arbitrary t_zero:
jd0 = JulianDate(utc=(2016, 1, 17.4329, 22.8, 4, 39.3)) # (y, m, d, h, m, s)
Now, make a second JulianDate object offset by 60 days
This works:
tim = list(jd0.tt_tuple())
tim[2] += 60 # add 60 days (~1/6 of a year)
jd60 = JulianDate(utc=tuple(tim))
But, what I would like is something like:
jd60 = jd0.add(delta_utc=(0, 0, 60, 0, 0, 0)) # ficticious method
Now calculate the positions and find the approximate angle, just to see that it worked.
p0 = earth.at(jd0).position.km
p60 = earth.at(jd60).position.km
dot = (p0*p60).sum()
cos_theta = dot / np.sqrt( (p0**2).sum() * (p61**2).sum() )
print (180./np.pi) * np.arccos(cos_theta)
print "should be roughly 60 degrees"
gives
60.6215331601
should be roughly 60 degrees
Reference here,
My solution is as follows:
from skyfield.api import load
import numpy as np
data = load('de421.bsp')
earth = data['earth']
ts=load.timescale()
t=ts.utc(2016, 1, np.linspace(17.4329,77.4329,61), 22.8, 4, 39.3)
p=earth.at(t)
p0 = p.position.au[:,0]
p60 = p.position.au[:,60]
dot = (p0*p60).sum()
cos_theta = dot / np.sqrt( (p0**2).sum() * (p60**2).sum() )
print((180./np.pi) * np.arccos(cos_theta))
print("should be roughly 60 degrees")