pyspark - error in haversine formula - pyspark

I am trying to implement a haversine_distance calculator in pyspark
I am re-using a python code that i used before for the same purpose so this is what I did:
1. Implement a function for harvesine_distance as a udf
2. In my dataframe, used it to compute the distance of two lat/long points
3. When I run the a check on the values, it seemed correct
4. But when I tried to put some where clause on the dist_km i am getting an error :
File "<stdin>", line 13, in haversine_distance
TypeError: a float is required
Sample Values in the dataframe upd_join_final66_df:
+------------+------------+--------+---------+---------+
|delv_lat_upd|delv_lng_upd|LATITUDE|LONGITUDE| dist_km|
+------------+------------+--------+---------+---------+
| 33.7871| -84.382| 33.7602| -84.5398|14.893161|
| 33.7871| -84.382| 33.7602| -84.5398|14.893161|
| 33.7874| -84.3822| 33.7602| -84.5398|14.881769|
| 33.7874| -84.3822| 33.7602| -84.5398|14.881769|
| 33.7874| -84.3822| 33.7602| -84.5398|14.881769|
| 33.7874| -84.3822| 33.7602| -84.5398|14.881769|
| 33.7874| -84.3822| 33.7602| -84.5398|14.881769|
| 33.7872| -84.3822| 33.7602| -84.5398| 14.87728|
| 33.7872| -84.3822| 33.7602| -84.5398| 14.87728|
| 33.7872| -84.3822| 33.7602| -84.5398| 14.87728|
| 33.7872| -84.3822| 33.7602| -84.5398| 14.87728|
| 33.7872| -84.3822| 33.7602| -84.5398| 14.87728|
| 33.7871| -84.3821| 33.7602| -84.5398|14.884104|
| 33.7871| -84.3821| 33.7602| -84.5398|14.884104|
| 33.7869| -84.382| 33.7602| -84.5398|14.888724|
| 33.7871| -84.3825| 33.7602| -84.5398|14.847878|
| 33.7869| -84.3818| 33.7602| -84.5398|14.906844|
| 33.7869| -84.3818| 33.7602| -84.5398|14.906844|
| 33.7869| -84.3818| 33.7602| -84.5398|14.906844|
| 33.7869| -84.3818| 33.7602| -84.5398|14.906844|
+------------+------------+--------+---------+---------+
Code:
from math import radians, cos, sin, asin, sqrt, atan2, pi
def haversine_distance(lat1, lon1, lat2, lon2):
"""
Calculate the great circle distance between two points
on the earth (specified in decimal degrees)
"""
deg2rad = pi/180.0
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = (lon2 - lon1)
dlat = (lat2 - lat1)
a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
c = 2.0 * asin(sqrt(a))
#c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))
r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles #No rounding R = 3959.87433 (miles), 6372.8(km)
return c * r
haversine_distance_udf = udf(haversine_distance, FloatType())
upd_join_final66_df = upd_join_final66_df.withColumn('dist_km', \
haversine_distance_udf(upd_join_final66_df['LATITUDE'],upd_join_final66_df['LONGITUDE']\
,upd_join_final66_df['delv_lat_upd'],upd_join_final66_df['delv_lng_upd'])\
)
upd_join_final66_df.registerTempTable("fac66")
When I run the below command no ERROR to spotcheck
spark.sql("select delv_lat_upd, delv_lng_upd, LATITUDE, LONGITUDE, dist_km \
from fac66 \
").show()
When I tried to interrogate the dist_km I get the error
An error occurred while calling o4882.showString.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 1041.0 failed 4 times, most recent failure: Lost task 0.3 in stage 1041.0 (TID 81684, 10.0.0.15, executor 2): org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 171, in main
process()
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 166, in process
serializer.dump_stream(func(split_index, iterator), outfile)
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 220, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 138, in dump_stream
for obj in iterator:
File "/usr/hdp/current/spark2-client/python/pyspark/serializers.py", line 209, in _batched
for item in iterator:
File "<string>", line 1, in <lambda>
File "/usr/hdp/current/spark2-client/python/pyspark/worker.py", line 70, in <lambda>
return lambda *a: f(*a)
File "<stdin>", line 13, in haversine_distance
TypeError: a float is required

The issue was caused by data that were null that is causing the formula to have issue.

from math import radians, cos, sin, asin, sqrt, atan2, pi
def haversine_distance(lat1, lon1, lat2, lon2):
if lon1==None or lat1==None or lon2==None or lat2==None:
return None
else:
deg2rad = pi/180.0
# convert decimal degrees to radians
lon1, lat1, lon2, lat2 = map(radians, [lon1, lat1, lon2, lat2])
# haversine formula
dlon = (lon2 - lon1)
dlat = (lat2 - lat1)
a = sin(dlat/2.0)**2.0 + cos(lat1) * cos(lat2) * sin(dlon/2.0)**2.0
c = 2.0 * asin(sqrt(a))
#c = 2.0 * atan2(sqrt(a), sqrt(1.0-a))
r = 6372.8 # Radius of earth in kilometers. Use 3956 for miles #No rounding R = 3959.87433 (miles), 6372.8(km)
return c * r

Related

How to create an array in Pyspark with normal distribution with scipy.stats with UDF (or any other way)?

I currently working on migrate Python scripts to PySpark, I have this Python script that works fine:
### PYTHON
import pandas as pd
import scipy.stats as st
def fnNormalDistribution(mean,std, n):
box = list(eval('st.norm')(*[mean,std]).rvs(n))
return box
df = pd.DataFrame([[18.2500365,2.7105814157004193],
[9.833353,2.121324586200329],
[41.55563866666666,7.118716782527054]],
columns = ['mean','std'])
df
| mean | std |
|------------|----------|
| 18.250037| 2.710581|
| 9.833353| 2.121325|
| 41.555639| 7.118717|
n = 100 #Example
df['random_values'] = df.apply(lambda row: fnNormalDistribution(row["mean"], row["std"], n), axis=1)
df
| mean | std | random_values |
|------------|----------|--------------------------------------------------|
| 18.250037| 2.710581|[17.752189993958638, 18.883038367927465, 16.39...]|
| 9.833353| 2.121325|[10.31806454283759, 8.732261487201594, 11.6782...]|
| 41.555639| 7.118717|[38.17469739795093, 43.16514466083524, 49.2668...]|
but when I try to migrate to Pyspark I get the following error:
### PYSPARK
def fnNormalDistribution(mean,std, n):
box = list(eval('st.norm')(*[mean,std]).rvs(n))
return box
udf_fnNomalDistribution = f.udf(fnNormalDistribution, t.ArrayType(t.DoubleType()))
columns = ['mean','std']
data = [(18.2500365,2.7105814157004193),
(9.833353,2.121324586200329),
(41.55563866666666,7.118716782527054)]
df = spark.createDataFrame(data=data,schema=columns)
df.show()
| mean | std |
|------------|----------|
| 18.250037| 2.710581|
| 9.833353| 2.121325|
| 41.555639| 7.118717|
df = df.withColumn('random_values', udf_fnNomalDistribution('mean','std',f.lit(n)))
df.show()
PythonException:
An exception was thrown from the Python worker. Please see the stack trace below.
Traceback (most recent call last):
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 604, in main
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 596, in process
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 211, in dump_stream
self.serializer.dump_stream(self._batched(iterator), stream)
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 132, in dump_stream
for obj in iterator:
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\serializers.py", line 200, in _batched
for item in iterator:
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 450, in mapper
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 450, in <genexpr>
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\worker.py", line 85, in <lambda>
File "C:\Spark\spark-3.1.2-bin-hadoop3.2\python\lib\pyspark.zip\pyspark\util.py", line 73, in wrapper
return f(*args, **kwargs)
File "C:\Users\Ubits\AppData\Local\Temp/ipykernel_10604/2493247477.py", line 2, in fnNormalDistribution
File "<string>", line 1, in <module>
NameError: name 'st' is not defined
Is there some way to use the same function in Pyspark or get the random_values column in another way? I googled it with no exit about it.
Thanks
I was trying this and it can really be fixed by moving st inside fnNormalDistribution like samkart suggested.
I will just leave my example here as Fugue may provide a more readable way to bring this to Spark, especially around handling schema. Full code below.
import pandas as pd
def fnNormalDistribution(mean,std, n):
import scipy.stats as st
box = (eval('st.norm')(*[mean,std]).rvs(n)).tolist()
return box
df = pd.DataFrame([[18.2500365,2.7105814157004193],
[9.833353,2.121324586200329],
[41.55563866666666,7.118716782527054]],
columns = ['mean','std'])
n = 100 #Example
def helper(df: pd.DataFrame) -> pd.DataFrame:
df['random_values'] = df.apply(lambda row: fnNormalDistribution(row["mean"], row["std"], n), axis=1)
return df
from fugue import transform
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
# transform can take either pandas of spark DataFrame as input
# If engine is none, it will run on pandas
sdf = transform(df,
helper,
schema="*, random_values:[float]",
engine=spark)
sdf.show()

pytorch forecasting extract feature from hidden layer

I'm following the PyTorch Forecasting tutorial: https://pytorch-forecasting.readthedocs.io/en/latest/tutorials/building.html
I implemented a LSTM using AutoRegressiveBaseModelWithCovariates and to initialized the model from my dataset.
from pytorch_forecasting.models.rnn import RecurrentNetwork
...
model = RecurrentNetwork.from_dataset(dataset_with_covariates)
I've been asked to get the output of a hidden layer and visualize w tSNE or UMAP (something I've done before with Keras). I'm new to PyTorch unfortunately. Does anyone know how to do this?
Here's the summary.
| Name | Type | Params
----------------------------------------------------------------------------------
0 | loss | MAE | 0
1 | logging_metrics | ModuleList | 0
2 | logging_metrics.0 | SMAPE | 0
3 | logging_metrics.1 | MAE | 0
4 | logging_metrics.2 | RMSE | 0
5 | logging_metrics.3 | MAPE | 0
6 | logging_metrics.4 | MASE | 0
7 | embeddings | MultiEmbedding | 47
8 | embeddings.embeddings | ModuleDict | 47
9 | embeddings.embeddings.level_0 | Embedding | 12
10 | embeddings.embeddings.supervisorvehiclestatus | Embedding | 35
11 | rnn | LSTM | 2.5 K
12 | output_projector | Linear | 11
----------------------------------------------------------------------------------
2.5 K Trainable params
0 Non-trainable params
2.5 K Total params
0.010 Total estimated model params size (MB)
In attempt to find the layer name, I did:
for name, layer in model.named_modules():
print(name, layer)
RecurrentNetwork(
(loss): MAE()
(logging_metrics): ModuleList(
(0): SMAPE()
(1): MAE()
(2): RMSE()
(3): MAPE()
(4): MASE()
)
(embeddings): MultiEmbedding(
(embeddings): ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
)
)
(rnn): LSTM(28, 10, num_layers=2, batch_first=True, dropout=0.1)
(output_projector): Linear(in_features=10, out_features=1, bias=True)
)
loss MAE()
logging_metrics ModuleList(
(0): SMAPE()
(1): MAE()
(2): RMSE()
(3): MAPE()
(4): MASE()
)
logging_metrics.0 SMAPE()
logging_metrics.1 MAE()
logging_metrics.2 RMSE()
logging_metrics.3 MAPE()
logging_metrics.4 MASE()
embeddings MultiEmbedding(
(embeddings): ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
)
)
embeddings.embeddings ModuleDict(
(group_name): Embedding(4, 3)
(categorical_var ): Embedding(7, 5)
)
embeddings.embeddings.level_0 Embedding(4, 3)
embeddings.embeddings.categorical_var Embedding(7, 5)
rnn LSTM(28, 10, num_layers=2, batch_first=True, dropout=0.1)
output_projector Linear(in_features=10, out_features=1, bias=True)
I thought I could do something like this to get the activations, but it is not working.
def get_hidden_features(x, layer):
activation = {}
def get_activation(name):
def hook(m, i, o):
activation[name] = o.detach()
return hook
model.register_forward_hook(get_activation(layer))
_ = model(x)
return activation[layer]
outhidden = get_hidden_features(x, "rnn")
Returns:
AttributeError: 'Output' object has no attribute 'detach'

GPS and MPU 6050 don't work together in one program

I'm working on a project that uses MPU 6050 and GPS together embedded in raspberry pi device. However, if I try to run the gps code alone or the code for MPU 6050 (accelerometer/gyroscope) alone, it works.
But the problem is when I try to run the code for GPS and MPU 6050 together in one program. It only runs the code for mpu 6050 and ignore many of the code and outputs of the gps.
I think the problem is related to the interfacing of both the gps and mpu 6050.
So, I hope I get any help ... thanks in advance ..
import gyro_acc
#import speed
import sys, time, requests, jsonify
import serial
import pynmea2
# defining the api-endpoint
# API_ENDPOINT = "https://gp-anamoly-api.herokuapp.com/create/anomaly"
# initialize MPU
gyro_acc.MPU_Init()
print(" ::: Start Reading ::: ")
# tracker for the number of potholes/cracks
i = 0
port="/dev/ttyS0"
ser = serial.Serial(port, baudrate=9600, timeout=0.2)
dataout = pynmea2.NMEAStreamReader()
while True:
# read data
Gx, Gy, Gz, Ax, Ay, Az = gyro_acc.mpu_read()
# print("Gx={:.2f} deg/s | Gy={:.2f} deg/s | Gz={:.2f} deg/s".format(Gx, Gy, Gz)) + "|| Ax={:.2f} m/s^2 | Ay={:.2f} m/s^2 | Az={:.2f} m/s^2".format(Ax, Ay, Az))
print("Gx={:.2f} deg/s | Gy={:.2f} deg/s | Gz={:.2f} deg/s".format(Gx, Gy, Gz))
print("Ax={:.2f} m/s^2 | Ay={:.2f} m/s^2 | Az={:.2f} m/s^2".format(Ax, Ay, Az))
# get the location and speed from gps data
newdata=ser.readline()
if newdata[0:6] == "$GPRMC":
newmsg=pynmea2.parse(newdata)
lat=newmsg.latitude
lng=newmsg.longitude
gps = "Latitude=" + str(lat) + " and Longitude=" + str(lng)
print(gps)
if newdata[0:6] == "$GPVTG":
newmsg=pynmea2.parse(newdata)
speed = newmsg.spd_over_grnd_kmph
speed_output = "Speed= {}KM/H".format(speed)
print(speed_output)
# separator
print("===========================")
# print data dynamically
# sys.stdout.write("\rGx={:.2f} deg/s | Gy={:.2f} deg/s | Gz={:.2f} deg/s".format(Gx, Gy, Gz) \
# + " || Ax={:.2f} m/s^2 | Ay={:.2f} m/s^2 | Az={:.2f} m/s^2".format(Ax, Ay, Az) + "\n"
# + "Speed: {:.1f} KM/H".format(vehicle_speed))
# sys.stdout.flush()
# print speed dynamically
# sys.stdout.write("\rSpeed: {:.1f} KM/H".format(vehicle_speed))
# sys.stdout.flush()
# check if it detect pothole/crack and save to database
if Az > 13.25:
# print pothole detected
# sys.stdout.write("\n")
print("---------------------------")
print("Pothole/Crack Detected (No.{})".format(i))
print("---------------------------")
i += 1
# sleep for one second
time.sleep(1)
the library gyro_acc is another I have in the project.

Convert KMeans "centres" output to PySpark dataframe

I'm running a K-means clustering model, and I want to analyse the cluster centroids, however the centers output is a LIST of my 20 centroids, with their coordinates (8 each) as an ARRAY. I need it as a dataframe, with clusters 1:20 as rows, and their attribute values (centroid coordinates) as columns like so:
c1 | 0.85 | 0.03 | 0.01 | 0.00 | 0.12 | 0.01 | 0.00 | 0.12
c2 | 0.25 | 0.80 | 0.10 | 0.00 | 0.12 | 0.01 | 0.00 | 0.77
c3 | 0.05 | 0.10 | 0.00 | 0.82 | 0.00 | 0.00 | 0.22 | 0.00
The dataframe format is important because what I WANT to do is:
For each centroid
Identify the 3 strongest attributes
Create a "name" for each of the 20 centroids that is a concatenation of the 3 most dominant traits in that centroid
For example:
c1 | milk_eggs_cheese
c2 | meat_milk_bread
c3 | toiletries_bread_eggs
This code is running in Zeppelin, EMR version 5.19, Spark2.4. The model works great, but this is the boilerplate code from the Spark documentation (https://spark.apache.org/docs/latest/ml-clustering.html#k-means), which produces the list of arrays output that I can't really use.
centers = model.clusterCenters()
print("Cluster Centers: ")
for center in centers:
print(center)
This is an excerpt of the output I get.
Cluster Centers:
[0.12391775 0.04282062 0.00368751 0.27282358 0.00533401 0.03389095
0.04220946 0.03213536 0.00895981 0.00990327 0.01007891]
[0.09018751 0.01354349 0.0130329 0.00772877 0.00371508 0.02288211
0.032301 0.37979978 0.002487 0.00617438 0.00610262]
[7.37626746e-02 2.02469798e-03 4.00944473e-04 9.62304581e-04
5.98964859e-03 2.95190585e-03 8.48736175e-01 1.36797882e-03
2.57451073e-04 6.13320072e-04 5.70559278e-04]
Based on How to convert a list of array to Spark dataframe I have tried this:
df = sc.parallelize(centers).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
df.show()
But this throws the following error:
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
model.clusterCenters() gives you a list of numpy arrays and not a list of lists like in the answer you have linked. Just convert the numpy arrays to a lists before creating the dataframe:
bla = [e.tolist() for e in centers]
df = sc.parallelize(bla).toDF(['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese'])
#or df = spark.createDataFrame(bla, ['fresh_items', 'wine_liquor', 'baby', 'cigarettes', 'fresh_meat', 'fruit_vegetables', 'bakery', 'toiletries', 'pets', 'coffee', 'cheese']
df.show()

Copy a categorical variable with its value labels

Is it possible to copy a labeled categorical variable in a single line, or do I generally have to copy over labels as a separate step?
In the case I'm looking at, egen ... group() comes close, but changes the underlying integers:
sysuse auto
** starts them from different indices
egen mycut = cut(mpg), at(0 20 30 50) label icodes
egen mycut_copy = group(mycut), label
** does weird stuff
egen mycut2 = cut(mpg), at(0 20 30 50) label icodes
replace mycut2 = group(mycut2)
egen mycut_copy2 = group(mycut2), label
** the correct approach?
gen mycut3 = cut(mpg), at(0 20 30 50) label icodes
gen mycut_copy3 = mycut3
label values mycut_copy3 mycut3
You can do what you want very easily using the less-known clonevar command:
sysuse auto, clear
egen mycut = cut(mpg), at(0 20 30 50) label icodes
clonevar mycut2 = mycut
list mycut* in 1/10, separator(0)
+----------------+
| mycut mycut2 |
|----------------|
1. | 20- 20- |
2. | 0- 0- |
3. | 20- 20- |
4. | 20- 20- |
5. | 0- 0- |
6. | 0- 0- |
7. | 20- 20- |
8. | 20- 20- |
9. | 0- 0- |
10. | 0- 0- |
+----------------+
Note that group() refers to different functions when used with generate and egen, which is why you do not get the same results.