Calculate centroid of a set of coordinates on a PySpark dataframe - pyspark

I have a dataframe similar to
+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+
Where for each name I have a few different lat lon on the same dataframe. I would like to use the shapely function to calculate the centroid for each user:
Point(lat, lon).centroid()
This UDF would be able to calculate it:
from shapely.geometry import MultiPoint
def f(x):
return list(MultiPoint(tuple(x.values)).centroid.coords[0])
get_centroid = udf(lambda x: f(x), DoubleType())
But how can I apply it to a list of coordinates of each user? It seems that a UDAF on a group by is not a viable solution in this case.

You want:
Execute 3rd party plain Python function
Which is not associative or commutative
The only choice you have is:
group records (you can use RDD.groupBy or collect_list).
apply the function.
flatMap (RDD) or join (DF).

Related

Comparing two Identically structured Dataframes in Spark

val originalDF = Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",600,80000),(3,"rishi","ahmedabad",510,65000)).toDF("id","name","city","credit_score","credit_limit")
val changedDF= Seq((1,"gaurav","jaipur",550,70000),(2,"sunil","noida",650,90000),(4,"Joshua","cochin",612,85000)).toDF("id","name","city","creditscore","credit_limit")
So the above two dataframes has the same table structure and I want to find out the id's for which the values have changed in the other dataframe(changedDF). I tried with the except() function in spark but its giving me two rows. Id is the common column between these two dataframes.
changedDF.except(originalDF).show
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 4|Joshua|cochin| 612| 85000|
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Whereas I only want the common ids for which there has been any changes.Like this ->
+---+------+------+-----------+------------+
| id| name| city|creditscore|credit_limit|
+---+------+------+-----------+------------+
| 2| sunil| noida| 650| 90000|
+---+------+------+-----------+------------+
Is there any way to find out the only the common ids for which the data have changed.
Can anybody tell me any approach I can follow to achieve this.
You can do the inner join of the dataframes, that will give you the result with common ids.
originalDF.alias("a").join(changedDF.alias("b"), col("a.id") === col("b.id"), "inner")
.select("a.*")
.except(changedDF)
.show
Then, your expected result will be out:
+---+-----+-----+------------+------------+
| id| name| city|credit_score|credit_limit|
+---+-----+-----+------------+------------+
| 2|sunil|noida| 600| 80000|
+---+-----+-----+------------+------------+

Approx quantile on a array of doubles - Spark dataframe

I have a spark dataframe defined as:
+----------------+--------------------+-----------+
| id | amt_list|ct_tran_amt|
+----------------+--------------------+-----------+
|1 |[2.99, 7.73, 193....| 23|
|2 |[9.99, 9.95, 5.0,...| 17|
|3 |[4.57, 14.06, 0.7...| 19|
How do I calculate approximate quantile (1st and 3rd) as new columns?
df.stat.approxQuantile("amt",Array(0.25,0.75), 0.001) does not take a wrapped array as input.
I'm not aware of a built-in spark function to do this, so I would go for an UDF:
def calcPercentile(perc:Double) = udf((xs:Seq[Double]) => xs.sorted.apply(((xs.size-1)*perc).toInt))
df
.withColumn("QT1", calcPercentile(0.25)($"amt_list"))
.withColumn("QT3", calcPercentile(0.75)($"amt_list"))
.show()
EDIT:
There is also an approach without UDF:
df
.withColumn("Q1", sort_array($"amt_list")(((size($"amt_list")-1)*0.25).cast("int")))
.withColumn("Q3", sort_array($"amt_list")(((size($"amt_list")-1)*0.75).cast("int")))
.show()

How to transform DataFrame before joining operation?

The following code is used to extract ranks from the column products. The ranks are second numbers in each pair [...]. For example, in the given example [[222,66],[333,55]] the ranks are 66 and 55 for products with PK 222 and 333, accordingly.
But the code in Spark 2.2 works very slowly when df_products is around 800 Mb:
df_products.createOrReplaceTempView("df_products")
val result = df.as("df2")
.join(spark.sql("SELECT * FROM df_products")
.select($"product_PK", explode($"products").as("products"))
.withColumnRenamed("product_PK","product_PK_temp").as("df1"),$"df2.product _PK" === $"df1.product_PK_temp" and $"df2.rec_product_PK" === $"df1.products.product_PK", "left")
.drop($"df1.product_PK_temp")
.select($"product_PK", $"rec_product_PK", coalesce($"df1.products.col2", lit(0.0)).as("rank_product"))
This is a small sample of df_products and df:
df_products =
+----------+--------------------+
|product_PK| products|
+----------+--------------------+
| 111|[[222,66],[333,55...|
| 222|[[333,24],[444,77...|
...
+----------+--------------------+
df =
+----------+-----------------+
|product_PK| rec_product_PK|
+----------+-----------------+
| 111| 222|
| 222| 888|
+----------+-----------------+
The above-given code works well when the arrays in each row of products contain a small number of elements. But when there are a lot of elements in the arrays of each row [[..],[..],...], then the code seems to get stuck and it does not advance.
How can I optimize the code? Any help is really highly appreciated.
Is it possible, for example, to transform df_products into the following DataFrame before joining?
df_products =
+----------+--------------------+------+
|product_PK| rec_product_PK| rank|
+----------+--------------------+------+
| 111| 222| 66|
| 111| 333| 55|
| 222| 333| 24|
| 222| 444| 77|
...
+----------+--------------------+------+
As per my answer here, you can transform df_products using something like this:
import org.apache.spark.sql.functions.explode
df1 = df.withColumn("array_elem", explode(df("products"))
df2 = df1.select("product_PK", "array_elem.*")
This assumes products is an array of structs. If products is an array of array, you can use the following instead:
df2 = df1.withColumn("rank", df2("products").getItem(1))

List to DataFrame in pyspark

Can someone tell me how to convert a list containing strings to a Dataframe in pyspark. I am using python 3.6 with spark 2.2.1. I am just started learning spark environment and my data looks like below
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
Now, i want to create a Dataframe as follows
---------------------------------
|ID | words |
---------------------------------
1 | ['apple','ball','ballon'] |
2 | ['cat','camel','james'] |
I even want to add ID column which is not associated in the data
You can convert the list to a list of Row objects, then use spark.createDataFrame which will infer the schema from your data:
from pyspark.sql import Row
R = Row('ID', 'words')
# use enumerate to add the ID column
spark.createDataFrame([R(i, x) for i, x in enumerate(my_data)]).show()
+---+--------------------+
| ID| words|
+---+--------------------+
| 0|[apple, ball, bal...|
| 1| [cat, camel, james]|
| 2| [none, focus, cake]|
+---+--------------------+
Try this -
data_array = []
for i in range (0,len(my_data)) :
data_array.extend([(i, my_data[i])])
df = spark.createDataframe(data = data_array, schema = ["ID", "words"])
df.show()
Try this -- the simplest approach
from pyspark.sql import *
x = Row(utc_timestamp=utc, routine='routine name', message='your message')
data = [x]
df = sqlContext.createDataFrame(data)
Simple Approach:
my_data =[['apple','ball','ballon'],['cat','camel','james'],['none','focus','cake']]
spark.sparkContext.parallelize(my_data).zipWithIndex() \
toDF(["id", "words"]).show(truncate=False)
+---------------------+-----+
|id |words|
+---------------------+-----+
|[apple, ball, ballon]|0 |
|[cat, camel, james] |1 |
|[none, focus, cake] |2 |
+---------------------+-----+

Apply different aggregate function to a PySpark groupby

I have a dataframe with a structure similar to
+----+-----+-------+------+------+------+
| cod| name|sum_vol| date| lat| lon|
+----+-----+-------+------+------+------+
|aggc|23124| 37|201610|-15.42|-32.11|
|aggc|23124| 19|201611|-15.42|-32.11|
| abc| 231| 22|201610|-26.42|-43.11|
| abc| 231| 22|201611|-26.42|-43.11|
| ttx| 231| 10|201610|-22.42|-46.11|
| ttx| 231| 10|201611|-22.42|-46.11|
| tty| 231| 25|201610|-25.42|-42.11|
| tty| 231| 45|201611|-25.42|-42.11|
|xptx| 124| 62|201611|-26.43|-43.21|
|xptx| 124| 260|201610|-26.43|-43.21|
|xptx|23124| 50|201610|-26.43|-43.21|
|xptx|23124| 50|201611|-26.43|-43.21|
+----+-----+-------+------+------+------+
and now I want to aggregate the lat and lon values, but using my own function:
def get_centroid(lat, lon):
# ...do whatever I need here
return t_lat, t_lon
get_c = udf(lambda x, y: get_centroid(x,y), FloatType())
gg = df.groupby('cod', 'name').agg(get_c('lat', 'lon'))
but I get the following error:
u"expression 'pythonUDF' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;"
Is there a way to get the elements of the group by and operate on them, without having to use a UDAF? Something similar to pandas
df.groupby(['cod','name'])[['lat', 'lon']].apply(f).to_frame().reset_index()