Deciles or other quantile rank for Pyspark column - pyspark

I have a pyspark DF with multiple numeric columns and I want to, for each column calculate the decile or other quantile rank for that row based on each variable.
This is simple for pandas as we can create a new column for each variable using the qcut function to assign the value 0 to n-1 for 'q' as in pd.qcut(x,q=n).
How can this be done in pyspark? I have tried the following but clearly the break points are not unique between these thirds. I want to get the lower 1/3 of the data assigned 1, the next 1/3 assigned 2 and the top 1/3 assigned 3. I want to be able to change this and perhaps use 1/10, 1/32 etc
w = Window.partitionBy(data.var1).orderBy(data.var1)
d2=df.select(
"var1",
ntile(3).over(w).alias("ntile3")
)
agged=d2.groupby('ntile3').agg(F.min("var1").alias("min_var1"),F.max("var1").alias("max_var1"),F.count('*'))
agged.show()
+------+--------+--------+--------+
|ntile3|min_var1|max_var1|count(1)|
+------+--------+--------+--------+
| 1| 0.0| 210.0| 517037|
| 3| 0.0| 206.0| 516917|
| 2| 0.0| 210.0| 516962|
+------+--------+--------+--------+

QuantileDiscretizer from 'pyspark.ml.feature' can be used.
values = [(0.1,), (0.4,), (1.2,), (1.5,)]
df = spark.createDataFrame(values, ["values"])
qds = QuantileDiscretizer(numBuckets=2,
... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error")
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
+------+-------+
|values|buckets|
+------+-------+
| 0.1| 0.0|
| 0.4| 1.0|
| 1.2| 1.0|
| 1.5| 1.0|
+------+-------+

You can use the percent_rank from pyspark.sql.functions with a window function. For instance for computing deciles you can do:
from pyspark.sql.window import Window
from pyspark.sql.functions import ceil, percent_rank
w = Window.orderBy(data.var1)
data.select('*', ceil(10 * percent_rank().over(w)).alias("decile"))
By doing so you first compute the percent_rank, and then you multiply this by 10 and take the upper integer. Consequently, all values with a percent_rank between 0 and 0.1 will be added to decile 1, all values with a percent_rank between 0.1 and 0.2 will be added to decile 2, etc.

In the accepted answer fit is called two times. Thus change from
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
to
qds.setHandleInvalid("skip").fit(df).transform(df).show()

Related

PySpark: Group by two columns, count the pairs, and divide the average of two different columns

I have a dataframe with several columns, some of which are labeled PULocationID, DOLocationID, total_amount, and trip_distance. I'm trying to group by both PULocationID and DOLocationID, then count the combination each into a column called "count". I also need to take the average of total_amount and trip_distance and divide them into a column called "trip_rate". The end DF should be:
PULocationID
DOLocationID
count
trip_rate
123
422
1
5.2435
3
27
4
6.6121
Where (123,422) are paired together once for a trip rate of $5.24 and (3, 27) are paired together 4 times where the trip rate is $6.61.
Through reading some other threads, I'm able to group by the locations and count them using the below:
df.groupBy("PULocationID", 'DOLocationID').agg(count(lit(1)).alias("count")).show()
OR I can group by the locations and get the averages of the two columns I need using the below:
df.groupBy("PULocationID", 'DOLocationID').agg({'total_amount':'avg', 'trip_distance':'avg'}).show()
I tried a couple of things to get the trip_rate, but neither worked:
df.withColumn("trip_rate", (pyspark.sql.functions.col("total_amount") / pyspark.sql.functions.col("trip_distance")))
df.withColumn("trip_rate", df.total_amount/sum(df.trip_distance))
I also can't figure out how to combine the two queries that work (i.e. count of locations + averages).
Using this as an example input DataFrame:
+------------+------------+------------+-------------+
|PULocationID|DOLocationID|total_amount|trip_distance|
+------------+------------+------------+-------------+
| 123| 422| 10.487| 2|
| 3| 27| 19.8363| 3|
| 3| 27| 13.2242| 2|
| 3| 27| 6.6121| 1|
| 3| 27| 26.4484| 4|
+------------+------------+------------+-------------+
You can chain together the groupBy, agg, and select (you could also use withColumn and drop if you only need the 4 columns).
import pyspark.sql.functions as F
new_df = df.groupBy(
"PULocationID",
"DOLocationID",
).agg(
F.count(F.lit(1)).alias("count"),
F.avg(F.col("total_amount")).alias("avg_amt"),
F.avg(F.col("trip_distance")).alias("avg_distance"),
).select(
"PULocationID",
"DOLocationID",
"count",
(F.col("avg_amt") / F.col("avg_distance")).alias("trip_rate")
)
new_df.show()
+------------+------------+-----+-----------------+
|PULocationID|DOLocationID|count| trip_rate|
+------------+------------+-----+-----------------+
| 123| 422| 1| 5.2435|
| 3| 27| 4|6.612100000000001|
+------------+------------+-----+-----------------+

transform distinct row values to different columns with corresponding rows using Pyspark

I'm new to Pyspark and trying to transform data
Given dataframe
Col1
A=id1a A=id2a B=id1b C=id1c B=id2b
D=id1d A=id3a B=id3b C=id2c
A=id4a C=id3c
Required:
A B C
id1a id1b id1c
id2a id2b id2c
id3a id3b id3b
id4a null null
I have tried pivot, but that gives first value.
There might be a better way , however an approach is splitting the column on spaces to create array of the entries and then using higher order functions(spark 2.4+) to split on the '=' for each entry in the splitted array .Then explode and create 2 columns one with the id and one with the value. Then we can assign a row number to each partition and groupby then pivot:
import pyspark.sql.functions as F
df1 = (df.withColumn("Col1",F.split(F.col("Col1"),"\s+")).withColumn("Col1",
F.explode(F.expr("transform(Col1,x->split(x,'='))")))
.select(F.col("Col1")[0].alias("cols"),F.col("Col1")[1].alias("vals")))
from pyspark.sql import Window
w = Window.partitionBy("cols").orderBy("cols")
final = (df1.withColumn("Rnum",F.row_number().over(w)).groupBy("Rnum")
.pivot("cols").agg(F.first("vals")).orderBy("Rnum"))
final.show()
+----+----+----+----+----+
|Rnum| A| B| C| D|
+----+----+----+----+----+
| 1|id1a|id1b|id1c|id1d|
| 2|id2a|id2b|id2c|null|
| 3|id3a|id3b|id3c|null|
| 4|id4a|null|null|null|
+----+----+----+----+----+
this is how df1 looks like after the transformation:
df1.show()
+----+----+
|cols|vals|
+----+----+
| A|id1a|
| A|id2a|
| B|id1b|
| C|id1c|
| B|id2b|
| D|id1d|
| A|id3a|
| B|id3b|
| C|id2c|
| A|id4a|
| C|id3c|
+----+----+
May be I don't know the full picture, but the data format seems to be strange. If nothing can be done at the data source, then some collects, pivots and joins will be needed. Try this.
import pyspark.sql.functions as F
test = sqlContext.createDataFrame([('A=id1a A=id2a B=id1b C=id1c B=id2b',1),('D=id1d A=id3a B=id3b C=id2c',2),('A=id4a C=id3c',3)],schema=['col1','id'])
tst_spl = test.withColumn("item",(F.split('col1'," ")))
tst_xpl = tst_spl.select(F.explode("item"))
tst_map = tst_xpl.withColumn("key",F.split('col','=')[0]).withColumn("value",F.split('col','=')[1]).drop('col')
#%%
tst_pivot = tst_map.groupby(F.lit(1)).pivot('key').agg(F.collect_list(('value'))).drop('1')
#%%
tst_arr = [tst_pivot.select(F.posexplode(coln)).withColumnRenamed('col',coln) for coln in tst_pivot.columns]
tst_fin = reduce(lambda df1,df2:df1.join(df2,on='pos',how='full'),tst_arr).orderBy('pos')
tst_fin.show()
+---+----+----+----+----+
|pos| A| B| C| D|
+---+----+----+----+----+
| 0|id3a|id3b|id1c|id1d|
| 1|id4a|id1b|id2c|null|
| 2|id1a|id2b|id3c|null|
| 3|id2a|null|null|null|
+---+----+----+----+----

How to convert numerical values to a categorical variable using pyspark

pyspark dataframe which have a range of numerical variables.
for eg
my dataframe have a column value from 1 to 100.
1-10 - group1<== the column value for 1 to 10 should contain group1 as value
11-20 - group2
.
.
.
91-100 group10
how can i achieve this using pyspark dataframe
# Creating an arbitrary DataFrame
df = spark.createDataFrame([(1,54),(2,7),(3,72),(4,99)], ['ID','Var'])
df.show()
+---+---+
| ID|Var|
+---+---+
| 1| 54|
| 2| 7|
| 3| 72|
| 4| 99|
+---+---+
Once the DataFrame has been created, we use floor() function to find the integral part of a number. For eg; floor(15.5) will be 15. We need to find the integral part of the Var/10 and add 1 to it, because the indexing starts from 1, as opposed to 0. Finally, we have need to prepend group to the value. Concatenation can be achieved with concat() function, but keep in mind that since the prepended word group is not a column, so we need to put it inside lit() which creates a column of a literal value.
# Requisite packages needed
from pyspark.sql.functions import col, floor, lit, concat
df = df.withColumn('Var',concat(lit('group'),(1+floor(col('Var')/10))))
df.show()
+---+-------+
| ID| Var|
+---+-------+
| 1| group6|
| 2| group1|
| 3| group8|
| 4|group10|
+---+-------+

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate
To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer

Applying a Window function to calculate differences in pySpark

I am using pySpark, and have set up my dataframe with two columns representing a daily asset price as follows:
ind = sc.parallelize(range(1,5))
prices = sc.parallelize([33.3,31.1,51.2,21.3])
data = ind.zip(prices)
df = sqlCtx.createDataFrame(data,["day","price"])
I get upon applying df.show():
+---+-----+
|day|price|
+---+-----+
| 1| 33.3|
| 2| 31.1|
| 3| 51.2|
| 4| 21.3|
+---+-----+
Which is fine and all. I would like to have another column that contains the day-to-day returns of the price column, i.e., something like
(price(day2)-price(day1))/(price(day1))
After much research, I am told that this is most efficiently accomplished through applying the pyspark.sql.window functions, but I am unable to see how.
You can bring the previous day column by using lag function, and add additional column that does actual day-to-day return from the two columns, but you may have to tell spark how to partition your data and/or order it to do lag, something like this:
from pyspark.sql.window import Window
import pyspark.sql.functions as func
from pyspark.sql.functions import lit
dfu = df.withColumn('user', lit('tmoore'))
df_lag = dfu.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Window.partitionBy("user")))
result = df_lag.withColumn('daily_return',
(df_lag['price'] - df_lag['prev_day_price']) / df_lag['price'] )
>>> result.show()
+---+-----+-------+--------------+--------------------+
|day|price| user|prev_day_price| daily_return|
+---+-----+-------+--------------+--------------------+
| 1| 33.3| tmoore| null| null|
| 2| 31.1| tmoore| 33.3|-0.07073954983922816|
| 3| 51.2| tmoore| 31.1| 0.392578125|
| 4| 21.3| tmoore| 51.2| -1.403755868544601|
+---+-----+-------+--------------+--------------------+
Here is longer introduction into Window functions in Spark.
Lag function can help you resolve your use case.
from pyspark.sql.window import Window
import pyspark.sql.functions as func
### Defining the window
Windowspec=Window.orderBy("day")
### Calculating lag of price at each day level
prev_day_price= df.withColumn('prev_day_price',
func.lag(dfu['price'])
.over(Windowspec))
### Calculating the average
result = prev_day_price.withColumn('daily_return',
(prev_day_price['price'] - prev_day_price['prev_day_price']) /
prev_day_price['price'] )