Create nxn matrix from pyspark datafame - pyspark

I am quite new to pyspark. I have 10k text data set. I create a Jaccard distance using Minhash lsh .
the output i got like for example
col1 col2 dist
A B 0.77
B C 0.56
C A 0.88
I want to convert this to NxN matrix format.
A B C
A 0 0.77 0.88
B 0.77 0 0.56
C 0.88 0.56 0
Is there any way to create this using pyspark.
I appreciate the suggestions.

It can done using the code below. However, it will be very computationally intensive because of the groupBy, pivots, union and then again groupBy. The two groupBy pivots are there because there are two combinations in your data A-B and B-A.
df1=df.groupBy("col1").pivot("col2").agg(F.first("dist")).orderBy("col1")
df2=df.groupBy(F.col("col2").alias("col1")).pivot("col1").agg(F.first("dist")).orderBy("col1")
df3=df1.union(df2)
df3.groupBy("col1")\
.agg(*(F.first(x,True).alias(x) for x in df3.columns if x != 'col1'))\
.fillna(0)\
.orderBy("col1")\
.show()
+----+----+----+----+
|col1| A| B| C|
+----+----+----+----+
| A| 0.0|0.77|0.88|
| B|0.77| 0.0|0.56|
| C|0.88|0.56| 0.0|
+----+----+----+----+

Related

How would I filter a dataframe by a column's percentile value in Scala Spark

Say I have this dataframe:
val df = Seq(("Mike",1),("Kevin",2),("Bob",3),("Steve",4)).toDF("name","score")
and I want to filter this dataframe so that it only returns rows where the "score" column is greater than on equal to the 75th percentile. How would I do this?
Thanks so much and have a great day!
What you want to base your filter on is the upper quartile.
It is also known as the upper quartile or the 75th empirical quartile and 75% of the data lies below this point.
Based on the answer here, you can use spark's approximateQuantile to get what you want:
val q = df.stat.approxQuantile("score", Array(.75), 0)
q: Array[Double] = Array(3.0)
This array(q) gives you the boundary between 3rd and 4th quartiles.
Using a simple spark filter should get you what you want:
df.filter($"score" >= q.head).show
+-----+-----+
| name|score|
+-----+-----+
| Bob| 3|
|Steve| 4|
+-----+-----+

compute the rate in pyspark dataframe

I have a spark dataframe like this :
date isF
190502 1
190502 0
190503 1
190504 1
190504 0
190505 1
I would like to compute for each date the rate of "isF" when isF =1.
The expected result is :
date rate
190502 0.5
190503 1
190504 0.5
190505 1
I tryed somethong like this but here i compute the sum, how can I do to compute the rate? :
stats_daily_df = (tx_wd_df
.groupBy("date", "isF")
.agg(# select
when(col("isF") == 1, (sum("isF")).alias("sum"))
.otherwise(0)) # else 0.00
)
IIUC, Below can help:
df.groupBy('date').agg((f.sum('isF')/f.count('isF')).alias('rate')).show()
+------+----+
| date|rate|
+------+----+
|190505| 1.0|
|190502| 0.5|
|190504| 0.5|
|190503| 1.0|
+------+----+

Deciles or other quantile rank for Pyspark column

I have a pyspark DF with multiple numeric columns and I want to, for each column calculate the decile or other quantile rank for that row based on each variable.
This is simple for pandas as we can create a new column for each variable using the qcut function to assign the value 0 to n-1 for 'q' as in pd.qcut(x,q=n).
How can this be done in pyspark? I have tried the following but clearly the break points are not unique between these thirds. I want to get the lower 1/3 of the data assigned 1, the next 1/3 assigned 2 and the top 1/3 assigned 3. I want to be able to change this and perhaps use 1/10, 1/32 etc
w = Window.partitionBy(data.var1).orderBy(data.var1)
d2=df.select(
"var1",
ntile(3).over(w).alias("ntile3")
)
agged=d2.groupby('ntile3').agg(F.min("var1").alias("min_var1"),F.max("var1").alias("max_var1"),F.count('*'))
agged.show()
+------+--------+--------+--------+
|ntile3|min_var1|max_var1|count(1)|
+------+--------+--------+--------+
| 1| 0.0| 210.0| 517037|
| 3| 0.0| 206.0| 516917|
| 2| 0.0| 210.0| 516962|
+------+--------+--------+--------+
QuantileDiscretizer from 'pyspark.ml.feature' can be used.
values = [(0.1,), (0.4,), (1.2,), (1.5,)]
df = spark.createDataFrame(values, ["values"])
qds = QuantileDiscretizer(numBuckets=2,
... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error")
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
+------+-------+
|values|buckets|
+------+-------+
| 0.1| 0.0|
| 0.4| 1.0|
| 1.2| 1.0|
| 1.5| 1.0|
+------+-------+
You can use the percent_rank from pyspark.sql.functions with a window function. For instance for computing deciles you can do:
from pyspark.sql.window import Window
from pyspark.sql.functions import ceil, percent_rank
w = Window.orderBy(data.var1)
data.select('*', ceil(10 * percent_rank().over(w)).alias("decile"))
By doing so you first compute the percent_rank, and then you multiply this by 10 and take the upper integer. Consequently, all values with a percent_rank between 0 and 0.1 will be added to decile 1, all values with a percent_rank between 0.1 and 0.2 will be added to decile 2, etc.
In the accepted answer fit is called two times. Thus change from
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
to
qds.setHandleInvalid("skip").fit(df).transform(df).show()

Padding in a Pyspark Dataframe

I have a Pyspark dataframe(Original Dataframe) having below data(all columns have string datatype):
id Value
1 103
2 1504
3 1
I need to create a new modified dataframe with padding in value column, so that length of this column should be 4 characters. If length is less than 4 characters, then add 0's in data as shown below:
id Value
1 0103
2 1504
3 0001
Can someone help me out? How can i achieve it using Pyspark dataframe? Any help will be appreciated.
You can use lpad from functions module,
from pyspark.sql.functions import lpad
>>> df.select('id',lpad(df['value'],4,'0').alias('value')).show()
+---+-----+
| id|value|
+---+-----+
| 1| 0103|
| 2| 1504|
| 3| 0001|
+---+-----+
Using PySpark lpad function in conjunction with withColumn:
import pyspark.sql.functions as F
dfNew = dfOrigin.withColumn('Value', F.lpad(dfOrigin['Value'], 4, '0'))

Why Spark ML ALS algorithm print RMSE = NaN?

I use ALS to predict rating, this is my code:
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("user_id")
.setItemCol("business_id")
.setRatingCol("stars")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(testing)
predictions.sort("user_id").show(1000)
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("stars")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")
But get some negative scores and RMSE is Nan:
+-------+-----------+---------+------------+
|user_id|business_id| stars| prediction|
+-------+-----------+---------+------------+
| 0| 2175| 4.0| 4.0388923|
| 0| 5753| 3.0| 2.6875196|
| 0| 9199| 4.0| 4.1753435|
| 0| 16416| 2.0| -2.710618|
| 0| 6063| 3.0| NaN|
| 0| 23076| 2.0| -0.8930751|
Root-mean-square error = NaN
How to get a good result?
Negative values don't matter as RMSE squares the values first. Probably you have empty prediction values. You could drop them:
predictions.na().drop(["prediction"])
Although, that can be a bit misleading, alternatively you could fill those values with your lowest/highest/average rating.
I'd also recommend to round x < min_rating and x > max_rating to the lowest/highest ratings, which would improve your RMSE.
EDIT:
Some extra info here: https://issues.apache.org/jira/browse/SPARK-14489
Since Spark version 2.2.0 you can set the coldStartStrategy parameter to drop in order to drop any rows in the DataFrame of predictions that contain NaN values. The evaluation metric will then be computed over the non-NaN data and will be valid.
model.setColdStartStrategy("drop");
A small correction will solve this issue:
prediction.na.drop()