Why Spark ML ALS algorithm print RMSE = NaN?

Why Spark ML ALS algorithm print RMSE = NaN? - scala

I use ALS to predict rating, this is my code:
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("user_id")
.setItemCol("business_id")
.setRatingCol("stars")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(testing)
predictions.sort("user_id").show(1000)
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("stars")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")
But get some negative scores and RMSE is Nan:
+-------+-----------+---------+------------+
|user_id|business_id| stars| prediction|
+-------+-----------+---------+------------+
| 0| 2175| 4.0| 4.0388923|
| 0| 5753| 3.0| 2.6875196|
| 0| 9199| 4.0| 4.1753435|
| 0| 16416| 2.0| -2.710618|
| 0| 6063| 3.0| NaN|
| 0| 23076| 2.0| -0.8930751|
Root-mean-square error = NaN
How to get a good result?

Negative values don't matter as RMSE squares the values first. Probably you have empty prediction values. You could drop them:
predictions.na().drop(["prediction"])
Although, that can be a bit misleading, alternatively you could fill those values with your lowest/highest/average rating.
I'd also recommend to round x < min_rating and x > max_rating to the lowest/highest ratings, which would improve your RMSE.
EDIT:
Some extra info here: https://issues.apache.org/jira/browse/SPARK-14489

Since Spark version 2.2.0 you can set the coldStartStrategy parameter to drop in order to drop any rows in the DataFrame of predictions that contain NaN values. The evaluation metric will then be computed over the non-NaN data and will be valid.
model.setColdStartStrategy("drop");

A small correction will solve this issue:
prediction.na.drop()

Related

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark.
I start with a panda, convert to a document - term matrix and then apply PCA.
The UDF:
class MultiLabelCounter():
def __init__(self, classes=None):
self.classes_ = classes
def fit(self,y):
self.classes_ =
sorted(set(itertools.chain.from_iterable(y)))
self.mapping = dict(zip(self.classes_,
range(len(self.classes_))))
return self
def transform(self,y):
yt = []
for labels in y:
data = [0]*len(self.classes_)
for label in labels:
data[self.mapping[label]] +=1
yt.append(data)
return yt
def fit_transform(self,y):
return self.fit(y).transform(y)
mlb = MultiLabelCounter()
df_grouped =
df_grouped.withColumnRenamed("collect_list(full)","full")
udf_mlb = udf(lambda x: mlb.fit_transform(x),IntegerType())
mlb_fitted = df_grouped.withColumn('full',udf_mlb(col("full")))
I am of course getting NULL results.
I am using spark 2.4.4 version.
EDIT
Adding sample input and output as per request
Input:
|id|val|
|--|---|
|1|[hello,world]|
|2|[goodbye, world]|
|3|[hello,hello]|
Output:
|id|hello|goodbye|world|
|--|-----|-------|-----|
|1|1|0|1|
|2|0|1|1|
|3|2|0|0|

Based upon input data shared, I tried replicating your output and it works. Please see below -
Input Data
df = spark.createDataFrame(data=[(1, ['hello', 'world']), (2, ['goodbye', 'world']), (3, ['hello', 'hello'])], schema=['id', 'vals'])
df.show()
+---+----------------+
| id| vals|
+---+----------------+
| 1| [hello, world]|
| 2|[goodbye, world]|
| 3| [hello, hello]|
+---+----------------+
Now, using explode to create separate rows out of vals list items. Thereafter, using pivot and count will calculate the frequency. Finally, replacing null values with 0 using fillna(0). See below -
from pyspark.sql.functions import *
df1 = df.select(['id', explode(col('vals'))]).groupBy("id").pivot("col").agg(count(col("col")))
df1.fillna(0).orderBy("id").show()
Output
+---+-------+-----+-----+
| id|goodbye|hello|world|
+---+-------+-----+-----+
| 1| 0| 1| 1|
| 2| 1| 0| 1|
| 3| 0| 2| 0|
+---+-------+-----+-----+

How to use K-means with parquet file

i'd like to learn how to use K-Means algorithm on Spark.
I have a parquet file and i would like to analyze it with k-means. How can I tell spark to analyze only specific column? How can I remove null values from rows? Can someone write a simple code of how to do it?
Thank you

If you want specific columns just do select on dataframe and then use VectorAssembler. KMeans require Vector column as input features.
You can delete, fill or replace null values using DataFrameNaFunctions.
See example below:
val dataset= spark.range(10)
.select('id.cast("double").as("c1"),lit('id/2).as("c2").cast("double"))
val assembler = new VectorAssembler()
.setInputCols(dataset.columns)
.setOutputCol("myFeatures")
val output = assembler.transform(dataset)
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("myFeatures")
val model = kmeans.fit(output)
// Make predictions
val predictions = model.transform(output)
predictions.show()
+---+---+----------+----------+
| c1| c2|myFeatures|prediction|
+---+---+----------+----------+
|0.0|0.0| (2,[],[])| 1|
|1.0|0.5| [1.0,0.5]| 1|
|2.0|1.0| [2.0,1.0]| 1|
|3.0|1.5| [3.0,1.5]| 1|
|4.0|2.0| [4.0,2.0]| 1|
|5.0|2.5| [5.0,2.5]| 0|
|6.0|3.0| [6.0,3.0]| 0|
|7.0|3.5| [7.0,3.5]| 0|
|8.0|4.0| [8.0,4.0]| 0|
|9.0|4.5| [9.0,4.5]| 0|
+---+---+----------+----------+

Spark - group and aggregate only several smallest items

In short
I have cartesian-product (cross-join) of two dataframes and function which gives some score for given element of this product. I want now to get few "best matched" elements of the second DF for every member of the first DF.
In details
What follows is a simplified example as my real code is somewhat bloated with additional fields and filters.
Given two sets of data, each having some id and value:
// simple rdds of tuples
val rdd1 = sc.parallelize(Seq(("a", 31),("b", 41),("c", 59),("d", 26),("e",53),("f",58)))
val rdd2 = sc.parallelize(Seq(("z", 16),("y", 18),("x",3),("w",39),("v",98), ("u", 88)))
// convert them to dataframes:
val df1 = spark.createDataFrame(rdd1).toDF("id1", "val1")
val df2 = spark.createDataFrame(rdd2).toDF("id2", "val2")
and some function which for pair of the elements from the first and second dataset gives their "matching score":
def f(a:Int, b:Int):Int = (a * a + b * b * b) % 17
// convert it to udf
val fu = udf((a:Int, b:Int) => f(a, b))
we can create the product of two sets and calculate score for every pair:
val dfc = df1.crossJoin(df2)
val r = dfc.withColumn("rez", fu(col("val1"), col("val2")))
r.show
+---+----+---+----+---+
|id1|val1|id2|val2|rez|
+---+----+---+----+---+
| a| 31| z| 16| 8|
| a| 31| y| 18| 10|
| a| 31| x| 3| 2|
| a| 31| w| 39| 15|
| a| 31| v| 98| 13|
| a| 31| u| 88| 2|
| b| 41| z| 16| 14|
| c| 59| z| 16| 12|
...
And now we want to have this result grouped by id1:
r.groupBy("id1").agg(collect_set(struct("id2", "rez")).as("matches")).show
+---+--------------------+
|id1| matches|
+---+--------------------+
| f|[[v,2], [u,8], [y...|
| e|[[y,5], [z,3], [x...|
| d|[[w,2], [x,6], [v...|
| c|[[w,2], [x,6], [v...|
| b|[[v,2], [u,8], [y...|
| a|[[x,2], [y,10], [...|
+---+--------------------+
But really we want only to retain only few (say 3) of "matches", those having the best score (say, least score).
The question is
How to get the "matches" sorted and reduced to top-N elements? Probably it is something about collect_list and sort_array, though I don't know how to sort by inner field.
Is there a way to ensure optimization in case of large input DFs - e.g. choosing minimums directly while aggregating. I know it could be done easily if I wrote the code without spark - keeping small array or priority queue for every id1 and adding element where it should be, possibly dropping out some previously added.
E.g. it's ok that cross-join is costly operation, but I want to avoid wasting memory on the results most of which I'm going to drop in the next step. My real use case deals with DFs with less than 1 mln entries so cross-join is yet viable but as we want to select only 10-20 top matches for each id1 it seems to be quite desirable not to keep unnecessary data between steps.

For start we need to take only the first n rows. To do this we are partitioning the DF by 'id1' and sorting the groups by the res. We use it to add row number column to the DF, like that we can use where function to take the first n rows. Than you can continue doing the same code your wrote. Grouping by 'id1' and collecting the list. Only now you already have the highest rows.
import org.apache.spark.sql.expressions.Window
import org.apache.spark.sql.functions._
val n = 3
val w = Window.partitionBy($"id1").orderBy($"res".desc)
val res = r.withColumn("rn", row_number.over(w)).where($"rn" <= n).groupBy("id1").agg(collect_set(struct("id2", "res")).as("matches"))
A second option that might be better because you won't need to group the DF twice:
val sortTakeUDF = udf{(xs: Seq[Row], n: Int)} => xs.sortBy(_.getAs[Int]("res")).reverse.take(n).map{case Row(x: String, y:Int)}}
r.groupBy("id1").agg(sortTakeUDF(collect_set(struct("id2", "res")), lit(n)).as("matches"))
In here we create a udf that take the array column and an integer value n. The udf sorts the array by your 'res' and returns only the first n elements.

Deciles or other quantile rank for Pyspark column

I have a pyspark DF with multiple numeric columns and I want to, for each column calculate the decile or other quantile rank for that row based on each variable.
This is simple for pandas as we can create a new column for each variable using the qcut function to assign the value 0 to n-1 for 'q' as in pd.qcut(x,q=n).
How can this be done in pyspark? I have tried the following but clearly the break points are not unique between these thirds. I want to get the lower 1/3 of the data assigned 1, the next 1/3 assigned 2 and the top 1/3 assigned 3. I want to be able to change this and perhaps use 1/10, 1/32 etc
w = Window.partitionBy(data.var1).orderBy(data.var1)
d2=df.select(
"var1",
ntile(3).over(w).alias("ntile3")
)
agged=d2.groupby('ntile3').agg(F.min("var1").alias("min_var1"),F.max("var1").alias("max_var1"),F.count('*'))
agged.show()
+------+--------+--------+--------+
|ntile3|min_var1|max_var1|count(1)|
+------+--------+--------+--------+
| 1| 0.0| 210.0| 517037|
| 3| 0.0| 206.0| 516917|
| 2| 0.0| 210.0| 516962|
+------+--------+--------+--------+

QuantileDiscretizer from 'pyspark.ml.feature' can be used.
values = [(0.1,), (0.4,), (1.2,), (1.5,)]
df = spark.createDataFrame(values, ["values"])
qds = QuantileDiscretizer(numBuckets=2,
... inputCol="values", outputCol="buckets", relativeError=0.01, handleInvalid="error")
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
+------+-------+
|values|buckets|
+------+-------+
| 0.1| 0.0|
| 0.4| 1.0|
| 1.2| 1.0|
| 1.5| 1.0|
+------+-------+

You can use the percent_rank from pyspark.sql.functions with a window function. For instance for computing deciles you can do:
from pyspark.sql.window import Window
from pyspark.sql.functions import ceil, percent_rank
w = Window.orderBy(data.var1)
data.select('*', ceil(10 * percent_rank().over(w)).alias("decile"))
By doing so you first compute the percent_rank, and then you multiply this by 10 and take the upper integer. Consequently, all values with a percent_rank between 0 and 0.1 will be added to decile 1, all values with a percent_rank between 0.1 and 0.2 will be added to decile 2, etc.

In the accepted answer fit is called two times. Thus change from
bucketizer = qds.fit(df)
bucketizer.setHandleInvalid("skip").fit(df).transform(df).show()
to
qds.setHandleInvalid("skip").fit(df).transform(df).show()

How to randomly selecting rows from one dataframeusing information from another dataframe

The following I am attempting in Scala-Spark.
I'm hoping someone can give me some guidance on how to tackle this problem or provide me with some resources to figure out what I can do.
I have a dateCountDF with a count corresponding to a date. I would like to randomly select a certain number of entries for each dateCountDF.month from another Dataframe entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate and then place all the results into a new Dataframe. See Bellow for Data Example
I'm not at all sure how to approach this problem from a Spark-SQl or Spark-MapReduce perspective. The furthest I got was the naive approach, where I use a foreach on a dataFrame and then refer to the other dataframe within the function. But this doesn't work because of the distributed nature of Spark.
val randomEntites = dateCountDF.foreach(x => {
val count:Int = x(1).toString().toInt
val result = entitiesDF.take(count)
return result
})
DataFrames
**dateCountDF**
| Date | Count |
+----------+----------------+
|2016-08-31| 4|
|2015-12-31| 1|
|2016-09-30| 5|
|2016-04-30| 5|
|2015-11-30| 3|
|2016-05-31| 7|
|2016-11-30| 2|
|2016-07-31| 5|
|2016-12-31| 9|
|2014-06-30| 4|
+----------+----------------+
only showing top 10 rows
**entitiesDF**
| ID | FirstDate | LastDate |
+----------+-----------------+----------+
| 296| 2014-09-01|2015-07-31|
| 125| 2015-10-01|2016-12-31|
| 124| 2014-08-01|2015-03-31|
| 447| 2017-02-01|2017-01-01|
| 307| 2015-01-01|2015-04-30|
| 574| 2016-01-01|2017-01-31|
| 613| 2016-04-01|2017-02-01|
| 169| 2009-08-23|2016-11-30|
| 205| 2017-02-01|2017-02-01|
| 433| 2015-03-01|2015-10-31|
+----------+-----------------+----------+
only showing top 10 rows
Edit:
For clarification.
My inputs are entitiesDF and dateCountDF. I want to loop through dateCountDF and for each row I want to select a random number of entities in entitiesDF where dateCountDF.FirstDate<entitiesDF.Date && entitiesDF.Date <= dateCountDF.LastDate

To select random you do like this in scala
import random
def sampler(df, col, records):
# Calculate number of rows
colmax = df.count()
# Create random sample from range
vals = random.sample(range(1, colmax), records)
# Use 'vals' to filter DataFrame using 'isin'
return df.filter(df[col].isin(vals))
select random number of rows you want store in dataframe and the add this data in the another dataframe for this you can use unionAll.
also you can refer this answer