How to use K-means with parquet file - scala

i'd like to learn how to use K-Means algorithm on Spark.
I have a parquet file and i would like to analyze it with k-means. How can I tell spark to analyze only specific column? How can I remove null values from rows? Can someone write a simple code of how to do it?
Thank you

If you want specific columns just do select on dataframe and then use VectorAssembler. KMeans require Vector column as input features.
You can delete, fill or replace null values using DataFrameNaFunctions.
See example below:
val dataset= spark.range(10)
.select('id.cast("double").as("c1"),lit('id/2).as("c2").cast("double"))
val assembler = new VectorAssembler()
.setInputCols(dataset.columns)
.setOutputCol("myFeatures")
val output = assembler.transform(dataset)
// Trains a k-means model.
val kmeans = new KMeans().setK(2).setSeed(1L).setFeaturesCol("myFeatures")
val model = kmeans.fit(output)
// Make predictions
val predictions = model.transform(output)
predictions.show()
+---+---+----------+----------+
| c1| c2|myFeatures|prediction|
+---+---+----------+----------+
|0.0|0.0| (2,[],[])| 1|
|1.0|0.5| [1.0,0.5]| 1|
|2.0|1.0| [2.0,1.0]| 1|
|3.0|1.5| [3.0,1.5]| 1|
|4.0|2.0| [4.0,2.0]| 1|
|5.0|2.5| [5.0,2.5]| 0|
|6.0|3.0| [6.0,3.0]| 0|
|7.0|3.5| [7.0,3.5]| 0|
|8.0|4.0| [8.0,4.0]| 0|
|9.0|4.5| [9.0,4.5]| 0|
+---+---+----------+----------+

Related

PySpark UDF: a fir transform example

I am really new to PySpark and am trying to translate some python code into pyspark.
I start with a panda, convert to a document - term matrix and then apply PCA.
The UDF:
class MultiLabelCounter():
def __init__(self, classes=None):
self.classes_ = classes
def fit(self,y):
self.classes_ =
sorted(set(itertools.chain.from_iterable(y)))
self.mapping = dict(zip(self.classes_,
range(len(self.classes_))))
return self
def transform(self,y):
yt = []
for labels in y:
data = [0]*len(self.classes_)
for label in labels:
data[self.mapping[label]] +=1
yt.append(data)
return yt
def fit_transform(self,y):
return self.fit(y).transform(y)
mlb = MultiLabelCounter()
df_grouped =
df_grouped.withColumnRenamed("collect_list(full)","full")
udf_mlb = udf(lambda x: mlb.fit_transform(x),IntegerType())
mlb_fitted = df_grouped.withColumn('full',udf_mlb(col("full")))
I am of course getting NULL results.
I am using spark 2.4.4 version.
EDIT
Adding sample input and output as per request
Input:
|id|val|
|--|---|
|1|[hello,world]|
|2|[goodbye, world]|
|3|[hello,hello]|
Output:
|id|hello|goodbye|world|
|--|-----|-------|-----|
|1|1|0|1|
|2|0|1|1|
|3|2|0|0|
Based upon input data shared, I tried replicating your output and it works. Please see below -
Input Data
df = spark.createDataFrame(data=[(1, ['hello', 'world']), (2, ['goodbye', 'world']), (3, ['hello', 'hello'])], schema=['id', 'vals'])
df.show()
+---+----------------+
| id| vals|
+---+----------------+
| 1| [hello, world]|
| 2|[goodbye, world]|
| 3| [hello, hello]|
+---+----------------+
Now, using explode to create separate rows out of vals list items. Thereafter, using pivot and count will calculate the frequency. Finally, replacing null values with 0 using fillna(0). See below -
from pyspark.sql.functions import *
df1 = df.select(['id', explode(col('vals'))]).groupBy("id").pivot("col").agg(count(col("col")))
df1.fillna(0).orderBy("id").show()
Output
+---+-------+-----+-----+
| id|goodbye|hello|world|
+---+-------+-----+-----+
| 1| 0| 1| 1|
| 2| 1| 0| 1|
| 3| 0| 2| 0|
+---+-------+-----+-----+

how to convert assembler vector to data frame?

I just used VectorAssembler to normalize my features for a ML application.
def kmeansClustering ( k : Int ) : sql.DataFrame = {
val assembler = new VectorAssembler()
.setInputCols(this.listeOfName())
.setOutputCol("features")
val intermediaireDF = assembler
.transform(this.filterNumeric())
.select("features")
val kmeans = new KMeans().setK(k).setSeed(1L)
val model = kmeans.fit(intermediaireDF)
val predictions = model.transform(intermediaireDF)
return(predictions)
}
as a result I got a 2 vectors dataframe:
+--------------------+----------+
| features|prediction|
+--------------------+----------+
|[-27.482279,153.0...| 0|
|[-27.47059,153.03...| 2|
|[-27.474531,153.0...| 3|
.................................
So I want to perform something like avg and std by group for each column but the features are assembled and I can't do manipulation on them.
I've tried to use org.apache.spark.ml.feature.VectorDisassembler, but it did not work.
val disassembler = new VectorDisassembler().setInputCol("vectorCol")
disassembler.transform(df).show()
Any suggestion ?
Actually you do not need to remove the original columns to perform your clustering.
// creating sample data
val df = spark.range(10).select('id as "a", 'id %3 as "b")
val assembler = new VectorAssembler()
.setInputCols(Array("a", "b")).setOutputCol("features")
// Here I delete the select so as to keep all the columns
val intermediaireDF = assembler.transform(this.filterNumeric())
// I specify explicitely what the feature column is
val kmeans = new KMeans().setK( 2 ).setSeed(1L).setFeaturesCol("features")
// And the rest remains unchanged
val model = kmeans.fit(intermediaireDF)
val predictions = model.transform(intermediaireDF)
predictions.show(6)
+---+---+----------+----------+
| a| b| features|prediction|
+---+---+----------+----------+
| 1| 0| [1.0,0.0]| 1|
| 2| 1| [2.0,1.0]| 1|
| 3| 2| [3.0,2.0]| 1|
| 4| 0| [4.0,0.0]| 1|
| 5| 1| [5.0,1.0]| 0|
| 6| 2| [6.0,2.0]| 0|
+---+---+----------+----------+
And from there, you can compute what you need.

Remove all records which are duplicate in spark dataframe

I have a spark dataframe with multiple columns in it. I want to find out and remove rows which have duplicated values in a column (the other columns can be different).
I tried using dropDuplicates(col_name) but it will only drop duplicate entries but still keep one record in the dataframe. What I need is to remove all entries which were initially containing duplicate entries.
I am using Spark 1.6 and Scala 2.10.
I would use window-functions for this. Lets say you want to remove duplicate id rows :
import org.apache.spark.sql.expressions.Window
df
.withColumn("cnt", count("*").over(Window.partitionBy($"id")))
.where($"cnt"===1).drop($"cnt")
.show()
This can be done by grouping by the column (or columns) to look for duplicates in and then aggregate and filter the results.
Example dataframe df:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
| 4| 4|
| 4| 5|
+---+---+
Grouping by the id column to remove its duplicates (the last two rows):
val df2 = df.groupBy("id")
.agg(first($"num").as("num"), count($"id").as("count"))
.filter($"count" === 1)
.select("id", "num")
This will give you:
+---+---+
| id|num|
+---+---+
| 1| 1|
| 2| 2|
| 3| 3|
+---+---+
Alternativly, it can be done by using a join. It will be slower, but if there is a lot of columns there is no need to use first($"num").as("num") for each one to keep them.
val df2 = df.groupBy("id").agg(count($"id").as("count")).filter($"count" === 1).select("id")
val df3 = df.join(df2, Seq("id"), "inner")
I added a killDuplicates() method to the open source spark-daria library that uses #Raphael Roth's solution. Here's how to use the code:
import com.github.mrpowers.spark.daria.sql.DataFrameExt._
df.killDuplicates(col("id"))
// you can also supply multiple Column arguments
df.killDuplicates(col("id"), col("another_column"))
Here's the code implementation:
object DataFrameExt {
implicit class DataFrameMethods(df: DataFrame) {
def killDuplicates(cols: Column*): DataFrame = {
df
.withColumn(
"my_super_secret_count",
count("*").over(Window.partitionBy(cols: _*))
)
.where(col("my_super_secret_count") === 1)
.drop(col("my_super_secret_count"))
}
}
}
You might want to leverage the spark-daria library to keep this logic out of your codebase.

rearrange order of spark columns

I have a spark dataframe with many columns. Using Spark and Scala, I would like to select the columns in a specified order, but I don't want to hardcode the desired order. In pseudo-code, I'd like do something like:
val colNames = df.columns
val newOrder = colNames(colNames.length) ++ colNames(0:colNames.length-1)
df.select(newOrder)
How can I do this? Thanks!
You can do something like this:
val df = Seq((1,2,3)).toDF("A","B","C")
df.select(df.columns.last, df.columns.dropRight(1): _*).show
+---+---+---+
| C| A| B|
+---+---+---+
| 3| 1| 2|
+---+---+---+

Why Spark ML ALS algorithm print RMSE = NaN?

I use ALS to predict rating, this is my code:
val als = new ALS()
.setMaxIter(5)
.setRegParam(0.01)
.setUserCol("user_id")
.setItemCol("business_id")
.setRatingCol("stars")
val model = als.fit(training)
// Evaluate the model by computing the RMSE on the test data
val predictions = model.transform(testing)
predictions.sort("user_id").show(1000)
val evaluator = new RegressionEvaluator()
.setMetricName("rmse")
.setLabelCol("stars")
.setPredictionCol("prediction")
val rmse = evaluator.evaluate(predictions)
println(s"Root-mean-square error = $rmse")
But get some negative scores and RMSE is Nan:
+-------+-----------+---------+------------+
|user_id|business_id| stars| prediction|
+-------+-----------+---------+------------+
| 0| 2175| 4.0| 4.0388923|
| 0| 5753| 3.0| 2.6875196|
| 0| 9199| 4.0| 4.1753435|
| 0| 16416| 2.0| -2.710618|
| 0| 6063| 3.0| NaN|
| 0| 23076| 2.0| -0.8930751|
Root-mean-square error = NaN
How to get a good result?
Negative values don't matter as RMSE squares the values first. Probably you have empty prediction values. You could drop them:
predictions.na().drop(["prediction"])
Although, that can be a bit misleading, alternatively you could fill those values with your lowest/highest/average rating.
I'd also recommend to round x < min_rating and x > max_rating to the lowest/highest ratings, which would improve your RMSE.
EDIT:
Some extra info here: https://issues.apache.org/jira/browse/SPARK-14489
Since Spark version 2.2.0 you can set the coldStartStrategy parameter to drop in order to drop any rows in the DataFrame of predictions that contain NaN values. The evaluation metric will then be computed over the non-NaN data and will be valid.
model.setColdStartStrategy("drop");
A small correction will solve this issue:
prediction.na.drop()