How to use Spark ML ALS algorithm? [duplicate] - scala

Is it possible to factorize a Spark dataframe column? With factorizing I mean creating a mapping of each unique value in the column to the same ID.
Example, the original dataframe:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| A|
|1473492972|4060600988513370| A|
|1473509764|4060600988513370| B|
|1473513432|4060600988513370| C|
|1473513432|4060600988513370| A|
+----------+----------------+--------------------+
to the factorized version:
+----------+----------------+--------------------+
| col1| col2| col3|
+----------+----------------+--------------------+
|1473490929|4060600988513370| 0|
|1473492972|4060600988513370| 0|
|1473509764|4060600988513370| 1|
|1473513432|4060600988513370| 2|
|1473513432|4060600988513370| 0|
+----------+----------------+--------------------+
In scala itself it would be fairly simple, but since Spark distributes it's dataframes over nodes I'm not sure how to keep a mapping from A->0, B->1, C->2.
Also, assume the dataframe is pretty big (gigabytes), which means loading one entire column into the memory of a single machine might not be possible.
Can it be done?

You can use StringIndexer to encode letters into indices:
import org.apache.spark.ml.feature.StringIndexer
val indexer = new StringIndexer()
.setInputCol("col3")
.setOutputCol("col3Index")
val indexed = indexer.fit(df).transform(df)
indexed.show()
+----------+----------------+----+---------+
| col1| col2|col3|col3Index|
+----------+----------------+----+---------+
|1473490929|4060600988513370| A| 0.0|
|1473492972|4060600988513370| A| 0.0|
|1473509764|4060600988513370| B| 1.0|
|1473513432|4060600988513370| C| 2.0|
|1473513432|4060600988513370| A| 0.0|
+----------+----------------+----+---------+
Data:
val df = spark.createDataFrame(Seq(
(1473490929, "4060600988513370", "A"),
(1473492972, "4060600988513370", "A"),
(1473509764, "4060600988513370", "B"),
(1473513432, "4060600988513370", "C"),
(1473513432, "4060600988513370", "A"))).toDF("col1", "col2", "col3")

You can use an user defined function.
First you create the mapping you need:
val updateFunction = udf {(x: String) =>
x match {
case "A" => 0
case "B" => 1
case "C" => 2
case _ => 3
}
}
And now you only have to apply it to your DataFrame:
df.withColumn("col3", updateFunction(df.col("col3")))

Related

how to convert assembler vector to data frame?

I just used VectorAssembler to normalize my features for a ML application.
def kmeansClustering ( k : Int ) : sql.DataFrame = {
val assembler = new VectorAssembler()
.setInputCols(this.listeOfName())
.setOutputCol("features")
val intermediaireDF = assembler
.transform(this.filterNumeric())
.select("features")
val kmeans = new KMeans().setK(k).setSeed(1L)
val model = kmeans.fit(intermediaireDF)
val predictions = model.transform(intermediaireDF)
return(predictions)
}
as a result I got a 2 vectors dataframe:
+--------------------+----------+
| features|prediction|
+--------------------+----------+
|[-27.482279,153.0...| 0|
|[-27.47059,153.03...| 2|
|[-27.474531,153.0...| 3|
.................................
So I want to perform something like avg and std by group for each column but the features are assembled and I can't do manipulation on them.
I've tried to use org.apache.spark.ml.feature.VectorDisassembler, but it did not work.
val disassembler = new VectorDisassembler().setInputCol("vectorCol")
disassembler.transform(df).show()
Any suggestion ?
Actually you do not need to remove the original columns to perform your clustering.
// creating sample data
val df = spark.range(10).select('id as "a", 'id %3 as "b")
val assembler = new VectorAssembler()
.setInputCols(Array("a", "b")).setOutputCol("features")
// Here I delete the select so as to keep all the columns
val intermediaireDF = assembler.transform(this.filterNumeric())
// I specify explicitely what the feature column is
val kmeans = new KMeans().setK( 2 ).setSeed(1L).setFeaturesCol("features")
// And the rest remains unchanged
val model = kmeans.fit(intermediaireDF)
val predictions = model.transform(intermediaireDF)
predictions.show(6)
+---+---+----------+----------+
| a| b| features|prediction|
+---+---+----------+----------+
| 1| 0| [1.0,0.0]| 1|
| 2| 1| [2.0,1.0]| 1|
| 3| 2| [3.0,2.0]| 1|
| 4| 0| [4.0,0.0]| 1|
| 5| 1| [5.0,1.0]| 0|
| 6| 2| [6.0,2.0]| 0|
+---+---+----------+----------+
And from there, you can compute what you need.

Sum columns of a Spark dataframe and create another dataframe

I have a dataframe like below -
I am trying to create another dataframe from this which has 2 columns - the column name and the sum of values in each column like this -
So far, I've tried this (in Spark 2.2.0) but throws a stack trace -
val get_count: (String => Long) = (c: String) => {
df.groupBy("id")
.agg(sum(c) as "s")
.select("s")
.collect()(0)
.getLong(0)
}
val sqlfunc = udf(get_count)
summary = summary.withColumn("sum_of_column", sqlfunc(col("c")))
Are there any other alternatives of accomplishing this task?
I think that the most efficient way is to do an aggregation and then build a new dataframe. That way you avoid a costly explode.
First, let's create the dataframe. BTW, it's always nice to provide the code to do it when you ask a question. This way we can reproduce your problem in seconds.
val df = Seq((1, 1, 0, 0, 1), (1, 1, 5, 0, 0),
(0, 1, 0, 6, 0), (0, 1, 0, 4, 3))
.toDF("output_label", "ID", "C1", "C2", "C3")
Then we build the list of columns that we are interested in, the aggregations, and compute the result.
val cols = (1 to 3).map(i => s"C$i")
val aggs = cols.map(name => sum(col(name)).as(name))
val agg_df = df.agg(aggs.head, aggs.tail :_*) // See the note below
agg_df.show
+---+---+---+
| C1| C2| C3|
+---+---+---+
| 5| 10| 4|
+---+---+---+
We almost have what we need, we just need to collect the data and build a new dataframe:
val agg_row = agg_df.first
cols.map(name => name -> agg_row.getAs[Long](name))
.toDF("column", "sum")
.show
+------+---+
|column|sum|
+------+---+
| C1| 5|
| C2| 10|
| C3| 4|
+------+---+
EDIT:
NB: df.agg(aggs.head, aggs.tail :_*) may seem strange. The idea is simply to compute all the aggregations computed in aggs. One would expect something more simple like df.agg(aggs : _*). Yet the signature of the agg method is as follows:
def agg(expr: org.apache.spark.sql.Column,exprs: org.apache.spark.sql.Column*)
maybe to ensure that at least one column is used, and this is why you need to split aggs in aggs.head and aggs.tail.
What i do is to define a method to create a struct from the desired values:
def kv (columnsToTranspose: Array[String]) = explode(array(columnsToTranspose.map {
c => struct(lit(c).alias("k"), col(c).alias("v"))
}: _*))
This functions receives a list of columns to transpose (your 3 last columns in your case) and transform them in a struct with the column name as key and the column value as value
And then use that method to create an struct and process it as you want
df.withColumn("kv", kv(df.columns.tail.tail))
.select( $"kv.k".as("column"), $"kv.v".alias("values"))
.groupBy("column")
.agg(sum("values").as("sum"))
First apply the previous defined function to have the desired columns as the said struct, and then deconstruct the struct to have a column key and a column value in each row.
Then you can aggregate by the column name and sum the values
INPUT
+------------+---+---+---+---+
|output_label| id| c1| c2| c3|
+------------+---+---+---+---+
| 1| 1| 0| 0| 1|
| 1| 1| 5| 0| 0|
| 0| 1| 0| 6| 0|
| 0| 1| 0| 4| 3|
+------------+---+---+---+---+
OUTPUT
+------+---+
|column|sum|
+------+---+
| c1| 5|
| c3| 4|
| c2| 10|
+------+---+

How to paralelize processing of dataframe in apache spark with combination over a column

I'm looking a solution to build an aggregation with all combination of a column. For example , I have for a data frame as below:
val df = Seq(("A", 1), ("B", 2), ("C", 3), ("A", 4), ("B", 5)).toDF("id", "value")
+---+-----+
| id|value|
+---+-----+
| A| 1|
| B| 2|
| C| 3|
| A| 4|
| B| 5|
+---+-----+
And looking an aggregation for all combination over the column "id". Here below I found a solution, but this cannot use the parallelism of Spark, works only on driver node or only on a single executor. Is there any better solution in order to get rid of the for loop?
import spark.implicits._;
val list =df.select($"id").distinct().orderBy($"id").as[String].collect();
val combinations = (1 to list.length flatMap (x => list.combinations(x))) filter(_.length >1)
val schema = StructType(
StructField("indexvalue", IntegerType, true) ::
StructField("segment", StringType, true) :: Nil)
var initialDF = spark.createDataFrame(sc.emptyRDD[Row], schema)
for (x <- combinations) {
initialDF = initialDF.union(df.filter($"id".isin(x: _*))
.agg(expr("sum(value)").as("indexvalue"))
.withColumn("segment",lit(x.mkString("+"))))
}
initialDF.show()
+----------+-------+
|indexvalue|segment|
+----------+-------+
| 12| A+B|
| 8| A+C|
| 10| B+C|
| 15| A+B+C|
+----------+-------+

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+

Simple Roll-Down With Spark Dataframe (Scala)

If you have a simple dataframe that looks like this:
val n = sc.parallelize(List[String](
"Alice", null, null,
"Bob", null, null,
"Chuck"
)).toDF("name")
Which looks like this:
//+-----+
//| name|
//+-----+
//|Alice|
//| null|
//| null|
//| Bob|
//| null|
//| null|
//|Chuck|
//+-----+
How can I use dataframe roll-down functions to get:
//+-----+
//| name|
//+-----+
//|Alice|
//|Alice|
//|Alice|
//| Bob|
//| Bob|
//| Bob|
//|Chuck|
//+-----+
Note: Please state any needed imports, I suspect these include:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.{WindowSpec, Window}
Note: Some sites I tried to mimic are:
http://xinhstechblog.blogspot.com/2016/04/spark-window-functions-for-dataframes.html
and
https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html
I've come across something like this in the past so I realize that Spark versions will differ. I am using 1.5.2 in the cluster (where this solution is more useful) and 2.0 in local emulation. I prefer a 1.5.2 compatible solution.
Also, I'd like to get away from writing SQL directly - avoid using sqlContext.sql(...)
If you have another column that allows grouping of the values, here's a suggestion:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 1),
(None, 1),
(Some("Bob"), 2),
(None, 2),
(None, 2),
(Some("Chuck"), 3)
).toDF("name", "group")
val result = df.withColumn("new_col", min(col("name")).over(Window.partitionBy("group")))
result.show()
+-----+-----+-------+
| name|group|new_col|
+-----+-----+-------+
|Alice| 1| Alice|
| null| 1| Alice|
| null| 1| Alice|
| Bob| 2| Bob|
| null| 2| Bob|
| null| 2| Bob|
|Chuck| 3| Chuck|
+-----+-----+-------+
On the other hand, if you only have a column that allows ordering, but not grouping, the solution is a little harder. My first idea is to create a subset and then do a join:
import org.apache.spark.sql.functions._
import org.apache.spark.sql.expressions.Window
import sqlContext.implicits._
val df = Seq(
(Some("Alice"), 1),
(None, 2),
(None, 3),
(Some("Bob"), 4),
(None, 5),
(None, 6),
(Some("Chuck"), 7)
).toDF("name", "order")
val subset = df
.select("name", "order")
.where(col("name").isNotNull)
.withColumn("next", lead("order", 1).over(Window.orderBy("order")))
val partial = df.as("a")
.join(subset.as("b"), col("a.order") >= col("b.order") && (col("a.order") < subset("next")), "left")
val result = partial.select(coalesce(col("a.name"), col("b.name")).as("name"), col("a.order"))
result.show()
+-----+-----+
| name|order|
+-----+-----+
|Alice| 1|
|Alice| 2|
|Alice| 3|
| Bob| 4|
| Bob| 5|
| Bob| 6|
|Chuck| 7|
+-----+-----+