how to convert assembler vector to data frame? - scala

I just used VectorAssembler to normalize my features for a ML application.
def kmeansClustering ( k : Int ) : sql.DataFrame = {
val assembler = new VectorAssembler()
val intermediaireDF = assembler
val kmeans = new KMeans().setK(k).setSeed(1L)
val model =
val predictions = model.transform(intermediaireDF)
as a result I got a 2 vectors dataframe:
| features|prediction|
|[-27.482279,153.0...| 0|
|[-27.47059,153.03...| 2|
|[-27.474531,153.0...| 3|
So I want to perform something like avg and std by group for each column but the features are assembled and I can't do manipulation on them.
I've tried to use, but it did not work.
val disassembler = new VectorDisassembler().setInputCol("vectorCol")
Any suggestion ?

Actually you do not need to remove the original columns to perform your clustering.
// creating sample data
val df = spark.range(10).select('id as "a", 'id %3 as "b")
val assembler = new VectorAssembler()
.setInputCols(Array("a", "b")).setOutputCol("features")
// Here I delete the select so as to keep all the columns
val intermediaireDF = assembler.transform(this.filterNumeric())
// I specify explicitely what the feature column is
val kmeans = new KMeans().setK( 2 ).setSeed(1L).setFeaturesCol("features")
// And the rest remains unchanged
val model =
val predictions = model.transform(intermediaireDF)
| a| b| features|prediction|
| 1| 0| [1.0,0.0]| 1|
| 2| 1| [2.0,1.0]| 1|
| 3| 2| [3.0,2.0]| 1|
| 4| 0| [4.0,0.0]| 1|
| 5| 1| [5.0,1.0]| 0|
| 6| 2| [6.0,2.0]| 0|
And from there, you can compute what you need.


Scala (NOT pyspark) map linear regression coefficients to feature names (categorical and continuous)

I have a dataframe in scala that looks like this
| id|group| normalized_amount|query_id| y| y1|group1|groupIndexed| groupEncoded|
| 1| B| 0.22874172014806| 1| 0.317739988492575| 0| B| 1.0|(2,[1],[1.0])|
| 2| A| -1.42432215217563| 2| -1.32008967486074| 0| C| 0.0|(2,[0],[1.0])|
| 3| B| -2.03644548423379| 3| -1.65740392834359| 0| B| 1.0|(2,[1],[1.0])|
| 4| B| 0.425753803902096| 4|-0.127591370989296| 0| C| 0.0|(2,[0],[1.0])|
| 5| A| 0.521050829955076| 5| 0.824285664580579| 1| A| 2.0| (2,[],[])|
| 6| A|-0.0416682439998418| 6| 0.321350404322885| 1| C| 0.0|(2,[0],[1.0])|
| 7| A| -1.2787327462978| 7| -0.88099379032367| 0| A| 2.0| (2,[],[])|
| 8| A| 0.431780409975322| 8| 0.575249966796747| 1| C| 0.0|(2,[0],[1.0])|
And I'm performing a linear regression of y on group1 (a categorical variable of 3 categories) and normalized_amount (a continuous variable) as follows
var assembler = new VectorAssembler().setInputCols(Array("groupEncoded", "normalized_amount")).setOutputCol("features")
val dfFeatures = assembler.transform(df)
var lr = new LinearRegression()
var lrModel =
var lrPrediction = lrModel.transform(dfFeatures)
I can access coefficients and standard errors as follows
lrModel.coefficients //model coefficient estimates (not intercept)
lrModel.summary.coefficientStandardErrors //standard error of intercept and coefficients, not sure in which order
My questions are
how can I figure out which feature correspond to which coefficient estimate (for categorical values, I need to figure out the coefficient of each category)? Same with standard errors?
how can I choose which category to "leave out" as the reference category?
how to perform a linear regression with no intercept?
I've seen some answers to similar questions, but they are all in pyspark and not in scala, and I'm only using scala
With a dataframe as your transformed df, that includes the prediction, and LogisticRegressionModel, you can access to the attributes of the VectorAssembler field. This code from databricks, I slightly adapted it for a LogisticRegressionModel instead of Pipeline. Note that you can choose if you want intercept estimation or not:
val lrToFit : LinearRegression = ???
// With this dataframe as your transformed df that includes the prediction
val df: DataFrame = ???
val lr : LogisticRegressionModel = ???
val schema = df.schema
// Using the schema, the attributes of the Vector Assembler(features) can be extracted
val features = AttributeGroup.fromStructField(schema(lr.getFeaturesCol))
val featureNames: Array[String] = if (lr.getFitIntercept) {
Array("(Intercept)") ++ features
} else {
val coefficients = lr.coefficients.toArray
val coeffs = if (lr.getFitIntercept) {
coefficients ++ Array(lr.intercept)
} else {
} { case (feature, coeff) =>
This is a method that can be used if you load a pretrained model because in that case you might not know the order of the features in the VectorAssembler transformation. I think that you will need to select the reference category manually.

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
| Country|1/22/20|1/23/20|1/24/20|
| Chad| 1| 0| 5|
|Paraguay| 4| 6| 3|
| Russia| 0| 0| 1|
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(, s"$key${}")
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.withColumnRenamed("key1_country", "country")
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
val df2 = List(
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
| Eftv_Date|S_Amt|A_Amt|
|2017-02-01| 0| 400|
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
val x =
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x =
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|

How to map adjacent elements in scala

I have RDD[String] according to device,timestamp,on/off format.How do I calculate amount of time each device is swiched on.What is the best way of doing this in spark ?
on means 1 and off means 0
Intermediate step 1
A,((1335953754 - 1335952933),(1335995228 - 1335994294))
B,((1336002622- 1336001513),(1336007462 - 1336006905))
Intermediate step 2
I'll assume that RDD[String] can be parsed into a RDD of DeviceLog where DeviceLog is:
case class DeviceLog(val id: String, val timestamp: Long, val onoff: Int)
The DeviceLog class is pretty straight forward.
// initialize contexts
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
Those initialize the spark context and sql context that we'll use it for dataframes.
Step 1:
val input = List(
val df = input.toDF()
| id| timestamp|onoff|
| A|1335952933| 1|
| A|1335953754| 0|
| A|1335994294| 1|
| A|1335995228| 0|
| B|1336001513| 1|
| B|1336002622| 0|
| B|1336006905| 1|
| B|1336007462| 0|
Step 2: Partition by device id, order by timestamp and retain pair information (on/off)
val wSpec = Window.partitionBy("id").orderBy("timestamp")
val df1 = df
.withColumn("spend", lag("timestamp", 1).over(wSpec))
.withColumn("one", lag("onoff", 1).over(wSpec))
.where($"spend" isNotNull)
| id| timestamp|onoff| spend|one|
| A|1335953754| 0|1335952933| 1|
| A|1335994294| 1|1335953754| 0|
| A|1335995228| 0|1335994294| 1|
| B|1336002622| 0|1336001513| 1|
| B|1336006905| 1|1336002622| 0|
| B|1336007462| 0|1336006905| 1|
Step 3: Compute upTime and filter by criteria
val df2 = df1
.withColumn("upTime", $"timestamp" - $"spend")
.withColumn("criteria", $"one" - $"onoff")
.where($"criteria" === 1)
| id| timestamp|onoff| spend|one|upTime|criteria|
| A|1335953754| 0|1335952933| 1| 821| 1|
| A|1335995228| 0|1335994294| 1| 934| 1|
| B|1336002622| 0|1336001513| 1| 1109| 1|
| B|1336007462| 0|1336006905| 1| 557| 1|
Step 4: group by id and sum
val df3 = df2.groupBy($"id").agg(sum("upTime"))
| id|sum(upTime)|
| A| 1755|
| B| 1666|

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
How can I map over the array and get my output?
You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.agg(exprs.head, exprs.tail: _*)
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+
First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults ={
x =>
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual( => x(0)))
val g2 = checkAllEqual( => x(1)))
val g3 = checkAllEqual( => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
This will print:
| id|prod1|prod2|prod3|
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|