how to convert assembler vector to data frame? - scala

I just used VectorAssembler to normalize my features for a ML application.
def kmeansClustering ( k : Int ) : sql.DataFrame = {
val assembler = new VectorAssembler()
.setInputCols(this.listeOfName())
.setOutputCol("features")
val intermediaireDF = assembler
.transform(this.filterNumeric())
.select("features")
val kmeans = new KMeans().setK(k).setSeed(1L)
val model = kmeans.fit(intermediaireDF)
val predictions = model.transform(intermediaireDF)
return(predictions)
}
as a result I got a 2 vectors dataframe:
+--------------------+----------+
| features|prediction|
+--------------------+----------+
|[-27.482279,153.0...| 0|
|[-27.47059,153.03...| 2|
|[-27.474531,153.0...| 3|
.................................
So I want to perform something like avg and std by group for each column but the features are assembled and I can't do manipulation on them.
I've tried to use org.apache.spark.ml.feature.VectorDisassembler, but it did not work.
val disassembler = new VectorDisassembler().setInputCol("vectorCol")
disassembler.transform(df).show()
Any suggestion ?

Actually you do not need to remove the original columns to perform your clustering.
// creating sample data
val df = spark.range(10).select('id as "a", 'id %3 as "b")
val assembler = new VectorAssembler()
.setInputCols(Array("a", "b")).setOutputCol("features")
// Here I delete the select so as to keep all the columns
val intermediaireDF = assembler.transform(this.filterNumeric())
// I specify explicitely what the feature column is
val kmeans = new KMeans().setK( 2 ).setSeed(1L).setFeaturesCol("features")
// And the rest remains unchanged
val model = kmeans.fit(intermediaireDF)
val predictions = model.transform(intermediaireDF)
predictions.show(6)
+---+---+----------+----------+
| a| b| features|prediction|
+---+---+----------+----------+
| 1| 0| [1.0,0.0]| 1|
| 2| 1| [2.0,1.0]| 1|
| 3| 2| [3.0,2.0]| 1|
| 4| 0| [4.0,0.0]| 1|
| 5| 1| [5.0,1.0]| 0|
| 6| 2| [6.0,2.0]| 0|
+---+---+----------+----------+
And from there, you can compute what you need.

Related

Scala (NOT pyspark) map linear regression coefficients to feature names (categorical and continuous)

I have a dataframe in scala that looks like this
df.show
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| id|group| normalized_amount|query_id| y| y1|group1|groupIndexed| groupEncoded|
+---+-----+-------------------+--------+------------------+--------+------+------------+-------------+
| 1| B| 0.22874172014806| 1| 0.317739988492575| 0| B| 1.0|(2,[1],[1.0])|
| 2| A| -1.42432215217563| 2| -1.32008967486074| 0| C| 0.0|(2,[0],[1.0])|
| 3| B| -2.03644548423379| 3| -1.65740392834359| 0| B| 1.0|(2,[1],[1.0])|
| 4| B| 0.425753803902096| 4|-0.127591370989296| 0| C| 0.0|(2,[0],[1.0])|
| 5| A| 0.521050829955076| 5| 0.824285664580579| 1| A| 2.0| (2,[],[])|
| 6| A|-0.0416682439998418| 6| 0.321350404322885| 1| C| 0.0|(2,[0],[1.0])|
| 7| A| -1.2787327462978| 7| -0.88099379032367| 0| A| 2.0| (2,[],[])|
| 8| A| 0.431780409975322| 8| 0.575249966796747| 1| C| 0.0|(2,[0],[1.0])|
And I'm performing a linear regression of y on group1 (a categorical variable of 3 categories) and normalized_amount (a continuous variable) as follows
var assembler = new VectorAssembler().setInputCols(Array("groupEncoded", "normalized_amount")).setOutputCol("features")
val dfFeatures = assembler.transform(df)
var lr = new LinearRegression()
var lrModel = lr.fit(dfFeatures)
var lrPrediction = lrModel.transform(dfFeatures)
I can access coefficients and standard errors as follows
lmModel.intercept
lrModel.coefficients //model coefficient estimates (not intercept)
lrModel.summary.coefficientStandardErrors //standard error of intercept and coefficients, not sure in which order
My questions are
how can I figure out which feature correspond to which coefficient estimate (for categorical values, I need to figure out the coefficient of each category)? Same with standard errors?
how can I choose which category to "leave out" as the reference category?
how to perform a linear regression with no intercept?
I've seen some answers to similar questions, but they are all in pyspark and not in scala, and I'm only using scala
With a dataframe as your transformed df, that includes the prediction, and LogisticRegressionModel, you can access to the attributes of the VectorAssembler field. This code from databricks, I slightly adapted it for a LogisticRegressionModel instead of Pipeline. Note that you can choose if you want intercept estimation or not:
val lrToFit : LinearRegression = ???
lrToFit.setFitIntercept(false)
// With this dataframe as your transformed df that includes the prediction
val df: DataFrame = ???
val lr : LogisticRegressionModel = ???
val schema = df.schema
// Using the schema, the attributes of the Vector Assembler(features) can be extracted
val features = AttributeGroup.fromStructField(schema(lr.getFeaturesCol)).attributes.get.map(_.name.get)
val featureNames: Array[String] = if (lr.getFitIntercept) {
Array("(Intercept)") ++ features
} else {
features
}
val coefficients = lr.coefficients.toArray
val coeffs = if (lr.getFitIntercept) {
coefficients ++ Array(lr.intercept)
} else {
coefficients
}
featureNames.zip(coeffs).foreach { case (feature, coeff) =>
println(s"$feature\t$coeff")
}
This is a method that can be used if you load a pretrained model because in that case you might not know the order of the features in the VectorAssembler transformation. I think that you will need to select the reference category manually.

Selecting specific rows from different dataframes within a map scope

Hello I am new to Spark and scala, and I have three similar dataframes as the following:
df1:
+--------+-------+-------+-------+
| Country|1/22/20|1/23/20|1/24/20|
+--------+-------+-------+-------+
| Chad| 1| 0| 5|
+--------+-------+-------+-------+
|Paraguay| 4| 6| 3|
+--------+-------+-------+-------+
| Russia| 0| 0| 1|
+--------+-------+-------+-------+
df2 and d3 are exactly similar just with different values
I would like to apply a function to each row of df1 but I also need to select the same row (using the Country as key) from the other two dataframes because I need the selected rows as input arguments for the function I want to apply.
I thought of using
df1.map{ r =>
val selectedRowDf2 = selectRow using r at column "Country" ...
val selectedRowDf3 = selectRow using r at column "Country" ...
r.apply(functionToApply(r, selectedRowDf2, selectedRowDf3)
}
I also tried with map but I get an error as follows:
Error:(238, 23) not enough arguments for method map: (implicit evidence$6: org.apache.spark.sql.Encoder[Unit])org.apache.spark.sql.Dataset[Unit].
Unspecified value parameter evidence$6.
df1.map{
A possible approach could be to append each dataframe columns with a key to uniquely identify the columns and finally merge all the dataframe to a single dataframe using country column. The desired operation could be performed on each row of the merged datafarme.
def appendColWithKey(df: DataFrame, key: String) = {
var newdf = df
df.schema.foreach(s => {
newdf = newdf.withColumnRenamed(s.name, s"$key${s.name}")
})
newdf
}
val kdf1 = appendColWithKey(df1, "key1_")
val kdf2 = appendColWithKey(df2, "key2_")
val kdf3 = appendColWithKey(df3, "key3_")
val tempdf1 = kdf1.join(kdf2, col("key1_country") === col("key2_country"))
val tempdf = tempdf1.join(kdf3, col("key1_country") === col("key3_country"))
val finaldf = tempdf
.drop("key2_country")
.drop("key3_country")
.withColumnRenamed("key1_country", "country")
finaldf.show(10)
//Output
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| country|key1_1/22/20|key1_1/23/20|key1_1/24/20|key2_1/22/20|key2_1/23/20|key2_1/24/20|key3_1/22/20|key3_1/23/20|key3_1/24/20|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+
| Chad| 1| 0| 5| 1| 0| 5| 1| 0| 5|
|Paraguay| 4| 6| 3| 4| 6| 3| 4| 6| 3|
| Russia| 0| 0| 1| 0| 0| 1| 0| 0| 1|
+--------+------------+------------+------------+------------+------------+------------+------------+------------+------------+

Spark Dataframe - Method to take row as input & dataframe has output

I need to write a method that iterates all the rows from DF2 and generate a Dataframe based on some conditions.
Here is the inputs DF1 & DF2 :
val df1Columns = Seq("Eftv_Date","S_Amt","A_Amt","Layer","SubLayer")
val df2Columns = Seq("Eftv_Date","S_Amt","A_Amt")
var df1 = List(
List("2016-10-31","1000000","1000","0","1"),
List("2016-12-01","100000","950","1","1"),
List("2017-01-01","50000","50","2","1"),
List("2017-03-01","50000","100","3","1"),
List("2017-03-30","80000","300","4","1")
)
.map(row =>(row(0), row(1),row(2),row(3),row(4))).toDF(df1Columns:_*)
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
|2017-03-01| 50000| 100| 3| 1|
|2017-03-30| 80000| 300| 4| 1|
+----------+-------+-----+-----+--------+
val df2 = List(
List("2017-02-01","0","400")
).map(row =>(row(0), row(1),row(2))).toDF(df2Columns:_*)
+----------+-----+-----+
| Eftv_Date|S_Amt|A_Amt|
+----------+-----+-----+
|2017-02-01| 0| 400|
+----------+-----+-----+
Now I need to write a method that filters DF1 based on the Eftv_Date values from each row of DF2.
For example, first row of df2.Eftv_date=Feb 01 2017, so need to filter df1 having records Eftv_date less than or equal to Feb 01 2017.So this will generate 3 records as below:
Expected Result :
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+
I have written the method as below and called it using map function.
def transformRows(row: Row ) = {
val dateEffective = row.getAs[String]("Eftv_Date")
val df1LayerMet = df1.where(col("Eftv_Date").leq(dateEffective))
df1 = df1LayerMet
df1
}
val x = df2.map(transformRows)
But while calling this I am facing this error:
Error:(154, 24) Unable to find encoder for type stored in a Dataset. Primitive types (Int, String, etc) and Product types (case classes) are supported by importing spark.implicits._ Support for serializing other types will be added in future releases.
val x = df2.map(transformRows)
Note : We can implement this using join , But I need to implement a custom scala method to do this , since there were a lot of transformations involved. For simplicity I have mentioned only one condition.
Seems you need a non-equi join:
df1.alias("a").join(
df2.select("Eftv_Date").alias("b"),
df1("Eftv_Date") <= df2("Eftv_Date") // non-equi join condition
).select("a.*").show
+----------+-------+-----+-----+--------+
| Eftv_Date| S_Amt|A_Amt|Layer|SubLayer|
+----------+-------+-----+-----+--------+
|2016-10-31|1000000| 1000| 0| 1|
|2016-12-01| 100000| 950| 1| 1|
|2017-01-01| 50000| 50| 2| 1|
+----------+-------+-----+-----+--------+

How to map adjacent elements in scala

I have RDD[String] according to device,timestamp,on/off format.How do I calculate amount of time each device is swiched on.What is the best way of doing this in spark ?
on means 1 and off means 0
E.g
A,1335952933,1
A,1335953754,0
A,1335994294,1
A,1335995228,0
B,1336001513,1
B,1336002622,0
B,1336006905,1
B,1336007462,0
Intermediate step 1
A,((1335953754 - 1335952933),(1335995228 - 1335994294))
B,((1336002622- 1336001513),(1336007462 - 1336006905))
Intermediate step 2
(A,(821,934))
(B,(1109,557))
output
(A,1755)
(B,1666)
I'll assume that RDD[String] can be parsed into a RDD of DeviceLog where DeviceLog is:
case class DeviceLog(val id: String, val timestamp: Long, val onoff: Int)
The DeviceLog class is pretty straight forward.
// initialize contexts
val sc = new SparkContext(conf)
val sqlContext = new HiveContext(sc)
Those initialize the spark context and sql context that we'll use it for dataframes.
Step 1:
val input = List(
DeviceLog("A",1335952933,1),
DeviceLog("A",1335953754,0),
DeviceLog("A",1335994294,1),
DeviceLog("A",1335995228,0),
DeviceLog("B",1336001513,1),
DeviceLog("B",1336002622,0),
DeviceLog("B",1336006905,1),
DeviceLog("B",1336007462,0))
val df = input.toDF()
df.show()
+---+----------+-----+
| id| timestamp|onoff|
+---+----------+-----+
| A|1335952933| 1|
| A|1335953754| 0|
| A|1335994294| 1|
| A|1335995228| 0|
| B|1336001513| 1|
| B|1336002622| 0|
| B|1336006905| 1|
| B|1336007462| 0|
+---+----------+-----+
Step 2: Partition by device id, order by timestamp and retain pair information (on/off)
val wSpec = Window.partitionBy("id").orderBy("timestamp")
val df1 = df
.withColumn("spend", lag("timestamp", 1).over(wSpec))
.withColumn("one", lag("onoff", 1).over(wSpec))
.where($"spend" isNotNull)
df1.show()
+---+----------+-----+----------+---+
| id| timestamp|onoff| spend|one|
+---+----------+-----+----------+---+
| A|1335953754| 0|1335952933| 1|
| A|1335994294| 1|1335953754| 0|
| A|1335995228| 0|1335994294| 1|
| B|1336002622| 0|1336001513| 1|
| B|1336006905| 1|1336002622| 0|
| B|1336007462| 0|1336006905| 1|
+---+----------+-----+----------+---+
Step 3: Compute upTime and filter by criteria
val df2 = df1
.withColumn("upTime", $"timestamp" - $"spend")
.withColumn("criteria", $"one" - $"onoff")
.where($"criteria" === 1)
df2.show()
| id| timestamp|onoff| spend|one|upTime|criteria|
+---+----------+-----+----------+---+------+--------+
| A|1335953754| 0|1335952933| 1| 821| 1|
| A|1335995228| 0|1335994294| 1| 934| 1|
| B|1336002622| 0|1336001513| 1| 1109| 1|
| B|1336007462| 0|1336006905| 1| 557| 1|
+---+----------+-----+----------+---+------+--------+
Step 4: group by id and sum
val df3 = df2.groupBy($"id").agg(sum("upTime"))
df3.show()
+---+-----------+
| id|sum(upTime)|
+---+-----------+
| A| 1755|
| B| 1666|
+---+-----------+

How to combine (join) information across an Array[DataFrame]

I have an Array[DataFrame] and I want to check, for each row of each data frame, if there is any change in the values by column. Say I have the first row of three data frames, like:
(0,1.0,0.4,0.1)
(0,3.0,0.2,0.1)
(0,5.0,0.4,0.1)
The first column is the ID, and my ideal output for this ID would be:
(0, 1, 1, 0)
meaning that the second and third columns changed while the third did not.
I attach here a bit of data to replicate my setting
val rdd = sc.parallelize(Array((0,1.0,0.4,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.9,0.2),
(3,0.9,0.2,0.2),
(4,0.3,0.5,0.5)))
val rdd2 = sc.parallelize(Array((0,3.0,0.2,0.1),
(1,0.9,0.3,0.3),
(2,0.2,0.5,0.2),
(3,0.8,0.1,0.1),
(4,0.3,0.5,0.5)))
val rdd3 = sc.parallelize(Array((0,5.0,0.4,0.1),
(1,0.5,0.3,0.3),
(2,0.3,0.3,0.5),
(3,0.3,0.3,0.1),
(4,0.3,0.5,0.5)))
val df = rdd.toDF("id", "prop1", "prop2", "prop3")
val df2 = rdd2.toDF("id", "prop1", "prop2", "prop3")
val df3 = rdd3.toDF("id", "prop1", "prop2", "prop3")
val result:Array[DataFrame] = new Array[DataFrame](3)
result.update(0, df)
result.update(1,df2)
result.update(2,df3)
How can I map over the array and get my output?
You can use countDistinct with groupBy:
import org.apache.spark.sql.functions.{countDistinct}
val exprs = Seq("prop1", "prop2", "prop3")
.map(c => (countDistinct(c) > 1).cast("integer").alias(c))
val combined = result.reduce(_ unionAll _)
val aggregatedViaGroupBy = combined
.groupBy($"id")
.agg(exprs.head, exprs.tail: _*)
aggregatedViaGroupBy.show
// +---+-----+-----+-----+
// | id|prop1|prop2|prop3|
// +---+-----+-----+-----+
// | 0| 1| 1| 0|
// | 1| 1| 0| 0|
// | 2| 1| 1| 1|
// | 3| 1| 1| 1|
// | 4| 0| 0| 0|
// +---+-----+-----+-----+
First we need to join all the DataFrames together.
val combined = result.reduceLeft((a,b) => a.join(b,"id"))
To compare all the columns of the same label (e.g., "prod1"), I found it easier (at least for me) to operate on the RDD level. We fist transform the data into (id, Seq[Double]).
val finalResults = combined.rdd.map{
x =>
(x.getInt(0), x.toSeq.tail.map(_.asInstanceOf[Double]))
}.map{
case(i,d) =>
def checkAllEqual(l: Seq[Double]) = if(l.toSet.size == 1) 0 else 1
val g = d.grouped(3).toList
val g1 = checkAllEqual(g.map(x => x(0)))
val g2 = checkAllEqual(g.map(x => x(1)))
val g3 = checkAllEqual(g.map(x => x(2)))
(i, g1,g2,g3)
}.toDF("id", "prod1", "prod2", "prod3")
finalResults.show()
This will print:
+---+-----+-----+-----+
| id|prod1|prod2|prod3|
+---+-----+-----+-----+
| 0| 1| 1| 0|
| 1| 1| 0| 0|
| 2| 1| 1| 1|
| 3| 1| 1| 1|
| 4| 0| 0| 0|
+---+-----+-----+-----+