merge new rows with previous rows data in dataframe in spark scala

merge new rows with previous rows data in dataframe in spark scala - scala

Input Spark Dataframe df (OLTP):
+----+---------+------+
|name|date |amount|
+----+---------+------+
|abc |4/6/2018 | 100 |
|abc |4/6/2018 | 200 |
+----+---------+------+
|abc |4/13/2018| 300 |
+----+---------+------+
Expected DF (OLAP) :
+----+---------+------+
|name|date |amount|
+----+---------+------+
|abc |4/6/2018 | 100 |
|abc |4/6/2018 | 200 |
+----+---------+------+
|abc |4/13/2018| 100|
+----+---------+------+
| abc|4/13/2018| 200|
+----+---------+------+
| abc|4/13/2018| 300|
+----+---------+------+
my code
val df = df1.union(df1)
+----+---------+------+
|name|date |amount|
+----+---------+------+
|abc |4/6/2018 |100 |
|abc |4/6/2018 |200 |
|abc |4/13/2018|300 |
|abc |4/6/2018 |100 |
|abc |4/6/2018 |200 |
|abc |4/13/2018|300 |
+----+---------+------+
val w1 = org.apache.spark.sql.expressions.Window.orderBy("date")
val ExpectedDF = df.withColumn("previousAmount", lag("amount",1).over(w1)).withColumn("newdate", lag("date",1).over(w1))
ExpectedDF .show(false)
+----+---------+------+--------------+---------+
|name|date |amount|previousAmount|newdate |
+----+---------+------+--------------+---------+
|abc |4/13/2018|300 |null |null |
|abc |4/13/2018|300 |300 |4/13/2018|
|abc |4/6/2018 |100 |300 |4/13/2018|
|abc |4/6/2018 |200 |100 |4/6/2018 |
|abc |4/6/2018 |100 |200 |4/6/2018 |
|abc |4/6/2018 |200 |100 |4/6/2018 |
+----+---------+------+--------------+---------+

def main(args: Array[String]){
val conf = new SparkConf().setAppName("Excel-read-write").setMaster("local")
val sc = new SparkContext(conf)
val sqlc = new org.apache.spark.sql.SQLContext(sc)
val ss = SparkSession.builder().master("local").appName("Excel-read-write").getOrCreate()
import ss.sqlContext.implicits._
var df1 = sqlc.read.format("com.databricks.spark.csv")
.option("header", "true")
.option("inferSchema", "true")
.load("oldRecords.csv")
df1.show(false)
println("---- df1 row count ----"+df1.count())
if(df1.count()>0){
for (i <- 0 until (df1.count().toInt)-1) {
var df2 = df1.unionAll(df1)//.union(df1)//df3
//df2.show(false)
var w1 = org.apache.spark.sql.expressions.Window.orderBy("date")
var df3 = df2.withColumn("previousAmount", lag("amount",1).over(w1)).withColumn("newdate", lag("date",1).over(w1))
// df3.show(false)
var df4 = df3.filter((df3.col("newdate").isNotNull))//(df3.col("new_date").isNotNull)
//df4.show(false)
var df5 = df4.select("name","amount","newdate").distinct()
println("-----------"+df5.show(false))
df1 = df5.withColumnRenamed("newdate", "date")
}
}
}

Related

How to rank dataframe depending on a group of rows in a column?

I have this dataframe :
+-----+----------+---------+
|num |Timestamp |frequency|
+-----+----------+---------+
|20.0 |1632899456|4 |
|20.0 |1632901256|4 |
|20.0 |1632901796|4 |
|20.0 |1632899155|4 |
|10.0 |1632901743|2 |
|10.0 |1632899933|2 |
|91.0 |1632899756|1 |
|32.0 |1632900776|1 |
|41.0 |1632900176|1 |
+-----+----------+---------+
I want to add a column containing the rank of each frequency. The new dataframe would be like this :
+-----+----------+---------+------------+
|num |Timestamp |frequency|rank |
+-----+----------+---------+------------+
|20.0 |1632899456|4 |1 |
|20.0 |1632901256|4 |1 |
|20.0 |1632901796|4 |1 |
|20.0 |1632899155|4 |1 |
|10.0 |1632901743|2 |2 |
|10.0 |1632899933|2 |2 |
|91.0 |1632899756|1 |3 |
|32.0 |1632900776|1 |3 |
|41.0 |1632900176|1 |3 |
+-----+----------+---------+------------+
I am using Spark version 2.4.3 and SQLContext, with scala language.

You can use dense_rank:
import org.apache.spark.sql.expressions.Window
val df2 = df.withColumn("rank", dense_rank().over(Window.orderBy(desc("frequency")))

How to find Delta of a column in apache spark using SCALA [duplicate]

I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
Now, I want to add new column, with difference between amount of any two dates from the table.

You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
You can find the difference between previous month and current month by doing
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
You should have
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
which will give you
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
I hope the answer is helpful

Assumming those two dates belong to each group of your table
my imports :
import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
Perpare the dataframe
scala> val seqRow = Seq(
| ("A","1- jul",1000),
| ("A","1-june",2000),
| ("A","1-May",2000),
| ("A","1-dec",3000),
| ("B","1-jul",100),
| ("B","1-june",300),
| ("B","1-May",400),
| ("B","1-dec",300))
seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
Now write a UDF for your case,
scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
| //get the month and their values
| val monthMap = list.map{str =>
| val splitText = str.split("\\$")
| val month = splitText(0).split("-")(1).trim
|
| (month.toLowerCase,splitText(1).toInt)
| }.toMap
|
| val stMnth = monthMap(startMonth)
| val endMnth = monthMap(endMonth)
| endMnth - stMnth
|
| })
calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
Now, Preparing the output
scala> val (month1 : String,month2 : String) = ("jul","dec")
month1: String = jul
month2: String = dec
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
scala> req_df.orderBy('column1).show
+-------+----------+----+
|column1|sum_amount|diff|
+-------+----------+----+
| A| 8000|2000|
| B| 1100| 200|
+-------+----------+----+
Hope, this is what you want.

(table.filter($"Date".isin("1-jul", "1-dec"))
.groupBy("Column1")
.pivot("Date")
.agg(first($"Amount"))
.withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
| B| 300| 100| 200|
| A| 3000| 1000|2000|
+-------+-----+-----+----+

join dataframes and perform operation

Hello guys i have a dataframe that is being up to date each date , each day i need to add the new qte and the new ca to the old one and update the date .
So i need to update the ones that are already existing and add the new ones.Here an example what i would like to have at the end
val histocaisse = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
val hist = histocaisse
.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
val histocaisse2 = spark.read
.format("csv")
.option("header", "true") //reading the headers
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
val hist2 = histocaisse2.withColumn("pos_id", 'pos_id.cast(LongType))
.withColumn("article_id", 'pos_id.cast(LongType))
.withColumn("date", 'date.cast(DateType))
.withColumn("qte", 'qte.cast(DoubleType))
.withColumn("ca", 'ca.cast(DoubleType))
hist2.show(false)
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-07|2.5 |3.5 |
|2 |2 |2000-01-07|14.7|12.0|
|3 |3 |2000-01-07|3.5 |1.2 |
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|2.5 |3.5 |
|2 |2 |2000-01-08|14.7|12.0|
|3 |3 |2000-01-08|3.5 |1.2 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
+------+----------+----------+----+----+
|pos_id|article_id|date |qte |ca |
+------+----------+----------+----+----+
|1 |1 |2000-01-08|5.0 |7.0 |
|2 |2 |2000-01-08|39.4|24.0|
|3 |3 |2000-01-08|7.0 |2.4 |
|4 |4 |2000-01-08|3.5 |1.2 |
|5 |5 |2000-01-08|14.5|1.2 |
|6 |6 |2000-01-08|2.0 |1.25|
+------+----------+----------+----+----+
Here what i did
val histoCombinaison2=hist2.join(hist,Seq("article_id","pos_id"),"left")
.groupBy("article_id","pos_id").agg((hist2("qte")+hist("qte")) as ("qte"),(hist2("ca")+hist("ca")) as ("ca"),hist2("date"))
histoCombinaison2.show()
and i got the following exception
Exception in thread "main" org.apache.spark.sql.AnalysisException: expression '`qte`' is neither present in the group by, nor is it an aggregate function. Add to group by or wrap in first() (or first_value) if you don't care which value you get.;
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$class.failAnalysis(CheckAnalysis.scala:40)
at org.apache.spark.sql.catalyst.analysis.Analyzer.failAnalysis(Analyzer.scala:58)
at org.apache.spark.sql.catalyst.analysis.CheckAnalysis$$anonfun$checkAnalysis$1.org$apache$spark$sql$catalyst$analysis$CheckAnalysis$class$$anonfun$$checkValidAggregateExpression$1(CheckAnalysis.scala:218)

// import functions
import org.apache.spark.sql.functions.{coalesce, lit}
// we might not need groupBy,
// since after join, all the information will be in the same row
// so instead of using aggregate function, we simply combine the related fields as a new column.
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
coalesce(hist2("date"), hist1("date")).alias("date"),
(coalesce(hist2("qte"), lit(0)) + coalesce(hist1("qte"), lit(0))).alias("qte"),
(coalesce(hist2("ca"), lit(0)) + coalesce(hist1("ca"), lit(0))).alias("ca"))
.orderBy("pos_id", "article_id")
// df.show()
|pos_id|article_id| date| qte| ca|
+------+----------+----------+----+----+
| 1| 1|2000-01-08| 5.0| 7.0|
| 2| 2|2000-01-08|29.4|24.0|
| 3| 3|2000-01-08| 7.0| 2.4|
| 4| 4|2000-01-08| 3.5| 1.2|
| 5| 5|2000-01-08|14.5| 1.2|
| 6| 6|2000-01-08| 2.0|1.25|
+------+----------+----------+----+----+
Thanks.

As I have mentioned your comment that you should define your schema and use it in reading csv to dataframe as
import sqlContext.implicits._
import org.apache.spark.sql.types._
val schema = StructType(Seq(
StructField("pos_id", LongType, true),
StructField("article_id", LongType, true),
StructField("date", DateType, true),
StructField("qte", LongType, true),
StructField("ca", DoubleType, true)
))
val hist1 = sqlContext.read
.format("csv")
.option("header", "true")
.schema(schema)
.load("C:/Users/MHT/Desktop/histocaisse_dte1.csv")
hist1.show
val hist2 = sqlContext.read
.format("csv")
.option("header", "true") //reading the headers
.schema(schema)
.load("C:/Users/MHT/Desktop/histocaisse_dte2.csv")
hist2.show
Then you should use when function to define the logic you need to implement as
val df = hist2.join(hist1, Seq("article_id", "pos_id"), "left")
.select($"pos_id", $"article_id",
when(hist2("date").isNotNull, hist2("date")).otherwise(when(hist1("date").isNotNull, hist1("date")).otherwise(lit(null))).alias("date"),
(when(hist2("qte").isNotNull, hist2("qte")).otherwise(lit(0)) + when(hist1("qte").isNotNull, hist1("qte")).otherwise(lit(0))).alias("qte"),
(when(hist2("ca").isNotNull, hist2("ca")).otherwise(lit(0)) + when(hist1("ca").isNotNull, hist1("ca")).otherwise(lit(0))).alias("ca"))
I hope the answer is helpful

Difference between two rows in Spark dataframe

I created a dataframe in Spark, by groupby column1 and date and calculated the amount.
val table = df1.groupBy($"column1",$"date").sum("amount")
Column1 |Date |Amount
A |1-jul |1000
A |1-june |2000
A |1-May |2000
A |1-dec |3000
A |1-Nov |2000
B |1-jul |100
B |1-june |300
B |1-May |400
B |1-dec |300
Now, I want to add new column, with difference between amount of any two dates from the table.

You can use Window function if the calculation is fixed as calculating difference between previous months, or calculating between previous two months ... etc. For that you can use lag and lead function with Window.
But for that you need to change the date column as below so that it can be ordered.
+-------+------+--------------+------+
|Column1|Date |Date_Converted|Amount|
+-------+------+--------------+------+
|A |1-jul |2017-07-01 |1000 |
|A |1-june|2017-06-01 |2000 |
|A |1-May |2017-05-01 |2000 |
|A |1-dec |2017-12-01 |3000 |
|A |1-Nov |2017-11-01 |2000 |
|B |1-jul |2017-07-01 |100 |
|B |1-june|2017-06-01 |300 |
|B |1-May |2017-05-01 |400 |
|B |1-dec |2017-12-01 |300 |
+-------+------+--------------+------+
You can find the difference between previous month and current month by doing
import org.apache.spark.sql.expressions._
val windowSpec = Window.partitionBy("Column1").orderBy("Date_Converted")
import org.apache.spark.sql.functions._
df.withColumn("diff_Amt_With_Prev_Month", $"Amount" - when((lag("Amount", 1).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 1).over(windowSpec)))
.show(false)
You should have
+-------+------+--------------+------+------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_Month|
+-------+------+--------------+------+------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |-100.0 |
|B |1-jul |2017-07-01 |100 |-200.0 |
|B |1-dec |2017-12-01 |300 |200.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |0.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |1000.0 |
|A |1-dec |2017-12-01 |3000 |1000.0 |
+-------+------+--------------+------+------------------------+
You can increase the lagging position for previous two months as
df.withColumn("diff_Amt_With_Prev_two_Month", $"Amount" - when((lag("Amount", 2).over(windowSpec)).isNull, 0).otherwise(lag("Amount", 2).over(windowSpec)))
.show(false)
which will give you
+-------+------+--------------+------+----------------------------+
|Column1|Date |Date_Converted|Amount|diff_Amt_With_Prev_two_Month|
+-------+------+--------------+------+----------------------------+
|B |1-May |2017-05-01 |400 |400.0 |
|B |1-june|2017-06-01 |300 |300.0 |
|B |1-jul |2017-07-01 |100 |-300.0 |
|B |1-dec |2017-12-01 |300 |0.0 |
|A |1-May |2017-05-01 |2000 |2000.0 |
|A |1-june|2017-06-01 |2000 |2000.0 |
|A |1-jul |2017-07-01 |1000 |-1000.0 |
|A |1-Nov |2017-11-01 |2000 |0.0 |
|A |1-dec |2017-12-01 |3000 |2000.0 |
+-------+------+--------------+------+----------------------------+
I hope the answer is helpful

Assumming those two dates belong to each group of your table
my imports :
import org.apache.spark.sql.functions.{concat_ws,collect_list,lit}
Perpare the dataframe
scala> val seqRow = Seq(
| ("A","1- jul",1000),
| ("A","1-june",2000),
| ("A","1-May",2000),
| ("A","1-dec",3000),
| ("B","1-jul",100),
| ("B","1-june",300),
| ("B","1-May",400),
| ("B","1-dec",300))
seqRow: Seq[(String, String, Int)] = List((A,1- jul,1000), (A,1-june,2000), (A,1-May,2000), (A,1-dec,3000), (B,1-jul,100), (B,1-june,300), (B,1-May,400), (B,1-dec,300))
scala> val input_df = sc.parallelize(seqRow).toDF("column1","date","amount")
input_df: org.apache.spark.sql.DataFrame = [column1: string, date: string ... 1 more field]
Now write a UDF for your case,
scala> def calc_diff = udf((list : Seq[String],startMonth : String,endMonth : String) => {
| //get the month and their values
| val monthMap = list.map{str =>
| val splitText = str.split("\\$")
| val month = splitText(0).split("-")(1).trim
|
| (month.toLowerCase,splitText(1).toInt)
| }.toMap
|
| val stMnth = monthMap(startMonth)
| val endMnth = monthMap(endMonth)
| endMnth - stMnth
|
| })
calc_diff: org.apache.spark.sql.expressions.UserDefinedFunction
Now, Preparing the output
scala> val (month1 : String,month2 : String) = ("jul","dec")
month1: String = jul
month2: String = dec
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase)))
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 2 more fields]
scala> val req_df = group_df.withColumn("diff",calc_diff('collect_val,lit(month1.toLowerCase),lit(month2.toLowerCase))).drop('collect_val)
req_df: org.apache.spark.sql.DataFrame = [column1: string, sum_amount: bigint ... 1 more field]
scala> req_df.orderBy('column1).show
+-------+----------+----+
|column1|sum_amount|diff|
+-------+----------+----+
| A| 8000|2000|
| B| 1100| 200|
+-------+----------+----+
Hope, this is what you want.

(table.filter($"Date".isin("1-jul", "1-dec"))
.groupBy("Column1")
.pivot("Date")
.agg(first($"Amount"))
.withColumn("diff", $"1-dec" - $"1-jul")
).show
+-------+-----+-----+----+
|Column1|1-dec|1-jul|diff|
+-------+-----+-----+----+
| B| 300| 100| 200|
| A| 3000| 1000|2000|
+-------+-----+-----+----+

How to transform the dataframe into label feature vector?

I am running a logistic regression modl in scala and I have a data frame like below:
df
+-----------+------------+
|x |y |
+-----------+------------+
| 0| 0|
| 0| 33|
| 0| 58|
| 0| 96|
| 0| 1|
| 1| 21|
| 0| 10|
| 0| 65|
| 1| 7|
| 1| 28|
+-----------+------------+
I need to tranform this into something like this
+-----+------------------+
|label| features |
+-----+------------------+
| 0.0|(1,[1],[0]) |
| 0.0|(1,[1],[33]) |
| 0.0|(1,[1],[58]) |
| 0.0|(1,[1],[96]) |
| 0.0|(1,[1],[1]) |
| 1.0|(1,[1],[21]) |
| 0.0|(1,[1],[10]) |
| 0.0|(1,[1],[65]) |
| 1.0|(1,[1],[7]) |
| 1.0|(1,[1],[28]) |
+-----------+------------+
I tried
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val assembler = new VectorAssembler()
.setInputCols(Array("x"))
.setOutputCol("Feature")
var lrModel= lr.fit(daf.withColumnRenamed("x","label").withColumnRenamed("y","features"))
Any help is appreciated.

Given the dataframe as
+---+---+
|x |y |
+---+---+
|0 |0 |
|0 |33 |
|0 |58 |
|0 |96 |
|0 |1 |
|1 |21 |
|0 |10 |
|0 |65 |
|1 |7 |
|1 |28 |
+---+---+
And doing as below
val assembler = new VectorAssembler()
.setInputCols(Array("x", "y"))
.setOutputCol("features")
val output = assembler.transform(df).select($"x".cast(DoubleType).as("label"), $"features")
output.show(false)
Would give you result as
+-----+----------+
|label|features |
+-----+----------+
|0.0 |(2,[],[]) |
|0.0 |[0.0,33.0]|
|0.0 |[0.0,58.0]|
|0.0 |[0.0,96.0]|
|0.0 |[0.0,1.0] |
|1.0 |[1.0,21.0]|
|0.0 |[0.0,10.0]|
|0.0 |[0.0,65.0]|
|1.0 |[1.0,7.0] |
|1.0 |[1.0,28.0]|
+-----+----------+
Now using LogisticRegression would be easy
val lr = new LogisticRegression()
.setMaxIter(10)
.setRegParam(0.3)
.setElasticNetParam(0.8)
val lrModel = lr.fit(output)
println(s"Coefficients: ${lrModel.coefficients} Intercept: ${lrModel.intercept}")
You will have output as
Coefficients: [1.5672602877378823,0.0] Intercept: -1.4055020984891717

We Keep Coding

iphone swift flutter scala powershell matlab mongodb postgresql perl eclipse

merge new rows with previous rows data in dataframe in spark scala - scala

Related

How to rank dataframe depending on a group of rows in a column?

How to find Delta of a column in apache spark using SCALA [duplicate]

join dataframes and perform operation

Difference between two rows in Spark dataframe

How to transform the dataframe into label feature vector?

Categories

Resources