How to add multiple columns in a spark dataframe using SCALA - scala

I have a condition where I have to add 5 columns (to an existing DF) for 5 months of a year.
The existing DF is like:
EId EName Esal
1 abhi 1100
2 raj 300
3 nanu 400
4 ram 500
The Output should be as follows:
EId EName Esal Jan Feb March April May
1 abhi 1100 1100 1100 1100 1100 1100
2 raj 300 300 300 300 300 300
3 nanu 400 400 400 400 400 400
4 ram 500 500 500 500 500 500
I can do this one by one with withColumn but that takes a lot of time.
Is there a way I can run some loop and keep on adding columns till my conditions are exhausted.
Many thanks in advance.

You can use foldLeft. You'll need to create a List of the columns that you want.
df.show
+---+----+----+
| id|name| sal|
+---+----+----+
| 1| A|1100|
+---+----+----+
val list = List("Jan", "Feb" , "Mar", "Apr") // ... you get the idea
list.foldLeft(df)((df, month) => df.withColumn(month , $"sal" ) ).show
+---+----+----+----+----+----+----+
| id|name| sal| Jan| Feb| Mar| Apr|
+---+----+----+----+----+----+----+
| 1| A|1100|1100|1100|1100|1100|
+---+----+----+----+----+----+----+
So, basically what happens is you fold the sequence you created while starting with the original dataframe and applying transformation as you keep on traversing through the list.

Yes , You can do the same using foldLeft.FoldLeft traverse the elements in the collection from left to right with the desired value.
So you can store the desired columns in a List().
For Example:
val BazarDF = Seq(
("Veg", "tomato", 1.99),
("Veg", "potato", 0.45),
("Fruit", "apple", 0.99),
("Fruit", "pineapple", 2.59)
).toDF("Type", "Item", "Price")
Create a List with column name and values(as an example used null value)
var ColNameWithDatatype = List(("Jan", lit("null").as("StringType")),
("Feb", lit("null").as("StringType")
))
var BazarWithColumnDF1 = ColNameWithDatatype.foldLeft(BazarDF)
{ (tempDF, colName) =>
tempDF.withColumn(colName._1, colName._2)
}
You can see the example Here

Have in mind that withColumn method of DataFrame could have performance issues when called in loop:
Spark DAG differs with 'withColumn' vs 'select'
There is even mentioned about that in
https://spark.apache.org/docs/latest/api/scala/org/apache/spark/sql/Dataset.html#withColumn(colName:String,col:org.apache.spark.sql.Column):org.apache.spark.sql.DataFrame
this method introduces a projection internally. Therefore, calling it multiple times, for instance, via loops in order to add multiple columns can generate big plans which can cause performance issues and even StackOverflowException. To avoid this, use select with the multiple columns at once.
The safer way is to do it with select:
val monthsColumns = months.map { month:String =>
col("sal").as(month)
}
val updatedDf = df.select(df.columns.map(col) ++ monthsColumns: _*)

Related

Spark Combining Disparate rate Dataframes in Time

Using Spark and Scala, I have two DataFrames with data values.
I'm trying to accomplish something that, when processing serially would be trival, but when processing in a cluster seems daunting.
Let's say I have to sets of values. One of them is very regular:
Relative Time
Value1
10
1
20
2
30
3
And I want to combine it with another value that is very irregular:
Relative Time
Value2
1
100
22
200
And get this (driven by Value1):
Relative Time
Value1
Value2
10
1
100
20
2
100
30
3
200
Note: There are a few scenarios here. One of them is that Value1 is a massive DataFrame and Value2 only has a few hundred values. The other scenario is that they're both massive.
Also note: I depict Value2 as being very slow, and it might be, but also could may be much faster than Value1, so I may have 10 or 100 values of Value2 before my next value of Value1, and I'd want the latest. Because of this doing a union of them and windowing it doesn't seem practical.
How would I accomplish this in Spark?
I think you can do:
Full outer join between the two tables
Use the last function to look back the closest value of value2
import spark.implicits._
import org.apache.spark.sql.expressions.Window
val df1 = spark.sparkContext.parallelize(Seq(
(10, 1),
(20, 2),
(30, 3)
)).toDF("Relative Time", "value1")
val df2 = spark.sparkContext.parallelize(Seq(
(1, 100),
(22, 200)
)).toDF("Relative Time", "value2_temp")
val df = df1.join(df2, Seq("Relative Time"), "outer")
val window = Window.orderBy("Relative Time")
val result = df.withColumn("value2", last($"value2_temp", ignoreNulls = true).over(window)).filter($"value1".isNotNull).drop("value2_temp")
result.show()
+-------------+------+------+
|Relative Time|value1|value2|
+-------------+------+------+
| 10| 1| 100|
| 20| 2| 100|
| 30| 3| 200|
+-------------+------+------+

Group and aggregate dataset in spark scala without using spark.sql()

I have a dataset with account information of customers as below
customerID
accountID
balance
ID001
ACC001
20
ID002
ACC002
400
ID003
ACC003
500
ID002
ACC004
30
I want to groupby and aggregrate the above data to get output as below without using spark.sql functions, instead allowed to use datasets API
accounts
number of accounts
totalBalance
averageBalance
[ID001,ACC001,20]
1
20
20
[[ID002,ACC002,400], [ID002,ACC004,30]]
2
430
215
[ID003,ACC003,500]
1
500
500
I tried using ds.groupBy("accountID").agg(Map("balance" -> "avg")), however I am only able to use Map function to get the average. Need help to do multiple aggregation without using spark sql functions.
Appreciate any help to achieve the above solution. Thanks
Here is your solution
val cust_data = Seq[(String, String, Int)](
("ID001", "ACC001", 20),
("ID002", "ACC002", 400),
("ID003", "ACC003", 500),
("ID002", "ACC004", 30)).toDF("customerID", "accountID", "balance")
val out_df = cust_data.groupBy("customerID").agg(count($"accountID").alias("number_of_accounts"),
sum($"balance").alias("totalBalance"),
avg($"balance").alias("averageBalance"))
out_df.show()
+----------+------------------+------------+--------------+
|customerID|number_of_accounts|totalBalance|averageBalance|
+----------+------------------+------------+--------------+
| ID001| 1| 20| 20.0|
| ID002| 2| 430| 215.0|
| ID003| 1| 500| 500.0|
+----------+------------------+------------+--------------+

Spark Scala sum of values by unique key

If I have key,value pairs that compromise item(key) and the sales(value):
bolt 45
bolt 5
drill 1
drill 1
screw 1
screw 2
screw 3
So I want to obtain an RDD where each element is the sum of the values for every unique key:
bolt 50
drill 2
screw 6
My current code is like that:
val salesRDD = sc.textFile("/user/bigdata/sales.txt")
val pairs = salesRDD.map(s => (s, 1))
val counts = pairs.reduceByKey((a, b) => a + b)
counts.collect().foreach(println)
But my results get this:
(bolt 5,1)
(drill 1,2)
(bolt 45,1)
(screw 2,1)
(screw 3,1)
(screw 1,1)
How should I edit my code to get the above result?
Java way, hope you can convert this to scala. Looks like you just need a groupby and count
salesRDD.groupBy(salesRDD.col("name")).count();
+-----+-----+
| name|count|
+-----+-----+
| bolt| 50|
|drill| 2|
|screw| 6 |
+-----+-----+
Also,
please use Datasets and Dataframes rather than RDDs. You will find it a lot handy

how to join two dataframes and substract two columns from the dataframe

I have two dataframes which look like below
I am trying to find the diff between two amount based on ID
Dataframe 1:
ID I Amt
1 null 200
null 2 200
3 null 600
dataframe 2
ID I Amt
2 null 300
3 null 400
Output
Df
ID Amt(df2-df1)
2 100
3 -200
Query doesnt work:
Substraction doesnt work
df = df1.join(df2, df1["coalesce(ID, I)"] == df2["coalesce(ID, I)"], 'inner').select
((df1["amt)"]) – (df2["amt”])), df1["coalesce(ID, I)"].show())
I would do a couple of things differently. To make it easier to know what column is in what dataframe, I would rename them. I would also do the coalesce outside of the join itself.
val joined = df1.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF1_AMT")).join(
df2.withColumn("joinKey",coalesce($"ID",$"I")).select($"joinKey",$"Amt".alias("DF2_AMT")),"joinKey")
Then you can easily perform your calculation:
joined.withColumn("DIFF",$"DF2_AMT" - $"DF1_AMT").show
+-------+-------+-------+------+
|joinKey|DF1_AMT|DF2_AMT| DIFF|
+-------+-------+-------+------+
| 2| 200| 300| 100.0|
| 3| 600| 400|-200.0|
+-------+-------+-------+------+

How to mark a row in a group of rows in spark dataframe

Hi I would like to mark a row from group of records based on some rules. I have a dataframe like below
id price date
a 100 2016
a 200 2016
a 100 2016
b 100 2016
b 100 2015
My output dataframe should be
id price date
a 200 2016
b 100 2016
In the given dataframe the rules are based on two columns.From the group of ids(a,b), first one based on the maximum price and second one based on recent date.My actual rules are more complicated and it involve lot of other columns too.
What is best approach for solving problem like this. Need to pick a row from a group of rows based on some rules.Any help would be appreciated. Thanks
Try this.
val df = Seq(("a",100,2016), ("a",200,2016), ("a",100,2016), ("b",100,2016),("b",100,2015)).toDF("id", "price", "date")
df.show
val df1 = df.select($"id", struct($"price", $"date").alias("data")).groupBy($"id").agg(max("data").alias("data")).select($"id", $"data.price", $"data.date")
df1.show
You will get the output like below.
+---+-----+----+
| id|price|date|
+---+-----+----+
| b| 100|2016|
| a| 200|2016|
+---+-----+----+